Back to datasets
Dataset assetOpen Source CommunityBioinformaticsNamed Entity Recognition
spyysalo/bc2gm_corpus
Bc2GmCorpus is a dataset for named entity recognition focusing on gene‑related entities. It comprises a training set, validation set, and test set containing 12,500, 2,500, and 5,000 samples respectively. Each sample includes an `id`, a list of `tokens`, and `ner_tags` indicating gene‑related entity annotations.
Source
hugging_face
Created
Nov 28, 2025
Updated
Jan 10, 2024
Signals
179 views
Availability
Linked source ready
Overview
Dataset description and usage context
Dataset Overview
Basic Information
- Dataset Name: Bc2GmCorpus
- Language: English
- License: Unknown
- Multilinguality: Monolingual
- Size: 10K < n < 100K
- Source Data: Raw data
- Task Category: Part‑of‑Speech Tagging
- Task ID: Named Entity Recognition
Dataset Structure
Features
- id: String identifier for the sentence.
- tokens: Sequence of strings representing the words in the sentence.
- ner_tags: Sequence of labels where
0denotes no disease,1the first token of a disease entity, and2subsequent tokens of the same disease entity.
Data Splits
- Training set: 12,500 samples, 6,095,123 bytes
- Validation set: 2,500 samples, 1,215,919 bytes
- Test set: 5,000 samples, 2,454,589 bytes
Size Metrics
- Download size: 2,154,630 bytes
- Dataset size: 9,765,631 bytes
Configuration
- Configuration Name: bc2gm_corpus
- Data Files:
- Training: bc2gm_corpus/train-*
- Validation: bc2gm_corpus/validation-*
- Test: bc2gm_corpus/test-*
Need downstream help?
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.