Back to datasets
Dataset assetOpen Source CommunityNatural Language ProcessingBioinformatics
BioCreative II Gene Mention corpus
The BioCreative II Gene Mention corpus is a dataset in the field of bioinformatics, primarily used for gene mention recognition tasks. It includes training and test data to support biomedical text mining and natural language processing research.
Source
github
Created
Jun 9, 2016
Updated
Apr 25, 2024
Signals
164 views
Availability
Linked source ready
Overview
Dataset description and usage context
Dataset Overview
Data Source
- The original dataset includes
bc2GMtrain_1.1.tar.gzandbc2GMtest_1.0.tar.gz, downloaded from BioCreative II Corpus, and extracted to theoriginal-datadirectory.
Data Processing
-
Standoff Format Conversion
- The raw data are converted to the BioNLP shared‑task style standoff format, stored in the
standoff/{train,devel,test}directories. - 2,500 documents are moved from
standoff/traintostandoff/develto form a development set.
- The raw data are converted to the BioNLP shared‑task style standoff format, stored in the
-
CoNLL Format Conversion
- The standoff2conll tool converts the standoff format to CoNLL format, stored in the
conlldirectory.
- The standoff2conll tool converts the standoff format to CoNLL format, stored in the
-
Combined Dataset
- A combined dataset incorporating GENE and ALTGENE versions is created, stored in
combined-data/{train,test}. - Both standoff and CoNLL formats are generated, saved in
combined-data/standoff/{train,devel,test}andcombined-data/conll-{wide,narrow}respectively.
- A combined dataset incorporating GENE and ALTGENE versions is created, stored in
-
Train / Devel Split
- A development set of 2,500 sentences is split from the original data and stored in
devel-split/{train,devel}.
- A development set of 2,500 sentences is split from the original data and stored in
Data Formats
- Original format: plain text files.
- Standoff format: BioNLP shared‑task style.
- CoNLL format: TSV files similar to CoNLL.
Data Versions
- Two CoNLL‑format versions are provided:
- "wide" version: retains longer overlapping annotations, discarding shorter ones.
- "narrow" version: retains shorter annotations, discarding longer ones.
Need downstream help?
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.