DATASET
Open Source Community
BioCreative II Gene Mention corpus
The BioCreative II Gene Mention corpus is a dataset in the field of bioinformatics, primarily used for gene mention recognition tasks. It includes training and test data to support biomedical text mining and natural language processing research.
Updated 4/25/2024
github
Description
Dataset Overview
Data Source
- The original dataset includes
bc2GMtrain_1.1.tar.gzandbc2GMtest_1.0.tar.gz, downloaded from BioCreative II Corpus, and extracted to theoriginal-datadirectory.
Data Processing
-
Standoff Format Conversion
- The raw data are converted to the BioNLP shared‑task style standoff format, stored in the
standoff/{train,devel,test}directories. - 2,500 documents are moved from
standoff/traintostandoff/develto form a development set.
- The raw data are converted to the BioNLP shared‑task style standoff format, stored in the
-
CoNLL Format Conversion
- The standoff2conll tool converts the standoff format to CoNLL format, stored in the
conlldirectory.
- The standoff2conll tool converts the standoff format to CoNLL format, stored in the
-
Combined Dataset
- A combined dataset incorporating GENE and ALTGENE versions is created, stored in
combined-data/{train,test}. - Both standoff and CoNLL formats are generated, saved in
combined-data/standoff/{train,devel,test}andcombined-data/conll-{wide,narrow}respectively.
- A combined dataset incorporating GENE and ALTGENE versions is created, stored in
-
Train / Devel Split
- A development set of 2,500 sentences is split from the original data and stored in
devel-split/{train,devel}.
- A development set of 2,500 sentences is split from the original data and stored in
Data Formats
- Original format: plain text files.
- Standoff format: BioNLP shared‑task style.
- CoNLL format: TSV files similar to CoNLL.
Data Versions
- Two CoNLL‑format versions are provided:
- "wide" version: retains longer overlapping annotations, discarding shorter ones.
- "narrow" version: retains shorter annotations, discarding longer ones.
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Login to Access
Please login to view download links and access full dataset details.
Topics
Bioinformatics
Natural Language Processing
Source
Organization: github
Created: 6/9/2016
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.