Back to datasets
Dataset assetOpen Source CommunityNatural Language ProcessingBioinformatics

BioCreative II Gene Mention corpus

The BioCreative II Gene Mention corpus is a dataset in the field of bioinformatics, primarily used for gene mention recognition tasks. It includes training and test data to support biomedical text mining and natural language processing research.

Source
github
Created
Jun 9, 2016
Updated
Apr 25, 2024
Signals
164 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Overview

Data Source

  • The original dataset includes bc2GMtrain_1.1.tar.gz and bc2GMtest_1.0.tar.gz, downloaded from BioCreative II Corpus, and extracted to the original-data directory.

Data Processing

  1. Standoff Format Conversion

    • The raw data are converted to the BioNLP shared‑task style standoff format, stored in the standoff/{train,devel,test} directories.
    • 2,500 documents are moved from standoff/train to standoff/devel to form a development set.
  2. CoNLL Format Conversion

    • The standoff2conll tool converts the standoff format to CoNLL format, stored in the conll directory.
  3. Combined Dataset

    • A combined dataset incorporating GENE and ALTGENE versions is created, stored in combined-data/{train,test}.
    • Both standoff and CoNLL formats are generated, saved in combined-data/standoff/{train,devel,test} and combined-data/conll-{wide,narrow} respectively.
  4. Train / Devel Split

    • A development set of 2,500 sentences is split from the original data and stored in devel-split/{train,devel}.

Data Formats

  • Original format: plain text files.
  • Standoff format: BioNLP shared‑task style.
  • CoNLL format: TSV files similar to CoNLL.

Data Versions

  • Two CoNLL‑format versions are provided:
    • "wide" version: retains longer overlapping annotations, discarding shorter ones.
    • "narrow" version: retains shorter annotations, discarding longer ones.
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio