JUHE API Marketplace
DATASET
Open Source Community

BioCreative II Gene Mention corpus

The BioCreative II Gene Mention corpus is a dataset in the field of bioinformatics, primarily used for gene mention recognition tasks. It includes training and test data to support biomedical text mining and natural language processing research.

Updated 4/25/2024
github

Description

Dataset Overview

Data Source

  • The original dataset includes bc2GMtrain_1.1.tar.gz and bc2GMtest_1.0.tar.gz, downloaded from BioCreative II Corpus, and extracted to the original-data directory.

Data Processing

  1. Standoff Format Conversion

    • The raw data are converted to the BioNLP shared‑task style standoff format, stored in the standoff/{train,devel,test} directories.
    • 2,500 documents are moved from standoff/train to standoff/devel to form a development set.
  2. CoNLL Format Conversion

    • The standoff2conll tool converts the standoff format to CoNLL format, stored in the conll directory.
  3. Combined Dataset

    • A combined dataset incorporating GENE and ALTGENE versions is created, stored in combined-data/{train,test}.
    • Both standoff and CoNLL formats are generated, saved in combined-data/standoff/{train,devel,test} and combined-data/conll-{wide,narrow} respectively.
  4. Train / Devel Split

    • A development set of 2,500 sentences is split from the original data and stored in devel-split/{train,devel}.

Data Formats

  • Original format: plain text files.
  • Standoff format: BioNLP shared‑task style.
  • CoNLL format: TSV files similar to CoNLL.

Data Versions

  • Two CoNLL‑format versions are provided:
    • "wide" version: retains longer overlapping annotations, discarding shorter ones.
    • "narrow" version: retains shorter annotations, discarding longer ones.

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Bioinformatics
Natural Language Processing

Source

Organization: github

Created: 6/9/2016

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.