Back to datasets
Dataset assetOpen Source CommunityBioinformaticsNamed Entity Recognition

spyysalo/bc2gm_corpus

Bc2GmCorpus is a dataset for named entity recognition focusing on gene‑related entities. It comprises a training set, validation set, and test set containing 12,500, 2,500, and 5,000 samples respectively. Each sample includes an `id`, a list of `tokens`, and `ner_tags` indicating gene‑related entity annotations.

Source
hugging_face
Created
Nov 28, 2025
Updated
Jan 10, 2024
Signals
179 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Overview

Basic Information

  • Dataset Name: Bc2GmCorpus
  • Language: English
  • License: Unknown
  • Multilinguality: Monolingual
  • Size: 10K < n < 100K
  • Source Data: Raw data
  • Task Category: Part‑of‑Speech Tagging
  • Task ID: Named Entity Recognition

Dataset Structure

Features

  • id: String identifier for the sentence.
  • tokens: Sequence of strings representing the words in the sentence.
  • ner_tags: Sequence of labels where 0 denotes no disease, 1 the first token of a disease entity, and 2 subsequent tokens of the same disease entity.

Data Splits

  • Training set: 12,500 samples, 6,095,123 bytes
  • Validation set: 2,500 samples, 1,215,919 bytes
  • Test set: 5,000 samples, 2,454,589 bytes

Size Metrics

  • Download size: 2,154,630 bytes
  • Dataset size: 9,765,631 bytes

Configuration

  • Configuration Name: bc2gm_corpus
  • Data Files:
    • Training: bc2gm_corpus/train-*
    • Validation: bc2gm_corpus/validation-*
    • Test: bc2gm_corpus/test-*
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio