JUHE API Marketplace
DATASET
Open Source Community

spyysalo/bc2gm_corpus

Bc2GmCorpus is a dataset for named entity recognition focusing on gene‑related entities. It comprises a training set, validation set, and test set containing 12,500, 2,500, and 5,000 samples respectively. Each sample includes an `id`, a list of `tokens`, and `ner_tags` indicating gene‑related entity annotations.

Updated 1/10/2024
hugging_face

Description

Dataset Overview

Basic Information

  • Dataset Name: Bc2GmCorpus
  • Language: English
  • License: Unknown
  • Multilinguality: Monolingual
  • Size: 10K < n < 100K
  • Source Data: Raw data
  • Task Category: Part‑of‑Speech Tagging
  • Task ID: Named Entity Recognition

Dataset Structure

Features

  • id: String identifier for the sentence.
  • tokens: Sequence of strings representing the words in the sentence.
  • ner_tags: Sequence of labels where 0 denotes no disease, 1 the first token of a disease entity, and 2 subsequent tokens of the same disease entity.

Data Splits

  • Training set: 12,500 samples, 6,095,123 bytes
  • Validation set: 2,500 samples, 1,215,919 bytes
  • Test set: 5,000 samples, 2,454,589 bytes

Size Metrics

  • Download size: 2,154,630 bytes
  • Dataset size: 9,765,631 bytes

Configuration

  • Configuration Name: bc2gm_corpus
  • Data Files:
    • Training: bc2gm_corpus/train-*
    • Validation: bc2gm_corpus/validation-*
    • Test: bc2gm_corpus/test-*

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Bioinformatics
Named Entity Recognition

Source

Organization: hugging_face

Created: Unknown

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.