DATASET
Open Source Community
spyysalo/bc2gm_corpus
Bc2GmCorpus is a dataset for named entity recognition focusing on gene‑related entities. It comprises a training set, validation set, and test set containing 12,500, 2,500, and 5,000 samples respectively. Each sample includes an `id`, a list of `tokens`, and `ner_tags` indicating gene‑related entity annotations.
Updated 1/10/2024
hugging_face
Description
Dataset Overview
Basic Information
- Dataset Name: Bc2GmCorpus
- Language: English
- License: Unknown
- Multilinguality: Monolingual
- Size: 10K < n < 100K
- Source Data: Raw data
- Task Category: Part‑of‑Speech Tagging
- Task ID: Named Entity Recognition
Dataset Structure
Features
- id: String identifier for the sentence.
- tokens: Sequence of strings representing the words in the sentence.
- ner_tags: Sequence of labels where
0denotes no disease,1the first token of a disease entity, and2subsequent tokens of the same disease entity.
Data Splits
- Training set: 12,500 samples, 6,095,123 bytes
- Validation set: 2,500 samples, 1,215,919 bytes
- Test set: 5,000 samples, 2,454,589 bytes
Size Metrics
- Download size: 2,154,630 bytes
- Dataset size: 9,765,631 bytes
Configuration
- Configuration Name: bc2gm_corpus
- Data Files:
- Training: bc2gm_corpus/train-*
- Validation: bc2gm_corpus/validation-*
- Test: bc2gm_corpus/test-*
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Login to Access
Please login to view download links and access full dataset details.
Topics
Bioinformatics
Named Entity Recognition
Source
Organization: hugging_face
Created: Unknown
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.