spyysalo/bc2gm_corpus

Bc2GmCorpus is a dataset for named entity recognition focusing on gene‑related entities. It comprises a training set, validation set, and test set containing 12,500, 2,500, and 5,000 samples respectively. Each sample includes an `id`, a list of `tokens`, and `ner_tags` indicating gene‑related entity annotations.

Updated 1/10/2024

hugging_face

Dataset Overview

Basic Information

Dataset Name: Bc2GmCorpus
Language: English
License: Unknown
Multilinguality: Monolingual
Size: 10K < n < 100K
Source Data: Raw data
Task Category: Part‑of‑Speech Tagging
Task ID: Named Entity Recognition

Dataset Structure

Features

id: String identifier for the sentence.
tokens: Sequence of strings representing the words in the sentence.
ner_tags: Sequence of labels where 0 denotes no disease, 1 the first token of a disease entity, and 2 subsequent tokens of the same disease entity.

Data Splits

Training set: 12,500 samples, 6,095,123 bytes
Validation set: 2,500 samples, 1,215,919 bytes
Test set: 5,000 samples, 2,454,589 bytes

Size Metrics

Download size: 2,154,630 bytes
Dataset size: 9,765,631 bytes

Configuration

Configuration Name: bc2gm_corpus
Data Files:
- Training: bc2gm_corpus/train-*
- Validation: bc2gm_corpus/validation-*
- Test: bc2gm_corpus/test-*

spyysalo/bc2gm_corpus

Description

Dataset Overview

Basic Information

Dataset Structure

Features

Data Splits

Size Metrics

Configuration

AI studio

Access Dataset

Topics

Source