JUHE API Marketplace
DATASET
Open Source Community

DFKI-SLT/GDA

The GDA dataset is a sentence-level evaluation dataset for extracting gene‑disease associations, developed by Nourani and Reshadata (2020). Built on the DisGeNET and PubTator databases, it contains 8,000 sentences covering 1,904 diseases and 3,635 genes. The dataset is split into training, validation, and test sets, each instance providing multiple fields such as gene ID, disease name, association type, etc. Construction involved extracting relevant sentences from PubMed abstracts and applying systematic filtering to ensure high‑quality negative samples.

Updated 6/22/2024
hugging_face

Description

GDA Dataset Overview

Dataset Description

Dataset Summary

The GDA dataset was developed by Nourani and Reshadata (2020) as a sentence‑level evaluation resource for extracting gene‑disease associations from biomedical literature. It is built on the DisGeNET and PubTator databases and comprises 8,000 sentences covering 1,904 unique diseases and 3,635 unique genes.

Language

The language of the dataset is English.

Dataset Structure

Data Fields

  • NofPmids: Number of PubMed IDs associated with the gene‑disease pair, type float64.
  • NofSnps: Number of single‑nucleotide polymorphisms (SNPs) linked to the gene‑disease pair, type float64.
  • associationType: Type of association between gene and disease (e.g., Negative, Biomarker, Therapeutic), type string.
  • diseaseId: Unique identifier for the disease, type string.
  • diseaseName: Disease name, type string.
  • diseaseType: Category of disease (e.g., disease, group, phenotype), type string.
  • disease_mention: Specific mention of the disease in the source text, type string.
  • geneId: Unique identifier for the gene, type string.
  • geneSymbol: Symbolic representation of the gene, type string.
  • gene_mention: Specific mention of the gene in the source text, type string.
  • originalSource: Original source, type string.
  • pmid: PubMed ID linked to the sentence, type int64.
  • raw_sentence: Original sentence from the source document, type string.
  • score: Confidence or relevance score for the gene‑disease association, type float64.
  • sentence: Sentence with span annotations, type string.
  • source: Database or repository providing the association data, type string.

Data Splits

  • train: Training set with 4,000 samples, size 1,907,978 bytes.
  • validation: Validation set with 2,400 samples, size 1,134,075 bytes.
  • test: Test set with 1,600 samples, size 756,401 bytes.

Citation

  • Nourani, E., & Reshadat, V. (2020). Association extraction from biomedical literature based on representation and transfer learning. Journal of Theoretical Biology, 488, 110112. https://doi.org/10.1016/j.jtbi.2019.110112

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Gene‑Disease Association
Biomedical Text Mining

Source

Organization: hugging_face

Created: Unknown

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.