DFKI-SLT/GDA
The GDA dataset is a sentence-level evaluation dataset for extracting gene‑disease associations, developed by Nourani and Reshadata (2020). Built on the DisGeNET and PubTator databases, it contains 8,000 sentences covering 1,904 diseases and 3,635 genes. The dataset is split into training, validation, and test sets, each instance providing multiple fields such as gene ID, disease name, association type, etc. Construction involved extracting relevant sentences from PubMed abstracts and applying systematic filtering to ensure high‑quality negative samples.
Dataset description and usage context
GDA Dataset Overview
Dataset Description
Dataset Summary
The GDA dataset was developed by Nourani and Reshadata (2020) as a sentence‑level evaluation resource for extracting gene‑disease associations from biomedical literature. It is built on the DisGeNET and PubTator databases and comprises 8,000 sentences covering 1,904 unique diseases and 3,635 unique genes.
Language
The language of the dataset is English.
Dataset Structure
Data Fields
NofPmids: Number of PubMed IDs associated with the gene‑disease pair, typefloat64.NofSnps: Number of single‑nucleotide polymorphisms (SNPs) linked to the gene‑disease pair, typefloat64.associationType: Type of association between gene and disease (e.g., Negative, Biomarker, Therapeutic), typestring.diseaseId: Unique identifier for the disease, typestring.diseaseName: Disease name, typestring.diseaseType: Category of disease (e.g., disease, group, phenotype), typestring.disease_mention: Specific mention of the disease in the source text, typestring.geneId: Unique identifier for the gene, typestring.geneSymbol: Symbolic representation of the gene, typestring.gene_mention: Specific mention of the gene in the source text, typestring.originalSource: Original source, typestring.pmid: PubMed ID linked to the sentence, typeint64.raw_sentence: Original sentence from the source document, typestring.score: Confidence or relevance score for the gene‑disease association, typefloat64.sentence: Sentence with span annotations, typestring.source: Database or repository providing the association data, typestring.
Data Splits
train: Training set with 4,000 samples, size 1,907,978 bytes.validation: Validation set with 2,400 samples, size 1,134,075 bytes.test: Test set with 1,600 samples, size 756,401 bytes.
Citation
- Nourani, E., & Reshadat, V. (2020). Association extraction from biomedical literature based on representation and transfer learning. Journal of Theoretical Biology, 488, 110112. https://doi.org/10.1016/j.jtbi.2019.110112
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.