High Quality Data

Dataset Hub

Explore high-quality datasets for your AI and machine learning projects.

Sort:

Browse by Category

bigbio/mlee

MLEE is a corpus of event extraction annotations for angiogenesis paper abstracts. It includes manually annotated entities, relations, events, and coreference information covering processes at the molecular, cellular, tissue, and organ levels.

hugging_face

View Details

DFKI-SLT/GDA

Gene‑Disease Association

Biomedical Text Mining

The GDA dataset is a sentence-level evaluation dataset for extracting gene‑disease associations, developed by Nourani and Reshadata (2020). Built on the DisGeNET and PubTator databases, it contains 8,000 sentences covering 1,904 diseases and 3,635 genes. The dataset is split into training, validation, and test sets, each instance providing multiple fields such as gene ID, disease name, association type, etc. Construction involved extracting relevant sentences from PubMed abstracts and applying systematic filtering to ensure high‑quality negative samples.

hugging_face

View Details