High Quality Data

Dataset Hub

Explore high-quality datasets for your AI and machine learning projects.

Sort:

Browse by Category

damlab/human_hiv_ppi

This dataset is extracted from the NCBI‑maintained Human‑HIV Interaction dataset and contains over 16,000 pairs of interactions between HIV and human proteins. Fields include HIV protein product, HIV protein name, interaction type, human protein product, human protein name, reference list, description, HIV protein sequence, and human protein sequence. The dataset was created to train models that identify proteins interacting with HIV. It was manually curated by experts, which may bias it toward well‑studied proteins and known interactions.

hugging_face

View Details

katielink/dm_alphamissense

Gene Variant Prediction

Bioinformatics

The Google DeepMind AlphaMissense database contains predictions for all possible single-nucleotide missense variants in human protein‑coding genes, covering both hg19 and hg38 genome builds. The dataset provides gene‑level average predictions, predictions for all possible single‑amino‑acid substitutions, and predictions for non‑canonical transcript isoforms. Each file includes chromosome, genomic position, reference and alternate nucleotides, UniProtKB identifier, transcript ID, protein variant, AlphaMissense pathogenicity score and its classification, among other fields. Use of the dataset is limited to the CC BY‑NC‑SA 4.0 license and only for non‑commercial research.

hugging_face

View Details

bigbio/genia_term_corpus

Bioinformatics

Text Mining

The GENIA Term Corpus focuses on recognizing entities of interest in molecular biology such as proteins, genes, and cells, which is a fundamental task in biomedical text mining. The GENIA technical term annotations cover physical biological entities as well as other important terminology. The corpus annotates abstracts from the main GENIA corpus, totaling 1,999 abstracts.

hugging_face

View Details

UCLA Consortium for Neuropsychiatric Phenomics LA5c Study

Neuropsychiatric Disorders

Bioinformatics

This dataset is part of the UCLA Consortium for Neuropsychiatric Phenomics (LA5c) study, providing preprocessed data that include participant information, scan data, and derivative files. The dataset records detailed scanning parameters, physiological recordings, task events, and provides results of data visualizations and quality control.

github

View Details

mhc-peptides-dataset

Bioinformatics

Peptide Binding Prediction

The dataset contains 86 000 peptides and their binding affinity measurements, intended for predicting whether a peptide is bound.

github

View Details

CHEN11, ASTEX, metapocket2 datasets, FPTRAIN, HOLO4K

Bioinformatics

Drug Discovery

CHEN11: 251 proteins with 476 ligands for LBS prediction benchmarks. ASTEX: Astex diverse dataset. metapocket2: includes U/B48 (48 proteins in bound and unbound states), DT198 (198 drug‑target complexes), B210 (210 bound‑state proteins). FPTRAIN: dataset for training Fpocket pocket‑scoring function. HOLO4K: large protein‑ligand complex set comprising large multi‑chain structures directly downloaded from PDB.

github

View Details

liupf/ChEBI-20-MM

Chemoinformatics

Bioinformatics

The ChEBI-20-MM dataset is a multimodal benchmark extended from the ChEBI-20 dataset, focusing on molecular science. It integrates multiple molecular data modalities, including InChI, IUPAC, SELFIES, and images, to evaluate models on molecular generation, image recognition, IUPAC identification, molecular description, and retrieval tasks. By increasing modality diversity, the dataset provides a more comprehensive assessment of model performance on multimodal data processing.

hugging_face

View Details

spyysalo/bc2gm_corpus

Bioinformatics

Named Entity Recognition

Bc2GmCorpus is a dataset for named entity recognition focusing on gene‑related entities. It comprises a training set, validation set, and test set containing 12,500, 2,500, and 5,000 samples respectively. Each sample includes an `id`, a list of `tokens`, and `ner_tags` indicating gene‑related entity annotations.

hugging_face

View Details

BioCreative II Gene Mention corpus

Bioinformatics

Natural Language Processing

The BioCreative II Gene Mention corpus is a dataset in the field of bioinformatics, primarily used for gene mention recognition tasks. It includes training and test data to support biomedical text mining and natural language processing research.

github

View Details

damlab/HIV_FLT

Bioinformatics

Viral Genomics

The dataset originates from the Los Alamos National Laboratory (LANL) HIV sequence database and contains 1,609 high‑quality full‑genome HIV sequences from the 2016 release. Sequences were processed with the GeneCutter tool and translated to amino‑acid sequences using BioPython's `Seq.translate`. The dataset is intended to train an HIV‑BERT model for predicting various HIV‑related features. It includes fields such as ID, gag, pol, env, nef, tat, rev, and proteome, each representing the protein amino‑acid sequence of the corresponding HIV gene. The dataset can be used for research on HIV sequence characteristics, a virus that has caused millions of deaths globally over past decades.

hugging_face

View Details