Dataset Catalog

Browse trusted datasets for evaluation, enrichment, and production use.

Category index
Showing 10 of 10 datasets
Category: Bioinformatics

damlab/human_hiv_ppi

BioinformaticsProtein-Protein Interaction

This dataset is extracted from the NCBI‑maintained Human‑HIV Interaction dataset and contains over 16,000 pairs of interactions between HIV and human proteins. Fields include HIV protein product, HIV protein name, interaction type, human protein product, human protein name, reference list, description, HIV protein sequence, and human protein sequence. The dataset was created to train models that identify proteins interacting with HIV. It was manually curated by experts, which may bias it toward well‑studied proteins and known interactions.

Source hugging_faceUpdated Apr 4, 202288 viewsLinked
Inspect dataset

katielink/dm_alphamissense

Gene Variant PredictionBioinformatics

The Google DeepMind AlphaMissense database contains predictions for all possible single-nucleotide missense variants in human protein‑coding genes, covering both hg19 and hg38 genome builds. The dataset provides gene‑level average predictions, predictions for all possible single‑amino‑acid substitutions, and predictions for non‑canonical transcript isoforms. Each file includes chromosome, genomic position, reference and alternate nucleotides, UniProtKB identifier, transcript ID, protein variant, AlphaMissense pathogenicity score and its classification, among other fields. Use of the dataset is limited to the CC BY‑NC‑SA 4.0 license and only for non‑commercial research.

Source hugging_faceUpdated Oct 5, 2023259 viewsLinked
Inspect dataset

bigbio/genia_term_corpus

BioinformaticsText Mining

The GENIA Term Corpus focuses on recognizing entities of interest in molecular biology such as proteins, genes, and cells, which is a fundamental task in biomedical text mining. The GENIA technical term annotations cover physical biological entities as well as other important terminology. The corpus annotates abstracts from the main GENIA corpus, totaling 1,999 abstracts.

Source hugging_faceUpdated Dec 22, 2022267 viewsLinked
Inspect dataset

UCLA Consortium for Neuropsychiatric Phenomics LA5c Study

Neuropsychiatric DisordersBioinformatics

This dataset is part of the UCLA Consortium for Neuropsychiatric Phenomics (LA5c) study, providing preprocessed data that include participant information, scan data, and derivative files. The dataset records detailed scanning parameters, physiological recordings, task events, and provides results of data visualizations and quality control.

Source githubUpdated Jan 29, 2024316 viewsLinked
Inspect dataset

mhc-peptides-dataset

BioinformaticsPeptide Binding Prediction

The dataset contains 86 000 peptides and their binding affinity measurements, intended for predicting whether a peptide is bound.

Source githubUpdated Apr 13, 2020112 viewsLinked
Inspect dataset

CHEN11, ASTEX, metapocket2 datasets, FPTRAIN, HOLO4K

BioinformaticsDrug Discovery

CHEN11: 251 proteins with 476 ligands for LBS prediction benchmarks. ASTEX: Astex diverse dataset. metapocket2: includes U/B48 (48 proteins in bound and unbound states), DT198 (198 drug‑target complexes), B210 (210 bound‑state proteins). FPTRAIN: dataset for training Fpocket pocket‑scoring function. HOLO4K: large protein‑ligand complex set comprising large multi‑chain structures directly downloaded from PDB.

Source githubUpdated Apr 11, 2024145 viewsLinked
Inspect dataset

liupf/ChEBI-20-MM

ChemoinformaticsBioinformatics

The ChEBI-20-MM dataset is a multimodal benchmark extended from the ChEBI-20 dataset, focusing on molecular science. It integrates multiple molecular data modalities, including InChI, IUPAC, SELFIES, and images, to evaluate models on molecular generation, image recognition, IUPAC identification, molecular description, and retrieval tasks. By increasing modality diversity, the dataset provides a more comprehensive assessment of model performance on multimodal data processing.

Source hugging_faceUpdated Jun 17, 2024193 viewsLinked
Inspect dataset

spyysalo/bc2gm_corpus

BioinformaticsNamed Entity Recognition

Bc2GmCorpus is a dataset for named entity recognition focusing on gene‑related entities. It comprises a training set, validation set, and test set containing 12,500, 2,500, and 5,000 samples respectively. Each sample includes an `id`, a list of `tokens`, and `ner_tags` indicating gene‑related entity annotations.

Source hugging_faceUpdated Jan 10, 2024179 viewsLinked
Inspect dataset

BioCreative II Gene Mention corpus

BioinformaticsNatural Language Processing

The BioCreative II Gene Mention corpus is a dataset in the field of bioinformatics, primarily used for gene mention recognition tasks. It includes training and test data to support biomedical text mining and natural language processing research.

Source githubUpdated Apr 25, 2024164 viewsLinked
Inspect dataset

damlab/HIV_FLT

BioinformaticsViral Genomics

The dataset originates from the Los Alamos National Laboratory (LANL) HIV sequence database and contains 1,609 high‑quality full‑genome HIV sequences from the 2016 release. Sequences were processed with the GeneCutter tool and translated to amino‑acid sequences using BioPython's `Seq.translate`. The dataset is intended to train an HIV‑BERT model for predicting various HIV‑related features. It includes fields such as ID, gag, pol, env, nef, tat, rev, and proteome, each representing the protein amino‑acid sequence of the corresponding HIV gene. The dataset can be used for research on HIV sequence characteristics, a virus that has caused millions of deaths globally over past decades.

Source hugging_faceUpdated Feb 8, 2022175 viewsLinked
Inspect dataset