JUHE API Marketplace
High Quality Data

Dataset Hub

Explore high-quality datasets for your AI and machine learning projects.

Sort:

Browse by Category

ChEMBL v25

Drug Discovery
Chemoinformatics

All processed datasets are based on data extracted from ChEMBL version 25, which is provided under the Creative Commons Attribution‑ShareAlike 3.0 Unported license.

github
View Details

smiles-molecules-chembl

Drug Discovery
Molecular Generation

ChEMBL is a manually curated bio‑active molecule database with drug‑like properties, integrating chemical, bio‑activity, and genomic data to facilitate the translation of genomic information into effective new drugs. The dataset contains 1,941,405 molecules, split into training, validation, and test sets for distribution‑learning and goal‑directed molecular generation tasks, i.e., generating new molecules with desired properties.

huggingface
View Details

CHEN11, ASTEX, metapocket2 datasets, FPTRAIN, HOLO4K

Bioinformatics
Drug Discovery

CHEN11: 251 proteins with 476 ligands for LBS prediction benchmarks. ASTEX: Astex diverse dataset. metapocket2: includes U/B48 (48 proteins in bound and unbound states), DT198 (198 drug‑target complexes), B210 (210 bound‑state proteins). FPTRAIN: dataset for training Fpocket pocket‑scoring function. HOLO4K: large protein‑ligand complex set comprising large multi‑chain structures directly downloaded from PDB.

github
View Details

katielink/moses

Molecular Generation
Drug Discovery

The dataset is derived from the Clean Leads subset of the ZINC database, filtered by molecular weight (250–350 Da), number of rotatable bonds (≤ 7), and XlogP (≤ 3.5). Molecules containing charged atoms or atoms other than C, N, S, O, F, Cl, Br, H, or rings larger than eight atoms were removed. The dataset also applies medicinal chemistry filters (MCFs) and PAINS filters. The final set comprises 1,936,962 molecular structures, split into a training set (~1.6 M molecules), a test set (~176 k molecules), and a scaffold test set (~176 k molecules). The scaffold test set contains unique Bemis‑Murcko scaffolds not present in the training or test sets, enabling evaluation of a model's ability to generate novel scaffolds.

hugging_face
View Details