smiles-molecules-chembl
ChEMBL is a manually curated bio‑active molecule database with drug‑like properties, integrating chemical, bio‑activity, and genomic data to facilitate the translation of genomic information into effective new drugs. The dataset contains 1,941,405 molecules, split into training, validation, and test sets for distribution‑learning and goal‑directed molecular generation tasks, i.e., generating new molecules with desired properties.
Description
ChEMBL Molecule Generation Dataset
Dataset Description
ChEMBL is a manually curated bio‑active molecule database with drug‑like properties. It aggregates chemical, bio‑activity, and genomic data to facilitate the translation of genomic information into effective new drugs.
Task Description
Suitable for distribution‑learning and goal‑directed molecular generation tasks, i.e., generating new molecules with predefined attributes.
Dataset Statistics
- Total molecules: 1,941,405
- Training set: 1,358,980 molecules
- Validation set: 194,123 molecules
- Test set: 388,302 molecules
The dataset was randomly split by the Therapeutics Data Commons and missing values were removed.
References
- Mendez, David, et al. “ChEMBL: towards direct deposition of bioassay data.” Nucleic Acids Research 47.D1 (2019): D930‑D940.
- Davies, Mark, et al. “ChEMBL web services: streamlining access to drug discovery data and utilities.” Nucleic Acids Research 43.W1 (2015): W612‑W620.
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: huggingface
Created: 8/6/2024
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.