smiles-molecules-chembl
ChEMBL is a manually curated bio‑active molecule database with drug‑like properties, integrating chemical, bio‑activity, and genomic data to facilitate the translation of genomic information into effective new drugs. The dataset contains 1,941,405 molecules, split into training, validation, and test sets for distribution‑learning and goal‑directed molecular generation tasks, i.e., generating new molecules with desired properties.
Dataset description and usage context
ChEMBL Molecule Generation Dataset
Dataset Description
ChEMBL is a manually curated bio‑active molecule database with drug‑like properties. It aggregates chemical, bio‑activity, and genomic data to facilitate the translation of genomic information into effective new drugs.
Task Description
Suitable for distribution‑learning and goal‑directed molecular generation tasks, i.e., generating new molecules with predefined attributes.
Dataset Statistics
- Total molecules: 1,941,405
- Training set: 1,358,980 molecules
- Validation set: 194,123 molecules
- Test set: 388,302 molecules
The dataset was randomly split by the Therapeutics Data Commons and missing values were removed.
References
- Mendez, David, et al. “ChEMBL: towards direct deposition of bioassay data.” Nucleic Acids Research 47.D1 (2019): D930‑D940.
- Davies, Mark, et al. “ChEMBL web services: streamlining access to drug discovery data and utilities.” Nucleic Acids Research 43.W1 (2015): W612‑W620.
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.