Dataset assetOpen Source CommunityDrug DiscoveryMolecular Generation

smiles-molecules-chembl

ChEMBL is a manually curated bio‑active molecule database with drug‑like properties, integrating chemical, bio‑activity, and genomic data to facilitate the translation of genomic information into effective new drugs. The dataset contains 1,941,405 molecules, split into training, validation, and test sets for distribution‑learning and goal‑directed molecular generation tasks, i.e., generating new molecules with desired properties.

Source

huggingface

Created

Aug 6, 2024

Updated

Aug 6, 2024

Signals

1,027 views

Availability

Linked source ready

Overview

Dataset description and usage context

ChEMBL Molecule Generation Dataset

Dataset Description

ChEMBL is a manually curated bio‑active molecule database with drug‑like properties. It aggregates chemical, bio‑activity, and genomic data to facilitate the translation of genomic information into effective new drugs.

Task Description

Suitable for distribution‑learning and goal‑directed molecular generation tasks, i.e., generating new molecules with predefined attributes.

Dataset Statistics

Total molecules: 1,941,405
- Training set: 1,358,980 molecules
- Validation set: 194,123 molecules
- Test set: 388,302 molecules

The dataset was randomly split by the Therapeutics Data Commons and missing values were removed.

References

Mendez, David, et al. “ChEMBL: towards direct deposition of bioassay data.” Nucleic Acids Research 47.D1 (2019): D930‑D940.
Davies, Mark, et al. “ChEMBL web services: streamlining access to drug discovery data and utilities.” Nucleic Acids Research 43.W1 (2015): W612‑W620.

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio