Back to datasets
Dataset assetOpen Source CommunityDrug DiscoveryMolecular Generation

smiles-molecules-chembl

ChEMBL is a manually curated bio‑active molecule database with drug‑like properties, integrating chemical, bio‑activity, and genomic data to facilitate the translation of genomic information into effective new drugs. The dataset contains 1,941,405 molecules, split into training, validation, and test sets for distribution‑learning and goal‑directed molecular generation tasks, i.e., generating new molecules with desired properties.

Source
huggingface
Created
Aug 6, 2024
Updated
Aug 6, 2024
Signals
1,027 views
Availability
Linked source ready
Overview

Dataset description and usage context

ChEMBL Molecule Generation Dataset

Dataset Description

ChEMBL is a manually curated bio‑active molecule database with drug‑like properties. It aggregates chemical, bio‑activity, and genomic data to facilitate the translation of genomic information into effective new drugs.

Task Description

Suitable for distribution‑learning and goal‑directed molecular generation tasks, i.e., generating new molecules with predefined attributes.

Dataset Statistics

  • Total molecules: 1,941,405
    • Training set: 1,358,980 molecules
    • Validation set: 194,123 molecules
    • Test set: 388,302 molecules

The dataset was randomly split by the Therapeutics Data Commons and missing values were removed.

References

  1. Mendez, David, et al. “ChEMBL: towards direct deposition of bioassay data.” Nucleic Acids Research 47.D1 (2019): D930‑D940.
  2. Davies, Mark, et al. “ChEMBL web services: streamlining access to drug discovery data and utilities.” Nucleic Acids Research 43.W1 (2015): W612‑W620.
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio