JUHE API Marketplace
DATASET
Open Source Community

smiles-molecules-chembl

ChEMBL is a manually curated bio‑active molecule database with drug‑like properties, integrating chemical, bio‑activity, and genomic data to facilitate the translation of genomic information into effective new drugs. The dataset contains 1,941,405 molecules, split into training, validation, and test sets for distribution‑learning and goal‑directed molecular generation tasks, i.e., generating new molecules with desired properties.

Updated 8/6/2024
huggingface

Description

ChEMBL Molecule Generation Dataset

Dataset Description

ChEMBL is a manually curated bio‑active molecule database with drug‑like properties. It aggregates chemical, bio‑activity, and genomic data to facilitate the translation of genomic information into effective new drugs.

Task Description

Suitable for distribution‑learning and goal‑directed molecular generation tasks, i.e., generating new molecules with predefined attributes.

Dataset Statistics

  • Total molecules: 1,941,405
    • Training set: 1,358,980 molecules
    • Validation set: 194,123 molecules
    • Test set: 388,302 molecules

The dataset was randomly split by the Therapeutics Data Commons and missing values were removed.

References

  1. Mendez, David, et al. “ChEMBL: towards direct deposition of bioassay data.” Nucleic Acids Research 47.D1 (2019): D930‑D940.
  2. Davies, Mark, et al. “ChEMBL web services: streamlining access to drug discovery data and utilities.” Nucleic Acids Research 43.W1 (2015): W612‑W620.

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Drug Discovery
Molecular Generation

Source

Organization: huggingface

Created: 8/6/2024

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.