JUHE API Marketplace
DATASET
Open Source Community

katielink/moses

The dataset is derived from the Clean Leads subset of the ZINC database, filtered by molecular weight (250–350 Da), number of rotatable bonds (≤ 7), and XlogP (≤ 3.5). Molecules containing charged atoms or atoms other than C, N, S, O, F, Cl, Br, H, or rings larger than eight atoms were removed. The dataset also applies medicinal chemistry filters (MCFs) and PAINS filters. The final set comprises 1,936,962 molecular structures, split into a training set (~1.6 M molecules), a test set (~176 k molecules), and a scaffold test set (~176 k molecules). The scaffold test set contains unique Bemis‑Murcko scaffolds not present in the training or test sets, enabling evaluation of a model's ability to generate novel scaffolds.

Updated 1/23/2024
hugging_face

Description

Molecular Set (MOSES): Benchmark Platform for Molecular Generation Models

Dataset Overview

MOSES is a benchmark platform for machine‑learning research in drug discovery. It implements several popular molecular generation models and provides a suite of metrics to evaluate the quality and diversity of generated molecules. MOSES aims to standardize molecular generation research and facilitate sharing and comparison of new models.

Dataset Details

  • Source: The dataset is refined from the ZINC database.
  • Size: Contains 4,591,276 molecules.
  • Filter Criteria:
    • Molecular weight: 250–350 Da.
    • Number of rotatable bonds: ≤ 7.
    • XlogP: ≤ 3.5.
    • Only atoms C, N, S, O, F, Cl, Br, H are allowed; no charged atoms or rings larger than eight atoms.
    • Passed medicinal chemistry filters (MCFs) and PAINS filters.
  • Splits:
    • Training set: ~1.6 M molecules.
    • Test set: ~176 k molecules.
    • Scaffold test set: ~176 k molecules, containing unique Bemis‑Murcko scaffolds absent from the training and test sets.

Citation Information

If you use the MOSES dataset in your research, please cite the following paper:

@article{10.3389/fphar.2020.565644, title={{M}olecular {S}ets ({MOSES}): {A} {B}enchmarking {P}latform for {M}olecular {G}eneration {M}odels}, author={Polykovskiy, Daniil and Zhebrak, Alexander and Sanchez‑Lengeling, Benjamin and Golovanov, Sergey and Tatanov, Oktai and Belyaev, Stanislav and Kurbanov, Rauf and Artamonov, Aleksey and Aladinskiy, Vladimir and Veselov, Mark and Kadurin, Artur and Johansson, Simon and Chen, Hongming and Nikolenko, Sergey and Aspuru‑Guzik, Alan and Zhavoronkov, Alex}, journal={Frontiers in Pharmacology}, year={2020} }

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Molecular Generation
Drug Discovery

Source

Organization: hugging_face

Created: Unknown

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.