Dataset assetOpen Source CommunityDrug DiscoveryMolecular Generation

katielink/moses

The dataset is derived from the Clean Leads subset of the ZINC database, filtered by molecular weight (250–350 Da), number of rotatable bonds (≤ 7), and XlogP (≤ 3.5). Molecules containing charged atoms or atoms other than C, N, S, O, F, Cl, Br, H, or rings larger than eight atoms were removed. The dataset also applies medicinal chemistry filters (MCFs) and PAINS filters. The final set comprises 1,936,962 molecular structures, split into a training set (~1.6 M molecules), a test set (~176 k molecules), and a scaffold test set (~176 k molecules). The scaffold test set contains unique Bemis‑Murcko scaffolds not present in the training or test sets, enabling evaluation of a model's ability to generate novel scaffolds.

Source

hugging_face

Created

Nov 28, 2025

Updated

Jan 23, 2024

Signals

207 views

Availability

Linked source ready

Overview

Dataset description and usage context

Molecular Set (MOSES): Benchmark Platform for Molecular Generation Models

Dataset Overview

MOSES is a benchmark platform for machine‑learning research in drug discovery. It implements several popular molecular generation models and provides a suite of metrics to evaluate the quality and diversity of generated molecules. MOSES aims to standardize molecular generation research and facilitate sharing and comparison of new models.

Dataset Details

Source: The dataset is refined from the ZINC database.
Size: Contains 4,591,276 molecules.
Filter Criteria:
- Molecular weight: 250–350 Da.
- Number of rotatable bonds: ≤ 7.
- XlogP: ≤ 3.5.
- Only atoms C, N, S, O, F, Cl, Br, H are allowed; no charged atoms or rings larger than eight atoms.
- Passed medicinal chemistry filters (MCFs) and PAINS filters.
Splits:
- Training set: ~1.6 M molecules.
- Test set: ~176 k molecules.
- Scaffold test set: ~176 k molecules, containing unique Bemis‑Murcko scaffolds absent from the training and test sets.

Citation Information

If you use the MOSES dataset in your research, please cite the following paper:

@article{10.3389/fphar.2020.565644, title={{M}olecular {S}ets ({MOSES}): {A} {B}enchmarking {P}latform for {M}olecular {G}eneration {M}odels}, author={Polykovskiy, Daniil and Zhebrak, Alexander and Sanchez‑Lengeling, Benjamin and Golovanov, Sergey and Tatanov, Oktai and Belyaev, Stanislav and Kurbanov, Rauf and Artamonov, Aleksey and Aladinskiy, Vladimir and Veselov, Mark and Kadurin, Artur and Johansson, Simon and Chen, Hongming and Nikolenko, Sergey and Aspuru‑Guzik, Alan and Zhavoronkov, Alex}, journal={Frontiers in Pharmacology}, year={2020} }

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio