katielink/moses
The dataset is derived from the Clean Leads subset of the ZINC database, filtered by molecular weight (250–350 Da), number of rotatable bonds (≤ 7), and XlogP (≤ 3.5). Molecules containing charged atoms or atoms other than C, N, S, O, F, Cl, Br, H, or rings larger than eight atoms were removed. The dataset also applies medicinal chemistry filters (MCFs) and PAINS filters. The final set comprises 1,936,962 molecular structures, split into a training set (~1.6 M molecules), a test set (~176 k molecules), and a scaffold test set (~176 k molecules). The scaffold test set contains unique Bemis‑Murcko scaffolds not present in the training or test sets, enabling evaluation of a model's ability to generate novel scaffolds.
Description
Molecular Set (MOSES): Benchmark Platform for Molecular Generation Models
Dataset Overview
MOSES is a benchmark platform for machine‑learning research in drug discovery. It implements several popular molecular generation models and provides a suite of metrics to evaluate the quality and diversity of generated molecules. MOSES aims to standardize molecular generation research and facilitate sharing and comparison of new models.
Dataset Details
- Source: The dataset is refined from the ZINC database.
- Size: Contains 4,591,276 molecules.
- Filter Criteria:
- Molecular weight: 250–350 Da.
- Number of rotatable bonds: ≤ 7.
- XlogP: ≤ 3.5.
- Only atoms C, N, S, O, F, Cl, Br, H are allowed; no charged atoms or rings larger than eight atoms.
- Passed medicinal chemistry filters (MCFs) and PAINS filters.
- Splits:
- Training set: ~1.6 M molecules.
- Test set: ~176 k molecules.
- Scaffold test set: ~176 k molecules, containing unique Bemis‑Murcko scaffolds absent from the training and test sets.
Citation Information
If you use the MOSES dataset in your research, please cite the following paper:
@article{10.3389/fphar.2020.565644, title={{M}olecular {S}ets ({MOSES}): {A} {B}enchmarking {P}latform for {M}olecular {G}eneration {M}odels}, author={Polykovskiy, Daniil and Zhebrak, Alexander and Sanchez‑Lengeling, Benjamin and Golovanov, Sergey and Tatanov, Oktai and Belyaev, Stanislav and Kurbanov, Rauf and Artamonov, Aleksey and Aladinskiy, Vladimir and Veselov, Mark and Kadurin, Artur and Johansson, Simon and Chen, Hongming and Nikolenko, Sergey and Aspuru‑Guzik, Alan and Zhavoronkov, Alex}, journal={Frontiers in Pharmacology}, year={2020} }
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: hugging_face
Created: Unknown
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.