Datasets | JuheAPI

katielink/moses

Molecular Generation

Drug Discovery

The dataset is derived from the Clean Leads subset of the ZINC database, filtered by molecular weight (250–350 Da), number of rotatable bonds (≤ 7), and XlogP (≤ 3.5). Molecules containing charged atoms or atoms other than C, N, S, O, F, Cl, Br, H, or rings larger than eight atoms were removed. The dataset also applies medicinal chemistry filters (MCFs) and PAINS filters. The final set comprises 1,936,962 molecular structures, split into a training set (~1.6 M molecules), a test set (~176 k molecules), and a scaffold test set (~176 k molecules). The scaffold test set contains unique Bemis‑Murcko scaffolds not present in the training or test sets, enabling evaluation of a model's ability to generate novel scaffolds.

hugging_face

View Details

Dataset Hub

Browse by Category

smiles-molecules-chembl

katielink/moses