Explore high-quality datasets for your AI and machine learning projects.
The MoleculeNet ESOL dataset is part of the MoleculeNet benchmark for predicting aqueous solubility. The target values are log‑transformed, expressed as log mol/L. The dataset contains 1,128 samples; scaffold split is recommended; evaluation metric is RMSE.
AqSolDB, created by the Autonomous Materials Discovery (AMD) research group, contains aqueous solubility data for 9,982 unique compounds aggregated from nine publicly available soluble datasets. It is the largest publicly accessible dataset of its kind, serving both as a valuable reference for measured solubility and as improved, generalizable training data for data‑driven models. The dataset provides 2D descriptors for compounds, with standardized and validated molecular representations and reliability labels.
This dataset contains per‑atom formation energy data for 133,420 materials. It is provided as two main files: `index.json`, which includes material indices, IDs, formulas, atom counts, and per‑atom formation energies; and `data.hdf5`, which stores structural information (lattice, number of atoms, per‑atom energy, atom pointers) and atomic data (positions, atomic numbers).