Explore high-quality datasets for your AI and machine learning projects.
This dataset contains single‑turn dialogues with SMILES molecular descriptions, formatted as JSON and including SMILES strings with their corresponding textual descriptions. The dataset is split into training, validation, and test sets containing 264,391, 33,072, and 32,987 samples respectively. Dialogue templates consist of human queries and GPT‑generated molecule descriptions. Additionally, 14 query templates are provided for generating the query portion of the dialogues.
The dataset is primarily intended for research in chemistry, biology, and medicine, containing three features: CID, SMILES, and SELFIES, which identify compounds, describe molecular structures, and provide self‑descriptive molecular representations, respectively. The dataset is split into training, validation, and test sets, comprising a large number of samples with a total size of 36.6 TB and a download size of 12.6 GB.