OpenMol/PubChemSFT
This dataset contains single‑turn dialogues with SMILES molecular descriptions, formatted as JSON and including SMILES strings with their corresponding textual descriptions. The dataset is split into training, validation, and test sets containing 264,391, 33,072, and 32,987 samples respectively. Dialogue templates consist of human queries and GPT‑generated molecule descriptions. Additionally, 14 query templates are provided for generating the query portion of the dialogues.
Dataset description and usage context
Dataset Overview
Dataset Files
- Filename:
all_clean.json-
Overlaps with the ChEBI‑20 test set have been removed.
-
SMILES strings without descriptions have been removed.
-
Data format:
{ "SMILES": <str>: [ ["Please describe the molecule", DESCRIPTION], ... ] } -
Statistics:
- Maximum token length: 6113
- Minimum token length: 20
- Average token length: 191
- Median token length: 149
- Total single‑turn dialogues: 326,689
- Total SMILES examples: 293,302
-
Dataset Size
- Training set: 264,391
- Validation set: 33,072
- Test set: 32,987
Dialogue Templates
conversation:{
[
"from": "human",
"value": <QUERY>, # Randomly sampled from query templates
],
[
"from": "gpt",
"value": <TEXT>, # Description of the given molecule
],
]
Dataset Entry Content
"graph":
[
"edge_index":, # array (int64)
"edge_feat":, # array (int64)
"node_feat":, # array (int64)
"num_nodes":, # integer
],
"conversation": # as described above
Query Templates
{
<image>
Could you give me a brief overview of this molecule?,
<image>
Could you provide a description of this molecule?,
<image>
Describe this molecule.,
<image>
Please give me some details about this molecule.,
<image>
Provide a brief overview of this molecule.,
<image>
Provide a description of this molecule.,
<image>
What can you tell me about this molecule?,
Could you give me a brief overview of this molecule?
<image>,
Could you provide a description of this molecule?
<image>,
Describe this molecule.
<image>,
Please give me some details about this molecule.
<image>,
Provide a brief overview of this molecule.
<image>,
Provide a description of this molecule.
<image>,
What can you tell me about this molecule?
<image>
}
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.