Back to datasets
Dataset assetOpen Source CommunityQuestion Answering SystemsChemical Molecules

OpenMol/PubChemSFT

This dataset contains single‑turn dialogues with SMILES molecular descriptions, formatted as JSON and including SMILES strings with their corresponding textual descriptions. The dataset is split into training, validation, and test sets containing 264,391, 33,072, and 32,987 samples respectively. Dialogue templates consist of human queries and GPT‑generated molecule descriptions. Additionally, 14 query templates are provided for generating the query portion of the dialogues.

Source
hugging_face
Created
Nov 28, 2025
Updated
Apr 7, 2024
Signals
86 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Overview

Dataset Files

  • Filename: all_clean.json
    • Overlaps with the ChEBI‑20 test set have been removed.

    • SMILES strings without descriptions have been removed.

    • Data format:

      {
        "SMILES": <str>:
        [
          ["Please describe the molecule", DESCRIPTION],
          ...
        ]
      }
      
    • Statistics:

      • Maximum token length: 6113
      • Minimum token length: 20
      • Average token length: 191
      • Median token length: 149
      • Total single‑turn dialogues: 326,689
      • Total SMILES examples: 293,302

Dataset Size

  • Training set: 264,391
  • Validation set: 33,072
  • Test set: 32,987

Dialogue Templates

conversation:{
  [
    "from":  "human",
    "value": <QUERY>, # Randomly sampled from query templates
  ],
  [
    "from":  "gpt",
    "value": <TEXT>, # Description of the given molecule
  ],
]

Dataset Entry Content

"graph":
  [
    "edge_index":, # array (int64)
    "edge_feat":,  # array (int64)
    "node_feat":,  # array (int64)
    "num_nodes":,  # integer
  ],
  "conversation": # as described above

Query Templates

{
 <image>
Could you give me a brief overview of this molecule?,
 <image>
Could you provide a description of this molecule?,
 <image>
Describe this molecule.,
 <image>
Please give me some details about this molecule.,
 <image>
Provide a brief overview of this molecule.,
 <image>
Provide a description of this molecule.,
 <image>
What can you tell me about this molecule?,
 Could you give me a brief overview of this molecule?
<image>,
 Could you provide a description of this molecule?
<image>,
 Describe this molecule.
<image>,
 Please give me some details about this molecule.
<image>,
 Provide a brief overview of this molecule.
<image>,
 Provide a description of this molecule.
<image>,
 What can you tell me about this molecule?
<image>
}
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio