JUHE API Marketplace
DATASET
Open Source Community

OpenMol/PubChemSFT

This dataset contains single‑turn dialogues with SMILES molecular descriptions, formatted as JSON and including SMILES strings with their corresponding textual descriptions. The dataset is split into training, validation, and test sets containing 264,391, 33,072, and 32,987 samples respectively. Dialogue templates consist of human queries and GPT‑generated molecule descriptions. Additionally, 14 query templates are provided for generating the query portion of the dialogues.

Updated 4/7/2024
hugging_face

Description

Dataset Overview

Dataset Files

  • Filename: all_clean.json
    • Overlaps with the ChEBI‑20 test set have been removed.

    • SMILES strings without descriptions have been removed.

    • Data format:

      {
        "SMILES": <str>:
        [
          ["Please describe the molecule", DESCRIPTION],
          ...
        ]
      }
      
    • Statistics:

      • Maximum token length: 6113
      • Minimum token length: 20
      • Average token length: 191
      • Median token length: 149
      • Total single‑turn dialogues: 326,689
      • Total SMILES examples: 293,302

Dataset Size

  • Training set: 264,391
  • Validation set: 33,072
  • Test set: 32,987

Dialogue Templates

conversation:{
  [
    "from":  "human",
    "value": <QUERY>, # Randomly sampled from query templates
  ],
  [
    "from":  "gpt",
    "value": <TEXT>, # Description of the given molecule
  ],
]

Dataset Entry Content

"graph":
  [
    "edge_index":, # array (int64)
    "edge_feat":,  # array (int64)
    "node_feat":,  # array (int64)
    "num_nodes":,  # integer
  ],
  "conversation": # as described above

Query Templates

{
 <image>
Could you give me a brief overview of this molecule?,
 <image>
Could you provide a description of this molecule?,
 <image>
Describe this molecule.,
 <image>
Please give me some details about this molecule.,
 <image>
Provide a brief overview of this molecule.,
 <image>
Provide a description of this molecule.,
 <image>
What can you tell me about this molecule?,
 Could you give me a brief overview of this molecule?
<image>,
 Could you provide a description of this molecule?
<image>,
 Describe this molecule.
<image>,
 Please give me some details about this molecule.
<image>,
 Provide a brief overview of this molecule.
<image>,
 Provide a description of this molecule.
<image>,
 What can you tell me about this molecule?
<image>
}

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Chemical Molecules
Question Answering Systems

Source

Organization: hugging_face

Created: Unknown

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.