Back to datasets
Dataset assetOpen Source CommunityMachine LearningChemical Molecules

PubChem

The dataset is primarily intended for research in chemistry, biology, and medicine, containing three features: CID, SMILES, and SELFIES, which identify compounds, describe molecular structures, and provide self‑descriptive molecular representations, respectively. The dataset is split into training, validation, and test sets, comprising a large number of samples with a total size of 36.6 TB and a download size of 12.6 GB.

Source
huggingface
Created
Dec 11, 2024
Updated
Dec 12, 2024
Signals
671 views
Availability
Linked source ready
Overview

Dataset description and usage context

PubChem Dataset Overview

Dataset Information

Features

  • CID: Chemical identifier, data type int64.
  • SMILES: Simplified molecular input line entry system representation of chemical structures, data type large_string.
  • SELFIES: Symbolic representation of chemical structures, data type string.

Data Splits

  • train: Training set, containing 95,207,924 samples, size 29,280,467,548.8 bytes.
  • val: Validation set, containing 11,900,990 samples, size 3,660,058,289.828831 bytes.
  • test: Test set, containing 11,900,991 samples, size 3,660,058,597.371169 bytes.

Dataset Size

  • Download Size: 12,629,892,833 bytes.
  • Total Size: 36,600,584,436.0 bytes.

Configuration

  • default: Default configuration, containing file paths for training, validation, and test set data.

Labels

  • chemistry: Chemistry
  • biology: Biology
  • medical: Medicine

Dataset Scale

  • 100M < n < 1B: Dataset scale between 100 M and 1 B.
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio