PubChem

The dataset is primarily intended for research in chemistry, biology, and medicine, containing three features: CID, SMILES, and SELFIES, which identify compounds, describe molecular structures, and provide self‑descriptive molecular representations, respectively. The dataset is split into training, validation, and test sets, comprising a large number of samples with a total size of 36.6 TB and a download size of 12.6 GB.

Updated 12/12/2024

huggingface

PubChem Dataset Overview

Dataset Information

Features

CID: Chemical identifier, data type int64.
SMILES: Simplified molecular input line entry system representation of chemical structures, data type large_string.
SELFIES: Symbolic representation of chemical structures, data type string.

Data Splits

train: Training set, containing 95,207,924 samples, size 29,280,467,548.8 bytes.
val: Validation set, containing 11,900,990 samples, size 3,660,058,289.828831 bytes.
test: Test set, containing 11,900,991 samples, size 3,660,058,597.371169 bytes.

Dataset Size

Download Size: 12,629,892,833 bytes.
Total Size: 36,600,584,436.0 bytes.

Configuration

default: Default configuration, containing file paths for training, validation, and test set data.

Labels

chemistry: Chemistry
biology: Biology
medical: Medicine

Dataset Scale

100M < n < 1B: Dataset scale between 100 M and 1 B.

PubChem

Description

PubChem Dataset Overview

Dataset Information

Features

Data Splits

Dataset Size

Configuration

Labels

Dataset Scale

AI studio

Access Dataset

Topics

Source