DATASET
Open Source Community
PubChem
The dataset is primarily intended for research in chemistry, biology, and medicine, containing three features: CID, SMILES, and SELFIES, which identify compounds, describe molecular structures, and provide self‑descriptive molecular representations, respectively. The dataset is split into training, validation, and test sets, comprising a large number of samples with a total size of 36.6 TB and a download size of 12.6 GB.
Updated 12/12/2024
huggingface
Description
PubChem Dataset Overview
Dataset Information
Features
- CID: Chemical identifier, data type
int64. - SMILES: Simplified molecular input line entry system representation of chemical structures, data type
large_string. - SELFIES: Symbolic representation of chemical structures, data type
string.
Data Splits
- train: Training set, containing 95,207,924 samples, size 29,280,467,548.8 bytes.
- val: Validation set, containing 11,900,990 samples, size 3,660,058,289.828831 bytes.
- test: Test set, containing 11,900,991 samples, size 3,660,058,597.371169 bytes.
Dataset Size
- Download Size: 12,629,892,833 bytes.
- Total Size: 36,600,584,436.0 bytes.
Configuration
- default: Default configuration, containing file paths for training, validation, and test set data.
Labels
- chemistry: Chemistry
- biology: Biology
- medical: Medicine
Dataset Scale
- 100M < n < 1B: Dataset scale between 100 M and 1 B.
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Login to Access
Please login to view download links and access full dataset details.
Topics
Chemical Molecules
Machine Learning
Source
Organization: huggingface
Created: 12/11/2024
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.