JUHE API Marketplace
DATASET
Open Source Community

PubChem

The dataset is primarily intended for research in chemistry, biology, and medicine, containing three features: CID, SMILES, and SELFIES, which identify compounds, describe molecular structures, and provide self‑descriptive molecular representations, respectively. The dataset is split into training, validation, and test sets, comprising a large number of samples with a total size of 36.6 TB and a download size of 12.6 GB.

Updated 12/12/2024
huggingface

Description

PubChem Dataset Overview

Dataset Information

Features

  • CID: Chemical identifier, data type int64.
  • SMILES: Simplified molecular input line entry system representation of chemical structures, data type large_string.
  • SELFIES: Symbolic representation of chemical structures, data type string.

Data Splits

  • train: Training set, containing 95,207,924 samples, size 29,280,467,548.8 bytes.
  • val: Validation set, containing 11,900,990 samples, size 3,660,058,289.828831 bytes.
  • test: Test set, containing 11,900,991 samples, size 3,660,058,597.371169 bytes.

Dataset Size

  • Download Size: 12,629,892,833 bytes.
  • Total Size: 36,600,584,436.0 bytes.

Configuration

  • default: Default configuration, containing file paths for training, validation, and test set data.

Labels

  • chemistry: Chemistry
  • biology: Biology
  • medical: Medicine

Dataset Scale

  • 100M < n < 1B: Dataset scale between 100 M and 1 B.

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Chemical Molecules
Machine Learning

Source

Organization: huggingface

Created: 12/11/2024

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.