JUHE API Marketplace
DATASET
Open Source Community

liupf/ChEBI-20-MM

The ChEBI-20-MM dataset is a multimodal benchmark extended from the ChEBI-20 dataset, focusing on molecular science. It integrates multiple molecular data modalities, including InChI, IUPAC, SELFIES, and images, to evaluate models on molecular generation, image recognition, IUPAC identification, molecular description, and retrieval tasks. By increasing modality diversity, the dataset provides a more comprehensive assessment of model performance on multimodal data processing.

Updated 6/17/2024
hugging_face

Description

ChEBI-20-MM Dataset

Overview

ChEBI-20-MM is an extensive, multimodal benchmark derived from the ChEBI-20 dataset. It aims to provide comprehensive evaluation of various models in the molecular science domain. The benchmark integrates multimodal data, including InChI, IUPAC, SELFIES, and images, making it a versatile tool for a wide range of molecular tasks.

Dataset Description

ChEBI-20-MM expands the original ChEBI-20 dataset, emphasizing the integration of multiple molecular data modalities. The benchmark assesses model capabilities in the following key areas:

  • Molecular Generation: Evaluates a model's ability to generate accurate molecular structures.
  • Image Recognition: Tests a model's proficiency in converting molecular images to other representation formats.
  • IUPAC Identification: Assesses a model's ability to generate IUPAC names from other representations.
  • Molecular Description: Evaluates a model's ability to produce descriptive textual descriptions of molecular structures.
  • Retrieval Tasks: Measures a model's effectiveness in accurately and efficiently retrieving molecular information.

Utility and Significance

By expanding modality diversity, the benchmark enables a more thorough evaluation of model performance on multimodal data processing.

Data Visualization

We employ visualization techniques to analyze the dataset's suitability for language models and chemical space coverage. The figure below shows how we use different visualization methods to analyze text length distributions and token counts generated by each model's tokenizer, assessing language model adaptability to our dataset's textual characteristics.

Data Visualization

We also examine the top 10 scaffolds in the dataset, counting the number of molecules per scaffold. Transparent bars represent the total count, while solid bars indicate the training set count. For chemical space coverage analysis, we select molecular weight (MW), LogP, aromatic ring count, and topological polar surface area (TPSA) as descriptors. We examine their distribution and correlations within the dataset, providing insight into chemical diversity and complexity.

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Chemoinformatics
Bioinformatics

Source

Organization: hugging_face

Created: Unknown

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.