Dataset assetOpen Source CommunityBioinformaticsChemoinformatics

liupf/ChEBI-20-MM

The ChEBI-20-MM dataset is a multimodal benchmark extended from the ChEBI-20 dataset, focusing on molecular science. It integrates multiple molecular data modalities, including InChI, IUPAC, SELFIES, and images, to evaluate models on molecular generation, image recognition, IUPAC identification, molecular description, and retrieval tasks. By increasing modality diversity, the dataset provides a more comprehensive assessment of model performance on multimodal data processing.

Source

hugging_face

Created

Nov 28, 2025

Updated

Jun 17, 2024

Signals

192 views

Availability

Linked source ready

Overview

Dataset description and usage context

ChEBI-20-MM Dataset

Overview

ChEBI-20-MM is an extensive, multimodal benchmark derived from the ChEBI-20 dataset. It aims to provide comprehensive evaluation of various models in the molecular science domain. The benchmark integrates multimodal data, including InChI, IUPAC, SELFIES, and images, making it a versatile tool for a wide range of molecular tasks.

Dataset Description

ChEBI-20-MM expands the original ChEBI-20 dataset, emphasizing the integration of multiple molecular data modalities. The benchmark assesses model capabilities in the following key areas:

Molecular Generation: Evaluates a model's ability to generate accurate molecular structures.
Image Recognition: Tests a model's proficiency in converting molecular images to other representation formats.
IUPAC Identification: Assesses a model's ability to generate IUPAC names from other representations.
Molecular Description: Evaluates a model's ability to produce descriptive textual descriptions of molecular structures.
Retrieval Tasks: Measures a model's effectiveness in accurately and efficiently retrieving molecular information.

Utility and Significance

By expanding modality diversity, the benchmark enables a more thorough evaluation of model performance on multimodal data processing.

Data Visualization

We employ visualization techniques to analyze the dataset's suitability for language models and chemical space coverage. The figure below shows how we use different visualization methods to analyze text length distributions and token counts generated by each model's tokenizer, assessing language model adaptability to our dataset's textual characteristics.

Data Visualization

We also examine the top 10 scaffolds in the dataset, counting the number of molecules per scaffold. Transparent bars represent the total count, while solid bars indicate the training set count. For chemical space coverage analysis, we select molecular weight (MW), LogP, aromatic ring count, and topological polar surface area (TPSA) as descriptors. We examine their distribution and correlations within the dataset, providing insight into chemical diversity and complexity.

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio