liupf/ChEBI-20-MM
The ChEBI-20-MM dataset is a multimodal benchmark extended from the ChEBI-20 dataset, focusing on molecular science. It integrates multiple molecular data modalities, including InChI, IUPAC, SELFIES, and images, to evaluate models on molecular generation, image recognition, IUPAC identification, molecular description, and retrieval tasks. By increasing modality diversity, the dataset provides a more comprehensive assessment of model performance on multimodal data processing.
Description
ChEBI-20-MM Dataset
Overview
ChEBI-20-MM is an extensive, multimodal benchmark derived from the ChEBI-20 dataset. It aims to provide comprehensive evaluation of various models in the molecular science domain. The benchmark integrates multimodal data, including InChI, IUPAC, SELFIES, and images, making it a versatile tool for a wide range of molecular tasks.
Dataset Description
ChEBI-20-MM expands the original ChEBI-20 dataset, emphasizing the integration of multiple molecular data modalities. The benchmark assesses model capabilities in the following key areas:
- Molecular Generation: Evaluates a model's ability to generate accurate molecular structures.
- Image Recognition: Tests a model's proficiency in converting molecular images to other representation formats.
- IUPAC Identification: Assesses a model's ability to generate IUPAC names from other representations.
- Molecular Description: Evaluates a model's ability to produce descriptive textual descriptions of molecular structures.
- Retrieval Tasks: Measures a model's effectiveness in accurately and efficiently retrieving molecular information.
Utility and Significance
By expanding modality diversity, the benchmark enables a more thorough evaluation of model performance on multimodal data processing.
Data Visualization
We employ visualization techniques to analyze the dataset's suitability for language models and chemical space coverage. The figure below shows how we use different visualization methods to analyze text length distributions and token counts generated by each model's tokenizer, assessing language model adaptability to our dataset's textual characteristics.

We also examine the top 10 scaffolds in the dataset, counting the number of molecules per scaffold. Transparent bars represent the total count, while solid bars indicate the training set count. For chemical space coverage analysis, we select molecular weight (MW), LogP, aromatic ring count, and topological polar surface area (TPSA) as descriptors. We examine their distribution and correlations within the dataset, providing insight into chemical diversity and complexity.
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: hugging_face
Created: Unknown
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.