liupf/ChEBI-20-MM
The ChEBI-20-MM dataset is a multimodal benchmark extended from the ChEBI-20 dataset, focusing on molecular science. It integrates multiple molecular data modalities, including InChI, IUPAC, SELFIES, and images, to evaluate models on molecular generation, image recognition, IUPAC identification, molecular description, and retrieval tasks. By increasing modality diversity, the dataset provides a more comprehensive assessment of model performance on multimodal data processing.
Dataset description and usage context
ChEBI-20-MM Dataset
Overview
ChEBI-20-MM is an extensive, multimodal benchmark derived from the ChEBI-20 dataset. It aims to provide comprehensive evaluation of various models in the molecular science domain. The benchmark integrates multimodal data, including InChI, IUPAC, SELFIES, and images, making it a versatile tool for a wide range of molecular tasks.
Dataset Description
ChEBI-20-MM expands the original ChEBI-20 dataset, emphasizing the integration of multiple molecular data modalities. The benchmark assesses model capabilities in the following key areas:
- Molecular Generation: Evaluates a model's ability to generate accurate molecular structures.
- Image Recognition: Tests a model's proficiency in converting molecular images to other representation formats.
- IUPAC Identification: Assesses a model's ability to generate IUPAC names from other representations.
- Molecular Description: Evaluates a model's ability to produce descriptive textual descriptions of molecular structures.
- Retrieval Tasks: Measures a model's effectiveness in accurately and efficiently retrieving molecular information.
Utility and Significance
By expanding modality diversity, the benchmark enables a more thorough evaluation of model performance on multimodal data processing.
Data Visualization
We employ visualization techniques to analyze the dataset's suitability for language models and chemical space coverage. The figure below shows how we use different visualization methods to analyze text length distributions and token counts generated by each model's tokenizer, assessing language model adaptability to our dataset's textual characteristics.

We also examine the top 10 scaffolds in the dataset, counting the number of molecules per scaffold. Transparent bars represent the total count, while solid bars indicate the training set count. For chemical space coverage analysis, we select molecular weight (MW), LogP, aromatic ring count, and topological polar surface area (TPSA) as descriptors. We examine their distribution and correlations within the dataset, providing insight into chemical diversity and complexity.
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.