Explore high-quality datasets for your AI and machine learning projects.
All processed datasets are based on data extracted from ChEMBL version 25, which is provided under the Creative Commons Attribution‑ShareAlike 3.0 Unported license.
The ChEBI-20-MM dataset is a multimodal benchmark extended from the ChEBI-20 dataset, focusing on molecular science. It integrates multiple molecular data modalities, including InChI, IUPAC, SELFIES, and images, to evaluate models on molecular generation, image recognition, IUPAC identification, molecular description, and retrieval tasks. By increasing modality diversity, the dataset provides a more comprehensive assessment of model performance on multimodal data processing.