Explore high-quality datasets for your AI and machine learning projects.
一个包含10亿张图片和120亿个文本的数据集,用于多语言语言-图像模型的训练。
FineWeb 2是受欢迎的FineWeb数据集的第二版,为超过1000种语言提供高质量的预训练数据。该数据集经过多语言设置的复杂处理管道,包括语言识别、去重和过滤等步骤。
xCodeEval is currently the largest executable multilingual multitask benchmark dataset, containing 25 million document‑level code examples covering approximately 7,500 unique problems across 17 programming languages. The dataset comprises seven tasks involving code understanding, generation, translation, and retrieval, and uses execution‑based evaluation. It also introduces a code execution engine, ExecEval, supporting all languages, and proposes a data splitting and selection scheme based on geometric mean and graph‑theoretic principles to balance the distribution of multiple attributes.
STSb Multi MT is a multilingual semantic textual similarity benchmark containing sentence pairs and similarity scores for German, English, Spanish, French, Italian, Dutch, Polish, Portuguese, Russian, and Chinese. Built from the STS‑benchmark dataset and translated via deepl.com, it can be used to train sentence‑embedding models such as T‑Systems‑onsite/cross‑en‑de‑roberta‑sentence‑transformer. The collection includes a training set (5,749 pairs), development set (1,500 pairs), and test set (1,379 pairs).
We present MedQA, the first free‑form multiple‑choice open‑domain QA dataset for medicine, derived from professional medical examinations. It covers three languages—English, Simplified Chinese, and Traditional Chinese (Taiwan)—with 12 723, 34 251, and 14 123 questions respectively. In addition to the QA pairs, we release a large corpus of medical‑text extracted from textbooks to support reading‑comprehension models.