JUHE API Marketplace
High Quality Data

Dataset Hub

Explore high-quality datasets for your AI and machine learning projects.

Sort:

Browse by Category

WebLI

South African Theatre
Multilingual Processing

一个包含10亿张图片和120亿个文本的数据集,用于多语言语言-图像模型的训练。

github
View Details

FineWeb 2

Multilingual Processing
Natural Language Processing

FineWeb 2是受欢迎的FineWeb数据集的第二版,为超过1000种语言提供高质量的预训练数据。该数据集经过多语言设置的复杂处理管道,包括语言识别、去重和过滤等步骤。

github
View Details

NTU-NLP-sg/xCodeEval

Code Analysis
Multilingual Processing

xCodeEval is currently the largest executable multilingual multitask benchmark dataset, containing 25 million document‑level code examples covering approximately 7,500 unique problems across 17 programming languages. The dataset comprises seven tasks involving code understanding, generation, translation, and retrieval, and uses execution‑based evaluation. It also introduces a code execution engine, ExecEval, supporting all languages, and proposes a data splitting and selection scheme based on geometric mean and graph‑theoretic principles to balance the distribution of multiple attributes.

hugging_face
View Details

mteb/stsb_multi_mt

Multilingual Processing
Semantic Similarity Evaluation

STSb Multi MT is a multilingual semantic textual similarity benchmark containing sentence pairs and similarity scores for German, English, Spanish, French, Italian, Dutch, Polish, Portuguese, Russian, and Chinese. Built from the STS‑benchmark dataset and translated via deepl.com, it can be used to train sentence‑embedding models such as T‑Systems‑onsite/cross‑en‑de‑roberta‑sentence‑transformer. The collection includes a training set (5,749 pairs), development set (1,500 pairs), and test set (1,379 pairs).

hugging_face
View Details

bigbio/med_qa

Medical QA
Multilingual Processing

We present MedQA, the first free‑form multiple‑choice open‑domain QA dataset for medicine, derived from professional medical examinations. It covers three languages—English, Simplified Chinese, and Traditional Chinese (Taiwan)—with 12 723, 34 251, and 14 123 questions respectively. In addition to the QA pairs, we release a large corpus of medical‑text extracted from textbooks to support reading‑comprehension models.

hugging_face
View Details