High Quality Data

Dataset Hub

Explore high-quality datasets for your AI and machine learning projects.

Sort:

Browse by Category

Phi3_intent_v43_3_w_unknown

Intent Recognition

Natural Language Processing

This dataset is used for intent recognition tasks, containing user queries and their corresponding correct intents. The dataset is divided into training and validation sets, used respectively for model training and performance evaluation.

prm800k

Natural Language Processing

Machine Learning

This dataset contains data from [openai/prm800k](https://github.com/openai/prm800k). It is divided into two phases (phase1 and phase2), each with train and test splits. Features include labeler, timestamp, question, etc.; detailed feature types are described in the README.

google-research-datasets/nq_open

Open‑Domain Question Answering

Natural Language Processing

--- annotations_creators: - expert-generated language_creators: - other language: - en license: - cc-by-sa-3.0 multilinguality: - monolingual size_categories: - 10K<n<100K source_datasets: - extended|natural_questions task_categories: - question-answering task_ids: - open-domain-qa pretty_name: NQ-Open dataset_info: config_name: nq_open features: - name: question dtype: string - name: answer sequence: string splits: - name: train num_bytes: 6651236 num_examples: 87925 - name: validation num_bytes: 313829 num_examples: 3610 download_size: 4678245 dataset_size: 6965065 configs: - config_name: nq_open data_files: - split: train path: nq_open/train-* - split: validation path: nq_open/validation-* default: true --- # Dataset Card for nq_open ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://efficientqa.github.io/ - **Repository:** https://github.com/google-research-datasets/natural-questions/tree/master/nq_open - **Paper:** https://www.aclweb.org/anthology/P19-1612.pdf - **Leaderboard:** https://ai.google.com/research/NaturalQuestions/efficientqa - **Point of Contact:** [Mailing List](efficientqa@googlegroups.com) ### Dataset Summary The NQ-Open task, introduced by Lee et.al. 2019, is an open domain question answering benchmark that is derived from Natural Questions. The goal is to predict an English answer string for an input English question. All questions can be answered using the contents of English Wikipedia. ### Supported Tasks and Leaderboards Open Domain Question-Answering, EfficientQA Leaderboard: https://ai.google.com/research/NaturalQuestions/efficientqa ### Languages English (`en`) ## Dataset Structure ### Data Instances ``` { "question": "names of the metropolitan municipalities in south africa", "answer": [ "Mangaung Metropolitan Municipality", "Nelson Mandela Bay Metropolitan Municipality", "eThekwini Metropolitan Municipality", "City of Tshwane Metropolitan Municipality", "City of Johannesburg Metropolitan Municipality", "Buffalo City Metropolitan Municipality", "City of Ekurhuleni Metropolitan Municipality" ] } ``` ### Data Fields - `question` - Input open domain question. - `answer` - List of possible answers to the question ### Data Splits - Train : 87925 - validation : 3610 ## Dataset Creation ### Curation Rationale [Needs More Information] ### Source Data #### Initial Data Collection and Normalization Natural Questions contains question from aggregated queries to Google Search (Kwiatkowski et al., 2019). To gather an open version of this dataset, we only keep questions with short answers and discard the given evidence document. Answers with many tokens often resemble extractive snippets rather than canonical answers, so we discard answers with more than 5 tokens. #### Who are the source language producers? [Needs More Information] ### Annotations #### Annotation process [Needs More Information] #### Who are the annotators? [Needs More Information] ### Personal and Sensitive Information [Needs More Information] ## Considerations for Using the Data ### Social Impact of Dataset [Needs More Information] ### Discussion of Biases Evaluating on this diverse set of question-answer pairs is crucial, because all existing datasets have inherent biases that are problematic for open domain QA systems with learned retrieval. In the Natural Questions dataset the question askers do not already know the answer. This accurately reflects a distribution of genuine information-seeking questions. However, annotators must separately find correct answers, which requires assistance from automatic tools and can introduce a moderate bias towards results from the tool. ### Other Known Limitations [Needs More Information] ## Additional Information ### Dataset Curators [Needs More Information] ### Licensing Information All of the Natural Questions data is released under the [CC BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/) license. ### Citation Information ``` @article{doi:10.1162/tacl\_a\_00276, author = {Kwiatkowski, Tom and Palomaki, Jennimaria and Redfield, Olivia and Collins, Michael and Parikh, Ankur and Alberti, Chris and Epstein, Danielle and Polosukhin, Illia and Devlin, Jacob and Lee, Kenton and Toutanova, Kristina and Jones, Llion and Kelcey, Matthew and Chang, Ming-Wei and Dai, Andrew M. and Uszkoreit, Jakob and Le, Quoc and Petrov, Slav}, title = {Natural Questions: A Benchmark for Question Answering Research}, journal = {Transactions of the Association for Computational Linguistics}, volume = {7}, number = {}, pages = {453-466}, year = {2019}, doi = {10.1162/tacl\_a\_00276}, URL = { https://doi.org/10.1162/tacl_a_00276 }, eprint = { https://doi.org/10.1162/tacl_a_00276 }, abstract = { We present the Natural Questions corpus, a question answering data set. Questions consist of real anonymized, aggregated queries issued to the Google search engine. An annotator is presented with a question along with a Wikipedia page from the top 5 search results, and annotates a long answer (typically a paragraph) and a short answer (one or more entities) if present on the page, or marks null if no long/short answer is present. The public release consists of 307,373 training examples with single annotations; 7,830 examples with 5-way annotations for development data; and a further 7,842 examples with 5-way annotated sequestered as test data. We present experiments validating quality of the data. We also describe analysis of 25-way annotations on 302 examples, giving insights into human variability on the annotation task. We introduce robust metrics for the purposes of evaluating question answering systems; demonstrate high human upper bounds on these metrics; and establish baseline results using competitive methods drawn from related literature. } } @inproceedings{lee-etal-2019-latent, title = "Latent Retrieval for Weakly Supervised Open Domain Question Answering", author = "Lee, Kenton and Chang, Ming-Wei and Toutanova, Kristina", booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics", month = jul, year = "2019", address = "Florence, Italy", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/P19-1612", doi = "10.18653/v1/P19-1612", pages = "6086--6096", abstract = "Recent work on open domain question answering (QA) assumes strong supervision of the supporting evidence and/or assumes a blackbox information retrieval (IR) system to retrieve evidence candidates. We argue that both are suboptimal, since gold evidence is not always available, and QA is fundamentally different from IR. We show for the first time that it is possible to jointly learn the retriever and reader from question-answer string pairs and without any IR system. In this setting, evidence retrieval from all of Wikipedia is treated as a latent variable. Since this is impractical to learn from scratch, we pre-train the retriever with an Inverse Cloze Task. We evaluate on open versions of five QA datasets. On datasets where the questioner already knows the answer, a traditional IR system such as BM25 is sufficient. On datasets where a user is genuinely seeking an answer, we show that learned retrieval is crucial, outperforming BM25 by up to 19 points in exact match.", } ``` ### Contributions Thanks to [@Nilanshrajput](https://github.com/Nilanshrajput) for adding this dataset.

talby/spamassassin

Natural Language Processing

The SpamAssassin public email corpus is a collection of email messages assembled by members of the SpamAssassin project, suitable for testing spam‑filtering systems. The dataset contains various email samples divided into spam and ham categories, with further sub‑groups such as hard_ham, spam_2, spam, easy_ham, and easy_ham_2. Structure includes fields like label, group, text, and raw; only a training split is provided.

open-llm-leaderboard-old/details_CultriX__MonaTrix-v4-7B-DPO

Model Evaluation

Natural Language Processing

This dataset was automatically generated during the evaluation of model CultriX/MonaTrix‑v4‑7B‑DPO. It comprises 63 configurations, each mapping to a specific evaluation task. Each run creates a split named after its timestamp; the `train` split always points to the latest results. An additional `results` configuration aggregates outcomes from all runs for metric computation on the Open LLM Leaderboard.

AdapterOcean/med_alpaca_standardized_cluster_8

Natural Language Processing

Text Clustering

--- dataset_info: features: - name: text dtype: string - name: conversation_id dtype: int64 - name: embedding sequence: float64 - name: cluster dtype: int64 splits: - name: train num_bytes: 145562012 num_examples: 14666 download_size: 42803368 dataset_size: 145562012 configs: - config_name: default data_files: - split: train path: data/train-* --- # Dataset Card for "med_alpaca_standardized_cluster_8" [More Information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)

meta-math/MetaMathQA

Math Problem Solving

Natural Language Processing

MetaMathQA is a dataset enhanced from the training sets of GSM8K and MATH, without using any test‑set data. Each original question can be found in `meta-math/MetaMathQA`; these items originate from GSM8K or MATH training sets.

clarin-pl/2021-punctuation-restoration

Speech Recognition

Natural Language Processing

The 2021‑punctuation‑restoration dataset is primarily used to restore punctuation in the output of automatic speech recognition (ASR) systems. It contains Polish text and audio data, divided into two parts: WikiTalks (conversational) and WikiNews (informational). The dataset aims to improve the readability of ASR‑generated transcripts and may also enhance performance on other NLP tasks. It comprises 1,200 texts, totaling over 240,000 words, spoken by over 100 different native speakers. The dataset provides training and test splits, with the test set containing ASR transcriptions of texts from both sources (WikiNews and WikiTalks).

open-llm-leaderboard-old/details_CalderaAI__13B-Legerdemain-L2

Model Evaluation

Natural Language Processing

This dataset was automatically created during the evaluation of the model CalderaAI/13B‑Legerdemain‑L2 on the Open LLM Leaderboard. It consists of 64 configurations, each corresponding to an evaluation task. The dataset was generated from two runs, with each run represented as a specific split within each configuration. The "train" split always points to the latest results. An additional "results" configuration stores aggregated results from all runs for computing and displaying aggregated metrics on the Open LLM Leaderboard. The README also provides an example of how to load run details using the `load_dataset` function from the `datasets` library. The latest run results are provided in JSON format, showing metrics such as EM, F1, and accuracy for various tasks.

abdiharyadi/eli5-id-preprocessed-tokenized-filtered

Natural Language Processing

Text Classification

The dataset contains training features: input_ids, attention_mask, and labels, each represented as integer sequences. The training split comprises 443,918 examples with a total size of approximately 1,004,301,613.693275 bytes. The download size is 235,069,151 bytes.

spot-the-diff

Image Difference Recognition

Natural Language Processing

This dataset is used for learning to describe the differences between pairs of similar images. It contains four image features (img_a, img_b, img_diff) and one sentence sequence feature (sentences). The dataset is split into training, testing, and validation sets with 9,524, 1,404, and 1,634 samples respectively.

g-ronimo/oasst2_top4k_en

Dialogue Generation

Natural Language Processing

The dataset contains two primary features: messages, each comprising the sub‑features content and role. It is split into a training set with 4,000 samples and a test set with 400 samples. The data were selected from top‑ranked dialogues in OpenAssistant/oasst2, followed by deduplication and similarity filtering (long answers with similarity > 0.8 were excluded). The dataset includes only English content and was generated using a specific script.

FaceCaption-15M

Face Recognition

Natural Language Processing

FaceCaption‑15M is a large‑scale, diverse, high‑quality dataset of facial images and their natural‑language descriptions, containing over 15 million facial image‑description pairs, intended to promote research on face‑centric tasks. The dataset construction includes image collection, facial attribute annotation, facial description generation, and statistical analysis.

C-MTEB/LCQMC

Natural Language Processing

Sentence Similarity

--- configs: - config_name: default data_files: - split: train path: data/train-* - split: validation path: data/validation-* - split: test path: data/test-* dataset_info: features: - name: sentence1 dtype: string - name: sentence2 dtype: string - name: score dtype: int32 splits: - name: train num_bytes: 18419299 num_examples: 238766 - name: validation num_bytes: 760701 num_examples: 8802 - name: test num_bytes: 876457 num_examples: 12500 download_size: 14084841 dataset_size: 20056457 --- # Dataset Card for "LCQMC" [More Information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)

allenai/WildChat

Dialogue Systems

Natural Language Processing

WildChat是一个包含65万条人类用户与ChatGPT之间对话的数据集。该数据集通过向在线用户免费提供OpenAI的GPT-3.5和GPT-4访问权限收集而成。数据集涵盖了多种用户与聊天机器人的交互场景，如模糊的用户请求、代码转换、话题转换、政治讨论等。WildChat既可作为指令微调的数据集，也可作为研究用户行为的宝贵资源。需要注意的是，该数据集包含有毒的用户输入和ChatGPT的响应，并提供了一个无毒的子集。数据集支持多语言，包含66种语言，并且已经过脱敏处理。

C4

Natural Language Processing

Pre‑training Models

The C4 dataset, created by the China Information Processing Laboratory and other institutions, is a large unlabeled text corpus widely used for pre‑training large language models. It contains approximately 400 million cleaned text passages sourced from various high‑quality unstructured text resources. During creation, heuristic rules were applied to select well‑structured, valuable content, and data quality was enhanced by generating instructions and rewriting responses. The C4 dataset is mainly used for instruction tuning of large language models, aiming to improve zero‑shot learning and other NLP tasks.

indiejoseph/cc100-yue

Natural Language Processing

The filtered Cantonese dataset is a subset of the CC100 corpus, containing only Cantonese content after filtering. It is intended to support various natural language processing tasks such as text classification, sentiment analysis, named entity recognition, and machine translation. The filtering process references an article by ToastyNews.

zjunlp/KnowEdit

Natural Language Processing

Knowledge Editing

KnowEdit is a benchmark dataset for knowledge editing of large language models, covering six tasks: ZsRE, Wiki_recent, Wiki_counterfact, WikiBio, ConvSent, and Sanitation. It is built by reorganizing and extending existing datasets (e.g., WikiBio, ZsRE, WikiData_Counterfact) to provide comprehensive evaluation for knowledge editing in LLMs. The dataset structure includes knowledge insertion, modification, and deletion tasks, each with corresponding training and test data. The README also provides example code for loading and using the dataset with the EasyEdit tool, and details differences in loading methods and data structures across the constituent datasets.

bigbio/mlee

Biomedical Text Mining

Natural Language Processing

MLEE is a corpus of event extraction annotations for angiogenesis paper abstracts. It includes manually annotated entities, relations, events, and coreference information covering processes at the molecular, cellular, tissue, and organ levels.

open-llm-leaderboard-old/details_yleo__EmertonOmniBeagle-7B-dpo

Model Evaluation

Natural Language Processing

This dataset was automatically created during the evaluation run of model yleo/EmertonOmniBeagle-7B-dpo on the Open LLM Leaderboard. It comprises 63 configurations, each corresponding to an evaluated task, containing results from a single run. The "train" split always points to the latest results. An additional configuration named "results" stores aggregated results from all runs, used to compute and display aggregated metrics on the Open LLM Leaderboard. The README also provides a Python example for loading the dataset using the 🤗 datasets library and includes the latest results for a specific run.

Traditional-Chinese-Medicine-Dataset-SFT

Traditional Chinese Medicine

Natural Language Processing

该数据集是一个高质量的中医数据集，主要由非网络来源的内部数据构成，包含约1GB的中医各个领域临床案例、名家典籍、医学百科、名词解释等优质内容。数据集99%为简体中文内容，质量优异，信息密度可观。数据集适用于预训练或继续预训练用途，未来将继续发布针对SFT/IFT的多轮对话和问答数据集。数据集可以独立使用，但建议先使用配套的预训练数据集对模型进行继续预训练后，再使用该数据集进行进一步的指令微调。数据集还包含一定比例的中文常识、中文多轮对话数据以及古文/文言文<->现代文翻译数据，以避免灾难性遗忘并加强模型表现。

FineWeb 2

Multilingual Processing

Natural Language Processing

FineWeb 2是受欢迎的FineWeb数据集的第二版，为超过1000种语言提供高质量的预训练数据。该数据集经过多语言设置的复杂处理管道，包括语言识别、去重和过滤等步骤。

Starlento/DPO-En-Zh-20k-handbook

Natural Language Processing

Machine Learning

This dataset is a rearranged version of the original DPO‑En‑Zh‑20k dataset, split into 9,900 + 9,900 samples for training and 100 + 100 for testing. It contains fields such as language, prompt, rejected response (content and role), and chosen response (content and role), suitable for text generation and QA tasks in both Chinese and English.

Chinese-SimpleQA

Natural Language Processing

Language Model Evaluation

Chinese SimpleQA is a comprehensive Chinese benchmark for evaluating factual correctness of language models on short questions. It features five characteristics: Chinese language, diversity, high quality, static references, and ease of evaluation. The dataset covers six major topics with 99 fine‑grained sub‑topics, spanning humanities to science and engineering, containing 3,000 high‑quality questions to help developers assess factual accuracy in Chinese and support algorithm research.

FlagEval/CLCC_v1

Natural Language Processing

Chinese Linguistics

CLCC is a Chinese linguistics and cognition challenge dataset created by the FlagEval team. Task categories include question answering and text generation, supporting both Chinese and English. Evaluation results can be obtained via human assessment or the judgeLLM model.

COYO-700M

Image Recognition

Natural Language Processing

COYO‑700M is a massive dataset comprising 747 million image‑text pairs and various other metadata, intended for training diverse models. It is constructed by collecting alt‑text from HTML documents along with their associated images, aiming to support training of large foundation models and complement existing datasets.

fhamborg/news_sentiment_newsmtsc

Sentiment Analysis

Natural Language Processing

NewsMTSC is a high‑quality dataset containing over 11k manually annotated sentences from English news articles. Each sentence is labeled by five human annotators and includes only examples where the annotators’ sentiment judgments are the same or similar. The dataset is split into two subsets (`rw` and `mt`), each containing training, validation, and test parts.

SemCor and Masc documents

Word Sense Disambiguation

Natural Language Processing

Contains SemCor and Masc documents annotated with NOAD word senses for evaluating word‑sense disambiguation systems. Data are in XML format with detailed part‑of‑speech tags and segmentation information.

strombergnlp/nlpcc-stance

Natural Language Processing

Social Media Analysis

This is a Chinese stance‑prediction dataset specifically designed for detecting stance in Chinese micro‑blogs. The data originate from the NLPCC‑ICCPOL 2016 shared task, aiming to identify stance toward five target topics given annotated data. Each instance contains a unique ID, target, text, and stance label (against, favor, or none). The dataset was annotated by Chinese students, ensuring consistency and reliability. It contains only Chinese data and is released under a CC‑BY‑4.0 license.

curated_20k_spanish

Natural Language Processing

This dataset includes a feature named 'messages', which is a list containing two sub‑features: 'content' (string) and 'role' (string). The dataset is divided into a training split (train) with 20,207 samples, totaling 48,020,454 bytes. The download size is 24,914,380 bytes, and it is licensed under Apache 2.0. The language is Spanish.

NeviduJ/Sample_WritingPrompts

Natural Language Processing

Text Generation

The dataset contains three primary features: prompt and story, both of string type. The dataset is split into training, validation, and test sets with 1,400, 200, and 400 examples respectively. Download size is 4,002,221 bytes, total size 6,296,928 bytes.

CLUECorpus2020

Natural Language Processing

Pretraining Corpus

By cleaning the Chinese portion of the Common Crawl corpus, a high‑quality 100 GB Chinese pre‑training corpus was obtained. This dataset can be directly used for pre‑training, language‑modeling, or language‑generation tasks, and a small vocabulary specifically for Simplified‑Chinese NLP tasks has been released.

open-llm-leaderboard-old/details_Danielbrdz__Barcenas-Tiny-1.1b-DPO

Model Evaluation

Natural Language Processing

This dataset was automatically generated during the evaluation runs of the model Danielbrdz/Barcenas‑Tiny‑1.1b‑DPO. It comprises 63 configurations, each representing a distinct evaluation task. For each run, a split named after the run’s timestamp is created; the "train" split always points to the latest results. An additional "results" configuration stores aggregated metrics for all runs, which are used to compute and display aggregate scores on the Open LLM Leaderboard.

ChineseWebText2.0

Natural Language Processing

Language Models

ChineseWebText 2.0 is a large‑scale high‑quality Chinese web‑text dataset containing 3.8 TB of data. Each text is accompanied by a quality score, single‑label and multi‑label domain tags, as well as toxicity classification and scores, enabling LLM researchers to select data based on new quality thresholds. The dataset was constructed and filtered using the MDFG‑tool, ensuring high data quality and multidimensional fine‑grained information.

Yahoo_Answers_10_categories_for_NLP

Natural Language Processing

Text Classification

The Yahoo Answers topic classification dataset is constructed using the 10 largest primary categories. Each category contains 140,000 training samples and 6,000 test samples, totaling 1,400,000 training samples and 60,000 test samples. The dataset files include classes.txt, train.csv, and test.csv, where each sample has four columns: category index, question title, question content, and best answer.

MER2023

Machine Learning

Natural Language Processing

This dataset focuses on the Chinese language environment and can be used as a benchmark for multi‑label learning, noise robustness, and semi‑supervised learning research.

mandarjoshi/trivia_qa

Reading Comprehension

Natural Language Processing

TriviaQA is a reading‑comprehension dataset containing over 650,000 question‑answer‑evidence triples. It includes 95,000 question‑answer pairs authored by trivia enthusiasts and independently collected evidence documents, with an average of six documents per question, providing high‑quality distant supervision. The dataset is monolingual (English) and is suitable for QA and text‑generation tasks.

matthewfranglen/aste-v2

Sentiment Analysis

Natural Language Processing

Aspect Sentiment Triplet Extraction v2 is designed for extracting tuples consisting of a target entity, its associated sentiment, and the opinion span that explains the sentiment. It focuses on aspect‑based sentiment analysis (ABSA) to identify aspects of target entities and the polarity expressed for each aspect. The data are derived from SemEval 2014, 2015, and 2016 datasets, pre‑processed with spell correction and tokenization. The dataset includes training, validation, and test splits, each line containing index, text, start and end indices for aspect and opinion spans, the aspect and opinion terms, and the sentiment class.

pixelprose

Image Processing

Natural Language Processing

PixelProse is a comprehensive dataset containing 16 million synthetically generated image captions created with the Gemini 1.0 Pro Vision model. The dataset provides rich variables such as image unique identifiers, URLs, captioning model, and caption text, and supports multiple download and usage options.

zhengr/ultrachat_200k

Natural Language Processing

UltraChat 200k is a rigorously filtered dialogue dataset containing 1.4 M ChatGPT‑generated conversations covering a wide range of topics. The dataset has been processed, including selecting a subset of data, correcting case, and removing dialogues containing certain phrases, making it suitable for supervised fine‑tuning and generation ranking tasks.

CLEVR

Visual Reasoning

Natural Language Processing

The CLEVR dataset is a diagnostic dataset for compositional language and elementary visual reasoning, designed to help researchers evaluate and develop models that can understand and answer questions about complex visual scenes.

irds/nfcorpus

Natural Language Processing

Information Retrieval

The nfcorpus dataset is a text retrieval collection for medical information retrieval, consisting of 5,371 documents. Each document includes a document ID, URL, title, and abstract. The dataset was introduced by Vera Boteva et al. at the 2016 European Conference on Information Retrieval and has been used in several related sets such as `nfcorpus_dev`, `nfcorpus_test`, etc.

open-llm-leaderboard-old/details_OpenBuddy__openbuddy-qwen1.5-14b-v21.1-32k

Model Evaluation

Natural Language Processing

This dataset was automatically created during the evaluation run of the model OpenBuddy/openbuddy‑qwen1.5‑14b‑v21.1‑32k for evaluation on the Open LLM Leaderboard. The dataset comprises 63 configurations, each corresponding to an evaluation task. The dataset is generated from a single run; each run can be found in each configuration, with splits named after the run timestamp. The 'train' split always points to the latest results. Additionally, a 'results' configuration stores aggregated results of all runs for computing and displaying aggregated metrics on the Open LLM Leaderboard.

KShivendu/dbpedia-entities-openai-1M

Natural Language Processing

OpenAI 1M with DBPedia Entities is a dataset of one million samples designed for feature‑extraction tasks. Each sample includes an `_id`, `title`, `text`, and an `openai` field containing a 1536‑dimensional float32 embedding generated with the text‑embedding‑ada‑002 model. The dataset is English, created in June 2023 for benchmarking pgvector and VectorDB (Qdrant) performance, and will later be expanded to ten million vectors. It is derived from the first one million entries of the BeIR/DBpedia‑Entity dataset.

neulab/conala

Programming Language Processing

Natural Language Processing

The CoNaLa dataset is a benchmark for code generation tasks, containing code‑natural language pairs. The data were crawled from Stack Overflow, automatically filtered and manually annotated, comprising 2,379 training samples and 500 test samples. Additionally, an automatically mined set with nearly 600,000 samples is provided. The dataset is used to evaluate code generation, with English language and Python code. It includes two versions: a manually annotated version and an automatically mined version, each with different fields and splits.

Lauler/flan-norwegian

Natural Language Processing

Machine Translation

This dataset includes multiple feature fields, such as inputs, targets, task, index, as well as input and target fields that have been normalized and back‑translation processed. The dataset is split into training, validation, and test sets, containing 2,771,562, 23,860, and 734,178 examples respectively. The total size of the dataset is 12,154,335,861.0 bytes, with a download size of 5,880,786,502 bytes.

ATIS dataset

Natural Language Processing

Intent Recognition

The ATIS dataset is used in training and evaluation phases, containing 4,978 training sentences and 850 evaluation sentences. It is utilized for natural language understanding (NLU) training, involving tokenization, featurization, intent classification, and entity recognition and extraction.

tner/mit_movie_trivia

Natural Language Processing

Entity Recognition

The MIT Movie NER dataset is part of the T‑NER project and is specifically designed for named entity recognition tasks in the movie domain. It includes 12 entity types such as Actor, Plot, Opinion, Award, Year, Genre, Origin, Director, Soundtrack, Relationship, Character_Name, and Quote. The dataset is split into training (6,816 instances), validation (1,000 instances), and test (1,953 instances).

Yelp Reviews Dataset

Sentiment Analysis

Natural Language Processing

The dataset comprises Yelp review data for sentiment analysis, specifically comparing the effectiveness of BERT and RoBERTa models on Yelp review sentiment classification.

Cornell Movie Dialogs Corpus

Natural Language Processing

Dialogue Systems

The Cornell Movie Dialogs Corpus is a collection of fictional dialogues extracted from movie scripts. Due to its richness and diversity, it is well suited for training and evaluating dialogue agents.

GenderAlign

Natural Language Processing

GenderAlign is a dataset co‑developed by Southern University of Science and Technology and the PaZhou Laboratory, focusing on mitigating gender bias in large language models. It contains 8,000 single‑turn dialogues, each paired with a 'chosen' and a 'rejected' response, intended to contrast unbiased and biased conversational patterns. In its creation, researchers first collected seed texts exhibiting gender bias or describing gender differences from existing datasets and books, then automatically generated dialogues using GPT‑3.5. GenderAlign's primary application area is improving gender bias in language models by providing high‑quality unbiased dialogue samples to help models better understand and generate fair text.

openai/gsm8k

Mathematical Word Problems

Natural Language Processing

GSM8K (Grade School Math 8K) is a dataset of 8.5 K high‑quality, linguistically diverse elementary mathematics word problems. It supports question‑answering tasks that require multi‑step reasoning, typically involving 2–8 steps of basic arithmetic (+, –, ×, ÷). Problems are of middle‑school difficulty and most can be solved without explicitly defining variables. Solutions are provided in natural language rather than pure mathematical notation. The dataset offers two configurations, "main" and "socratic," each with different answer formats.

fine-tuned/NFCorpus-512-192-gpt-4o-2024-05-13-43315

Natural Language Processing

The dataset "news articles" is a generated dataset designed to support the development of domain‑specific embedding models for retrieval tasks.

GeoQuestions1089

Natural Language Processing

GeoQuestions1089 is a crowdsourced geospatial question‑answering dataset containing 1,089 triples of natural‑language questions, SPARQL/GeoSPARQL queries, and answers, targeting the YAGO2geo knowledge graph. The dataset is split into two parts: GeoQuestions_c (1,017 entries without linguistic errors) and GeoQuestions_w (72 entries with grammar, syntax, or spelling errors). Version 1.1 introduced several improvements, including unified query format, corrected natural‑language case handling, query classification fixes, and replacement of erroneous triples. Questions are categorized into nine groups covering various aspects of geospatial QA.

afmck/text8

Natural Language Processing

The dataset contains three parts: training (train), validation (validation), and test (test). Each part has a sample with different byte sizes. The dataset feature is text (string). Total download size is 54,357,043 bytes, total size is 100,000,012 bytes. Configuration name is default, data file paths correspond to train, validation, test.

CLVR Jaco Play Dataset

Robot Teleoperation

Natural Language Processing

The CLVR Jaco Play Dataset focuses on the tele‑operation robotics domain. Released in 2023 by research teams from the University of Southern California and KAIST, it provides 1,085 tele‑operated Jaco 2 robot segments with accompanying language annotations. The dataset is valuable for researchers and developers working on robot tele‑operation, natural language processing, and human‑computer interaction.

wmt/wmt16

Machine Translation

Natural Language Processing

This is a translation dataset based on statmt.org data, supporting multiple language pairs including cs‑en, de‑en, fi‑en, ro‑en, ru‑en and tr‑en. The dataset size ranges from 10 MB to 100 MB and is primarily used for translation tasks. The creators did not provide annotations; the data originates from several extended corpora such as europarl_bilingual, news_commentary, setimes and un_multi. The download size is 1.69 GB, the generated dataset size is 297.28 MB, and total disk usage is 1.99 GB.

MathCritique-76k

Mathematical Reasoning

Natural Language Processing

MathCritique‑76k is a dataset for training and testing large language models (LLMs) on mathematical reasoning tasks, containing model responses and step‑level feedback. The dataset was collected via an automated, scalable framework and aims to help models generate natural‑language feedback, improving performance on mathematical reasoning tasks.

BelleGroup/train_0.5M_CN

Natural Language Processing

Chinese Language Models

包含约50万条由BELLE项目生成的中文指令数据。每条数据包含一个指令、输入（本数据集均为空）和输出。

lilacai/lilac-wikitext-2-raw-v1

Natural Language Processing

Machine Learning

This dataset was generated by Lilac for a HuggingFace Space. The original source dataset is wikitext. The configuration includes the namespace, name, source dataset name, configuration name, as well as the signal‑processing path and embedding method. Signal processing covers various signals such as near‑duplicate detection, PII detection, language detection, text statistics, sentiment analysis, code detection, and toxicity detection.

ucinlp/drop

Natural Language Processing

Reading Comprehension

DROP是一个众包创建的、包含约96,000个问题的阅读理解基准数据集，要求系统在段落中进行引用解析并执行离散操作（如加法、计数或排序）。这些操作需要对段落内容有更全面的理解，超越了之前数据集的要求。数据集包含段落、问题和答案跨度等字段，分为训练集和验证集，分别包含77,400和9,535个示例。

resume-dataset

Resume Generation

Natural Language Processing

The dataset contains three primary fields: prompt (input), completion (output), and rejected_text (text that was rejected), all of which are string types. The dataset is split into a single training split (train) with 21,503 samples and a total size of 217,325,151 bytes. The download size is 95,854,833 bytes.

nlp_chinese_corpus

Natural Language Processing

A large‑scale Chinese natural‑language‑processing corpus containing diverse types of Chinese text such as Wikipedia, news, and encyclopedia Q&A, intended to support research and applications in Chinese NLP.

stas/openwebtext-10k

Natural Language Processing

This is a subset of the OpenWebText dataset, named stas/openwebtext-10k, which is an open‑source replica of OpenAI's WebText dataset. The subset contains the first 10,000 records of the original dataset, primarily for testing purposes. It includes a single split called `train` with a `text` feature, comprising 10,000 rows. The compressed size is approximately 15 MB and the uncompressed size is about 50 MB.

pfb30/multi_woz_v22

Dialogue Systems

Natural Language Processing

The Multi‑Domain Wizard‑of‑Oz (MultiWOZ) dataset is a fully annotated collection of written human‑human dialogues spanning multiple domains and topics. Version 2.1 fixes numerous annotation errors from the original release, while version 2.2 further corrects dialogue state errors, redefines the ontology, and introduces standardized slot‑span annotations. The dataset supports tasks such as dialogue modeling, intent‑state tracking, and dialogue act prediction. It is split into training, validation, and test sets containing 8,437, 1,000, and 1,000 dialogues respectively.

JParaCrawl

Machine Translation

Natural Language Processing

JParaCrawl is the largest publicly available English‑Japanese parallel corpus created by NTT. It is constructed via large‑scale web crawling and automatic alignment of parallel sentences. The dataset includes a training set with a massive number of bytes and examples. Each data instance contains an English‑Japanese sentence pair. The dataset is distributed under its own license.

THUDM/CogVLM-SFT-311K

Natural Language Processing

Cognitive Language Model

--- license: cc-by-nc-4.0 --- # CogVLM-SFT-311K: Bilingual Visual Instruction Data in CogVLM SFT CogVLM-SFT-311K is the primary aligned corpus used in the initial training of CogVLM v1.0. The process of constructing this dataset is as follows: 1. Approximately 3500 high-quality data samples were selected from the open source [MiniGPT-4](https://huggingface.co/datasets/Vision-CAIR/cc_sbu_align), known as minigpt4-3500. 2. Minigpt4-3500 was integrated with [Llava-Instruct-150K](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K) and translated into Chinese through a language model. 3. We discovered significant noise in the detailed description part of minigpt4-3500 and Llava-instruct. Thus, we corrected these Chinese corpora and retranslated them into English. ## Dataset Information The dataset contains three folders corresponding to the mixed part of minigpt4-3500 and llava, the llava solo conversation, and the multi-turn conversation datasets. Their layout is as follows: ``` .CogVLM-SFT-311K ├── llava_details-minigpt4_3500_formate ├── llava_instruction_multi_conversations_formate └── llava_instruction_single_conversation_formate ``` In our open-source data, the datasets are distributed as follows: ``` .llava_details-minigpt4_3500_formate ├── images │ └── 00000001.jpg └── labels └── 00000001.json ``` Images are stored in the images folder, while labels containing the corresponding image description or dialogue are stored in the labels folder. ## Dataset Quantity + llava_details-minigpt4_3500_formate: 22,464 images and descriptions + llava_instruction_multi_conversations_formate: 56,673 images and multi-turn conversations + llava_instruction_single_conversation_formate: 76,634 images and single-turn conversations Dataset Format Caption format for image description ``` { "captions": [ { "role": "caption", "content": "The photograph features a beach scene with a group of people surfing in the ocean. There are ten individuals spaced out in the water, sitting or lying on various sizes of body boards. The surfers line up in a row, close to the sandy shoreline. Some of the body boards are visible floating on the surface of the water, with the surfers riding on them or nearby." } ] } ``` Conversation format for image dialogue ``` { "conversations": [ { "role": "user", "content": "What can be inferred about the zebras' behavior and surroundings?" }, { "role": "assistant", "content": "Based on the image, we can infer that the two zebras are likely seeking relief from the sun's heat, as they are standing side by side under the branches of a thorny tree. This shade-providing tree offers some respite from the sun, possibly during the hottest part of the day. The zebras are in a green field with grass, providing them with an ideal environment to graze and eat while staying near their source of shelter. This shows that the zebras' behavior is influenced by the conditions and available resources in their surroundings. It also highlights that these animals adopt strategies to adapt to the fluctuating conditions of their environment, such as cooperation and seeking shelter, to survive and thrive in their natural habitat." } ] } ``` ## License + Due to non-commercial agreements, we did not use these data in the bilingual version of CogVLM or any other models involving commercialization. + The dataset license adheres to: <br> Attribution-NonCommercial 4.0 International. It should abide by the policy of OpenAI: https://openai.com/policies/terms-of-use This will not allow you to use these data for any **commercial activitiesI**. ## References This project utilizes data and concepts based on the following research papers: - Zhu, D., Chen, J., Shen, X., Li, X., & Elhoseiny, M. (2023). MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. arXiv preprint arXiv:2304.10592. - Liu, H., Li, C., Wu, Q., & Lee, Y. J. (2023). Visual Instruction Tuning. arXiv:2304.08485.

Source82/osa-alpaca_dataset_augmented_cleaned

Natural Language Processing

Data Augmentation

This dataset includes three features: instruction, input, and output, all of type string. The dataset contains only a training split (train) with 6,856 samples, total size 1,958,991 bytes. Download size is 792,005 bytes. In the default configuration, the data file path is data/train-*.

gretelai/synthetic_text_to_sql

Natural Language Processing

The gretelai/synthetic_text_to_sql dataset is a high‑quality synthetic Text‑to‑SQL sample dataset generated using Gretel Navigator, containing 105,851 records, split into 100,000 training records and 5,851 test records. The dataset covers 100 different domains and includes various SQL tasks such as data definition, retrieval, manipulation, analysis, and reporting. Additionally, the dataset provides natural‑language explanations of SQL queries and contextual tags to optimize model training. The dataset quality is evaluated using LLM‑as‑a‑judge techniques, showing excellent performance in SQL standard compliance, correctness, and instruction adherence.

NLPCC-KBQA

Natural Language Processing

Knowledge Base QA

The NLPCC‑KBQA dataset contains data used in the NLPCC open‑domain QA evaluations from 2016 to 2018. It includes 24,479 training instances and test sets for each of the three years. Each instance consists of a knowledge triple and a manually annotated natural‑language question derived from that triple.

openai/webgpt_comparisons

Natural Language Processing

Question Answering Systems

--- pretty_name: WebGPT Comparisons --- # Dataset Card for WebGPT Comparisons ## Dataset Description In the [WebGPT paper](https://arxiv.org/abs/2112.09332), the authors trained a reward model from human feedback. They used the reward model to train a long form question answering model to align with human preferences. This is the dataset of all comparisons that were marked as suitable for reward modeling by the end of the WebGPT project. There are 19,578 comparisons in total. Each example in the dataset contains a pair of model answers for a question, and the associated metadata. Each answer has a preference score from humans that can be used to determine which of the two answers are better. Overall, an example has the following fields: * `question`: The text of the question, together with the name of the dataset from which it was taken and a unique ID. * `quotes_0`: The extracts that the model found while browsing for `answer_0`, together with the title of the page on which the extract was found, constructed from the HTML title and domain name of the page. * `answer_0`: The final answer that the model composed using `quotes_0`. * `tokens_0`: The prefix that would have been given to the model in the final step of the episode to create `answer_0`, and the completion given by the model or human. The prefix is made up of the question and the quotes, with some truncation, and the completion is simply the answer. Both are tokenized using the GPT-2 tokenizer. The concatenation of the prefix and completion is the input used for reward modeling. * `score_0`: The strength of the preference for `answer_0` over `answer_1` as a number from −1 to 1. It sums to 0 with `score_1`, and an answer is preferred if and only if its score is positive. For reward modeling, we treat scores of 0 as soft 50% labels, and all other scores as hard labels (using only their sign). * `quotes_1`: The counterpart to `quotes_0`. * `answer_1`: The counterpart to `answer_0`. * `tokens_1`: The counterpart to `tokens_0`. * `score_1`: The counterpart to `score_0`. This information was found in Appendix K of the WebGPT paper. ## Citation Information [https://arxiv.org/abs/2112.09332](https://arxiv.org/abs/2112.09332) ``` @inproceedings{nakano2021webgpt, author = {Reiichiro Nakano and Jacob Hilton and Suchir Balaji and Jeff Wu and Long Ouyang and Christina Kim and Christopher Hesse and Shantanu Jain and Vineet Kosaraju and William Saunders and Xu Jiang and Karl Cobbe and Tyna Eloundou and Gretchen Krueger and Kevin Button and Matthew Knight and Benjamin Chess and John Schulman}, title = {WebGPT: Browser-assisted question-answering with human feedback}, booktitle = {arXiv}, year = 2021, } ``` Dataset added to the Hugging Face Hub by [@Tristan](https://huggingface.co/Tristan) and [@natolambert](https://huggingface.co/natolambert)

tau/commonsense_qa

Common-sense Reasoning

Natural Language Processing

CommonsenseQA is a new multiple‑choice QA dataset that requires various types of commonsense knowledge to predict the correct answer. It contains 12,102 questions, each with one correct answer and four distractors. The dataset is split into training, validation, and test sets, primarily in English.

PsyDTCorpus

Natural Language Processing

PsyDTCorpus is a high‑quality multi‑turn psychological‑health dialogue dataset created by a team at South China University of Technology. It aims to simulate the personalized counseling style of a specific therapist. The dataset contains 5,000 single‑turn long‑text dialogues generated in a single pass with GPT‑4, modeling the five major personality traits of clients and synthesizing multi‑turn conversations. The creation process combines real‑world counseling cases to ensure complexity and diversity. PsyDTCorpus is mainly applied in psychological counseling, seeking to improve the performance of LLMs for mental‑health support by providing personalized counseling styles, addressing the lack of personalization in existing models.

IDEA-CCNL/laion2B-multi-chinese-subset

Multimodal Learning

Natural Language Processing

--- annotations_creators: - crowdsourced language_creators: - crowdsourced language: - zh license: - cc-by-4.0 multilinguality: - monolingual pretty_name: laion2B-multi-chinese-subset task_categories: - feature-extraction --- # laion2B-multi-chinese-subset - Github: [Fengshenbang-LM](https://github.com/IDEA-CCNL/Fengshenbang-LM) - Docs: [Fengshenbang-Docs](https://fengshenbang-doc.readthedocs.io/) ## 简介 Brief Introduction 取自Laion2B多语言多模态数据集中的中文部分，一共143M个图文对。 A subset from Laion2B (a multimodal dataset), around 143M image-text pairs (only Chinese). ## 数据集信息 Dataset Information 大约一共143M个中文图文对。大约占用19GB空间（仅仅是url等文本信息，不包含图片）。 - Homepage: [laion-5b](https://laion.ai/blog/laion-5b/) - Huggingface: [laion/laion2B-multi](https://huggingface.co/datasets/laion/laion2B-multi) ## 下载 Download ```bash mkdir laion2b_chinese_release && cd laion2b_chinese_release for i in {00000..00012}; do wget https://huggingface.co/datasets/IDEA-CCNL/laion2B-multi-chinese-subset/resolve/main/data/train-$i-of-00013.parquet; done cd .. ``` ## Lisence CC-BY-4.0 ## 引用 Citation 如果您在您的工作中使用了我们的模型，可以引用我们的[论文](https://arxiv.org/abs/2209.02970)： If you are using the resource for your work, please cite the our [paper](https://arxiv.org/abs/2209.02970): ```text @article{fengshenbang, author = {Jiaxing Zhang and Ruyi Gan and Junjie Wang and Yuxiang Zhang and Lin Zhang and Ping Yang and Xinyu Gao and Ziwei Wu and Xiaoqun Dong and Junqing He and Jianheng Zhuo and Qi Yang and Yongfeng Huang and Xiayu Li and Yanghan Wu and Junyu Lu and Xinyu Zhu and Weifeng Chen and Ting Han and Kunhao Pan and Rui Wang and Hao Wang and Xiaojun Wu and Zhongshen Zeng and Chongpei Chen}, title = {Fengshenbang 1.0: Being the Foundation of Chinese Cognitive Intelligence}, journal = {CoRR}, volume = {abs/2209.02970}, year = {2022} } ``` 也可以引用我们的[网站](https://github.com/IDEA-CCNL/Fengshenbang-LM/): You can also cite our [website](https://github.com/IDEA-CCNL/Fengshenbang-LM/): ```text @misc{Fengshenbang-LM, title={Fengshenbang-LM}, author={IDEA-CCNL}, year={2021}, howpublished={\url{https://github.com/IDEA-CCNL/Fengshenbang-LM}}, }

medical-qa-id-filtered-split

Natural Language Processing

This dataset is a medical question‑answering collection containing system prompts, question IDs, question texts, original answer texts, answer lengths, and other features. It is split into training, validation, and test sets with 89,101, 4,950 and 4,951 samples respectively. The download size is 42,351,649 bytes and the total size is 83,382,248 bytes. The source is https://huggingface.co/datasets/lintangbs/medical-qa-id-llama, and preprocessing steps include removing empty lines and limiting the maximum token count to 1,024.

wmt/wmt14

Machine Translation

Natural Language Processing

The WMT14 dataset is a multilingual dataset for machine translation tasks, containing translation pairs for several language pairs such as Czech‑English (cs‑en), German‑English (de‑en), French‑English (fr‑en), Hindi‑English (hi‑en) and Russian‑English (ru‑en). Dataset size varies from a few MB to several tens of GB depending on the language pair. The dataset comprises training, validation, and test splits; each language pair includes a `translation` field containing the source and target texts. It is built from statmt.org data and allows users to customize language pairs and data sources.

wenhu/tab_fact

Fact Verification

Natural Language Processing

TabFact is a large‑scale dataset comprising 16 k Wikipedia tables as evidence and 118 k manually annotated statements for fact verification based on semi‑structured evidence. Statements are labeled as ENTAILED or REFUTED. The dataset is challenging because it requires both soft linguistic reasoning and hard symbolic reasoning.

Murple/ksponspeech

Speech Recognition

Natural Language Processing

The KsponSpeech dataset contains 969 hours of Korean conversational speech recorded by approximately 2,000 native Korean speakers in clean environments. All data were created by recording dialogues between two people and manually transcribing the audio. Transcriptions provide both orthographic and phonetic versions, along with disfluency tags (e.g., filler words, repeated words, word fragments) to indicate spontaneous speech. The dataset is primarily used for automatic speech recognition tasks and has been publicly released on the Korean government open data platform.

llm-blender/mix-instruct

Natural Language Processing

Machine Learning

MixInstruct is a dataset released for the LLM‑Blender project. It contains responses from 11 currently popular instruction‑following LLMs, including Stanford Alpaca, FastChat Vicuna, Dolly V2, StableLM, Open Assistant, Koala, Baize, Flan‑T5, ChatGLM, MOSS, and Mosaic MPT. The dataset is evaluated with automatic metrics (BLEU, ROUGE, BERTScore, BARTScore) and pairwise comparisons of 4,771 test samples performed by ChatGPT. The format is JSON, with fields for instruction, input, output, and candidate responses, each accompanied by detailed scores.

allenai/math_qa

Math Problem Solving

Natural Language Processing

We introduce a large‑scale dataset of mathematical word problems. By annotating the AQuA‑RAT dataset with a novel representation language, we generate fully specified procedural programs. AQuA‑RAT provides the problem, options, rationale, and correct answer.

SimpleQA-Bench

Natural Language Processing

Question Answering Systems

SimpleQA‑Bench combines the SimpleQA and Chinese‑SimpleQA datasets into a multiple‑choice question (MCQ) format. The original datasets contain a large amount of long‑tail and niche knowledge, yielding low direct answer accuracy. To facilitate factuality evaluation, GPT‑4o generated three plausible yet incorrect options for each question, converting the QA pairs into MCQ format. A total of 7,324 samples were transformed, with fields including dataset name, metadata, question, answer, messages, options, and the correct option ID.

vaishali/spider-tableQA

Table Question Answering

Natural Language Processing

The spider‑tableQA dataset is a resource designed for multi‑table question answering tasks, containing a total of 7,700 samples across training and validation splits. Each sample includes a query, question, table name, table content, answer, source and target. The dataset is intended for training and evaluating QA models capable of handling multi‑table operations, with an emphasis on generating tabular answers.

HamdiJr/Egyptian_hieroglyphs

Hieroglyph Recognition

Natural Language Processing

The dataset contains 10 images of Egyptian hieroglyphs extracted from the book "The Pyramid of Unas", together with a language model. Each hieroglyph is manually annotated and labeled according to the Gardiner sign list. The dataset also includes automated detection results, tools for building the language model (e.g., vocabulary and n‑gram grammars), and a description of its structure and GPL non‑commercial license.

DocuQA

Natural Language Processing

This dataset is designed for testing document‑based question‑answering applications or APIs and accepts PDF documents as input. It contains 20 distinct documents, each accompanied by 5 different question types, for a total of 100 evaluation questions. Document types vary widely, including journal articles, news reports, financial statements, and tutorials, aiming to assess a QA system's ability to understand context, recognize keywords, and extract specific information.

chiayewken/bamboogle

Language Models

Natural Language Processing

The Bamboogle dataset contains data for studying the compositionality gap in language models. It includes two features—question and answer—and consists of a test split with 125 examples, totalling 10,747 bytes. The dataset is associated with the paper "Measuring and Narrowing the Compositionality Gap in Language Models" and is released under the MIT License.

CreativeLang/vua20_metaphor

Natural Language Processing

VUA20 is a metaphor detection dataset, likely the largest used in the FigLang2020 workshop. The dataset comprises 200 k instances and was created in 2020. Annotation methodology is detailed in the MIP paper.

cimec/lambada

Natural Language Processing

Text Understanding

The LAMBADA dataset is used to evaluate computational models' text‑understanding ability, specifically testing whether a model can handle long‑range dependencies via a word‑prediction task. The dataset consists of narrative passages extracted from BookCorpus, split into development and test sets, with training data covering the full text of 2,662 novels. Its structure includes text and label fields, and it is partitioned into training, development, and test sets. The dataset was created to assess whether language models can retain long‑term contextual memory. Annotation involved paid crowdworkers ensuring that the target word could only be guessed by reading the entire passage. The language is English and the license is CC BY 4.0.

weibo-comments-v1

Natural Language Processing

The dataset includes features such as id, text content, labeled id, user nickname, comments, and label. It is split into a training set (2,325 samples) and a test set (582 samples). Download size is 810,622 bytes; total size is 1,266,259 bytes.

FreedomIntelligence/DxBench

Natural Language Processing

This dataset is a benchmark for text generation and label‑classification tasks in the medical domain, supporting both English and Chinese. It consists of multiple configurations (DxBench, Dxy, Muzhi), each with corresponding English and Chinese data files.

openwebtext-sentences

Natural Language Processing

The OpenWebText‑Sentences dataset is extracted from the OpenWebText corpus, containing the original textual content split into individual sentences. It is stored in Parquet format for fast access. Sentences were split using the NLTK 3.9.1 pre‑trained "Punkt" tokenizer. The dataset size is 25.7 GB and includes 307,432,490 sentences in English.