Explore high-quality datasets for your AI and machine learning projects.
This dataset is for supervised fine‑tuning (SFT) and direct preference optimization (DPO), available in English and Chinese versions. It is based on the four MBTI dimensions, each with two opposing attributes: energy (Extraversion E – Introversion I), information (Sensing S – Intuition N), decision (Thinking T – Feeling F), and execution (Judging J – Perceiving P). The dataset follows the Alpaca format, containing instruction, input, and output. Users can select the appropriate file for SFT or DPO based on the MBTI type.
The dataset contains user action types, timestamps, and final goals, split into training and testing sets. Each set includes three files: action type, action time, and goal.
This dataset is used to analyze stroke, employing machine learning models and resampling techniques such as SMOTEENN to improve prediction accuracy and address dataset imbalance.
The FungiTastic dataset was created jointly by the University of West Bohemia, INRIA and the Czech Technical University in Prague. It comprises approximately 350,000 records and over 650,000 fungal photographs with detailed metadata. The dataset stems from twenty years of continuous data collection and supports various machine learning tasks such as closed‑set and open‑set classification. Rich metadata includes timestamps, camera settings, geographic coordinates, satellite imagery, and biological taxonomy. It is widely used for image classification problems in biology, particularly for fungal identification.
The dataset contains a feature named `signal` of type float32 and a feature named `label_id` of type int32. It is split into training, validation, and test sets with 9,900, 539, and 1,100 samples respectively. Total download size is 35,524,532 bytes; dataset size is 12,000,560 bytes.
This dataset contains data from [openai/prm800k](https://github.com/openai/prm800k). It is divided into two phases (phase1 and phase2), each with train and test splits. Features include labeler, timestamp, question, etc.; detailed feature types are described in the README.
The dataset contains 5,000 phishing URLs and 5,000 legitimate URLs for training machine learning models to predict phishing websites. Features of URLs and site content such as domain, IP, URL length, etc., were extracted, resulting in a dataset with 18 features.
The dataset includes three primary features—question, chain‑of‑thought, and answer—and is split into training, validation, and test sets containing 385,620, 500, and 1,319 samples respectively. The download size is 50,052,843 bytes and the total size is 91,978,048 bytes.
The MIL‑QUALAIR dataset was constructed for predicting urban air pollution in the Milan metropolitan area. It includes Sentinel‑5P satellite observations, meteorological conditions, terrain features, and ground‑station measurements. The dataset spans 2018‑2023 and supports prediction of five major pollutants: PM10, PM2.5, NO2, O3, and SO2. It was compiled by the LINKS Foundation, funded by the UP2030 project, and released under the MIT license.
CC6204‑Hackaton‑CUB200 is a multimodal dataset for image‑classification and text‑classification tasks, especially suitable for multimodal classification problems. It contains bird images and descriptive texts; each image has ten textual descriptions, and each instance is labeled with the bird species. The dataset provides training (5,994 observations) and test (5,794 observations) splits. It originates from the Caltech Vision Lab; the associated paper is "The Caltech‑UCSD Birds‑200‑2011 Dataset". Creators and contributors include Catherine Wah and Cristóbal Alcázar.
This corpus contains questions and answers collected from the website Insurance Library, the first open QA corpus in the insurance domain. Content is user‑submitted, with high‑quality answers provided by professionals possessing deep domain knowledge. The dataset is divided into two parts, "question‑answer corpus" and "question‑answer pair corpus", suitable for training machine‑learning models.
The Multi30k dataset is a multilingual English‑German image description dataset, containing training, validation, and test sets, and supporting multiple languages such as English, German, French, and Czech. The dataset provides detailed statistics such as the number of sentences, word count, and average words per sentence. Additionally, it offers download links for visual features and original images.
The MoleculeNet ESOL dataset is part of the MoleculeNet benchmark for predicting aqueous solubility. The target values are log‑transformed, expressed as log mol/L. The dataset contains 1,128 samples; scaffold split is recommended; evaluation metric is RMSE.
This project aims to generate a rich dataset from the PepBDB database for machine‑learning and computational‑biology research. The dataset processes peptide‑protein interaction data, extracts sequences, and adds various biochemical features, creating a tabular dataset suitable for Random Forest, XGBoost, and other analyses. Each row is labeled as binding residue (1) or non‑binding residue (0).
PHOENIX-2014和PHOENIX-2014-T是德国RWTH Aachen大学人类语言技术与模式识别小组开发的大型德语手语数据集。这些数据集广泛用于研究,本仓库提供了这两个数据集的PyTorch数据集包装器,以便于在PyTorch模型中使用这些数据集。
This dataset is a rearranged version of the original DPO‑En‑Zh‑20k dataset, split into 9,900 + 9,900 samples for training and 100 + 100 for testing. It contains fields such as language, prompt, rejected response (content and role), and chosen response (content and role), suitable for text generation and QA tasks in both Chinese and English.
DocVQA is a dataset for visual question answering on document images, containing 50,000 questions based on 12,767 images. It is split 80‑10‑10 into train, validation, and test sets (39,463 questions & 10,194 images for training, 5,349 questions & 1,286 images for validation, 5,188 questions & 1,287 images for testing). Document images originate from the UCSF Industry Documents Library and include printed, typed, and handwritten content such as letters, memos, notes, and reports.
Diabetes health indicators dataset, used to compare classic machine learning and quantum machine learning techniques for feature selection and classification.
These datasets are used for machine‑learning research and encompass sign language data from various countries and languages, including video and depth data, for sign language recognition and understanding studies.
The PEMS_SF UCI Machine Learning Dataset is a collection for training and testing machine learning models. It includes separate training and test files for model development and validation. Files include PEMS_train, PEMS_test, among others.
The dataset comprises multiple feature fields such as cn_id, hateSpeech, counterSpeech, hsType, hsSubType, cnType, age, gender, and educationLevel. These fields represent various data types, including strings and floating‑point numbers. The dataset includes a split named **train** with 14,988 examples, totaling 4,432,994 bytes. Download size is 696,348 bytes. The configuration name is **default**, and the data files are located at `data/train-*`.
The database contains 76 attributes, but all published experiments have used 14 of them. Notably, the Cleveland dataset is currently the only one used by machine‑learning researchers. The target field indicates the presence of heart disease, with values ranging from 0 (none) to 4.
The dataset consists of curated chest X‑ray images designed to assist in detecting COVID‑19 and other respiratory diseases. It is divided into training and testing sets, each containing three primary categories: COVID‑19 positive, Normal, and Pneumonia. The dataset is intended for machine‑learning and deep‑learning applications, particularly convolutional neural network (CNN) based image classification tasks.
This dataset focuses on the Chinese language environment and can be used as a benchmark for multi‑label learning, noise robustness, and semi‑supervised learning research.
This dataset is part of T2I-CompBench and contains text files for multiple configurations and splits, primarily for textual data. The dataset includes various configurations such as '3d_spatial_train', 'color_val', 'complex_train', 'shape_val', 'texture_train', each with specific features and splits. Features are mainly text data, and splits contain training and validation sets of varying sizes. The dataset serves as a comprehensive benchmark for text‑to‑image generation and is licensed under the MIT License. It is obtained from a GitHub repository and uploaded to the Hugging Face Hub.
The dataset contains three parts—generation, refinement, and quality—each with training, testing, and validation configurations, and specifies the corresponding data file paths.
MindBigData 2022 is a large-scale EEG signal dataset comprising three primary datasets and their sub-datasets. The data were collected using various EEG devices such as MindWave, EPOC1, Muse1, Insight1, etc., with detailed sampling rates and channel information. The dataset is split into 80% training and 20% testing, containing both labels and EEG recordings. Each sub-dataset has specific device and sampling‑rate configurations. Specifically: 1. MindBigData MNIST of Brain Digits – four sub-datasets based on MindWave, EPOC1, Muse1, and Insight1; 2. MindBigData Imagenet of the Brain – two sub-datasets based on Insight1 EEG signals and spectrograms; 3. MindBigData Visual MNIST of Brain Digits – three sub-datasets based on Muse2, Cap64, and Cap64 Morlet devices.
A machine‑learning project for network intrusion detection that uses a Convolutional Neural Network (CNN) for model training and evaluation.
This dataset is used to train and evaluate machine‑learning models for power substation network cybersecurity, containing network capture data for IEC 61850 and IEC 104 protocols.
The MIT‑Adobe FiveK dataset is a publicly available collection containing 5,000 RAW images in DNG format, each retouched by five experts to produce 25,000 TIFF images (16‑bit per channel, ProPhoto RGB, lossless). The dataset also includes semantic information for each image. Created by MIT and Adobe Systems, Inc., it is intended to provide a diverse and challenging test set for image‑processing algorithms. Images cover a wide range of scenes—landscapes, portraits, still life, architecture—and exhibit varied lighting, color balance, and exposure conditions.
This dataset contains various health‑related features and a binary target variable indicating the presence or absence of diabetes. The dataset originates from the CDC and is used to build machine‑learning models to classify individuals as diabetic or not.
This repository contains multiple datasets, including TIFA, AIGCIQA2023, ECCV_Caption, and D3PO, for human feedback research on reward models. Each dataset comes with detailed download and setup instructions.
The BAM dataset consists of several subsets (e.g., `obj`, `scene`, `scene_only`, etc.) designed for model comparison and input‑dependency analysis. Each subset provides detailed training and validation splits along with specific usage descriptions.
Open Images is a dataset containing approximately 9 million image URLs, annotated with over 6 000 category labels. The dataset is split into training and validation sets, and each image may have one or multiple labels; label information can be obtained via CSV files.
A collection of machine‑learning training datasets provided by Data Science Dojo. Currently includes 43 datasets, categorized into classification‑clustering and regression tasks, with difficulty levels easy, medium, and hard. Each dataset folder contains a README.md that details basic information, feature description, data source, etc.
This is the first public dataset of real oil wells containing rare adverse events, which can serve as a benchmark dataset for developing machine learning techniques related to the inherent challenges of real-world data.
This dataset is part of the Bittensor Subnet 13 decentralized network and contains pre‑processed Reddit data. Network miners continuously update the data, providing a real‑time stream of Reddit content suitable for various analysis and machine‑learning tasks. The dataset includes fields such as text, label, data type, community name, datetime, username (encoded), and URL (encoded). The primary language is English, though multilingual content may be present. It is released under the MIT license and is subject to Reddit's terms of use.
This study addresses leaf‑disease classification tasks in plant pathology, employing multiple advanced deep‑learning models on the plant‑pathology‑2021 dataset.
The fish disease dataset is a collection of images depicting various diseases that affect fish. These images have been carefully selected and annotated for training and evaluating machine‑learning models to detect and classify fish diseases. The dataset covers a variety of fish species and diseases commonly found in aquaculture and natural environments.
The Stanford Human Preferences Dataset (SHP) comprises 385 K Reddit user preference records across 18 distinct topical domains, intended for training RLHF reward models and NLG evaluation models. Each example consists of a Reddit post containing a question or instruction and a pair of top‑level comments, where one comment is preferred by the Reddit community. Preferences are inferred from timestamp information to ensure they reflect comment helpfulness rather than harmfulness. The dataset includes training, validation, and test splits for 18 sub‑forums, with each sub‑forum stored as JSONL files.
--- configs: - config_name: default data_files: - split: train path: data/train-* dataset_info: features: - name: image dtype: image - name: label dtype: class_label: names: '0': Alpha '1': Beta '2': Chi '3': Delta '4': Epsilon '5': Eta '6': Gamma '7': Iota '8': Kappa '9': Lambda '10': LunateSigma '11': Mu '12': Nu '13': Omega '14': Omicron '15': Phi '16': Pi '17': Psi '18': Rho '19': Tau '20': Theta '21': Upsilon '22': Xi '23': Zeta splits: - name: train num_bytes: 309609553.26 num_examples: 205797 download_size: 217254607 dataset_size: 309609553.26 --- # Dataset Card for "AncientMNIST" [More Information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
Dataset for a machine learning project on Polish house price prediction, containing detailed information such as location, size, and floor.
A computer‑vision dataset for Chinese Mahjong tiles, containing various tile images and their labels for training and testing machine‑learning models.
Codecfake dataset and countermeasures for universal detection of deep‑fake audio. Because of Zenodo repository size limits, the dataset is split into multiple subsets, including training, development, and test sets.
This dataset was generated by Lilac for a HuggingFace Space. The original source dataset is wikitext. The configuration includes the namespace, name, source dataset name, configuration name, as well as the signal‑processing path and embedding method. Signal processing covers various signals such as near‑duplicate detection, PII detection, language detection, text statistics, sentiment analysis, code detection, and toxicity detection.
This project stores the dataset used for a flower‑recognition project based on a convolutional neural network. The original dataset is publicly available on Kaggle but is inconvenient to download; it is provided here for easy access. The data also serve as backend content for a web‑frontend that displays scrolling images.
The Stanford Dog Dataset contains approximately 20,000 images spanning 120 categories, with each image accompanied by corresponding annotations. The dataset is used to train convolutional neural network (CNN) classifiers; due to the limited data volume, transfer learning techniques are employed, utilizing pretrained models such as VGG16.
The CICIDS2017 dataset is used for cybersecurity tasks and contains several days of network traffic data for malicious traffic detection. The data have been read, cleaned, merged, and a random‑forest model has been applied for classification.
The dataset is primarily intended for research in chemistry, biology, and medicine, containing three features: CID, SMILES, and SELFIES, which identify compounds, describe molecular structures, and provide self‑descriptive molecular representations, respectively. The dataset is split into training, validation, and test sets, comprising a large number of samples with a total size of 36.6 TB and a download size of 12.6 GB.
The dataset contains 2.4 million lumbar spine MRI studies, focusing on vertebrae and intervertebral discs. Scans are accompanied by medical reports for diagnosing spinal diseases such as degenerative spinal disease, lumbar degenerative disease, and disc herniation. The dataset emphasizes sagittal T2‑weighted images of the spinal canal and uses the sagittal view for detailed spinal imaging. It supports segmentation algorithms and classification models to achieve accurate automatic segmentation and classification. Deep learning can be applied to assess spinal stenosis, detect degenerative changes, and segment spinal structures. The data include sagittal and axial views and are suitable for machine‑learning and medical‑diagnostic tasks. All patients consented to data release, and the data are de‑identified.
The Fish Disease dataset is a collection of images depicting various diseases affecting fish. The images are carefully selected and annotated for training and evaluating machine‑learning models to detect and classify fish diseases. The dataset covers a wide range of fish species and diseases found in aquaculture and natural environments.
The dataset contains two features: instruction and output, both of type string. It is split into a training set with 682 samples and a test set with 293 samples. The total size of the dataset is 2,205,691 bytes, and the download size is 1,042,346 bytes. The dataset configuration includes a default configuration, specifying the paths for the training and test data.
MixInstruct is a dataset released for the LLM‑Blender project. It contains responses from 11 currently popular instruction‑following LLMs, including Stanford Alpaca, FastChat Vicuna, Dolly V2, StableLM, Open Assistant, Koala, Baize, Flan‑T5, ChatGLM, MOSS, and Mosaic MPT. The dataset is evaluated with automatic metrics (BLEU, ROUGE, BERTScore, BARTScore) and pairwise comparisons of 4,771 test samples performed by ChatGPT. The format is JSON, with fields for instruction, input, output, and candidate responses, each accompanied by detailed scores.
--- configs: - config_name: default data_files: - split: test path: data/test-* - split: train path: data/train-* - split: validation path: data/validation-* dataset_info: features: - name: text dtype: string - name: label dtype: class_label: names: '0': '100' '1': '101' '2': '102' '3': '103' '4': '104' '5': '106' '6': '107' '7': '108' '8': '109' '9': '110' '10': '112' '11': '113' '12': '114' '13': '115' '14': '116' - name: idx dtype: int32 splits: - name: test num_bytes: 810970 num_examples: 10000 - name: train num_bytes: 4245677 num_examples: 53360 - name: validation num_bytes: 797922 num_examples: 10000 download_size: 4697191 dataset_size: 5854569 --- # Dataset Card for "TNews-classification" [More Information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
This dataset is used for welding quality inspection and contains welding images with corresponding labels. Labels are divided into six categories to classify different welding quality issues. The dataset is primarily intended for model training and includes 45,058 training samples.
MASSW is a comprehensive text dataset summarizing multiple aspects of scientific workflows. It contains over 152,000 peer‑reviewed publications from 17 leading computer‑science conferences spanning the past 50 years. The dataset defines five core aspects of a scientific workflow: context, key idea, method, outcome, and projected impact, and systematically extracts and structures these aspects from each publication using large language models (LLMs). MASSW is large‑scale and high‑accuracy, verified through extensive checks and comparisons with human annotations and alternative methods. It supports various novel and benchmarkable machine‑learning tasks such as idea generation and outcome prediction, providing a benchmark for evaluating LLM agents in scientific research.
A dataset of chicken clucks intended to enable machine‑learning models to translate chicken sounds into human language and better understand poultry needs. It includes individual chicken calls and longer audio segments possibly containing multiple chickens, used to analyze chicken welfare.
This project uses the 'ai-hub2' dataset, primarily aimed at providing high‑quality training data to improve the YOLOv11 construction equipment detection system on construction sites. The dataset contains five categories: boring machine, concrete truck, crane, dump truck, and excavator, covering common heavy machinery on construction sites to support vehicle detection in complex environments.
The dataset contains image and trajectory information. Image features store image data, while trajectory features store trajectory data in string form. The dataset comprises a training set with 80,193 samples, total size 3,018,635,973.509 bytes. Download size is 2,977,091,891 bytes. Configuration name is "default", and training data files are located at "data/train-*".
This dataset, exported from roboflow.ai on 30 March 2022 at 15:11 GMT, contains 250 images of airplanes annotated in COCO format. No image augmentations were applied. The task is object detection with the label [planes]; the images are split into training, validation, and test sets.
This dataset originates from the UCI Machine Learning Repository and contains data on daily sports activities. The dataset was collected using multiple sensors, capturing movement data from various body parts, and underwent complex preprocessing steps such as feature extraction and normalization, intended for training and testing machine learning models.
--- configs: - config_name: default data_files: - split: train path: data/train-* dataset_info: features: - name: input dtype: string - name: output dtype: string - name: instruction dtype: string - name: data_source dtype: string splits: - name: train num_bytes: 30776452 num_examples: 24926 download_size: 15565850 dataset_size: 30776452 language: - en size_categories: - 10K<n<100K --- # Open-Platypus This dataset is focused on improving LLM logical reasoning skills and was used to train the Platypus2 models. It is comprised of the following datasets, which were filtered using keyword search and then Sentence Transformers to remove questions with a similarity above 80%: | Dataset Name | License Type | |--------------------------------------------------------------|--------------| | [PRM800K](https://github.com/openai/prm800k) | MIT | | [MATH](https://github.com/hendrycks/math) | MIT | | [ScienceQA](https://github.com/lupantech/ScienceQA) | [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International](https://creativecommons.org/licenses/by-nc-sa/4.0/) | | [SciBench](https://github.com/mandyyyyii/scibench) | MIT | | [ReClor](https://whyu.me/reclor/) | Non-commercial | | [TheoremQA](https://huggingface.co/datasets/wenhu/TheoremQA) | MIT | | [`nuprl/leetcode-solutions-python-testgen-gpt4`](https://huggingface.co/datasets/nuprl/leetcode-solutions-python-testgen-gpt4/viewer/nuprl--leetcode-solutions-python-testgen-gpt4/train?p=1) | None listed | | [`jondurbin/airoboros-gpt4-1.4.1`](https://huggingface.co/datasets/jondurbin/airoboros-gpt4-1.4.1) | other | | [`TigerResearch/tigerbot-kaggle-leetcodesolutions-en-2k`](https://huggingface.co/datasets/TigerResearch/tigerbot-kaggle-leetcodesolutions-en-2k/viewer/TigerResearch--tigerbot-kaggle-leetcodesolutions-en-2k/train?p=2) | apache-2.0 | | [ARB](https://arb.duckai.org) | CC BY 4.0 | | [`timdettmers/openassistant-guanaco`](https://huggingface.co/datasets/timdettmers/openassistant-guanaco) | apache-2.0 | ## Data Contamination Check We've removed approximately 200 questions that appear in the Hugging Face benchmark test sets. Please see our [paper](https://arxiv.org/abs/2308.07317) and [project webpage](https://platypus-llm.github.io) for additional information. ## Model Info Please see models at [`garage-bAInd`](https://huggingface.co/garage-bAInd). ## Training and filtering code Please see the [Platypus GitHub repo](https://github.com/arielnlee/Platypus). ## Citations ```bibtex @article{platypus2023, title={Platypus: Quick, Cheap, and Powerful Refinement of LLMs}, author={Ariel N. Lee and Cole J. Hunter and Nataniel Ruiz}, booktitle={arXiv preprint arxiv:2308.07317}, year={2023} } ``` ```bibtex @article{lightman2023lets, title={Let's Verify Step by Step}, author={Lightman, Hunter and Kosaraju, Vineet and Burda, Yura and Edwards, Harri and Baker, Bowen and Lee, Teddy and Leike, Jan and Schulman, John and Sutskever, Ilya and Cobbe, Karl}, journal={preprint arXiv:2305.20050}, year={2023} } ``` ```bibtex @inproceedings{lu2022learn, title={Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering}, author={Lu, Pan and Mishra, Swaroop and Xia, Tony and Qiu, Liang and Chang, Kai-Wei and Zhu, Song-Chun and Tafjord, Oyvind and Clark, Peter and Ashwin Kalyan}, booktitle={The 36th Conference on Neural Information Processing Systems (NeurIPS)}, year={2022} } ``` ```bibtex @misc{wang2023scibench, title={SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models}, author={Xiaoxuan Wang and Ziniu Hu and Pan Lu and Yanqiao Zhu and Jieyu Zhang and Satyen Subramaniam and Arjun R. Loomba and Shichang Zhang and Yizhou Sun and Wei Wang}, year={2023}, arXiv eprint 2307.10635 } ``` ```bibtex @inproceedings{yu2020reclor, author = {Yu, Weihao and Jiang, Zihang and Dong, Yanfei and Feng, Jiashi}, title = {ReClor: A Reading Comprehension Dataset Requiring Logical Reasoning}, booktitle = {International Conference on Learning Representations (ICLR)}, month = {April}, year = {2020} } ``` ```bibtex @article{chen2023theoremqa, title={TheoremQA: A Theorem-driven Question Answering dataset}, author={Chen, Wenhu and Ming Yin, Max Ku, Elaine Wan, Xueguang Ma, Jianyu Xu, Tony Xia, Xinyi Wang, Pan Lu}, journal={preprint arXiv:2305.12524}, year={2023} } ``` ```bibtex @article{hendrycksmath2021, title={Measuring Mathematical Problem Solving With the MATH Dataset}, author={Dan Hendrycks and Collin Burns and Saurav Kadavath and Akul Arora and Steven Basart and Eric Tang and Dawn Song and Jacob Steinhardt}, journal={NeurIPS}, year={2021} } ``` ```bibtex @misc{sawada2023arb, title={ARB: Advanced Reasoning Benchmark for Large Language Models}, author={Tomohiro Sawada and Daniel Paleka and Alexander Havrilla and Pranav Tadepalli and Paula Vidas and Alexander Kranias and John J. Nay and Kshitij Gupta and Aran Komatsuzaki}, arXiv eprint 2307.13692, year={2023} } ```
This dataset contains various information on residential rental properties, such as the number of bedrooms, bathrooms, area, location, amenities, etc., for training and evaluating machine‑learning models that predict rent prices.
The dataset includes features such as index, observation images (right‑wrist cam, high cam, left‑wrist cam, low cam), velocity, state, effort, action, episode index, frame index, next‑state done flag, and timestamp. Each feature has a specific data type and length (e.g., video frames, float sequences). The dataset is primarily for training and contains 1,881 examples with a total size of 955,784 bytes.
The V‑D4RL dataset provides pixel‑based analogues of the D4RL benchmark tasks derived from the dm_control suite and extends two state‑of‑the‑art online pixel‑based continuous control algorithms, DrQ‑v2 and DreamerV2, to offline settings. It includes data of varying difficulty across multiple environments such as walker_walk, cheetah_run, and humanoid_walk, along with corresponding benchmarks and algorithm evaluations.
--- annotations_creators: - crowdsourced language_creators: - crowdsourced license: - cc0-1.0 multilinguality: - multilingual size_categories: ab: - 10K<n<100K ar: - 100K<n<1M as: - 1K<n<10K ast: - n<1K az: - n<1K ba: - 100K<n<1M bas: - 1K<n<10K be: - 100K<n<1M bg: - 1K<n<10K bn: - 100K<n<1M br: - 10K<n<100K ca: - 1M<n<10M ckb: - 100K<n<1M cnh: - 1K<n<10K cs: - 10K<n<100K cv: - 10K<n<100K cy: - 100K<n<1M da: - 1K<n<10K de: - 100K<n<1M dv: - 10K<n<100K el: - 10K<n<100K en: - 1M<n<10M eo: - 1M<n<10M es: - 1M<n<10M et: - 10K<n<100K eu: - 100K<n<1M fa: - 100K<n<1M fi: - 10K<n<100K fr: - 100K<n<1M fy-NL: - 10K<n<100K ga-IE: - 1K<n<10K gl: - 10K<n<100K gn: - 1K<n<10K ha: - 1K<n<10K hi: - 10K<n<100K hsb: - 1K<n<10K hu: - 10K<n<100K hy-AM: - 1K<n<10K ia: - 10K<n<100K id: - 10K<n<100K ig: - 1K<n<10K it: - 100K<n<1M ja: - 10K<n<100K ka: - 10K<n<100K kab: - 100K<n<1M kk: - 1K<n<10K kmr: - 10K<n<100K ky: - 10K<n<100K lg: - 100K<n<1M lt: - 10K<n<100K lv: - 1K<n<10K mdf: - n<1K mhr: - 100K<n<1M mk: - n<1K ml: - 1K<n<10K mn: - 10K<n<100K mr: - 10K<n<100K mrj: - 10K<n<100K mt: - 10K<n<100K myv: - 1K<n<10K nan-tw: - 10K<n<100K ne-NP: - n<1K nl: - 10K<n<100K nn-NO: - n<1K or: - 1K<n<10K pa-IN: - 1K<n<10K pl: - 100K<n<1M pt: - 100K<n<1M rm-sursilv: - 1K<n<10K rm-vallader: - 1K<n<10K ro: - 10K<n<100K ru: - 100K<n<1M rw: - 1M<n<10M sah: - 1K<n<10K sat: - n<1K sc: - 1K<n<10K sk: - 10K<n<100K skr: - 1K<n<10K sl: - 10K<n<100K sr: - 1K<n<10K sv-SE: - 10K<n<100K sw: - 100K<n<1M ta: - 100K<n<1M th: - 100K<n<1M ti: - n<1K tig: - n<1K tok: - 1K<n<10K tr: - 10K<n<100K tt: - 10K<n<100K tw: - n<1K ug: - 10K<n<100K uk: - 10K<n<100K ur: - 100K<n<1M uz: - 100K<n<1M vi: - 10K<n<100K vot: - n<1K yue: - 10K<n<100K zh-CN: - 100K<n<1M zh-HK: - 100K<n<1M zh-TW: - 100K<n<1M source_datasets: - extended|common_voice task_categories: - automatic-speech-recognition task_ids: [] paperswithcode_id: common-voice pretty_name: Common Voice Corpus 11.0 language_bcp47: - ab - ar - as - ast - az - ba - bas - be - bg - bn - br - ca - ckb - cnh - cs - cv - cy - da - de - dv - el - en - eo - es - et - eu - fa - fi - fr - fy-NL - ga-IE - gl - gn - ha - hi - hsb - hu - hy-AM - ia - id - ig - it - ja - ka - kab - kk - kmr - ky - lg - lt - lv - mdf - mhr - mk - ml - mn - mr - mrj - mt - myv - nan-tw - ne-NP - nl - nn-NO - or - pa-IN - pl - pt - rm-sursilv - rm-vallader - ro - ru - rw - sah - sat - sc - sk - skr - sl - sr - sv-SE - sw - ta - th - ti - tig - tok - tr - tt - tw - ug - uk - ur - uz - vi - vot - yue - zh-CN - zh-HK - zh-TW extra_gated_prompt: By clicking on “Access repository” below, you also agree to not attempt to determine the identity of speakers in the Common Voice dataset. --- # Dataset Card for Common Voice Corpus 11.0 ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [How to use](#how-to-use) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://commonvoice.mozilla.org/en/datasets - **Repository:** https://github.com/common-voice/common-voice - **Paper:** https://arxiv.org/abs/1912.06670 - **Leaderboard:** https://paperswithcode.com/dataset/common-voice - **Point of Contact:** [Anton Lozhkov](mailto:anton@huggingface.co) ### Dataset Summary The Common Voice dataset consists of a unique MP3 and corresponding text file. Many of the 24210 recorded hours in the dataset also include demographic metadata like age, sex, and accent that can help improve the accuracy of speech recognition engines. The dataset currently consists of 16413 validated hours in 100 languages, but more voices and languages are always added. Take a look at the [Languages](https://commonvoice.mozilla.org/en/languages) page to request a language or start contributing. ### Supported Tasks and Leaderboards The results for models trained on the Common Voice datasets are available via the [🤗 Autoevaluate Leaderboard](https://huggingface.co/spaces/autoevaluate/leaderboards?dataset=mozilla-foundation%2Fcommon_voice_11_0&only_verified=0&task=automatic-speech-recognition&config=ar&split=test&metric=wer) ### Languages ``` Abkhaz, Arabic, Armenian, Assamese, Asturian, Azerbaijani, Basaa, Bashkir, Basque, Belarusian, Bengali, Breton, Bulgarian, Cantonese, Catalan, Central Kurdish, Chinese (China), Chinese (Hong Kong), Chinese (Taiwan), Chuvash, Czech, Danish, Dhivehi, Dutch, English, Erzya, Esperanto, Estonian, Finnish, French, Frisian, Galician, Georgian, German, Greek, Guarani, Hakha Chin, Hausa, Hill Mari, Hindi, Hungarian, Igbo, Indonesian, Interlingua, Irish, Italian, Japanese, Kabyle, Kazakh, Kinyarwanda, Kurmanji Kurdish, Kyrgyz, Latvian, Lithuanian, Luganda, Macedonian, Malayalam, Maltese, Marathi, Meadow Mari, Moksha, Mongolian, Nepali, Norwegian Nynorsk, Odia, Persian, Polish, Portuguese, Punjabi, Romanian, Romansh Sursilvan, Romansh Vallader, Russian, Sakha, Santali (Ol Chiki), Saraiki, Sardinian, Serbian, Slovak, Slovenian, Sorbian, Upper, Spanish, Swahili, Swedish, Taiwanese (Minnan), Tamil, Tatar, Thai, Tigre, Tigrinya, Toki Pona, Turkish, Twi, Ukrainian, Urdu, Uyghur, Uzbek, Vietnamese, Votic, Welsh ``` ## How to use The `datasets` library allows you to load and pre-process your dataset in pure Python, at scale. The dataset can be downloaded and prepared in one call to your local drive by using the `load_dataset` function. For example, to download the Hindi config, simply specify the corresponding language config name (i.e., "hi" for Hindi): ```python from datasets import load_dataset cv_11 = load_dataset("mozilla-foundation/common_voice_11_0", "hi", split="train") ``` Using the datasets library, you can also stream the dataset on-the-fly by adding a `streaming=True` argument to the `load_dataset` function call. Loading a dataset in streaming mode loads individual samples of the dataset at a time, rather than downloading the entire dataset to disk. ```python from datasets import load_dataset cv_11 = load_dataset("mozilla-foundation/common_voice_11_0", "hi", split="train", streaming=True) print(next(iter(cv_11))) ``` *Bonus*: create a [PyTorch dataloader](https://huggingface.co/docs/datasets/use_with_pytorch) directly with your own datasets (local/streamed). ### Local ```python from datasets import load_dataset from torch.utils.data.sampler import BatchSampler, RandomSampler cv_11 = load_dataset("mozilla-foundation/common_voice_11_0", "hi", split="train") batch_sampler = BatchSampler(RandomSampler(cv_11), batch_size=32, drop_last=False) dataloader = DataLoader(cv_11, batch_sampler=batch_sampler) ``` ### Streaming ```python from datasets import load_dataset from torch.utils.data import DataLoader cv_11 = load_dataset("mozilla-foundation/common_voice_11_0", "hi", split="train") dataloader = DataLoader(cv_11, batch_size=32) ``` To find out more about loading and preparing audio datasets, head over to [hf.co/blog/audio-datasets](https://huggingface.co/blog/audio-datasets). ### Example scripts Train your own CTC or Seq2Seq Automatic Speech Recognition models on Common Voice 11 with `transformers` - [here](https://github.com/huggingface/transformers/tree/main/examples/pytorch/speech-recognition). ## Dataset Structure ### Data Instances A typical data point comprises the `path` to the audio file and its `sentence`. Additional fields include `accent`, `age`, `client_id`, `up_votes`, `down_votes`, `gender`, `locale` and `segment`. ```python { 'client_id': 'd59478fbc1ee646a28a3c652a119379939123784d99131b865a89f8b21c81f69276c48bd574b81267d9d1a77b83b43e6d475a6cfc79c232ddbca946ae9c7afc5', 'path': 'et/clips/common_voice_et_18318995.mp3', 'audio': { 'path': 'et/clips/common_voice_et_18318995.mp3', 'array': array([-0.00048828, -0.00018311, -0.00137329, ..., 0.00079346, 0.00091553, 0.00085449], dtype=float32), 'sampling_rate': 48000 }, 'sentence': 'Tasub kokku saada inimestega, keda tunned juba ammust ajast saati.', 'up_votes': 2, 'down_votes': 0, 'age': 'twenties', 'gender': 'male', 'accent': '', 'locale': 'et', 'segment': '' } ``` ### Data Fields `client_id` (`string`): An id for which client (voice) made the recording `path` (`string`): The path to the audio file `audio` (`dict`): A dictionary containing the path to the downloaded audio file, the decoded audio array, and the sampling rate. Note that when accessing the audio column: `dataset[0]["audio"]` the audio file is automatically decoded and resampled to `dataset.features["audio"].sampling_rate`. Decoding and resampling of a large number of audio files might take a significant amount of time. Thus it is important to first query the sample index before the `"audio"` column, *i.e.* `dataset[0]["audio"]` should **always** be preferred over `dataset["audio"][0]`. `sentence` (`string`): The sentence the user was prompted to speak `up_votes` (`int64`): How many upvotes the audio file has received from reviewers `down_votes` (`int64`): How many downvotes the audio file has received from reviewers `age` (`string`): The age of the speaker (e.g. `teens`, `twenties`, `fifties`) `gender` (`string`): The gender of the speaker `accent` (`string`): Accent of the speaker `locale` (`string`): The locale of the speaker `segment` (`string`): Usually an empty field ### Data Splits The speech material has been subdivided into portions for dev, train, test, validated, invalidated, reported and other. The validated data is data that has been validated with reviewers and received upvotes that the data is of high quality. The invalidated data is data has been invalidated by reviewers and received downvotes indicating that the data is of low quality. The reported data is data that has been reported, for different reasons. The other data is data that has not yet been reviewed. The dev, test, train are all data that has been reviewed, deemed of high quality and split into dev, test and train. ## Data Preprocessing Recommended by Hugging Face The following are data preprocessing steps advised by the Hugging Face team. They are accompanied by an example code snippet that shows how to put them to practice. Many examples in this dataset have trailing quotations marks, e.g _“the cat sat on the mat.“_. These trailing quotation marks do not change the actual meaning of the sentence, and it is near impossible to infer whether a sentence is a quotation or not a quotation from audio data alone. In these cases, it is advised to strip the quotation marks, leaving: _the cat sat on the mat_. In addition, the majority of training sentences end in punctuation ( . or ? or ! ), whereas just a small proportion do not. In the dev set, **almost all** sentences end in punctuation. Thus, it is recommended to append a full-stop ( . ) to the end of the small number of training examples that do not end in punctuation. ```python from datasets import load_dataset ds = load_dataset("mozilla-foundation/common_voice_11_0", "en", use_auth_token=True) def prepare_dataset(batch): """Function to preprocess the dataset with the .map method""" transcription = batch["sentence"] if transcription.startswith('"') and transcription.endswith('"'): # we can remove trailing quotation marks as they do not affect the transcription transcription = transcription[1:-1] if transcription[-1] not in [".", "?", "!"]: # append a full-stop to sentences that do not end in punctuation transcription = transcription + "." batch["sentence"] = transcription return batch ds = ds.map(prepare_dataset, desc="preprocess dataset") ``` ## Dataset Creation ### Curation Rationale [Needs More Information] ### Source Data #### Initial Data Collection and Normalization [Needs More Information] #### Who are the source language producers? [Needs More Information] ### Annotations #### Annotation process [Needs More Information] #### Who are the annotators? [Needs More Information] ### Personal and Sensitive Information The dataset consists of people who have donated their voice online. You agree to not attempt to determine the identity of speakers in the Common Voice dataset. ## Considerations for Using the Data ### Social Impact of Dataset The dataset consists of people who have donated their voice online. You agree to not attempt to determine the identity of speakers in the Common Voice dataset. ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information Public Domain, [CC-0](https://creativecommons.org/share-your-work/public-domain/cc0/) ### Citation Information ``` @inproceedings{commonvoice:2020, author = {Ardila, R. and Branson, M. and Davis, K. and Henretty, M. and Kohler, M. and Meyer, J. and Morais, R. and Saunders, L. and Tyers, F. M. and Weber, G.}, title = {Common Voice: A Massively-Multilingual Speech Corpus}, booktitle = {Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020)}, pages = {4211--4215}, year = 2020 } ```
The dataset consists of multiple features such as 'source', 'chosen', 'chosen_rating', 'chosen_model', 'rejected', 'rejected_rating', and 'rejected_model'. 'chosen' and 'rejected' are list‑type fields containing sub‑features 'content' and 'role'. The dataset is split into 'train' (15 204 samples) and 'test' (200 samples). Total download size is 79 362 069 bytes, total dataset size is 152 534 966.0 bytes.
Retrieval-Augmented Generation (RAG) Dataset 12000 is an English dataset specifically designed to optimize RAG models. Built by Neural Bridge AI, it contains 12,000 entries, each comprising three fields: context, question, and answer. Context data originates from Falcon RefinedWeb, while questions and answers are generated by GPT-4. The dataset is split into a training set (9,600 samples) and a test set (2,400 samples) and is released under the Apache 2.0 license.
The dataset consists of chess‑board states and associated move sequences extracted from games downloaded from lichess.org. Each game is parsed into multiple records; each record starts with a FEN string followed by 1‑10 SAN moves. The data are intended for training the ChessRoberta model and have not been filtered, so they may not be optimal for high‑performance chess modelling.
Imagewoof is a subset of ImageNet containing ten dog‑breed categories, designed to provide a challenging image‑classification benchmark. Created by Jeremy Howard, it is released under the Apache‑2.0 license. The dataset includes images and corresponding labels, with a defined train/validation split.
--- dataset_info: features: - name: prompt dtype: string - name: fingerprint dtype: 'null' - name: seed dtype: string - name: sha1 dtype: string - name: id dtype: int64 - name: concepts sequence: string splits: - name: train num_bytes: 1546511078 num_examples: 239121 download_size: 335965438 dataset_size: 1546511078 configs: - config_name: default data_files: - split: train path: data/train-* ---