Datasets | JuheAPI

PandaVT/Machine_Mindset_MBTI_dataset

MBTI Personality Analysis

Machine Learning

This dataset is for supervised fine‑tuning (SFT) and direct preference optimization (DPO), available in English and Chinese versions. It is based on the four MBTI dimensions, each with two opposing attributes: energy (Extraversion E – Introversion I), information (Sensing S – Intuition N), decision (Thinking T – Feeling F), and execution (Judging J – Perceiving P). The dataset follows the Alpaca format, containing instruction, input, and output. Users can select the appropriate file for SFT or DPO based on the MBTI type.

hugging_face

View Details

Breakfast dataset

Breakfast Behavior Analysis

Machine Learning

The dataset contains user action types, timestamps, and final goals, split into training and testing sets. Each set includes three files: action type, action time, and goal.

github

View Details

Cereberal-Stroke-Analysis

Stroke

Machine Learning

This dataset is used to analyze stroke, employing machine learning models and resampling techniques such as SMOTEENN to improve prediction accuracy and address dataset imbalance.

github

View Details

FungiTastic

Fungal Classification

Machine Learning

The FungiTastic dataset was created jointly by the University of West Bohemia, INRIA and the Czech Technical University in Prague. It comprises approximately 350,000 records and over 650,000 fungal photographs with detailed metadata. The dataset stems from twenty years of continuous data collection and supports various machine learning tasks such as closed‑set and open‑set classification. Rich metadata includes timestamps, camera settings, geographic coordinates, satellite imagery, and biological taxonomy. It is widely used for image classification problems in biology, particularly for fungal identification.

arXiv

View Details

RML2016

Signal Processing

Machine Learning

The dataset contains a feature named `signal` of type float32 and a feature named `label_id` of type int32. It is split into training, validation, and test sets with 9,900, 539, and 1,100 samples respectively. Total download size is 35,524,532 bytes; dataset size is 12,000,560 bytes.

huggingface

View Details

prm800k

Natural Language Processing

Machine Learning

This dataset contains data from [openai/prm800k](https://github.com/openai/prm800k). It is divided into two phases (phase1 and phase2), each with train and test splits. Features include labeler, timestamp, question, etc.; detailed feature types are described in the README.

huggingface

View Details

Phishing and Benign URLs Dataset

Cybersecurity

Machine Learning

The dataset contains 5,000 phishing URLs and 5,000 legitimate URLs for training machine learning models to predict phishing websites. Features of URLs and site content such as domain, IP, URL length, etc., were extracted, resulting in a dataset with 18 features.

github

View Details

gsm8k_synthetic_cot

Chain-of-Thought

Machine Learning

The dataset includes three primary features—question, chain‑of‑thought, and answer—and is split into training, validation, and test sets containing 385,620, 500, and 1,319 samples respectively. The download size is 50,052,843 bytes and the total size is 91,978,048 bytes.

huggingface

View Details

links-ads/mil-qualair

Urban Air Quality Prediction

Machine Learning

The MIL‑QUALAIR dataset was constructed for predicting urban air pollution in the Milan metropolitan area. It includes Sentinel‑5P satellite observations, meteorological conditions, terrain features, and ground‑station measurements. The dataset spans 2018‑2023 and supports prediction of five major pollutants: PM10, PM2.5, NO2, O3, and SO2. It was compiled by the LINKS Foundation, funded by the UP2030 project, and released under the MIT license.

hugging_face

View Details

alkzar90/CC6204-Hackaton-Cub-Dataset

Deep Learning

Machine Learning

CC6204‑Hackaton‑CUB200 is a multimodal dataset for image‑classification and text‑classification tasks, especially suitable for multimodal classification problems. It contains bird images and descriptive texts; each image has ten textual descriptions, and each instance is labeled with the bird species. The dataset provides training (5,994 observations) and test (5,794 observations) splits. It originates from the Caltech Vision Lab; the associated paper is "The Caltech‑UCSD Birds‑200‑2011 Dataset". Creators and contributors include Catherine Wah and Cristóbal Alcázar.

hugging_face

View Details

insuranceqa-corpus-zh

Insurance QA

Machine Learning

This corpus contains questions and answers collected from the website Insurance Library, the first open QA corpus in the insurance domain. Content is user‑submitted, with high‑quality answers provided by professionals possessing deep domain knowledge. The dataset is divided into two parts, "question‑answer corpus" and "question‑answer pair corpus", suitable for training machine‑learning models.

github

View Details

Multi30k Dataset

Multilingual Image Description

Machine Learning

The Multi30k dataset is a multilingual English‑German image description dataset, containing training, validation, and test sets, and supporting multiple languages such as English, German, French, and Czech. The dataset provides detailed statistics such as the number of sentences, word count, and average words per sentence. Additionally, it offers download links for visual features and original images.

github

View Details

scikit-fingerprints/MoleculeNet_ESOL

Chemistry

Machine Learning

The MoleculeNet ESOL dataset is part of the MoleculeNet benchmark for predicting aqueous solubility. The target values are log‑transformed, expressed as log mol/L. The dataset contains 1,128 samples; scaffold split is recommended; evaluation metric is RMSE.

hugging_face

View Details

PepBDB-ML

Machine Learning

Computational Biology

This project aims to generate a rich dataset from the PepBDB database for machine‑learning and computational‑biology research. The dataset processes peptide‑protein interaction data, extracts sequences, and adds various biochemical features, creating a tabular dataset suitable for Random Forest, XGBoost, and other analyses. Each row is labeled as binding residue (1) or non‑binding residue (0).

github

View Details

PHOENIX-2014, PHOENIX-2014-T

Sign Language Recognition

Machine Learning

PHOENIX-2014和PHOENIX-2014-T是德国RWTH Aachen大学人类语言技术与模式识别小组开发的大型德语手语数据集。这些数据集广泛用于研究，本仓库提供了这两个数据集的PyTorch数据集包装器，以便于在PyTorch模型中使用这些数据集。

github

View Details

Starlento/DPO-En-Zh-20k-handbook

Natural Language Processing

Machine Learning

This dataset is a rearranged version of the original DPO‑En‑Zh‑20k dataset, split into 9,900 + 9,900 samples for training and 100 + 100 for testing. It contains fields such as language, prompt, rejected response (content and role), and chosen response (content and role), suitable for text generation and QA tasks in both Chinese and English.

hugging_face

View Details

eliolio/docvqa

Visual Question Answering

Machine Learning

DocVQA is a dataset for visual question answering on document images, containing 50,000 questions based on 12,767 images. It is split 80‑10‑10 into train, validation, and test sets (39,463 questions & 10,194 images for training, 5,349 questions & 1,286 images for validation, 5,188 questions & 1,287 images for testing). Document images originate from the UCSF Industry Documents Library and include printed, typed, and handwritten content such as letters, memos, notes, and reports.

hugging_face

View Details

Diabetes Health Indicators

Diabetes

Machine Learning

Diabetes health indicators dataset, used to compare classic machine learning and quantum machine learning techniques for feature selection and classification.

github

View Details

WLASL, DGS Kinect 40, RWTH-PHOENIX-Weather, SIGNUM, GSL 20, Boston ASL LVD, PSL Kinect 30, PSL ToF 84, PSL 101, LSA64, BosphorusSign, MSR Gesture 3D, DEVISIGN-G, DEVISIGN-D, DEVISIGN-L, IIITA -ROBITA, Purdue ASL, CUNY ASL, SignsWorld Atlas

Sign Language Recognition

Machine Learning

These datasets are used for machine‑learning research and encompass sign language data from various countries and languages, including video and depth data, for sign language recognition and understanding studies.

github

View Details

PEMS_SF UCI Machine learning dataset

Traffic Management

Machine Learning

The PEMS_SF UCI Machine Learning Dataset is a collection for training and testing machine learning models. It includes separate training and test files for model development and validation. Files include PEMS_train, PEMS_test, among others.

github

View Details

Rhma/CONAN

Hate Speech

Machine Learning

The dataset comprises multiple feature fields such as cn_id, hateSpeech, counterSpeech, hsType, hsSubType, cnType, age, gender, and educationLevel. These fields represent various data types, including strings and floating‑point numbers. The dataset includes a split named **train** with 14,988 examples, totaling 4,432,994 bytes. Download size is 696,348 bytes. The configuration name is **default**, and the data files are located at `data/train-*`.

hugging_face

View Details

Heart-UCI-Dataset

Heart Disease

Machine Learning

The database contains 76 attributes, but all published experiments have used 14 of them. Notably, the Cleveland dataset is currently the only one used by machine‑learning researchers. The target field indicates the presence of heart disease, with values ranging from 0 (none) to 4.

github

View Details

COVID-19 X-ray Dataset (Train & Test Sets)

COVID-19 Detection

Machine Learning

The dataset consists of curated chest X‑ray images designed to assist in detecting COVID‑19 and other respiratory diseases. It is divided into training and testing sets, each containing three primary categories: COVID‑19 positive, Normal, and Pneumonia. The dataset is intended for machine‑learning and deep‑learning applications, particularly convolutional neural network (CNN) based image classification tasks.

github

View Details

MER2023

Machine Learning

Natural Language Processing

This dataset focuses on the Chinese language environment and can be used as a benchmark for multi‑label learning, noise robustness, and semi‑supervised learning research.

arXiv

View Details

t2i-compbench

Text-to-Image Generation

Machine Learning

This dataset is part of T2I-CompBench and contains text files for multiple configurations and splits, primarily for textual data. The dataset includes various configurations such as '3d_spatial_train', 'color_val', 'complex_train', 'shape_val', 'texture_train', each with specific features and splits. Features are mainly text data, and splits contain training and validation sets of varying sizes. The dataset serves as a comprehensive benchmark for text‑to‑image generation and is licensed under the MIT License. It is obtained from a GitHub repository and uploaded to the Hugging Face Hub.

huggingface

View Details

fasterinnerlooper/codereviewer

Code Review

Machine Learning

The dataset contains three parts—generation, refinement, and quality—each with training, testing, and validation configurations, and specifies the corresponding data file paths.

hugging_face

View Details

DavidVivancos/MindBigData2022

Neuroscience

Machine Learning

MindBigData 2022 is a large-scale EEG signal dataset comprising three primary datasets and their sub-datasets. The data were collected using various EEG devices such as MindWave, EPOC1, Muse1, Insight1, etc., with detailed sampling rates and channel information. The dataset is split into 80% training and 20% testing, containing both labels and EEG recordings. Each sub-dataset has specific device and sampling‑rate configurations. Specifically: 1. MindBigData MNIST of Brain Digits – four sub-datasets based on MindWave, EPOC1, Muse1, and Insight1; 2. MindBigData Imagenet of the Brain – two sub-datasets based on Insight1 EEG signals and spectrograms; 3. MindBigData Visual MNIST of Brain Digits – three sub-datasets based on Muse2, Cap64, and Cap64 Morlet devices.

hugging_face

View Details

NSL-KDD

Network Intrusion Detection

Machine Learning

A machine‑learning project for network intrusion detection that uses a Convolutional Neural Network (CNN) for model training and evaluation.

github

View Details

Electrical Substations Cybersecurity Dataset

Power System Security

Machine Learning

This dataset is used to train and evaluate machine‑learning models for power substation network cybersecurity, containing network capture data for IEC 61850 and IEC 104 protocols.

github

View Details

yuukicammy/MIT-Adobe-FiveK

Image Processing

Machine Learning

The MIT‑Adobe FiveK dataset is a publicly available collection containing 5,000 RAW images in DNG format, each retouched by five experts to produce 25,000 TIFF images (16‑bit per channel, ProPhoto RGB, lossless). The dataset also includes semantic information for each image. Created by MIT and Adobe Systems, Inc., it is intended to provide a diverse and challenging test set for image‑processing algorithms. Images cover a wide range of scenes—landscapes, portraits, still life, architecture—and exhibit varied lighting, color balance, and exposure conditions.

hugging_face

View Details

Diabetes Binary Health Indicators BRFSS2015

Diabetes

Machine Learning

This dataset contains various health‑related features and a binary target variable indicating the presence or absence of diabetes. The dataset originates from the CDC and is used to build machine‑learning models to classify individuals as diabetic or not.

github

View Details

TIFA, AIGCIQA2023, ECCV_Caption, D3PO

Machine Learning

Human Feedback

This repository contains multiple datasets, including TIFA, AIGCIQA2023, ECCV_Caption, and D3PO, for human feedback research on reward models. Each dataset comes with detailed download and setup instructions.

github

View Details

BAM dataset

Machine Learning

Model Analysis

The BAM dataset consists of several subsets (e.g., `obj`, `scene`, `scene_only`, etc.) designed for model comparison and input‑dependency analysis. Each subset provides detailed training and validation splits along with specific usage descriptions.

github

View Details

Open Images dataset

Image Recognition

Machine Learning

Open Images is a dataset containing approximately 9 million image URLs, annotated with over 6 000 category labels. The dataset is split into training and validation sets, and each image may have one or multiple labels; label information can be obtained via CSV files.

github

View Details

ML-learning-datasets

Machine Learning

Dataset Classification

A collection of machine‑learning training datasets provided by Data Science Dojo. Currently includes 43 datasets, categorized into classification‑clustering and regression tasks, with difficulty levels easy, medium, and hard. Each dataset folder contains a README.md that details basic information, feature description, data source, etc.

github

View Details

3W Dataset

Oil Well Safety

Machine Learning

This is the first public dataset of real oil wells containing rare adverse events, which can serve as a benchmark dataset for developing machine learning techniques related to the inherent challenges of real-world data.

github

View Details

reddit_dataset_198

Social Media Analysis

Machine Learning

This dataset is part of the Bittensor Subnet 13 decentralized network and contains pre‑processed Reddit data. Network miners continuously update the data, providing a real‑time stream of Reddit content suitable for various analysis and machine‑learning tasks. The dataset includes fields such as text, label, data type, community name, datetime, username (encoded), and URL (encoded). The primary language is English, though multilingual content may be present. It is released under the MIT license and is subject to Reddit's terms of use.

huggingface

View Details

plant-pathology-2021

Plant Pathology

Machine Learning

This study addresses leaf‑disease classification tasks in plant pathology, employing multiple advanced deep‑learning models on the plant‑pathology‑2021 dataset.

github

View Details

jero98772/CuraPeces_Background

Fish Disease

Machine Learning

The fish disease dataset is a collection of images depicting various diseases that affect fish. These images have been carefully selected and annotated for training and evaluating machine‑learning models to detect and classify fish diseases. The dataset covers a variety of fish species and diseases commonly found in aquaculture and natural environments.

hugging_face

View Details

stanfordnlp/SHP

Machine Learning

Dataset Difficulty Evaluation

The Stanford Human Preferences Dataset (SHP) comprises 385 K Reddit user preference records across 18 distinct topical domains, intended for training RLHF reward models and NLG evaluation models. Each example consists of a Reddit post containing a question or instruction and a pair of top‑level comments, where one comment is preferred by the Reddit community. Preferences are inferred from timestamp information to ensure they reflect comment helpfulness rather than harmfulness. The dataset includes training, validation, and test splits for 18 sub‑forums, with each sub‑forum stored as JSONL files.

hugging_face

View Details

SummerSigh/AncientMNIST

Ancient Script Recognition

Machine Learning

--- configs: - config_name: default data_files: - split: train path: data/train-* dataset_info: features: - name: image dtype: image - name: label dtype: class_label: names: '0': Alpha '1': Beta '2': Chi '3': Delta '4': Epsilon '5': Eta '6': Gamma '7': Iota '8': Kappa '9': Lambda '10': LunateSigma '11': Mu '12': Nu '13': Omega '14': Omicron '15': Phi '16': Pi '17': Psi '18': Rho '19': Tau '20': Theta '21': Upsilon '22': Xi '23': Zeta splits: - name: train num_bytes: 309609553.26 num_examples: 205797 download_size: 217254607 dataset_size: 309609553.26 --- # Dataset Card for "AncientMNIST" [More Information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)

hugging_face

View Details

Houses.csv

Real Estate Analysis

Machine Learning

Dataset for a machine learning project on Polish house price prediction, containing detailed information such as location, size, and floor.

github

View Details

Mahjong Dataset

Computer Vision

Machine Learning

A computer‑vision dataset for Chinese Mahjong tiles, containing various tile images and their labels for training and testing machine‑learning models.

github

View Details

Codecfake Dataset

Deepfake Audio Detection

Machine Learning

Codecfake dataset and countermeasures for universal detection of deep‑fake audio. Because of Zenodo repository size limits, the dataset is split into multiple subsets, including training, development, and test sets.

github

View Details

lilacai/lilac-wikitext-2-raw-v1

Natural Language Processing

Machine Learning

This dataset was generated by Lilac for a HuggingFace Space. The original source dataset is wikitext. The configuration includes the namespace, name, source dataset name, configuration name, as well as the signal‑processing path and embedding method. Signal processing covers various signals such as near‑duplicate detection, PII detection, language detection, text statistics, sentiment analysis, code detection, and toxicity detection.

hugging_face

View Details

flowers_dataset

Image Recognition

Machine Learning

This project stores the dataset used for a flower‑recognition project based on a convolutional neural network. The original dataset is publicly available on Kaggle but is inconvenient to download; it is provided here for easy access. The data also serve as backend content for a web‑frontend that displays scrolling images.

github

View Details

Stanford Dog Dataset

Image Recognition

Machine Learning

The Stanford Dog Dataset contains approximately 20,000 images spanning 120 categories, with each image accompanied by corresponding annotations. The dataset is used to train convolutional neural network (CNN) classifiers; due to the limited data volume, transfer learning techniques are employed, utilizing pretrained models such as VGG16.

github

View Details

CICIDS2017

Cybersecurity

Machine Learning

The CICIDS2017 dataset is used for cybersecurity tasks and contains several days of network traffic data for malicious traffic detection. The data have been read, cleaned, merged, and a random‑forest model has been applied for classification.

github

View Details

PubChem

Chemical Molecules

Machine Learning

The dataset is primarily intended for research in chemistry, biology, and medicine, containing three features: CID, SMILES, and SELFIES, which identify compounds, describe molecular structures, and provide self‑descriptive molecular representations, respectively. The dataset is split into training, validation, and test sets, comprising a large number of samples with a total size of 36.6 TB and a download size of 12.6 GB.

huggingface

View Details

lumbar-spine-mri

Medical Imaging

Machine Learning

The dataset contains 2.4 million lumbar spine MRI studies, focusing on vertebrae and intervertebral discs. Scans are accompanied by medical reports for diagnosing spinal diseases such as degenerative spinal disease, lumbar degenerative disease, and disc herniation. The dataset emphasizes sagittal T2‑weighted images of the spinal canal and uses the sagittal view for detailed spinal imaging. It supports segmentation algorithms and classification models to achieve accurate automatic segmentation and classification. Deep learning can be applied to assess spinal stenosis, detect degenerative changes, and segment spinal structures. The data include sagittal and axial views and are suitable for machine‑learning and medical‑diagnostic tasks. All patients consented to data release, and the data are de‑identified.

huggingface

View Details

jero98772/CuraPeces_Removed_background

Fish Disease

Machine Learning

The Fish Disease dataset is a collection of images depicting various diseases affecting fish. The images are carefully selected and annotated for training and evaluating machine‑learning models to detect and classify fish diseases. The dataset covers a wide range of fish species and diseases found in aquaculture and natural environments.

hugging_face

View Details

martim00/math_aime_2023

Mathematics Competition

Machine Learning

The dataset contains two features: instruction and output, both of type string. It is split into a training set with 682 samples and a test set with 293 samples. The total size of the dataset is 2,205,691 bytes, and the download size is 1,042,346 bytes. The dataset configuration includes a default configuration, specifying the paths for the training and test data.

hugging_face

View Details

llm-blender/mix-instruct

Natural Language Processing

Machine Learning

MixInstruct is a dataset released for the LLM‑Blender project. It contains responses from 11 currently popular instruction‑following LLMs, including Stanford Alpaca, FastChat Vicuna, Dolly V2, StableLM, Open Assistant, Koala, Baize, Flan‑T5, ChatGLM, MOSS, and Mosaic MPT. The dataset is evaluated with automatic metrics (BLEU, ROUGE, BERTScore, BARTScore) and pairwise comparisons of 4,771 test samples performed by ChatGPT. The format is JSON, with fields for instruction, input, output, and candidate responses, each accompanied by detailed scores.

hugging_face

View Details

C-MTEB/TNews-classification

Text Classification

Machine Learning

--- configs: - config_name: default data_files: - split: test path: data/test-* - split: train path: data/train-* - split: validation path: data/validation-* dataset_info: features: - name: text dtype: string - name: label dtype: class_label: names: '0': '100' '1': '101' '2': '102' '3': '103' '4': '104' '5': '106' '6': '107' '7': '108' '8': '109' '9': '110' '10': '112' '11': '113' '12': '114' '13': '115' '14': '116' - name: idx dtype: int32 splits: - name: test num_bytes: 810970 num_examples: 10000 - name: train num_bytes: 4245677 num_examples: 53360 - name: validation num_bytes: 797922 num_examples: 10000 download_size: 4697191 dataset_size: 5854569 --- # Dataset Card for "TNews-classification" [More Information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)

hugging_face

View Details

TIG_ss304_dataset_full

Welding Quality Inspection

Machine Learning

This dataset is used for welding quality inspection and contains welding images with corresponding labels. Labels are divided into six categories to classify different welding quality issues. The dataset is primarily intended for model training and includes 45,058 training samples.

huggingface

View Details

massw

Scientific Workflow

Machine Learning

MASSW is a comprehensive text dataset summarizing multiple aspects of scientific workflows. It contains over 152,000 peer‑reviewed publications from 17 leading computer‑science conferences spanning the past 50 years. The dataset defines five core aspects of a scientific workflow: context, key idea, method, outcome, and projected impact, and systematically extracts and structures these aspects from each publication using large language models (LLMs). MASSW is large‑scale and high‑accuracy, verified through extensive checks and comparisons with human annotations and alternative methods. It supports various novel and benchmarkable machine‑learning tasks such as idea generation and outcome prediction, providing a benchmark for evaluating LLM agents in scientific research.

huggingface

View Details

ChickenLanguageDataset

Poultry Sound Analysis

Machine Learning

A dataset of chicken clucks intended to enable machine‑learning models to translate chicken sounds into human language and better understand poultry needs. It includes individual chicken calls and longer audio segments possibly containing multiple chickens, used to analyze chicken welfare.

github

View Details

ai-hub2

Construction Vehicle Detection

Machine Learning

This project uses the 'ai-hub2' dataset, primarily aimed at providing high‑quality training data to improve the YOLOv11 construction equipment detection system on construction sites. The dataset contains five categories: boring machine, concrete truck, crane, dump truck, and excavator, covering common heavy machinery on construction sites to support vehicle detection in complex environments.

github

View Details

double_pendulum

Double Pendulum System

Machine Learning

The dataset contains image and trajectory information. Image features store image data, while trajectory features store trajectory data in string form. The dataset comprises a training set with 80,193 samples, total size 3,018,635,973.509 bytes. Download size is 2,977,091,891 bytes. Configuration name is "default", and training data files are located at "data/train-*".

huggingface

View Details

keremberke/plane-detection

Airplane Detection

Machine Learning

This dataset, exported from roboflow.ai on 30 March 2022 at 15:11 GMT, contains 250 images of airplanes annotated in COCO format. No image augmentations were applied. The task is object detection with the label [planes]; the images are split into training, validation, and test sets.

hugging_face

View Details

Daily and Sports Activities

Sports Activity Data

Machine Learning

This dataset originates from the UCI Machine Learning Repository and contains data on daily sports activities. The dataset was collected using multiple sensors, capturing movement data from various body parts, and underwent complex preprocessing steps such as feature extraction and normalization, intended for training and testing machine learning models.

github

View Details

garage-bAInd/Open-Platypus

Natural Language Processing

Machine Learning

--- configs: - config_name: default data_files: - split: train path: data/train-* dataset_info: features: - name: input dtype: string - name: output dtype: string - name: instruction dtype: string - name: data_source dtype: string splits: - name: train num_bytes: 30776452 num_examples: 24926 download_size: 15565850 dataset_size: 30776452 language: - en size_categories: - 10K<n<100K --- # Open-Platypus This dataset is focused on improving LLM logical reasoning skills and was used to train the Platypus2 models. It is comprised of the following datasets, which were filtered using keyword search and then Sentence Transformers to remove questions with a similarity above 80%: | Dataset Name | License Type | |--------------------------------------------------------------|--------------| | [PRM800K](https://github.com/openai/prm800k) | MIT | | [MATH](https://github.com/hendrycks/math) | MIT | | [ScienceQA](https://github.com/lupantech/ScienceQA) | [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International](https://creativecommons.org/licenses/by-nc-sa/4.0/) | | [SciBench](https://github.com/mandyyyyii/scibench) | MIT | | [ReClor](https://whyu.me/reclor/) | Non-commercial | | [TheoremQA](https://huggingface.co/datasets/wenhu/TheoremQA) | MIT | | [`nuprl/leetcode-solutions-python-testgen-gpt4`](https://huggingface.co/datasets/nuprl/leetcode-solutions-python-testgen-gpt4/viewer/nuprl--leetcode-solutions-python-testgen-gpt4/train?p=1) | None listed | | [`jondurbin/airoboros-gpt4-1.4.1`](https://huggingface.co/datasets/jondurbin/airoboros-gpt4-1.4.1) | other | | [`TigerResearch/tigerbot-kaggle-leetcodesolutions-en-2k`](https://huggingface.co/datasets/TigerResearch/tigerbot-kaggle-leetcodesolutions-en-2k/viewer/TigerResearch--tigerbot-kaggle-leetcodesolutions-en-2k/train?p=2) | apache-2.0 | | [ARB](https://arb.duckai.org) | CC BY 4.0 | | [`timdettmers/openassistant-guanaco`](https://huggingface.co/datasets/timdettmers/openassistant-guanaco) | apache-2.0 | ## Data Contamination Check We've removed approximately 200 questions that appear in the Hugging Face benchmark test sets. Please see our [paper](https://arxiv.org/abs/2308.07317) and [project webpage](https://platypus-llm.github.io) for additional information. ## Model Info Please see models at [`garage-bAInd`](https://huggingface.co/garage-bAInd). ## Training and filtering code Please see the [Platypus GitHub repo](https://github.com/arielnlee/Platypus). ## Citations ```bibtex @article{platypus2023, title={Platypus: Quick, Cheap, and Powerful Refinement of LLMs}, author={Ariel N. Lee and Cole J. Hunter and Nataniel Ruiz}, booktitle={arXiv preprint arxiv:2308.07317}, year={2023} } ``` ```bibtex @article{lightman2023lets, title={Let's Verify Step by Step}, author={Lightman, Hunter and Kosaraju, Vineet and Burda, Yura and Edwards, Harri and Baker, Bowen and Lee, Teddy and Leike, Jan and Schulman, John and Sutskever, Ilya and Cobbe, Karl}, journal={preprint arXiv:2305.20050}, year={2023} } ``` ```bibtex @inproceedings{lu2022learn, title={Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering}, author={Lu, Pan and Mishra, Swaroop and Xia, Tony and Qiu, Liang and Chang, Kai-Wei and Zhu, Song-Chun and Tafjord, Oyvind and Clark, Peter and Ashwin Kalyan}, booktitle={The 36th Conference on Neural Information Processing Systems (NeurIPS)}, year={2022} } ``` ```bibtex @misc{wang2023scibench, title={SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models}, author={Xiaoxuan Wang and Ziniu Hu and Pan Lu and Yanqiao Zhu and Jieyu Zhang and Satyen Subramaniam and Arjun R. Loomba and Shichang Zhang and Yizhou Sun and Wei Wang}, year={2023}, arXiv eprint 2307.10635 } ``` ```bibtex @inproceedings{yu2020reclor, author = {Yu, Weihao and Jiang, Zihang and Dong, Yanfei and Feng, Jiashi}, title = {ReClor: A Reading Comprehension Dataset Requiring Logical Reasoning}, booktitle = {International Conference on Learning Representations (ICLR)}, month = {April}, year = {2020} } ``` ```bibtex @article{chen2023theoremqa, title={TheoremQA: A Theorem-driven Question Answering dataset}, author={Chen, Wenhu and Ming Yin, Max Ku, Elaine Wan, Xueguang Ma, Jianyu Xu, Tony Xia, Xinyi Wang, Pan Lu}, journal={preprint arXiv:2305.12524}, year={2023} } ``` ```bibtex @article{hendrycksmath2021, title={Measuring Mathematical Problem Solving With the MATH Dataset}, author={Dan Hendrycks and Collin Burns and Saurav Kadavath and Akul Arora and Steven Basart and Eric Tang and Dawn Song and Jacob Steinhardt}, journal={NeurIPS}, year={2021} } ``` ```bibtex @misc{sawada2023arb, title={ARB: Advanced Reasoning Benchmark for Large Language Models}, author={Tomohiro Sawada and Daniel Paleka and Alexander Havrilla and Pranav Tadepalli and Paula Vidas and Alexander Kranias and John J. Nay and Kshitij Gupta and Aran Komatsuzaki}, arXiv eprint 2307.13692, year={2023} } ```

hugging_face

View Details

House-Rent-Predicition

Real Estate Leasing

Machine Learning

This dataset contains various information on residential rental properties, such as the number of bedrooms, bathrooms, area, location, amenities, etc., for training and evaluating machine‑learning models that predict rent prices.

github

View Details

haixuantao/just_gripper_eval

Robotic Arm Grasping

Machine Learning

The dataset includes features such as index, observation images (right‑wrist cam, high cam, left‑wrist cam, low cam), velocity, state, effort, action, episode index, frame index, next‑state done flag, and timestamp. Each feature has a specific data type and length (e.g., video frames, float sequences). The dataset is primarily for training and contains 1,881 examples with a total size of 955,784 bytes.

hugging_face

View Details

conglu/vd4rl

Reinforcement Learning

Machine Learning

The V‑D4RL dataset provides pixel‑based analogues of the D4RL benchmark tasks derived from the dm_control suite and extends two state‑of‑the‑art online pixel‑based continuous control algorithms, DrQ‑v2 and DreamerV2, to offline settings. It includes data of varying difficulty across multiple environments such as walker_walk, cheetah_run, and humanoid_walk, along with corresponding benchmarks and algorithm evaluations.

hugging_face

View Details

mozilla-foundation/common_voice_11_0

Speech Recognition

Machine Learning

--- annotations_creators: - crowdsourced language_creators: - crowdsourced license: - cc0-1.0 multilinguality: - multilingual size_categories: ab: - 10K<n<100K ar: - 100K<n<1M as: - 1K<n<10K ast: - n<1K az: - n<1K ba: - 100K<n<1M bas: - 1K<n<10K be: - 100K<n<1M bg: - 1K<n<10K bn: - 100K<n<1M br: - 10K<n<100K ca: - 1M<n<10M ckb: - 100K<n<1M cnh: - 1K<n<10K cs: - 10K<n<100K cv: - 10K<n<100K cy: - 100K<n<1M da: - 1K<n<10K de: - 100K<n<1M dv: - 10K<n<100K el: - 10K<n<100K en: - 1M<n<10M eo: - 1M<n<10M es: - 1M<n<10M et: - 10K<n<100K eu: - 100K<n<1M fa: - 100K<n<1M fi: - 10K<n<100K fr: - 100K<n<1M fy-NL: - 10K<n<100K ga-IE: - 1K<n<10K gl: - 10K<n<100K gn: - 1K<n<10K ha: - 1K<n<10K hi: - 10K<n<100K hsb: - 1K<n<10K hu: - 10K<n<100K hy-AM: - 1K<n<10K ia: - 10K<n<100K id: - 10K<n<100K ig: - 1K<n<10K it: - 100K<n<1M ja: - 10K<n<100K ka: - 10K<n<100K kab: - 100K<n<1M kk: - 1K<n<10K kmr: - 10K<n<100K ky: - 10K<n<100K lg: - 100K<n<1M lt: - 10K<n<100K lv: - 1K<n<10K mdf: - n<1K mhr: - 100K<n<1M mk: - n<1K ml: - 1K<n<10K mn: - 10K<n<100K mr: - 10K<n<100K mrj: - 10K<n<100K mt: - 10K<n<100K myv: - 1K<n<10K nan-tw: - 10K<n<100K ne-NP: - n<1K nl: - 10K<n<100K nn-NO: - n<1K or: - 1K<n<10K pa-IN: - 1K<n<10K pl: - 100K<n<1M pt: - 100K<n<1M rm-sursilv: - 1K<n<10K rm-vallader: - 1K<n<10K ro: - 10K<n<100K ru: - 100K<n<1M rw: - 1M<n<10M sah: - 1K<n<10K sat: - n<1K sc: - 1K<n<10K sk: - 10K<n<100K skr: - 1K<n<10K sl: - 10K<n<100K sr: - 1K<n<10K sv-SE: - 10K<n<100K sw: - 100K<n<1M ta: - 100K<n<1M th: - 100K<n<1M ti: - n<1K tig: - n<1K tok: - 1K<n<10K tr: - 10K<n<100K tt: - 10K<n<100K tw: - n<1K ug: - 10K<n<100K uk: - 10K<n<100K ur: - 100K<n<1M uz: - 100K<n<1M vi: - 10K<n<100K vot: - n<1K yue: - 10K<n<100K zh-CN: - 100K<n<1M zh-HK: - 100K<n<1M zh-TW: - 100K<n<1M source_datasets: - extended|common_voice task_categories: - automatic-speech-recognition task_ids: [] paperswithcode_id: common-voice pretty_name: Common Voice Corpus 11.0 language_bcp47: - ab - ar - as - ast - az - ba - bas - be - bg - bn - br - ca - ckb - cnh - cs - cv - cy - da - de - dv - el - en - eo - es - et - eu - fa - fi - fr - fy-NL - ga-IE - gl - gn - ha - hi - hsb - hu - hy-AM - ia - id - ig - it - ja - ka - kab - kk - kmr - ky - lg - lt - lv - mdf - mhr - mk - ml - mn - mr - mrj - mt - myv - nan-tw - ne-NP - nl - nn-NO - or - pa-IN - pl - pt - rm-sursilv - rm-vallader - ro - ru - rw - sah - sat - sc - sk - skr - sl - sr - sv-SE - sw - ta - th - ti - tig - tok - tr - tt - tw - ug - uk - ur - uz - vi - vot - yue - zh-CN - zh-HK - zh-TW extra_gated_prompt: By clicking on “Access repository” below, you also agree to not attempt to determine the identity of speakers in the Common Voice dataset. --- # Dataset Card for Common Voice Corpus 11.0 ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [How to use](#how-to-use) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://commonvoice.mozilla.org/en/datasets - **Repository:** https://github.com/common-voice/common-voice - **Paper:** https://arxiv.org/abs/1912.06670 - **Leaderboard:** https://paperswithcode.com/dataset/common-voice - **Point of Contact:** [Anton Lozhkov](mailto:anton@huggingface.co) ### Dataset Summary The Common Voice dataset consists of a unique MP3 and corresponding text file. Many of the 24210 recorded hours in the dataset also include demographic metadata like age, sex, and accent that can help improve the accuracy of speech recognition engines. The dataset currently consists of 16413 validated hours in 100 languages, but more voices and languages are always added. Take a look at the [Languages](https://commonvoice.mozilla.org/en/languages) page to request a language or start contributing. ### Supported Tasks and Leaderboards The results for models trained on the Common Voice datasets are available via the [🤗 Autoevaluate Leaderboard](https://huggingface.co/spaces/autoevaluate/leaderboards?dataset=mozilla-foundation%2Fcommon_voice_11_0&only_verified=0&task=automatic-speech-recognition&config=ar&split=test&metric=wer) ### Languages ``` Abkhaz, Arabic, Armenian, Assamese, Asturian, Azerbaijani, Basaa, Bashkir, Basque, Belarusian, Bengali, Breton, Bulgarian, Cantonese, Catalan, Central Kurdish, Chinese (China), Chinese (Hong Kong), Chinese (Taiwan), Chuvash, Czech, Danish, Dhivehi, Dutch, English, Erzya, Esperanto, Estonian, Finnish, French, Frisian, Galician, Georgian, German, Greek, Guarani, Hakha Chin, Hausa, Hill Mari, Hindi, Hungarian, Igbo, Indonesian, Interlingua, Irish, Italian, Japanese, Kabyle, Kazakh, Kinyarwanda, Kurmanji Kurdish, Kyrgyz, Latvian, Lithuanian, Luganda, Macedonian, Malayalam, Maltese, Marathi, Meadow Mari, Moksha, Mongolian, Nepali, Norwegian Nynorsk, Odia, Persian, Polish, Portuguese, Punjabi, Romanian, Romansh Sursilvan, Romansh Vallader, Russian, Sakha, Santali (Ol Chiki), Saraiki, Sardinian, Serbian, Slovak, Slovenian, Sorbian, Upper, Spanish, Swahili, Swedish, Taiwanese (Minnan), Tamil, Tatar, Thai, Tigre, Tigrinya, Toki Pona, Turkish, Twi, Ukrainian, Urdu, Uyghur, Uzbek, Vietnamese, Votic, Welsh ``` ## How to use The `datasets` library allows you to load and pre-process your dataset in pure Python, at scale. The dataset can be downloaded and prepared in one call to your local drive by using the `load_dataset` function. For example, to download the Hindi config, simply specify the corresponding language config name (i.e., "hi" for Hindi): ```python from datasets import load_dataset cv_11 = load_dataset("mozilla-foundation/common_voice_11_0", "hi", split="train") ``` Using the datasets library, you can also stream the dataset on-the-fly by adding a `streaming=True` argument to the `load_dataset` function call. Loading a dataset in streaming mode loads individual samples of the dataset at a time, rather than downloading the entire dataset to disk. ```python from datasets import load_dataset cv_11 = load_dataset("mozilla-foundation/common_voice_11_0", "hi", split="train", streaming=True) print(next(iter(cv_11))) ``` *Bonus*: create a [PyTorch dataloader](https://huggingface.co/docs/datasets/use_with_pytorch) directly with your own datasets (local/streamed). ### Local ```python from datasets import load_dataset from torch.utils.data.sampler import BatchSampler, RandomSampler cv_11 = load_dataset("mozilla-foundation/common_voice_11_0", "hi", split="train") batch_sampler = BatchSampler(RandomSampler(cv_11), batch_size=32, drop_last=False) dataloader = DataLoader(cv_11, batch_sampler=batch_sampler) ``` ### Streaming ```python from datasets import load_dataset from torch.utils.data import DataLoader cv_11 = load_dataset("mozilla-foundation/common_voice_11_0", "hi", split="train") dataloader = DataLoader(cv_11, batch_size=32) ``` To find out more about loading and preparing audio datasets, head over to [hf.co/blog/audio-datasets](https://huggingface.co/blog/audio-datasets). ### Example scripts Train your own CTC or Seq2Seq Automatic Speech Recognition models on Common Voice 11 with `transformers` - [here](https://github.com/huggingface/transformers/tree/main/examples/pytorch/speech-recognition). ## Dataset Structure ### Data Instances A typical data point comprises the `path` to the audio file and its `sentence`. Additional fields include `accent`, `age`, `client_id`, `up_votes`, `down_votes`, `gender`, `locale` and `segment`. ```python { 'client_id': 'd59478fbc1ee646a28a3c652a119379939123784d99131b865a89f8b21c81f69276c48bd574b81267d9d1a77b83b43e6d475a6cfc79c232ddbca946ae9c7afc5', 'path': 'et/clips/common_voice_et_18318995.mp3', 'audio': { 'path': 'et/clips/common_voice_et_18318995.mp3', 'array': array([-0.00048828, -0.00018311, -0.00137329, ..., 0.00079346, 0.00091553, 0.00085449], dtype=float32), 'sampling_rate': 48000 }, 'sentence': 'Tasub kokku saada inimestega, keda tunned juba ammust ajast saati.', 'up_votes': 2, 'down_votes': 0, 'age': 'twenties', 'gender': 'male', 'accent': '', 'locale': 'et', 'segment': '' } ``` ### Data Fields `client_id` (`string`): An id for which client (voice) made the recording `path` (`string`): The path to the audio file `audio` (`dict`): A dictionary containing the path to the downloaded audio file, the decoded audio array, and the sampling rate. Note that when accessing the audio column: `dataset[0]["audio"]` the audio file is automatically decoded and resampled to `dataset.features["audio"].sampling_rate`. Decoding and resampling of a large number of audio files might take a significant amount of time. Thus it is important to first query the sample index before the `"audio"` column, *i.e.* `dataset[0]["audio"]` should **always** be preferred over `dataset["audio"][0]`. `sentence` (`string`): The sentence the user was prompted to speak `up_votes` (`int64`): How many upvotes the audio file has received from reviewers `down_votes` (`int64`): How many downvotes the audio file has received from reviewers `age` (`string`): The age of the speaker (e.g. `teens`, `twenties`, `fifties`) `gender` (`string`): The gender of the speaker `accent` (`string`): Accent of the speaker `locale` (`string`): The locale of the speaker `segment` (`string`): Usually an empty field ### Data Splits The speech material has been subdivided into portions for dev, train, test, validated, invalidated, reported and other. The validated data is data that has been validated with reviewers and received upvotes that the data is of high quality. The invalidated data is data has been invalidated by reviewers and received downvotes indicating that the data is of low quality. The reported data is data that has been reported, for different reasons. The other data is data that has not yet been reviewed. The dev, test, train are all data that has been reviewed, deemed of high quality and split into dev, test and train. ## Data Preprocessing Recommended by Hugging Face The following are data preprocessing steps advised by the Hugging Face team. They are accompanied by an example code snippet that shows how to put them to practice. Many examples in this dataset have trailing quotations marks, e.g _“the cat sat on the mat.“_. These trailing quotation marks do not change the actual meaning of the sentence, and it is near impossible to infer whether a sentence is a quotation or not a quotation from audio data alone. In these cases, it is advised to strip the quotation marks, leaving: _the cat sat on the mat_. In addition, the majority of training sentences end in punctuation ( . or ? or ! ), whereas just a small proportion do not. In the dev set, **almost all** sentences end in punctuation. Thus, it is recommended to append a full-stop ( . ) to the end of the small number of training examples that do not end in punctuation. ```python from datasets import load_dataset ds = load_dataset("mozilla-foundation/common_voice_11_0", "en", use_auth_token=True) def prepare_dataset(batch): """Function to preprocess the dataset with the .map method""" transcription = batch["sentence"] if transcription.startswith('"') and transcription.endswith('"'): # we can remove trailing quotation marks as they do not affect the transcription transcription = transcription[1:-1] if transcription[-1] not in [".", "?", "!"]: # append a full-stop to sentences that do not end in punctuation transcription = transcription + "." batch["sentence"] = transcription return batch ds = ds.map(prepare_dataset, desc="preprocess dataset") ``` ## Dataset Creation ### Curation Rationale [Needs More Information] ### Source Data #### Initial Data Collection and Normalization [Needs More Information] #### Who are the source language producers? [Needs More Information] ### Annotations #### Annotation process [Needs More Information] #### Who are the annotators? [Needs More Information] ### Personal and Sensitive Information The dataset consists of people who have donated their voice online. You agree to not attempt to determine the identity of speakers in the Common Voice dataset. ## Considerations for Using the Data ### Social Impact of Dataset The dataset consists of people who have donated their voice online. You agree to not attempt to determine the identity of speakers in the Common Voice dataset. ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information Public Domain, [CC-0](https://creativecommons.org/share-your-work/public-domain/cc0/) ### Citation Information ``` @inproceedings{commonvoice:2020, author = {Ardila, R. and Branson, M. and Davis, K. and Henretty, M. and Kohler, M. and Meyer, J. and Morais, R. and Saunders, L. and Tyers, F. M. and Weber, G.}, title = {Common Voice: A Massively-Multilingual Speech Corpus}, booktitle = {Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020)}, pages = {4211--4215}, year = 2020 } ```

hugging_face

View Details

Capybara-Preferences

Preference Analysis

Machine Learning

The dataset consists of multiple features such as 'source', 'chosen', 'chosen_rating', 'chosen_model', 'rejected', 'rejected_rating', and 'rejected_model'. 'chosen' and 'rejected' are list‑type fields containing sub‑features 'content' and 'role'. The dataset is split into 'train' (15 204 samples) and 'test' (200 samples). Total download size is 79 362 069 bytes, total dataset size is 152 534 966.0 bytes.

huggingface

View Details

neural-bridge/rag-dataset-12000

Natural Language Processing

Machine Learning

Retrieval-Augmented Generation (RAG) Dataset 12000 is an English dataset specifically designed to optimize RAG models. Built by Neural Bridge AI, it contains 12,000 entries, each comprising three fields: context, question, and answer. Context data originates from Falcon RefinedWeb, while questions and answers are generated by GPT-4. The dataset is split into a training set (9,600 samples) and a test set (2,400 samples) and is released under the Apache 2.0 license.

hugging_face

View Details

TannerGladson/lichess-frames

Chess

Machine Learning

The dataset consists of chess‑board states and associated move sequences extracted from games downloaded from lichess.org. Each game is parsed into multiple records; each record starts with a FEN string followed by 1‑10 SAN moves. The data are intended for training the ChessRoberta model and have not been filtered, so they may not be optimal for high‑performance chess modelling.

hugging_face

View Details

frgfm/imagewoof

Image Recognition

Machine Learning

Imagewoof is a subset of ImageNet containing ten dog‑breed categories, designed to provide a challenging image‑classification benchmark. Created by Jeremy Howard, it is released under the Apache‑2.0 license. The dataset includes images and corresponding labels, with a defined train/validation split.

hugging_face

View Details

bigcode/self-oss-instruct-sc2-concepts

Natural Language Processing

Machine Learning

--- dataset_info: features: - name: prompt dtype: string - name: fingerprint dtype: 'null' - name: seed dtype: string - name: sha1 dtype: string - name: id dtype: int64 - name: concepts sequence: string splits: - name: train num_bytes: 1546511078 num_examples: 239121 download_size: 335965438 dataset_size: 1546511078 configs: - config_name: default data_files: - split: train path: data/train-* ---

hugging_face

View Details

Dataset Hub

Browse by Category

PandaVT/Machine_Mindset_MBTI_dataset

Breakfast dataset

Cereberal-Stroke-Analysis

FungiTastic

RML2016

prm800k

Phishing and Benign URLs Dataset

gsm8k_synthetic_cot

links-ads/mil-qualair

alkzar90/CC6204-Hackaton-Cub-Dataset

insuranceqa-corpus-zh

Multi30k Dataset

scikit-fingerprints/MoleculeNet_ESOL

PepBDB-ML

PHOENIX-2014, PHOENIX-2014-T

Starlento/DPO-En-Zh-20k-handbook

eliolio/docvqa

Diabetes Health Indicators

WLASL, DGS Kinect 40, RWTH-PHOENIX-Weather, SIGNUM, GSL 20, Boston ASL LVD, PSL Kinect 30, PSL ToF 84, PSL 101, LSA64, BosphorusSign, MSR Gesture 3D, DEVISIGN-G, DEVISIGN-D, DEVISIGN-L, IIITA -ROBITA, Purdue ASL, CUNY ASL, SignsWorld Atlas

PEMS_SF UCI Machine learning dataset

Rhma/CONAN

Heart-UCI-Dataset

COVID-19 X-ray Dataset (Train & Test Sets)

MER2023

t2i-compbench

fasterinnerlooper/codereviewer

DavidVivancos/MindBigData2022

NSL-KDD

Electrical Substations Cybersecurity Dataset

yuukicammy/MIT-Adobe-FiveK

Diabetes Binary Health Indicators BRFSS2015

TIFA, AIGCIQA2023, ECCV_Caption, D3PO

BAM dataset

Open Images dataset

ML-learning-datasets

3W Dataset

reddit_dataset_198

plant-pathology-2021

jero98772/CuraPeces_Background

stanfordnlp/SHP

SummerSigh/AncientMNIST

Houses.csv

Mahjong Dataset

Codecfake Dataset

lilacai/lilac-wikitext-2-raw-v1

flowers_dataset

Stanford Dog Dataset

CICIDS2017

PubChem

lumbar-spine-mri

jero98772/CuraPeces_Removed_background

martim00/math_aime_2023

llm-blender/mix-instruct

C-MTEB/TNews-classification

TIG_ss304_dataset_full

massw

ChickenLanguageDataset

ai-hub2

double_pendulum

keremberke/plane-detection

Daily and Sports Activities

garage-bAInd/Open-Platypus

House-Rent-Predicition

haixuantao/just_gripper_eval

conglu/vd4rl

mozilla-foundation/common_voice_11_0

Capybara-Preferences

neural-bridge/rag-dataset-12000

TannerGladson/lichess-frames

frgfm/imagewoof

bigcode/self-oss-instruct-sc2-concepts