Explore high-quality datasets for your AI and machine learning projects.
The dataset contains over 65,000 photos of more than 5,000 individuals from 40 countries, providing a valuable resource for exploring and developing authentication solutions. It is especially suitable for biometric verification, notably facial recognition in financial services. Each individual includes 13 selfie images and 2 ID photos captured with various devices and resolutions. The dataset aims to develop more robust re‑identification algorithms and enhance security measures across applications.
The dataset contains vibration signals from bearings in nine different health states under four different operating conditions. It is publicly available for validating rolling‑bearing diagnostic algorithms.
The dataset crawls up to 25 chapters per book from https://m.bqgui.cc, resulting in 12,740 entries. After three rounds of cleaning, each entry contains the book title, summary, and novel text. Titles are of high quality, summaries have low usability, and the novel texts have had some ads and symbols removed but still contain low‑quality content.
The Chameleon dataset is designed for side‑channel analysis and contains real power‑trace recordings collected from a 32‑bit RISC‑V system‑on‑chip that implements four masking countermeasures (dynamic frequency scaling, random delay, morphing, and chaffing). The traces capture interleaved execution of AES encryption operations with general‑purpose applications. The dataset is divided into four sub‑datasets, each corresponding to one countermeasure, and each sub‑dataset is further split into 16 files based on the value of the first byte of the encryption key. It supports research on segmented methods and side‑channel analysis techniques, especially for devices employing masking countermeasures.
This dataset is for supervised fine‑tuning (SFT) and direct preference optimization (DPO), available in English and Chinese versions. It is based on the four MBTI dimensions, each with two opposing attributes: energy (Extraversion E – Introversion I), information (Sensing S – Intuition N), decision (Thinking T – Feeling F), and execution (Judging J – Perceiving P). The dataset follows the Alpaca format, containing instruction, input, and output. Users can select the appropriate file for SFT or DPO based on the MBTI type.
This dataset provides real noise images for denoising research, containing 40 different scenes captured by 5 leading‑brand cameras, totaling 100 regions of size 512 × 512, including noisy images and corresponding ground‑truth images.
smoltalk‑chinese is a Chinese fine‑tuning dataset referenced from the SmolTalk dataset, designed to provide high‑quality synthetic data for training large language models (LLMs). The dataset consists entirely of synthetic data, covering more than 700,000 entries, and is composed of multiple parts including tasks referenced from magpie‑ultra, other SmolTalk tasks, simulated daily‑life dialogues, and mathematics problems from the Chinese version of Math23K. The generation process follows strict standards to ensure data quality and diversity. Experiments show that models fine‑tuned on smoltalk‑chinese achieve significant advantages on multiple metrics.
The UW‑Bench dataset, created by the School of Microelectronics and Communication Engineering at Chongqing University, focuses on urban flood detection, containing 7,677 annotated images sourced from surveillance cameras and handheld devices. The dataset was collected under various adverse conditions such as low light, strong reflections, and clear water, aiming to improve model generalization in real‑world applications. Manual annotation ensures data quality, suitable for enhancing accuracy and efficiency of urban flood detection.
This public dataset contains two BigQuery tables; the table used is `citybike_trips`, containing over 58 million records. The `tripduration` field indicates the duration of each bike rental (in seconds); other fields serve as potential features.
PHDF‑Dataset is a pedestrian head detection dataset based on a fisheye camera, containing 2,201 images annotated with 13,887 pedestrian heads. Released by the Shien‑Ming Wu Lab of the School of Intelligent Engineering, South China University of Technology, for non‑commercial research purposes.
The WHU‑Hi dataset (Wuhan UAV‑borne Hyperspectral Images) was collected and shared by the RSIDEA research group at Wuhan University, serving as a benchmark for precise crop classification and hyperspectral image classification research. It comprises three independent UAV‑borne hyperspectral datasets: WHU‑Hi‑LongKou, WHU‑Hi‑HanChuan, and WHU‑Hi‑HongHu, all captured over agricultural areas in Hubei Province, China. The data were acquired using a Headwall Nano‑Hyperspec sensor mounted on UAV platforms, providing high spatial resolution (H2 images). Pre‑processing includes radiometric calibration and geometric correction performed with the HyperSpec software supplied by the instrument vendor. Each dataset includes detailed acquisition metadata (time, weather, sensor, flight altitude, image size, number of bands, spatial resolution) and sample counts for various crop classes.
The Real‑Vul dataset was developed by the School of Computer Science at the University of Waterloo to provide a comprehensive dataset for evaluating deep‑learning models in real‑world software vulnerability detection. It contains 5,528 C/C++ function samples drawn from diverse software projects such as the Chromium browser and the Linux operating system. The dataset was created using a time‑based split strategy to ensure realistic and timely training and testing data. Real‑Vul is primarily intended for assessing and improving the practical performance of existing vulnerability detection models, especially in complex and varied real‑world software environments.
The LEVIR‑CC dataset is a large dataset for remote‑sensing image change caption generation, specifically supporting research on change captioning using dual‑branch Transformers.
We provide a database of hyperspectral images (HSIs) supporting our research. The images cover a variety of real‑world materials and objects. The database is publicly released to the research community. Detailed information can be found in our manuscript.
MTA (Multi‑Camera Track Auto) is a large multi‑target multi‑camera tracking dataset, containing over 2,800 person identities captured by 6 cameras, each video exceeding 100 minutes. The dataset spans both daytime and nighttime periods.
The Secom dataset contains a unique rare‑event scenario with highly imbalanced output classes. It consists of 1,567 observations and 590 variables; each record represents a single production entity with associated measurement features. The `secom_labels.data` file provides pass/fail labels (‑1 = pass, 1 = fail) and timestamps for each test point.
This dataset unifies public datasets for insulator detection and fault classification in power lines, providing the merged data and the code used for merging.
CMM‑Math is a Chinese multimodal mathematics dataset containing over 28,000 high‑quality samples covering 12 grades from primary school to high school. It includes diverse question types such as multiple‑choice and fill‑in‑the‑blank, with detailed solutions. Some questions involve visual context, making the dataset more challenging. The dataset is split into a training set (22,000+ samples) and an evaluation set (5,000+ samples).
The dataset contains user action types, timestamps, and final goals, split into training and testing sets. Each set includes three files: action type, action time, and goal.
An emoji database containing each emoji's group, subgroup, code points, hash, status, rendered emoji, short name, description, and aliases.
LLaVA‑Interleave Bench is a comprehensive multi‑image dataset collected from public datasets or generated via the GPT‑4V API. The dataset aims to evaluate the interleaved multi‑image reasoning capability of large multimodal models. It was collected in April 2024 and released in June 2024. Its primary use is for research on large multimodal models and chatbots, targeting researchers and enthusiasts in computer vision, natural language processing, machine learning, and AI.
The CBSD68 dataset is a color version of the BSD68 benchmark for image denoising, containing original .jpg files converted to lossless .png format and augmented with different levels of Gaussian white noise.
The dataset includes multiple features such as question ID, dialogue content from two models, winner label, judge, turn count, anonymity flag, language, timestamp, and OpenAI moderation results. Each dialogue records content and role, turn number, and anonymity status. Toxicity detection results from two large models are also captured, including binary flags and probabilities. The dataset is provided as a training split with 33,000 samples.
The dataset comprises four sub‑datasets: - **songs**: information about Springsteen songs or related tracks, including albums and lyrics; - **concerts**: records of all Springsteen concerts since 1973, detailing venues, locations, and dates; - **setlists**: song lists performed at each concert; - **tours**: information about tours associated with each concert.
The dataset, named geoQuery‑tableQA, is intended for table‑question answering tasks. It includes fields such as `query`, `question`, `answer`, `table_names`, `tables`, `source`, and `target`. The dataset is split into training (530 samples), validation (49 samples), and test (253 samples) sets, with a total size of 14,896,902 bytes and a download size of 1,988,975 bytes.
The dataset includes images and wind‑speed data. Image data type is `image`; wind‑speed data type is `float32`. It also provides a classification label with categories TD, TS, STS, TY, STY, and SuperTY, corresponding to class IDs 0–5. The dataset consists of a single training split with 472 samples, total size 109,541,339.0 bytes, and a download size of 109,547,565 bytes.
We scraped posts that appeared on the Weibo hot list from 2022‑11‑25 to 2023‑03‑08 (only posts from the day they trended) together with their associated comments.
This dataset is used to analyze stroke, employing machine learning models and resampling techniques such as SMOTEENN to improve prediction accuracy and address dataset imbalance.
The dataset contains 100 images of various fruits and vegetables captured under controlled lighting conditions using a Living Optics camera. Data types include RGB images, sparse spectral samples, and instance segmentation masks. The dataset includes over 430,000 spectral samples, of which more than 85,000 belong to one of 19 categories. Additionally, 13 labeled images are provided as a validation set along with some unlabeled demonstration videos. The dataset is primarily used for image segmentation and classification tasks.
This dataset is used for intent recognition tasks, containing user queries and their corresponding correct intents. The dataset is divided into training and validation sets, used respectively for model training and performance evaluation.
M4‑Instruct is a multi‑image dataset collected in April 2024 from public datasets and the GPT‑4V API, intended for training large multimodal models. It is used for research on large multimodal models and chatbots, targeting audiences in computer vision, natural language processing, machine learning, and AI.
SHOT7M2 is a synthetic, hierarchical, compositional basketball behavior dataset with 7.2 million frames that demonstrate hierarchical organization of basketball actions. Based on the animation model of Starke et al., it uses neural state machines to predict future character poses. The dataset contains 4,000 clips, each with 1,800 frames, performed by a single agent executing various basketball motions. Each clip is labeled with one of four activity types: casual play, intense play, dribbling training, or no play. It provides 26‑keypoint skeletal poses and combinations of 4 activity types, 12 actions, and 14 basic motions.
FIT‑RS is a large‑scale fine‑grained instruction‑tuning dataset containing 1,800,851 high‑quality instruction samples, designed to enhance the fine‑grained understanding capabilities of Remote Sensing Large Multimodal Models (RSLMMs).
The VRAI dataset supports vehicle re‑identification research in aerial imagery. With the rapid growth of unmanned aerial vehicles (UAVs), UAV‑based visual applications have attracted increasing attention from both industry and academia. However, public UAV vehicle re‑ID datasets are scarce, limiting research despite potential applications such as long‑term tracking and visual object retrieval. VRAI addresses this gap by providing a large‑scale, publicly available collection of UAV‑captured vehicle images with comprehensive annotations.
The dataset was reprocessed by the authors based on portions of the Le2i and SisFall datasets. Video clips of ADL and fall actions were converted into images using the OpenCV library, after which a human segmentation neural network was applied to obtain images containing only the human foreground, followed by binarization and morphological erosion and dilation processing. The dataset is divided into two parts, each comprising five categories: Blank (no person), Fall, Like‑fall (unstable balance), Lie, and Stand.
Lego part image dataset generated from 3‑D models, comprising 10 part types with 6,000 unique rendered images per part (total 60,000 images). Additionally, 60 photographic images per class are provided (total 600 photos).
This dataset supports the development and evaluation of Weapon Engagement Zone (WEZ) prediction models, generated through high‑fidelity air‑combat simulations. The dataset captures various scenarios and conditions representing engagements between shooter aircraft and targets.
DocBank is a new large‑scale dataset constructed using weak supervision, designed to provide integrated text and layout information for downstream tasks. Currently, the DocBank dataset contains 500,000 pages of documents, with 400,000 for training, 50,000 for validation, and 50,000 for testing. Annotations are machine‑generated, the language is English, and the dataset is monolingual. Fields include image, token, bounding box, color, font, and label.
The dataset contains photos of miners taken in various mines. Each image is annotated with bounding‑box detection of miners and an attribute indicating whether the miner is sitting or standing. It is suitable for computer‑vision and safety‑assessment applications and is valuable for researchers, employers, and policymakers in the mining sector. The dataset structure includes raw images, bounding‑box annotations, and XML files containing box coordinates and labels.
The FungiTastic dataset was created jointly by the University of West Bohemia, INRIA and the Czech Technical University in Prague. It comprises approximately 350,000 records and over 650,000 fungal photographs with detailed metadata. The dataset stems from twenty years of continuous data collection and supports various machine learning tasks such as closed‑set and open‑set classification. Rich metadata includes timestamps, camera settings, geographic coordinates, satellite imagery, and biological taxonomy. It is widely used for image classification problems in biology, particularly for fungal identification.
The dataset contains small‑size pathology images with corresponding labels indicating the presence of tumor tissue. Images are 96 × 96 pixels and were provided as part of a Kaggle competition.
This dataset contains speech contributions from the Common Voice community on the web platform; all contributions are included regardless of validation status. The dataset is released roughly every six months and includes audio files and associated metadata such as age, gender, accent, etc.
--- license: mit --- <p align="center"> <h1>G-buffer Objaverse</h1> <p> G-buffer Objaverse: High-Quality Rendering Dataset of Objaverse. [Chao Xu](mailto:eric.xc@alibaba-inc.com), [Yuan Dong](mailto:yuandong15@fudan.edu.cn), [Qi Zuo](mailto:muyuan.zq@alibaba-inc.com), [Junfei Zhang](mailto:miracle.zjf@alibaba-inc.com), [Xiaodan Ye](mailto:doris.yxd@alibaba-inc.com), [Wenbo Geng](mailto:rengui.gwb@alibaba-inc.com), [Yuxiang Zhang](mailto:yuxiangzhang.zyx@alibaba-inc.com), [Xiaodong Gu](https://scholar.google.com.hk/citations?user=aJPO514AAAAJ&hl=zh-CN&oi=ao), [Lingteng Qiu](https://lingtengqiu.github.io/), [Zhengyi Zhao](mailto:bushe.zzy@alibaba-inc.com), [Qing Ran](mailto:ranqing.rq@alibaba-inc.com), [Jiayi Jiang](mailto:jiayi.jjy@alibaba-inc.com), [Zilong Dong](https://scholar.google.com/citations?user=GHOQKCwAAAAJ&hl=zh-CN&oi=ao), [Liefeng Bo](https://scholar.google.com/citations?user=FJwtMf0AAAAJ&hl=zh-CN) ## [Project page](https://aigc3d.github.io/gobjaverse/) ## [Github](https://github.com/modelscope/richdreamer/tree/main/dataset/gobjaverse) ## [YouTube](https://www.youtube.com/watch?v=PWweS-EPbJo) ## [RichDreamer](https://aigc3d.github.io/richdreamer/) ## [ND-Diffusion Model](https://github.com/modelscope/normal-depth-diffusion) ## TODO - [ ] Release objaverse-xl alignment rendering data ## News - We have released a compressed version of the datasets, check the downloading tips! (01.14, 2024 UTC) - Thanks for [JunzheJosephZhu](https://github.com/JunzheJosephZhu) for improving the robustness of the downloading scripts. Now you could restart the download script from the break point. (01.12, 2024 UTC) - Release 10 Category Annotation of the Objaverse Subset (01.06, 2024 UTC) - Release G-buffer Objaverse Rendering Dataset (01.06, 2024 UTC) ## Download - Download gobjaverse ***(6.5T)*** rendering dataset using following scripts. ```bash # download_gobjaverse_280k index file wget https://virutalbuy-public.oss-cn-hangzhou.aliyuncs.com/share/aigc3d/gobjaverse_280k.json # Example: python ./scripts/data/download_gobjaverse_280k.py ./gobjaverse_280k ./gobjaverse_280k.json 10 python ./download_gobjaverse_280k.py /path/to/savedata /path/to/gobjaverse_280k.json nthreads(eg. 10) # Or if the network is not so good, we have provided a compressed verison with each object as a tar file # To download the compressed version(only 260k tar files) python ./download_objaverse_280k_tar.py /path/to/savedata /path/to/gobjaverse_280k.json nthreads(eg. 10) # download gobjaverse_280k/gobjaverse index to objaverse wget https://virutalbuy-public.oss-cn-hangzhou.aliyuncs.com/share/aigc3d/gobjaverse_280k_index_to_objaverse.json wget https://virutalbuy-public.oss-cn-hangzhou.aliyuncs.com/share/aigc3d/gobjaverse_index_to_objaverse.json # download Cap3D text-caption file wget https://virutalbuy-public.oss-cn-hangzhou.aliyuncs.com/share/aigc3d/text_captions_cap3d.json ``` - The 10 general categories including Human-Shape (41,557), Animals (28,882), Daily-Used (220,222), Furnitures (19,284), Buildings&&Outdoor (116,545), Transportations (20,075), Plants (7,195), Food (5,314), Electronics (13,252) and Poor-quality (107,001). - Download the category annotation using following scripts. ```bash # download category annotation wget https://virutalbuy-public.oss-cn-hangzhou.aliyuncs.com/share/aigc3d/category_annotation.json # If you want to download a specific category in gobjaverse280k: # Step1: download the index file of the specified category. wget https://virutalbuy-public.oss-cn-hangzhou.aliyuncs.com/share/aigc3d/gobjaverse_280k_split/gobjaverse_280k_{category_name}.json # category_name: Human-Shape, ... # Step2: download using script. # Example: python ./scripts/data/download_gobjaverse_280k.py ./gobjaverse_280k_Human-Shape ./gobjaverse_280k_Human-Shape.json 10 python ./download_gobjaverse_280k.py /path/to/savedata /path/to/gobjaverse_280k_{category_name}.json nthreads(eg. 10) ``` ## Folder Structure - The structure of gobjaverse rendering dataset: ``` |-- ROOT |-- dictionary_id |-- instance_id |-- campos_512_v4 |-- 00000 |-- 00000.json # Camera Information |-- 00000.png # RGB |-- 00000_albedo.png # Albedo |-- 00000_hdr.exr # HDR |-- 00000_mr.png # Metalness and Roughness |-- 00000_nd.exr # Normal and Depth |-- ... ``` ### Coordinate System #### Normal Coordinate System The 3D coordinate system definition is very complex. it is difficult for us to say what the camera system used. Fortunately, the target we want to get is mapping the world normal of rendering system to Normal-Bae system, as the following figure illustrates:  where the U-axis and V-axis denote the width-axis and height-axis in image space, respectively, the xyz is the Normal-Bae camera view coordinate system. Note that public rendering system for Objaverse is blender-based system:  However, our rendering system is defined at **Unity-based system**, seeing:  *A question is how do we plug in blender's coordinate system directly without introducing a new coordinate system?* A possible solution is that we maintain world to camera transfer matrix as blender setting, *transferring Unity-based system to blender-based system* We provide example codes to visualize the coordinate mapping. ```bash # example of coordinate experiments ## download datasets wget https://virutalbuy-public.oss-cn-hangzhou.aliyuncs.com/share/Lingtengqiu/render_data_examples.zip unzip render_data_examples.zip ## visualizing blender-based system, and warping world-space normal to normal-bae system. python ./process_blender_dataset.py ## visualizing our system, and warping world-space normal to normal-bae system. python ./process_unity_dataset.py ``` #### Depth-Warpping We write an example to demonstrate that how to obtain intrinsic matrix K, and warp ref image to target image based on ref depth map. ```bash # build quick-zbuff code mkdir -p ./lib/build g++ -shared -fpic -o ./lib/build/zbuff.so ./lib/zbuff.cpp # an demo for depth-based Warpping # python ./depth_warp_example.py $REFVIEW $TARGETVIEW python3 ./depth_warp_example.py 0 3 ``` ## Citation ``` @article{qiu2023richdreamer, title={RichDreamer: A Generalizable Normal-Depth Diffusion Model for Detail Richness in Text-to-3D}, author={Lingteng Qiu and Guanying Chen and Xiaodong Gu and Qi zuo and Mutian Xu and Yushuang Wu and Weihao Yuan and Zilong Dong and Liefeng Bo and Xiaoguang Han}, year={2023}, journal = {arXiv preprint arXiv:2311.16918} } ``` ``` @article{objaverse, title={Objaverse: A Universe of Annotated 3D Objects}, author={Matt Deitke and Dustin Schwenk and Jordi Salvador and Luca Weihs and Oscar Michel and Eli VanderBilt and Ludwig Schmidt and Kiana Ehsani and Aniruddha Kembhavi and Ali Farhadi}, journal={arXiv preprint arXiv:2212.08051}, year={2022} } ```
The BETA dataset from Tsinghua University is a large‑scale benchmark database designed to support research on SSVEP‑BCI applications.
The RAiD dataset is an indoor–outdoor cross‑scenario re‑identification dataset collected at the Winstun Chung Hall of UC Riverside. It contains images from four cameras (two indoor and two outdoor), recording 6,920 images of 43 individuals. Detailed annotations include images, foreground masks, camera IDs, and person IDs.
The dataset contains a feature named `signal` of type float32 and a feature named `label_id` of type int32. It is split into training, validation, and test sets with 9,900, 539, and 1,100 samples respectively. Total download size is 35,524,532 bytes; dataset size is 12,000,560 bytes.
This dataset contains 5,000 randomly selected movie records from IMDB, with 28 attributes for each record.
--- license: other license_name: conceptual-12m license_link: LICENSE task_categories: - image-to-text size_categories: - 10M<n<100M --- # Dataset Card for Conceptual Captions 12M (CC12M) ## Dataset Description - **Repository:** [Conceptual 12M repository](https://github.com/google-research-datasets/conceptual-12m) - **Paper:** [Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts](https://arxiv.org/abs/2102.08981) - **Point of Contact:** [Conceptual Captions e-mail](mailto:conceptual-captions@google.com) ### Dataset Summary Conceptual 12M (CC12M) is a dataset with 12 million image-text pairs specifically meant to be used for visionand-language pre-training. Its data collection pipeline is a relaxed version of the one used in Conceptual Captions 3M (CC3M). ### Usage This instance of Conceptual Captions is in [webdataset](https://github.com/webdataset/webdataset/commits/main) .tar format. It can be used with webdataset library or upcoming releases of Hugging Face `datasets`. ...More Detail TBD ### Data Splits This dataset was downloaded using img2dataset. Images resized on download if shortest edge > 512 to shortest edge = 512. #### Train * `cc12m-train-*.tar` * Downloaded on 2021/18/22 * 2176 shards, 10968539 samples ## Additional Information ### Dataset Curators Soravit Changpinyo, Piyush Sharma, Nan Ding and Radu Soricut. ### Licensing Information The dataset may be freely used for any purpose, although acknowledgement of Google LLC ("Google") as the data source would be appreciated. The dataset is provided "AS IS" without any warranty, express or implied. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset. ### Citation Information ```bibtex @inproceedings{changpinyo2021cc12m, title = {{Conceptual 12M}: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts}, author = {Changpinyo, Soravit and Sharma, Piyush and Ding, Nan and Soricut, Radu}, booktitle = {CVPR}, year = {2021}, } ```
iHarmony4 is the first large‑scale image harmonization public benchmark dataset, containing four sub‑datasets: HCOCO, HAdobe5k, HFlickr, and Hday2night. Each sub‑dataset includes synthetic composite images, foreground masks of the composite images, and the corresponding real images.
This dataset contains data from [openai/prm800k](https://github.com/openai/prm800k). It is divided into two phases (phase1 and phase2), each with train and test splits. Features include labeler, timestamp, question, etc.; detailed feature types are described in the README.
The dataset comprises five features: title, author, text, identifier, and category. It is divided into a training set containing 47,829 samples, with a total size of 12,701,752 bytes. The download size is 9,836,711 bytes.
WebVi3D is a multi‑view image dataset containing 320 M frames extracted from 16 M video clips, used for training See3D models. The dataset expands training data by automatically filtering video clips with inconsistent multi‑view information or insufficient observations, yielding a high‑quality, diverse multi‑view image collection.
This is a multimodal remote‑sensing dataset for Hunan Province in 2017, comprising Sentinel‑2, Sentinel‑1, and SRTM data. It contains 400 training images (256×256), and 50 validation and test images. The TRI in the training set is computed from SRTM via GDAL and can be used for knowledge reconstruction.
The NYC Yellow Taxi trip records capture pickup and drop‑off dates and times, pickup and drop‑off locations, trip distance, itemized fare details, rate type, payment type, and the number of passengers reported by the driver.
--- annotations_creators: - expert-generated language_creators: - other language: - en license: - cc-by-sa-3.0 multilinguality: - monolingual size_categories: - 10K<n<100K source_datasets: - extended|natural_questions task_categories: - question-answering task_ids: - open-domain-qa pretty_name: NQ-Open dataset_info: config_name: nq_open features: - name: question dtype: string - name: answer sequence: string splits: - name: train num_bytes: 6651236 num_examples: 87925 - name: validation num_bytes: 313829 num_examples: 3610 download_size: 4678245 dataset_size: 6965065 configs: - config_name: nq_open data_files: - split: train path: nq_open/train-* - split: validation path: nq_open/validation-* default: true --- # Dataset Card for nq_open ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://efficientqa.github.io/ - **Repository:** https://github.com/google-research-datasets/natural-questions/tree/master/nq_open - **Paper:** https://www.aclweb.org/anthology/P19-1612.pdf - **Leaderboard:** https://ai.google.com/research/NaturalQuestions/efficientqa - **Point of Contact:** [Mailing List](efficientqa@googlegroups.com) ### Dataset Summary The NQ-Open task, introduced by Lee et.al. 2019, is an open domain question answering benchmark that is derived from Natural Questions. The goal is to predict an English answer string for an input English question. All questions can be answered using the contents of English Wikipedia. ### Supported Tasks and Leaderboards Open Domain Question-Answering, EfficientQA Leaderboard: https://ai.google.com/research/NaturalQuestions/efficientqa ### Languages English (`en`) ## Dataset Structure ### Data Instances ``` { "question": "names of the metropolitan municipalities in south africa", "answer": [ "Mangaung Metropolitan Municipality", "Nelson Mandela Bay Metropolitan Municipality", "eThekwini Metropolitan Municipality", "City of Tshwane Metropolitan Municipality", "City of Johannesburg Metropolitan Municipality", "Buffalo City Metropolitan Municipality", "City of Ekurhuleni Metropolitan Municipality" ] } ``` ### Data Fields - `question` - Input open domain question. - `answer` - List of possible answers to the question ### Data Splits - Train : 87925 - validation : 3610 ## Dataset Creation ### Curation Rationale [Needs More Information] ### Source Data #### Initial Data Collection and Normalization Natural Questions contains question from aggregated queries to Google Search (Kwiatkowski et al., 2019). To gather an open version of this dataset, we only keep questions with short answers and discard the given evidence document. Answers with many tokens often resemble extractive snippets rather than canonical answers, so we discard answers with more than 5 tokens. #### Who are the source language producers? [Needs More Information] ### Annotations #### Annotation process [Needs More Information] #### Who are the annotators? [Needs More Information] ### Personal and Sensitive Information [Needs More Information] ## Considerations for Using the Data ### Social Impact of Dataset [Needs More Information] ### Discussion of Biases Evaluating on this diverse set of question-answer pairs is crucial, because all existing datasets have inherent biases that are problematic for open domain QA systems with learned retrieval. In the Natural Questions dataset the question askers do not already know the answer. This accurately reflects a distribution of genuine information-seeking questions. However, annotators must separately find correct answers, which requires assistance from automatic tools and can introduce a moderate bias towards results from the tool. ### Other Known Limitations [Needs More Information] ## Additional Information ### Dataset Curators [Needs More Information] ### Licensing Information All of the Natural Questions data is released under the [CC BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/) license. ### Citation Information ``` @article{doi:10.1162/tacl\_a\_00276, author = {Kwiatkowski, Tom and Palomaki, Jennimaria and Redfield, Olivia and Collins, Michael and Parikh, Ankur and Alberti, Chris and Epstein, Danielle and Polosukhin, Illia and Devlin, Jacob and Lee, Kenton and Toutanova, Kristina and Jones, Llion and Kelcey, Matthew and Chang, Ming-Wei and Dai, Andrew M. and Uszkoreit, Jakob and Le, Quoc and Petrov, Slav}, title = {Natural Questions: A Benchmark for Question Answering Research}, journal = {Transactions of the Association for Computational Linguistics}, volume = {7}, number = {}, pages = {453-466}, year = {2019}, doi = {10.1162/tacl\_a\_00276}, URL = { https://doi.org/10.1162/tacl_a_00276 }, eprint = { https://doi.org/10.1162/tacl_a_00276 }, abstract = { We present the Natural Questions corpus, a question answering data set. Questions consist of real anonymized, aggregated queries issued to the Google search engine. An annotator is presented with a question along with a Wikipedia page from the top 5 search results, and annotates a long answer (typically a paragraph) and a short answer (one or more entities) if present on the page, or marks null if no long/short answer is present. The public release consists of 307,373 training examples with single annotations; 7,830 examples with 5-way annotations for development data; and a further 7,842 examples with 5-way annotated sequestered as test data. We present experiments validating quality of the data. We also describe analysis of 25-way annotations on 302 examples, giving insights into human variability on the annotation task. We introduce robust metrics for the purposes of evaluating question answering systems; demonstrate high human upper bounds on these metrics; and establish baseline results using competitive methods drawn from related literature. } } @inproceedings{lee-etal-2019-latent, title = "Latent Retrieval for Weakly Supervised Open Domain Question Answering", author = "Lee, Kenton and Chang, Ming-Wei and Toutanova, Kristina", booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics", month = jul, year = "2019", address = "Florence, Italy", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/P19-1612", doi = "10.18653/v1/P19-1612", pages = "6086--6096", abstract = "Recent work on open domain question answering (QA) assumes strong supervision of the supporting evidence and/or assumes a blackbox information retrieval (IR) system to retrieve evidence candidates. We argue that both are suboptimal, since gold evidence is not always available, and QA is fundamentally different from IR. We show for the first time that it is possible to jointly learn the retriever and reader from question-answer string pairs and without any IR system. In this setting, evidence retrieval from all of Wikipedia is treated as a latent variable. Since this is impractical to learn from scratch, we pre-train the retriever with an Inverse Cloze Task. We evaluate on open versions of five QA datasets. On datasets where the questioner already knows the answer, a traditional IR system such as BM25 is sufficient. On datasets where a user is genuinely seeking an answer, we show that learned retrieval is crucial, outperforming BM25 by up to 19 points in exact match.", } ``` ### Contributions Thanks to [@Nilanshrajput](https://github.com/Nilanshrajput) for adding this dataset.
ASAG2024 is a comprehensive short‑answer grading benchmark dataset created by the Zurich University of Applied Sciences. It comprises seven commonly used short‑answer grading datasets, totaling 19,000 question‑answer‑score triples across multiple subjects and education levels. Scores are normalized between 0 and 1 to facilitate comparison across datasets. The creation involved integrating data from multiple sources and standardizing them. The dataset is primarily used to evaluate and compare the performance of automated grading systems, aiming to address automation and generalizability challenges in short‑answer grading.
The SpamAssassin public email corpus is a collection of email messages assembled by members of the SpamAssassin project, suitable for testing spam‑filtering systems. The dataset contains various email samples divided into spam and ham categories, with further sub‑groups such as hard_ham, spam_2, spam, easy_ham, and easy_ham_2. Structure includes fields like label, group, text, and raw; only a training split is provided.
The dataset used in this paper is a curated subset of GRAZPEDWRI‑DX, intended for fine‑grained recognition of wrist pathologies. It includes image data for training, validation, and testing sets.
UNSW‑NB15 is a comprehensive network intrusion detection system dataset for academic research. The dataset was compared statistically with the KDD99 dataset and was released by Nour Moustafa and Jill Slay; publications must be cited when using it.
ActPlan‑1K is a multimodal planning benchmark jointly created by the Hong Kong University of Science and Technology and the University of California, San Diego. It evaluates vision‑language models' program planning abilities in domestic activities. The dataset includes 153 activities and 1,187 instances, each comprising a natural‑language task description and multiple environment images captured from the iGibson2 simulator. The creation process combined ChatGPT and iGibson2, converting BDDL activity definitions into natural‑language descriptions and collecting environment images. ActPlan‑1K is primarily used to assess the program planning capabilities of vision‑language models in multimodal tasks, especially for home activities and counterfactual scenarios.
The HaGRID sample dataset contains 127,331 images at 384p resolution, sampled from the original HaGRID dataset (552,992 images at 1080p, total size 716 GB). It was created to enable usage on free tiers of Google Colab and Kaggle Notebooks. The dataset is primarily intended for object‑detection tasks and includes annotations for multiple hand‑gesture categories such as call, no_gesture, like, etc. Annotation data comprise bounding‑box coordinates and class labels.
The MV-Video dataset is a large‑scale multi‑view video dataset consisting of 53 K rendered animated 3D objects. It is used to train the Animate3D model ([Animate3D: Animating Any 3D Model with Multi‑view Video Diffusion](https://animate3d.github.io/)).
The BIOSCAN_1M insect dataset provides information about insects. Each record includes four primary attributes: DNA barcode sequence, barcode index number (BIN), taxonomic rank annotation, and RGB image. The DNA barcode sequence shows the nucleotide arrangement, BIN serves as an alternative to Linnaean names, providing gene‑centered taxonomy, taxonomic rank annotation classifies organisms hierarchically based on evolutionary relationships, and the RGB image displays raw images from the 16 most densely sampled insect orders. The dataset also illustrates class distribution and class imbalance, which are inherent characteristics of insect communities.
This dataset records microplastic presence in drinking water. Each row represents a water‑sample record, containing microplastic material and type, color, water source type (tap or bottled), and the sampling location's latitude and longitude. The dataset focuses on polyethylene (PE) material for predicting PE levels across different geographic locations.
The multimodal distracted driving detection dataset (2016‑2018) and the earlier dataset (2014‑2016) include detailed information such as the number of drivers and average driving time.
The AdvancedEdit dataset was created via a novel data construction pipeline, featuring high visual quality, complex instructions, and good background consistency at a large scale.
HOI-dataset is a depth‑map hand‑part segmentation dataset for hand–object interaction, providing download links for training and validation sets.
The dataset records body‑fat percentage, age, weight, height, and ten body‑circumference measurements (e.g., waist) for 252 male subjects. Body‑fat, a health indicator, is accurately estimated via underwater weighing. By applying multivariate regression, body‑fat can be predicted using only a scale and a measuring tape, providing a convenient method for estimating male body‑fat.
Observed Antibody Space数据集是一个生物学和蛋白质领域的数据集,包含配对和不配对两种配置。配对配置包含重链和轻链的序列信息,而不配对配置仅包含单一链的序列信息。数据集的大小在10M到100M之间,适用于大规模数据处理。
GLENDA (Gynecologic Laparoscopy Endometriosis Dataset) comprises over 350 annotated images of endometriosis lesions from more than 100 gynecologic laparoscopic surgeries, plus over 13,000 unannotated non‑pathological images from more than 20 surgeries. It is designed for automated content analysis, particularly for endometriosis detection. The dataset offers binary and multi‑class classification configurations, each with detailed feature descriptions and split information. Annotations were produced by medical experts specializing in endometriosis treatment. Use is restricted to scientific research and prohibited for commercial purposes.
The dataset comprises mental‑health counseling dialogues. Primary features include `Context`, `Response`, and `conversations`. Each conversation entry contains a `from` field (sender) and a `value` field (content). The training split contains 3,512 samples with a total size of 9,356,552 bytes.
SupplyGraph is a benchmark dataset for supply chain planning, especially suited for applications of Graph Neural Networks (GNNs). The dataset originates from a leading Fast-Moving Consumer Goods (FMCG) company in Bangladesh and includes time‑series data as node features for sales forecasting, production planning, and factory problem identification. Researchers can apply GNNs to a variety of supply‑chain problems, advancing analysis and planning in the field.
This dataset was automatically generated during the evaluation of model CultriX/MonaTrix‑v4‑7B‑DPO. It comprises 63 configurations, each mapping to a specific evaluation task. Each run creates a split named after its timestamp; the `train` split always points to the latest results. An additional `results` configuration aggregates outcomes from all runs for metric computation on the Open LLM Leaderboard.
This dataset is primarily designed for code‑related question‑answer tasks and includes features such as labNo, taskNo, questioner, question, code, startLine, endLine, questionType, answer, src, code_processed, id, raw_code, raw_comment, comment, and q_code. The dataset is split into a training set containing 35,360 samples, with a total size of 46,842,820 bytes and a download size of 17,749,500 bytes.
The Ego‑QA‑19k dataset contains 19k video‑question‑answer pairs in first‑person view scenarios. Dataset creation involved two stages: first, video subtitles were concatenated chronologically to generate video descriptions, then GPT‑4 generated 20 questions per video; second, questions containing specific cue words were filtered out, and graduate‑level native English speakers ensured question authenticity and the required video length to answer each question.
The dataset contains play‑by‑play NBA game data and shot‑detail information from the 1996/97 season through the 2023/24 season, sourced from stats.nba.com, data.nba.com, and pbpstats.com.
FRACTAL is a benchmark dataset for 3D point‑cloud semantic segmentation, comprising 100,000 point clouds covering a 250 km² area across five regions in France. Originating from the Lidar HD project, it balances rare classes through an efficient sampling strategy and includes challenging landscapes. Each point cloud spans 50 × 50 m with high point density (average 37 points/m²). The dataset provides seven semantic classes and is colorized using high‑resolution aerial imagery. It is split into training (80,000), validation (10,000), and test (10,000) subsets, with metadata documenting acquisition years and distribution details.
The dataset consists of five interrelated tables, each containing key information about customers, transactions, branches, and merchants. These tables include the customers table, genders table, cities table, transactions table, branches table, and merchants table.
--- dataset_info: features: - name: text dtype: string - name: conversation_id dtype: int64 - name: embedding sequence: float64 - name: cluster dtype: int64 splits: - name: train num_bytes: 145562012 num_examples: 14666 download_size: 42803368 dataset_size: 145562012 configs: - config_name: default data_files: - split: train path: data/train-* --- # Dataset Card for "med_alpaca_standardized_cluster_8" [More Information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
The dataset contains 25,000 color images divided into five classes, each class comprising 5,000 images. All images are 768 × 768 pixels in JPEG format.
MetaMathQA is a dataset enhanced from the training sets of GSM8K and MATH, without using any test‑set data. Each original question can be found in `meta-math/MetaMathQA`; these items originate from GSM8K or MATH training sets.
SnapGarden v0.6 is a dataset containing 1,000 images of 25 plant species; each image has been augmented five times and is accompanied by a question‑answer style description, intended to help AI learn how to describe plants. The dataset is split into training, validation, and test sets, suitable for image captioning, plant recognition, and educational content development. It is released under the MIT license, emphasizing respect for original image owners and copyright.
The P2ANET dataset is a large‑scale benchmark for dense action detection from table‑tennis broadcast videos. It consists of two parts: a raw dataset and a processed dataset, collected in two batches (v1 and v2). Video data were captured with an RGB monocular camera, and labels were obtained via manual annotation.
The dataset contains four features: image, original_index, landmark, and mask. The image feature is stored as an image format, original_index is an integer, landmark is a sequence of integers, and mask is null. The dataset is divided into two parts: base_transforms (69,426 samples) and random_aug_transforms (26,435 samples). Total download size is 8,177,644,392 bytes and total dataset size is 8,315,251,492.07 bytes.
The Brazilian E‑Commerce Public Dataset by Olist contains order information from 2016‑2018 across multiple marketplaces in Brazil, with 100,000 orders. Features allow multi‑dimensional analysis of orders, including status, price, payment, shipping performance, customer location, product attributes, and customer reviews. A geographic dataset with latitude‑longitude coordinates linked to Brazilian postal codes is also provided.
The `raw` folder contains the original AMASS dataset files. The `amass_joints_h36m_60.pkl` file is a preprocessed version that can be generated by running the `convert_amass.py` script. The `train.tar.bz2` file is the converted dataset; simply extract it to the appropriate location for use.
This dataset is based on the Danbooru2018 dataset for anime character recognition, containing 1 million images and 70,000 characters. The dataset has been processed to generate 1 million head images and their corresponding character labels. The character label distribution follows a long‑tail, with an average of 13.85 images per label.
The GLH-Bridge dataset is a large‑scale dataset for bridge detection in large‑size very high resolution (VHR) remote sensing images. More details can be found in the paper "Learning to Holistically Detect Bridges from Large‑Size VHR Remote Sensing Imagery" (TPAMI 2024).
This dataset contains programming contest‑related problems and test case information. Each sample includes problem title, problem content, platform, problem ID, contest ID, contest date, starter code, difficulty, public test cases, private test cases, and metadata. The dataset is provided as a single test set with 713 samples, total size 3.79 GB. The configuration name is `default`, and data files are located at `data/test-*`.
The IST-3 CT head scan dataset was created by the Clinical Brain Sciences Centre at the University of Edinburgh, containing 10,659 CT series for research on intracranial arterial calcification segmentation. The dataset originates from the third International Stroke Trial (IST-3), involving non‑contrast CT scans of 3,035 acute ischemic stroke patients. During creation, registration to a template and quality control ensured validity and accuracy. The dataset primarily supports deep‑learning methods for stroke risk assessment, especially automatic quantification of intracranial arterial calcifications.