Explore high-quality datasets for your AI and machine learning projects.
The Real‑Vul dataset was developed by the School of Computer Science at the University of Waterloo to provide a comprehensive dataset for evaluating deep‑learning models in real‑world software vulnerability detection. It contains 5,528 C/C++ function samples drawn from diverse software projects such as the Chromium browser and the Linux operating system. The dataset was created using a time‑based split strategy to ensure realistic and timely training and testing data. Real‑Vul is primarily intended for assessing and improving the practical performance of existing vulnerability detection models, especially in complex and varied real‑world software environments.
CC6204‑Hackaton‑CUB200 is a multimodal dataset for image‑classification and text‑classification tasks, especially suitable for multimodal classification problems. It contains bird images and descriptive texts; each image has ten textual descriptions, and each instance is labeled with the bird species. The dataset provides training (5,994 observations) and test (5,794 observations) splits. It originates from the Caltech Vision Lab; the associated paper is "The Caltech‑UCSD Birds‑200‑2011 Dataset". Creators and contributors include Catherine Wah and Cristóbal Alcázar.
The BlessemFlood21 dataset was created by the Fraunhofer Institute for Image Processing and other institutions, focusing on high‑resolution RGB images of non‑coastal flood scenes. It contains 4,623 images, each 512×512 pixels, captured by a drone after the 2021 Erftstadt‑Blessem flood event. Detailed water masks were generated using semi‑supervised human‑in‑the‑loop techniques, primarily for training and testing deep‑learning models to support flood detection and emergency response.
The Draper VDISC dataset is a source‑code vulnerability detection dataset containing 1.27 million functions mined from open‑source software, each annotated with potential vulnerabilities via static analysis. The data are split into training, validation, and test sets in an 80:10:10 ratio, stored in HDF5 format. Each function's source code is stored as a variable‑length UTF‑8 string and includes five binary vulnerability labels corresponding to four common CWEs (CWE‑120, CWE‑119, CWE‑469, CWE‑476) and an “other CWE”. The dataset is sponsored by the U.S. Air Force Research Laboratory as part of the DARPA MUSE program.
In this study, we used the "Odontoai" dataset to train and improve a YOLOv8‑seg model for efficient segmentation of dental radiographs. The dataset includes 52 distinct tooth categories (e.g., tooth‑11 to tooth‑85), with each image annotated by professional dentists. Standardized and verified annotations ensure high accuracy and consistency. The images cover diverse angles, lighting conditions, and backgrounds, enhancing model generalization. This high‑quality dataset enables the YOLOv8‑seg model to accurately identify and segment various tooth structures, supporting advanced dental diagnostics.
This dataset is specifically designed for welding quality inspection, covering three defect categories: "Bad Weld" (defective welds due to poor process such as porosity, cracks, lack of fusion), "Defect" (subtle imperfections like surface irregularities or uneven weld width), and "Good Weld" (standard-compliant samples serving as positive examples).
The msmarco-document-v2/trec-dl-2019 dataset, provided by the ir-datasets package, focuses on text retrieval tasks. It contains 200 queries and 13,940 relevance judgments (qrels) for evaluating document retrieval systems. Example usage includes loading and processing the data with HuggingFace's datasets library in Python.
The UCF‑101 dataset is a widely used benchmark for video action recognition. It contains 13,320 videos across 101 action categories, totaling about 7.2 GB. Videos have a resolution of 320×240 pixels and durations ranging from 1 to 30 seconds. Originally collected from YouTube and manually annotated, a ZIP version is provided to replace the original RAR distribution for easier access. The dataset is suitable for research on video‑based action recognition, such as training and evaluating deep‑learning models.
Caltech‑101 dataset: VGG from scratch plus ResNet. This Jupyter notebook provides CNN‑based image classification. The project uses technologies including a VGG model built from scratch, a pretrained ResNet model, and F1 score and accuracy for performance evaluation. Dataset source: https://www.tensorflow.org/datasets/catalog/caltech101?hl=es-419
The RichHF‑18K dataset contains the extensive human‑feedback labels we collected for our CVPR 24 paper, along with the original filenames of the labeled images. It includes subjective scores (e.g., aesthetic ratings), human‑annotated heatmaps (e.g., regions of pixel‑level distortion), and misalignment marks in textual prompts. The dataset consists of 17,760 examples in TensorFlow Example format, comprising 15,810 training examples, 995 development examples, and 955 test examples.
The MalImg dataset is used for malware image classification research. By converting malware code into grayscale images, deep‑learning techniques can improve malware classification efficiency.
The TartanAir dataset is an image dataset for visual SLAM and deep learning tasks, containing images and depth information across various environments, suitable for training and testing image feature matching algorithms.
A dataset for defect detection, using YOLOv8 and its enhanced models (including Coordinate Attention and Swin Transformer) for defect detection.
The BraTS 2020 dataset is used for brain tumor segmentation projects on multimodal MRI scans. It aims to accurately segment three tumor sub‑regions: GD‑enhancing tumor (ET), peritumoral edema (ED), and necrotic and non‑enhancing tumor core (NCR/NET). By developing automated segmentation methods with deep learning, it seeks to help medical professionals analyze brain tumor MRI scans more efficiently and accurately, improving diagnosis, treatment planning, and monitoring.