JUHE API Marketplace
High Quality Data

Dataset Hub

Explore high-quality datasets for your AI and machine learning projects.

Sort:

Browse by Category

Selfie-with-ID

Authentication
Face Recognition

The dataset contains over 65,000 photos of more than 5,000 individuals from 40 countries, providing a valuable resource for exploring and developing authentication solutions. It is especially suitable for biometric verification, notably facial recognition in financial services. Each individual includes 13 selfie images and 2 ID photos captured with various devices and resolutions. The dataset aims to develop more robust re‑identification algorithms and enhance security measures across applications.

huggingface
View Details

ffhq-256_training_faces

Face Recognition
Computer Vision

The dataset contains four features: image, original_index, landmark, and mask. The image feature is stored as an image format, original_index is an integer, landmark is a sequence of integers, and mask is null. The dataset is divided into two parts: base_transforms (69,426 samples) and random_aug_transforms (26,435 samples). Total download size is 8,177,644,392 bytes and total dataset size is 8,315,251,492.07 bytes.

huggingface
View Details

Flickr-Faces-HQ (FFHQ)

Face Recognition
Generative Adversarial Networks

Flickr‑Faces‑HQ (FFHQ) is a high‑quality face image dataset originally created as a benchmark for Generative Adversarial Networks (GANs). The dataset contains 70,000 high‑quality PNG images at a resolution of 1024×1024, featuring significant variation in age, race, and background, as well as accessories such as glasses, sunglasses, and hats. Images were scraped from Flickr, inheriting its biases, and were automatically aligned and cropped using dlib. Only images with appropriate licenses were collected, and various automatic filters and Amazon Mechanical Turk were employed to remove occasional statues, paintings, or non‑photographic content.

github
View Details

FaceCaption-15M

Face Recognition
Natural Language Processing

FaceCaption‑15M is a large‑scale, diverse, high‑quality dataset of facial images and their natural‑language descriptions, containing over 15 million facial image‑description pairs, intended to promote research on face‑centric tasks. The dataset construction includes image collection, facial attribute annotation, facial description generation, and statistical analysis.

huggingface
View Details

jxie/celeba-hq

Face Recognition
Gender Classification

The dataset contains images and labels. The image feature is of image type, and the label is a binary classification with two classes: female and male. The dataset is split into a training set of 28 000 samples and a validation set of 2 000 samples. Total download size is 2 762 725 456 bytes and total size is 2 763 112 879 bytes. Data file paths are `train-*` and `validation-*`.

hugging_face
View Details

student/FFHQ

Face Recognition
Image Processing

The FFHQ (Flickr‑Faces‑HQ) dataset comprises 70,000 high‑quality PNG images at 1024 × 1024 resolution, featuring diverse ages, ethnicities, backgrounds, and accessories (glasses, hats, etc.). Images were sourced from Flickr under permissive licenses, automatically aligned and cropped using dlib, and filtered to remove non‑photos. The dataset supports research in generative adversarial networks and related fields.

hugging_face
View Details

pca-face-dataset

Face Recognition
Principal Component Analysis

This dataset contains representative images generated via Principal Component Analysis for face recognition.

github
View Details

UADFV, EBV, Deepfake-TIMIT, DFFD, Wild Deepfake, Celeb-DF (v1), Celeb-DF (v2), DFDC, Deeper Forensic, FaceForensic++, DFGC, FFIW-10K, ForgeryNet

Deepfake Detection
Face Recognition

This is a list of multiple Deepfakes‑related datasets, each with specific uses and characteristics, e.g., UADFV for detecting inconsistent head pose, EBV for revealing AI‑generated fake‑face videos by detecting eye blinks, etc.

github
View Details

CASIA-SURF, CASIA-SURF-CeFA, CASIA-SURF-HiFiMask, CASIA-SURF-SuHiFiMask

Face Recognition
Anti‑spoofing

This is a large multimodal benchmark dataset for face anti‑spoofing, comprising multiple datasets including CASIA‑SURF, CASIA‑SURF‑CeFA, CASIA‑SURF‑HiFiMask, and CASIA‑SURF‑SuHiFiMask. These datasets support face anti‑spoofing research across various modalities and cross‑ethnicity analyses.

github
View Details

IMDb-Face, Megaface

Face Recognition
Image Dataset

The IMDb‑Face dataset is used for face recognition and contains facial images gathered from IMDb. The Megaface dataset is a large‑scale face recognition benchmark comprising multiple subsets for various recognition tasks.

github
View Details

Synthetic Faces High Quality (SFHQ) dataset

Face Recognition
Image Processing

The dataset comprises approximately 425,000 carefully selected high‑quality synthetic face images at 1024 × 1024 resolution, generated by transforming various inspirations such as paintings, sketches, 3D models, and text‑to‑image generators into realistic faces. It also includes facial landmarks (an extended set of 110 points) and semantic segmentation masks for face parsing.

github
View Details

facescrub-dataset

Face Recognition
Computer Vision

The dataset contains 47,500 face images, each 50 × 50 pixels in colour, sourced from facescrub. It is intended for training and validation, extracted using OpenCV HOG face detection and not manually cleaned.

github
View Details