FaceCaption-15M
FaceCaption‑15M is a large‑scale, diverse, high‑quality dataset of facial images and their natural‑language descriptions, containing over 15 million facial image‑description pairs, intended to promote research on face‑centric tasks. The dataset construction includes image collection, facial attribute annotation, facial description generation, and statistical analysis.
Description
FaceCaption‑15M Dataset Overview
Basic Information
- License: CC‑BY‑4.0
- Language: English
- Size: 10 M < n < 100 M
- Task Types: Image‑to‑Text, Text‑to‑Image
- Tags: Computer Vision, Face, Dataset
Dataset Description
FaceCaption‑15M is a large‑scale, diverse, high‑quality facial‑image and natural‑language description dataset (image‑to‑text). It contains over 15 million facial‑image / description pairs, making it the largest dataset of its kind.
Update Log
- 24/07/17: Released the FLIP model.
- 24/07/06: Updated citation information.
- 24/07/05: Released FaceCaption‑15M‑V1.
Dataset Versions
- FaceCaption‑15M‑V1: Includes URL, face bounding box,
laion_caption,face_caption, etc. - Upcoming: HumanCaption V2 with facial description, short description, and detailed description.
Usage
# Using the Datasets library:
from datasets import load_dataset
ds = load_dataset("OpenFace-CQUPT/FaceCaption-15M")
# Using pandas:
import pandas as pd
df = pd.read_parquet("hf://datasets/OpenFace-CQUPT/FaceCaption-15M/FaceCaption-v1.parquet")
Construction Process
1.1 Facial Image Collection
- Image Collection: Images are sourced from the LAION‑Face dataset, which originally contains over 50 million image‑text pairs.
- Face Segmentation: Using the RetinaFace model, approximately 37 million images containing faces are filtered from LAION‑Face, then cropped, aligned, and filtered, resulting in about 23 million high‑quality facial images.
1.2 Facial Attribute Annotation
- Attribute Design: 40 visual attributes are defined for facial description.
- Automatic Annotation: Open‑source algorithms generate automatic annotations; only attributes with prediction probability > 0.85 are kept, and samples must have at least five valid predicted attributes, yielding a final dataset of 15 million samples.
1.3 Facial Description Generation
- Raw Text Generation: Attribute annotations are fed into a handcrafted grammar template to produce raw text.
- Rewritten Text: The raw text is input to a large language model (LLM) to generate natural, diverse, and accurate descriptions.
1.4 Statistical Analysis
- Comparison with Other Datasets: Includes sample count, average resolution, annotation count, average word count, etc.
- Image Quality Scores: Evaluated using BRISQUE and CLIPIQA.
- Text Distribution: Includes category distribution, sentence length distribution, 4‑gram distribution, etc.
Limitations & Discussion
- Dataset Bias: Some bias may have been introduced during cleaning and construction; ongoing updates aim to minimise bias.
- Legal Compliance: Follows LAION's open‑source release model, providing original image links, cleaned text descriptions, and facial coordinates.
- Privacy Protection: Individuals can request removal of personal images.
Contact
- Email: 2018211556@stu.cqupt.edu.cn or dw_dai@163.com
License Information
FaceCaption‑15M is released under the Creative Commons Attribution 4.0 International License (CC‑BY 4.0) for research and educational purposes only.
Citation
@misc{dai202415mmultimodalfacialimagetext,
title={15M Multimodal Facial Image‑Text Dataset},
author={Dawei Dai and YuTang Li and YingGe Liu and Mingming Jia and Zhang YuanHui and Guoyin Wang},
year={2024},
eprint={2407.08515},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2407.08515}
}
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: huggingface
Created: 7/3/2024
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.