FaceCaption‑15M Dataset Overview

Basic Information

License: CC‑BY‑4.0
Language: English
Size: 10 M < n < 100 M
Task Types: Image‑to‑Text, Text‑to‑Image
Tags: Computer Vision, Face, Dataset

Dataset Description

FaceCaption‑15M is a large‑scale, diverse, high‑quality facial‑image and natural‑language description dataset (image‑to‑text). It contains over 15 million facial‑image / description pairs, making it the largest dataset of its kind.

Update Log

24/07/17: Released the FLIP model.
24/07/06: Updated citation information.
24/07/05: Released FaceCaption‑15M‑V1.

Dataset Versions

FaceCaption‑15M‑V1: Includes URL, face bounding box, laion_caption, face_caption, etc.
Upcoming: HumanCaption V2 with facial description, short description, and detailed description.

Usage

# Using the Datasets library:
from datasets import load_dataset
 ds = load_dataset("OpenFace-CQUPT/FaceCaption-15M")

# Using pandas:
import pandas as pd
 df = pd.read_parquet("hf://datasets/OpenFace-CQUPT/FaceCaption-15M/FaceCaption-v1.parquet")

Construction Process

1.1 Facial Image Collection

Image Collection: Images are sourced from the LAION‑Face dataset, which originally contains over 50 million image‑text pairs.
Face Segmentation: Using the RetinaFace model, approximately 37 million images containing faces are filtered from LAION‑Face, then cropped, aligned, and filtered, resulting in about 23 million high‑quality facial images.

1.2 Facial Attribute Annotation

Attribute Design: 40 visual attributes are defined for facial description.
Automatic Annotation: Open‑source algorithms generate automatic annotations; only attributes with prediction probability > 0.85 are kept, and samples must have at least five valid predicted attributes, yielding a final dataset of 15 million samples.

1.3 Facial Description Generation

Raw Text Generation: Attribute annotations are fed into a handcrafted grammar template to produce raw text.
Rewritten Text: The raw text is input to a large language model (LLM) to generate natural, diverse, and accurate descriptions.

1.4 Statistical Analysis

Comparison with Other Datasets: Includes sample count, average resolution, annotation count, average word count, etc.
Image Quality Scores: Evaluated using BRISQUE and CLIPIQA.
Text Distribution: Includes category distribution, sentence length distribution, 4‑gram distribution, etc.

Limitations & Discussion

Dataset Bias: Some bias may have been introduced during cleaning and construction; ongoing updates aim to minimise bias.
Legal Compliance: Follows LAION's open‑source release model, providing original image links, cleaned text descriptions, and facial coordinates.
Privacy Protection: Individuals can request removal of personal images.

Contact

Email: 2018211556@stu.cqupt.edu.cn or dw_dai@163.com

License Information

FaceCaption‑15M is released under the Creative Commons Attribution 4.0 International License (CC‑BY 4.0) for research and educational purposes only.

Citation

@misc{dai202415mmultimodalfacialimagetext,
      title={15M Multimodal Facial Image‑Text Dataset},
      author={Dawei Dai and YuTang Li and YingGe Liu and Mingming Jia and Zhang YuanHui and Guoyin Wang},
      year={2024},
      eprint={2407.08515},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2407.08515}
}

FaceCaption-15M

Description