JUHE API Marketplace
DATASET
Open Source Community

FaceCaption-15M

FaceCaption‑15M is a large‑scale, diverse, high‑quality dataset of facial images and their natural‑language descriptions, containing over 15 million facial image‑description pairs, intended to promote research on face‑centric tasks. The dataset construction includes image collection, facial attribute annotation, facial description generation, and statistical analysis.

Updated 7/5/2024
huggingface

Description

FaceCaption‑15M Dataset Overview

Basic Information

  • License: CC‑BY‑4.0
  • Language: English
  • Size: 10 M < n < 100 M
  • Task Types: Image‑to‑Text, Text‑to‑Image
  • Tags: Computer Vision, Face, Dataset

Dataset Description

FaceCaption‑15M is a large‑scale, diverse, high‑quality facial‑image and natural‑language description dataset (image‑to‑text). It contains over 15 million facial‑image / description pairs, making it the largest dataset of its kind.

Update Log

  • 24/07/17: Released the FLIP model.
  • 24/07/06: Updated citation information.
  • 24/07/05: Released FaceCaption‑15M‑V1.

Dataset Versions

  • FaceCaption‑15M‑V1: Includes URL, face bounding box, laion_caption, face_caption, etc.
  • Upcoming: HumanCaption V2 with facial description, short description, and detailed description.

Usage

# Using the Datasets library:
from datasets import load_dataset
 ds = load_dataset("OpenFace-CQUPT/FaceCaption-15M")
# Using pandas:
import pandas as pd
 df = pd.read_parquet("hf://datasets/OpenFace-CQUPT/FaceCaption-15M/FaceCaption-v1.parquet")

Construction Process

1.1 Facial Image Collection

  • Image Collection: Images are sourced from the LAION‑Face dataset, which originally contains over 50 million image‑text pairs.
  • Face Segmentation: Using the RetinaFace model, approximately 37 million images containing faces are filtered from LAION‑Face, then cropped, aligned, and filtered, resulting in about 23 million high‑quality facial images.

1.2 Facial Attribute Annotation

  • Attribute Design: 40 visual attributes are defined for facial description.
  • Automatic Annotation: Open‑source algorithms generate automatic annotations; only attributes with prediction probability > 0.85 are kept, and samples must have at least five valid predicted attributes, yielding a final dataset of 15 million samples.

1.3 Facial Description Generation

  • Raw Text Generation: Attribute annotations are fed into a handcrafted grammar template to produce raw text.
  • Rewritten Text: The raw text is input to a large language model (LLM) to generate natural, diverse, and accurate descriptions.

1.4 Statistical Analysis

  • Comparison with Other Datasets: Includes sample count, average resolution, annotation count, average word count, etc.
  • Image Quality Scores: Evaluated using BRISQUE and CLIPIQA.
  • Text Distribution: Includes category distribution, sentence length distribution, 4‑gram distribution, etc.

Limitations & Discussion

  • Dataset Bias: Some bias may have been introduced during cleaning and construction; ongoing updates aim to minimise bias.
  • Legal Compliance: Follows LAION's open‑source release model, providing original image links, cleaned text descriptions, and facial coordinates.
  • Privacy Protection: Individuals can request removal of personal images.

Contact

License Information

FaceCaption‑15M is released under the Creative Commons Attribution 4.0 International License (CC‑BY 4.0) for research and educational purposes only.

Citation

@misc{dai202415mmultimodalfacialimagetext,
      title={15M Multimodal Facial Image‑Text Dataset},
      author={Dawei Dai and YuTang Li and YingGe Liu and Mingming Jia and Zhang YuanHui and Guoyin Wang},
      year={2024},
      eprint={2407.08515},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2407.08515}
}

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Face Recognition
Natural Language Processing

Source

Organization: huggingface

Created: 7/3/2024

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.