Back to datasets
Dataset assetOpen Source CommunityImage ProcessingAnime Character Recognition

Danbooru2018 Anime Character Recognition Dataset

This dataset is based on the Danbooru2018 dataset for anime character recognition, containing 1 million images and 70,000 characters. The dataset has been processed to generate 1 million head images and their corresponding character labels. The character label distribution follows a long‑tail, with an average of 13.85 images per label.

Source
github
Created
Jul 2, 2019
Updated
May 18, 2024
Signals
453 views
Availability
Linked source ready
Overview

Dataset description and usage context

Danbooru 2018 Anime Character Recognition Dataset Overview

Dataset Description

  • Dataset Name: Danbooru 2018 Anime Character Recognition Dataset
  • Dataset Source: Based on the Danbooru 2018 dataset.
  • Dataset Content: Contains 1,000,000 head images and their corresponding 70,000 character labels.
  • Dataset Purpose: Used for training and evaluating anime character recognition algorithms.

Data Processing Method

  • Label Filtering: Keep only character category labels.
  • Image Filtering: Retain images that contain only a single character label.
  • Head Detection: Use a specific model to extract head bounding boxes.
  • Image Deduplication: Remove images with multiple detected head bounding boxes.
  • Final Data Volume: 0.97M images, 70k labels.

Data Distribution and Visualization

  • Label‑Image Count Distribution: Visualized, showing only the top 100 labels.
  • Top 20 Popular Labels: Include hatsune_miku, hakurei_reimu, etc.
  • Distribution Characteristics: Long‑tail distribution, average 13.85 images per label.

Dataset Usage

  • Core Data File: faces.tsv, containing filename, label ID, and head detection results.
  • Label Text File: tagIds.tsv, providing text for each label ID.
  • Face Image Download: Pre‑processed face image archive can be downloaded via rsync.

Citation Information

  • Dataset Author: Yan Wang
  • Release Date: July 2019
  • Citation Format: Please refer to the README for the BibTeX format.

Baseline Model

  • Model Description: ResNet18 combined with ArcFace loss, achieving 37.3% accuracy.
  • Data Split: Training, validation, and test split files are provided.

Open Issues

  • Test Set Validation: Test set requires manual verification.
  • Face Alignment: Further optimization of face alignment is needed.
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio