High Quality Data

Dataset Hub

Explore high-quality datasets for your AI and machine learning projects.

Sort:

Browse by Category

e621_newest

The dataset is the latest supplemental data from e621.net, containing the newest images and videos. Data formats include gif, jpg, png, swf, and webm. Due to content reasons, this dataset is not suitable for all audiences. It consists of 346,765 records, with IDs ranging from 117,744 to 5,083,602, last updated on 2024-10-01. The dataset primarily uses English and Japanese, and is applicable to image classification, zero‑shot image classification, and text‑to‑image generation tasks. Content involves art and anime, accompanied by detailed tags describing the images and videos.

huggingface

View Details

ACCD

Anime

AI Training

Aiming to promote Chinese AI character creation, this project continuously collects typical anime character dialogue data (ACCD) and stores it in a public repository, which can be used for AI character training or literary creation learning.

github

View Details

subsplease_animes

Anime

Data Analysis

This is an integrated anime database combining data from subsplease, MyAnimeList, and Nyaa.si. Users can discover the most popular anime and those with reliable torrent magnet links. The database updates daily and includes 770 anime titles and a total of 11,137 episodes, each with detailed information such as ID, title, type, episode count, status, rating, Nyaa search link, magnet links, seed count, download count, and last update time.

huggingface

View Details

japanese-anime-speech-v2

Automatic Speech Recognition

Anime

japanese‑anime‑speech‑v2 is an audio‑text dataset designed to train automatic speech recognition models. It contains 300,506 audio clips and their transcriptions sourced from visual novels. The goal is to improve ASR models (e.g., OpenAI's Whisper) for transcribing anime and similar Japanese media dialogue. Audio is in MP3 format, sampled at 16 kHz, with an average length of 5.5 seconds. This is the first release of the japanese‑anime‑speech‑v2 series; compared with the previous version, audio quality has been adjusted and NSFW content is not filtered. The dataset is predominantly female voices, with vocabularies around love, relationships, and fantasy, which may not fully reflect real‑world speech patterns. Future plans include separating safe and NSFW content, improving text formatting, and expanding data sources.

huggingface

View Details