japanese-anime-speech-v2

japanese‑anime‑speech‑v2 is an audio‑text dataset designed to train automatic speech recognition models. It contains 300,506 audio clips and their transcriptions sourced from visual novels. The goal is to improve ASR models (e.g., OpenAI's Whisper) for transcribing anime and similar Japanese media dialogue. Audio is in MP3 format, sampled at 16 kHz, with an average length of 5.5 seconds. This is the first release of the japanese‑anime‑speech‑v2 series; compared with the previous version, audio quality has been adjusted and NSFW content is not filtered. The dataset is predominantly female voices, with vocabularies around love, relationships, and fantasy, which may not fully reflect real‑world speech patterns. Future plans include separating safe and NSFW content, improving text formatting, and expanding data sources.

Updated 6/30/2024

huggingface

Description

Japanese Anime Speech Dataset V2

Overview

japanese-anime-speech-v2 is an audio‑text dataset for training automatic speech recognition models. The dataset contains 292,637 audio‑text pairs sourced from various visual novels.

Dataset Information

Number of audio‑text pairs: 292,637
Safe content audio duration: 397.54 hours (86.8%)
Non‑safe content audio duration: 52.36 hours (13.2%)
Average safe content audio length: 5.3 seconds
Data source: Visual novels
Audio format: mp3 (128 kbps)
Latest version: V2 – 29 June 2024

Dataset Characteristics

Audio features:
- Sample rate: 16,000 Hz
Text features:
- Data type: string

Dataset Splits

Safe content (sfw):
- Bytes: 19174765803.112
- Samples: 271,788
Non‑safe content (nsfw):
- Bytes: 2864808426.209
- Samples: 20,849

Dataset Size

Download size: 24,379,492,733 bytes
Dataset size: 22,039,574,229.321 bytes

Configuration

Default configuration:
- Safe content file path: data/sfw-*
- Non‑safe content file path: data/nsfw-*

Version Changes

Changes from V1 to V2:
- Substantial increase in size, from 73,004 to 292,637 audio‑text pairs
- Audio format changed from mp3 (192 kbps) to mp3 (128 kbps) for storage efficiency
- Separate splits for safe and non‑safe content
- Normalized repeated characters
- Removed audio lines without dialogue
- Removed low‑quality audio lines

Biases and Limitations

The dataset is primarily sourced from visual novels, leading to a gender bias toward female voices and vocabulary focused on love, relationships, and fantasy, which may not reflect real‑world speech patterns.
High audio quality may differ from everyday speaking conditions.
Contains non‑safe content, making it unsuitable for all applications.
Transcriptions are unformatted and uncleaned, which may affect text quality.

Future Plans

Continue expanding the dataset with more sources.

Use and Citation

The dataset is open for commercial and non‑commercial use.
Citation is not mandatory, but providing a hyperlink to the dataset is encouraged when used in derived works.

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Please login to view download links and access full dataset details.

Topics

Automatic Speech Recognition

Anime

Source

Organization: huggingface

Created: 6/26/2024

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.

Check Prices →