JUHE API Marketplace
DATASET
Open Source Community

japanese-anime-speech-v2

japanese‑anime‑speech‑v2 is an audio‑text dataset designed to train automatic speech recognition models. It contains 300,506 audio clips and their transcriptions sourced from visual novels. The goal is to improve ASR models (e.g., OpenAI's Whisper) for transcribing anime and similar Japanese media dialogue. Audio is in MP3 format, sampled at 16 kHz, with an average length of 5.5 seconds. This is the first release of the japanese‑anime‑speech‑v2 series; compared with the previous version, audio quality has been adjusted and NSFW content is not filtered. The dataset is predominantly female voices, with vocabularies around love, relationships, and fantasy, which may not fully reflect real‑world speech patterns. Future plans include separating safe and NSFW content, improving text formatting, and expanding data sources.

Updated 6/30/2024
huggingface

Description

Japanese Anime Speech Dataset V2

Overview

japanese-anime-speech-v2 is an audio‑text dataset for training automatic speech recognition models. The dataset contains 292,637 audio‑text pairs sourced from various visual novels.

Dataset Information

  • Number of audio‑text pairs: 292,637
  • Safe content audio duration: 397.54 hours (86.8%)
  • Non‑safe content audio duration: 52.36 hours (13.2%)
  • Average safe content audio length: 5.3 seconds
  • Data source: Visual novels
  • Audio format: mp3 (128 kbps)
  • Latest version: V2 – 29 June 2024

Dataset Characteristics

  • Audio features:
    • Sample rate: 16,000 Hz
  • Text features:
    • Data type: string

Dataset Splits

  • Safe content (sfw):
    • Bytes: 19174765803.112
    • Samples: 271,788
  • Non‑safe content (nsfw):
    • Bytes: 2864808426.209
    • Samples: 20,849

Dataset Size

  • Download size: 24,379,492,733 bytes
  • Dataset size: 22,039,574,229.321 bytes

Configuration

  • Default configuration:
    • Safe content file path: data/sfw-*
    • Non‑safe content file path: data/nsfw-*

Version Changes

  • Changes from V1 to V2:
    • Substantial increase in size, from 73,004 to 292,637 audio‑text pairs
    • Audio format changed from mp3 (192 kbps) to mp3 (128 kbps) for storage efficiency
    • Separate splits for safe and non‑safe content
    • Normalized repeated characters
    • Removed audio lines without dialogue
    • Removed low‑quality audio lines

Biases and Limitations

  • The dataset is primarily sourced from visual novels, leading to a gender bias toward female voices and vocabulary focused on love, relationships, and fantasy, which may not reflect real‑world speech patterns.
  • High audio quality may differ from everyday speaking conditions.
  • Contains non‑safe content, making it unsuitable for all applications.
  • Transcriptions are unformatted and uncleaned, which may affect text quality.

Future Plans

  • Continue expanding the dataset with more sources.

Use and Citation

  • The dataset is open for commercial and non‑commercial use.
  • Citation is not mandatory, but providing a hyperlink to the dataset is encouraged when used in derived works.

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Automatic Speech Recognition
Anime

Source

Organization: huggingface

Created: 6/26/2024

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.