JESC
The JESC dataset is a Japanese‑English subtitle corpus created by Stanford University, Google Brain, and Rakuten Institute of Technology. Sourced from movie and TV subtitles on the web, it is one of the largest free EN‑JA corpora, focusing on conversational language. It contains 2.8 million sentence pairs covering everyday language, slang, instructions, and narratives. Licensed under CC‑BY‑4.0, it includes pre‑processed data with tokenized train/dev/test splits, primarily intended for translation tasks.
Description
Dataset Card JESC
Dataset Overview
JESC is a Japanese‑English bilingual corpus extracted from subtitles. It was created jointly by Stanford University, Google Brain, and Rakuten Institute of Technology by crawling and aligning movie and TV subtitles from the web. JESC is one of the largest free EN‑JA corpora, covering the spoken domain.
Dataset Features
- Languages: English (en), Japanese (ja)
- License: CC‑BY‑4.0
- Task Category: Translation
- Dataset Information:
- Features:
- translation:
- en: string type
- ja: string type
- translation:
- Splits:
- train:
- bytes: 249,255,464
- samples: 2,801,388
- train:
- Download size: 175,157,050
- Dataset size: 249,255,464
- Configuration:
- default:
- data files:
- train: data/train-*
- data files:
- default:
- Features:
Data Sample
json { en: "you are back, arent you, harold?", ja: あなたは戻ったのね、ハロルド? }
Dataset Content
- Large corpus of 2.8 million sentence pairs.
- Covers colloquial speech, slang, instructional text, and narrative translation—domains scarce in existing Japanese‑English MT resources.
- Includes pre‑processed tokenized train/dev/test splits.
- Provides code for crawling additional data and handling MT datasets.
Data Splits
Only the train split is provided.
License Information
The data are released under a Creative Commons (CC) license.
Citation Information
json @ARTICLE{pryzant_jesc_2018, author = {{Pryzant}, R. and {Chung}, Y. and {Jurafsky}, D. and {Britz}, D.}, title = "{JESC: Japanese‑English Subtitle Corpus}", journal = {Language Resources and Evaluation Conference (LREC)}, keywords = {Computer Science - Computation and Language}, year = 2018 }
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: huggingface
Created: 8/24/2024
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.