JESC

The JESC dataset is a Japanese‑English subtitle corpus created by Stanford University, Google Brain, and Rakuten Institute of Technology. Sourced from movie and TV subtitles on the web, it is one of the largest free EN‑JA corpora, focusing on conversational language. It contains 2.8 million sentence pairs covering everyday language, slang, instructions, and narratives. Licensed under CC‑BY‑4.0, it includes pre‑processed data with tokenized train/dev/test splits, primarily intended for translation tasks.

Updated 8/27/2024

huggingface

Description

Dataset Card JESC

Dataset Overview

JESC is a Japanese‑English bilingual corpus extracted from subtitles. It was created jointly by Stanford University, Google Brain, and Rakuten Institute of Technology by crawling and aligning movie and TV subtitles from the web. JESC is one of the largest free EN‑JA corpora, covering the spoken domain.

Dataset Features

Languages: English (en), Japanese (ja)
License: CC‑BY‑4.0
Task Category: Translation
Dataset Information:
- Features:
  - translation:
    - en: string type
    - ja: string type
- Splits:
  - train:
    - bytes: 249,255,464
    - samples: 2,801,388
- Download size: 175,157,050
- Dataset size: 249,255,464
- Configuration:
  - default:
    - data files:
      - train: data/train-*

Data Sample

json { en: "you are back, arent you, harold?", ja: あなたは戻ったのね、ハロルド? }

Dataset Content

Large corpus of 2.8 million sentence pairs.
Covers colloquial speech, slang, instructional text, and narrative translation—domains scarce in existing Japanese‑English MT resources.
Includes pre‑processed tokenized train/dev/test splits.
Provides code for crawling additional data and handling MT datasets.

Data Splits

Only the train split is provided.

License Information

The data are released under a Creative Commons (CC) license.

Citation Information

json @ARTICLE{pryzant_jesc_2018, author = {{Pryzant}, R. and {Chung}, Y. and {Jurafsky}, D. and {Britz}, D.}, title = "{JESC: Japanese‑English Subtitle Corpus}", journal = {Language Resources and Evaluation Conference (LREC)}, keywords = {Computer Science - Computation and Language}, year = 2018 }

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Please login to view download links and access full dataset details.

Topics

Machine Translation

Japanese‑English Corpus

Source

Organization: huggingface

Created: 8/24/2024

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.

Check Prices →