Back to datasets
Dataset assetOpen Source CommunityMachine TranslationJapanese‑English Corpus

JESC

The JESC dataset is a Japanese‑English subtitle corpus created by Stanford University, Google Brain, and Rakuten Institute of Technology. Sourced from movie and TV subtitles on the web, it is one of the largest free EN‑JA corpora, focusing on conversational language. It contains 2.8 million sentence pairs covering everyday language, slang, instructions, and narratives. Licensed under CC‑BY‑4.0, it includes pre‑processed data with tokenized train/dev/test splits, primarily intended for translation tasks.

Source
huggingface
Created
Aug 24, 2024
Updated
Aug 27, 2024
Signals
282 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Card JESC

Dataset Overview

JESC is a Japanese‑English bilingual corpus extracted from subtitles. It was created jointly by Stanford University, Google Brain, and Rakuten Institute of Technology by crawling and aligning movie and TV subtitles from the web. JESC is one of the largest free EN‑JA corpora, covering the spoken domain.

Dataset Features

  • Languages: English (en), Japanese (ja)
  • License: CC‑BY‑4.0
  • Task Category: Translation
  • Dataset Information:
    • Features:
      • translation:
        • en: string type
        • ja: string type
    • Splits:
      • train:
        • bytes: 249,255,464
        • samples: 2,801,388
    • Download size: 175,157,050
    • Dataset size: 249,255,464
    • Configuration:
      • default:
        • data files:
          • train: data/train-*

Data Sample

json { en: "you are back, arent you, harold?", ja: あなたは戻ったのね、ハロルド? }

Dataset Content

  1. Large corpus of 2.8 million sentence pairs.
  2. Covers colloquial speech, slang, instructional text, and narrative translation—domains scarce in existing Japanese‑English MT resources.
  3. Includes pre‑processed tokenized train/dev/test splits.
  4. Provides code for crawling additional data and handling MT datasets.

Data Splits

Only the train split is provided.

License Information

The data are released under a Creative Commons (CC) license.

Citation Information

json @ARTICLE{pryzant_jesc_2018, author = {{Pryzant}, R. and {Chung}, Y. and {Jurafsky}, D. and {Britz}, D.}, title = "{JESC: Japanese‑English Subtitle Corpus}", journal = {Language Resources and Evaluation Conference (LREC)}, keywords = {Computer Science - Computation and Language}, year = 2018 }

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.