Back to datasets
Dataset assetOpen Source CommunityNatural Language ProcessingMachine Translation

JParaCrawl

JParaCrawl is the largest publicly available English‑Japanese parallel corpus created by NTT. It is constructed via large‑scale web crawling and automatic alignment of parallel sentences. The dataset includes a training set with a massive number of bytes and examples. Each data instance contains an English‑Japanese sentence pair. The dataset is distributed under its own license.

Source
huggingface
Created
Aug 24, 2024
Updated
Aug 25, 2024
Signals
217 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Card for JParaCrawl

Dataset Overview

JParaCrawl is the largest publicly available English‑Japanese parallel corpus created by NTT. It is constructed via large‑scale web crawling and automatic alignment of parallel sentences.

Dataset Information

Features

  • translation
    • en: type string
    • ja: type string

Data Split

  • train
    • Bytes: 1084069907
    • Samples: 3669859

Download and Dataset Size

  • Download Size: 603669921
  • Dataset Size: 1084069907

Configuration

  • default
    • Data Files:
      • train: data/train-*

How to Use

from datasets import load_dataset
dataset = load_dataset("Hoshikuzu/JParaCrawl")

If loading takes too long, you can use streaming:

from datasets import load_dataset
dataset = load_dataset("Hoshikuzu/JParaCrawl", streaming=True)

Data Example

{
  "en": "Of course, we’ll keep the important stuff, but we’ll try to sell as much as possible of the stuff we don’t need. afterwards I feel like we can save money by reducing things and making life related patterns too.",
  "ja": "もちろん大切なものは取っておきますが、なくても困らないものはなるべく売るようにします。 さいごに ものを減らして、生活関連もパターン化することでお金は貯まる気がしています。"
}

License Information

JParaCrawl is distributed under its own license. See details at https://www.kecl.ntt.co.jp/icl/lirg/jparacrawl/.

Data Split

Only the train split is provided.

Citation

@inproceedings{morishita-etal-2020-jparacrawl,
    title = "{JP}ara{C}rawl: A Large Scale Web-Based {E}nglish-{J}apanese Parallel Corpus",
    author = "Morishita, Makoto  and
      Suzuki, Jun  and
      Nagata, Masaaki",
    booktitle = "Proceedings of The 12th Language Resources and Evaluation Conference",
    month = may,
    year = "2020",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://www.aclweb.org/anthology/2020.lrec-1.443",
    pages = "3603--3609",
    ISBN = "979-10-95546-34-4",
}
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio