JParaCrawl
JParaCrawl is the largest publicly available English‑Japanese parallel corpus created by NTT. It is constructed via large‑scale web crawling and automatic alignment of parallel sentences. The dataset includes a training set with a massive number of bytes and examples. Each data instance contains an English‑Japanese sentence pair. The dataset is distributed under its own license.
Dataset description and usage context
Dataset Card for JParaCrawl
Dataset Overview
JParaCrawl is the largest publicly available English‑Japanese parallel corpus created by NTT. It is constructed via large‑scale web crawling and automatic alignment of parallel sentences.
Dataset Information
Features
- translation
- en: type string
- ja: type string
Data Split
- train
- Bytes: 1084069907
- Samples: 3669859
Download and Dataset Size
- Download Size: 603669921
- Dataset Size: 1084069907
Configuration
- default
- Data Files:
- train: data/train-*
- Data Files:
How to Use
from datasets import load_dataset
dataset = load_dataset("Hoshikuzu/JParaCrawl")
If loading takes too long, you can use streaming:
from datasets import load_dataset
dataset = load_dataset("Hoshikuzu/JParaCrawl", streaming=True)
Data Example
{
"en": "Of course, we’ll keep the important stuff, but we’ll try to sell as much as possible of the stuff we don’t need. afterwards I feel like we can save money by reducing things and making life related patterns too.",
"ja": "もちろん大切なものは取っておきますが、なくても困らないものはなるべく売るようにします。 さいごに ものを減らして、生活関連もパターン化することでお金は貯まる気がしています。"
}
License Information
JParaCrawl is distributed under its own license. See details at https://www.kecl.ntt.co.jp/icl/lirg/jparacrawl/.
Data Split
Only the train split is provided.
Citation
@inproceedings{morishita-etal-2020-jparacrawl,
title = "{JP}ara{C}rawl: A Large Scale Web-Based {E}nglish-{J}apanese Parallel Corpus",
author = "Morishita, Makoto and
Suzuki, Jun and
Nagata, Masaaki",
booktitle = "Proceedings of The 12th Language Resources and Evaluation Conference",
month = may,
year = "2020",
address = "Marseille, France",
publisher = "European Language Resources Association",
url = "https://www.aclweb.org/anthology/2020.lrec-1.443",
pages = "3603--3609",
ISBN = "979-10-95546-34-4",
}
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.