JParaCrawl
JParaCrawl is the largest publicly available English‑Japanese parallel corpus created by NTT. It is constructed via large‑scale web crawling and automatic alignment of parallel sentences. The dataset includes a training set with a massive number of bytes and examples. Each data instance contains an English‑Japanese sentence pair. The dataset is distributed under its own license.
Description
Dataset Card for JParaCrawl
Dataset Overview
JParaCrawl is the largest publicly available English‑Japanese parallel corpus created by NTT. It is constructed via large‑scale web crawling and automatic alignment of parallel sentences.
Dataset Information
Features
- translation
- en: type string
- ja: type string
Data Split
- train
- Bytes: 1084069907
- Samples: 3669859
Download and Dataset Size
- Download Size: 603669921
- Dataset Size: 1084069907
Configuration
- default
- Data Files:
- train: data/train-*
- Data Files:
How to Use
from datasets import load_dataset
dataset = load_dataset("Hoshikuzu/JParaCrawl")
If loading takes too long, you can use streaming:
from datasets import load_dataset
dataset = load_dataset("Hoshikuzu/JParaCrawl", streaming=True)
Data Example
{
"en": "Of course, we’ll keep the important stuff, but we’ll try to sell as much as possible of the stuff we don’t need. afterwards I feel like we can save money by reducing things and making life related patterns too.",
"ja": "もちろん大切なものは取っておきますが、なくても困らないものはなるべく売るようにします。 さいごに ものを減らして、生活関連もパターン化することでお金は貯まる気がしています。"
}
License Information
JParaCrawl is distributed under its own license. See details at https://www.kecl.ntt.co.jp/icl/lirg/jparacrawl/.
Data Split
Only the train split is provided.
Citation
@inproceedings{morishita-etal-2020-jparacrawl,
title = "{JP}ara{C}rawl: A Large Scale Web-Based {E}nglish-{J}apanese Parallel Corpus",
author = "Morishita, Makoto and
Suzuki, Jun and
Nagata, Masaaki",
booktitle = "Proceedings of The 12th Language Resources and Evaluation Conference",
month = may,
year = "2020",
address = "Marseille, France",
publisher = "European Language Resources Association",
url = "https://www.aclweb.org/anthology/2020.lrec-1.443",
pages = "3603--3609",
ISBN = "979-10-95546-34-4",
}
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: huggingface
Created: 8/24/2024
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.