Dataset assetOpen Source CommunityNatural Language ProcessingMachine Translation

JParaCrawl

JParaCrawl is the largest publicly available English‑Japanese parallel corpus created by NTT. It is constructed via large‑scale web crawling and automatic alignment of parallel sentences. The dataset includes a training set with a massive number of bytes and examples. Each data instance contains an English‑Japanese sentence pair. The dataset is distributed under its own license.

Source

huggingface

Created

Aug 24, 2024

Updated

Aug 25, 2024

Signals

217 views

Availability

Linked source ready

Overview

Dataset description and usage context

Dataset Card for JParaCrawl

Dataset Overview

JParaCrawl is the largest publicly available English‑Japanese parallel corpus created by NTT. It is constructed via large‑scale web crawling and automatic alignment of parallel sentences.

Dataset Information

Features

translation
- en: type string
- ja: type string

Data Split

train
- Bytes: 1084069907
- Samples: 3669859

Download and Dataset Size

Download Size: 603669921
Dataset Size: 1084069907

Configuration

default
- Data Files:
  - train: data/train-*

How to Use

from datasets import load_dataset
dataset = load_dataset("Hoshikuzu/JParaCrawl")

If loading takes too long, you can use streaming:

from datasets import load_dataset
dataset = load_dataset("Hoshikuzu/JParaCrawl", streaming=True)

Data Example

{
  "en": "Of course, we’ll keep the important stuff, but we’ll try to sell as much as possible of the stuff we don’t need. afterwards I feel like we can save money by reducing things and making life related patterns too.",
  "ja": "もちろん大切なものは取っておきますが、なくても困らないものはなるべく売るようにします。 さいごに ものを減らして、生活関連もパターン化することでお金は貯まる気がしています。"
}

License Information

JParaCrawl is distributed under its own license. See details at https://www.kecl.ntt.co.jp/icl/lirg/jparacrawl/.

Data Split

Only the train split is provided.

Citation

@inproceedings{morishita-etal-2020-jparacrawl,
    title = "{JP}ara{C}rawl: A Large Scale Web-Based {E}nglish-{J}apanese Parallel Corpus",
    author = "Morishita, Makoto  and
      Suzuki, Jun  and
      Nagata, Masaaki",
    booktitle = "Proceedings of The 12th Language Resources and Evaluation Conference",
    month = may,
    year = "2020",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://www.aclweb.org/anthology/2020.lrec-1.443",
    pages = "3603--3609",
    ISBN = "979-10-95546-34-4",
}

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio