JUHE API Marketplace
DATASET
Open Source Community

JParaCrawl

JParaCrawl is the largest publicly available English‑Japanese parallel corpus created by NTT. It is constructed via large‑scale web crawling and automatic alignment of parallel sentences. The dataset includes a training set with a massive number of bytes and examples. Each data instance contains an English‑Japanese sentence pair. The dataset is distributed under its own license.

Updated 8/25/2024
huggingface

Description

Dataset Card for JParaCrawl

Dataset Overview

JParaCrawl is the largest publicly available English‑Japanese parallel corpus created by NTT. It is constructed via large‑scale web crawling and automatic alignment of parallel sentences.

Dataset Information

Features

  • translation
    • en: type string
    • ja: type string

Data Split

  • train
    • Bytes: 1084069907
    • Samples: 3669859

Download and Dataset Size

  • Download Size: 603669921
  • Dataset Size: 1084069907

Configuration

  • default
    • Data Files:
      • train: data/train-*

How to Use

from datasets import load_dataset
dataset = load_dataset("Hoshikuzu/JParaCrawl")

If loading takes too long, you can use streaming:

from datasets import load_dataset
dataset = load_dataset("Hoshikuzu/JParaCrawl", streaming=True)

Data Example

{
  "en": "Of course, we’ll keep the important stuff, but we’ll try to sell as much as possible of the stuff we don’t need. afterwards I feel like we can save money by reducing things and making life related patterns too.",
  "ja": "もちろん大切なものは取っておきますが、なくても困らないものはなるべく売るようにします。 さいごに ものを減らして、生活関連もパターン化することでお金は貯まる気がしています。"
}

License Information

JParaCrawl is distributed under its own license. See details at https://www.kecl.ntt.co.jp/icl/lirg/jparacrawl/.

Data Split

Only the train split is provided.

Citation

@inproceedings{morishita-etal-2020-jparacrawl,
    title = "{JP}ara{C}rawl: A Large Scale Web-Based {E}nglish-{J}apanese Parallel Corpus",
    author = "Morishita, Makoto  and
      Suzuki, Jun  and
      Nagata, Masaaki",
    booktitle = "Proceedings of The 12th Language Resources and Evaluation Conference",
    month = may,
    year = "2020",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://www.aclweb.org/anthology/2020.lrec-1.443",
    pages = "3603--3609",
    ISBN = "979-10-95546-34-4",
}

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Machine Translation
Natural Language Processing

Source

Organization: huggingface

Created: 8/24/2024

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.