High Quality Data

Dataset Hub

Explore high-quality datasets for your AI and machine learning projects.

Sort:

Browse by Category

TEDtalk-en-ja

This dataset comprises Japanese‑English translation pairs extracted from the Multitarget TED Talks (MTTT) dataset, based on TED talks. The data originates from WIT³ and is used in the IWSLT machine translation evaluation campaign. It contains a single training split with 158,535 examples, each consisting of an English sentence and a Japanese sentence. The dataset is released under the CC BY‑NC‑ND 4.0 license, requiring acknowledgment of TED's contribution.

huggingface

View Details