High Quality Data

Dataset Hub

Explore high-quality datasets for your AI and machine learning projects.

Sort:

Browse by Category

JESC

The JESC dataset is a Japanese‑English subtitle corpus created by Stanford University, Google Brain, and Rakuten Institute of Technology. Sourced from movie and TV subtitles on the web, it is one of the largest free EN‑JA corpora, focusing on conversational language. It contains 2.8 million sentence pairs covering everyday language, slang, instructions, and narratives. Licensed under CC‑BY‑4.0, it includes pre‑processed data with tokenized train/dev/test splits, primarily intended for translation tasks.

huggingface

View Details