LTRC Hindi-Telugu Parallel Corpus

We provide a Hindi‑Telugu parallel corpus across various technical domains (natural sciences, computer science, law, healthcare, and general domain). The corpus contains 700 K parallel sentences, of which 535 K were created through extraction, alignment, manual translation, iterative back‑translation with post‑editing, and 165 K were collected from the public domain. We report comparative evaluations of the corpus’s representativeness and diversity. The corpus is pre‑processed for machine translation; we trained a neural MT system and reported state‑of‑the‑art baselines on several domains and benchmarks. This defines a new task for domain‑specific MT for low‑resource language pairs such as Hindi‑Telugu. The 535 K curated corpus is freely available for non‑commercial research and is, to our knowledge, the largest, carefully curated, publicly available Hindi‑Telugu domain parallel corpus.

Updated 10/22/2024

github

Description

The LTRC Hindi‑Telugu Parallel Corpus

Dataset Overview

Title: The LTRC Hindi‑Telugu Parallel Corpus
Authors: Vandan Mujadia, Dipti Sharma
Publishing Institution: European Language Resources Association
Release Date: June 2022
Conference: Proceedings of the Thirteenth Language Resources and Evaluation Conference
Location: Marseille, France
Publisher: European Language Resources Association

Dataset Content

Language Pair: Hindi‑Telugu
Domains: Natural Sciences, Computer Science, Law, Healthcare, and General Domain
Scale: 700 K parallel sentences (535 K created via multiple methods, 165 K from public domain)
Creation Methods: Extraction, alignment, manual translation, iterative back‑translation with post‑editing

Dataset Uses

Pre‑processing: Suitable for machine translation
Task: Defines a new domain MT task for low‑resource language pairs (Hindi‑Telugu)

Dataset Characteristics

Representativeness & Diversity: Comparative evaluation performed
Availability: Free for non‑commercial research
Scale Claim: Largest, carefully curated, publicly available Hindi‑Telugu domain parallel corpus to date

Dataset Source

Development Institution: LTRC, IIIT‑Hyderabad
Funding: Meity, Government of India
Project: ILMT Hindi‑Telugu Pilot

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Please login to view download links and access full dataset details.

Topics

Machine Translation

Low‑Resource Languages

Source

Organization: github

Created: 10/22/2024

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.

Check Prices →