LTRC Hindi-Telugu Parallel Corpus
We provide a Hindi‑Telugu parallel corpus across various technical domains (natural sciences, computer science, law, healthcare, and general domain). The corpus contains 700 K parallel sentences, of which 535 K were created through extraction, alignment, manual translation, iterative back‑translation with post‑editing, and 165 K were collected from the public domain. We report comparative evaluations of the corpus’s representativeness and diversity. The corpus is pre‑processed for machine translation; we trained a neural MT system and reported state‑of‑the‑art baselines on several domains and benchmarks. This defines a new task for domain‑specific MT for low‑resource language pairs such as Hindi‑Telugu. The 535 K curated corpus is freely available for non‑commercial research and is, to our knowledge, the largest, carefully curated, publicly available Hindi‑Telugu domain parallel corpus.
Description
The LTRC Hindi‑Telugu Parallel Corpus
Dataset Overview
- Title: The LTRC Hindi‑Telugu Parallel Corpus
- Authors: Vandan Mujadia, Dipti Sharma
- Publishing Institution: European Language Resources Association
- Release Date: June 2022
- Conference: Proceedings of the Thirteenth Language Resources and Evaluation Conference
- Location: Marseille, France
- Publisher: European Language Resources Association
Dataset Content
- Language Pair: Hindi‑Telugu
- Domains: Natural Sciences, Computer Science, Law, Healthcare, and General Domain
- Scale: 700 K parallel sentences (535 K created via multiple methods, 165 K from public domain)
- Creation Methods: Extraction, alignment, manual translation, iterative back‑translation with post‑editing
Dataset Uses
- Pre‑processing: Suitable for machine translation
- Task: Defines a new domain MT task for low‑resource language pairs (Hindi‑Telugu)
Dataset Characteristics
- Representativeness & Diversity: Comparative evaluation performed
- Availability: Free for non‑commercial research
- Scale Claim: Largest, carefully curated, publicly available Hindi‑Telugu domain parallel corpus to date
Dataset Source
- Development Institution: LTRC, IIIT‑Hyderabad
- Funding: Meity, Government of India
- Project: ILMT Hindi‑Telugu Pilot
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: github
Created: 10/22/2024
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.