JUHE API Marketplace
API CatalogDatasetsDocsBlog
API CatalogDatasetsDocsBlog

Dataset Catalog

Browse trusted datasets for evaluation, enrichment, and production use.

Category index
Showing 1 of 1 datasets
Category: Low‑Resource Languages

LTRC Hindi-Telugu Parallel Corpus

Machine TranslationLow‑Resource Languages

We provide a Hindi‑Telugu parallel corpus across various technical domains (natural sciences, computer science, law, healthcare, and general domain). The corpus contains 700 K parallel sentences, of which 535 K were created through extraction, alignment, manual translation, iterative back‑translation with post‑editing, and 165 K were collected from the public domain. We report comparative evaluations of the corpus’s representativeness and diversity. The corpus is pre‑processed for machine translation; we trained a neural MT system and reported state‑of‑the‑art baselines on several domains and benchmarks. This defines a new task for domain‑specific MT for low‑resource language pairs such as Hindi‑Telugu. The 535 K curated corpus is freely available for non‑commercial research and is, to our knowledge, the largest, carefully curated, publicly available Hindi‑Telugu domain parallel corpus.

Source githubUpdated Oct 22, 2024141 viewsLinked
Inspect dataset
JUHE API Marketplace

Accelerate development and ship production-grade integrations with APIs, MCP services, and AI-first infrastructure workflows.

For Developers

ConsoleDocumentation

Product

Browse APIsTemp Mail APIGlobal SMS

Company

What's NewContact SupportTerms Of ServicePrivacy Policy
Copyright © 2026 JUHEDATA HK LIMITED - All rights reserved