JUHE API Marketplace
DATASET
Open Source Community

Vietnamese Corpus

The Vietnamese Corpus Project provides an organized collection of Vietnamese text resources covering multiple thematic domains. The corpus can be used for natural language processing, machine translation, text analysis, and other research involving Vietnamese. Documents are classified by topic for easy access, and the project also integrates Vietnamese Wikipedia dictionary resources for definitions and background information of Vietnamese terms.

Updated 8/19/2024
github

Description

Vietnamese Text Corpus

Project Overview

The Vietnamese Text Corpus Project aims to deliver an organized collection of Vietnamese textual resources across multiple thematic domains. The corpus can be used for natural language processing (NLP), machine translation, text analysis, and other research involving Vietnamese. Documents are categorized by topic for convenient access and utilization.

The project also incorporates Vietnamese Wikipedia dictionary resources, allowing users to easily look up definitions and background information for Vietnamese vocabulary.

Category Directory

Text documents in the corpus are grouped by content topic, with detailed information as follows:

  • Chính trị Xã hội (Politics & Society) – 6,567 documents covering Vietnamese politics, social phenomena, and related issues.
  • Đời sống (Life) – 4,195 documents covering daily life aspects such as family, education, culture, etc.
  • Kinh doanh (Business) – 4,276 documents focusing on business, economy, and finance topics.
  • Pháp luật (Law) – 6,656 documents covering laws, regulations, judicial cases, etc.
  • Sức khỏe (Health) – 4,417 documents covering medical health and public health topics.
  • Thế giới (World) – 5,716 documents discussing international news, global issues, diplomatic affairs, etc.
  • Thể thao (Sports) – 5,667 documents covering sports news, event reports, athlete information, etc.
  • Văn hóa (Culture) – 5,250 documents covering arts, literature, traditional culture, etc.

Wikipedia Dictionary

The project integrates a Vietnamese dictionary extracted from Wikipedia.

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Natural Language Processing
Vietnamese Language Research

Source

Organization: github

Created: 8/19/2024

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.