indiejoseph/cc100-yue
The filtered Cantonese dataset is a subset of the CC100 corpus, containing only Cantonese content after filtering. It is intended to support various natural language processing tasks such as text classification, sentiment analysis, named entity recognition, and machine translation. The filtering process references an article by ToastyNews.
Description
Dataset Card "cc100-yue"
Dataset Information
Features
- Name: text
- Data Type: string
Split
- Name: train
- Bytes: 32135136
- Samples: 176047
Download Size
- Size: 23579906
Dataset Size
- Size: 32135136
Configuration
- Configuration Name: default
- Data Files:
- Split: train
- Path: data/train-*
Dataset Description
The Filtered Cantonese Dataset is a subset of the CC100 corpus, filtered to contain only Cantonese content. This dataset aims to facilitate various natural language processing tasks such as text classification, sentiment analysis, named entity recognition, and machine translation.
Filtering Process
The filtering process follows the article Building a Hong Kongese Language Identifier written by ToastyNews.
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: hugging_face
Created: Unknown
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.