indiejoseph/cc100-yue
The filtered Cantonese dataset is a subset of the CC100 corpus, containing only Cantonese content after filtering. It is intended to support various natural language processing tasks such as text classification, sentiment analysis, named entity recognition, and machine translation. The filtering process references an article by ToastyNews.
Dataset description and usage context
Dataset Card "cc100-yue"
Dataset Information
Features
- Name: text
- Data Type: string
Split
- Name: train
- Bytes: 32135136
- Samples: 176047
Download Size
- Size: 23579906
Dataset Size
- Size: 32135136
Configuration
- Configuration Name: default
- Data Files:
- Split: train
- Path: data/train-*
Dataset Description
The Filtered Cantonese Dataset is a subset of the CC100 corpus, filtered to contain only Cantonese content. This dataset aims to facilitate various natural language processing tasks such as text classification, sentiment analysis, named entity recognition, and machine translation.
Filtering Process
The filtering process follows the article Building a Hong Kongese Language Identifier written by ToastyNews.
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.