Back to datasets
Dataset assetOpen Source CommunityNatural Language ProcessingCantonese

indiejoseph/cc100-yue

The filtered Cantonese dataset is a subset of the CC100 corpus, containing only Cantonese content after filtering. It is intended to support various natural language processing tasks such as text classification, sentiment analysis, named entity recognition, and machine translation. The filtering process references an article by ToastyNews.

Source
hugging_face
Created
Nov 28, 2025
Updated
Oct 17, 2023
Signals
280 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Card "cc100-yue"

Dataset Information

Features

  • Name: text
  • Data Type: string

Split

  • Name: train
  • Bytes: 32135136
  • Samples: 176047

Download Size

  • Size: 23579906

Dataset Size

  • Size: 32135136

Configuration

  • Configuration Name: default
  • Data Files:
    • Split: train
    • Path: data/train-*

Dataset Description

The Filtered Cantonese Dataset is a subset of the CC100 corpus, filtered to contain only Cantonese content. This dataset aims to facilitate various natural language processing tasks such as text classification, sentiment analysis, named entity recognition, and machine translation.

Filtering Process

The filtering process follows the article Building a Hong Kongese Language Identifier written by ToastyNews.

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio