JUHE API Marketplace
DATASET
Open Source Community

indiejoseph/cc100-yue

The filtered Cantonese dataset is a subset of the CC100 corpus, containing only Cantonese content after filtering. It is intended to support various natural language processing tasks such as text classification, sentiment analysis, named entity recognition, and machine translation. The filtering process references an article by ToastyNews.

Updated 10/17/2023
hugging_face

Description

Dataset Card "cc100-yue"

Dataset Information

Features

  • Name: text
  • Data Type: string

Split

  • Name: train
  • Bytes: 32135136
  • Samples: 176047

Download Size

  • Size: 23579906

Dataset Size

  • Size: 32135136

Configuration

  • Configuration Name: default
  • Data Files:
    • Split: train
    • Path: data/train-*

Dataset Description

The Filtered Cantonese Dataset is a subset of the CC100 corpus, filtered to contain only Cantonese content. This dataset aims to facilitate various natural language processing tasks such as text classification, sentiment analysis, named entity recognition, and machine translation.

Filtering Process

The filtering process follows the article Building a Hong Kongese Language Identifier written by ToastyNews.

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Cantonese
Natural Language Processing

Source

Organization: hugging_face

Created: Unknown

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.