indiejoseph/cc100-yue
Natural Language ProcessingCantonese
The filtered Cantonese dataset is a subset of the CC100 corpus, containing only Cantonese content after filtering. It is intended to support various natural language processing tasks such as text classification, sentiment analysis, named entity recognition, and machine translation. The filtering process references an article by ToastyNews.
Source hugging_faceUpdated Oct 17, 2023280 viewsLinked
Inspect dataset