Explore high-quality datasets for your AI and machine learning projects.
By cleaning the Chinese portion of the Common Crawl corpus, a high‑quality 100 GB Chinese pre‑training corpus was obtained. This dataset can be directly used for pre‑training, language‑modeling, or language‑generation tasks, and a small vocabulary specifically for Simplified‑Chinese NLP tasks has been released.