DATASET
Open Source Community
Chinese-web-novel
The dataset crawls up to 25 chapters per book from https://m.bqgui.cc, resulting in 12,740 entries. After three rounds of cleaning, each entry contains the book title, summary, and novel text. Titles are of high quality, summaries have low usability, and the novel texts have had some ads and symbols removed but still contain low‑quality content.
Updated 10/16/2024
huggingface
Description
Dataset Overview
Basic Information
- License: Apache 2.0
- Language: Chinese
- Tags:
- Art
- Not suitable for all audiences
- Data Volume: 10K < n < 100K
Data Source
- Source website: https://m.bqgui.cc
- Number of entries: 12,740
- Scope: Up to 25 chapters per book
Data Quality
- Book titles: Highest text quality, no ads
- Summaries: Low usability
- Novel text: Some symbols and ads filtered, but low‑quality content may still remain
Data Processing
- Crawling: Multi‑threaded crawling, see crawl.ipynb
- Cleaning: Primarily regex and string operations, see clean.ipynb
- Future plans: Aim to use LLMs or other tools for further cleaning
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Login to Access
Please login to view download links and access full dataset details.
Topics
Web Novels
Text Data
Source
Organization: huggingface
Created: 10/16/2024
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.