Chinese-web-novel

The dataset crawls up to 25 chapters per book from https://m.bqgui.cc, resulting in 12,740 entries. After three rounds of cleaning, each entry contains the book title, summary, and novel text. Titles are of high quality, summaries have low usability, and the novel texts have had some ads and symbols removed but still contain low‑quality content.

Updated 10/16/2024

huggingface

Description

Dataset Overview

Basic Information

License: Apache 2.0
Language: Chinese
Tags:
- Art
- Not suitable for all audiences
Data Volume: 10K < n < 100K

Data Source

Source website: https://m.bqgui.cc
Number of entries: 12,740
Scope: Up to 25 chapters per book

Data Quality

Book titles: Highest text quality, no ads
Summaries: Low usability
Novel text: Some symbols and ads filtered, but low‑quality content may still remain

Data Processing

Crawling: Multi‑threaded crawling, see crawl.ipynb
Cleaning: Primarily regex and string operations, see clean.ipynb
Future plans: Aim to use LLMs or other tools for further cleaning

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Please login to view download links and access full dataset details.

Topics

Web Novels

Text Data

Source

Organization: huggingface

Created: 10/16/2024

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.

Check Prices →