Back to datasets
Dataset assetOpen Source CommunityText DataWeb Novels

Chinese-web-novel

The dataset crawls up to 25 chapters per book from https://m.bqgui.cc, resulting in 12,740 entries. After three rounds of cleaning, each entry contains the book title, summary, and novel text. Titles are of high quality, summaries have low usability, and the novel texts have had some ads and symbols removed but still contain low‑quality content.

Source
huggingface
Created
Oct 16, 2024
Updated
Oct 16, 2024
Signals
1,461 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Overview

Basic Information

  • License: Apache 2.0
  • Language: Chinese
  • Tags:
    • Art
    • Not suitable for all audiences
  • Data Volume: 10K < n < 100K

Data Source

  • Source website: https://m.bqgui.cc
  • Number of entries: 12,740
  • Scope: Up to 25 chapters per book

Data Quality

  • Book titles: Highest text quality, no ads
  • Summaries: Low usability
  • Novel text: Some symbols and ads filtered, but low‑quality content may still remain

Data Processing

  • Crawling: Multi‑threaded crawling, see crawl.ipynb
  • Cleaning: Primarily regex and string operations, see clean.ipynb
  • Future plans: Aim to use LLMs or other tools for further cleaning
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio