Back to datasets
Dataset assetOpen Source CommunitySocial Science ResearchChinese Text Analysis

CASSum

Chinese long‑text dataset consisting of original articles and abstracts, primarily from social science academic papers. The data were sourced from websites of five institutes under the Chinese Academy of Social Sciences and cleaned by deduplication, removal of foreign language passages, blank lines, excess whitespace, and other preprocessing steps.

Source
github
Created
Aug 17, 2023
Updated
Aug 17, 2023
Signals
206 views
Availability
Linked source ready
Overview

Dataset description and usage context

CASSum Dataset Overview

Description

CASSum is a Chinese long‑text dataset containing full texts of academic papers in the social sciences and their abstracts. The data originate from five departments of the Chinese Academy of Social Sciences: Law Institute, Historical Research Institute, Philosophy Institute, Literature Network, and Industrial Economics Research Institute.

Processing Steps

  1. Deduplication
  2. Removal of foreign‑language passages
  3. Elimination of blank lines and excess whitespace
  4. Removal of headings such as "Content Summary"
  5. Discard abstracts shorter than 20 characters
  6. Discard full texts shorter than 200 characters
  7. Remove entries where the abstract‑to‑text length ratio is below 0.15
  8. Exclude incomplete full texts
  9. Manual review of abnormal abstract‑to‑text ratios

Statistics

  • Samples: 3,061
  • Average full‑text length: 10,746.70 characters
  • Average abstract length: 205.27 characters

Sample Format

The dataset file dataset.jl follows JSON‑Lines format, with each line containing:

  • url: source link
  • text: full article text
  • summary: abstract

Example:

{
    "url": "http://iolaw.cssn.cn/zxzp/202212/t20221208_5569568.shtml",
    "text": "Full article content...",
    "summary": "Abstract content..."
}
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio