Dataset assetOpen Source CommunitySocial Science ResearchChinese Text Analysis

CASSum

Chinese long‑text dataset consisting of original articles and abstracts, primarily from social science academic papers. The data were sourced from websites of five institutes under the Chinese Academy of Social Sciences and cleaned by deduplication, removal of foreign language passages, blank lines, excess whitespace, and other preprocessing steps.

Source

github

Created

Aug 17, 2023

Updated

Aug 17, 2023

Signals

206 views

Availability

Linked source ready

Overview

Dataset description and usage context

CASSum Dataset Overview

Description

CASSum is a Chinese long‑text dataset containing full texts of academic papers in the social sciences and their abstracts. The data originate from five departments of the Chinese Academy of Social Sciences: Law Institute, Historical Research Institute, Philosophy Institute, Literature Network, and Industrial Economics Research Institute.

Processing Steps

Deduplication
Removal of foreign‑language passages
Elimination of blank lines and excess whitespace
Removal of headings such as "Content Summary"
Discard abstracts shorter than 20 characters
Discard full texts shorter than 200 characters
Remove entries where the abstract‑to‑text length ratio is below 0.15
Exclude incomplete full texts
Manual review of abnormal abstract‑to‑text ratios

Statistics

Samples: 3,061
Average full‑text length: 10,746.70 characters
Average abstract length: 205.27 characters

Sample Format

The dataset file dataset.jl follows JSON‑Lines format, with each line containing:

url: source link
text: full article text
summary: abstract

Example:

{
    "url": "http://iolaw.cssn.cn/zxzp/202212/t20221208_5569568.shtml",
    "text": "Full article content...",
    "summary": "Abstract content..."
}

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio