CASSum
Chinese long‑text dataset consisting of original articles and abstracts, primarily from social science academic papers. The data were sourced from websites of five institutes under the Chinese Academy of Social Sciences and cleaned by deduplication, removal of foreign language passages, blank lines, excess whitespace, and other preprocessing steps.
Dataset description and usage context
CASSum Dataset Overview
Description
CASSum is a Chinese long‑text dataset containing full texts of academic papers in the social sciences and their abstracts. The data originate from five departments of the Chinese Academy of Social Sciences: Law Institute, Historical Research Institute, Philosophy Institute, Literature Network, and Industrial Economics Research Institute.
Processing Steps
- Deduplication
- Removal of foreign‑language passages
- Elimination of blank lines and excess whitespace
- Removal of headings such as "Content Summary"
- Discard abstracts shorter than 20 characters
- Discard full texts shorter than 200 characters
- Remove entries where the abstract‑to‑text length ratio is below 0.15
- Exclude incomplete full texts
- Manual review of abnormal abstract‑to‑text ratios
Statistics
- Samples: 3,061
- Average full‑text length: 10,746.70 characters
- Average abstract length: 205.27 characters
Sample Format
The dataset file dataset.jl follows JSON‑Lines format, with each line containing:
url: source linktext: full article textsummary: abstract
Example:
{
"url": "http://iolaw.cssn.cn/zxzp/202212/t20221208_5569568.shtml",
"text": "Full article content...",
"summary": "Abstract content..."
}
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.