CASSum
Chinese long‑text dataset consisting of original articles and abstracts, primarily from social science academic papers. The data were sourced from websites of five institutes under the Chinese Academy of Social Sciences and cleaned by deduplication, removal of foreign language passages, blank lines, excess whitespace, and other preprocessing steps.
Description
CASSum Dataset Overview
Description
CASSum is a Chinese long‑text dataset containing full texts of academic papers in the social sciences and their abstracts. The data originate from five departments of the Chinese Academy of Social Sciences: Law Institute, Historical Research Institute, Philosophy Institute, Literature Network, and Industrial Economics Research Institute.
Processing Steps
- Deduplication
- Removal of foreign‑language passages
- Elimination of blank lines and excess whitespace
- Removal of headings such as "Content Summary"
- Discard abstracts shorter than 20 characters
- Discard full texts shorter than 200 characters
- Remove entries where the abstract‑to‑text length ratio is below 0.15
- Exclude incomplete full texts
- Manual review of abnormal abstract‑to‑text ratios
Statistics
- Samples: 3,061
- Average full‑text length: 10,746.70 characters
- Average abstract length: 205.27 characters
Sample Format
The dataset file dataset.jl follows JSON‑Lines format, with each line containing:
url: source linktext: full article textsummary: abstract
Example:
{
"url": "http://iolaw.cssn.cn/zxzp/202212/t20221208_5569568.shtml",
"text": "Full article content...",
"summary": "Abstract content..."
}
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: github
Created: 8/17/2023
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.