JUHE API Marketplace
DATASET
Open Source Community

CASSum

Chinese long‑text dataset consisting of original articles and abstracts, primarily from social science academic papers. The data were sourced from websites of five institutes under the Chinese Academy of Social Sciences and cleaned by deduplication, removal of foreign language passages, blank lines, excess whitespace, and other preprocessing steps.

Updated 8/17/2023
github

Description

CASSum Dataset Overview

Description

CASSum is a Chinese long‑text dataset containing full texts of academic papers in the social sciences and their abstracts. The data originate from five departments of the Chinese Academy of Social Sciences: Law Institute, Historical Research Institute, Philosophy Institute, Literature Network, and Industrial Economics Research Institute.

Processing Steps

  1. Deduplication
  2. Removal of foreign‑language passages
  3. Elimination of blank lines and excess whitespace
  4. Removal of headings such as "Content Summary"
  5. Discard abstracts shorter than 20 characters
  6. Discard full texts shorter than 200 characters
  7. Remove entries where the abstract‑to‑text length ratio is below 0.15
  8. Exclude incomplete full texts
  9. Manual review of abnormal abstract‑to‑text ratios

Statistics

  • Samples: 3,061
  • Average full‑text length: 10,746.70 characters
  • Average abstract length: 205.27 characters

Sample Format

The dataset file dataset.jl follows JSON‑Lines format, with each line containing:

  • url: source link
  • text: full article text
  • summary: abstract

Example:

{
    "url": "http://iolaw.cssn.cn/zxzp/202212/t20221208_5569568.shtml",
    "text": "Full article content...",
    "summary": "Abstract content..."
}

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Chinese Text Analysis
Social Science Research

Source

Organization: github

Created: 8/17/2023

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.