Dataset assetOpen Source CommunityGovernment ResearchPublic Policy

launch/gov_report

The GovReport dataset comprises reports and their abstracts authored by U.S. governmental research agencies such as the Congressional Research Service and the Government Accountability Office. Compared with other long‑document summarization datasets, GovReport features longer documents and abstracts, requiring more context to cover key summary points. It provides three configurations: plain_text (default), plain_text_with_recommendations, and structure, corresponding to different data formats. The language is English; size ranges between 10 K and 100 K; license is CC BY 4.0.

Source

hugging_face

Created

Nov 28, 2025

Updated

Nov 9, 2022

Signals

597 views

Availability

Linked source ready

Overview

Dataset description and usage context

Dataset Overview

Dataset Name

GovReport

Dataset Summary

Content Source: The dataset contains reports and corresponding abstracts written by government research agencies such as the Congressional Research Service and the Government Accountability Office.
Characteristics: Compared with other long‑document summarization datasets, GovReport has longer abstracts and documents, demanding more contextual reading to cover the summarization keywords.

Version Information

Version: 1.0.1 (default), with extra spaces removed; 1.0.0 corresponds to the original dataset used in the paper.

Supported Tasks

Task: Summarization

Language

Language: English

Dataset Structure

Configurations:
- plain_text (default): Original text‑to‑text summarization setup used in the paper.
- plain_text_with_recommendations: Includes "GAO recommendations" in the text‑to‑text setup.
- structure: Contains partially structured data.

Data Fields

plain_text & plain_text_with_recommendations:
- id: string
- document: string
- summary: string
structure:
- id: string
- document_sections: dict with title, paragraphs, depth list
- summary_sections: dict with title, paragraphs list

Splits

Training: 17,519
Validation: 974
Test: 973

License

License: CC BY 4.0

Citation

@inproceedings{huang-etal-2021-efficient,
    title = "Efficient Attentions for Long Document Summarization",
    author = "Huang, Luyang and Cao, Shuyang and Parulian, Nikolaus and Ji, Heng and Wang, Lu",
    booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
    month = jun,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.naacl-main.112",
    doi = "10.18653/v1/2021.naacl-main.112",
    pages = "1419--1436",
    abstract = "The quadratic computational and memory complexities of large Transformers have limited their scalability for long document summarization. In this paper, we propose Hepos, a novel efficient encoder‑decoder attention with head‑wise positional strides to effectively pinpoint salient information from the source. We further conduct a systematic study of existing efficient self‑attentions. Combined with Hepos, we are able to process ten times more tokens than existing models that use full attentions. For evaluation, we present a new dataset, GovReport, with significantly longer documents and summaries. Results show that our models produce significantly higher ROUGE scores than competitive comparisons, including new state‑of‑the‑art results on PubMed. Human evaluation also shows that our models generate more informative summaries with fewer unfaithful errors.",
}

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio