Back to datasets
Dataset assetOpen Source CommunityGovernment ResearchPublic Policy
launch/gov_report
The GovReport dataset comprises reports and their abstracts authored by U.S. governmental research agencies such as the Congressional Research Service and the Government Accountability Office. Compared with other long‑document summarization datasets, GovReport features longer documents and abstracts, requiring more context to cover key summary points. It provides three configurations: plain_text (default), plain_text_with_recommendations, and structure, corresponding to different data formats. The language is English; size ranges between 10 K and 100 K; license is CC BY 4.0.
Source
hugging_face
Created
Nov 28, 2025
Updated
Nov 9, 2022
Signals
597 views
Availability
Linked source ready
Overview
Dataset description and usage context
Dataset Overview
Dataset Name
- GovReport
Dataset Summary
- Content Source: The dataset contains reports and corresponding abstracts written by government research agencies such as the Congressional Research Service and the Government Accountability Office.
- Characteristics: Compared with other long‑document summarization datasets, GovReport has longer abstracts and documents, demanding more contextual reading to cover the summarization keywords.
Version Information
- Version: 1.0.1 (default), with extra spaces removed; 1.0.0 corresponds to the original dataset used in the paper.
Supported Tasks
- Task: Summarization
Language
- Language: English
Dataset Structure
- Configurations:
- plain_text (default): Original text‑to‑text summarization setup used in the paper.
- plain_text_with_recommendations: Includes "GAO recommendations" in the text‑to‑text setup.
- structure: Contains partially structured data.
Data Fields
- plain_text & plain_text_with_recommendations:
id: stringdocument: stringsummary: string
- structure:
id: stringdocument_sections: dict withtitle,paragraphs,depthlistsummary_sections: dict withtitle,paragraphslist
Splits
- Training: 17,519
- Validation: 974
- Test: 973
License
- License: CC BY 4.0
Citation
@inproceedings{huang-etal-2021-efficient,
title = "Efficient Attentions for Long Document Summarization",
author = "Huang, Luyang and Cao, Shuyang and Parulian, Nikolaus and Ji, Heng and Wang, Lu",
booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
month = jun,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.naacl-main.112",
doi = "10.18653/v1/2021.naacl-main.112",
pages = "1419--1436",
abstract = "The quadratic computational and memory complexities of large Transformers have limited their scalability for long document summarization. In this paper, we propose Hepos, a novel efficient encoder‑decoder attention with head‑wise positional strides to effectively pinpoint salient information from the source. We further conduct a systematic study of existing efficient self‑attentions. Combined with Hepos, we are able to process ten times more tokens than existing models that use full attentions. For evaluation, we present a new dataset, GovReport, with significantly longer documents and summaries. Results show that our models produce significantly higher ROUGE scores than competitive comparisons, including new state‑of‑the‑art results on PubMed. Human evaluation also shows that our models generate more informative summaries with fewer unfaithful errors.",
}
Need downstream help?
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.