launch/gov_report
The GovReport dataset comprises reports and their abstracts authored by U.S. governmental research agencies such as the Congressional Research Service and the Government Accountability Office. Compared with other long‑document summarization datasets, GovReport features longer documents and abstracts, requiring more context to cover key summary points. It provides three configurations: plain_text (default), plain_text_with_recommendations, and structure, corresponding to different data formats. The language is English; size ranges between 10 K and 100 K; license is CC BY 4.0.
Description
Dataset Overview
Dataset Name
- GovReport
Dataset Summary
- Content Source: The dataset contains reports and corresponding abstracts written by government research agencies such as the Congressional Research Service and the Government Accountability Office.
- Characteristics: Compared with other long‑document summarization datasets, GovReport has longer abstracts and documents, demanding more contextual reading to cover the summarization keywords.
Version Information
- Version: 1.0.1 (default), with extra spaces removed; 1.0.0 corresponds to the original dataset used in the paper.
Supported Tasks
- Task: Summarization
Language
- Language: English
Dataset Structure
- Configurations:
- plain_text (default): Original text‑to‑text summarization setup used in the paper.
- plain_text_with_recommendations: Includes "GAO recommendations" in the text‑to‑text setup.
- structure: Contains partially structured data.
Data Fields
- plain_text & plain_text_with_recommendations:
id: stringdocument: stringsummary: string
- structure:
id: stringdocument_sections: dict withtitle,paragraphs,depthlistsummary_sections: dict withtitle,paragraphslist
Splits
- Training: 17,519
- Validation: 974
- Test: 973
License
- License: CC BY 4.0
Citation
@inproceedings{huang-etal-2021-efficient,
title = "Efficient Attentions for Long Document Summarization",
author = "Huang, Luyang and Cao, Shuyang and Parulian, Nikolaus and Ji, Heng and Wang, Lu",
booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
month = jun,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.naacl-main.112",
doi = "10.18653/v1/2021.naacl-main.112",
pages = "1419--1436",
abstract = "The quadratic computational and memory complexities of large Transformers have limited their scalability for long document summarization. In this paper, we propose Hepos, a novel efficient encoder‑decoder attention with head‑wise positional strides to effectively pinpoint salient information from the source. We further conduct a systematic study of existing efficient self‑attentions. Combined with Hepos, we are able to process ten times more tokens than existing models that use full attentions. For evaluation, we present a new dataset, GovReport, with significantly longer documents and summaries. Results show that our models produce significantly higher ROUGE scores than competitive comparisons, including new state‑of‑the‑art results on PubMed. Human evaluation also shows that our models generate more informative summaries with fewer unfaithful errors.",
}
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: hugging_face
Created: Unknown
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.