JUHE API Marketplace
DATASET
Open Source Community

launch/gov_report

The GovReport dataset comprises reports and their abstracts authored by U.S. governmental research agencies such as the Congressional Research Service and the Government Accountability Office. Compared with other long‑document summarization datasets, GovReport features longer documents and abstracts, requiring more context to cover key summary points. It provides three configurations: plain_text (default), plain_text_with_recommendations, and structure, corresponding to different data formats. The language is English; size ranges between 10 K and 100 K; license is CC BY 4.0.

Updated 11/9/2022
hugging_face

Description

Dataset Overview

Dataset Name

  • GovReport

Dataset Summary

  • Content Source: The dataset contains reports and corresponding abstracts written by government research agencies such as the Congressional Research Service and the Government Accountability Office.
  • Characteristics: Compared with other long‑document summarization datasets, GovReport has longer abstracts and documents, demanding more contextual reading to cover the summarization keywords.

Version Information

  • Version: 1.0.1 (default), with extra spaces removed; 1.0.0 corresponds to the original dataset used in the paper.

Supported Tasks

  • Task: Summarization

Language

  • Language: English

Dataset Structure

  • Configurations:
    • plain_text (default): Original text‑to‑text summarization setup used in the paper.
    • plain_text_with_recommendations: Includes "GAO recommendations" in the text‑to‑text setup.
    • structure: Contains partially structured data.

Data Fields

  • plain_text & plain_text_with_recommendations:
    • id: string
    • document: string
    • summary: string
  • structure:
    • id: string
    • document_sections: dict with title, paragraphs, depth list
    • summary_sections: dict with title, paragraphs list

Splits

  • Training: 17,519
  • Validation: 974
  • Test: 973

License

  • License: CC BY 4.0

Citation

@inproceedings{huang-etal-2021-efficient,
    title = "Efficient Attentions for Long Document Summarization",
    author = "Huang, Luyang and Cao, Shuyang and Parulian, Nikolaus and Ji, Heng and Wang, Lu",
    booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
    month = jun,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.naacl-main.112",
    doi = "10.18653/v1/2021.naacl-main.112",
    pages = "1419--1436",
    abstract = "The quadratic computational and memory complexities of large Transformers have limited their scalability for long document summarization. In this paper, we propose Hepos, a novel efficient encoder‑decoder attention with head‑wise positional strides to effectively pinpoint salient information from the source. We further conduct a systematic study of existing efficient self‑attentions. Combined with Hepos, we are able to process ten times more tokens than existing models that use full attentions. For evaluation, we present a new dataset, GovReport, with significantly longer documents and summaries. Results show that our models produce significantly higher ROUGE scores than competitive comparisons, including new state‑of‑the‑art results on PubMed. Human evaluation also shows that our models generate more informative summaries with fewer unfaithful errors.",
}

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Government Research
Public Policy

Source

Organization: hugging_face

Created: Unknown

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.