JUHE API Marketplace
DATASET
Open Source Community

REFINESUMM

The REFINESUMM dataset is an integrated benchmark designed for training and evaluating vision‑language models on image‑text multimodal summarization. It comprises triples of text, associated images, and summaries derived from Wikipedia articles and their accompanying images. The summaries are automatically generated by the multimodal large language model LLaVA‑v1.6‑Mistral‑7B, which has been self‑refined for this task.

Updated 10/2/2024
github

Description

REFINESUMM: Self‑Refining Multimodal Language Model Generates Multimodal Summaries Dataset

Dataset Overview

  • Name: REFINESUMM
  • Type: Multimodal summarization dataset
  • Goal: Train and evaluate vision‑language models for image‑text multimodal summarization tasks
  • Content: Triples of text, related images, and summaries based on Wikipedia articles and their images
  • Generation Model: Summaries are automatically generated by the multimodal large language model (LLaVA‑v1.6‑Mistral‑7B) and refined through a self‑refinement process

Dataset Download

Data Loading

  • Steps:
    1. Download the test split of WikiWeb2M:
      wget https://storage.googleapis.com/gresearch/wit/wikiweb2m/wikiweb2m-test.tfrecord.gz
      
    2. Place the downloaded file in the data/ directory.
    3. In update_data_from_wikiweb2m.py, set the split (e.g., train, val, test) on line 12.
    4. Run the following command:
      python update_data_from_wikiweb2m.py
      
    5. The dataset will be saved in data/ with columns txt (article), img (image), and summary (summary).

Citation

  • BibTeX:
    @inproceedings{patil-etal-2024-refinesumm,
        title = "{REFINESUMM}: Self‑Refining {MLLM} for Generating a Multimodal Summarization Dataset",
        author = "Patil, Vaidehi  and
          Ribeiro, Leonardo  and
          Liu, Mengwen  and
          Bansal, Mohit  and
          Dreyer, Markus",
        editor = "Ku, Lun‑Wei  and
          Martins, Andre  and
          Srikumar, Vivek",
        booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
        month = aug,
        year = "2024",
        address = "Bangkok, Thailand",
        publisher = "Association for Computational Linguistics",
        url = "https://aclanthology.org/2024.acl-long.743",
        pages = "13773--13786",
        abstract = "Multimodal Large Language Models (MLLMs) excel at synthesizing key information from diverse sources..."
    }
    

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Multimodal Summarization
Vision‑Language Models

Source

Organization: github

Created: 9/23/2024

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.