Dataset assetOpen Source CommunityVision‑Language ModelsMultimodal Summarization

REFINESUMM

The REFINESUMM dataset is an integrated benchmark designed for training and evaluating vision‑language models on image‑text multimodal summarization. It comprises triples of text, associated images, and summaries derived from Wikipedia articles and their accompanying images. The summaries are automatically generated by the multimodal large language model LLaVA‑v1.6‑Mistral‑7B, which has been self‑refined for this task.

Source

github

Created

Sep 23, 2024

Updated

Oct 2, 2024

Signals

84 views

Availability

Linked source ready

Overview

Dataset description and usage context

REFINESUMM: Self‑Refining Multimodal Language Model Generates Multimodal Summaries Dataset

Dataset Overview

Name: REFINESUMM
Type: Multimodal summarization dataset
Goal: Train and evaluate vision‑language models for image‑text multimodal summarization tasks
Content: Triples of text, related images, and summaries based on Wikipedia articles and their images
Generation Model: Summaries are automatically generated by the multimodal large language model (LLaVA‑v1.6‑Mistral‑7B) and refined through a self‑refinement process

Dataset Download

Download Link: Hugging Face

Data Loading

Steps:
1. Download the test split of WikiWeb2M:
```
wget https://storage.googleapis.com/gresearch/wit/wikiweb2m/wikiweb2m-test.tfrecord.gz
```
2. Place the downloaded file in the data/ directory.
3. In update_data_from_wikiweb2m.py, set the split (e.g., train, val, test) on line 12.
4. Run the following command:
```
python update_data_from_wikiweb2m.py
```
5. The dataset will be saved in data/ with columns txt (article), img (image), and summary (summary).

Citation

BibTeX:

@inproceedings{patil-etal-2024-refinesumm,
    title = "{REFINESUMM}: Self‑Refining {MLLM} for Generating a Multimodal Summarization Dataset",
    author = "Patil, Vaidehi  and
      Ribeiro, Leonardo  and
      Liu, Mengwen  and
      Bansal, Mohit  and
      Dreyer, Markus",
    editor = "Ku, Lun‑Wei  and
      Martins, Andre  and
      Srikumar, Vivek",
    booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = aug,
    year = "2024",
    address = "Bangkok, Thailand",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.acl-long.743",
    pages = "13773--13786",
    abstract = "Multimodal Large Language Models (MLLMs) excel at synthesizing key information from diverse sources..."
}

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio