Dataset assetOpen Source CommunityMultimodal DatasetsSpeech Recognition

How2

How2 is a multimodal dataset containing approximately 80,000 instructional videos (~2,000 hours) with English subtitles and summaries. About 300 hours of videos have been crowd‑translated into Portuguese and were used in the JSALT 2018 workshop. The dataset can be used for speech recognition, speech summarization, text summarization, and their multimodal extensions.

Source

github

Created

Oct 28, 2018

Updated

May 10, 2024

Signals

296 views

Availability

Linked source ready

Overview

Dataset description and usage context

How‑2 Dataset Overview

How‑2 is a multimodal dataset comprising roughly 80,000 instructional videos (~2,000 hours) with corresponding English subtitles and summaries. Approximately 300 hours of the videos were crowd‑translated into Portuguese and were employed in the JSALT 2018 workshop. How‑2 training data is split into 300‑hour and 2,000‑hour portions; only the former supports Portuguese machine translation. The 2,000‑hour dataset can be used for speech recognition, speech summarization, text summarization, and their multimodal extensions.

The following packages related to How‑2 have been released to reproduce our results and encourage further research:

ASR (300h): 300 hours of audio fbank + pitch features in Kaldi scp/ark format.
E2E Summarization + ASR (2000h): 2,000 hours of audio fbank + pitch features, transcriptions, and summaries in Kaldi scp/ark format.
Visual features: Video motion features for machine translation and automatic speech recognition, provided as NumPy arrays.
English Transcript: English transcripts for How‑2.
Portuguese Machine Translations: Crowd‑sourced Portuguese texts.
English Abstractive Summaries: Summary texts.
Visual features for Summarization: Video motion features for summarization, provided as NumPy arrays.
Object Grounding Features: Object grounding test and development sets.

When using the dataset, cite the following paper:

@inproceedings{sanabria18how2,
  title = {{How2:} A Large‑scale Dataset For Multimodal Language Understanding},
  author = {Sanabria, Ramon and Caglayan, Ozan and Palaskar, Shruti and Elliott, Desmond and Barrault, Loïc and Specia, Lucia and Metze, Florian},
  booktitle = {Proceedings of the Workshop on Visually Grounded Interaction and Language (ViGIL)},
  year = {2018},
  organization={NeurIPS},
  url = {http://arxiv.org/abs/1811.00347}
}

How‑2 has also been used for end‑to‑end speech summarization, with 43‑dimensional fbank + pitch features released to support this application. Relevant research can be found in the ESPNet Recipe and the associated paper. When conducting speech summarization research with this dataset, cite the following work:

@inproceedings{Sharma2022,
  author={Sharma, Roshan and Palaskar, Shruti and Black, Alan W and Metze, Florian},
  booktitle={ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  title={End‑to‑End Speech Summarization Using Restricted Self‑Attention},
  year={2022},
  pages={8072-8076},
  doi={10.1109/ICASSP43922.2022.9747320}
}

License information for each video can be found in the .info.json file accompanying the video. All videos are provided under the standard YouTube License. Unless otherwise noted, the repository contents are licensed under the Creative Commons BY‑SA 4.0 (for data) and/or the BSD‑2‑Clause License (for software).

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio