How2
How2 is a multimodal dataset containing approximately 80,000 instructional videos (~2,000 hours) with English subtitles and summaries. About 300 hours of videos have been crowd‑translated into Portuguese and were used in the JSALT 2018 workshop. The dataset can be used for speech recognition, speech summarization, text summarization, and their multimodal extensions.
Description
How‑2 Dataset Overview
How‑2 is a multimodal dataset comprising roughly 80,000 instructional videos (~2,000 hours) with corresponding English subtitles and summaries. Approximately 300 hours of the videos were crowd‑translated into Portuguese and were employed in the JSALT 2018 workshop. How‑2 training data is split into 300‑hour and 2,000‑hour portions; only the former supports Portuguese machine translation. The 2,000‑hour dataset can be used for speech recognition, speech summarization, text summarization, and their multimodal extensions.
The following packages related to How‑2 have been released to reproduce our results and encourage further research:
- ASR (300h): 300 hours of audio fbank + pitch features in Kaldi scp/ark format.
- E2E Summarization + ASR (2000h): 2,000 hours of audio fbank + pitch features, transcriptions, and summaries in Kaldi scp/ark format.
- Visual features: Video motion features for machine translation and automatic speech recognition, provided as NumPy arrays.
- English Transcript: English transcripts for How‑2.
- Portuguese Machine Translations: Crowd‑sourced Portuguese texts.
- English Abstractive Summaries: Summary texts.
- Visual features for Summarization: Video motion features for summarization, provided as NumPy arrays.
- Object Grounding Features: Object grounding test and development sets.
When using the dataset, cite the following paper:
@inproceedings{sanabria18how2,
title = {{How2:} A Large‑scale Dataset For Multimodal Language Understanding},
author = {Sanabria, Ramon and Caglayan, Ozan and Palaskar, Shruti and Elliott, Desmond and Barrault, Loïc and Specia, Lucia and Metze, Florian},
booktitle = {Proceedings of the Workshop on Visually Grounded Interaction and Language (ViGIL)},
year = {2018},
organization={NeurIPS},
url = {http://arxiv.org/abs/1811.00347}
}
How‑2 has also been used for end‑to‑end speech summarization, with 43‑dimensional fbank + pitch features released to support this application. Relevant research can be found in the ESPNet Recipe and the associated paper. When conducting speech summarization research with this dataset, cite the following work:
@inproceedings{Sharma2022,
author={Sharma, Roshan and Palaskar, Shruti and Black, Alan W and Metze, Florian},
booktitle={ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={End‑to‑End Speech Summarization Using Restricted Self‑Attention},
year={2022},
pages={8072-8076},
doi={10.1109/ICASSP43922.2022.9747320}
}
License information for each video can be found in the .info.json file accompanying the video. All videos are provided under the standard YouTube License. Unless otherwise noted, the repository contents are licensed under the Creative Commons BY‑SA 4.0 (for data) and/or the BSD‑2‑Clause License (for software).
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: github
Created: 10/28/2018
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.