Back to datasets
Dataset assetOpen Source CommunityWeb TechnologyCode Generation

Web2Code

The Web2Code dataset was created by MBZUAI to improve multimodal large language models' (MLLMs) capabilities in web understanding and HTML code generation. It comprises 11.797 million web instruction‑response pairs, including webpage images, HTML code, and structured questions and answers. The dataset was constructed using GPT‑3.5 and GPT‑4 for data cleaning and new data generation. Web2Code is primarily used for web content generation and task automation, addressing the shortcomings of existing MLLMs in handling web screenshots and generating HTML code.

Source
arXiv
Created
Jun 29, 2024
Updated
Jun 29, 2024
Signals
306 views
Availability
Linked source ready
Overview

Dataset description and usage context

Web2Code Dataset Overview

Basic Information

Latest Updates

  • 2024/06/27: Paper and project page released.

Evaluation Suite

  • Web‑Code Generation Benchmark: Provides environment setup, screenshot generation, and evaluation guidelines.
  • Web Understanding Benchmark: Offers setup, output generation, and evaluation instructions.

Acknowledgements

  • LLaVA: Built upon its codebase.
  • WebSRC, WebSight, Pix2Code: High‑quality web and HTML‑code related datasets.

Citation

@article{web2code2024,
  title={Web2Code: A Large‑scale Webpage‑to‑Code Dataset and Evaluation Framework for Multimodal LLMs},
  author={Yun, Sukmin and Lin, Haokun and Thushara, Rusiru and Bhat, Mohammad Qazim and Wang, Yongxin and Jiang, Zutao and Deng, Mingkai and Wang, Jinhong and Tao, Tianhua and Li, Junbo and Li, Haonan and Nakov, Preslav and Baldwin, Timothy and Liu, Zhengzhong and Xing, Eric P. and Liang, Xiaodan and Shen, Zhiqiang},
  journal={arXiv preprint arXiv:2406.20098},
  year={2024}
}

License

  • Data License: CC BY 4.0 (non‑commercial use only).
  • Usage Note: Data is for research purposes and must not be used commercially.
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio