JUHE API Marketplace
DATASET
Open Source Community

Web2Code

The Web2Code dataset was created by MBZUAI to improve multimodal large language models' (MLLMs) capabilities in web understanding and HTML code generation. It comprises 11.797 million web instruction‑response pairs, including webpage images, HTML code, and structured questions and answers. The dataset was constructed using GPT‑3.5 and GPT‑4 for data cleaning and new data generation. Web2Code is primarily used for web content generation and task automation, addressing the shortcomings of existing MLLMs in handling web screenshots and generating HTML code.

Updated 6/29/2024
arXiv

Description

Web2Code Dataset Overview

Basic Information

Latest Updates

  • 2024/06/27: Paper and project page released.

Evaluation Suite

  • Web‑Code Generation Benchmark: Provides environment setup, screenshot generation, and evaluation guidelines.
  • Web Understanding Benchmark: Offers setup, output generation, and evaluation instructions.

Acknowledgements

  • LLaVA: Built upon its codebase.
  • WebSRC, WebSight, Pix2Code: High‑quality web and HTML‑code related datasets.

Citation

@article{web2code2024,
  title={Web2Code: A Large‑scale Webpage‑to‑Code Dataset and Evaluation Framework for Multimodal LLMs},
  author={Yun, Sukmin and Lin, Haokun and Thushara, Rusiru and Bhat, Mohammad Qazim and Wang, Yongxin and Jiang, Zutao and Deng, Mingkai and Wang, Jinhong and Tao, Tianhua and Li, Junbo and Li, Haonan and Nakov, Preslav and Baldwin, Timothy and Liu, Zhengzhong and Xing, Eric P. and Liang, Xiaodan and Shen, Zhiqiang},
  journal={arXiv preprint arXiv:2406.20098},
  year={2024}
}

License

  • Data License: CC BY 4.0 (non‑commercial use only).
  • Usage Note: Data is for research purposes and must not be used commercially.

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Web Technology
Code Generation

Source

Organization: arXiv

Created: 6/29/2024

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.