COYO-700M
COYO‑700M is a massive dataset comprising 747 million image‑text pairs and various other metadata, intended for training diverse models. It is constructed by collecting alt‑text from HTML documents along with their associated images, aiming to support training of large foundation models and complement existing datasets.
Description
Dataset Overview
Dataset Name
COYO‑700M
Dataset Content
- 747M image‑text pairs, containing multiple metadata attributes to enhance model training.
- Data collected from alt‑text in HTML documents and their associated images.
Data Collection Process
- From October 2020 to August 2021, approx. 10 billion alt‑text‑image pairs were gathered from CommonCrawl.
- Image‑ and text‑level filtering removed non‑informative pairs at minimal cost.
Data Filtering
Image‑Level
- Includes all formats decodable by the Pillow library (JPEG, WEBP, PNG, BMP, etc.).
- Removes images smaller than 5 KB.
- Removes images with aspect ratio > 3.0.
- Removes images with minimum side < 200 px.
- Removes images flagged by OpenNSFW2 or GantMan/NSFW score > 0.5.
- Removes duplicate images based on pHash.
Text‑Level
- Collects English text only.
- Cleans format, discarding texts shorter than 5 characters.
- Discards texts without noun forms.
- Discards texts with word count < 3 or > 256.
- Discards texts appearing more than 10 times.
- Discards texts containing NSFW vocabulary.
Image‑Text‑Level
- Removes duplicate samples based on (image_phash, text).
Dataset Preview
The dataset includes image URLs, text, image dimensions, image pHash, text length, word count, token counts under BERT and GPT models, etc.
Dataset Statistics
- 746,972,269 image‑text pairs.
- 656,114,783 unique URLs.
- 579,679,137 unique image pHashes.
- 566,253,888 unique texts.
Metadata Attributes
- id: Unique 64‑bit integer ID.
- url: Image URL.
- text: Alt text.
- width, height: Image dimensions.
- image_phash: Perceptual hash of the image.
- text_length, word_count: Text length and word count.
- num_tokens_bert, num_tokens_gpt: Token counts for BERT and GPT.
- num_faces: Number of faces detected in the image.
- clip_similarity_vitb32, clip_similarity_vitl14: Image‑text similarity scores using OpenAI CLIP models.
- nsfw_score_opennsfw2, nsfw_score_gantman: NSFW scores for the image.
- watermark_score: Probability of a watermark in the image.
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: github
Created: 8/31/2022
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.