Dataset assetOpen Source CommunityNatural Language ProcessingImage Recognition

COYO-700M

COYO‑700M is a massive dataset comprising 747 million image‑text pairs and various other metadata, intended for training diverse models. It is constructed by collecting alt‑text from HTML documents along with their associated images, aiming to support training of large foundation models and complement existing datasets.

Source

github

Created

Aug 31, 2022

Updated

Nov 30, 2022

Signals

268 views

Availability

Linked source ready

Overview

Dataset description and usage context

Dataset Overview

Dataset Name

COYO‑700M

Dataset Content

747M image‑text pairs, containing multiple metadata attributes to enhance model training.
Data collected from alt‑text in HTML documents and their associated images.

Data Collection Process

From October 2020 to August 2021, approx. 10 billion alt‑text‑image pairs were gathered from CommonCrawl.
Image‑ and text‑level filtering removed non‑informative pairs at minimal cost.

Data Filtering

Image‑Level

Includes all formats decodable by the Pillow library (JPEG, WEBP, PNG, BMP, etc.).
Removes images smaller than 5 KB.
Removes images with aspect ratio > 3.0.
Removes images with minimum side < 200 px.
Removes images flagged by OpenNSFW2 or GantMan/NSFW score > 0.5.
Removes duplicate images based on pHash.

Text‑Level

Collects English text only.
Cleans format, discarding texts shorter than 5 characters.
Discards texts without noun forms.
Discards texts with word count < 3 or > 256.
Discards texts appearing more than 10 times.
Discards texts containing NSFW vocabulary.

Image‑Text‑Level

Removes duplicate samples based on (image_phash, text).

Dataset Preview

The dataset includes image URLs, text, image dimensions, image pHash, text length, word count, token counts under BERT and GPT models, etc.

Dataset Statistics

746,972,269 image‑text pairs.
656,114,783 unique URLs.
579,679,137 unique image pHashes.
566,253,888 unique texts.

Metadata Attributes

id: Unique 64‑bit integer ID.
url: Image URL.
text: Alt text.
width, height: Image dimensions.
image_phash: Perceptual hash of the image.
text_length, word_count: Text length and word count.
num_tokens_bert, num_tokens_gpt: Token counts for BERT and GPT.
num_faces: Number of faces detected in the image.
clip_similarity_vitb32, clip_similarity_vitl14: Image‑text similarity scores using OpenAI CLIP models.
nsfw_score_opennsfw2, nsfw_score_gantman: NSFW scores for the image.
watermark_score: Probability of a watermark in the image.

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio