pixparse/cc12m-wds

--- license: other license_name: conceptual-12m license_link: LICENSE task_categories: - image-to-text size_categories: - 10M<n<100M --- # Dataset Card for Conceptual Captions 12M (CC12M) ## Dataset Description - **Repository:** [Conceptual 12M repository](https://github.com/google-research-datasets/conceptual-12m) - **Paper:** [Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts](https://arxiv.org/abs/2102.08981) - **Point of Contact:** [Conceptual Captions e-mail](mailto:conceptual-captions@google.com) ### Dataset Summary Conceptual 12M (CC12M) is a dataset with 12 million image-text pairs specifically meant to be used for visionand-language pre-training. Its data collection pipeline is a relaxed version of the one used in Conceptual Captions 3M (CC3M). ### Usage This instance of Conceptual Captions is in [webdataset](https://github.com/webdataset/webdataset/commits/main) .tar format. It can be used with webdataset library or upcoming releases of Hugging Face `datasets`. ...More Detail TBD ### Data Splits This dataset was downloaded using img2dataset. Images resized on download if shortest edge > 512 to shortest edge = 512. #### Train * `cc12m-train-*.tar` * Downloaded on 2021/18/22 * 2176 shards, 10968539 samples ## Additional Information ### Dataset Curators Soravit Changpinyo, Piyush Sharma, Nan Ding and Radu Soricut. ### Licensing Information The dataset may be freely used for any purpose, although acknowledgement of Google LLC ("Google") as the data source would be appreciated. The dataset is provided "AS IS" without any warranty, express or implied. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset. ### Citation Information ```bibtex @inproceedings{changpinyo2021cc12m, title = {{Conceptual 12M}: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts}, author = {Changpinyo, Soravit and Sharma, Piyush and Ding, Nan and Soricut, Radu}, booktitle = {CVPR}, year = {2021}, } ```

Updated 12/15/2023

hugging_face

Description

Dataset Card for Conceptual Captions 12M (CC12M)

Dataset Description

Dataset Overview: Conceptual 12M (CC12M) is a dataset containing 12 million image‑text pairs, specifically designed for vision‑and‑language pretraining. Its data collection process is a relaxed version of Conceptual Captions 3M (CC3M).

Usage

This version of Conceptual Captions is provided in .tar format via webdataset. It can be used with the webdataset library or the upcoming Hugging Face datasets package.

Data Split

The dataset was downloaded using img2dataset; during download, if the shortest side exceeds...

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Please login to view download links and access full dataset details.

Topics

Image-Text Pairing

Vision-Language Pretraining

Source

Organization: hugging_face

Created: Unknown

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.

Check Prices →