Dataset assetOpen Source CommunityMultimodal DataSocial Media

Social-Media-Dataset

This dataset contains over 1 million tweets crawled from Twitter. After filtering and processing, it retains multimodal text‑image data, extracts emojis and embedded text, resulting in a dataset with four modalities.

Source

github

Created

Nov 7, 2024

Updated

Nov 7, 2024

Signals

857 views

Availability

Linked source ready

Overview

Dataset description and usage context

Social Media Dataset

Dataset Overview

Data Source: Crawled over 1 million Twitter posts.
Data Filtering:
- Used a pretrained VGG19 model to filter out non‑emoji images; 95% of data were initially screened.
- Manual filtering retained text‑image multimodal data, removing approximately 40% of the data.
Data Processing:
- Extracted emojis from text using regular expressions.
- Obtained embedded text from emoji packs using the PaddleOCR platform and manual correction.
Data Modalities: Includes four modalities.

Dataset Status

The dataset will be open‑sourced after the paper is accepted.

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio