Back to datasets
Dataset assetOpen Source CommunityMultimodal DataSocial Media
Social-Media-Dataset
This dataset contains over 1 million tweets crawled from Twitter. After filtering and processing, it retains multimodal text‑image data, extracts emojis and embedded text, resulting in a dataset with four modalities.
Source
github
Created
Nov 7, 2024
Updated
Nov 7, 2024
Signals
857 views
Availability
Linked source ready
Overview
Dataset description and usage context
Social Media Dataset
Dataset Overview
- Data Source: Crawled over 1 million Twitter posts.
- Data Filtering:
- Used a pretrained VGG19 model to filter out non‑emoji images; 95% of data were initially screened.
- Manual filtering retained text‑image multimodal data, removing approximately 40% of the data.
- Data Processing:
- Extracted emojis from text using regular expressions.
- Obtained embedded text from emoji packs using the PaddleOCR platform and manual correction.
- Data Modalities: Includes four modalities.
Dataset Status
- The dataset will be open‑sourced after the paper is accepted.
Need downstream help?
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.