Back to datasets
Dataset assetOpen Source CommunityMultimodal DataSocial Media

Social-Media-Dataset

This dataset contains over 1 million tweets crawled from Twitter. After filtering and processing, it retains multimodal text‑image data, extracts emojis and embedded text, resulting in a dataset with four modalities.

Source
github
Created
Nov 7, 2024
Updated
Nov 7, 2024
Signals
857 views
Availability
Linked source ready
Overview

Dataset description and usage context

Social Media Dataset

Dataset Overview

  • Data Source: Crawled over 1 million Twitter posts.
  • Data Filtering:
    • Used a pretrained VGG19 model to filter out non‑emoji images; 95% of data were initially screened.
    • Manual filtering retained text‑image multimodal data, removing approximately 40% of the data.
  • Data Processing:
    • Extracted emojis from text using regular expressions.
    • Obtained embedded text from emoji packs using the PaddleOCR platform and manual correction.
  • Data Modalities: Includes four modalities.

Dataset Status

  • The dataset will be open‑sourced after the paper is accepted.
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio