JUHE API Marketplace
DATASET
Open Source Community

Social-Media-Dataset

This dataset contains over 1 million tweets crawled from Twitter. After filtering and processing, it retains multimodal text‑image data, extracts emojis and embedded text, resulting in a dataset with four modalities.

Updated 11/7/2024
github

Description

Social Media Dataset

Dataset Overview

  • Data Source: Crawled over 1 million Twitter posts.
  • Data Filtering:
    • Used a pretrained VGG19 model to filter out non‑emoji images; 95% of data were initially screened.
    • Manual filtering retained text‑image multimodal data, removing approximately 40% of the data.
  • Data Processing:
    • Extracted emojis from text using regular expressions.
    • Obtained embedded text from emoji packs using the PaddleOCR platform and manual correction.
  • Data Modalities: Includes four modalities.

Dataset Status

  • The dataset will be open‑sourced after the paper is accepted.

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Social Media
Multimodal Data

Source

Organization: github

Created: 11/7/2024

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.