DATASET
Open Source Community
Social-Media-Dataset
This dataset contains over 1 million tweets crawled from Twitter. After filtering and processing, it retains multimodal text‑image data, extracts emojis and embedded text, resulting in a dataset with four modalities.
Updated 11/7/2024
github
Description
Social Media Dataset
Dataset Overview
- Data Source: Crawled over 1 million Twitter posts.
- Data Filtering:
- Used a pretrained VGG19 model to filter out non‑emoji images; 95% of data were initially screened.
- Manual filtering retained text‑image multimodal data, removing approximately 40% of the data.
- Data Processing:
- Extracted emojis from text using regular expressions.
- Obtained embedded text from emoji packs using the PaddleOCR platform and manual correction.
- Data Modalities: Includes four modalities.
Dataset Status
- The dataset will be open‑sourced after the paper is accepted.
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Login to Access
Please login to view download links and access full dataset details.
Topics
Social Media
Multimodal Data
Source
Organization: github
Created: 11/7/2024
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.