High Quality Data

Dataset Hub

Explore high-quality datasets for your AI and machine learning projects.

Sort:

Browse by Category

WeiboHotListDataSet

We scraped posts that appeared on the Weibo hot list from 2022‑11‑25 to 2023‑03‑08 (only posts from the day they trended) together with their associated comments.

github

View Details

bigIR/AuFIN

Social Media Analysis

Rumor Detection

This is an Arabic dataset for authoritative user search on Twitter. The dataset provides the top five users retrieved by the BM25 lexical retrieval model, where the query is a rumor text and the document collection consists of user documents. Each user document is constructed by concatenating the translated profile name and description, along with all translated Twitter list names and descriptions.

hugging_face

View Details

ngoc2018/twitter_dataset_1717622004

Social Media Analysis

Twitter Data

The dataset contains Twitter tweet information, including tweet ID, content, username, user ID, creation time, URL, like count, etc. The dataset is divided into a training set with 613 samples, totaling 278,470 bytes.

hugging_face

View Details

Intentonomy

Social Media Analysis

Computer Vision

Intentonomy is a dataset of 14,455 images created jointly by Cornell University and Facebook AI to understand and analyze human intent behind social‑media images. The images span everyday scenarios and are manually annotated with 28 intent categories using a psychology‑based taxonomy. Labels were collected via a novel “purpose game” on Amazon Mechanical Turk. The dataset supports tasks such as fake‑news detection and improving vision systems’ understanding of human intent.

arXiv

View Details

strombergnlp/nlpcc-stance

Natural Language Processing

Social Media Analysis

This is a Chinese stance‑prediction dataset specifically designed for detecting stance in Chinese micro‑blogs. The data originate from the NLPCC‑ICCPOL 2016 shared task, aiming to identify stance toward five target topics given annotated data. Each instance contains a unique ID, target, text, and stance label (against, favor, or none). The dataset was annotated by Chinese students, ensuring consistency and reliability. It contains only Chinese data and is released under a CC‑BY‑4.0 license.

hugging_face

View Details

Xiaohongshu AIGC Comments and Posts Dataset

AIGC

Social Media Analysis

The dataset is collected from the Xiaohongshu platform and focuses on user‑generated AI‑generated content (AIGC). It spans categories such as advertising, automotive, fashion, food, literature, printing, sports, and technology. The data include user comments and posts with fields for user ID, content, timestamp, like count, and sentiment analysis, enabling analysis of public opinions toward AIGC.

github

View Details

reddit_dataset_198

Social Media Analysis

Machine Learning

This dataset is part of the Bittensor Subnet 13 decentralized network and contains pre‑processed Reddit data. Network miners continuously update the data, providing a real‑time stream of Reddit content suitable for various analysis and machine‑learning tasks. The dataset includes fields such as text, label, data type, community name, datetime, username (encoded), and URL (encoded). The primary language is English, though multilingual content may be present. It is released under the MIT license and is subject to Reddit's terms of use.

huggingface

View Details

CMACD

Affective Computing

Social Media Analysis

这是一个基于社交媒体用户的多标签中文情感计算数据集，整合了用户的性格特质与六种情感及微情感，每种情感都标注了强度级别。数据集旨在推进机器对复杂人类情感的识别，并为心理学、教育、市场营销、金融和政治等领域的研究提供数据支持。

github

View Details

BullyDataset

Online Bullying Detection

Social Media Analysis

A Sina Weibo comment dataset specifically collected for cyberbullying detection, where comments are labeled as bullying if they contain gender discrimination, racial or regional insults, profanity or humiliation, factual distortion, expressions of violence, attacks on appearance or family members, repetitive negative comments, calls for others to join the attack, or imposing unwanted or insulting nicknames.

github

View Details

reddit_dataset_149

Social Media Analysis

Text Mining

This dataset is part of Bittensor Subnet 13, a decentralized network, and contains pre‑processed Reddit data. The data is continuously updated by network miners, providing a real‑time Reddit content stream suitable for various analysis and machine‑learning tasks. The dataset includes Reddit posts and comments with fields such as text, label, data type, community name, timestamp, anonymized username, and anonymized URL. While primarily English, it may contain multilingual content. Released under an MIT license and subject to Reddit's terms of use, users should be aware of potential biases, data quality variation, and temporal bias.

huggingface

View Details

Twitter/TwitterFollowGraph

Social Media Analysis

User Behavior

TwitterFollowGraph is a bipartite directed graph comprising user (consumer) nodes and author (producer) nodes, where edges represent a user's "follow" interaction with an author. Each edge is assigned to a predefined time chunk, denoted by a consecutive ordinal that respects the temporal order of interactions. TwitterFollowGraph contains a total of 261 million edges and 15.5 million vertices, with a maximum degree of 900 000 and a minimum degree of 5. The data format is shown in the table below: | user_index | author_index | time_chunk |

hugging_face

View Details