JUHE API Marketplace
DATASET
Open Source Community

reddit_dataset_149

This dataset is part of Bittensor Subnet 13, a decentralized network, and contains pre‑processed Reddit data. The data is continuously updated by network miners, providing a real‑time Reddit content stream suitable for various analysis and machine‑learning tasks. The dataset includes Reddit posts and comments with fields such as text, label, data type, community name, timestamp, anonymized username, and anonymized URL. While primarily English, it may contain multilingual content. Released under an MIT license and subject to Reddit's terms of use, users should be aware of potential biases, data quality variation, and temporal bias.

Updated 11/30/2024
huggingface

Description

Bittensor Subnet 13 Reddit Dataset

Dataset Description

  • Repository: LadyMia/reddit_dataset_149
  • Subnet: Bittensor Subnet 13
  • Miner Hotkey: 5ER93P7YrerwowGELtpnnkqoK7poR1Q8mca3f84k7b3nig3D

Dataset Summary

This dataset is part of the Bittensor Subnet 13 decentralized network, containing preprocessed Reddit data. The data is continuously updated by network miners, providing a real-time stream of Reddit content for various analytical and machine learning tasks.

Supported Tasks

  • Sentiment Analysis
  • Topic Modeling
  • Community Analysis
  • Content Categorization

Languages

Primary language: Datasets are mostly English, but can be multilingual due to decentralized ways of creation.

Dataset Structure

Data Instances

Each instance represents a single Reddit post or comment with the following fields:

Data Fields

  • text (string): The main content of the Reddit post or comment.
  • label (string): Sentiment or topic category of the content.
  • dataType (string): Indicates whether the entry is a post or a comment.
  • communityName (string): The name of the subreddit where the content was posted.
  • datetime (string): The date when the content was posted or commented.
  • username_encoded (string): An encoded version of the username to maintain user privacy.
  • url_encoded (string): An encoded version of any URLs included in the content.

Data Splits

This dataset is continuously updated and does not have fixed splits. Users should create their own splits based on their requirements and the datas timestamp.

Dataset Creation

Source Data

Data is collected from public posts and comments on Reddit, adhering to the platforms terms of service and API usage guidelines.

Personal and Sensitive Information

All usernames and URLs are encoded to protect user privacy. The dataset does not intentionally include personal or sensitive information.

Considerations for Using the Data

Social Impact and Biases

Users should be aware of potential biases inherent in Reddit data, including demographic and content biases. This dataset reflects the content and opinions expressed on Reddit and should not be considered a representative sample of the general population.

Limitations

  • Data quality may vary due to the nature of media sources.
  • The dataset may contain noise, spam, or irrelevant content typical of social media platforms.
  • Temporal biases may exist due to real-time collection methods.
  • The dataset is limited to public subreddits and does not include private or restricted communities.

Additional Information

Licensing Information

The dataset is released under the MIT license. The use of this dataset is also subject to Reddit Terms of Use.

Citation Information

If you use this dataset in your research, please cite it as follows:

@misc{LadyMia2024datauniversereddit_dataset_149, title={The Data Universe Datasets: The finest collection of social media data the web has to offer}, author={LadyMia}, year={2024}, url={https://huggingface.co/datasets/LadyMia/reddit_dataset_149}, }

Contributions

To report issues or contribute to the dataset, please contact the miner or use the Bittensor Subnet 13 governance mechanisms.

Dataset Statistics

  • Total Instances: 37221287
  • Date Range: 2024-11-23T00:00:00Z to 2024-11-30T00:00:00Z
  • Last Updated: 2024-11-30T08:35:12Z

Data Distribution

  • Posts: 6.09%
  • Comments: 93.91%

Top 10 Subreddits

RankTopicTotal CountPercentage
1r/AskReddit3275800.88%
2r/CFB1912500.51%
3r/AITAH1815400.49%
4r/nfl1674110.45%
5r/politics1194130.32%
6r/Pixelary1162240.31%
7r/NoStupidQuestions1024960.28%
8r/teenagers997240.27%
9r/repost882060.24%
10r/mildlyinfuriating793570.21%

Update History

DateNew InstancesTotal Instances
2024-11-23T08:11:02Z763287763287
2024-11-26T20:17:50Z1835349719116784
2024-11-30T08:35:12Z1810450337221287

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Social Media Analysis
Text Mining

Source

Organization: huggingface

Created: 11/23/2024

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.