JUHE API Marketplace
DATASET
Open Source Community

reddit_dataset_198

This dataset is part of the Bittensor Subnet 13 decentralized network and contains pre‑processed Reddit data. Network miners continuously update the data, providing a real‑time stream of Reddit content suitable for various analysis and machine‑learning tasks. The dataset includes fields such as text, label, data type, community name, datetime, username (encoded), and URL (encoded). The primary language is English, though multilingual content may be present. It is released under the MIT license and is subject to Reddit's terms of use.

Updated 11/29/2024
huggingface

Description

Bittensor Subnet 13 Reddit Dataset

Dataset Description

  • Repository: LadyMia/reddit_dataset_198
  • Subnet: Bittensor Subnet 13
  • Miner Hotkey: 5GBMaEW5jv73t27HEq6f1y2Nu2ZjMu5Mi9W9uoxKe22KTqQ7

Overview

This dataset is part of the Bittensor Subnet 13 decentralized network and contains pre‑processed Reddit data. Network miners continuously update the data, providing a real‑time stream of Reddit content suitable for various analysis and machine‑learning tasks.

Supported Tasks

  • Sentiment analysis
  • Topic modeling
  • Community analysis
  • Content classification

Language

Primary language: English; due to the decentralized creation process, multilingual content may also be present.

Structure

Data Instances

Each instance represents a Reddit post or comment and includes the following fields:

Fields

  • text (string): Main content of the Reddit post or comment.
  • label (string): Sentiment or topic category of the content.
  • dataType (string): Indicates whether the entry is a post or a comment.
  • communityName (string): Name of the subreddit where the content was posted.
  • datetime (string): Date of posting or commenting.
  • username_encoded (string): Encoded version of the username to protect privacy.
  • url_encoded (string): Encoded version of any URLs contained in the content.

Splits

The dataset is continuously updated and does not have fixed splits. Users should create their own splits based on timestamps or other criteria.

Creation

Source Data

Data were collected from publicly available Reddit posts and comments, adhering to the platform's terms of service and API usage guidelines.

Personal and Sensitive Information

All usernames and URLs have been encoded to protect user privacy. The dataset does not contain personal or sensitive information.

Usage Considerations

Social Impact and Bias

Users should be aware of potential biases in Reddit data, including demographic and content biases. The dataset reflects the content and viewpoints expressed on Reddit and should not be considered representative of the general population.

Limitations

  • Data quality may vary due to the nature of the source.
  • The dataset may contain noise, spam, or irrelevant content typical of social‑media platforms.
  • Temporal bias may exist because of the real‑time collection method.
  • Only public subreddits are included; private or restricted communities are excluded.

Additional Information

License

The dataset is released under the MIT license. Use of the dataset must also comply with Reddit's terms of service.

Citation

If you use this dataset in research, please cite:

@misc{LadyMia2024datauniversereddit_dataset_198,
        title={The Data Universe Datasets: The finest collection of social media data the web has to offer},
        author={LadyMia},
        year={2024},
        url={https://huggingface.co/datasets/LadyMia/reddit_dataset_198},
}

Contribution

To report issues or contribute to the dataset, contact a miner or use the Bittensor Subnet 13 governance mechanism.

Statistics

  • Total Instances: 37 827 699
  • Date Range: 2024‑11‑22T00:00:00Z to 2024‑11‑29T00:00:00Z
  • Last Update: 2024‑11‑29T09:34:43Z

Distribution

  • Posts: 6.08 %
  • Comments: 93.92 %

Top 10 Subreddits

RankSubredditCountPercentage
1r/AskReddit334 9850.89 %
2r/AITAH173 0500.46 %
3r/nfl156 4560.41 %
4r/CFB140 7140.37 %
5r/Pixelary124 6760.33 %
6r/politics124 5970.33 %
7r/teenagers102 1320.27 %
8r/NoStupidQuestions101 9700.27 %
9r/repost93 9890.25 %
10r/Cricket86 6380.23 %

Update History

DateNew InstancesTotal Instances
2024‑11‑22T08:38:31Z753 675753 675
2024‑11‑25T21:19:02Z18 771 86919 525 544
2024‑11‑29T09:34:43Z18 302 15537 827 699

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Social Media Analysis
Machine Learning

Source

Organization: huggingface

Created: 11/22/2024

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.