reddit_dataset_198
This dataset is part of the Bittensor Subnet 13 decentralized network and contains pre‑processed Reddit data. Network miners continuously update the data, providing a real‑time stream of Reddit content suitable for various analysis and machine‑learning tasks. The dataset includes fields such as text, label, data type, community name, datetime, username (encoded), and URL (encoded). The primary language is English, though multilingual content may be present. It is released under the MIT license and is subject to Reddit's terms of use.
Description
Bittensor Subnet 13 Reddit Dataset
Dataset Description
- Repository: LadyMia/reddit_dataset_198
- Subnet: Bittensor Subnet 13
- Miner Hotkey: 5GBMaEW5jv73t27HEq6f1y2Nu2ZjMu5Mi9W9uoxKe22KTqQ7
Overview
This dataset is part of the Bittensor Subnet 13 decentralized network and contains pre‑processed Reddit data. Network miners continuously update the data, providing a real‑time stream of Reddit content suitable for various analysis and machine‑learning tasks.
Supported Tasks
- Sentiment analysis
- Topic modeling
- Community analysis
- Content classification
Language
Primary language: English; due to the decentralized creation process, multilingual content may also be present.
Structure
Data Instances
Each instance represents a Reddit post or comment and includes the following fields:
Fields
text(string): Main content of the Reddit post or comment.label(string): Sentiment or topic category of the content.dataType(string): Indicates whether the entry is a post or a comment.communityName(string): Name of the subreddit where the content was posted.datetime(string): Date of posting or commenting.username_encoded(string): Encoded version of the username to protect privacy.url_encoded(string): Encoded version of any URLs contained in the content.
Splits
The dataset is continuously updated and does not have fixed splits. Users should create their own splits based on timestamps or other criteria.
Creation
Source Data
Data were collected from publicly available Reddit posts and comments, adhering to the platform's terms of service and API usage guidelines.
Personal and Sensitive Information
All usernames and URLs have been encoded to protect user privacy. The dataset does not contain personal or sensitive information.
Usage Considerations
Social Impact and Bias
Users should be aware of potential biases in Reddit data, including demographic and content biases. The dataset reflects the content and viewpoints expressed on Reddit and should not be considered representative of the general population.
Limitations
- Data quality may vary due to the nature of the source.
- The dataset may contain noise, spam, or irrelevant content typical of social‑media platforms.
- Temporal bias may exist because of the real‑time collection method.
- Only public subreddits are included; private or restricted communities are excluded.
Additional Information
License
The dataset is released under the MIT license. Use of the dataset must also comply with Reddit's terms of service.
Citation
If you use this dataset in research, please cite:
@misc{LadyMia2024datauniversereddit_dataset_198,
title={The Data Universe Datasets: The finest collection of social media data the web has to offer},
author={LadyMia},
year={2024},
url={https://huggingface.co/datasets/LadyMia/reddit_dataset_198},
}
Contribution
To report issues or contribute to the dataset, contact a miner or use the Bittensor Subnet 13 governance mechanism.
Statistics
- Total Instances: 37 827 699
- Date Range: 2024‑11‑22T00:00:00Z to 2024‑11‑29T00:00:00Z
- Last Update: 2024‑11‑29T09:34:43Z
Distribution
- Posts: 6.08 %
- Comments: 93.92 %
Top 10 Subreddits
| Rank | Subreddit | Count | Percentage |
|---|---|---|---|
| 1 | r/AskReddit | 334 985 | 0.89 % |
| 2 | r/AITAH | 173 050 | 0.46 % |
| 3 | r/nfl | 156 456 | 0.41 % |
| 4 | r/CFB | 140 714 | 0.37 % |
| 5 | r/Pixelary | 124 676 | 0.33 % |
| 6 | r/politics | 124 597 | 0.33 % |
| 7 | r/teenagers | 102 132 | 0.27 % |
| 8 | r/NoStupidQuestions | 101 970 | 0.27 % |
| 9 | r/repost | 93 989 | 0.25 % |
| 10 | r/Cricket | 86 638 | 0.23 % |
Update History
| Date | New Instances | Total Instances |
|---|---|---|
| 2024‑11‑22T08:38:31Z | 753 675 | 753 675 |
| 2024‑11‑25T21:19:02Z | 18 771 869 | 19 525 544 |
| 2024‑11‑29T09:34:43Z | 18 302 155 | 37 827 699 |
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: huggingface
Created: 11/22/2024
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.