DATASET

Open Source Community

FEMNIST

FEMNIST is an image classification dataset containing 62 classes (10 digits, 26 lowercase letters, 26 uppercase letters), with images of size 28×28 pixels (optionally upscaled to 128×128 pixels), involving 3,500 users. The dataset is derived by partitioning the EMNIST dataset so that each user's data includes characters written by a single author.

Updated 7/7/2022

github

Description

Dataset Overview

1. FEMNIST

Type: Image dataset
Details: Contains 62 classes (10 digits, 26 lowercase letters, 26 uppercase letters), image size 28×28 pixels, optionally adjustable to 128×128 pixels, with a total of 3,500 users.
Task: Image classification

2. Sentiment140

Type: Text dataset containing tweets
Details: 660,120 users
Task: Sentiment analysis
Format: Includes training and test sets; data stored in JSON format, each user’s data includes tweet content and sentiment label.

3. Shakespeare

Type: Text dataset containing Shakespeare dialogues
Details: 1,129 users (reduced to 660 users)
Task: Next‑character prediction
Format: Text format containing dialogue content.

4. Celeba

Type: Image dataset
Details: 9,343 users (excluding celebrities with fewer than 5 images)
Task: Image classification (smiling vs. non‑smiling)

5. Synthetic Dataset

Type: Synthetic dataset
Details: Users can customize the number of devices, number of classes, dimensions, etc.
Task: Classification

6. Reddit

Type: Text dataset containing Reddit comments
Details: 1,660,820 users, total of 56,587,343 comments
Task: Next‑word prediction

7. CIFAR 10 / CIFAR 100

Type: Image classification dataset
Details: 60,000 color images of size 32×32 pixels, distributed across 10 and 100 classes respectively, with 50,000/10,000 training/testing split.
Task: Image classification

8. FedVision - Street Dataset

Type: Real‑world object detection dataset
Details: Contains 5,20 devices, 956 samples, 7 classes
Task: Object detection
Format: Includes image data and training labels, stored in JSON format.

9. EMNIST

Type: Extended MNIST dataset containing English letters and digits
Details: Divided into 6 subsets: By_Class, By_Merge, Balanced, Digits, Letters, and MNIST
Task: Classification

10. MovieLens

Type: Structured dataset
Details: Contains user ratings for videos and video attributes; ratings are on a 5‑point scale
Task: Recommendation system
Format: Includes ratings.dat, users.dat, and movies.dat

11. Credit

Type: Structured dataset
Details: Contains user attributes such as gender, education level, etc. Credit 1 includes 150,000 samples with 10 attributes; Credit 2 includes 30,000 samples with 25 attributes
Task: Classification (predict whether a user will default on repayment)

12. ModelNet

Type: Image classification dataset
Details: Contains 2,311 3D models from 40 categories captured from various viewpoints
Task: Image classification
Processing: Requires conversion of CAD models to images using open‑source software Blender

13. PersonaChat

Type: Dialogue dataset
Details: Naturally non‑i.i.d. partitioned, based on assigned personas, divided into 17,568 clients

14. KWS

Type: Speech command dataset
Task: Limited‑vocabulary speech recognition

15. Flickr

Type: Personalized image aesthetic dataset
Task: Personalized image classification

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Image Classification

Character Recognition

Source

Organization: github

Created: 11/17/2020

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.

Check Prices →