Dataset assetOpen Source CommunityNatural Language ProcessingText Classification

abdiharyadi/eli5-id-preprocessed-tokenized-filtered

The dataset contains training features: input_ids, attention_mask, and labels, each represented as integer sequences. The training split comprises 443,918 examples with a total size of approximately 1,004,301,613.693275 bytes. The download size is 235,069,151 bytes.

Source

hugging_face

Created

Nov 28, 2025

Updated

May 23, 2024

Signals

87 views

Availability

Linked source ready

Overview

Dataset description and usage context

Dataset Information

Features

input_ids: sequence of type int32
attention_mask: sequence of type int8
labels: sequence of type int64

Data Splits

train:
- Bytes: 1,004,301,613.693275
- Samples: 443,918

Data Size

Download Size: 235,069,151 bytes
Dataset Size: 1,004,301,613.693275 bytes

Configuration

config_name: default
- data_files:
  - split: train
  - path: data/train-*

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio