Back to datasets
Dataset assetOpen Source CommunityNatural Language ProcessingText Classification

abdiharyadi/eli5-id-preprocessed-tokenized-filtered

The dataset contains training features: input_ids, attention_mask, and labels, each represented as integer sequences. The training split comprises 443,918 examples with a total size of approximately 1,004,301,613.693275 bytes. The download size is 235,069,151 bytes.

Source
hugging_face
Created
Nov 28, 2025
Updated
May 23, 2024
Signals
87 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Information

Features

  • input_ids: sequence of type int32
  • attention_mask: sequence of type int8
  • labels: sequence of type int64

Data Splits

  • train:
    • Bytes: 1,004,301,613.693275
    • Samples: 443,918

Data Size

  • Download Size: 235,069,151 bytes
  • Dataset Size: 1,004,301,613.693275 bytes

Configuration

  • config_name: default
    • data_files:
      • split: train
      • path: data/train-*
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio