Back to datasets
Dataset assetOpen Source CommunityGender StudiesNLP Bias Analysis

LabHC/bias_in_bios

The Bias in Bios dataset was created by De-Artega et al. in 2019 to study bias in NLP models. It contains textual biographies for occupation prediction, with gender (binary) as the sensitive attribute. Ravgofel et al. introduced a slightly smaller version in 2020 due to the loss of 5,557 biographies. The dataset is split into a training set (257,000 samples), a test set (99,000 samples) and a development set (40,000 samples). Classification labels comprise 28 occupations, each with a numerical label and proportion. Sensitive attribute labels are Male (label 0, 53.9%) and Female (label 1, 46.1%).

Source
hugging_face
Created
Nov 28, 2025
Updated
Sep 10, 2023
Signals
212 views
Availability
Linked source ready
Overview

Dataset description and usage context

Bias in Bios Dataset Overview

Basic Information

  • License: MIT
  • Task Category: Text Classification
  • Language: English

Dataset Features

  • Feature List:
    • hard_text: string, the biography text
    • profession: 64‑bit integer, occupation label
    • gender: 64‑bit integer, gender label

Dataset Splits

  • Training Set:
    • Bytes: 107487885
    • Samples: 257478
  • Test Set:
    • Bytes: 41312256
    • Samples: 99069
  • Development Set:
    • Bytes: 16504417
    • Samples: 39642

Dataset Size

  • Download Size: 99808338 bytes
  • Total Size: 165304558 bytes

Classification Labels

OccupationNumerical LabelProportion (%)
accountant01.42
architect12.55
attorney28.22
chiropractor30.67
comedian40.71
composer51.41
dentist63.68
dietitian71.0
dj80.38
filmmaker91.77
interior_designer100.37
journalist115.03
model121.89
nurse134.78
painter141.95
paralegal150.45
pastor160.64
personal_trainer170.36
photographer186.13
physician1910.35
poet201.77
professor2129.8
psychologist224.64
rapper230.35
software_engineer241.74
surgeon253.43
teacher264.09
yoga_teacher270.42

Sensitive Attribute

GenderNumerical LabelProportion (%)
Male053.9
Female146.1
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio