Dataset assetOpen Source CommunityGender StudiesNLP Bias Analysis

LabHC/bias_in_bios

The Bias in Bios dataset was created by De-Artega et al. in 2019 to study bias in NLP models. It contains textual biographies for occupation prediction, with gender (binary) as the sensitive attribute. Ravgofel et al. introduced a slightly smaller version in 2020 due to the loss of 5,557 biographies. The dataset is split into a training set (257,000 samples), a test set (99,000 samples) and a development set (40,000 samples). Classification labels comprise 28 occupations, each with a numerical label and proportion. Sensitive attribute labels are Male (label 0, 53.9%) and Female (label 1, 46.1%).

Source

hugging_face

Created

Nov 28, 2025

Updated

Sep 10, 2023

Signals

212 views

Availability

Linked source ready

Overview

Dataset description and usage context

Bias in Bios Dataset Overview

Basic Information

License: MIT
Task Category: Text Classification
Language: English

Dataset Features

Feature List:
- hard_text: string, the biography text
- profession: 64‑bit integer, occupation label
- gender: 64‑bit integer, gender label

Dataset Splits

Training Set:
- Bytes: 107487885
- Samples: 257478
Test Set:
- Bytes: 41312256
- Samples: 99069
Development Set:
- Bytes: 16504417
- Samples: 39642

Dataset Size

Download Size: 99808338 bytes
Total Size: 165304558 bytes

Classification Labels

Occupation	Numerical Label	Proportion (%)
accountant	0	1.42
architect	1	2.55
attorney	2	8.22
chiropractor	3	0.67
comedian	4	0.71
composer	5	1.41
dentist	6	3.68
dietitian	7	1.0
dj	8	0.38
filmmaker	9	1.77
interior_designer	10	0.37
journalist	11	5.03
model	12	1.89
nurse	13	4.78
painter	14	1.95
paralegal	15	0.45
pastor	16	0.64
personal_trainer	17	0.36
photographer	18	6.13
physician	19	10.35
poet	20	1.77
professor	21	29.8
psychologist	22	4.64
rapper	23	0.35
software_engineer	24	1.74
surgeon	25	3.43
teacher	26	4.09
yoga_teacher	27	0.42

Sensitive Attribute

Gender	Numerical Label	Proportion (%)
Male	0	53.9
Female	1	46.1

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio