JUHE API Marketplace
DATASET
Open Source Community

LabHC/bias_in_bios

The Bias in Bios dataset was created by De-Artega et al. in 2019 to study bias in NLP models. It contains textual biographies for occupation prediction, with gender (binary) as the sensitive attribute. Ravgofel et al. introduced a slightly smaller version in 2020 due to the loss of 5,557 biographies. The dataset is split into a training set (257,000 samples), a test set (99,000 samples) and a development set (40,000 samples). Classification labels comprise 28 occupations, each with a numerical label and proportion. Sensitive attribute labels are Male (label 0, 53.9%) and Female (label 1, 46.1%).

Updated 9/10/2023
hugging_face

Description

Bias in Bios Dataset Overview

Basic Information

  • License: MIT
  • Task Category: Text Classification
  • Language: English

Dataset Features

  • Feature List:
    • hard_text: string, the biography text
    • profession: 64‑bit integer, occupation label
    • gender: 64‑bit integer, gender label

Dataset Splits

  • Training Set:
    • Bytes: 107487885
    • Samples: 257478
  • Test Set:
    • Bytes: 41312256
    • Samples: 99069
  • Development Set:
    • Bytes: 16504417
    • Samples: 39642

Dataset Size

  • Download Size: 99808338 bytes
  • Total Size: 165304558 bytes

Classification Labels

OccupationNumerical LabelProportion (%)
accountant01.42
architect12.55
attorney28.22
chiropractor30.67
comedian40.71
composer51.41
dentist63.68
dietitian71.0
dj80.38
filmmaker91.77
interior_designer100.37
journalist115.03
model121.89
nurse134.78
painter141.95
paralegal150.45
pastor160.64
personal_trainer170.36
photographer186.13
physician1910.35
poet201.77
professor2129.8
psychologist224.64
rapper230.35
software_engineer241.74
surgeon253.43
teacher264.09
yoga_teacher270.42

Sensitive Attribute

GenderNumerical LabelProportion (%)
Male053.9
Female146.1

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

NLP Bias Analysis
Gender Studies

Source

Organization: hugging_face

Created: Unknown

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.