Back to datasets
Dataset assetOpen Source CommunityData AnalysisRevenue Forecasting

Adult Data Set

This is a widely used dataset for predicting whether income exceeds $50K per year based on provided census data.

Source
github
Created
Nov 27, 2018
Updated
Jan 31, 2019
Signals
169 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Overview

Data Source and Purpose

  • Source: UCI Machine Learning Repository
  • Name: Adult Data Set
  • Purpose: Predict whether an individual's annual income exceeds $50K/year based on provided census data.

Data Processing Steps

  1. Data Import: Use pandas read_csv function to read the "adult.data.txt" file.
  2. Feature Labels: Assign labels to the file’s features, including Age, Workclass, fnlwgt, Education, Education_Num, Martial_Status, Occupation, Relationship, Race, Sex, Capital_Gain, Capital_Loss, Hours_per_week, Country, Target.
  3. String Data Conversion: Convert string variables such as gender to numeric values (e.g., Female → 0, Male → 1).
  4. Missing Value Handling: Use fillna function to fill NaN values.
  5. Feature and Target Selection: Select features and target via data[features].values and data["target"].
  6. Dataset Size Display: Show the number of records with X.shape[0].

Model Training and Evaluation

  • Initial Model: Train a logistic regression model, adjusting hyperparameters such as the C value.
  • Optimization Algorithm: Apply GridSearchCV for hyperparameter tuning, testing different penalty (l1, l2) and C values (0.01, 0.1, 1, 10, 100, 1000) to find the optimal model.
  • Model Comparison: Compare traditional logistic regression, GridSearchCV‑optimized logistic regression, and k‑nearest neighbors algorithm.
  • Evaluation Results: Present a table of precision, recall, f1‑score, and support for each model, e.g., GridSearchCV‑optimized logistic regression results:
    • Under 50k: precision=0.87, recall=0.93, f1-score=0.90, support=7407
    • Over 50k: precision=0.72, recall=0.58, f1-score=0.64, support=2362
    • Average/Total: precision=0.84, recall=0.85, f1-score=0.84, support=9769

Conclusion

After a series of preprocessing steps, the dataset is used to train and evaluate logistic regression models, and GridSearchCV optimization of parameters significantly improves model performance.

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio