Adult Data Set

Dataset Overview

Data Source and Purpose

Source: UCI Machine Learning Repository
Name: Adult Data Set
Purpose: Predict whether an individual's annual income exceeds $50K/year based on provided census data.

Data Processing Steps

Data Import: Use pandas read_csv function to read the "adult.data.txt" file.
Feature Labels: Assign labels to the file’s features, including Age, Workclass, fnlwgt, Education, Education_Num, Martial_Status, Occupation, Relationship, Race, Sex, Capital_Gain, Capital_Loss, Hours_per_week, Country, Target.
String Data Conversion: Convert string variables such as gender to numeric values (e.g., Female → 0, Male → 1).
Missing Value Handling: Use fillna function to fill NaN values.
Feature and Target Selection: Select features and target via data[features].values and data["target"].
Dataset Size Display: Show the number of records with X.shape[0].

Model Training and Evaluation

Initial Model: Train a logistic regression model, adjusting hyperparameters such as the C value.
Optimization Algorithm: Apply GridSearchCV for hyperparameter tuning, testing different penalty (l1, l2) and C values (0.01, 0.1, 1, 10, 100, 1000) to find the optimal model.
Model Comparison: Compare traditional logistic regression, GridSearchCV‑optimized logistic regression, and k‑nearest neighbors algorithm.
Evaluation Results: Present a table of precision, recall, f1‑score, and support for each model, e.g., GridSearchCV‑optimized logistic regression results:
- Under 50k: precision=0.87, recall=0.93, f1-score=0.90, support=7407
- Over 50k: precision=0.72, recall=0.58, f1-score=0.64, support=2362
- Average/Total: precision=0.84, recall=0.85, f1-score=0.84, support=9769

Conclusion

After a series of preprocessing steps, the dataset is used to train and evaluate logistic regression models, and GridSearchCV optimization of parameters significantly improves model performance.

Description

Dataset Overview

Data Source and Purpose

Data Processing Steps

Model Training and Evaluation

Conclusion

AI studio

Access Dataset

Topics

Source