Back to datasets
Dataset assetOpen Source CommunityData AnalysisRevenue Forecasting
Adult Data Set
This is a widely used dataset for predicting whether income exceeds $50K per year based on provided census data.
Source
github
Created
Nov 27, 2018
Updated
Jan 31, 2019
Signals
169 views
Availability
Linked source ready
Overview
Dataset description and usage context
Dataset Overview
Data Source and Purpose
- Source: UCI Machine Learning Repository
- Name: Adult Data Set
- Purpose: Predict whether an individual's annual income exceeds $50K/year based on provided census data.
Data Processing Steps
- Data Import: Use pandas
read_csvfunction to read the "adult.data.txt" file. - Feature Labels: Assign labels to the file’s features, including Age, Workclass, fnlwgt, Education, Education_Num, Martial_Status, Occupation, Relationship, Race, Sex, Capital_Gain, Capital_Loss, Hours_per_week, Country, Target.
- String Data Conversion: Convert string variables such as gender to numeric values (e.g., Female → 0, Male → 1).
- Missing Value Handling: Use
fillnafunction to fill NaN values. - Feature and Target Selection: Select features and target via
data[features].valuesanddata["target"]. - Dataset Size Display: Show the number of records with
X.shape[0].
Model Training and Evaluation
- Initial Model: Train a logistic regression model, adjusting hyperparameters such as the
Cvalue. - Optimization Algorithm: Apply GridSearchCV for hyperparameter tuning, testing different
penalty(l1, l2) andCvalues (0.01, 0.1, 1, 10, 100, 1000) to find the optimal model. - Model Comparison: Compare traditional logistic regression, GridSearchCV‑optimized logistic regression, and k‑nearest neighbors algorithm.
- Evaluation Results: Present a table of precision, recall, f1‑score, and support for each model, e.g., GridSearchCV‑optimized logistic regression results:
- Under 50k: precision=0.87, recall=0.93, f1-score=0.90, support=7407
- Over 50k: precision=0.72, recall=0.58, f1-score=0.64, support=2362
- Average/Total: precision=0.84, recall=0.85, f1-score=0.84, support=9769
Conclusion
After a series of preprocessing steps, the dataset is used to train and evaluate logistic regression models, and GridSearchCV optimization of parameters significantly improves model performance.
Need downstream help?
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.