JUHE API Marketplace
DATASET
Open Source Community

Adult Data Set

This is a widely used dataset for predicting whether income exceeds $50K per year based on provided census data.

Updated 1/31/2019
github

Description

Dataset Overview

Data Source and Purpose

  • Source: UCI Machine Learning Repository
  • Name: Adult Data Set
  • Purpose: Predict whether an individual's annual income exceeds $50K/year based on provided census data.

Data Processing Steps

  1. Data Import: Use pandas read_csv function to read the "adult.data.txt" file.
  2. Feature Labels: Assign labels to the file’s features, including Age, Workclass, fnlwgt, Education, Education_Num, Martial_Status, Occupation, Relationship, Race, Sex, Capital_Gain, Capital_Loss, Hours_per_week, Country, Target.
  3. String Data Conversion: Convert string variables such as gender to numeric values (e.g., Female → 0, Male → 1).
  4. Missing Value Handling: Use fillna function to fill NaN values.
  5. Feature and Target Selection: Select features and target via data[features].values and data["target"].
  6. Dataset Size Display: Show the number of records with X.shape[0].

Model Training and Evaluation

  • Initial Model: Train a logistic regression model, adjusting hyperparameters such as the C value.
  • Optimization Algorithm: Apply GridSearchCV for hyperparameter tuning, testing different penalty (l1, l2) and C values (0.01, 0.1, 1, 10, 100, 1000) to find the optimal model.
  • Model Comparison: Compare traditional logistic regression, GridSearchCV‑optimized logistic regression, and k‑nearest neighbors algorithm.
  • Evaluation Results: Present a table of precision, recall, f1‑score, and support for each model, e.g., GridSearchCV‑optimized logistic regression results:
    • Under 50k: precision=0.87, recall=0.93, f1-score=0.90, support=7407
    • Over 50k: precision=0.72, recall=0.58, f1-score=0.64, support=2362
    • Average/Total: precision=0.84, recall=0.85, f1-score=0.84, support=9769

Conclusion

After a series of preprocessing steps, the dataset is used to train and evaluate logistic regression models, and GridSearchCV optimization of parameters significantly improves model performance.

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Revenue Forecasting
Data Analysis

Source

Organization: github

Created: 11/27/2018

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.