Back to datasets
Dataset assetOpen Source CommunityMachine LearningDiabetes

Diabetes Binary Health Indicators BRFSS2015

This dataset contains various health‑related features and a binary target variable indicating the presence or absence of diabetes. The dataset originates from the CDC and is used to build machine‑learning models to classify individuals as diabetic or not.

Source
github
Created
Jul 22, 2024
Updated
Aug 8, 2024
Signals
609 views
Availability
Linked source ready
Overview

Dataset description and usage context

Diabetes Binary Classification Project Documentation

Introduction

The project aims to build machine‑learning models based on multiple health indicators to classify individuals as diabetic or not. The dataset used is “Diabetes Binary Health Indicators BRFSS 2015”, from the U.S. Centers for Disease Control and Prevention (CDC).

Dataset Loading

The dataset is loaded with the following code:

import pandas as pd
df = pd.read_csv("diabetes_binary_health_indicators_BRFSS2015.csv")

Data Exploration

Profiling Report

A profiling report provides a comprehensive overview of the dataset, including distributions, missing values, correlations, etc.:

from pandas_profiling import ProfileReport
profile = ProfileReport(df, title="Profiling Report")
profile.to_file("analysis_report.html")

Basic Exploration

The basic exploration includes showing the first few rows, column information, statistical summary, info overview, missing values, duplicated rows, unique values count, and correlation matrix:

print("First few rows of the dataset:")
print(df.head())

print("Columns in the dataset:")
print(df.columns)

print("Statistical summary of the dataset:")
print(df.describe().T)

print("Information about the dataset:")
print(df.info())

print("Number of missing values in each column:")
print(df.isnull().sum())

print("Number of duplicated rows in the dataset:")
print(df.duplicated().sum())

print("Number of unique values in each column:")
print(df.nunique())

print("Correlation matrix:")
print(df.corr(numeric_only=True))

Visual Exploration

Visual exploration includes a correlation heatmap, class distribution plot for the binary diabetes target, and correlation plots with the target:

import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(16,10))
sns.heatmap(df.corr(), annot=True)
plt.show()

sns.countplot(x='Diabetes_binary', data=df)
plt.title("Class Distribution of Diabetes_binary")
plt.show()

plt.figure(figsize=(12, 8))
df.corr()['Diabetes_binary'].sort_values().plot(kind='bar')
plt.title("Correlation with Diabetes_binary")
plt.show()

Data Preprocessing

Handling Missing Values and Duplicates

The dataset contains no missing values but has duplicate rows that need to be removed.

Data Splitting

The dataset is split into training and testing sets:

from sklearn.model_selection import train_test_split
X = df.drop(columns='Diabetes_binary')
y = df['Diabetes_binary']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Data Scaling

Various scalers are applied:

  • StandardScaler
  • MinMaxScaler
  • RobustScaler

Handling Imbalanced Data

SMOTE is used to address class imbalance:

from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_train_res, y_train_res = smote.fit_resample(X_train, y_train)

Model Building

Multiple classification models are built, including:

  • Logistic Regression
  • RandomForestClassifier
  • GradientBoostingClassifier
  • KNeighborsClassifier
  • GaussianNB
  • DecisionTreeClassifier
  • XGBClassifier
  • CatBoostClassifier

Example Pipeline with Logistic Regression

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression())
])
from sklearn.model_selection import GridSearchCV
param_grid = {
    'classifier__C': [0.1, 1, 10],
    'classifier__penalty': ['l2']
}

grid_search = GridSearchCV(pipeline, param_grid, cv=5)
grid_search.fit(X_train_res, y_train_res)

Model Evaluation

Evaluation metrics include:

  • Accuracy
  • Precision
  • Recall
  • F1 Score

Example Evaluation Code

from sklearn.metrics import classification_report, confusion_matrix
y_pred = grid_search.predict(X_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

Findings and Learnings

  1. Data Quality: The dataset contains many duplicate rows that need to be removed.
  2. Feature Importance: Features such as BMI, HighBP, and Age show high correlation with diabetes.
  3. Class Imbalance: The target variable is imbalanced; techniques like SMOTE are essential.
  4. Model Performance: Ensemble models such as Random Forest and Gradient Boosting outperform simple models like Logistic Regression and Naïve Bayes.
  5. Hyperparameter Tuning: GridSearchCV effectively tunes hyperparameters and improves performance.

Conclusion

The project successfully classifies individuals as diabetic or not using multiple machine‑learning models. Ensemble methods perform best, and addressing class imbalance is crucial for improving model performance.

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio