Diabetes Binary Health Indicators BRFSS2015
This dataset contains various health‑related features and a binary target variable indicating the presence or absence of diabetes. The dataset originates from the CDC and is used to build machine‑learning models to classify individuals as diabetic or not.
Dataset description and usage context
Diabetes Binary Classification Project Documentation
Introduction
The project aims to build machine‑learning models based on multiple health indicators to classify individuals as diabetic or not. The dataset used is “Diabetes Binary Health Indicators BRFSS 2015”, from the U.S. Centers for Disease Control and Prevention (CDC).
Dataset Loading
The dataset is loaded with the following code:
import pandas as pd
df = pd.read_csv("diabetes_binary_health_indicators_BRFSS2015.csv")
Data Exploration
Profiling Report
A profiling report provides a comprehensive overview of the dataset, including distributions, missing values, correlations, etc.:
from pandas_profiling import ProfileReport
profile = ProfileReport(df, title="Profiling Report")
profile.to_file("analysis_report.html")
Basic Exploration
The basic exploration includes showing the first few rows, column information, statistical summary, info overview, missing values, duplicated rows, unique values count, and correlation matrix:
print("First few rows of the dataset:")
print(df.head())
print("Columns in the dataset:")
print(df.columns)
print("Statistical summary of the dataset:")
print(df.describe().T)
print("Information about the dataset:")
print(df.info())
print("Number of missing values in each column:")
print(df.isnull().sum())
print("Number of duplicated rows in the dataset:")
print(df.duplicated().sum())
print("Number of unique values in each column:")
print(df.nunique())
print("Correlation matrix:")
print(df.corr(numeric_only=True))
Visual Exploration
Visual exploration includes a correlation heatmap, class distribution plot for the binary diabetes target, and correlation plots with the target:
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(16,10))
sns.heatmap(df.corr(), annot=True)
plt.show()
sns.countplot(x='Diabetes_binary', data=df)
plt.title("Class Distribution of Diabetes_binary")
plt.show()
plt.figure(figsize=(12, 8))
df.corr()['Diabetes_binary'].sort_values().plot(kind='bar')
plt.title("Correlation with Diabetes_binary")
plt.show()
Data Preprocessing
Handling Missing Values and Duplicates
The dataset contains no missing values but has duplicate rows that need to be removed.
Data Splitting
The dataset is split into training and testing sets:
from sklearn.model_selection import train_test_split
X = df.drop(columns='Diabetes_binary')
y = df['Diabetes_binary']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Data Scaling
Various scalers are applied:
- StandardScaler
- MinMaxScaler
- RobustScaler
Handling Imbalanced Data
SMOTE is used to address class imbalance:
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_train_res, y_train_res = smote.fit_resample(X_train, y_train)
Model Building
Multiple classification models are built, including:
- Logistic Regression
- RandomForestClassifier
- GradientBoostingClassifier
- KNeighborsClassifier
- GaussianNB
- DecisionTreeClassifier
- XGBClassifier
- CatBoostClassifier
Example Pipeline with Logistic Regression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', LogisticRegression())
])
from sklearn.model_selection import GridSearchCV
param_grid = {
'classifier__C': [0.1, 1, 10],
'classifier__penalty': ['l2']
}
grid_search = GridSearchCV(pipeline, param_grid, cv=5)
grid_search.fit(X_train_res, y_train_res)
Model Evaluation
Evaluation metrics include:
- Accuracy
- Precision
- Recall
- F1 Score
Example Evaluation Code
from sklearn.metrics import classification_report, confusion_matrix
y_pred = grid_search.predict(X_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
Findings and Learnings
- Data Quality: The dataset contains many duplicate rows that need to be removed.
- Feature Importance: Features such as BMI, HighBP, and Age show high correlation with diabetes.
- Class Imbalance: The target variable is imbalanced; techniques like SMOTE are essential.
- Model Performance: Ensemble models such as Random Forest and Gradient Boosting outperform simple models like Logistic Regression and Naïve Bayes.
- Hyperparameter Tuning: GridSearchCV effectively tunes hyperparameters and improves performance.
Conclusion
The project successfully classifies individuals as diabetic or not using multiple machine‑learning models. Ensemble methods perform best, and addressing class imbalance is crucial for improving model performance.
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.