Back to datasets
Dataset assetOpen Source CommunityMachine LearningStroke
Cereberal-Stroke-Analysis
This dataset is used to analyze stroke, employing machine learning models and resampling techniques such as SMOTEENN to improve prediction accuracy and address dataset imbalance.
Source
github
Created
Dec 12, 2023
Updated
Dec 12, 2023
Signals
403 views
Availability
Linked source ready
Overview
Dataset description and usage context
Overview of Dataset Processing Workflow
Data Loading and Import
- Use
pandas,numpy,seaborn,matplotlib.pyplot, and other libraries to import and read the CSV file into a DataFrame (df).
Exploratory Data Analysis (EDA)
- Perform basic data exploration with
head()anddescribe(). - Check and count missing values using
isnull().sum().
Handling Categorical Variables
- Apply
pd.get_dummies()for one‑hot encoding of categorical variables.
Handling Missing Values
- Fill missing values using the
KNNImputeralgorithm.
Feature Scaling and Train‑Test Split
- Perform feature scaling with
MinMaxScaler. - Split the dataset into training and testing sets.
Model Selection and Evaluation
- Conduct preliminary testing with models such as
KNeighborsClassifier,GaussianNB,DecisionTreeClassifier, andRandomForestClassifier. - Generate a classification report to evaluate model performance on the imbalanced dataset.
Data Resampling
- Apply SMOTE for oversampling.
- Perform random undersampling to balance class distribution.
- Use SMOTEENN to combine oversampling and undersampling.
Post‑Resampling Model Evaluation
- Retrain and evaluate models on the oversampled, undersampled, and combined sampled datasets.
Conclusion
- Various resampling techniques, especially SMOTEENN, substantially improve the model’s ability to identify positive stroke cases.
Need downstream help?
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.