Synthesized Medical Cost Personal Dataset
This dataset contains 1,339 health‑insurance records and is intended for predicting individual medical charges. Features include age (numeric), sex (categorical), BMI (numeric), number of children (numeric), smoker status (categorical), region (categorical), and the target variable charges (numeric). The data are synthetic, generated to match the original dataset with a 95% similarity to address privacy concerns.
Description
Medical Insurance Charges Bayesian Analysis Dataset
Overview
The purpose of this dataset is to develop optimal health‑insurance products by leveraging historical records to estimate individual medical costs. These data support the creation of precise pricing models, strategic insurance planning, and effective portfolio management. The main goal is to accurately predict insurance charges using a variety of predictor variables.
Dataset Details
- Number of Records: 1,339 health‑insurance entries
- Target Variable: Individual medical charges ("Charges")
- Features:
- Age (numeric): Age of the policyholder
- Sex (categorical): Gender of the policyholder (male or female)
- BMI (numeric): Body Mass Index (kg/m²)
- Children (numeric): Number of children covered
- Smoker (categorical): Whether the policyholder smokes
- Region (categorical): Beneficiary’s U.S. region (Northeast / Southeast / Southwest / Northwest)
- Charges (numeric): Individual medical cost billed by the health‑insurance company
Note: Synthetic data were used to preserve privacy while maintaining a 95% similarity to the original dataset, ensuring compliance with GDPR and related regulations.
Required Libraries
The following R libraries are used for analysis:
c("dplyr", "MCMCpack", "coda", "R2OpenBUGS", "mixAK", "brms")
Pre‑processing
The original dataset underwent several modifications, such as converting categorical variables to binary indicators and reformatting certain fields to facilitate analysis.
Models Employed
Linear Models
- Markov Chain Monte Carlo (MCMC) sampling
- OpenBUGS
- Frequentist Generalized Linear Model (GLM)
- Bayesian GLM
- Normal‑mixture MCMC
Non‑linear Models
- MCMC sampling
- Bayesian Generalized Additive Model (GAM) (non‑linear)
Conclusion
Detailed conclusions and results are presented in the accompanying file, including figures and additional relevant information.
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: github
Created: 9/3/2024
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.