JUHE API Marketplace
DATASET
Open Source Community

Synthesized Medical Cost Personal Dataset

This dataset contains 1,339 health‑insurance records and is intended for predicting individual medical charges. Features include age (numeric), sex (categorical), BMI (numeric), number of children (numeric), smoker status (categorical), region (categorical), and the target variable charges (numeric). The data are synthetic, generated to match the original dataset with a 95% similarity to address privacy concerns.

Updated 9/3/2024
github

Description

Medical Insurance Charges Bayesian Analysis Dataset

Overview

The purpose of this dataset is to develop optimal health‑insurance products by leveraging historical records to estimate individual medical costs. These data support the creation of precise pricing models, strategic insurance planning, and effective portfolio management. The main goal is to accurately predict insurance charges using a variety of predictor variables.

Dataset Details

  • Number of Records: 1,339 health‑insurance entries
  • Target Variable: Individual medical charges ("Charges")
  • Features:
    • Age (numeric): Age of the policyholder
    • Sex (categorical): Gender of the policyholder (male or female)
    • BMI (numeric): Body Mass Index (kg/m²)
    • Children (numeric): Number of children covered
    • Smoker (categorical): Whether the policyholder smokes
    • Region (categorical): Beneficiary’s U.S. region (Northeast / Southeast / Southwest / Northwest)
    • Charges (numeric): Individual medical cost billed by the health‑insurance company

Note: Synthetic data were used to preserve privacy while maintaining a 95% similarity to the original dataset, ensuring compliance with GDPR and related regulations.

Required Libraries

The following R libraries are used for analysis:

c("dplyr", "MCMCpack", "coda", "R2OpenBUGS", "mixAK", "brms")

Pre‑processing

The original dataset underwent several modifications, such as converting categorical variables to binary indicators and reformatting certain fields to facilitate analysis.

Models Employed

Linear Models

  • Markov Chain Monte Carlo (MCMC) sampling
  • OpenBUGS
  • Frequentist Generalized Linear Model (GLM)
  • Bayesian GLM
  • Normal‑mixture MCMC

Non‑linear Models

  • MCMC sampling
  • Bayesian Generalized Additive Model (GAM) (non‑linear)

Conclusion

Detailed conclusions and results are presented in the accompanying file, including figures and additional relevant information.

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Medical Cost Prediction
Synthetic Data

Source

Organization: github

Created: 9/3/2024

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.