Medical Cost Analysis

Sep 10, 2021


Visitors to America are often cautioned to purchase a health insurance plan prior to their visit. Often this encouragement is accompanied by stern warnings, even horror stories about healthcare costs in the USA. People talk about outrageous bills for the uninsured party who finds themselves in trouble Despite its mixed reputation, healthcare in the United States performs exceptionally well in many regards. For instance, it has the best outcomes in the world for surviving a heart attack or stroke, though it does not do well when it comes to diabetes and asthma.

Waiting times, a concern in many countries with advanced health care systems, are less of a problem in the United States. Preventative health care spending is only marginally lower than that of other industrialized nations. Overall, the standard of healthcare in the United States is very high, but that may be of little comfort when a traveler is faced with an astonishing bill.

Exploratory Data Analysis

Observations :

  1. Charges increas as age increases
  2. Smokers cost much more than non-smokers
  3. Smokers with high bmi cost more (almost double the charges)
  4. Smokers with low bmi cost less than non-smokers with high bmi

Distribution of numeric data


We use MinMaxScaler to scale both the input features and the target.

The reason I also scale the target is that it is much eaiser to determine if the values of Root Mean Square Error (RMSE), Mean Squared Error (MSE), Mean Absolute Error (MAE), and R2 Score are large or not. For example, if the RMSE is larger than 1, it means your model perform worse than a naive prediction.

((1070, 11), (268, 11), (1070, 1), (268, 1))

Visualize the predictions

From the above graph, it seems like Gradient Boosting Regressor has the best results. Let’s vertify that below.

Indeed, Gradient Boosting Regressor gives the best result (i.e. has the less error). Its RMSE is close to zero, it means the model performs well if there is no overfitting. Therefore, the next thing to do is to check if overfitting occurs and fix overfitting if it happens.

Plot the learning curves to check if there is overfitting

It’s overfitting. The reason is that the test set has a much higher RMSE values than the training set. Another quick way to determine is that there is a big gap between the training set and the test set.

We notice that the gap is getting closer. This means that the model may have better performance if we feed more data to the model. However, we don’t have any more data. Therefore, we need to tune the hyperparameters of Gradient Boosting Regressor to avoid overfitting from happening.

The hyperparameters used to tune:

  1. Depth of each tree (max_depth)
  2. Number of trees (n_estimators)
  3. Learning rate (learning_rate)
  4. Sub sample (subsample)

Tune the Gradient Boosting Regressor Model

1. Depth of each tree (max_depth)

Let’s choose the depth of each tree (max_depth) as 2

2. Number of trees (n_estimators)

Let’s choose the Number of trees (n_estimators) as 20

3. Learning Rate

Let’s choose the learning rate of 0.02

4. Subsample

Let’s choose subsample = 0.03

Hyperparameter Tunning Summary:

  1. Depth of each tree (max_depth): 2

  2. Number of trees (n_estimators): 20

  3. Learning rate (learning_rate, eta): 0.02

  4. Sub sample (subsample): 0.03

Train and predict again after tunning the hyperparameters

rmse: 0.06894183401662704

As we can see, two lines are close to each other. This means that there is no overfitting anymore.

Gradient Boosting Regressor After Tuning rmse: 0.08117722274077202 mse: 0.006589741491904914 mae: 0.04506965929243093 r2: 0.8334048238166812