In this project I attempted to predict insurance charges based on a client’s age, sex, BMI, smoking status, and region of residence. I started with simple linear models, then added polynomial and interaction terms. To compare models I used R squared and MSE. Smoking status appeared to be the most significant predictor of charges. Moreover, the interaction between smoking with age and BMI gave the best metrics, suggesting that smoking compounds the effects on charges caused by increasing age and BMI. However, my final model had inconsistent variance in residuals, meaning a lot of effects were left unexplained.
Part 1: Data Exploration
Read in the dataset, and display some summaries of the data.
import pandas as pddf = pd.read_csv("https://www.dropbox.com/s/bocjjyo1ehr5auz/insurance_costs_1.csv?dl=1")df.head()
age
sex
bmi
smoker
region
charges
0
19
female
27.900
yes
southwest
16884.92400
1
33
male
22.705
no
northwest
21984.47061
2
32
male
28.880
no
northwest
3866.85520
3
31
female
25.740
no
southeast
3756.62160
4
60
female
25.840
no
northwest
28923.13692
df.describe()
age
bmi
charges
count
431.000000
431.000000
431.000000
mean
37.960557
30.768898
12297.098118
std
16.363909
6.111362
11876.527128
min
18.000000
15.960000
1131.506600
25%
22.000000
26.357500
2710.444575
50%
34.000000
30.590000
9866.304850
75%
55.000000
35.272500
14510.872600
max
64.000000
49.060000
55135.402090
df.isnull().sum()
age 0
sex 0
bmi 0
smoker 0
region 0
charges 0
dtype: int64
Fix any concerns you have about the data.
I’m concerned about the categorical variables so I’m going to dummify them. This converts categories like ‘sex’ and ‘smoker’ into numeric 0 or 1 columns that can be used in regression.
Make up to three plots comparing the response variable (charges) to one of the predictor variables. Briefly discuss each plot.
df.plot.scatter(x="age", y="charges")
There seems to be a correlation between age and higher charges. There are distinct groups of charges which could be explained by sex or smoking status, but all groups still increase with age. This suggests age is a useful predictor but there may be interaction effects with other variables.
df.plot.scatter(x="bmi", y="charges")
There seems to be a weak correlation between higher BMI and charges. This is surprising because I would have expected to see an exponential relationship since health issues tend to compound with higher BMIs.
df.boxplot(column='charges', by='smoker')
Smokers have higher charges. No surprise, smoking is a major risk factor. The clear separation between groups suggests smoker status will be a strong predictor in our models.
Part Two: Simple Linear Models
Construct a simple linear model to predict the insurance charges from the beneficiary’s age. Discuss the model fit, and interpret the coefficient estimates.
from sklearn.linear_model import LinearRegressionfrom sklearn.metrics import r2_scoremodel_age = LinearRegression()X=df[['age']]y=df['charges']model_age.fit(X,y)model_age.coef_
The model fit is very poor with a low R squared. The slope of 228 means charges increase by $228 per year of age. The intercept of $3,611 represents predicted charges at age zero, which could reflect baseline healthcare costs like birth or early childhood care.
Make a model that also incorporates the variable sex. Report your results.
from sklearn.linear_model import LinearRegressionfrom sklearn.metrics import r2_scoredf_dummy = pd.get_dummies(df[['age', 'sex']])df_dummy.head()
Which model (Q2 or Q3) do you think better fits the data? Justify your answer by calculating the MSE for each model, and also by comparing R-squared values.
from sklearn.metrics import mean_squared_error, r2_scoremse_age_sex = mean_squared_error(df['charges'], pred_age_sex)mse_age_sex
The Q3 model (with smoker) is better because the R squared is much higher and the MSE is much lower. This makes sense since smoking status clearly separates the data into distinct groups with diffrent charges.
Part Three: Multiple Linear Models
Fit a model that uses age and bmi as predictors. (Do not include an interaction term, age*bmi, between these two.) Report your results. How does the MSE compare to the model in Part Two Q1? How does the R-squared compare?
X = df[["age", "bmi"]]y = df["charges"]model = LinearRegression()model.fit(X, y)model.intercept_
The MSE increases a lot while the R squared increases only slightly compared to Part Two Q1. Adding BMI as a predictor doesn’t improve the model much, suggesting BMI alone isn’t a strong predictor of charges.
Perhaps the relationships are not linear. Fit a model that uses age and age^2 as predictors. How do the MSE and R-squared compare to the model in P2 Q1?
R squared for P2 Q1 was 0.099 and MSE was 123,792,439. The nonlinear model with age and age squared has nearly identical metrics, so both work equally well. Adding the squared term doesn’t capture any additional pattern in the data.
Fit a polynomial model of degree 4. How do the MSE and R-squared compare to the model in P2 Q1?
R squared for P2 Q1 (age and BMI on charges) was 0.099 and MSE was 123,792,439. The degree 4 polynomial R squared is slightly better at 0.108, but MSE got worse. The pros and cons cancel each other out, so there’s no real advantage to using the more complex model.
Fit a polynomial model of degree 12. How do the MSE and R-squared compare to the model in P2 Q1?
R squared stays basically the same and MSE increases. This is not a better fit. The higher degree polynomial probably overfitted on training data, increasing MSE without capturing any more of the effect of age and so not increasing R squared.
According to the MSE and R-squared, which is the best model? Do you agree that this is indeed the “best” model? Why or why not?
I don’t agree that any single model is definitively ‘best.’ The metrics are pretty similar for all of them, and user specific requirements could mean that for some people one model is better than another. Simpler models are often preferred when performance is similar.
Plot the predictions from your model in Q4 as a line plot on top of the scatterplot of your original data.
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
The best model uses age, BMI, and smoker as predictors, with both quantitative variables having interaction terms with smoker. This allows the effect of age and BMI on charges to differ between smokers and non smokers, which matches the pattern we saw in our exploratory plots.
Below I made a residual plot to make sure the errors made in my model were random. The plot shows some issues. There’s a curved pattern in the lower left, which means the model is missing a nonlinear relationship. The modles errors are way more all over the place for high charge people. Most larger residuals being positive indicate that we are chronically underestimating charges. This is not acceptable for steakholders.
import matplotlib.pyplot as plty_pred_abc_full = model_abc_full.predict(X_abc_full)residuals_abc_full = y - y_pred_abc_fullplt.scatter(y_pred_abc_full, residuals_abc_full)plt.show()