Heart Attack Risk Assessment

In this project I built models to predict heart attack risk using patient health data. I compared KNN, Logistic Regression, and Decision Tree models using cross validation. Logistic Regression performed the best, especially for sensitivity, meaning it was the best at not missing patients who are actually at risk. This matters a lot in healthcare because missing a heart attack case is a big deal. After validating on new data, Logistic Regression was still the top performer. The most important predictors ended up being chest pain type and max heart rate.

import pandas as pd
ha = pd.read_csv("C:/Users/spink/OneDrive/Desktop/Machine Learning/Data/heart_attack.csv")

ha.head()

	age	sex	cp	trtbps	chol	restecg	thalach	output
0	63	1	3	145	233	0	150	1
1	37	1	2	130	250	1	187	1
2	56	1	1	120	236	1	178	1
3	57	0	0	120	354	1	163	1
4	57	1	0	140	192	1	148	1

ha.isna().sum()

age        0
sex        0
cp         0
trtbps     0
chol       0
restecg    0
thalach    0
output     0
dtype: int64

ha.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 273 entries, 0 to 272
Data columns (total 8 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   age      273 non-null    int64
 1   sex      273 non-null    int64
 2   cp       273 non-null    int64
 3   trtbps   273 non-null    int64
 4   chol     273 non-null    int64
 5   restecg  273 non-null    int64
 6   thalach  273 non-null    int64
 7   output   273 non-null    int64
dtypes: int64(8)
memory usage: 17.2 KB

ha.describe()

	age	sex	cp	trtbps	chol	restecg	thalach	output
count	273.000000	273.000000	273.000000	273.000000	273.000000	273.000000	273.000000	273.000000
mean	54.347985	0.673993	0.974359	132.098901	246.860806	0.538462	149.446886	0.534799
std	9.163134	0.469611	1.030456	17.700358	52.569726	0.528059	23.240707	0.499704
min	29.000000	0.000000	0.000000	94.000000	126.000000	0.000000	71.000000	0.000000
25%	47.000000	0.000000	0.000000	120.000000	211.000000	0.000000	133.000000	0.000000
50%	56.000000	1.000000	1.000000	130.000000	240.000000	1.000000	152.000000	1.000000
75%	61.000000	1.000000	2.000000	140.000000	275.000000	1.000000	166.000000	1.000000
max	77.000000	1.000000	3.000000	200.000000	564.000000	2.000000	202.000000	1.000000

Part One: Fitting Models

Q1: KNN

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score, GridSearchCV
import numpy as np
import pandas as pd
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

X = ha.drop('output', axis=1)
y = ha['output']

knn_pipeline = make_pipeline(
    StandardScaler(),
    KNeighborsClassifier()
)

k_nums = {"kneighborsclassifier__n_neighbors": [3, 5, 7, 9, 11, 13, 15]}

gscv_knn = GridSearchCV(knn_pipeline,k_nums, cv=10,scoring="roc_auc")

gscv_knn.fit(X, y)

GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('standardscaler', StandardScaler()),
                                       ('kneighborsclassifier',
                                        KNeighborsClassifier())]),
             param_grid={'kneighborsclassifier__n_neighbors': [3, 5, 7, 9, 11,
                                                               13, 15]},
             scoring='roc_auc')

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

gscv_knn.best_params_

{'kneighborsclassifier__n_neighbors': 15}

gscv_knn.best_score_

np.float64(0.8289224664224664)

knn_pipeline_15 = make_pipeline(
    StandardScaler(),
    KNeighborsClassifier(n_neighbors=15)
)

knn_pipeline_15.fit(X, y)

knn_pipeline_15 = cross_val_score(knn_pipeline_15, X, y, cv=5, scoring="accuracy")
np.mean(knn_pipeline_15)

np.float64(0.7581818181818182)

from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix

knn_best = gscv_knn.best_estimator_

y_pred_knn = cross_val_predict(knn_best, X, y, cv=10)

cm_knn = confusion_matrix(y, y_pred_knn)
cm_knn

array([[ 86,  41],
       [ 27, 119]])

The best K value was 15. I listed a range of K values that balance overfitting at small values and underfitting at large ones, and then tuned them using gridsearch to cross validate each value. The corresponding ROC AUC was 0.82. Out of everything, we had 86 true negatives, 27 false negatives, 119 true positives, and 41 false positives. How good or bad that is depends on the requirments of the person asking.

Q2: Logistic Regression

from sklearn.linear_model import LogisticRegression

log_pipline = make_pipeline(
    StandardScaler(),
    LogisticRegression(max_iter=1000)
)

log_nums = {"logisticregression__C": [0.01,0.1,1,10,100]}

gscv_log = GridSearchCV(log_pipline,log_nums, cv=10,scoring="roc_auc")

gscv_log.fit(X, y)

GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('standardscaler', StandardScaler()),
                                       ('logisticregression',
                                        LogisticRegression(max_iter=1000))]),
             param_grid={'logisticregression__C': [0.01, 0.1, 1, 10, 100]},
             scoring='roc_auc')

gscv_log.best_params_

{'logisticregression__C': 0.01}

gscv_log.best_score_

np.float64(0.8532600732600732)

log_best = gscv_log.best_estimator_

y_pred_log = cross_val_predict(log_best, X, y, cv=10)

cm_log = confusion_matrix(y, y_pred_log)
cm_log

array([[ 84,  43],
       [ 19, 127]])

log_reg = log_best.named_steps["logisticregression"]
coef = log_reg.coef_[0]
coeff_names = X.columns

log_coef_df = pd.DataFrame({
    "feature": coeff_names,
    "coef": coef
})

log_coef_df

	feature	coef
0	age	-0.136729
1	sex	-0.252337
2	cp	0.302322
3	trtbps	-0.097634
4	chol	-0.054690
5	restecg	0.086628
6	thalach	0.280505

The best penalty parameter is 0.01. I listed a range of log parameter values that balance overfitting at small values and underfitting at large ones, then ran grid search with the pipeline and parameters to tune. The corresponding ROC AUC was 0.85. Overall, we had 84 true negatives and 19 false negatives, and 127 true positives and 43 false positives. How good or bad that is depends on the requirements of the person asking. The coefficient that has the most impact on the likelihood of a heart attack during exercise appears to be chest pain, followed by maximum heart rate (thalach).

Q3: Decision Tree

import numpy as np
from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier(random_state=0)

# min_impunity_decresese 


param_grid_tree = {
    "max_depth": [2, 3, 4, 5, 6, 8, None],
    "min_samples_leaf": [1, 5, 10]
}

gscv_tree = GridSearchCV(tree, param_grid_tree, cv=10, scoring="roc_auc")

gscv_tree.fit(X, y)

gscv_tree.best_params_

{'max_depth': 3, 'min_samples_leaf': 10}

gscv_tree.best_score_

np.float64(0.8238858363858365)

dt_best = gscv_tree.best_estimator_

y_pred_dt = cross_val_predict(dt_best, X, y, cv=10)

cm_tree = confusion_matrix(y, y_pred_dt)
cm_tree

array([[ 99,  28],
       [ 35, 111]])

dt_best = gscv_tree.best_estimator_

tree_coef_df = pd.DataFrame({
    "feature": X.columns,
    "importance": dt_best.feature_importances_
})

tree_coef_df

	feature	importance
0	age	0.156569
1	sex	0.102721
2	cp	0.584983
3	trtbps	0.038793
4	chol	0.000000
5	restecg	0.000000
6	thalach	0.116935

The best Decision Tree metrics are ‘max_depth’: 3, ‘min_samples_leaf’: , which means the data is only divided based on three conditions, and no fewer than ten explanatory variables can be included on one side of each decision. I listed out a range of parameter values and tuned with gridsearch. The corresponding ROC AUC for the best metrics was 0.83. Out of all predictions, we had 84 true negatives as opposed to 19 false negatives, and 127 true positives as opposed to 43 false positives. How good or bad that is depends on the requirements of the person asking. The coefficient that has the most impact on the likelihood of a heart attack during exercise appears to be chest pain, followed by age.

Q4: Interpretation

Chest pain was by far the most important both times, which makes sense because it is the most direct explanatory variable. If your chest is in unusual pain, the odds that you’re having a heart attack are pretty high, as opposed to, say, being a bit overweight. Resting heart rate and age were the second most important predictors in the Logistic Regression and the Decision Tree, respectively.

Q5: ROC Curve

Knn

import pandas as pd
from sklearn.metrics import roc_curve
from sklearn.model_selection import cross_val_predict

y_prob_knn_best = cross_val_predict(knn_best,X, y, cv=10, method="predict_proba")

fpr_knn, tpr_knn, _ = roc_curve(y, y_prob_knn_best[:, 1])

df_knn_roc = pd.DataFrame({
    "fpr": fpr_knn,
    "tpr": tpr_knn
})

from plotnine import ggplot, aes, geom_line, geom_abline

p_knn_roc_plot = (
    ggplot(df_knn_roc, aes(x="fpr", y="tpr"))
    + geom_line()
    + geom_abline()
)

p_knn_roc_plot

Logical Regression

from sklearn.metrics import roc_curve
from sklearn.model_selection import cross_val_predict

y_prob_log_best = cross_val_predict(log_best,X,y,cv=10,method="predict_proba")

fpr_log, tpr_log, _ = roc_curve(y, y_prob_log_best[:, 1])

df_log_roc = pd.DataFrame({
    "fpr": fpr_log,
    "tpr": tpr_log
})

from plotnine import ggplot, aes, geom_line, geom_abline

p_log_roc_plot = (
    ggplot(df_log_roc, aes(x="fpr", y="tpr"))
    + geom_line()
    + geom_abline()
)

p_log_roc_plot

Decsion Tree

y_prob_dt_best = cross_val_predict(dt_best, X, y,cv=10, method="predict_proba")

fpr_dt, tpr_dt, _ = roc_curve(y, y_prob_dt_best[:, 1])

df_dt_roc = pd.DataFrame({
    "fpr": fpr_dt,
    "tpr": tpr_dt
})

p_tree_roc_plot = (
    ggplot(df_dt_roc, aes(x="fpr", y="tpr"))
    + geom_line()
    + geom_abline()
)

p_tree_roc_plot

Part Two: Metrics

is_positive = (y == 1)

sens_knn  = cross_val_score(knn_best, X, is_positive, cv=10, scoring="recall").mean()

sens_log  = cross_val_score(log_best,X, is_positive, cv=10, scoring="recall").mean()

sens_dt = cross_val_score(dt_best, X, is_positive, cv=10, scoring="recall").mean()

sens_knn

np.float64(0.8147619047619047)

sens_log

np.float64(0.8695238095238096)

sens_dt

np.float64(0.76)

prec_knn  = cross_val_score(knn_best,  X, is_positive, cv=10, scoring="precision").mean()

prec_log  = cross_val_score(log_best,  X, is_positive, cv=10, scoring="precision").mean()

prec_dt = cross_val_score(dt_best, X, is_positive, cv=10, scoring="precision").mean()

prec_knn

np.float64(0.7446428571428572)

prec_log

np.float64(0.7522940647244052)

prec_dt

np.float64(0.7949490950226245)

is_negative = (y == 0)

spec_knn  = cross_val_score(knn_best,  X, is_negative, cv=10, scoring="recall").mean()

spec_log  = cross_val_score(log_best,  X, is_negative, cv=10, scoring="recall").mean()

spec_dt = cross_val_score(dt_best, X, is_negative, cv=10, scoring="recall").mean()

spec_knn

np.float64(0.6775641025641026)

spec_log

np.float64(0.6628205128205129)

spec_dt

np.float64(0.7493589743589744)

Part Three: Discussion

The metrics would be accuracy, Precision, recall (spec), and ROC AUC. Accuracy measures overall correctness, Precision focuses on avoiding false positives, Recall measures false negatives so we don’t miss cases, and ROC AUC shows how well the model separates the classes. Logistic Regression is the best model so that would be the specific metric I would use. I would expect a score similar to what I got in part 2.

The metrics I would use would be Logistic Regression is the best model for the hospitals needs. I still recommend Logistic Regression because it had the best recall score, which means the least false negatives. Some coefficients I might also pay attention to include cp and thlach (maximum heart rate), which were the two most impactful explanatory variables in the Logistic Regression. I would expect a score similar to what I got in part 2.

ROC AUC would be a good metric to measure accuracy. I would use a decision tree because it would be a lot easier to explain to non-technical hospital employees, while still being able to produce metrics, unlike KNN. I would expect a score similar to what I got in part 2.

I would use ROC AUC to compare KNN and Logistic Regression, and pick the one with the better fit curve, which seems to be KNN. I would expect a score similar to what I got in part 2.

Part Four: Validation

import pandas as pd
from sklearn.metrics import confusion_matrix, roc_auc_score, precision_score, recall_score

ha_validation = pd.read_csv("https://www.dropbox.com/s/jkwqdiyx6o6oad0/heart_attack_validation.csv?dl=1")

ha_validation.head()

	age	sex	cp	trtbps	chol	restecg	thalach	output
0	41	0	1	130	204	0	172	1
1	64	1	3	110	211	0	144	1
2	59	1	0	135	234	1	161	1
3	42	1	0	140	226	1	178	1
4	40	1	3	140	199	1	178	1

Knn

X_val = ha_validation.drop("output", axis=1)
y_val = ha_validation["output"]

y_pred_knn = knn_best.predict(X_val)
y_prob_knn = knn_best.predict_proba(X_val)[:, 1]

cm_knn = confusion_matrix(y_val, y_pred_knn)
cm_knn

array([[10,  1],
       [ 5, 14]])

roc_auc_knn = roc_auc_score(y_val, y_prob_knn)
roc_auc_knn

np.float64(0.9401913875598086)

precision_knn = precision_score(y_val, y_pred_knn)
precision_knn

0.9333333333333333

recall_knn = recall_score(y_val, y_pred_knn)
recall_knn

0.7368421052631579

Logistic Regression

y_pred_log = log_best.predict(X_val)
y_prob_log = log_best.predict_proba(X_val)[:, 1]

cm_log = confusion_matrix(y_val, y_pred_log)
cm_log

array([[10,  1],
       [ 4, 15]])

roc_auc_log = roc_auc_score(y_val, y_prob_log)
roc_auc_log

np.float64(0.937799043062201)

precision_log = precision_score(y_val, y_pred_log)
precision_log

0.9375

recall_log = recall_score(y_val, y_pred_log)
recall_log

0.7894736842105263

Decision Tree Model

y_pred_dt = dt_best.predict(X_val)
y_prob_dt = dt_best.predict_proba(X_val)[:, 1]

cm_dt = confusion_matrix(y_val, y_pred_dt)
cm_dt

array([[ 9,  2],
       [ 7, 12]])

roc_auc_dt = roc_auc_score(y_val, y_prob_dt)
roc_auc_dt

np.float64(0.8325358851674641)

precision_dt = precision_score(y_val, y_pred_dt)
precision_dt

0.8571428571428571

recall_dt = recall_score(y_val, y_pred_dt)
recall_dt

0.631578947368421

Comparison (old first then new)

prec_knn, sens_knn, prec_log, sens_log, prec_dt, sens_dt

(np.float64(0.7446428571428572),
 np.float64(0.8147619047619047),
 np.float64(0.7522940647244052),
 np.float64(0.8695238095238096),
 np.float64(0.7949490950226245),
 np.float64(0.76))

precision_knn, recall_knn, precision_log, recall_log, precision_dt, recall_dt

(0.9333333333333333,
 0.7368421052631579,
 0.9375,
 0.7894736842105263,
 0.8571428571428571,
 0.631578947368421)

In general the model I made either did better on the new data or only slighlty worse. Decide ona pipline, for the tree both are lower so that bad, if through knn what k.

Part 5: Cohen’s Kappa

from sklearn.metrics import cohen_kappa_score

kappa_knn = cohen_kappa_score(y_val, y_pred_knn)
kappa_knn

np.float64(0.6)

kappa_log = cohen_kappa_score(y_val, y_pred_log)
kappa_log

np.float64(0.660633484162896)

kappa_dt = cohen_kappa_score(y_val, y_pred_dt)
kappa_dt

np.float64(0.4104803493449781)

I would use Cohen’s kappa if there were signs that the model was finding it too easy to predict the data, possibly because the data were imbalanced or because the model was overfitting by coincidence. Logistic Regression has the highest. However, the metrics used in the early part also pointed to logistic regression being the best model. So, our conclusions don’t change meaningfully when we use Cohen’s Kappa. So, the data is probably well balanced, and the model is not overfitting.

	age	sex	cp	trtbps	chol	restecg	thalach	output
0	63	1	3	145	233	0	150	1
1	37	1	2	130	250	1	187	1
2	56	1	1	120	236	1	178	1
3	57	0	0	120	354	1	163	1
4	57	1	0	140	192	1	148	1

	age	sex	cp	trtbps	chol	restecg	thalach	output
0	41	0	1	130	204	0	172	1
1	64	1	3	110	211	0	144	1
2	59	1	0	135	234	1	161	1
3	42	1	0	140	226	1	178	1
4	40	1	3	140	199	1	178	1

	age	sex	cp	trtbps	chol	restecg	thalach	output
0	63	1	3	145	233	0	150	1
1	37	1	2	130	250	1	187	1
2	56	1	1	120	236	1	178	1
3	57	0	0	120	354	1	163	1
4	57	1	0	140	192	1	148	1

	age	sex	cp	trtbps	chol	restecg	thalach	output
0	41	0	1	130	204	0	172	1
1	64	1	3	110	211	0	144	1
2	59	1	0	135	234	1	161	1
3	42	1	0	140	226	1	178	1
4	40	1	3	140	199	1	178	1

	age	sex	cp	trtbps	chol	restecg	thalach	output
0	63	1	3	145	233	0	150	1
1	37	1	2	130	250	1	187	1
2	56	1	1	120	236	1	178	1
3	57	0	0	120	354	1	163	1
4	57	1	0	140	192	1	148	1

	age	sex	cp	trtbps	chol	restecg	thalach	output
0	41	0	1	130	204	0	172	1
1	64	1	3	110	211	0	144	1
2	59	1	0	135	234	1	161	1
3	42	1	0	140	226	1	178	1
4	40	1	3	140	199	1	178	1