import pandas as pd
ha = pd.read_csv("C:/Users/spink/OneDrive/Desktop/Machine Learning/Data/heart_attack.csv")Heart Attack Risk Assessment
In this project I built models to predict heart attack risk using patient health data. I compared KNN, Logistic Regression, and Decision Tree models using cross validation. Logistic Regression performed the best, especially for sensitivity, meaning it was the best at not missing patients who are actually at risk. This matters a lot in healthcare because missing a heart attack case is a big deal. After validating on new data, Logistic Regression was still the top performer. The most important predictors ended up being chest pain type and max heart rate.
ha.head()| age | sex | cp | trtbps | chol | restecg | thalach | output | |
|---|---|---|---|---|---|---|---|---|
| 0 | 63 | 1 | 3 | 145 | 233 | 0 | 150 | 1 |
| 1 | 37 | 1 | 2 | 130 | 250 | 1 | 187 | 1 |
| 2 | 56 | 1 | 1 | 120 | 236 | 1 | 178 | 1 |
| 3 | 57 | 0 | 0 | 120 | 354 | 1 | 163 | 1 |
| 4 | 57 | 1 | 0 | 140 | 192 | 1 | 148 | 1 |
ha.isna().sum()age 0
sex 0
cp 0
trtbps 0
chol 0
restecg 0
thalach 0
output 0
dtype: int64
ha.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 273 entries, 0 to 272
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 273 non-null int64
1 sex 273 non-null int64
2 cp 273 non-null int64
3 trtbps 273 non-null int64
4 chol 273 non-null int64
5 restecg 273 non-null int64
6 thalach 273 non-null int64
7 output 273 non-null int64
dtypes: int64(8)
memory usage: 17.2 KB
ha.describe()| age | sex | cp | trtbps | chol | restecg | thalach | output | |
|---|---|---|---|---|---|---|---|---|
| count | 273.000000 | 273.000000 | 273.000000 | 273.000000 | 273.000000 | 273.000000 | 273.000000 | 273.000000 |
| mean | 54.347985 | 0.673993 | 0.974359 | 132.098901 | 246.860806 | 0.538462 | 149.446886 | 0.534799 |
| std | 9.163134 | 0.469611 | 1.030456 | 17.700358 | 52.569726 | 0.528059 | 23.240707 | 0.499704 |
| min | 29.000000 | 0.000000 | 0.000000 | 94.000000 | 126.000000 | 0.000000 | 71.000000 | 0.000000 |
| 25% | 47.000000 | 0.000000 | 0.000000 | 120.000000 | 211.000000 | 0.000000 | 133.000000 | 0.000000 |
| 50% | 56.000000 | 1.000000 | 1.000000 | 130.000000 | 240.000000 | 1.000000 | 152.000000 | 1.000000 |
| 75% | 61.000000 | 1.000000 | 2.000000 | 140.000000 | 275.000000 | 1.000000 | 166.000000 | 1.000000 |
| max | 77.000000 | 1.000000 | 3.000000 | 200.000000 | 564.000000 | 2.000000 | 202.000000 | 1.000000 |
Part One: Fitting Models
Q1: KNN
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score, GridSearchCV
import numpy as np
import pandas as pd
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
X = ha.drop('output', axis=1)
y = ha['output']
knn_pipeline = make_pipeline(
StandardScaler(),
KNeighborsClassifier()
)
k_nums = {"kneighborsclassifier__n_neighbors": [3, 5, 7, 9, 11, 13, 15]}
gscv_knn = GridSearchCV(knn_pipeline,k_nums, cv=10,scoring="roc_auc")
gscv_knn.fit(X, y)GridSearchCV(cv=10,
estimator=Pipeline(steps=[('standardscaler', StandardScaler()),
('kneighborsclassifier',
KNeighborsClassifier())]),
param_grid={'kneighborsclassifier__n_neighbors': [3, 5, 7, 9, 11,
13, 15]},
scoring='roc_auc')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=10,
estimator=Pipeline(steps=[('standardscaler', StandardScaler()),
('kneighborsclassifier',
KNeighborsClassifier())]),
param_grid={'kneighborsclassifier__n_neighbors': [3, 5, 7, 9, 11,
13, 15]},
scoring='roc_auc')Pipeline(steps=[('standardscaler', StandardScaler()),
('kneighborsclassifier', KNeighborsClassifier(n_neighbors=15))])StandardScaler()
KNeighborsClassifier(n_neighbors=15)
gscv_knn.best_params_{'kneighborsclassifier__n_neighbors': 15}
gscv_knn.best_score_np.float64(0.8289224664224664)
knn_pipeline_15 = make_pipeline(
StandardScaler(),
KNeighborsClassifier(n_neighbors=15)
)
knn_pipeline_15.fit(X, y)
knn_pipeline_15 = cross_val_score(knn_pipeline_15, X, y, cv=5, scoring="accuracy")
np.mean(knn_pipeline_15)np.float64(0.7581818181818182)
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix
knn_best = gscv_knn.best_estimator_
y_pred_knn = cross_val_predict(knn_best, X, y, cv=10)
cm_knn = confusion_matrix(y, y_pred_knn)
cm_knnarray([[ 86, 41],
[ 27, 119]])
The best K value was 15. I listed a range of K values that balance overfitting at small values and underfitting at large ones, and then tuned them using gridsearch to cross validate each value. The corresponding ROC AUC was 0.82. Out of everything, we had 86 true negatives, 27 false negatives, 119 true positives, and 41 false positives. How good or bad that is depends on the requirments of the person asking.
Q2: Logistic Regression
from sklearn.linear_model import LogisticRegression
log_pipline = make_pipeline(
StandardScaler(),
LogisticRegression(max_iter=1000)
)
log_nums = {"logisticregression__C": [0.01,0.1,1,10,100]}
gscv_log = GridSearchCV(log_pipline,log_nums, cv=10,scoring="roc_auc")
gscv_log.fit(X, y)GridSearchCV(cv=10,
estimator=Pipeline(steps=[('standardscaler', StandardScaler()),
('logisticregression',
LogisticRegression(max_iter=1000))]),
param_grid={'logisticregression__C': [0.01, 0.1, 1, 10, 100]},
scoring='roc_auc')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=10,
estimator=Pipeline(steps=[('standardscaler', StandardScaler()),
('logisticregression',
LogisticRegression(max_iter=1000))]),
param_grid={'logisticregression__C': [0.01, 0.1, 1, 10, 100]},
scoring='roc_auc')Pipeline(steps=[('standardscaler', StandardScaler()),
('logisticregression',
LogisticRegression(C=0.01, max_iter=1000))])StandardScaler()
LogisticRegression(C=0.01, max_iter=1000)
gscv_log.best_params_{'logisticregression__C': 0.01}
gscv_log.best_score_np.float64(0.8532600732600732)
log_best = gscv_log.best_estimator_
y_pred_log = cross_val_predict(log_best, X, y, cv=10)
cm_log = confusion_matrix(y, y_pred_log)
cm_logarray([[ 84, 43],
[ 19, 127]])
log_reg = log_best.named_steps["logisticregression"]
coef = log_reg.coef_[0]
coeff_names = X.columns
log_coef_df = pd.DataFrame({
"feature": coeff_names,
"coef": coef
})
log_coef_df| feature | coef | |
|---|---|---|
| 0 | age | -0.136729 |
| 1 | sex | -0.252337 |
| 2 | cp | 0.302322 |
| 3 | trtbps | -0.097634 |
| 4 | chol | -0.054690 |
| 5 | restecg | 0.086628 |
| 6 | thalach | 0.280505 |
The best penalty parameter is 0.01. I listed a range of log parameter values that balance overfitting at small values and underfitting at large ones, then ran grid search with the pipeline and parameters to tune. The corresponding ROC AUC was 0.85. Overall, we had 84 true negatives and 19 false negatives, and 127 true positives and 43 false positives. How good or bad that is depends on the requirements of the person asking. The coefficient that has the most impact on the likelihood of a heart attack during exercise appears to be chest pain, followed by maximum heart rate (thalach).
Q3: Decision Tree
import numpy as np
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(random_state=0)
# min_impunity_decresese
param_grid_tree = {
"max_depth": [2, 3, 4, 5, 6, 8, None],
"min_samples_leaf": [1, 5, 10]
}
gscv_tree = GridSearchCV(tree, param_grid_tree, cv=10, scoring="roc_auc")
gscv_tree.fit(X, y)
gscv_tree.best_params_{'max_depth': 3, 'min_samples_leaf': 10}
gscv_tree.best_score_np.float64(0.8238858363858365)
dt_best = gscv_tree.best_estimator_
y_pred_dt = cross_val_predict(dt_best, X, y, cv=10)
cm_tree = confusion_matrix(y, y_pred_dt)
cm_treearray([[ 99, 28],
[ 35, 111]])
dt_best = gscv_tree.best_estimator_
tree_coef_df = pd.DataFrame({
"feature": X.columns,
"importance": dt_best.feature_importances_
})
tree_coef_df| feature | importance | |
|---|---|---|
| 0 | age | 0.156569 |
| 1 | sex | 0.102721 |
| 2 | cp | 0.584983 |
| 3 | trtbps | 0.038793 |
| 4 | chol | 0.000000 |
| 5 | restecg | 0.000000 |
| 6 | thalach | 0.116935 |
The best Decision Tree metrics are ‘max_depth’: 3, ‘min_samples_leaf’: , which means the data is only divided based on three conditions, and no fewer than ten explanatory variables can be included on one side of each decision. I listed out a range of parameter values and tuned with gridsearch. The corresponding ROC AUC for the best metrics was 0.83. Out of all predictions, we had 84 true negatives as opposed to 19 false negatives, and 127 true positives as opposed to 43 false positives. How good or bad that is depends on the requirements of the person asking. The coefficient that has the most impact on the likelihood of a heart attack during exercise appears to be chest pain, followed by age.
Q4: Interpretation
Chest pain was by far the most important both times, which makes sense because it is the most direct explanatory variable. If your chest is in unusual pain, the odds that you’re having a heart attack are pretty high, as opposed to, say, being a bit overweight. Resting heart rate and age were the second most important predictors in the Logistic Regression and the Decision Tree, respectively.
Q5: ROC Curve
Knn
import pandas as pd
from sklearn.metrics import roc_curve
from sklearn.model_selection import cross_val_predict
y_prob_knn_best = cross_val_predict(knn_best,X, y, cv=10, method="predict_proba")
fpr_knn, tpr_knn, _ = roc_curve(y, y_prob_knn_best[:, 1])
df_knn_roc = pd.DataFrame({
"fpr": fpr_knn,
"tpr": tpr_knn
})from plotnine import ggplot, aes, geom_line, geom_abline
p_knn_roc_plot = (
ggplot(df_knn_roc, aes(x="fpr", y="tpr"))
+ geom_line()
+ geom_abline()
)
p_knn_roc_plotLogical Regression
from sklearn.metrics import roc_curve
from sklearn.model_selection import cross_val_predict
y_prob_log_best = cross_val_predict(log_best,X,y,cv=10,method="predict_proba")
fpr_log, tpr_log, _ = roc_curve(y, y_prob_log_best[:, 1])
df_log_roc = pd.DataFrame({
"fpr": fpr_log,
"tpr": tpr_log
})from plotnine import ggplot, aes, geom_line, geom_abline
p_log_roc_plot = (
ggplot(df_log_roc, aes(x="fpr", y="tpr"))
+ geom_line()
+ geom_abline()
)
p_log_roc_plotDecsion Tree
y_prob_dt_best = cross_val_predict(dt_best, X, y,cv=10, method="predict_proba")
fpr_dt, tpr_dt, _ = roc_curve(y, y_prob_dt_best[:, 1])df_dt_roc = pd.DataFrame({
"fpr": fpr_dt,
"tpr": tpr_dt
})p_tree_roc_plot = (
ggplot(df_dt_roc, aes(x="fpr", y="tpr"))
+ geom_line()
+ geom_abline()
)
p_tree_roc_plotPart Two: Metrics
is_positive = (y == 1)
sens_knn = cross_val_score(knn_best, X, is_positive, cv=10, scoring="recall").mean()
sens_log = cross_val_score(log_best,X, is_positive, cv=10, scoring="recall").mean()
sens_dt = cross_val_score(dt_best, X, is_positive, cv=10, scoring="recall").mean()sens_knnnp.float64(0.8147619047619047)
sens_lognp.float64(0.8695238095238096)
sens_dtnp.float64(0.76)
prec_knn = cross_val_score(knn_best, X, is_positive, cv=10, scoring="precision").mean()
prec_log = cross_val_score(log_best, X, is_positive, cv=10, scoring="precision").mean()
prec_dt = cross_val_score(dt_best, X, is_positive, cv=10, scoring="precision").mean()prec_knnnp.float64(0.7446428571428572)
prec_lognp.float64(0.7522940647244052)
prec_dtnp.float64(0.7949490950226245)
is_negative = (y == 0)
spec_knn = cross_val_score(knn_best, X, is_negative, cv=10, scoring="recall").mean()
spec_log = cross_val_score(log_best, X, is_negative, cv=10, scoring="recall").mean()
spec_dt = cross_val_score(dt_best, X, is_negative, cv=10, scoring="recall").mean()spec_knnnp.float64(0.6775641025641026)
spec_lognp.float64(0.6628205128205129)
spec_dtnp.float64(0.7493589743589744)
Part Three: Discussion
Q1
The metrics would be accuracy, Precision, recall (spec), and ROC AUC. Accuracy measures overall correctness, Precision focuses on avoiding false positives, Recall measures false negatives so we don’t miss cases, and ROC AUC shows how well the model separates the classes. Logistic Regression is the best model so that would be the specific metric I would use. I would expect a score similar to what I got in part 2.
Q2
The metrics I would use would be Logistic Regression is the best model for the hospitals needs. I still recommend Logistic Regression because it had the best recall score, which means the least false negatives. Some coefficients I might also pay attention to include cp and thlach (maximum heart rate), which were the two most impactful explanatory variables in the Logistic Regression. I would expect a score similar to what I got in part 2.
Q3
ROC AUC would be a good metric to measure accuracy. I would use a decision tree because it would be a lot easier to explain to non-technical hospital employees, while still being able to produce metrics, unlike KNN. I would expect a score similar to what I got in part 2.
Q4
I would use ROC AUC to compare KNN and Logistic Regression, and pick the one with the better fit curve, which seems to be KNN. I would expect a score similar to what I got in part 2.
Part Four: Validation
import pandas as pd
from sklearn.metrics import confusion_matrix, roc_auc_score, precision_score, recall_score
ha_validation = pd.read_csv("https://www.dropbox.com/s/jkwqdiyx6o6oad0/heart_attack_validation.csv?dl=1")ha_validation.head()| age | sex | cp | trtbps | chol | restecg | thalach | output | |
|---|---|---|---|---|---|---|---|---|
| 0 | 41 | 0 | 1 | 130 | 204 | 0 | 172 | 1 |
| 1 | 64 | 1 | 3 | 110 | 211 | 0 | 144 | 1 |
| 2 | 59 | 1 | 0 | 135 | 234 | 1 | 161 | 1 |
| 3 | 42 | 1 | 0 | 140 | 226 | 1 | 178 | 1 |
| 4 | 40 | 1 | 3 | 140 | 199 | 1 | 178 | 1 |
Knn
X_val = ha_validation.drop("output", axis=1)
y_val = ha_validation["output"]y_pred_knn = knn_best.predict(X_val)
y_prob_knn = knn_best.predict_proba(X_val)[:, 1]cm_knn = confusion_matrix(y_val, y_pred_knn)
cm_knnarray([[10, 1],
[ 5, 14]])
roc_auc_knn = roc_auc_score(y_val, y_prob_knn)
roc_auc_knnnp.float64(0.9401913875598086)
precision_knn = precision_score(y_val, y_pred_knn)
precision_knn0.9333333333333333
recall_knn = recall_score(y_val, y_pred_knn)
recall_knn0.7368421052631579
Logistic Regression
y_pred_log = log_best.predict(X_val)
y_prob_log = log_best.predict_proba(X_val)[:, 1]cm_log = confusion_matrix(y_val, y_pred_log)
cm_logarray([[10, 1],
[ 4, 15]])
roc_auc_log = roc_auc_score(y_val, y_prob_log)
roc_auc_lognp.float64(0.937799043062201)
precision_log = precision_score(y_val, y_pred_log)
precision_log0.9375
recall_log = recall_score(y_val, y_pred_log)
recall_log0.7894736842105263
Decision Tree Model
y_pred_dt = dt_best.predict(X_val)
y_prob_dt = dt_best.predict_proba(X_val)[:, 1]cm_dt = confusion_matrix(y_val, y_pred_dt)
cm_dtarray([[ 9, 2],
[ 7, 12]])
roc_auc_dt = roc_auc_score(y_val, y_prob_dt)
roc_auc_dtnp.float64(0.8325358851674641)
precision_dt = precision_score(y_val, y_pred_dt)
precision_dt0.8571428571428571
recall_dt = recall_score(y_val, y_pred_dt)
recall_dt0.631578947368421
Comparison (old first then new)
prec_knn, sens_knn, prec_log, sens_log, prec_dt, sens_dt(np.float64(0.7446428571428572),
np.float64(0.8147619047619047),
np.float64(0.7522940647244052),
np.float64(0.8695238095238096),
np.float64(0.7949490950226245),
np.float64(0.76))
precision_knn, recall_knn, precision_log, recall_log, precision_dt, recall_dt(0.9333333333333333,
0.7368421052631579,
0.9375,
0.7894736842105263,
0.8571428571428571,
0.631578947368421)
In general the model I made either did better on the new data or only slighlty worse. Decide ona pipline, for the tree both are lower so that bad, if through knn what k.
Part 5: Cohen’s Kappa
from sklearn.metrics import cohen_kappa_score
kappa_knn = cohen_kappa_score(y_val, y_pred_knn)
kappa_knnnp.float64(0.6)
kappa_log = cohen_kappa_score(y_val, y_pred_log)
kappa_lognp.float64(0.660633484162896)
kappa_dt = cohen_kappa_score(y_val, y_pred_dt)
kappa_dtnp.float64(0.4104803493449781)
I would use Cohen’s kappa if there were signs that the model was finding it too easy to predict the data, possibly because the data were imbalanced or because the model was overfitting by coincidence. Logistic Regression has the highest. However, the metrics used in the early part also pointed to logistic regression being the best model. So, our conclusions don’t change meaningfully when we use Cohen’s Kappa. So, the data is probably well balanced, and the model is not overfitting.