A Second Look into Heart Disease Prediction

Introduction
From the initial classification model that I built for the heart disease dataset, I got an accuracy score of 84.67% when I used all features in the dataset to build a logistic regression model. The model had a very high accuracy score but unfortunately, in real-world implementation, accuracy is the wrong metric for evaluating this model because of the prediction goal. Accuracy alone doesn’t tell the full story when working with a class-imbalanced dataset like this one, where there’s a significant disparity between the number of positive and negative labels.

The model has high accuracy but performed poorly in predicting if a person has a chance of coronary heart disease.
The goal of the prediction should be, getting more true positives, because getting a lot of false negatives is too costly, it could potentially cost people their lives by giving them false hope and the cost of getting a follow-up test if they get a false-positive result isn’t as costly as living in false hope.
Recall (i.e How good the classifier was at predicting yeses correctly) is the right metric to use for this prediction.
Definition of some terms
Four Outcomes of Binary Classification
- True positives: data points labelled as positive that are actually positive
- False positives: data points labelled as positive that are actually negative
- True negatives: data points labelled as negative that are actually negative
- False negatives: data points labelled as negative that are actually positive
Evaluation Metrics of a Classification Algorithm
- Accuracy: the ratio of correctly classified samples to the total number of samples.
- Recall: the ability of a classification model to identify all relevant instances
- Precision: the ability of a classification model to return only relevant instances
- F1 score: a single metric that combines recall and precision using the harmonic mean
- AUC score: shows just how well the classification model performed
A high recall means that an algorithm returned most of the relevant results and that is the metric I’ll be using here. You can read more about evaluation metrics and when to use them here and here.
Another thing I did was compare the performance of five different classification algorithms.
- Logistic Regression
- K Nearest Neighbours
- Decision Tree Classifier
- Random Forest Classifier
- Support Vector Classifier
Data Cleaning
I decided to drop all missing values this time.
Feature Selection
This is the process of reducing the number of input variables when developing a predictive model to both reduce the computational cost of modelling and, in some cases, to improve the performance of the model.
I used SelectKBest from scikit learn to select the best 10 features in my dataset, and ended up with the following features;
Best features:
sysBP,
glucose,
age,
totChol,
cigsPerDay,
diaBP,
prevalentHyp,
diabetes,
BPMeds,
male
Feature Scaling
I used MinMaxScaler from scikit-learn to scale the model features because I’d be using different models that use distance measure between data points in their classification.
scaler = MinMaxScaler(feature_range=(0,1))heart_scaled = pd.DataFrame(scaler.fit_transform(heart_new), columns=heart_new.columns)
I then split my data into train and test sets with StratifiedShuffleSplit with 20% test set.
Steps Taken to Optimize the Recall Score
I resampled my data.
The reason for the low recall score is the enormous class imbalance,
heart['TenYearCHD'].value_counts()0 3099
1 557
Name: TenYearCHD, dtype: int64
to fix this you can either oversample the minority class or undersample the majority class in the training data. I used the resample method from scikit-learn. Either of these methods gives a balanced dataset.


I used both the upsampled data and downsampled data to train my models and compared their performance.
Model Selection
I used GridSearchCv from scikit-learn for hyperparameter tuning of the classification models and finally used the best performing parameters for classification.
Working with the Upsampled Data
I trained all the selected models with the upsampled data features and label and then performed prediction with the test set.
lr.fit(upsampled_X, upsampled_Y) #logisticregression
knn.fit(upsampled_X, upsampled_Y) #knearestneigbors
svc.fit(upsampled_X, upsampled_Y) #supportvectorclassifier
tree.fit(upsampled_X, upsampled_Y) #decisiontrees
clf.fit(upsampled_X, upsampled_Y) #randomforest
Model Performance
Accuracy
Logistic Regression: 0.6857923497267759
KNN: 0.7773224043715847
Decision Tree: 0.7459016393442623
Random Forest: 0.744535519125683
SVC: 0.6967213114754098
Recall
Logistic Regression: 0.6696428571428571
KNN: 0.19642857142857142
Decision Tree: 0.30357142857142855
Random Forest: 0.44642857142857145
SVC: 0.6071428571428571
Confusion Matrix
Logistic Regression: [[427 193]
[ 37 75]]
---------
KNN: [[547 73]
[ 90 22]]
---------
Decision Tree: [[512 108]
[ 78 34]]
---------
Random Forest: [[495 125]
[ 62 50]]
---------
SVC: [[442 178]
[ 44 68]]
Summary:
Logistic regression had the highest recall score of 66.9%. It correctly predicted 75 positive values out of 112 values.
Working with the Downsampled Data
lr.fit(downsampled_X, downsampled_Y)
knn.fit(downsampled_X, downsampled_Y)
svc.fit(downsampled_X, downsampled_Y)
tree.fit(downsampled_X, downsampled_Y)
clf.fit(downsampled_X, downsampled_Y)
Model Performance
Accuracy
Logistic Regression: 0.7308743169398907
KNN: 0.7213114754098361
Decision Tree: 0.7418032786885246
Random Forest: 0.6912568306010929
SVC: 0.7418032786885246
Recall
Logistic Regression: 0.6607142857142857
KNN: 0.5803571428571429
Decision Tree: 0.4017857142857143
Random Forest: 0.5178571428571429
SVC: 0.5803571428571429
Confusion Matrix
Logistic Regression: [[461 159]
[ 38 74]]
---------
KNN: [[463 157]
[ 47 65]]
---------
Decision Tree: [[498 122]
[ 67 45]]
---------
Random Forest: [[448 172]
[ 54 58]]
---------
SVC: [[478 142]
[ 47 65]]
Summary
Logistic regression had the highest recall score of 66.07%. It correctly predicted 74 positive values out of 112 values. just slightly lower than the recall score with upsampled data
These recall scores are not good enough but they are much better than the recall scores from the initial classification.
Another way of improving the recall score is by reducing the probability threshold.
With 0.1 threshold the Confusion Matrix is
[[ 1 619]
[ 1 111]]
with 112 correct predictions and 1 False Negatives Recall: 0.9910714285714286
Precision: 0.15205479452054796
AUC is : 0.4963421658986175
Accuracy is : 0.15300546448087432
A recall score of 99% is awesome, but the precision score is just 0.15. This model would not do anyone any good.
With 0.3 threshold the Confusion Matrix is
[[196 424]
[ 10 102]]
with 298 correct predictions and 10 False Negatives Recall: 0.9107142857142857
Precision: 0.19391634980988592
AUC is : 0.6134216589861751
Accuracy is : 0.40710382513661203
The recall score here is lower but not terrible. The performance of the model has improved.
With 0.5 threshold the Confusion Matrix is
[[427 193]
[ 37 75]]
with 502 correct predictions and 37 False Negatives Recall: 0.6696428571428571
Precision: 0.2798507462686567
AUC is : 0.679176267281106
Accuracy is : 0.6857923497267759
The recall score reduces as the probability threshold increases.
AUC is a measure of how well the classification model performed. A score of 0.5 means the model performed badly, a score of 0.7–0.8 is considered acceptable and a score of 0.8–0.9 considered excellent, and more than 0.9 is considered outstanding.
This classification project is complicated because of the enormous class imbalance which means there’s always going to be trade-offs in prediction.
Another way to improve the recall score is but tuning the class-weight hyperparameter in some classification models. What this parameter does is, it penalizes mistakes in predictions of the minority class. You can read more about it here.
The complete project can be found in this repository.