A Second Look into Heart Disease Prediction

6 min readMay 13, 2020

Introduction

From the initial classification model that I built for the heart disease dataset, I got an accuracy score of 84.67% when I used all features in the dataset to build a logistic regression model. The model had a very high accuracy score but unfortunately, in real-world implementation, accuracy is the wrong metric for evaluating this model because of the prediction goal. Accuracy alone doesn’t tell the full story when working with a class-imbalanced dataset like this one, where there’s a significant disparity between the number of positive and negative labels.

Initial classification model accuracy score and confusion matrix

The model has high accuracy but performed poorly in predicting if a person has a chance of coronary heart disease.

The goal of the prediction should be, getting more true positives, because getting a lot of false negatives is too costly, it could potentially cost people their lives by giving them false hope and the cost of getting a follow-up test if they get a false-positive result isn’t as costly as living in false hope.
Recall (i.e How good the classifier was at predicting yeses correctly) is the right metric to use for this prediction.

Definition of some terms

Four Outcomes of Binary Classification

True positives: data points labelled as positive that are actually positive
False positives: data points labelled as positive that are actually negative
True negatives: data points labelled as negative that are actually negative
False negatives: data points labelled as negative that are actually positive

Evaluation Metrics of a Classification Algorithm

Accuracy: the ratio of correctly classified samples to the total number of samples.
Recall: the ability of a classification model to identify all relevant instances
Precision: the ability of a classification model to return only relevant instances
F1 score: a single metric that combines recall and precision using the harmonic mean
AUC score: shows just how well the classification model performed

A high recall means that an algorithm returned most of the relevant results and that is the metric I’ll be using here. You can read more about evaluation metrics and when to use them here and here.

Another thing I did was compare the performance of five different classification algorithms.

Logistic Regression
K Nearest Neighbours
Decision Tree Classifier
Random Forest Classifier
Support Vector Classifier

Data Cleaning

I decided to drop all missing values this time.

Feature Selection

This is the process of reducing the number of input variables when developing a predictive model to both reduce the computational cost of modelling and, in some cases, to improve the performance of the model.

I used SelectKBest from scikit learn to select the best 10 features in my dataset, and ended up with the following features;

Best features:
  sysBP,
  glucose,
  age,
  totChol,
  cigsPerDay,
  diaBP,
  prevalentHyp,
  diabetes,
  BPMeds,
  male

Feature Scaling

I used MinMaxScaler from scikit-learn to scale the model features because I’d be using different models that use distance measure between data points in their classification.

scaler = MinMaxScaler(feature_range=(0,1))heart_scaled = pd.DataFrame(scaler.fit_transform(heart_new), columns=heart_new.columns)

I then split my data into train and test sets with StratifiedShuffleSplit with 20% test set.

Steps Taken to Optimize the Recall Score

I resampled my data.

The reason for the low recall score is the enormous class imbalance,

heart['TenYearCHD'].value_counts()0    3099
1     557
Name: TenYearCHD, dtype: int64

to fix this you can either oversample the minority class or undersample the majority class in the training data. I used the resample method from scikit-learn. Either of these methods gives a balanced dataset.

I used both the upsampled data and downsampled data to train my models and compared their performance.

Model Selection

I used GridSearchCv from scikit-learn for hyperparameter tuning of the classification models and finally used the best performing parameters for classification.

Working with the Upsampled Data

I trained all the selected models with the upsampled data features and label and then performed prediction with the test set.

lr.fit(upsampled_X, upsampled_Y)  #logisticregression
knn.fit(upsampled_X, upsampled_Y) #knearestneigbors
svc.fit(upsampled_X, upsampled_Y) #supportvectorclassifier
tree.fit(upsampled_X, upsampled_Y) #decisiontrees
clf.fit(upsampled_X, upsampled_Y) #randomforest

Model Performance

Accuracy

Logistic Regression:  0.6857923497267759
KNN:  0.7773224043715847
Decision Tree:  0.7459016393442623
Random Forest:  0.744535519125683
SVC:  0.6967213114754098

Recall

Logistic Regression:  0.6696428571428571
KNN:  0.19642857142857142
Decision Tree:  0.30357142857142855
Random Forest:  0.44642857142857145
SVC:  0.6071428571428571

Confusion Matrix

Logistic Regression:  [[427 193]
                       [ 37  75]]
---------
KNN:  [[547  73]
       [ 90  22]]
---------
Decision Tree:  [[512 108]
                 [ 78  34]]
---------
Random Forest:  [[495 125]
                 [ 62  50]]
---------
SVC:  [[442 178]
       [ 44  68]]

Summary:

Logistic regression had the highest recall score of 66.9%. It correctly predicted 75 positive values out of 112 values.

Working with the Downsampled Data

lr.fit(downsampled_X, downsampled_Y)
knn.fit(downsampled_X, downsampled_Y)
svc.fit(downsampled_X, downsampled_Y)
tree.fit(downsampled_X, downsampled_Y)
clf.fit(downsampled_X, downsampled_Y)

Model Performance

Accuracy

Logistic Regression:  0.7308743169398907
KNN:  0.7213114754098361
Decision Tree:  0.7418032786885246
Random Forest:  0.6912568306010929
SVC:  0.7418032786885246

Recall

Logistic Regression:  0.6607142857142857
KNN:  0.5803571428571429
Decision Tree:  0.4017857142857143
Random Forest:  0.5178571428571429
SVC:  0.5803571428571429

Confusion Matrix

Logistic Regression:  [[461 159]
                       [ 38  74]]
---------
KNN:  [[463 157]
       [ 47  65]]
---------
Decision Tree:  [[498 122]
                 [ 67  45]]
---------
Random Forest:  [[448 172]
                 [ 54  58]]
---------
SVC:  [[478 142]
       [ 47  65]]

Summary

Logistic regression had the highest recall score of 66.07%. It correctly predicted 74 positive values out of 112 values. just slightly lower than the recall score with upsampled data

These recall scores are not good enough but they are much better than the recall scores from the initial classification.

Another way of improving the recall score is by reducing the probability threshold.

With 0.1 threshold the Confusion Matrix is  
 [[  1 619]
 [  1 111]] 
 with 112 correct predictions and 1 False Negatives  Recall:  0.9910714285714286 
 Precision:  0.15205479452054796 
 AUC is : 0.4963421658986175 
 Accuracy is : 0.15300546448087432

A recall score of 99% is awesome, but the precision score is just 0.15. This model would not do anyone any good.

With 0.3 threshold the Confusion Matrix is  
 [[196 424]
 [ 10 102]] 
 with 298 correct predictions and 10 False Negatives  Recall:  0.9107142857142857 
 Precision:  0.19391634980988592 
 AUC is : 0.6134216589861751 
 Accuracy is : 0.40710382513661203

The recall score here is lower but not terrible. The performance of the model has improved.

With 0.5 threshold the Confusion Matrix is  
 [[427 193]
 [ 37  75]] 
 with 502 correct predictions and 37 False Negatives  Recall:  0.6696428571428571 
 Precision:  0.2798507462686567 
 AUC is : 0.679176267281106 
 Accuracy is : 0.6857923497267759

The recall score reduces as the probability threshold increases.

AUC is a measure of how well the classification model performed. A score of 0.5 means the model performed badly, a score of 0.7–0.8 is considered acceptable and a score of 0.8–0.9 considered excellent, and more than 0.9 is considered outstanding.

This classification project is complicated because of the enormous class imbalance which means there’s always going to be trade-offs in prediction.

Another way to improve the recall score is but tuning the class-weight hyperparameter in some classification models. What this parameter does is, it penalizes mistakes in predictions of the minority class. You can read more about it here.

The complete project can be found in this repository.

A Second Look into Heart Disease Prediction

Introduction

Definition of some terms

Data Cleaning

Feature Selection

Feature Scaling

Steps Taken to Optimize the Recall Score

Model Selection

Working with the Upsampled Data

Working with the Downsampled Data

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Temitope Bimbo Babatola

No responses yet