Heart Disease Prediction : A Logistic regression implementation from python scikit-learn

Temitope Bimbo Babatola
6 min readMar 26, 2020

--

Logistic Regression

Logistic Regression is a Machine Learning classification algorithm that is used to predict the probability of a categorical dependent variable. In logistic regression, the dependent variable is a binary variable that contains data coded as 1 (yes, success, etc.) or 0 (no, failure, etc.). In other words, the logistic regression model predicts P(Y=1) as a function of X.

Project Introduction

World Health Organization has estimated 12 million deaths occur worldwide, every year due to Heart diseases. Half the deaths in the United States and other developed countries are due to cardio vascular diseases. The early prognosis of cardiovascular diseases can aid in making decisions on lifestyle changes in high risk patients and in turn reduce the complications. This research intends to pinpoint the most relevant/risk factors of heart disease as well as predict the overall risk using logistic regression.

The project python notebook can be found in this Github repository.

Source of Data

The dataset is publicly available on the Kaggle website, and it is from a cardiovascular study on residents of the town of Framingham, Massachusetts. The classification goal is to predict whether the patient has a 10-year risk of future coronary heart disease (CHD).

Data

The dataset provides the patients’ information. It includes over 4,000 records and 15 attributes. Each attribute is a potential risk factor. There are demographic, behavioral and medical risk factors.

Demographic:

  • Sex: male or female(Nominal)
  • Age: Age of the patient;(Continuous — Although the recorded ages have been truncated to whole numbers, the concept of age is continuous)

Behavioral

  • Current Smoker: whether or not the patient is a current smoker (Nominal)
  • Cigs Per Day: the number of cigarettes that the person smoked on average in one day.(can be considered continuous as one can have any number of cigarettes, even half a cigarette.)

Medical( history)

  • BP Meds: whether or not the patient was on blood pressure medication (Nominal)
  • Prevalent Stroke: whether or not the patient had previously had a stroke (Nominal)
  • Prevalent Hyp: whether or not the patient was hypertensive (Nominal)
  • Diabetes: whether or not the patient had diabetes (Nominal)

Medical(current)

  • Tot Chol: total cholesterol level (Continuous)
  • Sys BP: systolic blood pressure (Continuous)
  • Dia BP: diastolic blood pressure (Continuous)
  • BMI: Body Mass Index (Continuous)
  • Heart Rate: heart rate (Continuous — In medical research, variables such as heart rate though in fact discrete, yet are considered continuous because of large number of possible values.)
  • Glucose: glucose level (Continuous)

Predict variable (desired target)

  • TenYearCHD: 10 year risk of coronary heart disease CHD (binary: “1”, means “Yes”, “0” means “No”)

Libraries used

  1. Pandas
  2. Matplotlib
  3. Seaborn
  4. Scikit-learn (sklearn):
  • LogisticRegression: Classification model
  • StratifiedShuffleSplit: used to split the dataset into train and test sets, especially good for unbalanced data.
  • confusion_matrix: the visualization of the performance of an algorithm.
  • accuracy_score: this function computes model accuracy

Data Exploration

First five rows of the dataset
First five rows of the dataset
An overview of the datatypes, number of rows(4238), number of columns(16)

Data cleaning: The dataset contains some missing values in the education, cigsPerDay, BPMeds, totChol, BMI, heartRate, and glucose columns and Logistic Regression has no way to reasonably deal with missing values, I chose to replace the missing values. I used some methods to replace the missing values;

Missing values in each column
Methods used for replacing the missing values

Visualizations

Correlation coefficient: a statistical measure of the strength of the relationship between the relative movements of two variables. The values range between -1.0 and 1.0. I used a heatmap to visualize the correlation coefficient between the attributes of the dataset.

Correlation Heatmap
Bar chart showing the distribution of TenYearCHD
Bar chart showing the relationship between gender and TenYearCHD
Bar chart showing the relationship between diabetes and TenYearCHD
Bar chart showing the relationship between age and TenYearCHD

Age is a continuous variable and so cannot be properly visualized on a bar chart so I made it into bins, each bin representing 10years.

Bar chart showing the relationship between cigsPerDay and TenYearCHD

Observations

  • The dataset is unbalanced, about 85% variables of the dataset do not have the risk of heart disease.
  • Men are more susceptible to heart disease than women
  • Older people are more susceptible to heart disease, so age is a good predictor.
  • Cigs might be a good predictor of the outcome variable.

Model implementation

I went ahead to build models with different train/test split proportions;

  • 80/20
  • 75/25
  • 70/30 to see different proportions of data would affect the accuracy of prediction
Model with 80% train set and 20% test set

This function splits the dataset into 80%-train data and 20%-test data with stratified shuffle split, then fits the logistic regression model with the train data, and predicts with the test set, calculates the accuracy score and confusion matrix; it returns the accuracy score and the confusion matric of the model.

I have two other functions performing 75/25 split and 70/30 splits too

Predictions

The difference in the accuracy of the model on the different train/test splits is almost negligible with just ~0.2% difference

I dropped the education feature and built another model to see if education has any effect on the model performance.

heart_fill_ed = heart_fill.drop(['education'], axis = 1)

The removal of the education column doesn’t have an effect on the model performance.

I did a little more model tweaking by dropping the Total cholesterol, BMI, Glucose columns each and it didn’t result in any significant change in the model performance.

The python notebook can be found in this Github repository.

Conclusion

  1. The model predicted with 84.9% accuracy. The model is more specific than sensitive because when examining the confusion matrix, the model did not do a very good job of accurately predicting the positives, therefore, there were a lot of false positive results. Overall model could be improved with more data.
  2. Education, Total cholesterol, BMI, Glucose show no significant change in the odds of CHD, they cause very negligible change in odds about 0.2% — 0.5%.

Future

I will go ahead and use other classification models in prediction and see how well those models do compared with logistic regression.

Acknowlegdement

This is my final project for the SheCodeAfrica mentoring program (Cohort1). I started this program as a mentee in Jan 2020 and for the past three months I have learned so much about data science from my mentor Becky Mashaido.

--

--