The goal of this lab is to train a model for the diagnosis of coronary artery disease.
The dataset is provided by the Cleveland Clinic Foundation for Heart Disease (more information). The dataset file to use is available here. Each row describes a patient. Below is a description of each column.
Column | Description | Feature Type | Data Type |
---|---|---|---|
Age | Age in years | Numerical | integer |
Sex | (1 = male; 0 = female) | Categorical | integer |
CP | Chest pain type (0, 1, 2, 3, 4) | Categorical | integer |
Trestbpd | Resting blood pressure (in mm Hg on admission to the hospital) | Numerical | integer |
Chol | Serum cholestoral in mg/dl | Numerical | integer |
FBS | (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false) | Categorical | integer |
RestECG | Resting electrocardiographic results (0, 1, 2) | Categorical | integer |
Thalach | Maximum heart rate achieved | Numerical | integer |
Exang | Exercise induced angina (1 = yes; 0 = no) | Categorical | integer |
Oldpeak | ST depression induced by exercise relative to rest | Numerical | float |
Slope | The slope of the peak exercise ST segment | Numerical | integer |
CA | Number of major vessels (0-3) colored by flourosopy | Numerical | integer |
Thal | 3 = normal; 6 = fixed defect; 7 = reversable defect | Categorical | string |
Target | Diagnosis of heart disease (1 = true; 0 = false) | Classification | integer |
You may use either a local or remote Python environment for this lab.
The easiest way to obtain a working Python setup is by using a cloud-based Jupyter notebook execution platform like Google Colaboratory, Paperspace or Kaggle Notebooks.
This lab is designed to make you discover three essential libraries of the Python ecosystem for Machine Learning: NumPy, pandas and scikit-learn.
The following tutorials will give you the first level of knowledge you need to start using these tools in your projects.
- NumPy: the absolute basics for beginners
- 10 minutes (or maybe a bit more 😊) to pandas
If you're time-constrained, you may skip the following parts: Selection, Merge, Grouping, Reshaping and Time Series.
- Getting Started with scikit-learn
While studying these tutorials, it is essential to test all code examples.
When done with the tutorials, take this test to check your understanding.
You may train any binary classification model on this task, for example a basic SGDClassifier implementing the logistic regression algorithm.
To implement the training process, you should take inspiration from the project workflow and classification performance lectures.
Try another model, for example a decision tree, and compare their performances.