Medicine has always been using the latest technology to improve care quality for its patients. Today machine learning can be used to help doctors with diagnostics. Using predictive models can save precious time to doctors in heart diseases prediction.
Problems to solve
How to predict if a patient is likely to have heart diseases depending on known variables such as his/her age, sex, blood pressure, electrocardiographic results, max heart rate, …. ?
Can machine learning help in this matter and how accurate predictive models can be to detect such diseases?
Like humans, machines can learn to make predictions by analyzing past information (historical data). Machines can quickly identify patterns in a set of data and produce a mathematical formula (model) using the variables from this historical data.
In order to demonstrate the performance of MyDataModels solution for this type of problem we choose a specific case study.
This case study is based on real data from a public dataset* which can be found from Kaggle.
The objective in this case study is to predict if a patient is likely to have a heart disease.
* see detailed information on this dataset at Dataset information section
The graph below shows an extract of the public dataset.
Each line is a patient and each column (feature) is a variable which can be used in the model.
- Age: age in years
- Sex: (1 = male; 0 = female)
- Cp: chest pain type
- Trestbps: resting blood pressure (in mm Hg on admission to the hospital)
- Chol: serum cholestoral in mg/dl
- Fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
- Restecg: resting electrocardiographic results
- Thalach: maximum heart rate achieved
- Exang: exercise induced angina (1 = yes; 0 = no)
- Oldpeak: ST depression induced by exercise relative to rest
- Slope: the slope of the peak exercise ST segment
- Ca: number of major vessels (0-3) colored by flourosopy
- Thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
The names and social security numbers of the patients were removed from the database, replaced with fake values.
To create predictive models MyDataModels has a performant solution called TADA.
In order to build a model, experts need to :
- upload the historical dataset into TADA,
- set what they want to predict (here “target”)
- select the other variables to use (all the other columns).
Below here are the statistics of a model obtained with TADA within a minute.
ACC = Accuracy
TPR = True Positive Rate
TNR = True Negative Rate
MCC = Matthew’s Correlation Coefficient
How to use this model?
Once a model is being built from historical data, professionals can easily make predictions by using information from their new patients.
To use a model, domain experts need to create a new dataset with, as lines all the patients they want a diagnostic for, and as columns all the variables used in the model.
Then they need to import this file into TADA (see screen below) and click on “Generate score”.
A new file will be generated with a new column (“prediction”) giving the predictions of the patients’ diagnosis: 1 or 0
Benefits of TADA
Doctors and medical staff are not data scientists. They may not have the required skills in machine learning nor coding to build predictive models. Most data handled by these professionals are Small Data, meaning that often their historical data contains a limited number of patients. Traditional machine learning tools work well with Big Data but do not perform well with Small Data.
MyDataModels allows domain experts to build automatically predictive models from Small Data. They can use their raw data, no need to normalize data, handle outliers, no feature engineering is required. Thanks to this limited data preparation and in few clicks the above results from this specific dataset were obtained in less than a minute on a standard laptop.
MyDataModels brings a self-service solution for those who have Small Data and no data scientists
A worldwide study on causes of death observes that heart disease/syndrome is the major cause of death. Anticipating that 23.6 million people will die from heart disease in coming 2030. The healthcare industry collects large amounts of heart disease data which unfortunately are not treated to discover hidden information for effective decision making.
In this heart diseases prediction use case, the results obtained from MyDataModels’ predictive models are satisfying with a 82% accuracy rate and can widely improve this situation.
The medical world could use more machine learning to detect diseases in general and heart diseases in particular and allow doctors, who are not data experts, to spend less time on data analysis and more on their patients care.
The original database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers to this date. The "Target" field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4. Experiments with the Cleveland database have concentrated on simply attempting to distinguish presence (values 1,2,3,4) from absence (value 0).
- Number of features: 13
- Size of data: 303 samples
- Weight: Positive class: 54.6%, Negative class: 45.4%
- Data description: This dataset is used to predict heart disease
- Target: presence (1) or absence (0) of heart disease
- Score: Accuracy