Medicine has always been using the latest technology to improve care quality for its patients. Today machine learning can be used to help doctors with diagnostics. Using predictive models can save precious time to doctors in heart diseases prediction.
Problems to solve
- How to predict if a patient is likely to have heart diseases depending on known variables such as his/her age, sex, blood pressure, electrocardiographic results, max heart rate, ….?
- Can machine learning help in this matter and how accurate predictive models can be to detect such diseases?
Benefits of TADA
Doctors and medical staff are not data scientists. They may not have the required skills in machine learning nor coding to build predictive models. Most data handled by these professionals are Small Data, meaning that often their historical data contains a limited number of patients. Traditional machine learning tools work well with Big Data but do not perform with Small Data.
MyDataModels allows domain experts to build predictive models from Small Data automatically and without training. They can use their collected data directly, without normalization and outlier’s management nor feature engineering. Thanks to this limited data preparation, the results from this specific dataset were obtained with a few clicks in less than a minute on a regular laptop.
MyDataModels brings a self-service solution for those who have Small Data and no data scientists.
Conclusion
A worldwide study on causes of death observes that heart disease/syndrome is the major cause of death anticipating that 23.6 million people will die from heart disease in coming 2030. The healthcare industry collects large amounts of heart disease data which unfortunately are not treated to discover hidden information for effective decision making.
In this heart diseases prediction use case, the results obtained from MyDataModels’ predictive models are satisfying with a 82% accuracy rate and can widely improve this situation.
The medical world could use more machine learning to detect diseases in general and heart diseases in particular and allow doctors, who are not data experts, to spend less time on data analysis and more on their patients care.
Case study
Solution
Automated Machine Learning solutions consist of predicting the future with historical data. To predict a future result, you must bring your descriptive data and the past result obtained.
TADA allows you to simply create a relevant predictive model from your data and apply it to future data.
In this case, the descriptive data are patient’s.
The goal of the dataset is to predict if patient have a heart disease or no, it’s a binary task (1/0).
To generate a model, the steps are the following:
- Create your project and load your data as a CSV table (with data in rows and variables in columns).
-
Select the variable you want to predict, called Goal.
In this case, the Goal is the variable "Target" (a visualization of the variable is provided). -
Select your data for the model generation. This step is called "Creating the Variable set" and allows you to manually select the descriptive variables you want to use. By default, they are all selected.
TADA identifies the relevant descriptive variables by itself, which affects the calculation time required to create the model.
The fewer variables selected, the faster the model creation. -
Create your model.
At creation, default values are proposed to you: Name of models, Population, Iteration. You only need to validate the default values to start model generation.
‘Best practices’ are at your disposal to guide you in the choice of these parameters.
Depending on the size of the descriptive data file, this step can take between a few seconds and ten minutes.
Once the model is created, you can see the results of the model using metrics and charts so you can judge its relevance.
Note:
To apply a model that you think is relevant, you can:
- Retrieve the associated mathematical formula and apply it (for instance on Excel)
- Retrieve the source code of the formula and use it by yourself (Valid only on TADA paying offers). The source code is available in R, Java, C ++ and soon Python.
- In order to use our "Predict" feature on the product, you will have to upload your file containing the data to be predicted. You will be returned a downloadable file containing the given data, with
the calculated predictions.
Dataset information
The screenshot below shows an extract of the public dataset.
Each row is a patient and each column is a variable which can be used in the model.
The original database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers to this date. The "Target" field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4. Experiments with the Cleveland database have concentrated on simply attempting to distinguish presence (values 1,2,3,4) from absence (value 0).
Dataset information
- Task Type : Binary Classification
- Number of variables: 13
- Number of rows: 303 samples
- Goal: Target : presence (1) or absence (0) of heart disease
- Weight: Positive class: 54.6%, Negative class: 45.4%
The variables are:
- Age: age in years
- Sex: (1 = male; 0 = female)
- Cp: chest pain type
- Trestbps: resting blood pressure (in mm Hg on admission to the hospital)
- Chol: serum cholestoral in mg/dl
- Fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
- Restecg: resting electrocardiographic results
- Thalach: maximum heart rate achieved
- Exang: exercise induced angina (1 = yes; 0 = no)
- Oldpeak: ST depression induced by exercise relative to rest
- Slope: the slope of the peak exercise ST segment
- Ca: number of major vessels (0-3) colored by flourosopy
- Thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
- Target is our Goal
The names and social security numbers of the patients were removed from the database, replaced with fake values.
Results
The results of the model are available following the generation of the model.
They present the performance of the predictive model.
The type of predictive model and the measurement indicators of the associated model are related to the Goal (Variable to be predicted) and the values of this variable.
The type of model you make is shown on the model results display.
According to the type of the Goal (in our case, the Goal is "Target"), we can make three types of predictions:
- Binary classification: Discrete value taking only two values (yes / no for instance)
- Multiclass classification: Discrete value taking more than two values (for instance a status of state with values like: On, Risk of breakdown, Down, etc.)
- Regression: Continuous value that can take an infinite number of values (a temperature, a pressure, a turnover, the price of a house, etc.)
At the generation of the model and according to the practices and state of the art of Machine Learning, your dataset will be divided into three parts by TADA:
- A training part which represents 40% of your dataset, it allows to train a certain number of formulas,
- A validation part, which represents 30% of your dataset, which validates and selects the best formulas found in the previous step,
- A test part which represents the last 30% of the model and which corresponds to the test of the formulas approved by the preceding stage. The performance measurement and the evaluation of your model should mainly be done on this partition (Standard and state of the art of Machine Learning) because the present data were not used in the learning and validation phase of the machine learning model and serve just to measure its performance.
ACC (Accuracy) represents the overall accuracy rate of the model, it is the percentage of classes that are well distributed (here we have 81.52% predictions that are correct)
TPR (True Positive Rate) represents the accuracy rate of the prediction of the positive class, ie of the "yes/1" class
TNR (True Negative Rate) represents the accuracy rate of the prediction of the negative class, ie of the "No/0" class
MCC (Matthew's Correlation Coefficient) represents the good prediction as a whole, that is, if we were able to divide the predictions between the both classes.
Confusion matrix
Here, the confusion matrix represents a visual way of interpreting the metrics.
In this case, TADA predicted 37 times that a patient will have a heart disease and was only mistaken 7 times (We made a mistake on 7 heart diseases).
In parallel, TADA predicted 55 times that a patient has no heart disease and was wrong 10 times (We missed 10 heart diseases).
Ready to use TADA?
You don't have immediate data?
No problem, data are available to make your trial as relevant as possible!
Try it now!