One of the main Machine Learning applications in healthcare is the identification and diagnosis of diseases which are considered hard-to-diagnose. This includes anything from cancers, which are tough to diagnose during the initial stage, to many genetic diseases.
Problems to solve
How to predict if a patient is likely to have breast cancer?
How to detect a risk from characteristics of breast mass cell nuclei?
How to help doctors to be more performant in their diagnosis?
Can machine learning help in these matters and how accurate predictive models can be to detect breast cancer?
Like humans, machines can learn to make predictions by analyzing past information (historical data). Machines can quickly identify patterns in a set of data and produce a mathematical formula (model) using the variables from this historical data.
In order to demonstrate the performance of MyDataModels’ solution for this type of problem we choose a specific case study.
This case study is based on real data from a public dataset* which can be found from Kaggle or UCI *.
The objective in this case study is to identify the benign or malignant class of the cell nuclei.
* see detailed information on this dataset in the Dataset information section
The graph below shows an extract of the public dataset.
Each line is a patient and each column (feature) is a variable.
Used features :
1. ID number
2. Diagnosis (M = malignant, B = benign)
3-32. Ten real-valued features are computed for each cell nucleus:
- radius (mean of distances from center to points on the perimeter)
- texture (standard deviation of gray-scale values)
- smoothness (local variation in radius lengths)
- compactness (perimeter^2 / area - 1.0)
- concavity (severity of concave portions of the contour)
- concave points (number of concave portions of the contour)
- fractal dimension ("coastline approximation" - 1)
To create predictive models, MyDataModels developed a performant solution called TADA.
In order to build a model, experts need to :
- upload the historical dataset into TADA,
- set what they want to predict (here “diagnostic”)
- select the other variables to use (all the other columns).
Below here are the statistics of a model obtained with TADA within a minute.
ACC = Accuracy
TPR = True Positive Rate
TNR = True Negative Rate
MCC = Matthew’s Correlation Coefficient
How to use this model?
Once a model is being built from historical data, professionals can easily make predictions by using information from their current patients who need a diagnostic.
To use a model, domain experts need to create a new dataset with:
As rows, all the patients they want a diagnostic for.
As columns, all the variables used in the model.
Then, they have to import this file into TADA (see screen below) and click on “Generate score”.
A new file will be generated with a new column giving the predictions of the patients’ diagnosis: M = malignant or B = benign
Benefits of TADA
Doctors and medical staff could use predictive models to help them in their diagnosis.
However, they are not data scientists and they may not have the required skills in machine learning nor coding experience to build models. Moreover, most data handled by these professionals are Small Data, meaning that often their historical data contains a limited number of patients and surely not hundred of thousands (Big Data). Traditional machine learning tools work well with Big Data but do not perform as well with Small Data.
MyDataModels allows domain experts, in this case doctors and researchers, to build automatically predictive models from their collected data. No training is required and they can use their collected data directly without a need to normalize it or handle outliers. No feature engineering is required. Thanks to this limited data preparation and in few clicks the above results from this specific dataset were obtained in less than a minute from a regular laptop.
MyDataModels brings a self-service solution for those who have Small Data and no data scientists.
Breast cancer is the most common cancer among women worldwide accounting for 25% of all cancer cases and affected 2.1 million people in 2015. Early diagnosis significantly increases the chances of survival however Research indicates that most experienced physicians can diagnose cancer with 79% accuracy while 91% correct diagnosis is achieved using machine learning techniques.
In this breast cancer prediction use case, the results obtained from MyDataModels’ predictive models are satisfying with a 97% accuracy rate.
The medical world could make more use of machine learning to detect diseases in general and breast cancer in particular. This would allow doctors, who are not data experts, to spend less time on data analysis and more time on providing the right treatment to their patients faster.
The objective is to identify the benign or malignant class of the cell nuclei.
Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. The actual linear program used to obtain the separating plane in the 3-dimensional space is that described in: [K. P. Bennett and O. L. Mangasarian: "Robust Linear Programming Discrimination of Two Linearly Inseparable Sets", Optimization Methods and Software 1, 1992, 23-34].
Task: Binary Classification
Number of features: 32
Size of data: 569 samples
Weight: Positive class (benign) 63%, Negative class (malignant) 37%
Target: class Diagnosis (M = malignant, B = benign)
The mean, standard error and "worst" or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.
All feature values are re-encoded with four significant digits.
Missing attribute values: none