## TCGA (The Cancer Genome Atlas) Predictive modeling using small data

MyDataModels participated to an INRA (Institut National de la Recherche Agronomique) PREDICT workshop where the provided dataset was an extract of the TCGA (The Cancer Genome Atlas) database. Dataset addressed an uterus cancer case. It contained four groups of variables, with no missing values.

The objective of this workshop was to demonstrate that MyDataModels automated Machine Learning solution could be used first to find the top 25 variables impacting resulting variable, and second to produce a high accuracy predictive model.

**The entire process took 3 days on a MacBook Air and was run by a person having no coding and no machine learning skills.**

First, the provided CSV file was imported in the MDM workspace. Dataset size was 64Mb, contained 90 samples – women – with 53,000+ genomic variables. The entire process was run on a standard MacBook Air. Next, a Goal was set specifying the VitalStatus (0=negative uterus cancer, 1=positive uterus cancer) variable as the analysis objective. 21 positive cases (23%), and 69 negative cases (77%).

## Feature reduction process before modeling

Feature reduction process described hereafter took 3 days.

The Variable Set Size parameter was set to a value of “3”. The parameter Cycles was set to a value of 100,000. Feature reduction was then performed for approximately four successive iterations of 100,000 cycles.

At this point, the Significance/Utilization (S/U) scatter plot presented the following visualization which indicates good differentiation of the variables under consideration:

As there was good differentiation depicted in the S/U scatter plot, we saved the top 2000 variables identified in the reduction in a new variable set. This was done for performance reasons. As there was good differentiation observed, we can safely discard the bulk of the least significant variables. There is no value in spending computational resources determining the relative merit of the least significant variables. The goal was to identify and rank only the most important variables.

This new variable set of the top 2000 variables was then loaded for reduction in the Step 2 tab. The Reset Metrics button was used to discard the reduction metrics on this new variable set so that reduction could focus solely on the relative merits of these 2000 variables.

The Variable Set Size was again set to a value of “3” and cycles to a large number so that reduction would continue without intervention until manually halted. Reduction was run until the average Exposure Values listed in the variables table was approximately 25.

At this point the S/U scatter plot again indicated that the variables were well differentiated:

This iterative process was repeated to produce new variable sets of the top 1000, 500, 250, 100 and finally the top 25 variables.

Each iteration entailed the above described process: the selection of a subset from the previous iteration, loading that subset for reduction, resetting the metrics, setting variable set size to 3, running reduction cycles until good differentiation is observed in the S/U scatter plot. A good practice is to use a naming convention that indicates each step of these iterations. The S/U scatter plots for each of these stages as we performed them are shown below.

**Top 1000:**

**Top 500:**

**Top 100:**

## Results in less than 1 hour

This step took less than 1 hour.

After final reduction on the top 100 variables, the top 25 variables were saved to a new variable set and sent to modeling in Step 3 tab.

Within 12 attempts, we decided to select the best classifier versus what we thought was the performance criteria, True Positive Rate. It was found to predict VitalStatus (Model ID: 20180716T173718.030587) with 74% accuracy on an hold out subset, but more importantly with True Positive Rate (Sensitivity) at 83%, 1 error on 6 samples.

We can see in the above table accuracy obtained on test subset is 74% with True Positive rate at 83%, True Negative rate at 71%, and AUC at 90%.

We then scored the selected model on a 45 samples test dataset, 11 positive cases and 34 negative cases, where VitalStatus variable content was empty. Results obtained with selected model were compared to actual results and accuracy found was 67%, True Positive Rate at 91% ( 10 predicted correctly on 11 ) and True Negative Rate at 59%.

We believe we could further improve performance by using a stratified sampling method and ensemble learning.

Selected model used only the 4 following variables – out of 25 top variables – and took following compact functional form.

**4 variables Used:**

CNA_PDSS2

RNASEQ_FAM157A

MIRNA_hsa-mir-122

MIRNA_hsa-mir-518f

**Model Functions:**

Each function computes a propensity score for each of the 2 cases, scores difference leads to prediction

## Summary

- Case study: model generation for uterus cancer prediction using historical data
- Historical dataset size: 64Mb, 90 samples, 53 161 variables
- Variables: 15 163 Methylation, 852 miARN, 17 130 CNA, 20 016 RNAseq
- Equipment used: MacBook Air, 1.8 Ghz Intel core i5, 8Gb 1600Mhz DDR3
- Variable reduction process: 3 days
- Modeling process: 1 hour
- Model’s accuracy: 74%, with 83% true positive rate, 71% true negative rate