MyDataModels was used to model and perform feature selection on the NCBI microarray gene expression dataset GSE19429. This dataset contains Affymetrix GeneChip Human Genome V133 Plus 2.0 microarray data representing gene expression levels for 200 samples. These samples consisted of bone marrow tissue obtained from 183 patients with myelodysplastic syndromes (MDS) and 17 healthy controls. The microarray used generated 54,675 attributes per sample. The dataset file size was 103 megabytes.
Acquisition and Preparation of the dataset file
This entailed download of the series matrix text file, removal of metadata rows, transposition of rows/columns such that columns represented attributes and rows represented samples. The response variable value was derived from the metadata and added as a column.
Rows were then randomized such that the order of samples in the rows was ensured to be random. The healthy vs MDS attribute was chosen as the response variable to analyze.
Results in less than 2 hours
This included dataset acquisition and preparation, feature reduction, ranking of the entire set of feature attributes and generation of an explanatory model, took just under two hours. When evaluated against hold out samples (samples not available for consideration in the analysis process), the resulting explanatory model was 90% accurate with a true negative rate of 100%, a true positive rate of 89.1%.
With regards to the ranked feature set, in the top 20 probes identified by the analysis process, 4 probes (231067_s_at, 241679_at, 210517_s_at, and 227530_at) were identified that represent transcripts for the gravin/AKAP21 gene, relevant to MDS as discovered by other research:
- Low expression of the putative tumour suppressor gene gravin in chronic myeloid leukaemia, myelodysplastic syndromes and acute myeloid leukaemia
This is of note because MyDataModels placed 4 of the 5 probe ID’s for AKAP21 in the top 20, which lends confidence that the analysis process is truly considering the merit of all attributes and not simply randomly finding useful attributes.
Also in the top 20 attributes was found three probe IDs that represent transcripts for ARPP21 (220359_s_at, 1556599_s_at, and 231935_at). The relevance of ARPP21 to MDS was also found by the research described here.
Also identified by MyDataModels in the top 20 probes:
OR7A5 (208285_at) – Gene expression profiling of CD34+ cells in patients with the 5q) syndrome
SH2D4B (1563849_at) and KIAA0226L (previously named C13orf18, probe 44790_s_at) – Both found to be down regulated and differentially expressed in MDS patients:
PPP2R2C (228010_at) – Downregulated in MDS patients per:
CD19 (206398_s_at) – Found to be downregulated in MDS patients per:
P4HA1 (202733_at) – Found to have gene pathway aberrantly methylated in MDS HSCs:
Stem and progenitor cells in myelodysplastic syndromes show aberrant stage-specific expansion and harbor genetic and epigenetic alterations
TP53INP1 (225912_at) – Relevant mutation characteristics in MDS:
IFR4 (204562_at) – Relevance to MDS per:
The following probes/genes were also identified in the top 20 probe IDs but relevance not found in other research or literature:
Gene: HMHB1, Probe: 208302_at
Gene: DUSP26, Probe: 219144_at
Gene: MME, Probe: 203434_s_at
Gene: P2RY14 Probe: 206637_at