Genomics Case Study

DNA and vector illustration

Biomarkers discovery

MyDataModels was used to model and perform feature selection on the NCBI microarray gene expression dataset GSE19429. This dataset contains Affymetrix GeneChip Human Genome V133 Plus 2.0 microarray data representing gene expression levels for 200 samples. These samples consisted of bone marrow tissue obtained from 183 patients with myelodysplastic syndromes (MDS) and 17 healthy controls. The microarray used generated 54,675 attributes per sample. The dataset file size was 103 megabytes.

Acquisition and Preparation of the dataset file

This entailed download of the series matrix text file, removal of metadata rows, transposition of rows/columns such that columns represented attributes and rows represented samples. The response variable value was derived from the metadata and added as a column.

Rows were then randomized such that the order of samples in the rows was ensured to be random. The healthy vs MDS attribute was chosen as the response variable to analyze.

Results in less than 2 hours

This included dataset acquisition and preparation, feature reduction, ranking of the entire set of feature attributes and generation of an explanatory model, took just under two hours. When evaluated against hold out samples (samples not available for consideration in the analysis process), the resulting explanatory model was 90% accurate with a true negative rate of 100%, a true positive rate of 89.1%.

With regards to the ranked feature set, in the top 20 probes identified by the analysis process, 4 probes (231067_s_at, 241679_at, 210517_s_at, and 227530_at) were identified that represent transcripts for the gravin/AKAP21 gene, relevant to MDS as discovered by other research:

This is of note because MyDataModels placed 4 of the 5 probe ID’s for AKAP21 in the top 20, which lends confidence that the analysis process is truly considering the merit of all attributes and not simply randomly finding useful attributes.

Also in the top 20 attributes was found three probe IDs that represent transcripts for ARPP21 (220359_s_at, 1556599_s_at, and 231935_at). The relevance of ARPP21 to MDS was also found by the research described here.

Also identified by MyDataModels in the top 20 probes:
OR7A5 (208285_at) – Gene expression profiling of CD34+ cells in patients with the 5q) syndrome

SH2D4B (1563849_at) and KIAA0226L (previously named C13orf18, probe 44790_s_at) – Both found to be down regulated and differentially expressed in MDS patients:

BMC Medical Genomics

PPP2R2C (228010_at) – Downregulated in MDS patients per:

Myelodysplastic syndrome hematopoietic stem cell

CD19 (206398_s_at) – Found to be downregulated in MDS patients per:

Diagnostic Potential of CD34+ Cell Antigen Expression in Myelodysplastic Syndromes

P4HA1 (202733_at) – Found to have gene pathway aberrantly methylated in MDS HSCs:

Stem and progenitor cells in myelodysplastic syndromes show aberrant stage-specific expansion and harbor genetic and epigenetic alterations

TP53INP1 (225912_at) – Relevant mutation characteristics in MDS:

Genetic Testing in the Myelodysplastic Syndromes: Molecular Insights Into Hematologic Diversity

IFR4 (204562_at) – Relevance to MDS per:


The following probes/genes were also identified in the top 20 probe IDs but relevance not found in other research or literature:

Probe: 1568611_at

Gene: HMHB1, Probe: 208302_at

Gene: DUSP26, Probe: 219144_at

Gene: MME, Probe: 203434_s_at

Gene: P2RY14 Probe: 206637_at

Learn more about us