In this work, we introduced iterative nonnegative matrix factorization based on randomly selected probe sets and applied it for stratifying CRC samples in a two-step process into two main types and subsequently into five subtypes. In contrast to previous studies, this iterative process enables us to detect a hierarchical relationship between subtypes based on expression differences of varying strength. Being based on randomly selected probe sets, iNMF has the advantage that it is unbiased with respect to knowledge about genes and pathways. The subtype signatures consisting of differentially expressed probe sets can be easily applied for hierarchically clustering independent CRC datasets in a two step process, thereby assigning the samples to their respective subtypes.
The presented CRC stratification was validated by clustering independent CRC expression datasets using the identified signatures and by applying the iNMF algorithm to an independent dataset, which resulted in highly similar subtypings. These results prove that our method and stratification are robust and transferable to other datasets, and that the lists of differentially expressed probe sets are applicable for the stratification of independent expression datasets and robust against confounding factors typically present in independent datasets.
The functional analyses of differentially expressed probe sets provided insight into differences in the activation of key signaling pathways in distinct types and subtypes and interesting start points for further investigations. The first iteration revealed a mesenchymal (Type 1) and a highly proliferative, epithelial (Type 2) type. This difference between epithelial and mesenchymal types is not correlated to the amount of infiltration by stromal cells as tumor sampels in all subtypes show similar percent stromal cells (Table 2). Further stratifying the mesenchymal type identified a subtype with signs of activation of MAPK, TGFβ, and calcium signaling (Subtype 1.1), a subtype with activation of immune system-related pathways (Subtype 1.2), and one with high expression of transporter genes (Subtype 1.3). The subdivision of the epithelial type revealed a subtype showing activation of immune system-related pathways (Subtype 2.1), and a subtype with high expression of genes on chromosomes 13q and 20q (Subtype 2.2).
Many of the pathways identified here as activated in specific subtypes were also shown to be targeted by recurrent alterations in a recent analysis by The Cancer Genome Atlas Network . In this analysis, most samples were found to harbor alterations leading to an activation of WNT signaling which is in agreement with the finding that WNT is the only pathway analyzed that seems to be activated in both Types 1 and 2. Furthermore, receptor tyrosine kinase-RAS signaling was affected in a substantial number of tumors, and we identified classical MAPK signaling to be activated in Type 1 and specifically in Subtype 1.1. Recently, Seshagiri and colleagues analyzed next-generation sequencing data obtained from 70 primary human colon tumors  and found frequent mutations in 356 candidate CRC genes previously identified in screens in mouse models of CRC [52, 53]. More than 8% of these genes are also contained in the signatures associated with the iNMF subtypes presented here. Clusterin, for example, is highly expressed in Type 1 and known to regulate NF-κB activity  and inhibit apoptosis . Type 2 tumors, on the other hand, show high expression of dachshund homolog 1 which inhibits TGFβ signaling through binding to SMAD4  and possibly contributes to the difference in TGFβ signaling between Type 1 and 2. This provides further evidence that the iNMF signatures and the differences in pathway activation between subtypes represent CRC intrinsic features and contribute to their better understanding.
Subtype 1.2 is highly enriched for tumors showing MSI, which have been shown to have substantial amounts of tumor-infiltrating lymphocytes . Although the average percent of infiltrating inflammatory cells is comparable across subtypes (Table 2), Subtype 1.2 indeed shows the highest average and this might have influenced the gene expression signatures. Unexpectedly, Subtype 1.2 is the only subtype that comprises more female than male tumors. Previously, it has been reported that there are differences regarding the location distribution of colorectal tumors between the genders, e.g. that in women right-sided CRC is more common  and that pathological and molecular features of the tumors vary between locations . These variations might cause changes in gene expression which are detected by iNMF.
Aligning cell lines with tumor samples to enhance their utility as pre-clinical predictive models has proved challenging for many tumor types. We observed that the four cell line panels investigated here generally provided a good coverage of the space of primary tumor samples, in contrast to a study by Auman and McLeod . Although the expression of the signature genes is less consistent in cell lines, replicates from different panels were stratified in a highly consistent fashion. Furthermore, specific biological characteristics agreed between tumor samples and cell lines assigned to the same subtype. The observation that expression patterns for the pathways investigated are not well conserved between cell lines and tumor samples might indicate that canonical pathways do not fully reflect mechanistic complexity. Furthermore, the non-natural culture conditions of cell lines might have an effect on gene expression which might change the activation of pathways or the respective expression signal that can be detected. However, the successful alignment of CRC cell lines to the newly identified disease subtypes using the techniques described here reveals that the gene expression profiles which define subtypes remain significantly intact despite extended growth in vitro.
Analysis of two cell line datasets with treatment response data indicated that subtypes respond differently to targeted compounds. Type 2 cell lines are more sensitive to treatment with aurora kinase inhibitors. This is in agreement with the high expression of aurora kinase A in Type 2 tumor samples and suggests that genes included in the signatures might be good candidates for targeted treatment of specific CRC subpopulations. Additionally, pharmacological data for two independent cell line panels suggests that Subtype 1.2 cell lines are most sensitive to inhibition of Src. These are interesting hypotheses for the treatment of the different CRC subtypes that warrant further investigation.
The comparison to published signatures showed that the five iNMF subtypes are neither detected by any of the existing signatures alone nor by their combination. For example, most tumors in Subtype 1.2 and many tumors in 2.1 have a high Oh B signature but differ in EMT status. Interestingly, Subtype 1.2 shows a significantly higher sensitivity than Subtype 2.1 to inhibition of proteins on the PI3K pathway, GSK3β, PI3K, and TOR. This suggests that the subtyping presented here allows a more fine grained subdivision of CRC samples which is likely to have greater utility at linking molecular features to pharmacology.