CorEx infers gene expression relationships from tumor RNA-seq profiles
CorEx can be used to construct a hierarchy of latent factors that “explain” correlated gene expression among samples with respect to subsets of genes (Fig. 2). Details of the algorithm and implementation can be found in the “Methods” section and the references therein. In the context of RNA-seq tumor profiles, the input to the algorithm is a training matrix consisting of z-scores for each gene among all the samples. The CorEx output is a set of m gene groups with the genes ordered within a group according to their estimated mutual information (MI) with respect to the group’s latent factor and a probabilistic assignment of samples to the k factor labels. Thus for each sample j, group i, and latent factor label k (here k∈{0,1,2}), CorEx provides an estimate of \(P(Y_{i}=k|X_{1}^{j},X_{2}^{j}...X_{n_{i}}^{j})\), where \(X_{g}^{j}\) denotes the expression of gene g in the j
th sample. Here the P(Y=k|X
1,X
2...X
n) for each sample is converted to a single score, f
ji that indicates sample j’s congruence with the factor i label values representing either very high or very low gene expression. This score provides an ordering of the samples with respect to gene expression. Heat maps of genes within groups ordered by MI versus samples ordered by f
ji show how groups with the greatest TC display strong correlations among many genes, with the clearest relationships being between genes with high group MI, whereas lower TC groups show noisier correlations and/or few genes (Fig. 2 and Additional file 1: Figure S1 of heat maps, and the Supplementary data files; Additional files 3 and 4).
Sharing of genes among groups is an integral part of the algorithm and this gene sharing is desirable for a couple of reasons. For one, it known that gene products often perform multiple roles within cells and therefore may participate in multiple interaction networks. For another, given that TC increases with the number of genes (all other things being equal), it is easier for the algorithm to find very small expression cohorts that participate in larger networks when they can be manifested as similar large network groups with different sub-clusters ranked most highly. In this application, genes tend to be associated with several groups on average, however typically only one to four at relatively high MI values.
Due to the complexity of the search space, the same gene groupings are not guaranteed to be found in different runs. However, comparison of gene lists between runs demonstrates that groups with high TC per gene are very likely to be reproduced. Additionally, we find that a Bayes shrinkage prior over the factor probabilities increases reproducibility of groups with low TC, at the cost of possibly suppressing some real but weak signals (Additional file 1: Figure S3). In this work we fix the number of latent factors in layer one to 200. Though this may seem like a relatively large number, the vast majority CorEx factors found appear to contain meaningful correlations. This is evident from a comparison of the group TC values to those resulting from training on a randomly permuted expression matrix (Additional file 1: Figure S4). The large TC values relative to the randomized data strongly suggest that the discovered gene groups are not merely the result of chance correlations in the data. As one moves down the list of groups according to total correlation, confidence in the clusters decreases though there appear to be still biologically meaningful clusters at the lower end, suggesting that training on a greater number of samples or taking into account additional biological information will be useful. The latter is investigated in the following sections.
CorEx gene groups are enriched for protein-protein interactions, GO, and KEGG annotations
A majority of the CorEx gene groups show enrichment for genes that function within specific biological pathways and/or networks as indicated by significant protein-protein interactions (PPI), Gene Ontology terms [21], and KEGG [22] pathway annotations (summarized in Fig. 3
a). Multiple groups are found related to the mitotic cell cycle and its regulation, extracellular matrix organization and interaction, chromatin modification and DNA packaging, electron transport, and regulation of map kinase and growth factor pathways, among others. Genes that code for proteins that participate in larger functional complexes such as ribosomes are also grouped together. Some of the particular PPI graphs are shown in Fig. 3 where it can be seen that CorEx finds networks of genes related to particular immune processes such as Type I interferon signaling, antigen processing and presentation, immune cell activation, migration, and aggregation. In fact, a plurality of found groups have predominantly immune system-related annotations. Overall, the number of groups enriched for annotations as well as their diversity and specificity support the usefulness of CorEx for parsing the information about differentially activated biological processes from the RNA-seq profiles at an unprecedented level of detail. (See Additional file 1: Figure S2 and Supplementary data for all interaction graphs and listing of terms.)
The CorEx groups often display especially clear PPI enrichments with relatively little noise, though all three types of annotations typically exceed threshold together. Groups with greater total correlation are more likely to yield significant database annotations. However, because groups with high TC tend to be larger, groups with somewhat lower TC tend have more specific functional annotations at low p-values. It can be seen that many groups even with quite low TC sometimes have associated annotations. Capturing these groups with relatively weak signals that nonetheless appear biologically coherent was part of the motivation for specifying such a large number of factors. The database annotations offer some guidance as to which may merit closer examination. While some enrichment of annotations within the gene groups is not surprising, CorEx substantially outperforms other methods such as hierarchical clustering, k-means, and principal component analysis in terms of the percentage of gene groups associated with significant biological annotations (Fig. 3
b).
Some of the power of CorEx in this context comes from lifting the restriction of considering only pairwise gene interactions, common in other approaches [23–25]. As a concrete example of the impact of this less restrictive inference method, one need only examine the pairwise relationships of the top ranked genes in one of the CorEx groups, pictured in Fig. 4. The colors of the scatter points indicate the CorEx label value for each tumor sample. In the figure it is clearly evident that the top genes have only weak pairwise expression correlations, yet the clouds of similarly hued points remain coherent across the various plots showing their higher dimensional correlation. The corresponding expression heat map for group 106 (Additional file 1: Figure S1) also shows clearly that, while any individual gene appears somewhat noisy, the trend across the whole set is easily discernible. The full set of genes from this group exhibit PPI enrichment as well as significant enrichment for the KEGG pathways: ‘pathways in cancer’, ‘Wnt signaling pathway’, ‘ proteoglycans in cancer’, ‘basal cell carcinoma’, and ‘Hippo signaling pathway’. CorEx detects this important set of expression relationships that would almost certainly be overlooked based upon pairwise correlations. Additionally, the sensitivity for groups that exhibit stronger pairwise correlation for the top genes can be increased when those signals are reinforced with signals from lower-ranked genes exhibiting weak pairwise correlations but high TC relative to the group as a whole.
The CorEx hierarchy of factors has biological significance
The hierarchy of CorEx latent factors appears to reflect biological organization on multiple levels. Though the leaves in the CorEx tree have a great diversity of associated GO terms, those that are shared between children and their common parent node display a trend toward increased enrichment of higher level terms in the L2 cluster (Additional file 1). Some of these higher level functional relationships are highlighted in Fig. 5. Clustering of Gene Ontology terms in the groups reveal layer two for immune signaling, extra-cellular matrix organization, immune cell activation, mitotic cell cycle, microtubule-related processes, and electron transport/chromosome organization. While in some cases this is due to redundancy of membership among the layer one gene groups (though the genes are ranked differently within the groups, the GO analysis here did not account for relative importance), in others there is a clear distinction in gene membership and function on the finer scale. For example, both groups 31 and 110 are incorporated into the large immune-related cluster near the center of Fig. 5
a. However, these groups have both different associated GO annotations and zero genes in common. Group 31 has gene ontology terms associated with immune cell activation (e.g. GO:45321 for genes EBI3, Cd79A, BATF, FCER1G, CCL5, EOMES, CD3D, Cd7, LCP1, ZNF683, CD2, ITK, THEMIS, B2M), adhesion (GO:7159 for genes EBI3, RAC2, BATF, CCL5, EOMES, CD3D, Cd7, LCP1, CD2, ITK, THEMIS, B2M), and migration (GO:2687 for genes SELL, RAC2, CCL4, CXCL13, CCL5, CXCL11, CXCL9, and CCL8). In contrast, group 110 primarily has terms associated with inflammatory response (GO:6954 for genes IL4R, MEFV, TGFB1, NLRP3, SIGLEC1, C5AR1, CD163, PIK3CG, NLRC4, TLR4, TNFRSF1B), apoptotic process (GO: 6915 for genes TGFB1, HGF, ABR, C5AR1, MKL1, NICAL1, MAP3K5, PIK3CG, INPP5D, NLRC4, TRAF1, TLR4, TNFRSF1B), and cell death (GO:43067 for genes STK10, HGF, ABR, SIGLEC1, C5AR1, MKL1, MICAL1, MAP3K5, PIK3CG, INPP5D, NLRC4, NCF2, TRAF1, TLR4, JAK3). Thus higher level clusters such as immune response are parsed at the first layer according to specific biological sub-networks.
A particularly striking example of the kind of information that can be gleaned from the layer two factors is shown in Fig. 5
b. One can see a layer two factor (node) that joins four clusters of ribosomal proteins. The fifth layer one cluster is comprised of a set of non-coding RNAs related to cancer metastasis through regulation of growth arrest and protein folding. Thus a cluster of non-coding regulatory factors is linked to the downstream interactions of proteins they influence. While the particular information highlighted here is already known, the presence of these meaningful relationships suggest that novel relationships can, in principle, be discovered by CorEx. This is more likely to happen when many thousands of samples are available to allow less common relationships and perhaps even tumor-specific rewiring of interactions to be seen.
We quantified the change in biological information content (IC) in the layers by calculating enriched GO term IC as a function of layer level. The results are shown in Fig. 6. Simply calculating the frequency distribution of IC for different layers shows a relatively high frequency of low information terms in the lower layer groups. However, if one considers only GO terms that appear as significantly enriched in both a child and its parent node, there is a clear trend toward more significant enrichment of low IC terms in the parent nodes. Increasing significance of low IC terms in the parent nodes indicate an aggregation of gene groups that reflect more specialized terms (hence functional annotation) in the children.
Patient stratification based upon combinations of latent factors yields differential survival associated with tumor-activated pathways
The continuous labels for the CorEx latent factors can be used as predictors of survival under a Cox proportional hazard model [26, 27]. A subset of the CorEx small to intermediate-sized gene groups was selected according to known biological significance (Gene Ontology enrichment) and used to stratify patients according to relative risk of death under a Cox proportional hazard model for the factor scores. We first constructed models for each factor individually. When this is done and patients are stratified roughly into thirds according to low, intermediate, and high relative risk of death, several gene expression groups appear to be associated with differential survival, as judged by the p-value for the difference between the two extremal Kaplan-Meier survival curves (Fig. 7
a and Additional file 1: Figure S5). False discovery rates (FDRs) were also estimated by randomly permuting the sample scores for each latent factor (shuffling each column) in the score matrix. The top single factor groups appear somewhat weak by this measure as the values are strongly limited by the small number of patient samples versus the number of potentially relevant networks. Using this analysis, the top latent factor group, related to immune chemokine and interferon signaling, yields an FDR level close to.3. Genes with large MI with respect to this group factor include among others CXCL10, CXCL11, CXCL9, CXCL13, ISG20, GBP (1-5), OASL, HERC5, STAT1, IFIT1, IFI35, TRIM22, and JAK2. Other groups that exhibit survival associations encompass additional immune system factors, fibroblast growth factor regulation, transforming growth factor beta signaling, map kinase regulation, Wnt signaling, Jak-STAT signaling, and cellular metabolism.
A perhaps even more compelling finding is that combinations of these statistically weak factors yield increased significance as pairs relative to a randomization test for paired factors, with 16/21 combinations yielding FDRs less than about.35 (Fig. 7
b). This suggests synergy of individual factors in terms of impact on survival. The pair of factors that show the greatest association with survival is immune system cytokine signaling combined with regulation of the fibroblast growth factor pathway. These two general features are hypothesized to be associated with long term survival in ovarian cancer from other studies, the latter due to its relationship to chemo resistance [28, 29]. The advantage here with respect to these previously known prognostic features is that each patient is assigned a relative risk according to activation of related pathways and therefore shows how they may combine to influence survival, indicating patients who might benefit from either single or simultaneous treatment with immune system modulation or FGF pathway inhibitors. Additionally, the fine parsing of overall immune system function by the sub-networks present in different groups has the potential to discriminate among different immunotherapies in terms of likely efficacy on a personalized basis. Overall, most of the top survival-associated combinations involve immune-related factors, emphasizing the primacy of the immunological milieu in influencing ovarian cancer survival rates in this cohort under the standard treatment protocols. The combination of a factor related to Wnt signaling and one for fibroblast growth factor signaling also appears predictive of survival in a subset of patients. Wnt signaling has also been associated with clinical prognosis in ovarian cancer [30]. Other factors and combinations of groups show narrower survival differentials. We expect that some of these play important roles for a relatively smaller proportion of patients and will gain significance once a greater number of tumor RNA-seq samples are available for analysis.
We performed survival analysis using PCA factors as well. While CorEx and PCA appear comparable in terms of the number of factors that can be used to stratify patients according to survival, the number and quality of biological enrichments for the CorEx factors is substantially greater than that for PCA (Fig. 3
b and Additional file 1: Figure S7).
It should be emphasized that in this retrospective analysis of a patient cohort that was treated prior to 2010, we are able to analyze the survival impact only with respect to standard carbo/taxol therapies as no targeted therapies were widely available or approved. Cell line and xenograft experiments or analysis of tissues from other clinical trials will likely bring to the fore other factor groupings that are associated with response to recently approved or experimental therapies, e.g. angiogenesis inhibitors, PARP inhibitors, and apoptosis signaling modulators to name a few [31–33].
CorEx discovers differential expression in ovarian tumors of EMT-related genes previously implicated in breast cancer stemness and metastasis
A survival differential at longer times only is seen for one CorEx gene expression factor (denoted G159). This suggests that this factor may influence survival in a way that is not related to initial carbo/taxol response per se. Detailed examination of the Gene Ontology annotations for this latent factor cluster revealed genes associated with microRNAs in cancer (6 genes), chromatin modification (8 genes), regulation of transcription from RNA pol II promoter (10 genes), stem cell differentiation (5 genes), and regulation of Wnt signaling (5 genes). Further, when this factor is combined in a survival model along with the top single factor within the same CorEx run (G103, immune cytokine signaling), the resulting stratification yields a more significant survival association than any other pair of factors from that run (Fig. 8
a).
Intrigued by the possibility of a factor related to something more general than response to a specific chemotherapy, we sought to validate its significance using the TCGA breast cancer data. The reasoning behind this was that while breast tumors exhibit some similarity on a molecular basis to ovarian tumors, there are also significant differences. Earlier detection and a greater variety of chemotherapeutic options means survival will depend largely upon factors not directly related to carbo/taxol response. We calculated factor scores for TCGA breast cancer tumor RNA-seq samples across the factors determined by the ovarian cancer training samples (i.e. we did not train a new set of latent factors, but mapped the breast cancer samples to the factors learned from ovarian tumors). Using these scores to fit a stratified model with respect to the G159 latent factor yields a survival differential for the breast cancer patients (Fig. 8
b). We subsequently discovered that the genes in this group have important functions within a network that was recently studied by Song et al. [34] in preclinical models of breast cancer, where it was shown that TET3 controls expression of DNMT3A, a DNA methyl transferase that exerts specific influence on the chromatin state related to stemness (Fig. 8
c). Further, the network is implicated in the progression of breast cancer via the epithelial-mesenchymal transition, invasion, and metastasis. Mesenchymal-like properties of tumor cells and EMT-associated features in general have been associated with poor prognoses in ovarian cancer [35], though this particular network has not been previously implicated. Importantly, this network was shown to be controlled by a microRNA acting on TET3, thus providing a novel target for drug development [34]. Observation of this mechanism in ovarian cancer tumors suggests not only a new biological driver for recurrence but also the possibility for selection of ovarian cancer patients who might benefit from any experimental miR-22 modulators that may be developed for breast cancer.
Some latent factors can be used to distinguish between normal and cancerous ovarian tissue
We trained a CorEx factor model on RNA-seq profiles from both normal and tumor ovarian tissue samples. In order to mitigate differences due to experimental variations, the expression values in this case were replaced with sample-relative percentiles [36]. The efficacy of this approach can be seen in group heat maps that demonstrate smooth variation in gene expression across tumor and normal samples (Additional file 1: Figure S8). While normal samples and similar tumors appear close to one another in the heat maps, there is still sometimes a qualitative difference that appears related to signal variance. Deeper exploration of the issues of cross-experiment comparison for RNA-seq is beyond the scope of this work.
The trained factors show clear distinctions in gene expression patterns between tumors and normal ovarian tissue and these appear related to a variety of biological processes including ones for development and regulation of cell differentiation, neurogenesis, apoptosis, sex differentiation, cell migration, and inflammatory response. Histograms of the factor scores for a few representative groups are shown in Fig. 9. These findings highlight some commonalities of pathway dysregulation among ovarian tumors. Further, several of the groups contain genes that can be targeted with existing drugs [37, 38]. This presents the possibility that repurposing of existing drugs based upon this sort of data may provide new therapeutic options for a great many ovarian cancer patients, though the implications within these particular gene expression cohorts need to be further elucidated.