Skip to main content
  • Research Article
  • Open access
  • Published:

Comprehensive discovery of subsample gene expression components by information explanation: therapeutic implications in cancer



De novo inference of clinically relevant gene function relationships from tumor RNA-seq remains a challenging task. Current methods typically either partition patient samples into a few subtypes or rely upon analysis of pairwise gene correlations that will miss some groups in noisy data. Leveraging higher dimensional information can be expected to increase the power to discern targetable pathways, but this is commonly thought to be an intractable computational problem.


In this work we adapt a recently developed machine learning algorithm for sensitive detection of complex gene relationships. The algorithm, CorEx, efficiently optimizes over multivariate mutual information and can be iteratively applied to generate a hierarchy of relatively independent latent factors. The learned latent factors are used to stratify patients for survival analysis with respect to both single factors and combinations. These analyses are performed and interpreted in the context of biological function annotations and protein network interactions that might be utilized to match patients to multiple therapies.


Analysis of ovarian tumor RNA-seq samples demonstrates the algorithm’s power to infer well over one hundred biologically interpretable gene cohorts, several times more than standard methods such as hierarchical clustering and k-means. The CorEx factor hierarchy is also informative, with related but distinct gene clusters grouped by upper nodes. Some latent factors correlate with patient survival, including one for a pathway connected with the epithelial-mesenchymal transition in breast cancer that is regulated by a microRNA that modulates epigenetics. Further, combinations of factors lead to a synergistic survival advantage in some cases.


In contrast to studies that attempt to partition patients into a small number of subtypes (typically 4 or fewer) for treatment purposes, our approach utilizes subgroup information for combinatoric transcriptional phenotyping. Considering only the 66 gene expression groups that are found to both have significant Gene Ontology enrichment and are small enough to indicate specific drug targets implies a computational phenotype for ovarian cancer that allows for 366 possible patient profiles, enabling truly personalized treatment. The findings here demonstrate a new technique that sheds light on the complexity of gene expression dependencies in tumors and could eventually enable the use of patient RNA-seq profiles for selection of personalized and effective cancer treatments.

Peer Review reports


In recent years, innovations in high-throughput sequence-based assays that allow for the rapid and cheap interrogation of the genomes and transcriptomes of individual tumors have given rise to fresh hope for significant progress toward understanding of complex cancer biology and effective treatments [1]. Inference of clinically relevant gene expression networks based upon RNA-seq profiles could provide crucial information in this context. The tumor transcriptome has the potential to provide a view of the functional network alterations that result from a variety of causes, including genetic mutations, copy number variations, epigenetic states, the tumor microenvironment, immunophenotype variations and even transient dynamic effects. In principle, machine learning could be used to leverage the rich information that gene expression provides about each individual to infer tumor-specific features that may guide treatment decisions. Unfortunately, several challenges limit the usefulness of existing methods. The data is high dimensional and the number of samples is small. For this reason, many methods focus on simple linear, pairwise relationships, limiting their generality. Clustering techniques typically use too broad a brush, pigeon-holing patients into a small number of clusters when understanding the interplay of multiple independent factors is crucial. Finally, interpretability of results is essential in this context in order to provide insight into biological mechanisms that could motivate new treatments.

To overcome these limitations, we apply a recently developed machine learning algorithm, CorEx [2, 3], to the analysis of RNA-seq transcriptomes. Multivariate and nonlinear dependencies in the data are captured by total correlation, a generalization of mutual information to multiple variables. Intuitively, CorEx uses an optimization scheme in order to efficiently infer latent variables that explain as much of the dependence in the data as possible. Each latent factor discovered by CorEx is a function of some possibly overlapping subset of the input variables, i.e. genes, with high total multivariate information, and reflects a relatively independent axis of variation in comparison to the other latent factors. Further, each latent factor defines a separate probabilistic clustering of samples, and thus provides a factor-specific stratification of tumors according to expression patterns of subgroups of genes across the tumor samples. We can apply CorEx again to these inferred latent factors to discover a hierarchy of dependencies among genes.

Ovarian cancer is the deadliest gynecologic malignancy due to its typically late stage at diagnosis, however it is responsive to a wide variety of therapies. In addition to the standard platinum/taxane combination for frontline, active second line therapies include pegylated doxorubicin, topoisomerase inhibitors, angiogenesis inhibitors, anti-metabolites and PARP inhibitors [4, 5]. Ovarian cancer also appears to be an immunogenic tumor for which checkpoint inhibitors, vaccines, and other immunotherapies may provide meaningful benefit [69]. Thus ovarian cancer tumors can be expected to display a rich variety of gene expression patterns related to outcomes under various treatments. On the other hand, current second line treatments are effective for less than 30% of patients, are not curative, and come with often serious cumulative toxicities. With the exception of DNA repair pathway mutations for PARP inhibitors, there are no genetic markers currently in use to predict therapeutic response. Thus new ways to match ovarian cancer patients to effective therapies based upon individual tumor biology will likely yield substantial gains in survival.

In this work we show how the use of CorEx to analyze ovarian cancer RNA-seq profiles uncovers relatively small cohorts of genes that exhibit coherent patterns of differential expression among tumors and that, in many cases, these cohorts are associated with biological pathways as indicated by database annotations of the genes. The lower levels of the hierarchy of inferred gene clusters provide a parsing of the upper layer nodes with respect to function. An example is a layer one node that groups together regulatory noncoding RNAs to regulated protein coding elements. Patient stratification with respect to individual gene groups shows differential survival associations in groups associated with known factors in chemo-responsiveness. Even more significantly, stratifications based upon combinations of group factors display synergistic survival associations. We highlight discovery of a factor containing genes related to epigenetic control of stem cell characteristics that have not been previously observed in ovarian tumors. Additionally, when breast cancer data not used for training is stratified relative to this latent factor, a survival association is also observed. Finally, we compare the group results in ovarian tumors to expression profiles in normal ovarian tissue and demonstrate substantial differences in some cases. The strong results presented here demonstrate a powerful tool for dissecting therapeutically relevant gene expression relationships in tumor RNA-seq, and suggest it may be usefully applied to infer complex correlation structure from other types of high-dimensional, noisy biological data [10].


CorEx Algorithm

The random variables, X 1,…,X n, written as X for short, represent measured gene expression values. Our goal is to learn a small number of latent factors, Y=Y 1,…,Y m, so that the X i’s are (nearly) independent after conditioning on Y. In other words, we are looking for latent factors that explain the relationships in the gene expression data. Each latent factor will depend on some subset of the genes, and it can take discrete values, Y j=1,…,c, that correspond to relative expression levels for that group of genes. We found that stratifying each latent factor into three groups was adequate for capturing relationships without causing problems with estimation. Therefore, our procedure defines a clustering of genes associated with each Y j and, for each group of genes, a stratification of patients into three groups. Furthermore, this clustering procedure can be iteratively applied to the latent factors, Y, themselves to produce a hierarchical clustering of the genes. We now review the technical approach for discovering the latent factors and clusters (Fig 1).

Fig. 1
figure 1

A simple hierarchical model constructed by CorEx. Variables in each layer are associated with latent factors in the next higher layer with a weight that is determined by optimizing over its relative contribution to the explanatory latent factors. Visually we set the thickness of links based on mutual information between variables and the thickness of nodes is proportional to the information contributed by that factor toward the objective, TC(X;Y)

We assume that the measured expression profile for one patient corresponds to a sample, x, drawn from some unknown distribution over the random variables (genes), X, which is written as p(X=x). Similarly, we can write the marginal probability for a single gene as p(X i=x i). Multivariate mutual information, or total correlation (TC), quantifies the amount of dependence among a set of variables [11]. Total correlation can be defined in terms of the Kullback-Leibler divergence, D KL, or in terms of H, the Shannon entropy [12] as follows.

$$ \begin{aligned} TC(X) &\equiv \sum_{i} H(X_{i}) - H(X)\\ &= D_{KL}\left(p(X=x) || \prod_{i} p(X_{i} = x_{i})\right) \end{aligned} $$

Intuitively, TC represents the distance between the true distribution, and the distribution under the null hypothesis that all variables are independent. TC is zero if and only if the variables are all independent1. If we could enumerate all of the underlying causes of dependence in the data, then the TC conditioned on these factors would be exactly zero.

The recently introduced method of Total Correlation Explanation (CorEx) is an information-theoretic optimization that reconstructs latent factors that explain as much of the dependence in the data as possible [2, 3]. The principle behind CorEx is to search for a small set of latent factors, Y=Y 1,…,Y m, so that the TC among the original variables is minimized after conditioning on Y. We refer to TC(X;Y)=TC(X)−TC(X|Y) as the correlation in X that is explained by Y and we seek to maximize this expression.

We would like to maximize TC(X;Y) over all ways of assigning discrete values to each Y j as a function of X. As mentioned, we restrict Y j’s to take three possible values so that, e.g., Y j=0,1, and 2 might correspond to relatively low, neutral, and high expression. For each sample of data, X=x l, with l=1,…,N indexing the samples, we define a probability distribution over possible values for each latent factor j, p(Y j=y j|X=x l). Surprisingly, the optimization of TC(X;Y) over all of these probability distributions is computationally efficient and has low sample complexity [3]. We now describe the iterative procedure for solving this optimization.

What makes our optimization tractable is that we can write the solution for p(Y j=y j|X=x) in an analytic form that defines an iterative update scheme that is guaranteed to converge to a local optimum. We begin at iteration t=0, with a random probability distribution, p t=0(Y j=y j|X=x l), over each latent factor and for each sample. The solution depends on two types of parameters. First, α i,j are weights between zero and one that determine how much Y j depends on X i. Second, p(X i=x i|Y j=c) is the marginal probability of observing X i=x i given an observed label Y j=c. We take this distribution to be a normal distribution with mean, μ i,j,c and variance, σ i,j,c. The parameters, α i,j,μ i,j,c,σ i,j,c, along with the class weights, can be updated at each time t according to the following rules:

$$\begin{array}{*{20}l} \alpha_{i,j}^{t+1} = I^{t}(X_{i};Y_{j}|Y_{1},\ldots,Y_{j-1})/I^{t}(X_{i};Y_{j}) \end{array} $$
$$\begin{array}{*{20}l} p^{t+1}(Y_{j}=c) =\frac{1}{N} \sum_{l} p^{t}\left(Y_{j}=c | X=x^{l}\right) \end{array} $$
$$\begin{array}{*{20}l} \mu_{i,j,c}^{t+1} = \frac{\frac{1}{N}\sum_{l} p^{t}\left(Y_{j}=c | X=x^{l}\right) x_{i}^{l}}{p^{t+1}(Y_{j}=c)} \end{array} $$
$$\begin{array}{*{20}l} \sigma_{i,j,c}^{t+1} = \sqrt{\frac{\frac{1}{N}\sum_{l} p^{t}\left(Y_{j}=c | X=x^{l}\right) \left(x_{i}^{l}-\mu_{i,j,c}^{t+1}\right)^{2}}{p^{t+1}(Y_{j}=c)}}\\ \end{array} $$

In practice, we alter the first rule to speed up computation and we devised a novel Bayesian estimate for the last two update rules that improves results on datasets with a small number of samples. Details of these revised updates are described in Additional file 1: Supplementary Methods.

The quality of the solution that is found by CorEx is quantified through the expression TC(X;Y). As we add more latent factors this objective increases before achieving a peak and then decreasing. Because of computational limitations, we used only 200 latent factors at the first layer even though more would have resulted in higher TC. At the higher levels, we tried different numbers of latent factors to optimize TC(X;Y) leading us to use 30 factors at layer 2 and 8 at layer 3. With this number of factors, each CorEx run took about two days on a single Amazon “r3.xlarge” node with 30 GB of RAM.

As a clustering method, CorEx has been favorably compared against standard methods like spectral clustering, k-means, and hierarchical clustering [2]. CorEx can also be viewed as a type of dimensionality reduction and in that capacity has been compared to PCA, independent component analysis, non-negative matrix factorization, and Isomap [2]. More generally, CorEx represents a new approach to unsupervised deep representation learning and, as such, is best compared to “neural network” approaches like autoencoders and restricted Boltzmann machines against which it has demonstrated the ability to recover much more interpretable structure [3].

Training matrix data

Gene level RNA-seq values (normalized to reads per kilobase per million or RPKM) for 420 high grade serous ovarian tumor samples and 780 breast cancer samples were downloaded from the TCGA data portal (The TCGA Research Network:, along with available clinical metadata for the samples. Somatic mutation data for 316 samples was obtained separately from the cBioPortal for Cancer Genomics [13, 14]. A training set of genes was selected by combining the RPKM data for the top 3000 genes by variance in expression with the RPKM values for the 2371 genes that were mutated in more than one sample. The gene RPKM values for the selected genes were then converted to z-scores to form the 420x5371 training matrix Additional file 2. 376 samples were associated with survival times in the clinical Biotab file and used for the Cox proportional hazard analyses. GTEx normal tissue data [15] for 39 samples was downloaded from the GTEx portal ( For the purposes of training on tumor and normal samples together, expression values were converted to percentiles within each sample to mitigate batch effects between the TCGA and GTex projects.

Database annotations

KEGG, GO, and PPI enrichments for the gene groups were obtained using the stringdb package in R [16, 17]. A maximum of 400 genes were used from each group. For groups from CorEx runs that utilized the Bayes shrinkage prior, only genes with a weighted mutual information greater than a threshold of 0.002 were retained for annotation enrichment purposes. FDR values with respect to the 200 groups (corresponding to the p-value cutoffs used in the Circos [18] plot in Fig. 2) were calculated by random selection from the set of 5371 genes used for training. The genes were alloted to groups with a size distribution identical to that of the CorEx discovered groups. In order to obtain summary GO terms for upper layer nodes in the CorEx factor hierarchy, the lists of GO terms from stringdb were submitted to the Revigo web server [19], which clusters GO terms to select a subset of distinct representative terms. GO term information content was approximated as −log(n d/n r) where n d denotes the number of descendants of the term in the gene ontology and n r is the number of descents of the root term.

Fig. 2
figure 2

Correlation Explanation algorithm applied to tumor RNA-seq. For training, CorEx is provided only a matrix of normalized gene expression values for the available tumor samples. The number of possible labels for each latent factor is specified, here set to three. In this application, we also set the number of layer one latent factors to 200. CorEx finds probabilistic assignments of genes to latent factors by maximizing the total correlation of the genes in groups, simultaneously minimizing dependence between latent factor groups. The factor labels from lower layers are used as input to upper layers in order to generate a hierarchical model. The output from CorEx is thus a hierarchical model, specified as a set of probabilities characterizing the association of genes or factors with latent factors at the next highest layer as well as probabilities that a given tumor sample’s expression pattern can be explained by each factor in a particular label state. The three probabilities for a given tumor sample and factor are can be usefully summarized by a single value that is the natural logarithm of the probability difference for the factor labels corresponding to extremal expression. The genes with high mutual information relative to latent factors show clear patterns of correlation when viewed on expression heat maps with tumor samples ordered by the summary latent factor score

Clustering comparisons

Hierachical and k-means clustering as well as PCA analysis were done using Cluster 3.0 [20]. Hierarchical clustering was performed using average linkage and a similarity metric of absolute correlation on centered data vectors. Discrete clusters were chosen by the cut method. PCA clusters were formed by including genes with coefficients greater then a threshold for each component such that the resulting clusters had an average number of genes (actual 78) comparable to CorEx groupings. 200 clusters were specified in all cases for comparison purposes.

Survival analysis

For survival analysis, the gene groups were filtered to retain only those with fewer than 100 correlated genes in each group above threshold significance and those associated with gene ontology terms with adjusted p-values less than a 1% FDR threshold. Additionally, fourteen groups were eliminated that were presumed to represent chromosomal location correlations rather than network interactions. This filtering resulted in 66 groups that were then analyzed for single factor survival associations under a coxph model in R. Patients were stratified according to relative risk under a predictive coxph model. Patients were assigned to one of three groups according to whether they were in the bottom 30%, top 30%, or middle 40% of risk of death. The top individual factors with differences in survival (determined by p-values between the two extremal strata as calculated by the survdiff function less than 0.05) were retained for combination analysis. The false discovery rate for this procedure was estimated by repeatedly selecting 66 factor groups (200 times), randomly shuffling their factor labels with respect to patient time of death and calculating p-values for both single and combination factors in the same fashion as before. False discovery rates were then calculated as the number of randomized groups or combinations exhibiting p-values less than a given threshold divided by the observed number of groups or combinations with p-values less than that threshold.


CorEx infers gene expression relationships from tumor RNA-seq profiles

CorEx can be used to construct a hierarchy of latent factors that “explain” correlated gene expression among samples with respect to subsets of genes (Fig. 2). Details of the algorithm and implementation can be found in the “Methods” section and the references therein. In the context of RNA-seq tumor profiles, the input to the algorithm is a training matrix consisting of z-scores for each gene among all the samples. The CorEx output is a set of m gene groups with the genes ordered within a group according to their estimated mutual information (MI) with respect to the group’s latent factor and a probabilistic assignment of samples to the k factor labels. Thus for each sample j, group i, and latent factor label k (here k{0,1,2}), CorEx provides an estimate of \(P(Y_{i}=k|X_{1}^{j},X_{2}^{j}...X_{n_{i}}^{j})\), where \(X_{g}^{j}\) denotes the expression of gene g in the j th sample. Here the P(Y=k|X 1,X 2...X n) for each sample is converted to a single score, f ji that indicates sample j’s congruence with the factor i label values representing either very high or very low gene expression. This score provides an ordering of the samples with respect to gene expression. Heat maps of genes within groups ordered by MI versus samples ordered by f ji show how groups with the greatest TC display strong correlations among many genes, with the clearest relationships being between genes with high group MI, whereas lower TC groups show noisier correlations and/or few genes (Fig. 2 and Additional file 1: Figure S1 of heat maps, and the Supplementary data files; Additional files 3 and 4).

Sharing of genes among groups is an integral part of the algorithm and this gene sharing is desirable for a couple of reasons. For one, it known that gene products often perform multiple roles within cells and therefore may participate in multiple interaction networks. For another, given that TC increases with the number of genes (all other things being equal), it is easier for the algorithm to find very small expression cohorts that participate in larger networks when they can be manifested as similar large network groups with different sub-clusters ranked most highly. In this application, genes tend to be associated with several groups on average, however typically only one to four at relatively high MI values.

Due to the complexity of the search space, the same gene groupings are not guaranteed to be found in different runs. However, comparison of gene lists between runs demonstrates that groups with high TC per gene are very likely to be reproduced. Additionally, we find that a Bayes shrinkage prior over the factor probabilities increases reproducibility of groups with low TC, at the cost of possibly suppressing some real but weak signals (Additional file 1: Figure S3). In this work we fix the number of latent factors in layer one to 200. Though this may seem like a relatively large number, the vast majority CorEx factors found appear to contain meaningful correlations. This is evident from a comparison of the group TC values to those resulting from training on a randomly permuted expression matrix (Additional file 1: Figure S4). The large TC values relative to the randomized data strongly suggest that the discovered gene groups are not merely the result of chance correlations in the data. As one moves down the list of groups according to total correlation, confidence in the clusters decreases though there appear to be still biologically meaningful clusters at the lower end, suggesting that training on a greater number of samples or taking into account additional biological information will be useful. The latter is investigated in the following sections.

CorEx gene groups are enriched for protein-protein interactions, GO, and KEGG annotations

A majority of the CorEx gene groups show enrichment for genes that function within specific biological pathways and/or networks as indicated by significant protein-protein interactions (PPI), Gene Ontology terms [21], and KEGG [22] pathway annotations (summarized in Fig. 3 a). Multiple groups are found related to the mitotic cell cycle and its regulation, extracellular matrix organization and interaction, chromatin modification and DNA packaging, electron transport, and regulation of map kinase and growth factor pathways, among others. Genes that code for proteins that participate in larger functional complexes such as ribosomes are also grouped together. Some of the particular PPI graphs are shown in Fig. 3 where it can be seen that CorEx finds networks of genes related to particular immune processes such as Type I interferon signaling, antigen processing and presentation, immune cell activation, migration, and aggregation. In fact, a plurality of found groups have predominantly immune system-related annotations. Overall, the number of groups enriched for annotations as well as their diversity and specificity support the usefulness of CorEx for parsing the information about differentially activated biological processes from the RNA-seq profiles at an unprecedented level of detail. (See Additional file 1: Figure S2 and Supplementary data for all interaction graphs and listing of terms.)

Fig. 3
figure 3

Gene Ontology, KEGG, and protein interaction enrichment for CorEx gene groups. a A circos plot indicates which latent factor groups of the 200 inferred by CorEx are enriched for genes on KEGG pathways (yellow), Gene Ontology terms (orange) and Protein-Protein interactions (blue). A tick mark of the corresponding color is drawn for groups exhibiting enrichment greater than a threshold. Further, the blue PPI tick marks have heights proportional to the relative enrichment (observed/expected). The group results appear in order of decreasing Total Correlation, with the number of genes in a group indicated by the green shading on the outer ring of the circos plot. (darker means a greater number of genes with a range from 4 to 1135) Cutoffs were initially chosen according to the presence of coherent biological signals, however randomization tests indicate corresponding FDR values of 1%, 5%, and 10% for GO, KEGG, and PPI, respectively. The vast majority of CorEx groups show annotation enrichment, typically for all three types. A sample of string network connections are shown on the outside, showing both a diversity of network function and relatively few apparently spuriously associated genes within the groups. b Annotation enrichments of CorEx groups compared to standard clustering methods including hierarchical clustering, k-means, and pca. Enrichment results are based upon 200 clusters for all methods

The CorEx groups often display especially clear PPI enrichments with relatively little noise, though all three types of annotations typically exceed threshold together. Groups with greater total correlation are more likely to yield significant database annotations. However, because groups with high TC tend to be larger, groups with somewhat lower TC tend have more specific functional annotations at low p-values. It can be seen that many groups even with quite low TC sometimes have associated annotations. Capturing these groups with relatively weak signals that nonetheless appear biologically coherent was part of the motivation for specifying such a large number of factors. The database annotations offer some guidance as to which may merit closer examination. While some enrichment of annotations within the gene groups is not surprising, CorEx substantially outperforms other methods such as hierarchical clustering, k-means, and principal component analysis in terms of the percentage of gene groups associated with significant biological annotations (Fig. 3 b).

Some of the power of CorEx in this context comes from lifting the restriction of considering only pairwise gene interactions, common in other approaches [2325]. As a concrete example of the impact of this less restrictive inference method, one need only examine the pairwise relationships of the top ranked genes in one of the CorEx groups, pictured in Fig. 4. The colors of the scatter points indicate the CorEx label value for each tumor sample. In the figure it is clearly evident that the top genes have only weak pairwise expression correlations, yet the clouds of similarly hued points remain coherent across the various plots showing their higher dimensional correlation. The corresponding expression heat map for group 106 (Additional file 1: Figure S1) also shows clearly that, while any individual gene appears somewhat noisy, the trend across the whole set is easily discernible. The full set of genes from this group exhibit PPI enrichment as well as significant enrichment for the KEGG pathways: ‘pathways in cancer’, ‘Wnt signaling pathway’, ‘ proteoglycans in cancer’, ‘basal cell carcinoma’, and ‘Hippo signaling pathway’. CorEx detects this important set of expression relationships that would almost certainly be overlooked based upon pairwise correlations. Additionally, the sensitivity for groups that exhibit stronger pairwise correlation for the top genes can be increased when those signals are reinforced with signals from lower-ranked genes exhibiting weak pairwise correlations but high TC relative to the group as a whole.

Fig. 4
figure 4

The use of multivariate mutual information increases the clustering power of CorEx. Pairwise correlations of normalized expression of individual samples for the top genes in group 106 appear very weak. The factor label values (indicated by color) show strong coherence across the different plots indicating how samples cluster more strongly in terms of total correlation

The CorEx hierarchy of factors has biological significance

The hierarchy of CorEx latent factors appears to reflect biological organization on multiple levels. Though the leaves in the CorEx tree have a great diversity of associated GO terms, those that are shared between children and their common parent node display a trend toward increased enrichment of higher level terms in the L2 cluster (Additional file 1). Some of these higher level functional relationships are highlighted in Fig. 5. Clustering of Gene Ontology terms in the groups reveal layer two for immune signaling, extra-cellular matrix organization, immune cell activation, mitotic cell cycle, microtubule-related processes, and electron transport/chromosome organization. While in some cases this is due to redundancy of membership among the layer one gene groups (though the genes are ranked differently within the groups, the GO analysis here did not account for relative importance), in others there is a clear distinction in gene membership and function on the finer scale. For example, both groups 31 and 110 are incorporated into the large immune-related cluster near the center of Fig. 5 a. However, these groups have both different associated GO annotations and zero genes in common. Group 31 has gene ontology terms associated with immune cell activation (e.g. GO:45321 for genes EBI3, Cd79A, BATF, FCER1G, CCL5, EOMES, CD3D, Cd7, LCP1, ZNF683, CD2, ITK, THEMIS, B2M), adhesion (GO:7159 for genes EBI3, RAC2, BATF, CCL5, EOMES, CD3D, Cd7, LCP1, CD2, ITK, THEMIS, B2M), and migration (GO:2687 for genes SELL, RAC2, CCL4, CXCL13, CCL5, CXCL11, CXCL9, and CCL8). In contrast, group 110 primarily has terms associated with inflammatory response (GO:6954 for genes IL4R, MEFV, TGFB1, NLRP3, SIGLEC1, C5AR1, CD163, PIK3CG, NLRC4, TLR4, TNFRSF1B), apoptotic process (GO: 6915 for genes TGFB1, HGF, ABR, C5AR1, MKL1, NICAL1, MAP3K5, PIK3CG, INPP5D, NLRC4, TRAF1, TLR4, TNFRSF1B), and cell death (GO:43067 for genes STK10, HGF, ABR, SIGLEC1, C5AR1, MKL1, MICAL1, MAP3K5, PIK3CG, INPP5D, NLRC4, NCF2, TRAF1, TLR4, JAK3). Thus higher level clusters such as immune response are parsed at the first layer according to specific biological sub-networks.

Fig. 5
figure 5

Gene function relationships in the latent factor hierarchy. a Hierarchical factor relationships are shown exhibiting a tendency to cluster layer one groups with similar general Gene Ontology enrichments. Lines are shown connecting the top X strongest relationships and the top 1000 genes. Thicker lines indicate higher CorEx relationship probabilities. Representative GO terms for hierarchically clustered groups only are shown in red. Though this figure highlights clustering at layer one and above, many nonclustered layer one factors have very significant annotations as well, detailed in Fig. 2 and the Additional file 1. b In addition to general relationships, specific causal relationships are also detected at the upper layers. Here an example is shown in which noncoding RNAs that are implicated in tumor growth and metastasis via regulation of ribosome activity are isolated in a layer zero group and joined at the higher level to ribosomal structure genes. Noncoding RNA functional annotations are taken from the NCBI Gene database (

A particularly striking example of the kind of information that can be gleaned from the layer two factors is shown in Fig. 5 b. One can see a layer two factor (node) that joins four clusters of ribosomal proteins. The fifth layer one cluster is comprised of a set of non-coding RNAs related to cancer metastasis through regulation of growth arrest and protein folding. Thus a cluster of non-coding regulatory factors is linked to the downstream interactions of proteins they influence. While the particular information highlighted here is already known, the presence of these meaningful relationships suggest that novel relationships can, in principle, be discovered by CorEx. This is more likely to happen when many thousands of samples are available to allow less common relationships and perhaps even tumor-specific rewiring of interactions to be seen.

We quantified the change in biological information content (IC) in the layers by calculating enriched GO term IC as a function of layer level. The results are shown in Fig. 6. Simply calculating the frequency distribution of IC for different layers shows a relatively high frequency of low information terms in the lower layer groups. However, if one considers only GO terms that appear as significantly enriched in both a child and its parent node, there is a clear trend toward more significant enrichment of low IC terms in the parent nodes. Increasing significance of low IC terms in the parent nodes indicate an aggregation of gene groups that reflect more specialized terms (hence functional annotation) in the children.

Fig. 6
figure 6

Comparison of term enrichment between parent and child clusters in the CorEx hierarchy. a The distributions of GO term information content (IC) for all terms enriched in factor-defined gene groups show a relative overabundance of low information (lower specificity) terms for factors in the lowest layer (L1). GO term IC is calculated as the negative logarithm of the ratio of the number of descendants of a given term in the Gene Otology to the number of descendants of the root. b Relative enrichment of GO terms that are shared by child and parent nodes is compared by plotting the difference of the (log) term enrichment p-values, with p1 referring to the p-value for the lower L1 layer and p2 referring to the p-value for the upper L2 layer. A clear trend for increased enrichment of low information terms in parent nodes (L2), as one expects in a GO-like hierarchy

Patient stratification based upon combinations of latent factors yields differential survival associated with tumor-activated pathways

The continuous labels for the CorEx latent factors can be used as predictors of survival under a Cox proportional hazard model [26, 27]. A subset of the CorEx small to intermediate-sized gene groups was selected according to known biological significance (Gene Ontology enrichment) and used to stratify patients according to relative risk of death under a Cox proportional hazard model for the factor scores. We first constructed models for each factor individually. When this is done and patients are stratified roughly into thirds according to low, intermediate, and high relative risk of death, several gene expression groups appear to be associated with differential survival, as judged by the p-value for the difference between the two extremal Kaplan-Meier survival curves (Fig. 7 a and Additional file 1: Figure S5). False discovery rates (FDRs) were also estimated by randomly permuting the sample scores for each latent factor (shuffling each column) in the score matrix. The top single factor groups appear somewhat weak by this measure as the values are strongly limited by the small number of patient samples versus the number of potentially relevant networks. Using this analysis, the top latent factor group, related to immune chemokine and interferon signaling, yields an FDR level close to.3. Genes with large MI with respect to this group factor include among others CXCL10, CXCL11, CXCL9, CXCL13, ISG20, GBP (1-5), OASL, HERC5, STAT1, IFIT1, IFI35, TRIM22, and JAK2. Other groups that exhibit survival associations encompass additional immune system factors, fibroblast growth factor regulation, transforming growth factor beta signaling, map kinase regulation, Wnt signaling, Jak-STAT signaling, and cellular metabolism.

Fig. 7
figure 7

Group survival association and combinations. a A Cox proportional hazard model is used to estimate relative risk for patients according to individual factor label values. Using this model, a relative risk of death can be calculated for each patient based upon her group label value. When patients are stratified into thirds according to relative risk, some factors show an association with survival indicated by the difference in Kaplan-Meier survival curves for the extremal groups (p-values shown in the graphs). The figure shows survival curves for two such factors, G56 and G173. The bottom 30%, middle 40%, and top 30% in terms of predicted survival under the coxph model are plotted in magenta, blue, and red, respectively. When these two factors are combined into a single risk model, the survival differential increases dramatically. b Single factor combinations with various Gene Ontology annotations can be combined and show synergy relative to survival association. The false discovery rate under a randomization test for six such pairs is below about 1/3 compared to only one single factor achieving that level. Note that it is not suggested that factors be selected based solely upon survival associations but rather factors that are interesting for other reasons, e.g. they contain druggable targets, can be investigated for an impact on survival both alone and in combinations

A perhaps even more compelling finding is that combinations of these statistically weak factors yield increased significance as pairs relative to a randomization test for paired factors, with 16/21 combinations yielding FDRs less than about.35 (Fig. 7 b). This suggests synergy of individual factors in terms of impact on survival. The pair of factors that show the greatest association with survival is immune system cytokine signaling combined with regulation of the fibroblast growth factor pathway. These two general features are hypothesized to be associated with long term survival in ovarian cancer from other studies, the latter due to its relationship to chemo resistance [28, 29]. The advantage here with respect to these previously known prognostic features is that each patient is assigned a relative risk according to activation of related pathways and therefore shows how they may combine to influence survival, indicating patients who might benefit from either single or simultaneous treatment with immune system modulation or FGF pathway inhibitors. Additionally, the fine parsing of overall immune system function by the sub-networks present in different groups has the potential to discriminate among different immunotherapies in terms of likely efficacy on a personalized basis. Overall, most of the top survival-associated combinations involve immune-related factors, emphasizing the primacy of the immunological milieu in influencing ovarian cancer survival rates in this cohort under the standard treatment protocols. The combination of a factor related to Wnt signaling and one for fibroblast growth factor signaling also appears predictive of survival in a subset of patients. Wnt signaling has also been associated with clinical prognosis in ovarian cancer [30]. Other factors and combinations of groups show narrower survival differentials. We expect that some of these play important roles for a relatively smaller proportion of patients and will gain significance once a greater number of tumor RNA-seq samples are available for analysis.

We performed survival analysis using PCA factors as well. While CorEx and PCA appear comparable in terms of the number of factors that can be used to stratify patients according to survival, the number and quality of biological enrichments for the CorEx factors is substantially greater than that for PCA (Fig. 3 b and Additional file 1: Figure S7).

It should be emphasized that in this retrospective analysis of a patient cohort that was treated prior to 2010, we are able to analyze the survival impact only with respect to standard carbo/taxol therapies as no targeted therapies were widely available or approved. Cell line and xenograft experiments or analysis of tissues from other clinical trials will likely bring to the fore other factor groupings that are associated with response to recently approved or experimental therapies, e.g. angiogenesis inhibitors, PARP inhibitors, and apoptosis signaling modulators to name a few [3133].

CorEx discovers differential expression in ovarian tumors of EMT-related genes previously implicated in breast cancer stemness and metastasis

A survival differential at longer times only is seen for one CorEx gene expression factor (denoted G159). This suggests that this factor may influence survival in a way that is not related to initial carbo/taxol response per se. Detailed examination of the Gene Ontology annotations for this latent factor cluster revealed genes associated with microRNAs in cancer (6 genes), chromatin modification (8 genes), regulation of transcription from RNA pol II promoter (10 genes), stem cell differentiation (5 genes), and regulation of Wnt signaling (5 genes). Further, when this factor is combined in a survival model along with the top single factor within the same CorEx run (G103, immune cytokine signaling), the resulting stratification yields a more significant survival association than any other pair of factors from that run (Fig. 8 a).

Fig. 8
figure 8

Discovery of new targets: regulatory microRNAs and EMT. a One factor was observed to show a bias toward a difference in long-term survival. Though its survival differential was not significant alone, when put into a model combined with the factor for immune chemokine signaling, the resulting survival association was substantially greater and exceeded that of every other tested pair for that run. b Breast cancer samples were mapped to CorEx factors learned from ovarian tumors and demonstrate a survival difference for this factor. c The group contains a small network related to microRNAs in cancer, chromatin modification, and stem cell differentiation. Work in breast cancer associates the network with a cellular EMT phenotype and increased aggressiveness and metastasis of breast tumors. EMT features have been shown to be influenced through regulation by TET3 of the demethylase DNMT3A and represent a potential for development of new therapies targeting the regulatory microRNA miR-22

Intrigued by the possibility of a factor related to something more general than response to a specific chemotherapy, we sought to validate its significance using the TCGA breast cancer data. The reasoning behind this was that while breast tumors exhibit some similarity on a molecular basis to ovarian tumors, there are also significant differences. Earlier detection and a greater variety of chemotherapeutic options means survival will depend largely upon factors not directly related to carbo/taxol response. We calculated factor scores for TCGA breast cancer tumor RNA-seq samples across the factors determined by the ovarian cancer training samples (i.e. we did not train a new set of latent factors, but mapped the breast cancer samples to the factors learned from ovarian tumors). Using these scores to fit a stratified model with respect to the G159 latent factor yields a survival differential for the breast cancer patients (Fig. 8 b). We subsequently discovered that the genes in this group have important functions within a network that was recently studied by Song et al. [34] in preclinical models of breast cancer, where it was shown that TET3 controls expression of DNMT3A, a DNA methyl transferase that exerts specific influence on the chromatin state related to stemness (Fig. 8 c). Further, the network is implicated in the progression of breast cancer via the epithelial-mesenchymal transition, invasion, and metastasis. Mesenchymal-like properties of tumor cells and EMT-associated features in general have been associated with poor prognoses in ovarian cancer [35], though this particular network has not been previously implicated. Importantly, this network was shown to be controlled by a microRNA acting on TET3, thus providing a novel target for drug development [34]. Observation of this mechanism in ovarian cancer tumors suggests not only a new biological driver for recurrence but also the possibility for selection of ovarian cancer patients who might benefit from any experimental miR-22 modulators that may be developed for breast cancer.

Some latent factors can be used to distinguish between normal and cancerous ovarian tissue

We trained a CorEx factor model on RNA-seq profiles from both normal and tumor ovarian tissue samples. In order to mitigate differences due to experimental variations, the expression values in this case were replaced with sample-relative percentiles [36]. The efficacy of this approach can be seen in group heat maps that demonstrate smooth variation in gene expression across tumor and normal samples (Additional file 1: Figure S8). While normal samples and similar tumors appear close to one another in the heat maps, there is still sometimes a qualitative difference that appears related to signal variance. Deeper exploration of the issues of cross-experiment comparison for RNA-seq is beyond the scope of this work.

The trained factors show clear distinctions in gene expression patterns between tumors and normal ovarian tissue and these appear related to a variety of biological processes including ones for development and regulation of cell differentiation, neurogenesis, apoptosis, sex differentiation, cell migration, and inflammatory response. Histograms of the factor scores for a few representative groups are shown in Fig. 9. These findings highlight some commonalities of pathway dysregulation among ovarian tumors. Further, several of the groups contain genes that can be targeted with existing drugs [37, 38]. This presents the possibility that repurposing of existing drugs based upon this sort of data may provide new therapeutic options for a great many ovarian cancer patients, though the implications within these particular gene expression cohorts need to be further elucidated.

Fig. 9
figure 9

Comparison to normal ovarian tissue suggests druggable targets. When normal ovarian samples are combined with ovarian tumors for training, many factors capture differences in gene expression patterns between them. The histograms show factor scores in cases with large differences that correspond to novel drug targets in ovarian cancer, some of which are listed below


In cancer and other diseases, high-throughput data combined with computational phenotyping i.e. the algorithmic distillation of observables that explain and allow prediction of the interaction of the course of disease and treatment, is poised to transform medicine as we know it. In this work we have presented a novel approach toward the goal of an RNA-seq -based computational phenotype. The method of correlation explanation leverages multivariate mutual information to infer complex hierarchical gene expression relationships directly from RNA-seq transcription levels, and the groups of genes thus obtained correspond to distinct cellular subnetworks as indicated by Gene Ontology annotations and known regulatory relationships. CorEx does this efficiently and without requiring any form of prior knowledge. The hierarchy of factors is shown to facilitate understanding and interpretability by relating networks at multiple levels. Though CorEx allows sharing of genes among groups, it also attempts to maximize overall independence between the groups in terms of expression patterns, with factors at higher levels being progressively less dependent. This leads to the ability to pick out groups from different upper level factors that may combine synergistically, and this is seen in the survival associations for some combinations of factors. In addition, tumor-specific assessment of differential transcriptional activation of the gene cohorts is intrinsic to the method. These results show how the observation of differential activity in these gene networks can have great clinical impact by enabling selection of patients most likely to benefit from particular therapies and/or combinations of therapies.

The use of RNA-seq, rather than microarray data, is essential for detailed inference of the transcriptional computational phenotype due to its superior dynamic range, sensitivity, and inherent whole transcriptome measurement capability compared to microarrays. Many past studies have focused on the use of microarray data, however RNA-seq should have substantial advantages for discovery of cancer-specific network interactions and this is supported by studies comparing the two [39, 40]. Some genes in the expression cohorts found here are not even measured in microarray experiments, often because their function is inadequately characterized and thus don’t justify inclusion of corresponding microarray probes. Also, uncommon transcript variants that arise from unstable cancer genomes cannot be detected with microarrays, resulting in misleading expression values. RNA-Seq data captures the effects of such alterations in ways that allow for their discovery and more accurate quantification [4143]. The results presented here from such a relatively small patient cohort are extremely encouraging and clearly show that it is possible to extract a great deal of information from just a few hundred whole exome tumor transcriptomes and makes a strong case for obtaining larger numbers of samples in this context.

Early work on the clinical implications of large scale gene expression data in cancer included direct correlation of gene expression and responses of cell lines to cancer agents [44]. More recently, gene co-expression analyses based upon pairwise correlation or mutual information have been applied to infer networks and regulatory modules [24, 4547], some of which are associated with cancer progression and prognosis. The number of modules found e.g. in breast cancer is markedly fewer (10-20 in the cited studies). Regulatory module inference in these cases often relies upon the application of graph theoretic techniques such as maximal clique identification to identify densely connected groups of genes, with special considerations to allow overlaps. CorEx, in contrast, constructs a graph with edges connecting multiple genes to latent factors rather than to one another and overlaps are intrinsic to the inference process. An alternative approach is presented in [48] wherein attractor metagenes are defined using mutual information Though relatively more restrictive in many ways and heuristic, this is similar in spirit to CorEx. Our approach is most directly related to the information-based clustering of [49]. CorEx represents a significant advance over that work since the lower bound on total correlation that CorEx uses for optimization makes multivariate clustering computationally tractable for large numbers of variables ([49] did not calculate TC for greater than pairwise correlations). As far as we are aware, our work with CorEx is the first successful use of full multivariate mutual information in the context of high throughput gene expression data. The principled approach of CorEx and also the addition of data-guided smoothing via the Bayes shrinkage prior allows for continuous improvement as more data is generated. These properties combined with favorable scaling suggest that CorEx can be widely adopted to analyze and interpret complex gene expression data in many contexts.

CorEx both provides a high level summary of the content of groups of differentially expressed genes and parses them out into particular groups for potential experimental and therapeutic purposes. Therefore it is an umbrella method that can encompass the particular findings of other studies on specific prognostic features of patient tumors. For instance, the original TCGA analysis of the ovarian tumor expression data used non-negative matrix factorization to identify a partitioning of genes and patients into a few subtypes [1]. Features of the four subtypes are clearly reflected in some of the higher level CorEx nodes such as that linking immune response factors. In contrast, here the functions broadly related to the putative subtypes have been parsed more finely on the lower level to reveal specific tumor-activated pathways and druggable targets. This allows one to make hypotheses regarding the interaction of functional gene modules in tumor development and progression. Further, the impact of expression modulation of such relatively smaller groups of genes may be more easily tested in cell lines and xenograft models.

There are many possible extensions and further applications of the methods described here. While mRNA provides a functional readout of many different causal processes, other types of salient information, e.g. somatic mutations, copy number variation, or chromatin marks related to gene activation or repression can be easily correlated with the gene expression groups at the higher level to directly link known biological causes to gene expression cohorts. This information can then point to specific mechanisms active in individual tumors or strengthen the RNA insights with the integration of complementary information. Though beyond the scope of this initial work, pan-cancer analysis is obviously of great interest, and CorEx’s favorable scaling behavior should allow for straightforward application in that context. Additionally, CorEx can be applied to other TCGA cancer types alone where one is likely to see different networks implicated.


We have demonstrated a new technique to detect and comprehend the significance of coordinate gene expression in tumor subsamples. Due to the use of multivariate information, the algorithm is exceptionally sensitive to groups displaying multiple weak interactions. This property enables the discovery of an unprecedented number of functional biological expression groupings. In ovarian cancer, where there is a particular need for tools to aid in the development of both novel and rational combination therapies, we have shown that this method of analysis discovers new targets and lends itself to the selection of therapeutic combinations for individuals.


1 “Total correlation” is a historical misnomer, since it is not a measure of correlation, but of dependence.



Correlation explanation


Deoxyribonucleic acid


Epithelial-mesenchymal transition


False discovery rate


Gene ontology


Genotype tissue expression


Information content


Kyoto encyclopedia of genes and genomes


Messenger ribonucleic acid


Mutual information


National center for biotechnology information


Poly ADP ribos polymerase


Principal component analysis


Protein-protein interaction


Random access memory


Reads per kilobase of transcript per million mapped reads


Ribonucleic acid


Total correlation


The Cancer Genome Atlas


  1. Network TCGAR. Integrated genomic analyses of ovarian carcinoma. Nature. 2011; 474(7353):609–15.

    Google Scholar 

  2. Ver Steeg G, Galstyan A. Discovering structure in high-dimensional data through correlation explanation In: Ghahramani Z, Welling M, Cortes C, Lawrence ND, Weinberger KQ, editors. Advances in Neural Information Processing Systems 27. Curran Associates, Inc: 2014. p. 577–85.

  3. Ver Steeg G, Galstyan A. Maximally informative hierarchical representations of high-dimensional data In: Lebanon G, Vishwanathan SVN, editors. Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics (AISTATS). JMLR Inc.: 2015. p. 1004–12.

  4. Howlader N, Noone AM, Krapcho M, Garshell J, Miller D, Altekruse SF, Kosary CL, Yu M, Ruhl J, Tatalovich Z, Mariotto A, Lewis DR, Chen HS, Feuer EJ, Cronin KA. SEER Cancer Statistics Review, 1975–2012 In: Howlader N, Noone AM, Krapcho M, Garshell J, Miller D, Altekruse SF, Kosary CL, Yu M, Ruhl J, Tatalovich Z, Mariotto A, editors. Bethesda: National Cancer Institute: 2015.

  5. Foley OW, Rauh-Hain JA, del Carmen MG. Recurrent epithelial ovarian cancer: an update on treatment. Oncology (Williston Park). 2013; 27(4):288–94298.

    Google Scholar 

  6. Matsuzaki J, Gnjatic S, Mhawech-Fauceglia P, Beck A, Miller A, Tsuji T, Eppolito C, Qian F, Lele S, Shrikant P, Old LJ, Odunsi K. Tumor-infiltrating NY-ESO-1-specific CD8+ T cells are negatively regulated by LAG-3 and PD-1 in human ovarian cancer,. Proc Natl Acad Sci USA. 2010; 107(17):7875–880.

    CAS  PubMed  Google Scholar 

  7. Homicsko K, Coukos G. Targeting Programmed Cell Death 1 in Ovarian Cancer. J Clin Oncol Off J Am Soc Clin Oncol. 2015; 33(34):3987–989.

    CAS  Google Scholar 

  8. Preston CC, Goode EL, Hartmann LC, Kalli KR, Knutson KL. Immunity and immune suppression in human ovarian cancer. Immunotherapy. 2011; 3(4):539–56.

    PubMed  PubMed Central  Google Scholar 

  9. Cubillos-Ruiz JR, Silberman PC, Rutkowski MR, Chopra S, Perales-Puchalt A, Song M, Zhang S, Bettigole SE, Gupta D, Holcomb K, Ellenson LH, Caputo T, Lee AH, Conejo-Garcia JR, Glimcher LH. ER Stress Sensor XBP1 Controls Anti-tumor Immunity by Disrupting Dendritic Cell Homeostasis. Cell. 2015; 161(7):1527–38.

    CAS  PubMed  PubMed Central  Google Scholar 

  10. Madsen SK, Ver Steeg G, Mezher A, Jahanshad N, Nir TM, Hua X, Gutman BA, Galstyan A, Thompson PM. Information-theoretic characterization of blood panel predictors for brain atrophy and cognitive decline in the elderly. In: Biomedical Imaging (ISBI), 2015 IEEE 12th International Symposium on. IEEE: 2015. p. 980–4. doi:10.1109/ISBI.2015.7164035, ISSN:1945-7928,

  11. Watanabe S. Information theoretical analysis of multivariate correlation. IBM J Res Dev. 1960; 4(1):66–82.

    Google Scholar 

  12. Cover TM, Thomas JA. Elements of Information Theory. New York: Wiley-Interscience; 2006.

    Google Scholar 

  13. Gao J, Aksoy BA, Dogrusoz U, Dresdner G, Gross B, Sumer SO, Sun Y, Jacobsen A, Sinha R, Larsson E, Cerami E, Sander C, Schultz N. Integrative analysis of complex cancer genomics and clinical profiles using the cBioPortal. Science signaling. 2013; 6(269):1–1.

    Google Scholar 

  14. Cerami E, Gao J, Dogrusoz U, Gross BE, Sumer SO, Aksoy BA, Jacobsen A, Byrne CJ, Heuer ML, Larsson E, Antipin Y, Reva B, Goldberg AP, Sander C, Schultz N. The cBio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data. Cancer discovery. 2012; 2(5):401–4.

    PubMed  Google Scholar 

  15. Lonsdale J, Thomas J, Salvatore M, Phillips R, Lo E, Shad S, Hasz R, Walters G, Garcia F, Young N, Foster B, Moser M, Karasik E, Gillard B, Ramsey K, Sullivan S, Bridge J, Magazine H, Syron J, Fleming J, Siminoff L, Traino H, Mosavel M, Barker L, Jewell S, Rohrer D, Maxim D, Filkins D, Harbach P, Cortadillo E, Berghuis B, Turner L, Hudson E, Feenstra K, Sobin L, Robb J, Branton P, Korzeniewski G, Shive C, Tabor D, Qi L, Groch K, Nampally S, Buia S, Zimmerman A, Smith A, Burges R, Robinson K, Valentino K, Bradbury D, Cosentino M, Diaz-Mayoral N, Kennedy M, Engel T, Williams P, Erickson K, Ardlie K, Winckler W, Getz G, DeLuca D, MacArthur D, Kellis M, Thomson A, Young T, Gelfand E, Donovan M, Meng Y, Grant G, Mash D, Marcus Y, Basile M, Liu J, Zhu J, Tu Z, Cox NJ, Nicolae DL, Gamazon ER, Im HK, Konkashbaev A, Pritchard J, Stevens M, Flutre T, Wen X, Dermitzakis ET, Lappalainen T, Guigo R, Monlong J, Sammeth M, Koller D, Battle A, Mostafavi S, McCarthy M, Rivas M, Maller J, Rusyn I, Nobel A, Wright F, Shabalin A, Feolo M, Sharopova N, Sturcke A, Paschal J, Anderson JM, Wilder EL, Derr LK, Green ED, Struewing JP, Temple G, Volpi S, Boyer JT, Thomson EJ, Guyer MS, Ng C, Abdallah A, Colantuoni D, Insel TR, Koester SE, Little AR, Bender PK, Lehner T, Yao Y, Compton CC, Vaught JB, Sawyer S, Lockhart NC, Demchok J, Moore HF. The Genotype-Tissue Expression (GTEx) project. Nature Genetics. 2013; 45(6):580–5.

    CAS  Google Scholar 

  16. Franceschini A, Szklarczyk D, Frankild S, Kuhn M, Simonovic M, Roth A, Lin J, Minguez P, Bork P, von Mering C, Jensen LJ. STRING v9.1: protein-protein interaction networks, with increased coverage and integration. Nucleic Acids Research. 2013; 41:D808–15.

    CAS  PubMed  Google Scholar 

  17. Jensen LJ, Kuhn M, Stark M, Chaffron S, Creevey C, Muller J, Doerks T, Julien P, Roth A, Simonovic M, Bork P, von Mering C. STRING 8–a global view on proteins and their functional interactions in 630 organisms. Nucleic acids. 2009; 37(Database issue):412–6.

    Google Scholar 

  18. Krzywinski M, Schein J, Birol I, Connors J. Circos: an information aesthetic for comparative genomics. Genome Res. 2009; 19:1639–45.

    CAS  PubMed  PubMed Central  Google Scholar 

  19. Supek F, Bošnjak M, Škunca N, Šmuc T. REVIGO Summarizes and Visualizes Long Lists of Gene Ontology Terms. PloS ONE. 2011; 6(7):21800.

    Google Scholar 

  20. de Hoon MJL, Imoto S, Nolan J, S M. Open Source Clustering Software. Bioinformatics. 2004; 20(9):1453–4.

    CAS  PubMed  Google Scholar 

  21. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature Genetics. 2000; 25(1):25–9.

    CAS  PubMed  PubMed Central  Google Scholar 

  22. Kanehisa M, Goto S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids. 2000; 28(1):27–30.

    CAS  Google Scholar 

  23. van Dam S, Craig T, de Magalhăes JP. GeneFriends: a human RNA-seq-based gene and transcript co-expression database. Nucleic Acids. 2015; 43(Database issue):1124–32.

    Google Scholar 

  24. Yang Y, Han L, Yuan Y, Li J, Hei N, Liang H. Gene co-expression network analysis reveals common system-level properties of prognostic genes across cancer types. Nature Commun. 2014; 5:3231.

    Google Scholar 

  25. Reshef DN, Reshef YA, Finucane HK, Grossman SR, McVean G, Turnbaugh PJ, Lander ES, Mitzenmacher M, Sabeti PC. Detecting novel associations in large data sets. Science (New York). 2011; 334(6062):1518–24.

    CAS  Google Scholar 

  26. Therneau TM, Grambsch PM. Modeling Survival Data: Extending the Cox Model. New York: Springer; 2000.

    Google Scholar 

  27. R Core Team. R: A Language and Environment for Statistical Computing, Vienna, Austria. Vienna: R Foundation for Statistical Computing; 2015. Accessed 13 Feb 2017.

  28. Cole C, Lau S, Backen A, Clamp A, Rushton G, Dive C, Hodgkinson C, McVey R, Kitchener H, Jayson GC. Inhibition of FGFR2 and FGFR1 increases cisplatin sensitivity in ovarian cancer. Cancer biology & therapy. 2014; 10(5):495–504.

    Google Scholar 

  29. Zhang L, Conejo-Garcia JR, Katsaros D, Gimotty PA, Massobrio M, Regnani G, Makrigiannakis A, Gray H, Schlienger K, Liebman MN, Rubin SC, Coukos G. Intratumoral T Cells, Recurrence, and Survival in Epithelial Ovarian Cancer. New England J Med. 2003; 348(3):203–13.

    CAS  Google Scholar 

  30. Bodnar L, Stanczak A, Cierniak S, Smoter M, Cichowicz M, Kozlowski W, Szczylik C, Wieczorek M, Lamparska-Przybysz M. Wnt/ β-catenin pathway as a potential prognostic and predictive marker in patients with advanced ovarian cancer. J Ovarian Res. 2014; 7(1):16.

    PubMed  PubMed Central  Google Scholar 

  31. Glass K, Quackenbush J, Spentzos D, Haibe-Kains B, Yuan GC. A network model for angiogenesis in ovarian cancer. BMC bioinformatics. 2015; 16(1):10.

    Google Scholar 

  32. D’Andrea AD. Targeting DNA Repair in Cancer Therapy. The Oncologist. 2015; 20:S1–8.

  33. Janzen DM, Tiourin E, Salehi JA, Paik DY, Lu J, Pellegrini M, Memarzadeh S. An apoptosis-enhancing drug overcomes platinum resistance in a tumour-initiating subpopulation of ovarian cancer. Nature Commun. 2015; 6:7956.

    CAS  Google Scholar 

  34. Song SJ, Poliseno L, Song MS, Ala U, Webster K, Ng C, Beringer G, Brikbak NJ, Yuan X, Cantley LC, Richardson AL, Pandolfi PP. MicroRNA-antagonism regulates breast cancer stemness and metastasis via TET-family-dependent chromatin remodeling. Cell. 2013; 154(2):311–24.

    CAS  PubMed  PubMed Central  Google Scholar 

  35. Tothill RW, Tinker AV, George J, Brown R, Fox SB, Lade S, Johnson DS, Trivett MK, Etemadmoghadam D, Locandro B, Traficante N, Fereday S, Hung JA, Chiew YE, Haviv I, Australian Ovarian Cancer Study Group, Gertig D, deFazio A, Bowtell DDL. Novel molecular subtypes of serous and endometrioid ovarian cancer linked to clinical outcome. Clin Cancer Res. 2008; 14(16):5198–208.

    CAS  PubMed  Google Scholar 

  36. Grossman RL, Heath AP, Ferretti V, Varmus HE, Lowy DR, Kibbe WA, Staudt LM. Toward a shared vision for cancer genomic data. New England J Med. 2016; 375(12):1109–12.

    Google Scholar 

  37. Wagner AH, Coffman AC, Ainscough BJ, Spies NC, Skidmore ZL, Campbell KM, Krysiak K, Pan D, McMichael JF, Eldred JM, Walker JR, Wilson RK, Mardis ER, Griffith M, Griffith OL. DGIdb 2.0: mining clinically relevant drug-gene interactions. Nucleic acids. 2016; 44(D1):1036–44.

    Google Scholar 

  38. Stelzer G, Rosen R, Plaschkes I, Zimmerman S, Twik M, Fishilevich S, Iny ST, Nudel R, Lieder I, Mazor Y, Kaplan S, Dahary D, Warshawsky D, Guan-Golan Y, Kohn A, Rappaport N, Safran M, Lancet D. The GeneCards suite: From gene data mining to disease genome sequence analysis. Curr Protoc Bioinforma. 2016; 54(1):1–33.

    Google Scholar 

  39. Iancu OD, Kawane S, Bottomly D, Searles R, Hitzemann R, McWeeney S. Utilizing RNA-Seq data for de novo coexpression network inference. Bioinformatics (Oxford, England). 2012; 28(12):1592–7.

    CAS  Google Scholar 

  40. Cai Y, Fendler B, Atwal GS. Utilizing RNA-Seq Data for Cancer Network Inference. 2012.

  41. Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nature Rev Gen. 2009; 10(1):57–63.

    CAS  Google Scholar 

  42. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature Methods. 2008; 5(7):621–8.

    CAS  PubMed  Google Scholar 

  43. Peng L, Bian XW, Li DK, Xu C, Wang GM, Xia QY, Xiong Q. Large-scale RNA-Seq Transcriptome Analysis of 4043 Cancers and 548 Normal Tissue Controls across 12 TCGA Cancer Types. Scientific reports. 2015; 5:13413.

    CAS  PubMed  PubMed Central  Google Scholar 

  44. Butte AJ, Tamayo P, Slonim D, Golub TR, Kohane IS. Discovering functional relationships between RNA expression and chemotherapeutic susceptibility using relevance networks. Proc Natl Acad Sci USA. 2000; 97(22):12182–6.

    CAS  PubMed  Google Scholar 

  45. Song L, Langfelder P, Horvath S. Comparison of co-expression measures: mutual information, correlation, and model based indices. BMC bioinformatics. 2012; 13(1):328.

    CAS  PubMed  PubMed Central  Google Scholar 

  46. Shi Z, Derow CK, Zhang B. Co-expression module analysis reveals biological processes, genomic gain, and regulatory mechanisms associated with breast cancer progression. BMC systems biology. 2010; 4(1):74.

    PubMed  PubMed Central  Google Scholar 

  47. Zhang B, Horvath S. A General Framework for Weighted Gene Co-Expression Network Analysis. Stat Appl Genet Mol Biol. 2005; 4(1):17.

    Google Scholar 

  48. Cheng WY, Yang T-HO, Anastassiou D. Biomolecular Events in Cancer Revealed by Attractor Metagenes. PLoS Comput Biol. 2013; 9(2):1002920.

    Google Scholar 

  49. Slonim N, Atwal GS, Tkacik G, Bialek W. Information-based clustering. Proc Natl Acad Sci.2005;102(51).

    CAS  Google Scholar 

Download references


GV acknowledges support from AFOSR grant FA9550-12-1-0417 and DARPA grant W911NF-12-1-0034 for development of the general algorithm used. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Availability of data and materials

Supporting data is available in the Additional files. The algorithm implementation used in this work can be obtained from

Authors’ contributions

SP drafted the manuscript with assistance from GV. SP and GV conceived and designed the studies. GV developed the CorEx algorithm enhancements and ran it on the data. SP selected and prepared the training data and performed all analyses downstream of CorEx. All authors read and approved the final manuscript.

Competing interests

The authors declare that they have no competing interests.

Consent to publish

No identifying details or images of individuals are included in this work.

Ethics approval and consent to participate

All human data analyzed in this study was originally obtained the TCGA Research Network and per their statement: “All specimens were obtained from patients with appropriate consent from the relevant institutional review board.”

Author information

Authors and Affiliations


Corresponding author

Correspondence to Shirley Pepke.

Additional files

Additional file 1

Supplementary methods details and text including a description of the Bayes prior and reproducibility across runs. (PDF 7511 kb)

Additional file 2

Expression matrix used for training. (XLSX 29 722 kb)

Additional file 3

Group gene membership. (XLSX 284 kb)

Additional file 4

Labels for each tumor sample. (XLSX 1443 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pepke, S., Ver Steeg, G. Comprehensive discovery of subsample gene expression components by information explanation: therapeutic implications in cancer. BMC Med Genomics 10, 12 (2017).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: