Comprehensive discovery of subsample gene expression components by information explanation: therapeutic implications in cancer

Background De novo inference of clinically relevant gene function relationships from tumor RNA-seq remains a challenging task. Current methods typically either partition patient samples into a few subtypes or rely upon analysis of pairwise gene correlations that will miss some groups in noisy data. Leveraging higher dimensional information can be expected to increase the power to discern targetable pathways, but this is commonly thought to be an intractable computational problem. Methods In this work we adapt a recently developed machine learning algorithm for sensitive detection of complex gene relationships. The algorithm, CorEx, efficiently optimizes over multivariate mutual information and can be iteratively applied to generate a hierarchy of relatively independent latent factors. The learned latent factors are used to stratify patients for survival analysis with respect to both single factors and combinations. These analyses are performed and interpreted in the context of biological function annotations and protein network interactions that might be utilized to match patients to multiple therapies. Results Analysis of ovarian tumor RNA-seq samples demonstrates the algorithm’s power to infer well over one hundred biologically interpretable gene cohorts, several times more than standard methods such as hierarchical clustering and k-means. The CorEx factor hierarchy is also informative, with related but distinct gene clusters grouped by upper nodes. Some latent factors correlate with patient survival, including one for a pathway connected with the epithelial-mesenchymal transition in breast cancer that is regulated by a microRNA that modulates epigenetics. Further, combinations of factors lead to a synergistic survival advantage in some cases. Conclusions In contrast to studies that attempt to partition patients into a small number of subtypes (typically 4 or fewer) for treatment purposes, our approach utilizes subgroup information for combinatoric transcriptional phenotyping. Considering only the 66 gene expression groups that are found to both have significant Gene Ontology enrichment and are small enough to indicate specific drug targets implies a computational phenotype for ovarian cancer that allows for 366 possible patient profiles, enabling truly personalized treatment. The findings here demonstrate a new technique that sheds light on the complexity of gene expression dependencies in tumors and could eventually enable the use of patient RNA-seq profiles for selection of personalized and effective cancer treatments. Electronic supplementary material The online version of this article (doi:10.1186/s12920-017-0245-6) contains supplementary material, which is available to authorized users.


Supplementary Figures
The rank biased overlap can be used to identify equivalent groups from different runs (

Supplementary Text
Given the variability between runs of CorEx in such a high dimensional space, some way to compare and reconcile the results of multiple runs is needed. We have found that comparison of the gene groupings using the rank-biased overlap (RBO) of [1] can be used to draw intuitively satisfying correspondences. The RBO is calculated as a single number that describes the similarity between two ranked lists. It is very flexible and has appealing asymptotic properties. It is possible to usefully aggregate results from various runs by declaring any two groups that exceed a threshold RBO to be equivalent. When this is the case, the group with greater TC is retained.

Bayesian shrinkage estimates for the marginals
The remaining details of the learning scheme concern the estimation of parameters in terms of our samples of data x, and the current estimate of probabilistic labels, p(Y = y |X = x). First of all, as we mentioned, p(X = x |Y = c) is estimated as a normal with mean μ and variance, σ . Let x denote the l-th sample of data. Then an empirical estimate for μ, would be the following.
We could simply use this estimate and the corresponding one for variance. However, if there are a small number of samples with label Y = c, this could become quite noisy.
Instead, we consider a Bayesian estimate based on James-Stein type shrinkage estimators [2,3] . We take as our Bayesian prior the hypothesis that the mean is μ = 1/N∑ x .
A Bayesian estimate for the mean of a normal distribution has a simple form.
The main question is how to set the value of the ``shrinkage parameter'', λ. The idea behind shrinkage estimators is to analytically estimate the value for λ to be the one with the minimum risk. We derive our estimate here since the setting differs slightly from previous attempts [2,3] .
For simplicity, we drop the subscripts and assume the true distribution has mean, μ, the prior is μ , and the empirical estimate is , and that the standard deviation is fixed and known σ. Let p(x; μ, σ) be a normal distribution parametrized by μ, σ. The risk, R is defined as the KL divergence between the true distribution and the estimated one.
We set λ to be the one that minimizes the risk by taking the derivative and solving as usual. It turns out that this leads to the following expression.
We have z = σ /N and we recognize this is just the standard error of the mean. Rather than estimate that, we use a shuffle test to estimate how far we expect to be from μ under the null hypothesis, and we call this . For some random permutation of the samples, π, we have the following. Then we take an expectation over several shuffles, . For ``agressive'' shrinkage, instead of shuffle x , we sample with replacement from the empirical distribution.

Setting the weights
The form of the weights, α are directly determined by the optimization which requires that α = I(X ; Y |Y , …, Y )/I(X ; Y ), for some ordering of the Y . We use a more tractable estimate for this weight [4] .
For each sample, x , we first calculate the most likely label for each latent factor, . We define the prediction of Y based on X for sample l as . We define, C = 1 if and 0 otherwise. C shows whether X correctly predicts Y for sample l.
Next, for each i, we sort j's in decreasing order of C ≡ ∑ C . Then we set α to be the fraction of samples for which C = 1 and C = 0, ∀k < j. In other words, it is the fraction of times that X correctly and uniquely predicted Y .

Training higher layers of the hierarchy
At the first layer of the hierarchy, the input variables are continuous and are latent factors are discrete. The corresponding training procedure was described in the main text. Now for each patient, l, and for each latent factor, , we have a distribution . We construct a new data matrix consisting of the most probable value for each latent factor and each patient as This data matrix will now be treated as the input data, X, and used to train a new CorEx representation at the next layer. Note, however, that at this next layer, both the inputs and the latent factors are discrete. This leads to a simplification of the update equations presented in the main text. In particular, the marginal distribution, , was previously parametrized as a normal distribution. Now, however, we can directly estimate this marginal from the contingency table of counts of different discrete events.
The symbol, , represents the discrete delta function and is 1 if its subscripts match and 0 otherwise. The update equation for and the other details of the optimization are unchanged.

PCA Survival Analysis
Principal component coordinates were used analogously to Corex continuous factor labels. The sample population was stratified approximately into thirds relative to the coordinate values for each principal component. All survival analyses (e.g. for Supplementary Figure 7) then proceeded as was done for the CorEx factors.