We used a robust, non-parametric measure of correlation to develop a general statistical method for reusing publicly available microarray data. The purpose of the method is to generate clinically relevant hypotheses regarding human disease. It employs the widely used SAM software package to define "pathway signatures" from known molecular perturbations [4]. The term "pathway signature" refers to the specific pattern of differential gene expression that was observed upon experimental perturbation of a given pathway. It is represented by the log2 of the fold change for the list of differentially expressed genes in the perturbed versus the non-activated state. Extracting this subset of genes is necessary in order to pinpoint the genes at play in the specific pathway of interest, and to eliminate other factors such as cell type could dominate any correlative signal detected. Our method can then assess the degree of statistical similarity between gene expression profiles in human disease and these pathway signatures, which in turn can be used as a measure of pathway activation. We are then able to evaluate pathway activation as a risk factor for patient survival. We call this approach Expression-based Pathway Signature Analysis (EPSA). Figure 1 provides an overview of EPSA.
Application to existing, established datasets
EPSA correctly identifies pathway mutations in cancer data
Bild et al. trained Bayesian regression models on a training set derived from replicates of five different oncogenic mutations. These predictors were able to assess the relative probability of pathway deregulation in tumors and cell lines for these five affected pathways. The authors then demonstrated that pathway prediction for positive control samples tended to reflect the molecular basis for tumorigenesis. This tendency was indicated by probability scores derived from their method of regression analysis [1]. Positive controls were expression patterns from mammary tumors from mice carrying transgenic copies of the oncogenes used in the training set or, in the case of Rb null animals, knockouts of an inhibitor of E2F3 [5]. Figure 2A shows results for positive controls presented by Bild et al., and Figure 2B shows results for a similar analysis using EPSA correlation scores instead of a regression probability score. EPSA demonstrates a similarly high level of accuracy in identifying the actual underlying mutation in positive controls.
EPSA prognosis prediction outperforms that of Bild et al.'s more complex machine learning approach
Figure 3 shows Kaplan-Meier survival curves for an ovarian cancer patient cohort (n = 135). The curves are based on EPSA correlation scores for each of the five oncogenic mutations performed by Bild et al. (see Methods). They show the two extreme tertiles of patients: namely, the third of patients with the highest degree of correlation and the third with the lowest degree of correlation. p-values are provided for Cox proportional-hazards analysis across all patients (described in Methods). These results may be compared with those in Supplementary Figure 4 of the Bild et al. paper. Mutations in the Myc and E2F3 pathways are roughly comparable as factors influencing survival in both methods. Mutations of the Ras and β-catenin pathways had low p-values in the EPSA and Bild analyses, but neither were significant when corrected for multiple hypothesis testing. Importantly, results of the Src pathway using Bild's method were not significant when corrected for multiple hypothesis testing (p = 0.22), while the p-value generated using EPSA was highly significant at 0.002 after Bonferroni correction. One major finding of the paper by Bild et al. was that clustering patients into six groups based on all five mutations together gave a more significant differential prognosis. Here, too, EPSA differentiated survival to a statistically significant degree, (Kaplan-Meier p-value < 0.0093 [not shown]).
EPSA detects statistically lower correlation in disease profiles with Connectivity Map perturbagens
The Connectivity Map is a reference collection of expression profiles generated by stimulation of human cultured cells with a number of small molecules, along with an algorithmic method for data mining through pattern matching. One goal of the approach was to detect functional connections between disease and drug action. Of note, the authors showed that connections could be detected even across cell types. Of the thirty-two Connectivity Map perturbagens for which we were able to generate pathway signatures (see Methods), none showed a significantly higher degree of correlation with tumor expression profiles from the Bild et al. ovarian cancer cohort than randomly permuted signatures. However, twenty-one perturbagens showed significantly lower levels of correlation with disease data [see Additional file 1]. That is, disease tissue profiles tended to show lower correlation with the pathways activated by these perturbagens than would be expected by random chance.
EPSA predicts differential prognosis in cancer patients based on projected activation level of small-molecule pathways
For the twenty-one molecules with a lower degree of correlation with patient profiles from the ovarian cancer cohort, we performed survival analysis using the degree of correlation between the signature and the tumor expression profile as a factor influencing survival. After Bonferroni correction, five of the perturbagen signature correlation values were determined to be significant factors influencing survival. The directionality of the effect for two of them (LY294002 and Trichostatin Aa) was opposite to what would be intuitive or helpful therapeutically, but of particular interest were results for the remaining three compounds: monorden, 17-allylamino-geldanamycin (17AAG), and ikarugamycin (Cox proportional analysis p-values: 2 × 10-4, 3 × 10-6, and 3 × 10-8 respectively; Figure 4). For these drugs, lower correlation with the drug signature corresponded to a lower rate of survival. Higher correlation with the drug signature corresponded to a higher rate of survival, though paradoxically, this higher correlation is also closer to the average expected correlation between the patient samples and the pathway signature. These results suggest that, for any one of these three small molecules, the corresponding pathway may be dysregulated in ovarian cancer, and that treatment with the drug could induce a beneficial drug signature. Of course, additional experiments would be required to tease out the actual underlying molecular mechanisms, and whether this proposed explanation would in fact play out with actual stimulation of ovarian tumor cells using these known therapeutic molecules.
Application to a divergent species dataset
Pathway signature determination from compendium data
Based on clinical relevance and availability of outcome data, we chose B cells as a model cell on which to test our approach. Gene Expression Omnibus series GSE2050 provides time course data in which gene expression profiles from mouse B cells stimulated with 33 different ligands were compared to gene expression profiles of unstimulated B cells [6, 7]. We used this data as the training compendium for EPSA. We used SAM [4] to identify pathway signatures of differentially expressed genes for each of the 33 ligands in the compendium dataset. We then sought to confirm that these pathway signatures would show a statistically significant degree of correlation with profiles generated through activation of the same pathways in other experiments, and on other microarray platforms.
EPSA identifies correlation patterns in positive controls from mouse B cells
GEO dataset GSE1014 (no associated PubMed ID) was generated by stimulating mouse B cells with a subset of ligands that had also been used in GSE2050. We used it as a positive control to confirm that correlations between test set profiles and corresponding ligand signatures derived from the compendium were detectable and differentiable. The resulting correlations, averaged among test replicates, are shown in Additional file 2A. In three out of the four test cases, the ligand used to stimulate the cells in the test set was correctly re-identified as the most highly correlated compendium profile. One test stimulation, lipopolysaccharide (LPS), was misidentified as CD40 ligand (CD40L), but LPS was a close second. In summary, our method demonstrated good sensitivity for detecting similarity between test profiles in mouse B cells and corresponding profiles derived from the mouse B cell compendium data.
EPSA detects correlation patterns in positive controls from human B cells
Applying our method to human B cells introduced additional complexity, in that the genes from the compendium mouse arrays had to be translated to their human orthologs and then translated again to the probes used on the human array platform. This was accomplished with NCBI's HomoloGene mapping (see Methods for details). Our method was designed to account for the fact that many genes might not be translated because of platform differences or a lack of orthologs. Note that these datasets were generated in an independent laboratory. All four stimulations in human B cells included Anti-Ig (AIG), and correlation with the AIG pathway signature for all four samples was high [see Additional file 2B]. In addition, CD40L correlation was high for the first two samples, which included CD40L, but lower for the second two samples, which did not include it. With the exception of IL4 stimulation in one sample, which did not demonstrate high correlation with the IL4 compendium profile, these results show that our method has good (though not perfect) sensitivity for detecting similarities between profiles, even when comparing across platform and species, and at different time points. Potential reasons for not detecting IL4 include one-time experimental or biological variability, since the test case represented only a single microarray. Interestingly, the next sequentially labeled experiment in the series in theory applied only AIG, but this array did show a higher degree of correlation with IL4. Without replicates, human error in the form of mislabeling cannot be ruled out.
Application of EPSA to diffuse large B cell lymphoma on cDNA microarrays
Having confirmed a reasonable level of sensitivity for correlation detection in positive controls, we next applied EPSA to microarray data from human samples of diffuse large B cell lymphoma (DLBCL), a B cell malignancy. The goal was to determine which pathways, if any, are activated in this disease, and if their activation has implications for survival. Using a dataset comprising gene expression profiles from 240 DLBCL patients [8], a number of the compendium ligands showed positive correlation with disease profiles [see Additional file 3].
To assess the significance of these correlations, we computed a false discovery rate (FDR) through permutation analysis (described in Methods). We identified four ligands with significant q-values: AIG, CD40, CPG, and LPS, with q-values of < 0.001, 0.08, 0.05, and 0.11 respectively [see Additional file 4].
We next performed survival analysis on the cohort of DLBCL patients using the level of correlation with the different signatures as factors influencing survival. The original study used unsupervised hierarchical clustering to group patients into three subtypes of DLBCL: germinal center-like (GC), activated B cell-like (AB), and type III. Of these subtypes, GC patients tended to have the best survival (p < 0.0001). AB and type III patients had poorer prognoses, but were statistically indistinguishable from each other using this methodology (Figure 5A).
We performed Cox proportional-hazards analysis on the same data to evaluate the EPSA score for the four statistically significant ligands listed above as potential factors influencing survival. Cox proportional-hazards analysis showed that AIG correlation was not a significant factor for survival, while the other three were significant, with Bonferroni corrected p-values for LPS, CD40, and CPG of 0.004, 0.0008, and 0.004 respectively (Figure 5B, 5C, and 5D). Toll-like receptor 4 (TLR4), Toll-like receptor 9 (TLR9), and CD40 are the receptors to which LPS, CPG, and CD40 ligand respectively bind to activate cells. Therefore, statistically significant, positive correlations with a signature found in TLR4, TLR9, and CD40 pathway activation were each associated with poor prognosis in diffuse large B cell lymphoma. As discussed below, a number of previous studies have demonstrated results that are in keeping with these findings.
Even more interesting are the results when this analysis is performed on activated B cell-like and type III, the two disease subtypes that were indistinguishable in the original study: Figure 5A shows that the AB and type III subtypes are statistically indistinguishable (p = 0.45). We analyzed these subtypes only and grouped by level of correlation (high, medium, or low) with pathway signatures for LPS, CD40, or CPG as described above. The results were the Kaplan-Meier curves shown in Figure 5B, 5C, and 5D. The curves represent the two extreme tertiles of patients, or the highest versus lowest degree of correlation. Cox proportional-hazards analysis across this entire subset of patients yielded Bonferroni corrected p-values of 0.008, 0.001, and 0.004 respectively for LPS, CD40, and CPG. The distinction being made is therefore independent from that of the GC, AB, type III system established in the original study [8].