We applied HOCUS to the problem of detecting cancer subtypes using two very different data modalities – somatic mutations and 3D tumor imaging data. Clustering patient samples by their shared genomic events or related imaging features may reveal common disease etiology important for outcome assessment. Yet mutation and imaging data are sparse – sample pairs have few overlapping events. It is therefore problematic to use these data as features directly for clustering since similarities calculated from sparse spaces suffer in sensitivity and specificity [14]. Similarities based on the local neighborhood in a network can be more sensitive because this approach can capture samples having an indirect coincidence through other samples. We show that the use of HOCUS for either mutations or imaging data adds specificity as it produces inferred subtypes that are biologically– and clinically–relevant that were undetected by the approaches using lower-order metrics.
Community detection reveals cancer subtypes using somatic mutation data
The particular ways in which a tumor genome is altered creates a signature that reflects the type of cell and mutagens involved. Driving events involving specific genes are associated with certain cancer types. For instance, BCR-ABL fusions are characteristic of chronic myeloid leukemia. The question is whether the pattern of mutations within these cells-of-origin can further subdivide the patient samples into meaningful categories that may be associated with patient outcomes.
We applied HOCUS to mutation data for 3 TCGA cancers: high-grade serous ovarian cancer (OV), glioblastoma multiforme (GBM), and bladder urothelial carcinoma (BLCA). We computed Hamming similarity over the mutation data for n genes as:
$$ {S}^{(1)}:{s}^{(1)}\left( j, k\right)=\frac{1}{n}{\displaystyle {\sum}_{i=1}^n\; I\left({x}_{i, j},{x}_{i, k}\right),} $$
(2)
where j and k denote two different samples, and x
i,j
is the value of feature i for sample j. The Hamming distance counts the fraction of matching mutated genes for all sample pairs, resulting in an adjacency matrix of m × m samples. We retained for clustering all metrics that provided a non-redundant set of relations between samples not captured by lower-order metrics. For higher-order clustering, we raised the precomputed similarity matrix S to the d − 1 power, where d is the order of clustering. We then supplied this similarity matrix as the feature matrix for input to consensus clustering (see Methods). Figure 1 shows a conceptual example of this principle– as the order of clustering increases, cliques in the network emerge and form clusters.
To do this, we identified all k
th-order metrics and lower such that the (k + 1)st metric produced highly similar relative similarities to the kth metric as measured by a kernel alignment test [15], (see Additional file 2: Figure S2). We sought to determine if higher-order feature-based similarity measures (i.e. those based on mutations, images etc) had an enrichment for connecting patients with similar survival outcomes compared to using first-order feature-based measures.
For each tumor type, we clustered the patient samples based on either Pearson correlation, the first-order Hamming similarities, or non-redundant higher-order similarities. We used K-means consensus clustering [13], for K = 2 to 10, and calculated the degree to which the solutions separated patients with different outcomes as a measure of biological relevance. A Kaplan-Meier test was performed on each clustering solution and the significance (-log P-value, log-rank test) was recorded (Fig. 2).
We applied HOCUS to the TCGA GBM dataset containing 283 patients for which 14,910 mutations were found across 7,874 distinct genes, and found 3 distinct clusters. Survival differentiation has proven difficult to achieve in previous analyses of GBM datasets [16, 17], however the HOCUS results show some difference in survival between clusters. Note that Pearson correlation achieves greater difference in survival but must consider at least ten clusters, which is likely an over-partitioning of the samples for this data set. Patients in the best surviving cluster, cluster 1, had low EGFR and TTN mutation occurrence compared to that of patients in other clusters; TTN mutations are predominantly in cluster 3 and EGFR mutations distributed between clusters 2 and 3. All 14 of the IDH1 mutated tumors were in cluster 1, as were most (11 of 16) of the ATRX mutants. The cluster corresponds well with mRNA cluster 3 (LGr3) from the recent TCGA paper [18]. Thus, the HOCUS clustering using mutations seems to have been able to tease out a low grade diffuse subtype, defined by IDH1 mutation status seen in younger individuals, characterized by the absence of a 1p/19q codeletion and a lack of TERT expression and an overall better prognosis. Furthermore, all 65 samples in cluster 1 have a TP53 mutation (Additional file 3: Figure S3), whereas there are none in cluster 2 and only 14 (of 105 samples) in cluster 3.
We next applied HOCUS to the TCGA OV dataset containing 316 patients for which 14,810 mutations in 8,258 genes were reported by the TCGA analysis working group. For the OV dataset, the first-order solution found the greatest separation in survival between clustered groups. Higher-order metrics gave different solutions but were comparable with the first-order solution in separating out patient groups with differences in outcome. One of the main divisions of the samples shows a significant difference in overall mutation rate (Additional file 4: Figure S4). In addition to TP53 mutations, several genes that are characteristic of passenger mutations were also predominant in the highly mutated cluster including TTN, MUC16, and RYR2. Other mutations were significantly associated with these clusters, highlighted in Additional file 4: Figure S4. HOCUS OV clusters correlate with platinum resistance, which is a survival marker.
These findings were surprising given that the TCGA OV dataset has posed a significant challenge for analysts to identify meaningful genome-based distinctions between the patients [17, 19]. One of the most successful attempts to date was reported by Hofree et al. [20] in which patient samples were clustered based on a network diffusion transformation of the mutation data. To compare the two approaches we ran HOCUS using the TCGA OV data as filtered by Hofree et al., whereas we do not filter the mutation datasets. Our results indicated that comparable survival differences to the NBS approach could be obtained by using a different metric (e.g. Hamming distance used here) and higher-order HOCUS, eliminating the need to introduce prior knowledge (Additional file 5: Fig. S5). A similar result was obtained when applying HOCUS to a TCGA breast cohort (see Additional file 6: Section 0.0.4) in which both the first-order and second-order results revealed similar survival separation while producing different clustering solutions. Thus, since the first and second-order solutions for both OV and BRCA (Additional file 7: Figure S6, Additional file 8: Figure S7, Section 0.0.4) gave different clustering solutions but comparable outcome separation, it is possible that a solution combining first- and second-order solutions could produce a better outcome predictor for the patients. Furthermore, since HOCUS performed better on the OV dataset when hypermutated samples and hypomutated genes are excluded from analysis, it would be beneficial to experiment with more extensive data preprocessing.
We next applied HOCUS to the TCGA BLCA cohort of 394 patients for which 84,048 mutations were called based on exome sequencing, covering 15,553 distinct genes. HOCUS uncovered distinct clusters in BLCA. Compared to all HOCUS orders, 2nd–order has the largest separation in survival of the clustered patients. We note that, like the case for OV, the clusters are associated with the number of mutations per sample. Indeed, clustering by mutation rate alone yields comparable separation in patient outcomes (log-rank test, P <4−10 5; Additional file 6: Figure S8) as the HOCUS solution. Since mutation data was the only data used, we searched for genes with mutations that discriminate the patient clusters to understand the different underlying etiologies. Figure 3 shows the top 15 genes associated with each cluster using a χ
2 test of independence (due to overlap in the ‘top’ genes for each cluster, 31 genes are shown). Many of these genes were associated with several cancer types, for example LRP1B has been associated with thyroid, ovarian, renal, and brain cancers [21–23]. Other known oncogenes such as PIK3CA (χ
2 test, p-value 3.6 × 10−4) and TP53 (χ
2 test, p-value 3.9 × 10−8) are also significantly associated with the clusters. Interestingly, the highly mutated BLCA cluster has the best survival prognosis. On average, cluster 3 patients had a 1.7 to 2.2 times higher survival probability at the 5-year mark (0.56 compared to 0.33 for cluster 1 and 0.25 for cluster 2). The 95% confidence interval of the survival probability at 5-years for cluster 3 patients was (0.45, 0.66) compared to (0.12, 0.38) for cluster 2 patients and (0.22, 0.43) for cluster 1 patients.
In both cases, a higher rate of TP53 mutations was found (80% compared to the background rate of 50%), and a slightly higher rate of smokers was in the category. However, when we compared our clusters to the TCGA BLCA clusters, which were generated using mutation and copy number data with an integrated NMF approach, we found only a weak correspondence (Additional file 9: Figure S8(b), χ
2 test, p-value 0.128). Thus HOCUS finds a solution that is distinct from the TCGA–derived subtypes and one that has a comparable association with survival outcomes.
We applied HOCUS to the TCGA Pancancer-12 mutations data. Clusters were derived for 3,394 samples using 313 mutated genes (listed by recent TCGA studies as high–confidence driver mutations [24, 25]) from which similarities were calculated. We visualized the resulting clusters with the Tumor Map tool (https://tumormap.ucsc.edu/?p=Pancan12.SampleMap&li=5). Tumor Map projects the HOCUS 2nd-order sample–sample similarities onto a 2-dimensional map based on the Google maps framework. Tumor Maps built from mutation data using Pearson correlations failed to separate samples into distinct clusters (data not shown), whereas the HOCUS clusters correlated with many other genomic features and were able to find subtypes among the samples (Additional file 10: Figure S9). The subtypes are characterized by clusters of samples from many tissues of origin. Not surprisingly, the main difference in most of the clusters is whether they contain samples with either a TP53 or PIK3CA mutation. However, of note is that HOCUS found a pancancer cluster (dashed boxed in Additional file 10: Figure S9) with samples containing both TP53 and PIK3CA mutations. These tumors either represent cases in which these two frequently mutated genes are altered or consist of tumors with higher subclonal heterogeneity that contain distinct subclones of either TP53 or PIK3CA mutated cells. In either case, these tumors would be of particular interest as they harbor two major drivers of oncogenesis. An attribute enrichment analysis revealed that samples clustered into the TP53-PIK3CA group are composed of tumors in later stages on average than those outside the cluster, consistent with a longer and more complex evolutionary history for these samples.
In summary, when applied to tissue-specific cohorts, the mutation-based HOCUS subtyping reveals distinct sample groupings that appear to be as biologically significant, based on survival comparisons, as the original TCGA expression-derived subtypes, which have been shown to be highly correlated with histopathology calls. On the other hand, when applied to a diverse collection of tumor types that span multiple tissues, the HOCUS-based mutation clusters identify major divisions that are not primarily tissue based. Taken together, this suggests HOCUS could be used in a classification system of the tumors in which mutation data is incorporated, along with other data like copy number and gene expression. Rather than considering mutated genes independently in this process, as was recently done to build pancancer decision trees for this data [26], HOCUS clusters could be used instead, providing potential mutated gene combinations for division points like the TP53- PIK3CA subtype described here.
Community detection of subtypes using copy number data
To test the applicability of HOCUS to other data, we also applied the technique to the clustering of patients based on Hamming similarity of GISTIC2 discretized copy number data (see Methods). We applied HOCUS to TCGA prostate adenocarcinoma (PRAD) copy number data because prostate cancers are known to harbor significant copy number events over the evolution of the tumor including AR amplifications, TMPRSS2-ERG fusions, and even whole genome level events such as chromoplexy. We converted the continuous copy number data to ordinal by using the output of the Broad’s GISTIC2 pipeline [27] that provides gene-level associated copy number estimates. GISTIC2 scores indicate copy number aberrations, where 1 indicates low-level and 2 indicates high level amplifications, negative scores indicate the same but deletions rather than amplifications, and a score of 0 indicates no copy number alterations. For the TCGA PRAD cohort, survival rates are sufficiently high making patient survival time an inappropriate measure of disease subtype. HOCUS cluster 2 patients have higher Gleason scores, more lymph node invasion, and higher stage (Additional file 11: Table S1, Additional file 12: Figure S10), all of which are associated with disease aggression.
Community detection from magnetic resonance imaging data
We next applied the HOCUS method to the task of grouping patients with GBM based on the imaging of their tumors. Previous imaging studies using MRI have extracted location and anatomical features to characterize tumors. Recent imaging studies have utilized MR images to define patient subtypes for personalized treatment [4, 28, 29]. MR images are 3-dimensional and contain millions of pixels and human brains have variable size and shape, making it difficult to compare patients. By mapping the MR images to the MNI brain atlas (Montreal Neurological Institute 152 [11]), we were able to compare patient images, and using HOCUS we were able to find clinically relevant imaging subtypes.
We applied HOCUS clustering using the GBM voxel (3-dimensional pixel) data from the TCGA collection of 184 patients with first- and higher-order metrics to find community structures. MRI data are part of the TCGA GBM cohort, downloaded from the Cancer Imaging Archive and processed by Stanford University as described in Liu et al. [4]. To reduce noise and the size of the MR images, we first preprocessed the data by filtering to a set of informative voxels containing tumor in some, but not all, of the patients (Additional file 13: Figure S11). We removed subsequent analysis all non-informative voxels with tumor in fewer than 15 individuals (see Additional file 6: Sec. 0.0.1). We computed sample-to-sample similarities using the remaining voxels and performed higher order calculations and clustering as described above for the mutation data (e.g. Hamming distance and ConsensusClusterPlus were used). Cluster solutions revealed that the metrics converged by the fourth-order (Additional file 14: Table S2).
We sought to determine which metric based on the imaging data best matched up with the observed differences in patient outcomes. We defined an outcome-based similarity metric by computing all pairwise absolute differences between the survival time of every pair of patients, d
ij
= |T(i) − T(j)|, where T(i) is the survival time in days of patient i. These distances were converted to co-survival similarities via the linear transform \( {s}_{ij}=\frac{1}{m}\left(1-{d}_{ij}\right) \), where m = max{d
ij
} is the maximum absolute differences between any two patients. We then quantified the correlation between imaging-based similarities and co-survival similarities using a normalized version of the kernel alignment method of Cristianini et al. [15] that calculates a centered correlation between two full sample-by-sample similarity matrices. We repeated the kernel alignment comparison to co-survival for first-order and higher order HOCUS metrics (Additional file 14: Table S2).
To visualize the results of the kernel alignment comparisons, we plotted the probability densities (Fig. 4). The third- and fourth-order had the highest kernel similarity scores to co-survival as can be verified visually. We chose the third–order solution because fourth–order produced similar groupings to the third, and thus the added complexity of using a higher order was not justified in this case. Interestingly, second-order had a lower kernel similarity to co-survival than first-order, illustrating the benefit of searching several higher order metrics beyond 2nd–order. We note that second-order HOCUS stratifies MR images into tumor groups by anatomic location (Additional file 15: Figure S12(a)). On the other hand, third-order clusters were driven by a combination of location and volume. In addition, third-order produced a larger separation of survival in groups than location or volume alone.
Each clustering solution that is based on a different metric order identifies different characteristics in the MR images that are associated with survival prognosis. While first-order clusters, based on Hamming distance, align with tumor volume, and second-order with anatomic location (Additional file 15: Figure S12), third-order clustering captures aspects of both tumor volume and location (Fig. 5(c-d)). Each solution has statistically significant separation in survival (Fig. 5(a)), with third-order having the greatest separation in survival of image cluster groups. Patients with tumors in the frontal lobe and which are smaller in volume had significantly better survival than larger tumors in the lower rear portions of the brain.
Interestingly, the third-order solution pulled together patients that made up two separate poor surviving clusters in the second-order solution. To better understand the third-order subtypes revealed by the imaging data, we inspected the genetic pathways that distinguish the poorer surviving subtype from the others using RNASeq gene expression data available for 184 patients. We computed a differential expression score for each gene to indicate whether a gene’s expression level was higher or lower on average in the poorer surviving cluster (cluster 3) relative to the others using the Statistical Analysis of Microarrays technique [30]. We then connected any gene with an absolute differential expression higher than one standard deviation above the average of all genes. Finally, we retained pathway interactions connecting only those genes that were both in this set and plotted them with the Cytoscape viewer [31]. Several pathways involved in major growth and proliferation signaling were implicated from these networks as might be expected (Fig. 6). ERK (MAPK1) was found to be significantly overexpressed in cluster 3 tumors along with JUN-kinase (MAPK8). In addition, AKT1 and PLK1 were also found to be higher in cluster 3, both known to drive cell cycle progression.