We conducted a regularized elastic-net regression analysis of heterogeneous genomic data that encompassed multiple classes of genetic aberrations (i.e. genomic features) in colorectal cancer. We identified 158 genomic features associated with advanced clinical stage with 10-fold cross validation (Figure 3) and ranked these genes based on by their regression coefficient and level of support from multiple assay modalities (Table 3). Our integrative analysis approach provides a better picture about the distribution of genomic features associated with clinical parameters, which cannot be obtained by separately analyzing individual classes of genetic aberrations such as mutations alone (Figure 7). For example, our integrative analysis identified specific mutations that delineate clinical stage in the context of overlapping genomic features. This was not possible when analyzing mutations alone without the genomic context of other genetic aberrations. We also demonstrated that the predictive power was relatively improved using our integrative analysis approach (Table 2) as several other studies have demonstrated previously [24–29]. Despite of slight increase in predictive power, integrative analysis enabled us to identify critical genes that are supported by more than one genomic biomarker. Several top candidates such as GNAS (supported by methylation and mutation) and FHIT (supported by methylation and mutation) would not be detected if analysis was solely based on gene expression data.
The leading candidate, WRN, showed the higher number of genomic features that can delineate advanced clinical stage CRC (Figure 5). WRN is a known tumor suppressor. Hereditary germ line mutations in the WRN gene cause the adult-progerioid genetic disorder known as Werner’s syndrome (WS). This disorder is associated with an increased risk of cancer [41–43]. The increased risk of cancer in individuals with WS has established the role of WRN as a tumor suppressor gene. However, genetic aberrations in WRN have never been described in sporadic CRC development. WRN is a human RecQ DNA helicase family and has both the characteristic 3′ to 5′ helicase activity and also a 3′ to 5′ exonuclease activity. Generally, we observed the WRN was subject to hemizygous gene loss, a category of genetic aberrations that are increasingly being scrutinized for a specific role in cancer development. Solimini et al. recently provided evidence that the gene dosage changes as result of hemizygous deletions play an important role in tumor progression . We are pursuing translational validation studies and further experiments to address the role of WRN in advanced CRC.
The results from analysis of the individual clinical parameters of the TMN criteria provided us with additional biological insight. There was significant overlap between genes associated with advanced clinical stage and N status, more so than any other comparison among the various clinical parameters we examined. This may be an indication that genes associated with lymph node invasion are relatively critical for delineating advanced stage of CRC compared to other parameters such as tumor invasiveness level. This conclusion is supported by the fact that lymph node status remains the most important and reproducible prognostic clinical parameter for colon cancer . Interestingly, we did not identify a gene specific for distant organ metastasis. This may be related to the smaller number of stage IV CRCs in early releases of the TCGA data set or the potential greater role of lymph node metastasis as an early indicator of metastatic progression. By grouping stage III and IV patients as advanced disease, we improve the sample size at the risk of increasing within-group heterogeneity. As the TCGA CRC data set matures, we will have opportunity to analyze a large data set for associations.
We evaluated the performance of elastic-net on the large genomic data set with clinical information by using control data sets. Our analysis successfully identified the clinically relevant genomic signatures in three cases: i) Identification of synthetic genes that have stage-associated genomic features, ii) Identification of the IDH1 gene from GBM mutation data and iii) identification of methylation on MLH1 and mutations on TGFBR2 for MSI status of CRC. In the case of the CRC stage association, some of the selected genomic features for CRC stage from each genomic assay were consistent with the results from the TCGA CRC study (Figure 4). Our study supports the overall reliability of the candidates associated with clinical stage from elastic-net.
Interestingly, cancer genes (e.g. TP53, KRAS, APC) with the highest frequency of genetic aberrations were not among the genes identified in delineating advanced clinical stage of CRC. These cancer driver genes were mutated frequently across all clinical stages. For instance, APC was mutated in 90%, 74%, 77%, and 93% of stage I, II, III and IV respectively. Therefore, our interpretation is that while these cancer drivers play a critical role in the initial neoplastic development and maintenance, they play lesser role for influencing metastatic progression and advanced clinical stage.
Our study is distinct and novel compared to the TCGA CRC study; unlike TCGA, we considered four major features of genetic aberrations simultaneously rather than specific genetic aberrations in isolation. As we demonstrate, this improves the performance of our approach; it leads us to some gene candidates delineating advanced clinical stage not previously recognized. Integration of different genomic data gives us more informative results as we have shown it in this study. Multiple genomic features enable us to prioritize the gene aberration for follow-up validation studies. Another feature of our study is the flexibility of our analysis approach. Using elastic net, we can readily modify our bioinformatics pipeline to consider other clinical parameters. In the future, we anticipate applying additional clinical parameters such as drug response for our analysis as these data become available.
One of the main limitations of our study is the absence of an independent data set for validation. We demonstrated the robustness of our candidates by bootstrap resampling. We also considered over 11 other genomic studies of CRC that looked at differences among early versus advanced stage disease . Only two analyzed tumors from all clinical stages usually with less than 80 samples. In addition, all of these studies used older gene expression microarrays lacking the number of features see in the TCGA and thus never reached the level of comprehensive multiple-platform analysis or high genomic resolution as was conducted by the TCGA. For this analysis, the ability to integrate heterogeneous genomic features was critical and the limitations of the other data sets made them less useful for our integrated analysis. As a validation analysis with a single platform, we opted to use an independent set of 354 samples with both clinical data and copy number variation data from the expanded TCGA CRC data. This separate analysis validated MALT1 as a top hit when only considering copy number analysis.
We also considered an integrative analysis of independent CRC samples from TCGA. At the time of this study, many of the TCGA samples have not undergone analysis with all of the genomic assays. Many of the other CRC samples, outside of the ones we use, lack sufficient clinical data. In the future, with the completion of the TCGA study and adequate clinical annotation afterwards, we will use the additional data sets from independent samples from our original analysis.
To improve our analysis, we are testing methods to integrate disparate classes of data; for example, this genomic data covers a range including binary for mutation, categorical for copy number, and continuous for mRNA and methylation. One potential approach involves turning the values of every genomic assay into categorical variable by discretization. For example, GISTIC is a method for producing discrete values for discriminating significant copy number alterations. However, discretization is less applicable to continuous data sets such as gene expression and methylation. Another possibility involves converting every genomic assay feature into a continuous value. For example, we can use several programs that predict the functional impacts of mutations, thus convert binary mutation values (e.g. stop mutation, substitutions, etc.) into continuous pathogenicity values. We will test whether homogeneity in different genomic values may improve the predictive power of our analysis and improve the ranking of genes in terms of their importance biologically and clinically.