Integrative regression network for genomic association study
© The Author(s). 2016
Published: 12 August 2016
The increasing availability of multiple types of genomic profiles measured from the same cancer patients has provided numerous opportunities for investigating genomic mechanisms underlying cancer. In particular, association studies of gene expression traits with respect to multi-layered genomic features are highly useful for uncovering the underlying mechanism. Conventional correlation-based association tests are limited because they are prone to revealing indirect associations. Moreover, integration of multiple types of genomic features raises another challenge.
In this study, we propose a new framework for association studies called integrative regression network that identifies genomic associations on multiple high-dimensional genomic profiles by taking into account the associations between as well as within profiles. We employed high-dimensional regression techniques to first identify the associations between different genomic profiles. Based on the resulting regression coefficients, a regression network was constructed within each profile. For example, two methylation features having similar regression coefficients with respect to a number of gene expression traits are likely to be involved in the same biological process and therefore we define an edge between two methylation features in the regression network. To extract more reliable associations, multiple sparse structured regression techniques were applied and the resulting multiple networks were merged as the integrative regression network using a similarity network fusion technique.
Experiments were carried out using four different sparse structured regression methods on five cancer types from TCGA. The advantages and disadvantages of each regression method were also explored. We find there was large inconsistency in the results from different regression methods, which supports the need to extract the proposed integrative regression network from multiple complimentary regression techniques. Fusing multiple regression networks by using similarity measurements led to the identification of significant gene pairs and a resulting network with better topological properties.
We developed and validated the integrative regression network scheme on multi-layered genomic profiles from TCGA. Our method facilitates identification of the strong signals as well as weaker signals by fusing information from different regression techniques. It could be extended to integrate results obtained from different cancer types as well.
Ongoing efforts by the The Cancer Genome Atlas (TCGA)  or the International Cancer Genome Consortium (ICGC)  have provided an exceptional opportunity for biomedical researchers and practitioners to explore the mechanisms and to identify important biomarkers underlying cancer. Large-scale analysis of the available datasets that cover genomic, transcriptomic, and epigenomic, and clinical profiles have revealed important characteristics of genomic associations in cancer. Additionally, ‘cancer stat fact sheets’ have revealed new cases and the expected mortality rate of cancer are rapidly increasing . Ongoing studies of gene expression with respect to multi-layered genomic features are highly useful for overcoming the poor prognosis of cancer.
In this study, we identified genomic associations using multiple genomic profiles. Given the high level of noise and extremely large data dimension, simple correlation-based association tests are prone to revealing indirect or false-positive genomic associations. Instead, we employed high-dimensional multivariate regression techniques to identify genomic associations between different high-dimensional genomic profiles. Moreover, we constructed a regression network utilizing the regression coefficient vector or matrix. The regression network was constructed within each profile such as mRNA expression or methylation, but takes into account the association between the two different genomic profiles. To extract more robust and statistically significant results, we used multiple regression techniques and then integrated the resultant regression networks into an integrative regression network by using a network fusion technique.
Various sparse structured regression techniques have been proposed to address the challenges arising in a high-dimensional regression setting, both for the input and output variables. A widely used L 1-regularized linear regression known as Lasso  produces sparse regression coefficients when the number of features is large. Variants of Lasso have been proposed to incorporate structural information of genomic features in input and expression traits as output. Graph-guided Fused Lasso (GFLasso) , for example, utilizes the network structure among output variables in multiple output regression setting. This is particularly suitable for association studies that consider gene expression traits as output variables because gene expression traits are under a natural network structure. In Sparse Group Lasso (SGL) , input variables (genomic features) are assumed to behave in groups; thus, by utilizing grouping information of the features such as pathway groups, the method identifies important genes in common pathways of interest. For problems, such as grouped covariates, this method can impose sparse effect on the group level and within the group level. The more recently described Structured Input–output Lasso (SIOL) method combines structural constraints on both the inputs and the outputs . Similar to GFLasso, this method considers output structural information, and similar to SGL, considers input group information. SIOL predicts true non-zero coefficients using both structural information and grouping effect on the inputs and output variants.
Each of these sparse structured regression methods exhibits advantages and disadvantages. Rather than selecting the single best method, we build an integrative regression network by fusing multiple regression networks. We adopted the existing approach of Similarity Network Fusion (SNF)  for network integration. The final fused network could compile shared information as well as complementary information from all different datasets used in the fusion by identifying similarities in all of the networks. Given the natural propagative behavior of SNF, the produced output showed less noise and captured important signals (both stronger and weaker signals). We demonstrate the proposed approach for an association study using methylation and gene expression data for five cancer datasets from TCGA.
Overview of the proposed method
Data & pre-processing
Data acquisition platforms
Broad Institute HT-HG-U133A Platform
Details of all cancer profiles before and after filtration process
High-dimensional regression methods
Least absolute shrinkage and selection operator (Lasso)
The second term of equation (2) induces a sparse solution by driving many irrelevant beta coefficients to exact zeros. The result of Lasso is a set of features that are highly affined to the given expression trait and the implication power of each feature j is given by its regression coefficient β j , which provides a measure of how strongly or weakly each feature influences the traits. This procedure is applied to each of the multiple gene expression traits independently. Lasso is widely effective, when J (features) > > N (number of samples), and only a small number of inputs are expected to influence outputs. This is implemented in R using the glmnet package . The optimal parameter λ was chosen by cross-validation.
Graph guided fused lasso (GFLasso)
To select optimal regularization parameters, we first identified the median of non-zero beta coefficients and multiplied it by the total count of gene expression traits. The obtained value was assigned to lambda as an initial value. The initial gamma was fixed as 1. The observation was carried out using different λ and γ values, for example fixing γ and applying different values of λ as λ/2, λ, and 2λ, then fixing λ and changing γ to γ/2, γ, and 2γ, to verify the mean squared error, regression coefficients density, and time to execute the dataset. Through empirical study, we derived lambda and gamma values as those with the smallest MSE. Based on previous studies , the correlation threshold was fixed as 0.7 for all datasets throughout the experiments, and thus f(rml) was always greater than or equal to 0.7 considering only very highly correlated gene expression features.
Sparse group lasso (SGL)
Here, α ∈ [0, 1] is a parameter for convex combination of the Lasso and group Lasso penalties (α = 0 gives the group Lasso fit, α = 1 gives the Lasso fit). And n and m represents the number of samples and the number of feature groups, respectively. This is implemented in R using the ‘SGL’ package .
To define the grouping of features, we applied clustering techniques to feature data. The hierarchical clustering was chosen after observing k-means and k-median clustering techniques. The function hclust in R was used with Euclidean distance measurement and Ward’s linkage method for experiments. The number of groups was verified with different trials such as 10, 20, 50, and 80, and 20 groups was chosen because of its better clustering results and MSE. For regularization parameter selection, the minimum value of the penalty parameter, as a fraction of the maximum value, was chosen to be 0.8 and α was set as 0.1.
Structured input–output lasso (SIOL)
Similar to SGL, the number of clusters/groups was chosen as 20. Parameter tuning was performed individually on each dataset through cross-validation. The identified λ 1 was 0.1, λ 2 ranged from 0.25 to 0.35, and λ 3 ranged from 0.15 to 0.25 for all cancer profiles.
Construction of regression network and its integration
Identifying a cutoff for edge filtering in regression networks
Performance investigation of different regression methods
Because of its structural information in consideration of SIOL, this method significantly outperformed all other regression methods, whereas GFLasso, SGL, and Lasso tend to produce comparable results, while Lasso, which uses no structural information, produces the largest MSE. The overall performance in terms of MSE in decreasing order was SIOL, GFLasso, SGL, and Lasso. The procedure was applied for multiple cancer profiles as shown in Fig. 3 and the behavior was observed to be similar for all cancer data types.
Discovering common genomic features of all methods without fusion technique
We further investigated the combined results of all four methods to identify influential predictors of cancer. The common predictors that were retrieved using all regression methods were collected. We focused on genomic features identified using all methods, as they are the strongest predictors of the expression traits. As the β value is the measure of how strongly each predictor variable influences the response variable, highly impacted gene pairs (top 200) based on the β values were collected for each of the four regression methods.
Figures 4 and 5 show that the results from different regression methods are very inconsistent. A naïve combination of the results would lead to a biased and inconsistent study. We also observed that selecting the top 50,100, or 150 regression coefficients showed a common trend of 0 (zero) common genomic features identified by all four regression methods (see Additional file 1). Figure 4 shows that the common genes identified by all regression methods from the top 200 regression coefficients were negligible, such as 3, 2, 4, 2, and 1 for the breast, colon, GBM, kidney, and lung cancer data sets, respectively. Therefore, rather than selecting a single regression method, we integrated the results obtained using various regression methods.
Integrative regression network
Permutation scheme to select significant pairs in regression networks
Regression coefficients measure the association strength of genomic features and expression traits. Fusion of these beta coefficients using similarity measurement was observed for different cancer profiles. A similar study can be conducted using correlation measurement, but this correlation is highly prone to identifying additional indirect genomic associations, which may redundantly appear across different types of genomics. We measured the affinities or similarities of methylation features (using the beta matrix) and mRNA expression (by transposing the beta matrix). The four affinity matrices from four regression methods were fused using SNF. The final integrative network was constructed with the strongest affined pairs from each network and from those pairs that were acknowledged by all networks (either stronger or weaker affinity value). Additionally, the affinity of each individual regression method versus the final fused network was examined.
Number of edges after filtering by the identified cutoff in each individual and the fused network
DNA Methylation features
mRNA Expression traits
Regression network properties
Network properties of methylation features and mRNA expression trails of Lung cancer profile
Number of nodes
Average number of neighbors
R 2 of node degree distribution
mRNA expression network
Number of nodes
Average number of neighbors
R 2 of node degree distribution
Functional characterization of the identified genes
Significantly enriched GO BP terms (top 5) for the largest connected component of integrative regression network of methylation features
GO:0006468 ~ protein amino acid phosphorylation
IPR008266:Tyrosine protein kinase, active site
IPR001245:Tyrosine protein kinase
GO:0004672 ~ protein kinase activity
IPR008266:Tyrosine protein kinase, active site
IPR001245:Tyrosine protein kinase
GO:00042127 ~ regulation of cell proliferation
GO:0008284 ~ positive regulation of cell proliferation
IPR001245:Tyrosine protein kinase
GO:0010033 ~ response to organic substance
GO:0043067 ~ regulation of programmed cell death
GO:0010941 ~ regulation of cell death
GO:0032403 ~ protein complex binding
has05200:Pathways in cancer
GO:0042127 ~ regulation of cell proliferation
has05200:Pathways in cancer
GO:0007169 ~ transmembrane receptor protein tyrosine kinase signaling pathway
GO:000716 ~ enzyme linked receptor protein signaling pathway
In enrichment study using gene expression networks, the cancer-related terms were prominently observed, which may be because of the procedure of intersecting expression genes with COSMIC cancer census genes. To cross-verify this result, we randomly selected 30 genes from COSMIC, performed a gene enrichment test, collected the top 5 terms, and repeated the same procedure for 30 iterations. Seven of the top 10 terms were chromosomal rearrangements, but their smallest FDR corrected p-values were multiple times larger than the p-values obtained from enrichment test of expression traits. A similar trend was identified for other significant terms, such as disease mutation, nucleus, and hsa05200: pathways in cancer, among others.
From the fused networks of methylation features, we collected hub genes that are with highest node degrees (see Additional file 3). Crucial cancer-causing genes were identified from the fused network, including AKT, KRAS, fibroblast growth factor receptors, anaplastic lymphoma kinase, and ERBBs. Previous studied demonstrated that the PI(3)K/AKT pathway is a strong therapeutic target in cell renal cell carcinoma . The KRAS oncogene is mutated in approximately 35–45 % of colorectal cancers and KRAS mutations are considered to be more predominant in pancreatic, thyroid, colorectal, and lung cancers [27, 28]. The anaplastic lymphoma kinase gene was also found to be a relevant term for lung cancer . Overexpression of FEFRs can lead to multiple cancer types and higher levels of fibroblast growth factor receptor was found in prostate, breast, lung, brain, gastric, sarcoma, head and neck, and multiple myeloma cancers . ERBB2 is typically amplified in tumors and overexpressed in breast cancer, and ERBBs are very important in cancer studies [31, 32].
Discussion and Conclusion
In this study, we presented an integrative regression network by combining the results of different regression methods. Given the highly correlated nature of genomic profiles, the association analysis of conventional correlation tests or multiple regression methods can produce inconsistent results. To address this issue and construct a more reliable association network from genomic profiles, we constructed a regression network by measuring the similarity of regression coefficient vectors in a high-dimensional multivariate multiple output regression setting. The results from different regression methods were further fused using a similarity fusion technique. The fused network facilitated identification of the strongest possible signal and as well as weaker signals, which increased the signal to noise ratio.
The GO enrichment test revealed that the final fused network could identify genes with the lowest FDR corrected p-values, and numerous cancer-related features were recognized using the fusion technique. The genes identified using the fusion technique were highly similar behavioral genes for the cancer profile, i.e. if gene g 1 and g 2 has nearly the same magnitudes of the regression coefficient and were identified by two or more regression methods or by a single regression method but with a very higher magnitude of similarity, then the SNF allows for the propagation of genes g 1 and g 2 as nodes in the final network with their affinities (similarities) as edges. Understanding cancer using this process can provide guidance for predicting the prognosis, developing effective therapies, and identifying subtypes  of cancer.
We developed an effective method for analyzing genes involved in cancer that integrates results from different regression methods. Although our analysis was done on each of the different cancer types separately, the result can be easily applied to integrate the results from multiple cancer types that can lead to common behavior across cancers. Based on the ease of the fusion technique (SNF), this method can be conveniently adopted to different types of studies in different domains. The properties of SNF such as propagation effect over iterations, robustness against noise and scaling to a large number of genes enables application of this method to many domains.
Publication of this article has been funded by Basic Science Research Program through the National Research Foundation (NRF) of Korea funded by the Ministry of Science, ICT, and Future Planning (MSIP) (2014R1A1A3051169 & 2010–0028631). This article has been published as part of BMC Medical Genomics Volume 9 Supplement 1, 2016. Selected articles from the 5th Translational Bioinformatics Conference (TBC 2015): medical genomics. The full contents of the supplement are available online https://bmcmedgenomics.biomedcentral.com/articles/supplements/volume-9-supplement-1.
Availability of data and materials
The TCGA datasets used for analysis are publicly available at http://tcga-data.nci.nih.gov/tcga/.
RR and KS designed and developed the study. HJ and RR formed the experiments and inferred the results. KS and HJ provided experienced guidance and timely support. RR and KS wrote the manuscript and all authors read the manuscript and approved it.
The authors declared that they have no competing interests.
Consent for publication
Ethics approval and consent to participate
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- TCGA. The Cancer Genome Atlas. Available from: http://cancergenome.nih.gov/.
- ICGC. International Cancer Genome Consortium Available from: https://icgc.org/icgc
- Stat Fact Sheets. Surveillance, Epidemiology, and End Results Program Turning Cancer Data Into Discovery. Available from: http://seer.cancer.gov/statfacts/.
- Tibshirani R. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological). 1996;58(1):267–88.Google Scholar
- Kim S, Sohn K-A, Xing EP. A multivariate regression approach to association analysis of a quantitative trait network. Bioinformatics. 2009;25(12):i204–12.View ArticlePubMedPubMed CentralGoogle Scholar
- Simon N et al. A Sparse-Group Lasso. J Comput Graph Stat. 2012;22(2):231–45.View ArticleGoogle Scholar
- Lee S, Xing EP. Leveraging input and output structures for joint mapping of epistatic and marginal eQTLs. Bioinformatics. 2012;28(12):i137–46.View ArticlePubMedPubMed CentralGoogle Scholar
- Wang B et al. Similarity network fusion for aggregating data types on a genomic scale. Nat Meth. 2014;11(3):333–7.View ArticleGoogle Scholar
- Sohn K-A et al. Relative impact of multi-layered genomic data on gene expression phenotypes in serous ovarian tumors. BMC Syst Biol. 2013;7 Suppl 6:S9.View ArticlePubMedPubMed CentralGoogle Scholar
- COSMIC. Catalogue of somatic mutations in cancer. Available from: http://cancer.sanger.ac.uk.
- Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010;33(1):1.View ArticlePubMedPubMed CentralGoogle Scholar
- Marttinen P et al. Genome-wide association studies with high-dimensional phenotypes. Stat Appl Genet Mol Biol. 2013;12(4):413–31.PubMedGoogle Scholar
- Sohn K-A, Kim S. Joint estimation of structured sparsity and output structure in multiple-output regression via inverse-covariance regularization. In: Lawrence N, Girolami M, editors. International Conference on Artificial Intelligence and Statistics, 21-23 April 2012. Vol. 22. La Palma, Canary Islands: JMLR W&CP; 2012. p. 1081–9.Google Scholar
- Sailing Lab. GFlasso Available from: http://www.sailing.cs.cmu.edu/main/?page_id=463.
- Iordache, M.-D. A sparse regression approach to hyperspectral unmixing. PhD diss. INSTITUTO SUPERIOR TÉCNICO, Department of Electrical and Computer Engineering; 2011.Google Scholar
- SGL. Fit a GLM (or cox model) with a combination of lasso and group lasso regularization. Available from: https://cran.r-project.org/web/packages/SGL/index.html.
- Lee S, Xing EP. Structured Input-Output Lasso, with Application to eQTL Mapping, and a Thresholding Algorithm for Fast Estimation. 2012. arXiv preprint arXiv:1205.1989.Google Scholar
- Sailing Lab. Struct I/O Lasso Available from: http://www.sailing.cs.cmu.edu/main/?page_id=484.
- SNFtool. Similarity Network Fusion Available from: https://cran.r-project.org/web/packages/SNFtool/index.html.
- Pearl J. Probabilistic reasoning in intelligent systems: networks of plausible inference: Morgan Kaufmann; 2014.Google Scholar
- Hu T et al. Characterizing genetic interactions in human disease association studies using statistical epistasis networks. BMC bioinformatics. 2011;12(1):364.View ArticlePubMedPubMed CentralGoogle Scholar
- Cytoscape. Available from: http://www.cytoscape.org/cy3.html
- Zhang B, Horvath S. A general framework for weighted gene co-expression network analysis. Stat Appl Genet Mol Biol. 2005;4(1):Article 17.Google Scholar
- da Huang W, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc. 2009;4(1):44–57.View ArticleGoogle Scholar
- da Huang W, Sherman BT, Lempicki RA. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 2009;37(1):1–13.View ArticleGoogle Scholar
- The Cancer Genome Atlas Research, N. Comprehensive molecular characterization of clear cell renal cell carcinoma. Nature. 2013;499(7456):43–9.View ArticleGoogle Scholar
- Tan C, Du X. KRAS mutation testing in metastatic colorectal cancer. World J Gastroenterol WJG. 2012;18(37):5171–80.PubMedGoogle Scholar
- Markman B et al. EGFR and KRAS in colorectal cancer. Adv Clin Chem. 2010;51:71–119.View ArticlePubMedGoogle Scholar
- El-Telbany A, Ma PC. Cancer Genes in Lung Cancer: Racial Disparities: Are There Any? Genes Cancer. 2012;3(7–8):467–80.View ArticlePubMedPubMed CentralGoogle Scholar
- Ahmad I, Iwata T, Leung HY. Mechanisms of FGFR-mediated carcinogenesis. Biochimica et Biophysica Acta (BBA) Mole Cell Res. 2012;1823(4):850–60.View ArticleGoogle Scholar
- Yarden Y, Pines G. The ERBB network: at last, cancer therapy meets systems biology. Nat Rev Cancer. 2012;12(8):553–63.View ArticlePubMedGoogle Scholar
- Hynes NE, Lane HA. ERBB receptors and cancer: the complexity of targeted inhibitors. Nat Rev Cancer. 2005;5(5):341–54.View ArticlePubMedGoogle Scholar
- Yang D et al. Integrated analyses identify a master microRNA regulatory network for the mesenchymal subtype in serous ovarian cancer. Cancer Cell. 2013;23(2):186–99.View ArticlePubMedPubMed CentralGoogle Scholar