 Research
 Open Access
 Published:
Pathway analysis of rare variants for the clustered phenotypes by using hierarchical structured components analysis
BMC Medical Genomics volume 12, Article number: 100 (2019)
Abstract
Backgrounds
Recent largescale genetic studies often involve clustered phenotypes such as repeated measurements. Compared to a series of univariate analyses of single phenotypes, an analysis of clustered phenotypes can be useful for substantially increasing statistical power to detect more genetic associations. Moreover, for the analysis of rare variants, incorporation of biological information can boost weak effects of the rare variants.
Results
Through simulation studies, we showed that the proposed method outperforms other method currently available for pathwaylevel analysis of clustered phenotypes. Moreover, a real data analysis using a largescale whole exome sequencing dataset of 995 samples with metabolic syndromerelated phenotypes successfully identified the glyoxylate and dicarboxylate metabolism pathway that could not be identified by the univariate analyses of single phenotypes and other existing method.
Conclusion
In this paper, we introduced a novel pathwaylevel association test by combining hierarchical structured components analysis and penalized generalized estimating equations. The proposed method analyzes all pathways in a single unified model while considering their correlations. C/C++ implementation of PHARAOHGEE is publicly available at http://statgen.snu.ac.kr/software/pharaohgee/.
Backgrounds
The history of GenomeWide Association Studies (GWAS) now has reached two decades, and those GWAS have identified almost 60,000 unique associations of over 3000 traits [1]. However, despite the steeply increasing GWAS discoveries, those discoveries explain only a small portion of expected phenotypic variations [2, 3], a phenomenon known as “missing heritability” [2]. Some of the possible explanation for such phenomenon include genegene interaction, pleiotropic effect, and rare variants [3].
For the analysis of rare variants, the low statistical power caused by the sparseness of rare variants is one of the major issues. The use of biological information such as genes or pathways has been proven to escalate the statistical power and improve the biological interpretation, for identifying statistically significant genes and pathways associated with complex traits such as highdensity lipoprotein levels, obesity, schizophrenia, and multiple cancers [4,5,6,7,8]. Taking the advantages of the pathwaylevel analysis, we have developed statistical methods PHARAOH that investigates pathwaylevel associations [9] and PHARAOHmulti that extends PHARAOH to the analysis of multiple continuous phenotypes [10]. Our PHARAOH method has two exclusive features. First, it employs the hierarchy of biological process by constructing a hierarchical structural model of the rare variants, genes, pathways, and phenotype(s). Second, it considers all pathways within a single unified model with statistical regularization, hence effectively controlling the correlations between genes and pathways.
Another approach to improving the statistical power is a simultaneous analysis of clustered phenotypes. For example, the analysis of repeatedly measured phenotypes outperforms the analysis of crosssectionally observed phenotypes, since the information on the temporal differences within a subject improves the power [11]. Many recent GWAS have analyzed the repeatedly measured phenotypes and discovered many novel associations, such as fasting glucose, body mass index, and lung function [12,13,14]. In the repeated measures analysis, a consideration of the correlations between the repeated measurements is crucial. Neglecting the nature of clustered phenotypes may result in loss of statistical power [15].
The Generalized Estimating Equations (GEE) approach is one of the most commonly used methods for the analysis of clustered and correlated phenotypes [15]. The major advantages of GEE include that it can handle a wide class of phenotypes such as binary, count, and continuous traits from an exponential family distribution and that its estimator is consistent regardless of the specification of the working correlation structure. In these respects, the GEE approach has been contributed to the discovery of genetic components from various studies including association studies of lung cancer [16], ophthalmological measurements [16, 17], and genedrug interaction analysis [18]. For the analysis of expression datasets, various extensions of GEE have been proposed such as the repeated microarray experiment and penalized GEE for microRNA dataset [17, 18]. For genelevel tests, several GEE methods have been developed, including Longitudinal Genetic Random Field (LGRF) and GEEKM [19, 20].
However, unlike the genelevel analyses, to the best of our knowledge, only one method based on GEE has employed the pathwaylevel analysis of the correlated phenotypes [21] with the R package GEEaSPU. Note that GEEaSPU employs the adaptive Sum of Powered score (aSPU) and adapts the GEE framework to enable pathwaylevel analysis of genetic variants [21]. However, the GEEaSPU method cannot handle the correlations between the pathways, which can result in the biased results.
In order to address this problem, we propose a novel pathwaylevel association test for clustered and correlated phenotypes such as repeated measurements, Pathwaybased approach using HierArchical component of collapsed RAre variants Of Highthroughput sequencing data using Generalized Estimating Equations (PHARAOHGEE). While the existing GEE based pathwaylevel method GEEaSPU implements the individual “pathwaywise” test assuming all tests are independent, the proposed PHARAOHGEE method implements a “global test” that considers the correlation among the pathways into account by putting all pathways simultaneously into a single model. Moreover, PHARAOHGEE can handle various types of phenotypes (e.g., binary), and it also retains the advantages of PHARAOH, such as the hierarchical model that mimics the natural biological processes. By providing PHARAOHGEE program using a powerful and fast C/C++ based framework WISARD [22], it supports various genetic data formats and provides affordable performance.
Results
We used a workstation system consists of two Intel Xeon E5–2640 CPUs and 256GiB of RAM. Due to the limitation of the compared method, the R version 3.4.0 and R package ‘GEEaSPU’ were used with default settings.
Simulation study
For our simulation study, we generated 300 replicates from the simulated data pool. Each replicate consisted of 10 pathways in which the first pathway was causal and the other nine were noncausal (i.e., no effect). For each replicate, the proposed PHARAOHGEE method was applied to the 10 pathways simultaneously, whereas GEEaSPU was applied to each pathway individually. Here we assumed that the first pathway is causal and the others are noncausal. For the causal pathway, we considered three different parameter settings: four genelevel effects (w = 0.1, 0.2, 0.5 and 1.0), three pathwaylevel effects (β =0.15, 0.2 and 0.25), two correlations of phenotypes (ρ =0.25 and 0.5). For all test results, we applied the BH stepup procedure to control the False Discovery Rate (FDR) at 5% level [23]. Details on simulation procedure can be found on Methods section.
First, we evaluated the type 1 errors of PHARAOHGEE and GEEaSPU. For the given parameter settings for the causal pathway, we evaluated the type 1 errors using 9 noncausal pathways with significance level α = 0.01. As shown in Fig. 1, all methods controlled the type 1 error rates appropriately, regardless of the parameter values.
Second, we evaluated statistical power of the methods where power was computed as a proportion of the causal pathway being statistically significant at the FDR < 0.05 over 300 replicates. In addition to three parameter settings for the causal pathway, we consider two cases when the numbers of significant genes within the causal pathway are only one (H_{1} = 1) and two (H_{1} = 2) out of ten simulated genes, respectively. As shown in Fig. 2, PHARAOHGEE outperforms GEEaSPU in all simulation scenarios.
In the power analysis, there were two additional interesting findings. First, when the proportion of significant genes in the causal pathway became smaller, the proposed method tended to outperform GEEaSPU. Second, PHARAOHGEE showed less reduction of statistical power than GEEaSPU when the phenotypic correlation ρ increased. In real practical situation where only a fraction of genes is likely related to phenotypes and that the correlations among clustered phenotypes are high, these findings suggest that PHARAOHGEE would be more powerful for detecting true biological signals than GEEaSPU.
Analysis of whole exome sequencing (WES) dataset using clustered phenotypes
To demonstrate the usefulness of PHARAOHGEE, we analyzed a largescale sequencing dataset with six phenotypes related to the metabolic syndrome: systolic blood pressure (SBP), diastolic blood pressure (DBP), triglycerides (TG), fasting glucose (FASTGLU), waist circumference (WAIST), and highdensity lipoprotein (HDL). Before the analysis, we binarized these phenotypes according to the metabolic syndrome criteria of International Diabetes Federation (IDF) consensus worldwide definition of the metabolic syndrome (https://www.idf.org). Metabolic syndrome is diagnosed as the presence of three or more of the following criteria: (1) WAIST ≥90 cm in males and ≥ 80 cm in females; (2) elevated TG ≥ 150 mg/dL or taking medication; (3) HDLcholesterol < 40 mg/dL in males and < 50 mg/dL in females or taking lipidlowering agents; (4) systolic blood pressure ≥ 130 mmHg or diastolic blood pressure ≥ 85 mmHg or taking antihypertensive medications; and (5) elevated FASTGLU ≥100 mg/dL or oral hypoglycemic agents use. From these six metabolic syndrome related phenotypes, we derived five clustered binary traits. Especially, we combined two blood pressure phenotypes (SBP & DBP) into a single phenotype, named BP, by setting 1 if either SBP or DBP satisfied the diagnosis criteria of metabolic syndrome and 0 otherwise. All other phenotypes were binarized if the diagnosis criteria of metabolic syndrome was satisfied and 0 otherwise.
We applied PHARAOH for the univariate analysis of each binary phenotype and applied PHARAOHGEE and GEEaSPU for the multivariate analysis of the five binary phenotypes. We conducted the multiple testing adjustment to both univariate and multivariate analyses by using the BH stepup procedure [23]. The unstructured covariance structure of the phenotypes was assumed for both PHARAOHGEE and GEEaSPU. Figure 3 presents quantilequantile (QQ) plots showing that PHARAOH and PHARAOHGEE led to no substantial deflation or inflation of pvalues.
Table 1 exhibits the pathways with the five smallest qvalues identified by PHARAOHGEE, as well as their qvalues under PHAROH and GEEaSPU. PHARAOHGEE was able to identify one KEGG pathway, the glyoxylate and dicarboxylate metabolism, at the qvalue threshold of 0.1. None of these pathways turned out to be statistically significant in the univariate analyses of PHARAOH, always resulting in larger qvalues than those from PHARAOHGEE. Although the same glyoxylate and dicarboxylate pathway had the lowest pvalue by GEEaSPU, it failed to pass the qvalue threshold of 0.1, after the multiple testing adjustment. Thus, our real data analyses showed the relatively superior performance of PHARAOHGEE.
Among the five pathways identified by PHARAOHGEE, a recent study suggests a strong relationship between the metabolic syndrome and two pathways (glyoxylate and dicarboxylate, and fatty acid metabolisms), through their role in abdominal obesity [24]. In addition, the glycosphingolipid biosynthesis and MAPK signaling pathways are reported to be related to the metabolic syndrome via insulin resistance that plays a critical role in manifestation of the metabolic syndrome [25, 26].
Conclusion
An analysis of the clustered phenotypes provides more information than the crosssectional studies. Recent large cohort studies keep producing repeatedly measured phenotypes. We introduced a novel statistical method for the pathway analysis of the largescale genetic dataset with clustered phenotypes. While our previous PHARAOHmulti method can handle only continuous phenotypes, the proposed PHARAOHGEE can handle various phenotypes such as clustered binary and count phenotypes under the various correlation structures. Through the comparison study using the simulated datasets, we demonstrated that the proposed PHARAOHGEE method outperforms an existing pathway method. Furthermore, our application to the largescale WES dataset successfully identified one pathway that has not been discovered in the analyses of individual phenotype with the multiple testing adjustments.
Discussion
Compared to GEEaSPU the only currently available method for pathwaylevel test of clustered phenotypes, the proposed method has many advantages. First, PHARAOHGEE effectively controls the complex correlations among the pathways by constructing a unified hierarchical, doublypenalized statistical model. Second, it successfully reflects the nature of biological process from GSCA framework and takes clustered phenotypes into account from GEE framework. In conclusion, we hope that PHARAOHGEE can serve as a main tool for the pathwaylevel analysis of clustered phenotypes in genetic studies.
Currently, we have a number of considerations for our future research. Although we considered many possible combinations of parameters in the simulation setting, a further extensive simulation study is required for more comprehensive comparison with existing pathwaybased methods. In addition, we will perform a replication study using other independent datasets with the metabolic syndrome phenotypes. Finally, we will employ other penalization methods such as lasso and elasticnet.
Methods
PHARAOHGEE method
Technically, the proposed method is an extension of the doublyregularized Generalized Structured Component Analysis into the GEE framework [27] that imposes ridge penalties [28] on both genepathway and pathwayphenotype relationships. From the previous studies, we successfully demonstrated that those two ridge penalties effectively control the correlations between genes and pathways [9, 10]. PHARAOHGEE aims to identify associations between Q clustered phenotypes and K pathways, each of which is linked to T_{k} genes (k = 1, ⋯, K). An example of the PHARAOHGEE model is depicted in Fig. 4.
Let y_{iq} be the value of the q^{th} phenotype measured on the i^{th} individual (i = 1, …, N; q = 1, …, Q) and \( \tilde{y}_{i}={\left[{y}_{i1},\cdots, {y}_{iQ}\right]}^{\prime } \) be a Q × 1 vector of the clustered phenotypes of the i^{th} individual. Similar to the previous description of the PHARAOH model [9], we assume that y_{iq} follows an exponential family distribution with a mean μ_{iq}. Let Σ_{i} be the Q × Q covariance matrix of \( \tilde{y}_{i} \). Then,
where R_{i}(α) is a socalled “working correlation matrix”, α is a parameter vector that fully characterizes R_{i}(α), and \( {A}_i^{1/2}=\operatorname{diag}\left[\operatorname{var}\left({\mu}_{ij}\right)\right] \), i.e., a Q × Q diagonal matrix with the marginal variance of responses. Liang and Zeger [29] suggested various choices for R_{i}(α), e.g., the independence covariance structure, R_{i}(α) = I_{Q}, where I_{Q} is the identity matrix of order Q.
Let \( \tilde{x}_{i}^{\prime }=\left[1,\cdots, 1;{x}_{i11},\cdots, {x}_{i1{T}_1};\mathbf{\cdots};{x}_{iK1},\cdots, {x}_{iK{T}_K}\right] \) be a (T + 1) × 1 vector consisting of all genelevel collapsed variables for the i^{th} individual across K pathways, where T = \( {\Sigma}_{k=1}^K{T}_k \). The genelevel collapsed variables are generated as the weighted sums of rare variants. Let X be an N × (T + 1) matrix of the genelevel collapsed variables for N observations, as expressed in (2).
As in the previous methods [9], we standardize X to satisfy the conventional scaling constraint diag(X^{′}X) = NI. Each element of X, x_{ikt}, denotes a genelevel summary of the i^{th} sample for the t^{th} gene (t = 1, ⋯, T_{k}) in the k^{th} pathway and is generated by the weighted sum of rare variants that is same as the previous work [9, 10]. Let W denote a (T + 1) × (K + 1) matrix consisting of component weights w_{tk}, which are assigned to x_{ikt}. This matrix can be generally expressed as
Let η_{iq} and g(·) denote the i^{th} linear predictors of the q^{th} phenotype and a link function, respectively. We define the proposed PHARAOHGEE model as
where \( {f}_{ik}={\sum}_{t=1}^{T_k}{x}_{ik t}{w}_{tk} \) is the component score of the i^{th} individual for the k^{th} pathway \( {\overset{\sim }{\boldsymbol{f}}}_i=\left[1,{f}_{i1},\cdots, {f}_{iK}\right] \), and \( {\overset{\sim }{\boldsymbol{\beta}}}_q=\left[{\beta}_{0q}\ {\beta}_{1q}\cdots {\beta}_{Kq}\right] \) is a vector of coefficients linking K pathways to the q^{th} phenotype. We can statistically examine the joint effects of the k^{th} pathway on Q phenotypes by testing the null hypothesis H_{0}: β_{k1} = ... = β_{kQ} = 0. Moreover, it is possible to evaluate the effect of one gene on a single phenotype mediated by its corresponding pathway.
Parameter estimation
For simplicity, we describe the propose method, assuming that the phenotype \( \tilde{y}_{i} \) is continuous. It is technically straightforward to extend the method to other phenotypes from exponential distributions. In parameter estimation, we add two L_{2} penalty terms to control for potential adverse influences of high correlations between genes and/or pathways. Specifically, to estimate the parameters W and B, we seek to minimize the following penalized estimating equations.
where U is the estimating equation for the parameters, B is a matrix consisting of all regression coefficients \( {\overset{\sim }{\beta}}_q \), tr(·) denotes the trace of matrix, and λ_{G} and λ_{P} denote ridge parameters on the L_{2} penalty terms for the weights and regression coefficients, respectively. A more detail on the estimating equation and solving process can be found on elsewhere [9].
To minimize ϕ_{α, B, W}, we use an iterative algorithm that repeats the following steps until no substantial changes in parameter estimates occur.
Step 1: We update B for fixed W and R_{i}(α). Let b = vec(B) denote a vector formed by stacking all columns of B one below another. This is equivalent to minimizing the following estimating equations
where \( {\boldsymbol{Q}}_i={\boldsymbol{f}}_i^{\prime}\otimes \mathbf{I} \) and ⊗ denotes Kronecker product. Then, b can be estimated by \( \hat{\boldsymbol{b}}={\left(\sum \limits_{i=1}^N{\boldsymbol{Q}}_i^{\prime }{\Sigma}_i^{1}{\boldsymbol{Q}}_i+{\lambda}_P\mathrm{I}\right)}^{1}\left(\sum \limits_{i=1}^N{\boldsymbol{Q}}_i^{\prime }{\Sigma}_i^{1}{y}_i\right) \), and \( \hat{\boldsymbol{B}} \) is reconstructed from \( \hat{\boldsymbol{b}} \).
Step 2: We update W for fixed B and R_{i}(α). Let w = vec(W). Similar to step 1, it is equivalent to minimizing
where \( {M}_i=\tilde{x}_{i}^{\prime}\otimes {\boldsymbol{B}}^{\prime }, \) w* is the vector formed by eliminating all zero elements of w, and M_{i} is the matrix formed by removing the columns of \( \tilde{x}_{i}^{\prime}\otimes {B}^{\prime } \) corresponding to the zero elements of w. Then, w_{∗} can be estimated by \( {\hat{\boldsymbol{w}}}_{\ast }={\left(\sum \limits_{i=1}^N{M}_i^{\prime }{\Sigma}_i^{1}{M}_i+{\lambda}_G\mathrm{I}\right)}^{1}\left(\sum \limits_{i=1}^N{M}_i^{\prime }{\Sigma}_i^{1}{z}_i\right). \) Then, the estimated W is reconstructed from \( {\hat{\boldsymbol{w}}}_{\ast } \).
Step 3: We update R_{i}(α) from the updated B and W using Pearson residuals with the variance function of the distribution ν,
where \( {\hat{\mu}}_{ij}={\beta}_{0q}+\sum \limits_{k=1}^K{f}_{ik}{\hat{\beta}}_{kq}. \) Finally, the dispersion parameter φ is estimated consistently by
We apply kfold crossvalidation (CV) to estimate the values of λ_{G} and λ_{P}, which compares the quasideviance values [30] of a twodimensional grid of candidate values of λ_{G} and λ_{P}.
Significance testing and multiple correction
Resampling methods can be used to test the statistical significance of the estimated effects of all pathways on a given set of clustered phenotypes. In the proposed method, we utilize a permutation test to obtain pvalues. By permuting the phenotypes, the method first generates the empirical null distributions of both pathways and gene coefficients. By computing the quantile of the estimated pathway and gene coefficients from the nonpermuted dataset with the corresponding null distribution, we can obtain an empirical pvalue for any specific pathway and gene.
In our study, we want to test the joint effects of pathways on clustered phenotypes. In our previous study, we introduced two approaches to test β_{k1}, ..., β_{kQ} simultaneously and suggested the Waldtype statistics [10]. Similarly, we construct a single statistic that combines all Q coefficients. Here, we define a Waldtype statistic T as.
Under penalized GEE, the estimated covariance \( \operatorname{cov}\left({\hat{\overset{\sim }{\beta}}}_k\right) \) can be obtained in two ways. One way is to calculate it directly, as introduced by Wang et al. [31] as follows.
where \( {H}_{\hat{\overset{\sim }{\beta }}}=\sum \limits_{i=1}^N\tilde{x}_{i}^{\prime }{A}_i^{1/2}{R}_i^{1}\left(\alpha \right){A}_i^{1/2}\tilde{x}_{i} \), \( {E}_{\hat{\overset{\sim }{\beta }}}= tr\left({B}^{\prime }B\right) \), and \( {M}_{\hat{\overset{\sim }{\beta }}}=\sum \limits_{i=1}^N\tilde{x}_{i}^{\prime }{A}_i^{1/2}{R}_i^{1}\left(\alpha \right){e}_{\hat{\overset{\sim }{\beta }}}{e}_{\hat{\overset{\sim }{\beta}}}^{\prime }{R}_i^{1}\left(\alpha \right){A}_i^{1/2}\tilde{x}_{i} \) with \( {e}_{\hat{\overset{\sim }{\beta }}}={A}_i^{1/2}\left(\tilde{y}_{i}{\overset{\sim }{\mu}}_i\right) \). The other indirect way is to calculate it as the sample covariance of \( {\overset{\sim }{\boldsymbol{\beta}}}_k \) from permutations. We use this indirect way to reduce computational burden.
For the calculated pvalues, we implemented two types of multiple testing procedure as we discussed earlier [10]. In short, we applied two approaches: Westfall & Young permutation procedure [32] that effectively considers the correlation of pvalues, and the BenjaminiHochberg (BH) stepup procedure [23] that computes qvalues by False Discovery Rate (FDR) adjustment.
Simulation study
We conducted a simple simulation study to investigate the performance of PHARAOHGEE and to compare the proposed method with the existing methods. We first simulated a large pool of rare genetic variants using SimRare [33]. All simulation settings were unchanged except for the 1Kbp of gene length. From the pool, one thousands of replicates were generated, each of those consists of 1000 individuals and 10 pathways. Finally, the phenotypes were simulated from the below model that assumes only the first pathway is causal:
where H_{1} and M_{1t} denote the number of causal genes in the causal pathway and the number of causal rare variants in the t^{th} causal gene, respectively. Note that M_{1t} was the number of rare variants in the simulated gene varies and was used as an input variable in our simulation study. We set γ_{1tj} to log_{10}MAF_{tj}, which represents the effect of the j^{th} genetic variant of the t^{th} gene. For the simplicity, we generated the phenotypes from the simulated linear predictor η_{iq}, by using it as a binarization threshold from the randomly generated variables from the multivariate normal distribution MVN(0, Σ). For each replicate, all rare variants were collapsed into genes.
Exome sequencing dataset with clustered phenotypes
In order to illustrate PHARAOHGEE for investigating associations between multiple pathways and the clustered phenotypes, we analyzed a largescale WES dataset from a Korean population cohort. Our WES dataset consists of nextgeneration sequencing data of 1087 individuals’ genomes, using the Illumina HiSeq2000 platform (Illumina, Inc., San Diego, CA), as a part of the T2DGENES consortium [34]. All individuals of the dataset were originated from a large Korean cohort named the Korean Association REsource (KARE) study [35]. For our analysis, we selected six phenotypes related to the metabolic disease: SBP, DBP, TG, FASTGLU, WAIST and HDL. In our analysis, we considered 995 individuals with complete phenotypes of interest. We then applied two pathway databases Biocarta and KEGG from Molecular Signatures Database [36], which is a curated collection of multiple pathway databases.
Abbreviations
 BH:

BenjaminiHochberg
 CV:

Crossvalidation
 DBP:

Diastolic blood pressure
 FASTGLU:

Fasting glucose
 FDR:

False Discovery Rate
 GEE:

Generalized estimating equations
 GWAS:

Genomewide association studies
 HDL:

Highdensity lipoprotein
 IDF:

International Diabetes Federation
 KARE:

Korean Association REsource
 SBP:

Systolic blood pressure
 TG:

Triglycerides
 WAIST:

Waist circumference
 WES:

Whole exome sequencing
References
 1.
MacArthur J, Bowler E, Cerezo M, Gil L, Hall P, Hastings E, Junkins H, McMahon A, Milano A, Morales J, et al. The new NHGRIEBI catalog of published genomewide association studies (GWAS catalog). Nucleic Acids Res. 2017;45(D1):D896–901.
 2.
Maher B. Personal genomes: the case of the missing heritability. Nature. 2008;456(7218):18–21.
 3.
Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, McCarthy MI, Ramos EM, Cardon LR, Chakravarti A, et al. Finding the missing heritability of complex diseases. Nature. 2009;461(7265):747–53.
 4.
Ahituv N, Kavaslar N, Schackwitz W, Ustaszewska A, Martin J, Hebert S, Doelle H, Ersoy B, Kryukov G, Schmidt S, et al. Medical sequencing at the extremes of human body mass. Am J Hum Genet. 2007;80(4):779–91.
 5.
Brunham LR, Singaraja RR, Hayden MR. Variations on a gene: rare and common variants in ABCA1 and their impact on HDL cholesterol levels and atherosclerosis. Annu Rev Nutr. 2006;26:105–29.
 6.
Cohen JC, Kiss RS, Pertsemlidis A, Marcel YL, McPherson R, Hobbs HH. Multiple rare alleles contribute to low plasma levels of HDL cholesterol. Science. 2004;305(5685):869–72.
 7.
Slatter TL, Jones GT, Williams MJ, van Rij AM, McCormick SP. Novel rare mutations and promoter haplotypes in ABCA1 contribute to lowHDLC levels. Clin Genet. 2008;73(2):179–84.
 8.
Walsh T, McClellan JM, McCarthy SE, Addington AM, Pierce SB, Cooper GM, Nord AS, Kusenda M, Malhotra D, Bhandari A, et al. Rare structural variants disrupt multiple genes in neurodevelopmental pathways in schizophrenia. Science. 2008;320(5875):539–43.
 9.
Lee S, Choi S, Kim YJ, Kim BJ, T2DGENES Consortium, Hwang H, Park T. Pathwaybased approach using hierarchical components of collapsed rare variants. Bioinformatics. 2016;32(17):i586–94.
 10.
Lee S, Kim Y, Choi S, Hwang H, Park T. Pathwaybased approach using hierarchical components of rare variants to analyze multiple phenotypes. BMC Bioinformatics. 2018;19(Suppl 4:79.
 11.
Landerman LR, Mustillo SA, Land KC. Modeling repeated measures of dichotomous data: testing whether the withinperson trajectory of change varies across levels of betweenperson factors. Soc Sci Res. 2011;40(5):1456–64.
 12.
RasmussenTorvik LJ, Alonso A, Li M, Kao W, Kottgen A, Yan Y, Couper D, Boerwinkle E, Bielinski SJ, Pankow JS. Impact of repeated measures and sample selection on genomewide association studies of fasting glucose. Genet Epidemiol. 2010;34(7):665–73.
 13.
Mei H, Chen W, Jiang F, He J, Srinivasan S, Smith EN, Schork N, Murray S, Berenson GS. Longitudinal replication studies of GWAS risk SNPs influencing body mass index over the course of childhood and adulthood. PLoS One. 2012;7(2):e31470.
 14.
Tang W, Kowgier M, Loth DW, Soler Artigas M, Joubert BR, Hodge E, Gharib SA, Smith AV, Ruczinski I, Gudnason V, et al. Largescale genomewide association studies and metaanalyses of longitudinal change in adult lung function. PLoS One. 2014;9(7):e100776.
 15.
Mukherjee B, Ko YA, Vanderweele T, Roy A, Park SK, Chen J. Principal interactions analysis for repeated measures data: application to genegene and geneenvironment interactions. Stat Med. 2012;31(22):2531–51.
 16.
Schifano ED, Li L, Christiani DC, Lin X. Genomewide association analysis for multiple continuous secondary phenotypes. Am J Hum Genet. 2013;92(5):744–59.
 17.
Fan Q, Teo YY, Saw SM. Application of advanced statistics in ophthalmology. Invest Ophthalmol Vis Sci. 2011;52(9):6059–65.
 18.
Sitlani CM, Rice KM, Lumley T, McKnight B, Cupples LA, Avery CL, Noordam R, Stricker BH, Whitsel EA, Psaty BM. Generalized estimating equations for genomewide association studies using longitudinal phenotype data. Stat Med. 2015;34(1):118–30.
 19.
He Z, Zhang M, Lee S, Smith JA, Guo X, Palmas W, Kardia SL, Diez Roux AV, Mukherjee B. Setbased tests for genetic association in longitudinal studies. Biometrics. 2015;71(3):606–15.
 20.
Wang X, Zhang Z, Morris N, Cai T, Lee S, Wang C, Yu TW, Walsh CA, Lin X. Rare variant association test in familybased sequencing studies. Brief Bioinform. 2017;18(6):954–61.
 21.
Kim J, Zhang Y, Pan W, Alzheimer's Disease Neuroimaging I. Powerful and adaptive testing for multitrait and multiSNP associations with GWAS and sequencing data. Genetics. 2016;203(2):715–31.
 22.
Lee S, Choi S, Qiao D, Cho M, Silverman EK, Park T, Won S. WISARD: workbench for integrated superfast association studies for related datasets. BMC Med Genet. 2018;11(Suppl 2):39.
 23.
Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B Methodol. 1995;57(1):289–300.
 24.
Chen G, Ye G, Zhang X, Liu X, Tu Y, Ye Z, Liu J, Guo Q, Wang Z, Wang L, et al. Metabolomics reveals protection of resveratrol in dietinduced metabolic risk factors in abdominal muscle. Cell Physiol Biochem. 2018;45(3):1136–48.
 25.
Gehart H, Kumpf S, Ittner A, Ricci R. MAPK signalling in cellular metabolism: stress or wellness? EMBO Rep. 2010;11(11):834–40.
 26.
Aerts JM, Boot RG, van Eijk M, Groener J, Bijl N, Lombardo E, Bietrix FM, Dekker N, Groen AK, Ottenhoff R, et al. Glycosphingolipids and insulin resistance. Adv Exp Med Biol. 2011;721:99–119.
 27.
Hwang H, Takane Y. Generalized structured component analysis. Psychometrika. 2004;69(1):81–99.
 28.
Hoerl AE, Kennard RW. Ridge regression  biased estimation for nonorthogonal problems. Technometrics. 1970;12(1):55.
 29.
Liang KY, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika. 1986;73(1):13–22.
 30.
Li B. A deviance function for the quasilikelihood method. Biometrika. 1993;80(4):741–53.
 31.
Wang L, Zhou J, Qu A. Penalized generalized estimating equations for highdimensional longitudinal data analysis. Biometrics. 2012;68(2):353–60.
 32.
Westfall PH, Young SS. Resamplingbased multiple testing : examples and methods for Pvalue adjustment. New York: Wiley; 1993.
 33.
Li B, Wang G, Leal SM. SimRare: a program to generate and analyze sequencebased data for association studies of quantitative and qualitative traits. Bioinformatics. 2012;28(20):2703–4.
 34.
Fuchsberger C, Flannick J, Teslovich TM, Mahajan A, Agarwala V, Gaulton KJ, Ma C, Fontanillas P, Moutsianas L, McCarthy DJ, et al. The genetic architecture of type 2 diabetes. Nature. 2016;536(7614):41–7.
 35.
Cho YS, Go MJ, Kim YJ, Heo JY, Oh JH, Ban HJ, Yoon D, Lee MH, Kim DJ, Park M, et al. A largescale genomewide association study of Asian populations uncovers genetic factors influencing eight quantitative traits. Nat Genet. 2009;41(5):527–34.
 36.
Liberzon A, Subramanian A, Pinchback R, Thorvaldsdottir H, Tamayo P, Mesirov JP. Molecular signatures database (MSigDB) 3.0. Bioinformatics. 2011;27(12):1739–40.
Acknowledgements
Not applicable.
Funding
Publication costs are funded by the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI) grant (HI16C2037). Also, this work was supported by the Bio & Medical Technology Development Program of the National Research Foundation of Korea (NRF) grant (2013M3A9C4078158) and by grants of the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (grant number: HI16C2037, HI15C2165, HI16C2048).
Availability of data and materials
We provide PHARAOHGEE method as a program from the website (http://statgen.snu.ac.kr/software/pharaohgee). The KARE exome sequencing dataset is a part of T2DGENES consortium, and is available upon approval of T2DGENES project committee.
About this supplement
This article has been published as part of BMC Medical Genomics Volume 12 Supplement 5, 2019: Selected articles from the 8th Translational Bioinformatics Conference: Medical Genomics. The full contents of the supplement are available online at https://bmcmedgenomics.biomedcentral.com/articles/supplements/volume12supplement5.
Author information
Affiliations
Contributions
SL and SK performed all analyses and developed the software implementation. SL, SK and TP conducted the entire study, developed the methodology, and wrote the manuscript. YK and BO helped with the performing of analyses. HH helped developing the methodology. All of the authors have read and approved of the final manuscript.
Corresponding author
Correspondence to Taesung Park.
Ethics declarations
Ethics approval and consent to participate
We used the exome sequencing data of 1,037 samples from KARE. KARE study is a part of Korean Genome Epidemiology Study (KoGES), and the dataset was used under the partnership of T2DGENES. All participants of KARE study provided written informed consent. The study using KARE samples was approved by two independent institutional review boards at Seoul National University.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Lee, S., Kim, S., Kim, Y. et al. Pathway analysis of rare variants for the clustered phenotypes by using hierarchical structured components analysis. BMC Med Genomics 12, 100 (2019). https://doi.org/10.1186/s1292001905174
Published:
Keywords
 Generalized estimating equations
 Clustered phenotypes
 Pathway analysis
 Rare variants