HisCoM-PCA: Hierarchical structural Component Model for Pathway analysis of Common vAriants

i Abstract

Introduction Genome-wide association studies (GWAS) have made great achievement for investigating the association between a set of genetic variants and a trait of interest. GWAS typically focus on associations between single-nucleotide polymorphisms (SNPs) and traits such as type 2 diabetes (T2D) [1]. To identify the common variants in GWAS, many statistical methods such as logistic regression and linear regression have been widely used. Since most of these methods are based on the single variant analysis, the statistically significant results sometimes may suffer from a lack of biological interpretation. In addition, it has been reported that only a small portion of the total heritability of traits can be explained by these identified SNPs [2]. To enhance the interpretation of the results from SNPs, many gene-based and pathway-based association analysis methods have been developed. Biological pathways, which have complex interaction with each other, always have more direct influence on the related biological behaviors rather than genes [3]. Thus, it is easier to interpret the pathway-based results than SNP-based results. The pathway-based association methods developed for GWAS often identify pathways based on results from the single analysis of SNPs. These methods often use only top SNPs according to the p-values obtained from single SNP analysis. However, such analysis process ignores genetic information in the SNPs which are not selected [4][5][6]. In addition, the high correlations always exist between pathways, potentially caused by many shared genes between pathways. The methods neglecting these correlations may mislead the association results [7].
Considering these deficiencies, a hierarchical component model has been constructed, which is named as PHARAOH (Pathwaybased approach using HierArchical components of collapsed RAre variants Of High-throughput sequencing data). PHARAOH performs pathway analysis for rare variants using a single hierarchical model.
PHARAOH includes a collapsing step for rare variants, since the rare variants data are usually sparse. The gene-level summary statistics are obtained by the special weight approach for rare variants. It analyzes entire genes and pathways by adding ridge-type penalties on both gene and pathway effects to traits [8]. PHARAOH is usually used to perform analysis for rare variants rather than common variants since the special collapsing step. In this study, we applied HisCoM-PCA for binary phenotypes, type 2 diabetes (T2D) and hypertension (HT), and continuous phenotypes, systolic blood pressure (SBP) and diastolic blood pressure (DBP), using large-scale SNP data from a Korean population study (8,840 samples) [9] and KEGG pathway database (186 pathways) [10]. Furthermore, HisCoM-PCA was compared to three existing pathway-based approaches: GSA-SNP2 [4], sARTP [11], MAGMA [12]. To check the power and type I error of HisCoM-PCA, the simulation study was performed with Genetic Analysis Workshop (GAW) 17 generated dataset [13]. The empirical power of HisCoM-PCA was compared with other three existing methods. Ansung, which represent city population and countryside population [9]. More than hundreds of papers have been completed using this cohort data for genetic analysis study. The common variant genotype data of 8840 individuals were produced with the Affymetrix Genome-Wide Human SNP array 5.0. This chip consists of about 50 million autosomal SNPs and total 352,228 SNPs are available. In this study, we excluded SNPs for which the minor allele frequencies (MAF) were less than 0.05, the genotype calling rates were less than 95%, and Hardy-Weinberg equilibrium p-values were less than 10 −6 .
We only kept the subjects with gender consistencies and those whose calling rates were more than 90%. After such quality control process, missing values were imputed only for existing variants. treatments, which are likely to influence blood pressure [15]. The basic characteristics and blood pressure of the subjects are listed in Table 2. The total scheme of HisCoM-PCA is showed in Figure 1.
The alternating regulated least-squares (ALS) algorithm is used to estimate model parameters in such component-based approach. According to the ALS algorithm, two steps are alternated until convergence [18]. To allow the potential correlations exist in the biological process, we utilize penalization approach on the effects of both genes and pathways. In this study, we adopt a ridge-type penalty to control multi-collinearity between genes and between pathways. Then, we seek to maximize the penalized log-likelihood function, which is given as follows: where ( ; , ) is the probability distribution for the phenotype of the ℎ individual. and are ridge parameters for genes and pathways. After estimation, we perform permutation test by resampling the phenotypes to test the significance of parameters. For KARE data, PLINK 1.90 [19] was used to perform the quality control analysis with the criteria described in Material section.
The SNPs were mapped to the UCSC hg19 genomic coordination.
Missing genotype data was imputed using the Beagle 5.0 [20] software program. Then the SNPs were annotated with genes using SnpEff v.4.3 [21]. common pathways for HT, three common pathways for SBP, and five common pathways for DBP.

Real data analysis for T2D
For T2D analysis, HisCoM-PCA successfully identified the well-known pathways biologically related with T2D. For example, the pathways such as calcium signaling pathway, renin-angiotensin system pathway, and phosphatidylinositol signaling pathway were known to be related to insulin resistance or insulin sensitivity [22][23][24][25]. Calcium signaling is crucial for insulin secretion in pancreatic β -cells [22,23]. In phosphatidylinositol signaling system, PI3K/PtdIns P3 signaling is known as an important role in the insulin stimulated glucose metabolism pathway which is associated with obesity and T2D [25]. Some diseases such as Alzheimer's disease(AD), asthma, and dilated cardiovascular have been reported to share molecular pathways or risk factors with T2D [26][27][28][29]. For example, several studies have shown that insulin resistance is related to risk of AD as well as T2D [26]. Application of HisCoM-PCA with T2D successfully identified the pathways of these diseases. In addition, folate biosynthesis pathway and hedgehog signaling pathway are also reported to be potentially relate to T2D [30,31].
These pathway results for T2D using HisCoM-PCA and other four methods are summarized in Table3. The pathways related to blood pressure (BP) were also identified by HisCoM-PCA using phenotypes HT, SBP, and DBP.
Calcium signaling pathway and complement and coagulation cascades pathway were known to be related to BP regulation [32,33].  Table 4.

Discussion
HisCoM-PCA is a novel method for pathway analysis of GWAS HisCoM-PCA also considers correlations between pathways, an aspect usually neglected by other methods. Further, correlation between pathways may influence the combined effect of pathways on traits, similar to when correlations exist between genes in a specific pathway. To allow correlation between genes and between pathways, HisCoM-PCA applies a ridge-type penalization approach on coefficient estimation for both genes and pathways. Cross-validation is then used to detect the optimal tuning parameters of ridge-type penalties. During such consideration of correlations, HisCoM-PCA performs gene-based and pathway-based analyses simultaneously, using the entirety of genes and pathways. However, most existing methods perform these two analyses separately. In addition, other methods often are limited to performing single gene analysis and single pathway analysis.
In addition to the above advantages, HisCoM-PCA has high flexibility for users. First, PC selection criteria may be defined by the user. Second, users can perform both non-target and target pathway analysis. Since HisCoM-PCA controls the correlation between pathways, it is useful to detect associated pathways having similar molecular mechanisms. Thus, we strongly believe that our method, (1) 2형 당뇨병; (2) 고혈압; (3) 수축기 혈압, (4) 이완기 혈압에