Skip to main content

Hierarchical structural component model for pathway analysis of common variants

Abstract

Background

Genome-wide association studies (GWAS) have been widely used to identify phenotype-related genetic variants using many statistical methods, such as logistic and linear regression. However, GWAS-identified SNPs, as identified with stringent statistical significance, explain just a small portion of the overall estimated genetic heritability. To address this ‘missing heritability’ issue, gene- and pathway-based analysis, and biological mechanisms, have been used for many GWAS studies. However, many of these methods often neglect the correlation between genes and between pathways.

Methods

We constructed a hierarchical component model that considers correlations both between genes and between pathways. Based on this model, we propose a novel pathway analysis method for GWAS datasets, Hierarchical structural Component Model for Pathway analysis of Common vAriants (HisCoM-PCA). HisCoM-PCA first summarizes the common variants of each gene, first at the gene-level, and then analyzes all pathways simultaneously by ridge-type penalization of both the gene and pathway effects on the phenotype. Statistical significance of the gene and pathway coefficients can be examined by permutation tests.

Results

Using the simulation data set of Genetic Analysis Workshop 17 (GAW17), for both binary and continuous phenotypes, we showed that HisCoM-PCA well-controlled type I error, and had a higher empirical power compared to several other methods. In addition, we applied our method to a SNP chip dataset of KARE for four human physiologic traits: (1) type 2 diabetes; (2) hypertension; (3) systolic blood pressure; and (4) diastolic blood pressure. Those results showed that HisCoM-PCA could successfully identify signal pathways with superior statistical and biological significance.

Conclusions

Our approach has the advantage of providing an intuitive biological interpretation for associations between common variants and phenotypes, via pathway information, potentially addressing the missing heritability conundrum.

Background

Genome-wide association studies (GWAS) have greatly advanced our understanding of the association between sets of genetic variants (genotypes) and traits of interest (phenotypes). GWAS typically focus on associations between single-nucleotide polymorphisms (SNPs) and traits (phenotypes), such as type 2 diabetes (T2D) [1]. To identify common variants in GWAS, many statistical methods, including logistic and linear regression, have been widely used. Since most of these methods are based on single variant analysis, their statistically significant results sometimes may suffer from a lack of biological interpretation. In addition, it has been reported that only a small portion of the total heritability, of specific traits, can be explained by these identified SNPs [2]. To enhance interpretation of SNP association results, many gene-based and pathway-based association analysis methods have been developed. Biological pathways, which complexly interact with each other, always have more direct influence on related biological behaviors, as compared to genes [3]. Thus, it is easier to interpret pathway-based results than SNP-based results. Such pathway-based association methods, developed for GWAS, often identify pathways based on results from a single analysis of SNPs. These methods often use only the most statistically significant SNPs, according to the p-values obtained from single SNP analysis. However, such analyses ignore genetic information from the SNPs that are not selected [4,5,6]. In addition, high correlations always exist between pathways, potentially arising from many genes shared between pathways. Thus, methods neglecting these correlations may mislead phenotype association results [7].

Considering these deficiencies, a hierarchical component model has been constructed, PHARAOH (Pathway-based approach using HierArchical components of collapsed RAre variants Of High-throughput sequencing data). PHARAOH performs pathway analysis for rare variants using a single hierarchical model, and includes a collapsing step for rare variants, whose data are usually sparse. PHARAOH gene-level summary statistics are obtained by a special weight approach for rare variants, and analyzes entire genes and pathways by adding ridge-type penalties on both gene and pathway effects on traits [8]. PHARAOH is usually used to analyze rare variants, rather than common variants, due to the special collapsing step, since common variant data usually needs dimension reduction instead of collapsing. In this study, we utilized the main framework of PHARAOH, and principal component analysis (PCA), to construct a hierarchical component model for common variants. Based on this model, we proposed a novel pathway analysis method for GWAS datasets, named Hierarchical structural Component Model for Pathway analysis of Common vAriants (HisCoM-PCA).

HisCoM-PCA has several distinctive features. First, HisCoM-PCA can identify associations between a distinct trait and entire pathways, using a single model. It can simultaneously quantify both the effects of pathways and genes to the phenotype. Second, HisCoM-PCA performs pathway analysis using gene-level summary statistics from SNPs within the same genes. Third, HisCoM-PCA allows potential correlations between genes and between pathways by adding ridge-type penalties to both genes and pathways effects. In addition, HisCoM-PCA may not only be used for binary phenotypes, but also continuous phenotypes. Overall, HisCoM-PCA can identify associated genes and pathways, by controlling correlations within them.

In this study, we applied HisCoM-PCA for two binary phenotypes, type 2 diabetes (T2D) and hypertension (HT), and two continuous phenotypes, systolic blood pressure (SBP) and diastolic blood pressure (DBP), using large-scale SNP data from a Korean population study, KARE (8840 samples) [9], and the KEGG pathway database (186 pathways) [10]. Furthermore, HisCoM-PCA was compared to three existing pathway-based approaches: GSA-SNP2 [4], sARTP [11], and MAGMA [12]. To check the power and type I error of HisCoM-PCA, a simulation study was performed using the Genetic Analysis Workshop (GAW) 17 generated dataset [13]. The empirical power of HisCoM-PCA was then compared to three other existing methods. The results of both a simulation study and real data analysis demonstrated that HisCoM-PCA could successfully identify statistically associated and biologically plausible pathways, for complex traits of interest.

Methods

KARE cohort dataset

The Korea Association REsource (KARE project) is a nearly 9000-participant cohort GWAS study of Korean populations from Ansan and Ansung, representing city and countryside populations, respectively [9]. The common variant genotype data of 8840 individuals were generated using the Affymetrix Genome-Wide Human SNP array 5.0. This chip consists of about 50 million autosomal SNPs, with a total of 352,228 SNPs available after quality control. In this study, we excluded SNPs with minor allele frequencies (MAFs) ≤ 0.05, genotype calling rates < 95%, and Hardy-Weinberg equilibrium p-values < 10−6. Thus, we only kept the subjects with gender consistencies, and those whose calling rates were > 90%. After such quality control processes, missing values were imputed only for existing variants.

Definition of type 2 diabetes

An individual is defined as T2D, according to the following criteria: (1) under treatment for T2D; (2) fasting plasma glucose (FPG) ≥ 126 mg/dL, 2-h postprandial blood glucose (Glu120) ≥ 200 mg/dL, or glycated hemoglobin (HbA1c) ≥ 6.5%; and (3) age of disease onset ≥40 years. Resultantly, a total of 1288 subjects were diagnosed as T2D, among 8840 individuals, with another 3687 individuals selected as normal subjects by the inclusion criteria: (1) FPG < 100 mg/dL, Glu120 < 140 mg/dL and HbA1c < 5.7%; and (2) no history of diabetes [14]. Demographic variables of the 4974 selected subjects are summarized in Table 1.

Table 1 Demographic variables for KARE cohort (T2D)

Definition of hypertension

A total of 2008 individuals were defined as hypertensive, according to the following criteria: (1) SBP ≥ 140 mmHg and/or DBP ≥ 90 mmHg; and (2) treatment with antihypertension medication, while 4569 individuals were defined as normotensive controls according to the criteria: SBP < 120 mmHg and DBP < 80 mmHg. Subjects with pre-hypertensive status were excluded from the analysis. For quantitative trait analysis of SBP and DBP, 1019 subjects were excluded due to hypertensive therapy or drug treatments, variables that influence blood pressure [15]. The basic characteristics and blood pressure of the subjects are listed in Table 2.

Table 2 Basic characteristics of study subjects

HisCoM-PCA

Step 1: SNPs dimension reduction by principal component analysis (PCA)

The first step of HisCoM-PCA reduces the dimensions of the common variants, located in the same genes, by PCA. After PCA was performed for each gene, part of principle components (PCs), as gene-level summary statistics, are chosen to represent the corresponding genes. In order to reduce the high dimension of SNP data, we only select a small number of PCs for each gene for the next pathway analysis. In order to determine the number of PCs for used for the pathway analysis, we use the following simple criteria: (1) using only the first PC and (2) using the PCs whose cumulative proportion of variances are more than 30%. We use a R’s function prcomp in the stats package to conduct PCA.

Step 2: pathway analysis with a hierarchical component model (HisCoM)

After reducing the dimensions of common variants for each gene, pathway analysis is performed, using the selected PCs, with a hierarchical component model, as previously used for pathway analysis of rare variants [8]. Before the analysis, genes are mapped to the well-defined pathways in the pathway databases such as the Kyoto Encyclopedia of Genes and Genome (KEGG). Then, the PCs for these mapped genes are derived and assigned to the corresponding pathways. These PCs are used as input dataset for step 2.

In this model, pathways are defined as a weighted component of a set of PCs (Fig. 1). Let us define yj as the phenotype of the jth subject and assume that phenotype independently follow an exponential family distribution (j = 1, …, N). Let K be the number of pathways, Tk be the number of genes in the kth pathway and Nkt be the number of PCs for the tth gene in the kth pathway. Let gkti denote the ith PC derived from PCA at the step 1 (k = 1, …, K; t = 1, …, Tk; i = 1, …, Nkt). These gkti s represent the genes and have continuous values. Let wkti denote a weight assigned to gkti and βk denote the coefficient connecting the kth pathway to the phenotype. For each individual, the relationships between PCs and binary phenotype are established in such a way that:

$$ logit\left(\pi \right)={\beta}_0+\sum \limits_{k=1}^K\left[\sum \limits_{t=1}^{T_k}\sum \limits_{i=1}^{N_{kt}}{g}_{kt i}{w}_{kt i}\right]{\beta}_k $$
Fig. 1
figure 1

A schematic diagram of HisCoM-PCA y is the phenotype of subject; pathwayk means kth pathway; Genekt − PCi denotes ith PC of the tth gene in kth pathway; SNPktr is rth SNP of the tth gene in kth pathway; lktr denotes a loading value of SNPktr from PCA; wkti denotes the weight assigned to Genekt − PCi; βk denote the coefficient connecting the kth pathway to the phenotype

To estimate the parameters in HisCoM-PCA, we use the alternating least squares (ALS) algorithm. The ALS algorithm was originally proposed by de Leeuw et al. [16] and adopted by Hwang and Takane for the generalized structural component analysis (GSCA) [17], and later by Lee et al. for the penalized log-likelihood function [8]. The ALS algorithm minimizes the objective function in the framework of least squares estimation. We use the ALS algorithm for the penalized log-likelihood function of Lee et al. [8]. Our ALS algorithm consists of two steps and these two steps iterate until convergence.

Step2 − 1: For fixing the weight coefficient estimates wkti, update the pathway coefficient estimates βk, in the sense of least squares.

Step2 − 2: For fixing pathway coefficient estimates βk, update the weight coefficient estimates wkti, in the sense of least squares.

To take into account potential correlations between genes and between pathways, we utilize a penalization approach. In this study, we adopt a ridge-type penalty to control multi-collinearity between genes and between pathways. Then, we sought to maximize the penalized log-likelihood function, given as follows:

$$ \phi ={\sum}_{j=1}^N\log p\left({y}_j;{\gamma}_j,\delta \right)-\frac{1}{2}{\lambda}_g{\sum}_{k=1}^K{\sum}_{t=1}^{T_k}{\sum}_{i=1}^{N_{kt}}{w}_{kt i}^2-\frac{1}{2}{\lambda}_p{\sum}_{k=1}^K{\beta}_k^2, $$

where p(yj; γj, δ) is the probability distribution for the phenotype of the jth individual, and λg and λp are ridge parameters for genes and pathways, respectively. After estimation, we performed permutation testing by resampling the phenotypes, to test the significance of the parameters. Here, we use a tool called WISARD (Workbench for Integrated Superfast Association study with Related Data) [18], which was developed for fast and a comprehensive analysis of SNP-chip and next-generation sequencing data. WISARD can perform the standard pathway analysis with SNP data as input. Instead of SNPs, we use the PCs derived from step 1 as input of WISARD to perform our PC based pathway analysis.

Simulation study

To check the power and type Ι error rate of HisCoM-PCA, a simulation study was performed using simulation data from the Genetic Analysis Workshop 17 (GAW17) [13]. In brief, a GAW17 simulation dataset was generated for 697 individuals from the 1000 Genomes Project [19], containing 24,487 SNVs and four phenotypes (Q1, Q2, Q4, and AFFECTED). The SNPs with minor allele frequencies (MAFs) ≤ 0.05, or genotype calling rates < 95% were excluded in the simulation study. We also kept the subjects with gender consistencies, and those whose calling rates were > 90%. Among the four simulated phenotypes, only Q1 was generated using pathway information, and was simulated to be affected by 9 genes from the vascular endothelial growth factor (VEGF) pathway, as defined by Ingenuity Pathway Analysis [20]. We next examined the power according to the proportion of identifying the VEGF pathway from the entirety of pathways in the KEGG database. Type I error of HisCoM-PCA was examined by the proportion of identifying null pathways which did not contain causal genes. Both type I error and power were calculated by analysis for Q1. To compare the power with other existing methods, we also analyzed the GAW17 dataset using sARTP [11], a self-contained version of MAGMA, a competitive version of MAGMA [12] and GSA-SNP2 [4].

Results

Simulation study using the gene analysis workshop 17 (GAW17) dataset

To check the power and type I error of HisCoM-PCA, we performed a simulation study using the GAW17 dataset, for both binary and continuous types of a Q1 trait. For binary phenotypes, we transformed the continuous values of Q1 to binary values, using the median. Each SNP was then assigned to a gene, if its location was in, or within 20 kb of, the gene, and the KEGG database then used to map genes and pathways. In the simulation study, we chose the first PCs and PCs whose cumulative proportion of variances was more than 30%, after PCA of each gene. The tuning parameters of our method, λg and λp, were optimized based on five-fold CV.

To investigate where the type I error rate is controlled, we examined type I error by the proportion of identifying a null pathway whose number of genes was the same as the VEGF pathway. We checked the type I errors of HisCoM-PCA, sARTP, competitive version of MAGMA, self-contained version of MAGMA, and GSA-SNP2 (Fig. 2).

Fig. 2
figure 2

Empirical type I errors and powers of HisCoM-PCA and other methods (a) Empirical type I errors of HisCoM-PCA, sARTP, two versions of MAGMA, and GSA-SNP2. Empirical type I error indicates the times of identifying a null pathway among 200 replicates. (b) Empirical powers of HisCoM-PCA, competitive version of MAGMA and GSA-SNP2. Empirical power indicates the times of identifying VEGF pathway among 200 replicates

To that end, HisCoM-PCA controlled type I error with PC selection criteria. GSA-SNP2, and the competitive version of MAGMA, also controlled type I error well. However, the type I errors of sARTP, and the self-contained version of MAGMA, were too inflated. Thus, we only compared the power of HisCoM-PCA with GSA-SNP2, and the competitive version of MAGMA. To examine the power, we calculated proportion of identifying VEGF pathways from 168 KEGG pathways, with 200 replicates. The powers of the three methods are shown in Fig. 2.

For both continuous and binary phenotypes, HisCoM-PCA showed the highest power, compared to the other methods. The powers of HisCoM-PCA, with two types of phenotypes, and two criteria of PC selection, were all higher than 0.95. However, the power of GSA-SNP2 were only 0.21 for the binary phenotype and 0.29 for the continuous phenotype, respectively. While MAGMA showed higher power than GSA-SNP2, it showed only 0.6 power for the binary phenotype. However, HisCoM-PCA showed similar powers with either PC selection criteria, while all the methods showed higher power with continuous vs. binary phenotypes.

Real data analysis of common variants from KARE

For KARE data, PLINK 1.90 [21] was used to perform quality control analysis using the criteria described in the Materials section. The SNPs were mapped to the UCSC hg19 genomic coordination. Missing genotype data was imputed using the Beagle 5.0 [22] software program. Then, the SNPs were annotated with genes using SnpEff v.4.3 [23]. After mapping these genes to the KEGG pathway database, a total of 3996 genes were matched to 186 KEGG pathways. The distribution of the number of SNPs per gene is given in Additional file 1: Figure S1. For the 3996 genes, we used the following simple criteria to choose the number of PCs: (1) using only the first PC and (2) using the PCs whose cumulative proportion of variances were more than 30%. When using the first criterion, each gene used only one PC. Thus, there were 3996 PCs. On the other hand, when using the second criterion, one gene could have multiple PCs. As a result, 4486 PCs were used. We then performed pathway analysis for four phenotypes: type-2 diabetes (T2D), hypertension (HT), systolic blood pressure (SBP), and diastolic blood pressure (DBP). Following association tests conducted in other previous studies of the KARE dataset, age, sex, body mass index (BMI), and area were included as covariates in the pathway analysis. In addition to HisCoM-PCA, other existing methods such as sARTP, MAGMA (self-contained and competitive versions) and GSA-SNP2 were used for comparison. The tuning parameters λg and λp, were chosen based on five-fold cross-validation (CV). To test the pathways’ significance, we performed permutation tests by generating 1000 permuted phenotypes.

HisCoM-PCA, using the first PCs, successfully identified 14 pathways for T2D, 15 pathways for HT, 3 pathways for SBP, and 9 pathways for DBP, respectively, at a 5% significance level. HisCoM-PCA, using the 4486 PCs, identified 13 pathways for T2D, 20 pathways for HT, 6 pathways for SBP, and 7 pathways for DBP, respectively, at the same significance level. These different PC selection criteria provided very consistent results. Both identified 10 common pathways for T2D, 14 common pathways for HT, three common pathways for SBP, and five common pathways for DBP. The summary results of GSA-SNP2, MAGMA (competitive version) and HisCoM-PCA for KARE data analysis are shown in Additional file 1: Figure S2. As a multiple testing correction method, the false discovery rate (FDR) was used to calculate corrected p-values for each pathway. When HisCoM-PCA was used for the first PCs, only three pathways had FDR corrected p-values less than 0.1 for HT. However, none of pathways passed this threshold for other phenotypes. When HisCoM-PCA was used for the PCs whose cumulative proportion of variance is more than 30%, two pathways for T2D and one pathway for HT passed the same threshold of FDR corrected p-value, respectively. None of pathways passed this threshold for SBP and DBP.

For T2D analysis, HisCoM-PCA successfully identified several well-known pathways biologically related to T2D. For example, pathways such as calcium signaling, the renin-angiotensin system, and phosphatidylinositol signaling, are known to be related to insulin resistance or insulin sensitivity [24,25,26,27]. Of these, calcium signaling is crucial for insulin secretion in pancreatic β-cells [24, 25], while phosphatidylinositol signaling is known to play an important role in an insulin-stimulated glucose metabolism pathway associated with obesity and T2D [27]. Moreover, some diseases, such as Alzheimer’s disease (AD), asthma, and dilated cardiomyopathy have been reported to share molecular pathways or risk factors with T2D [28,29,30,31], and several studies have shown that insulin resistance is related to risk of AD, as well as T2D [28]. These results demonstratee that application of HisCoM-PCA to T2D successfully identified various pathways of these diseases. In addition, folate biosynthesis and hedgehog signaling have also been reported to potentially relate to T2D [32, 33]. These pathway results for T2D, using HisCoM-PCA, and the other four methods, are summarized in Table 3.

Table 3 Pathways identified for T2D

The pathways related to blood pressure (BP) were also identified by HisCoM-PCA using the phenotypes HT, SBP, and DBP. In that regard, calcium signaling pathway and the complement and coagulation cascades pathway were previously shown to be related to BP regulation [34, 35]. BP regulation is influenced by regulators of vascular tone, which are dependent on ion channels, such as voltage-gated Ca2+ channels, members of the calcium signaling pathway. Moreover, the kallikrein-kinin system (KKS) importantly regulates BP by influencing vascular tone and renal salt processing. It is also well known that KKS is a large picture of the complement and coagulation cascades pathway. Other disease pathways, such as maturity onset diabetes of the young (MODY) and hypertrophic cardiomyopathy (HCM) were identified by HisCoM-PCA, and previous studies have shown that MODY and HCM may also associate with HP [36,37,38]. These pathway results for HT, SBP, and DBP, using HisCoM-PCA and the other four methods, are shown in Table 4.

Table 4 Pathways identified for BP

Discussion

HisCoM-PCA is a novel method for pathway analysis of GWAS data. By applying HisCoM-PCA to a large population study dataset (KARE), we identified several biologically associated pathways for type-2 diabetes (T2D) and blood pressure (BP). For BP, we used three phenotypes: hypertension (HT), systolic blood pressure (SBP), and diastolic blood pressure (DBP). Whether the phenotype of interest is continuous or binary, HisCoM-PCA can successfully detect associated pathways with statistical significance. As self-directed validation, some pathways related to HT were also identified for SBP or DBP, simultaneously providing significant p-values. Beside pathway analysis, we performed gene analysis using HisCoM-PCA at the same time. The reported genes identified by HisCoM-PCA at the 5% nominal significant level are summarized in Additional file 1: Tables S1 and S2. To that end, HisCoM-PCA identified several genes well known to genetically influence T2D or BP, demonstrating that HisCoM-PCA can detect both pathways and genes having biological significance.

Other existing pathway methods revealed numbers of significant pathways. As shown in simulation studies, however, they have high chance of being false positives. On the other hand, some pathways identified by HisCoM-PCA were previously reported to be related to T2D or BP, while these pathways were not significant by other pathway identification methods we used for comparison. In addition, some pathways were jointly identified by other methods and HisCoM-PCA. Real data analysis showed that HisCoM-PCA can provide new candidates that other methods cannot successfully identify.

We also examined empirical power and type I error rate for both binary and continuous phenotypes, using the Genetic Analysis Workshop 17 (GAW17) simulation dataset for GWAS. Compared to several methods, HisCoM-PCA controlled type I error well and showed high statistical power. However, some methods, such as sARTP and the self-contained version of MAGMA, did not control type I error well. Moreover, the methods that well controlled type I error showed lower power than HisCoM-PCA. In the simulation study, HisCoM-PCA analysis of the first PC showed similar power to HisCoM-PCA with PCs whose cumulative proportion of variance was more than 30%. This may indicate that the power is similar, using multiple PCs, which can save a lot of computing time.

Conclusions

In this study, we proposed a novel pathway analysis method HisCoM-PCA for GWAS datasets. Over the existing methods, HisCoM-PCA has several advantages. HisCoM-PCA performs gene-based and pathway-based analysis directly from raw data, while GSA-SNP2 and MAGMA use only summary measures such as p-values or test statistics of univariate analysis. These are gene-level summary measures and are used as inputs to perform pathway analysis. However, since the values of these summaries do not directly represent the raw genetic data, this issue probably leads to false discoveries. In HisCoM-PCA, we can obtain gene-level summary statistics, by PCA, for each gene. These statistics are a linear combination of SNPs from the raw data. Using these values, subsequent analysis for genes and pathways may decrease the possibility of false discoveries.

HisCoM-PCA also considers correlations between pathways, an aspect usually neglected by other methods. Correlation between pathways may influence the combined effect of pathways on traits, similar to when correlations exist between genes in a specific pathway. To allow correlation between genes and between pathways, HisCoM-PCA applies a ridge-type penalization approach on coefficient estimation for both genes and pathways by analyzing entire pathways simultaneously. Cross-validation is then used to detect the optimal tuning parameters of ridge-type penalties. Note that other methods can only analyze one pathway at a time. For the mapped 186 pathways in our KARE data, HisCoM-PCA analyzed them with one big model, while other methods performed single pathway analyses 186 times.

Furthermore, only HisCoM-PCA have an ability to perform conditional inference for identifying a novel pathway given known pathways. For example, for a given well-known pathway related to a trait of interest, HisCoM-PCA can only identify additional pathways for this given pathway. HisCoM-PCA can also perform stepwise selection of pathways by adding one pathway sequentially given the selected pathways.

In addition to the above advantages, HisCoM-PCA has high flexibility for users. First, PC selection criteria may be defined by the user. Second, users can perform both non-target and target pathway analysis. Since HisCoM-PCA controls the correlation between pathways, it is useful to detect associated pathways having similar molecular mechanisms. However, HisCoM-PCA has higher computational burden than other methods due to permutation test. We are currently working on developing an asymptotic test which can replace the permutation test. We strongly believe that our method, HisCoM-PCA, can be applied to any number of GWAS studies, resulting in the successful identification of genes and pathways associated with specific phenotypes.

Availability of data and materials

The genotype data of KARE samples are available by sending a request to the Distribution desk of Korea Biobank Network, National Institute of Health, Korea.

Abbreviations

AD:

Alzheimer’s disease

ALS:

Alternating regulated least squares

BMI:

Body mass index

BP:

Blood pressure

CV:

Cross-validation

DBP:

Diastolic blood pressure

FPG:

Fasting plasma glucose

GAW17:

Genetic analysis workshop 17

Glu120:

2-h Postprandial blood glucose

HbA1c:

Glycated hemoglobin

HCM:

Hypertrophic cardiomyopathy

HisCoM:

Hierarchical component model

HisCoM-PCA:

Hierarchical structural component model for pathway analysis of common variants

HT:

Hypertension

KARE:

Korea Association REsource

KEGG:

Kyoto Encyclopedia of Genes and Genome

KKS:

Kallikrein-kinin system

MAFs:

Minor allele frequencies

MODY:

Maturity onset diabetes of the young

PCA:

Principal component analysis

PCs:

Principal components

PHARAOH:

Pathway-based approach using hierarchical components of collapsed rare variants of high-throughput sequencing data

SBP:

Systolic blood pressure

SD:

Standard deviation

T2D:

Type 2 diabetes

VEGF:

Vascular endothelial growth factor

WISARD:

Workbench for integrated superfast association study with related data

References

  1. Xue A, Wu Y, Zhu Z, Zhang F, Kemper KE, Zheng Z, Yengo L, Lloyd-Jones LR, Sidorenko J, Wu Y. Genome-wide association analyses identify 143 risk variants and putative regulatory mechanisms for type 2 diabetes. Nat Commun. 2018;9(1):2941.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  2. Prasad R, Groop L. Genetics of type 2 diabetes—pitfalls and possibilities. Genes. 2015;6(1):87–123.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Costanzo M, Baryshnikova A, Bellay J, Kim Y, Spear ED, Sevier CS, Ding H, Koh JL, Toufighi K, Mostafavi S. The genetic landscape of a cell. Science. 2010;327(5964):425–31.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Yoon S, Nguyen HCT, Yoo YJ, Kim J, Baik B, Kim S, Kim J, Kim S, Nam D. Efficient pathway enrichment and network analysis of GWAS summary data using GSA-SNP2. Nucleic Acids Res. 2018;46(10):e60.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  5. Zhang K, Cui S, Chang S, Zhang L, Wang J. i-GSEA4GWAS: a web server for identification of pathways/gene sets associated with traits by applying an improved gene set enrichment analysis to genome-wide association study. Nucleic Acids Res. 2010;38(suppl_2):W90–5.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Segrè AV, Groop L, Mootha VK, Daly MJ, Altshuler D, Consortium D, Investigators M. Common inherited variation in mitochondrial genes is not enriched for associations with type 2 diabetes or related glycemic traits. PLoS Genet. 2010;6(8):e1001058.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  7. Alexa A, Rahnenführer J, Lengauer T. Improved scoring of functional groups from gene expression data by decorrelating GO graph structure. Bioinformatics. 2006;22(13):1600–7.

    Article  CAS  PubMed  Google Scholar 

  8. Lee S, Choi S, Kim YJ, Kim B-J, Consortium Td-G, Hwang H, Park T. Pathway-based approach using hierarchical components of collapsed rare variants. Bioinformatics. 2016;32(17):i586–94.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Cho YS, Go MJ, Kim YJ, Heo JY, Oh JH, Ban H-J, Yoon D, Lee MH, Kim D-J, Park M. A large-scale genome-wide association study of Asian populations uncovers genetic factors influencing eight quantitative traits. Nat Genet. 2009;41(5):527.

    Article  CAS  PubMed  Google Scholar 

  10. Kanehisa M, Goto S, Kawashima S, Okuno Y, Hattori M. The KEGG resource for deciphering the genome. Nucleic Acids Res. 2004;32(suppl_1):D277–80.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Zhang H, Wheeler W, Hyland PL, Yang Y, Shi J, Chatterjee N, Yu K. A powerful procedure for pathway-based meta-analysis using summary statistics identifies 43 pathways associated with type II diabetes in European populations. PLoS Genet. 2016;12(6):e1006122.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  12. de Leeuw CA, Mooij JM, Heskes T, Posthuma D. MAGMA: generalized gene-set analysis of GWAS data. PLoS Comput Biol. 2015;11(4):e1004219.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  13. Almasy L, Dyer TD, Peralta JM, Kent JW, Charlesworth JC, Curran JE, Blangero J. Genetic Analysis Workshop 17 mini-exome simulation. BMC Proceedings. 2011;5:S2.

  14. Lim J, Koh I, Cho YS. Identification of genetic loci stratified by diabetic status and microRNA related SNPs influencing kidney function in Korean populations. Genes Genomics. 2016;38(7):601–9.

    Article  CAS  Google Scholar 

  15. Jin H-S, Hong K-W, Lim J-E, Oh B. Replication of an African-American GWAS on blood pressure and hypertension in the Korean population. Genes Genomics. 2011;33(2):127.

    Article  Google Scholar 

  16. De Leeuw J, Young FW, Takane Y. Additive structure in qualitative data: an alternating least squares method with optimal scaling features. Psychometrika. 1976;41(4):471–503.

    Article  Google Scholar 

  17. Hwang H, Takane Y. Generalized structured component analysis. Psychometrika. 2004;69(1):81–99.

    Article  Google Scholar 

  18. Lee S, Choi S, Qiao D, Cho M, Silverman EK, Park T, Won S. WISARD: workbench for integrated superfast association studies for related datasets. BMC Med Genet. 2018;11(2):39.

    CAS  Google Scholar 

  19. Consortium GP. A map of human genome variation from population-scale sequencing. Nature. 2010;467(7319):1061.

    Article  CAS  Google Scholar 

  20. Ingenuity Pathways Analysis software. http://www.ingenuity.com. Accessed 2 Apr 2019.

  21. Chang CC, Chow CC, Tellier LC, Vattikuti S, Purcell SM, Lee JJ. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience. 2015;4(1):7.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  22. Browning BL, Zhou Y, Browning SR. A one-penny imputed genome from next-generation reference panels. Am J Hum Genet. 2018;103(3):338–48.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Cingolani P, Platts A, Wang LL, Coon M, Nguyen T, Wang L, Land SJ, Lu X, Ruden DM. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly. 2012;6(2):80–92.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Levy J. Abnormal cell calcium homeostasis in type 2 diabetes mellitus. Endocrine. 1999;10(1):1–6.

    Article  CAS  PubMed  Google Scholar 

  25. Hodgkin M, Hills C, Squires P. The calcium-sensing receptor and insulin secretion: a role outside systemic control 15 years on. J Endocrinol. 2008;199(1):1–4.

    Article  CAS  PubMed  Google Scholar 

  26. Scheen AJ. Prevention of type 2 diabetes mellitus through inhibition of the renin-angiotensin system. Drugs. 2004;64(22):2537–65.

    Article  CAS  PubMed  Google Scholar 

  27. Manna P, Jain SK. Phosphatidylinositol-3, 4, 5-triphosphate and cellular signaling: implications for obesity and diabetes. Cell Physiol Biochem. 2015;35(4):1253–75.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Han W, Li C. Linking type 2 diabetes and Alzheimer’s disease. Proc Natl Acad Sci. 2010;107(15):6557–8.

    Article  CAS  PubMed  Google Scholar 

  29. Luchsinger JA, Gustafson DR. Adiposity, type 2 diabetes, and Alzheimer’s disease. J Alzheimers Dis. 2009;16(4):693–704.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  30. Shore SA. Obesity and asthma: possible mechanisms. J Allergy Clin Immunol. 2008;121(5):1087–93.

    Article  PubMed  Google Scholar 

  31. Chan KHK, Huang Y-T, Meng Q, Wu C, Reiner A, Sobel EM, Tinker L, Lusis AJ, Yang X, Liu S. Shared molecular pathways and gene networks for cardiovascular disease and type 2 diabetes mellitus in women across diverse ethnicities. Circ Cardiovasc Genet. 2014;7(6):911–9.

    Article  CAS  PubMed  Google Scholar 

  32. Al-Maskari MY, Waly MI, Ali A, Al-Shuaibi YS, Ouhtit A. Folate and vitamin B12 deficiency and hyperhomocysteinemia promote oxidative stress in adult type 2 diabetes. Nutrition. 2012;28(7–8):e23–6.

    Article  CAS  PubMed  Google Scholar 

  33. Thomas MK, Rastalsky N, Lee JH, Habener JF. Hedgehog signaling regulation of insulin production by pancreatic beta-cells. Diabetes. 2000;49(12):2039–47.

    Article  CAS  PubMed  Google Scholar 

  34. Adragna NC, Lauf PK. K–cl cotransport function and its potential contribution to cardiovascular disease. Pathophysiology. 2007;14(3–4):135–46.

    Article  CAS  PubMed  Google Scholar 

  35. Kraja AT, Hunt SC, Rao D, Dávila-Román VG, Arnett DK, Province MA. Genetics of hypertension and cardiovascular disease and their interconnected pathways: lessons from large studies. Curr Hypertens Rep. 2011;13(1):46–54.

    Article  PubMed  PubMed Central  Google Scholar 

  36. Schober E, Rami B, Grabert M, Thon A, Kapellen T, Reinehr T, Holl R. Phenotypical aspects of maturity-onset diabetes of the young (MODY diabetes) in comparison with type 2 diabetes mellitus (T2DM) in children and adolescents: experience from a large multicentre database. Diabet Med. 2009;26(5):466–73.

    Article  CAS  PubMed  Google Scholar 

  37. Misawa K, Nitta Y, Matsubara T, Oe K, Kiyama M, Shimizu M, Mabuchi H. Difference in coronary blood flow dynamics between patients with hypertension and those with hypertrophic cardiomyopathy. Hypertens Res. 2002;25(5):711–6.

    Article  PubMed  Google Scholar 

  38. Takeda A, Takeda N. Different pathophysiology of cardiac hypertrophy in hypertension and hypertrophic cardiomyopathy. J Mol Cell Cardiol. 1997;29(11):2961–5.

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

The authors thank Professor Heungsun Hwang, Dr. Sungkyoung Choi, and Dr. Yongkang Kim for the helpful comments.

About this supplement

This article has been published as part of BMC Medical Genomics Volume 13 Supplement 3, 2020: Proceedings of the Joint International GIW & ABACBS-2019 Conference: medical genomics (part 2). The full contents of the supplement are available online at https://bmcmedgenomics.biomedcentral.com/articles/supplements/volume-13-supplement-3.

Funding

Publication costs are funded by the Bio & Medical Technology Development Program of the National Research Foundation of Korea (NRF) grant (2013M3A9C4078158). Also, the analysis and interpretation of data was supported by a grant of the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (grant number: HI16C2037) and the Bio & Medical Technology Development Program of the National Research Foundation (NRF) funded by the Korean government (MSIT) (2013M3A9C4078158).

Author information

Authors and Affiliations

Authors

Contributions

NJ performed all analyses. SL developed the software implementation. TP and NJ conceived the main idea, conducted the entire study, and wrote the manuscript. SL helped with the performing of analyses. SL and TP developed the methodology. All authors have read and approved the final manuscript.

Corresponding author

Correspondence to Taesung Park.

Ethics declarations

Ethics approval and consent to participate

We used the SNP chip data of 8,840 samples from KARE. KARE study is a part of Korean Genome Epidemiology Study (KoGES), and the dataset was used under the partnership of T2D-GENES. All participants of KARE study provided written informed consent. The study using KARE samples was approved by two independent institutional review boards at Seoul National University.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Additional file 1: Table S1.

Reported genes identified by HisCoM-PCA (the first PC) for type 2 diabetes. Table S2. Reported genes identified by HisCoM-PCA (the first PC) for blood pressure. Figure S1. Summary for number of SNPs in each gene excluding genes with one SNP. Figure S2. Summary of significant results from GSA-SNP2, MAGMA and HisCoM-PCA for four phenotypes: T2D, HT, SBP and DBP.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jiang, N., Lee, S. & Park, T. Hierarchical structural component model for pathway analysis of common variants. BMC Med Genomics 13 (Suppl 3), 26 (2020). https://doi.org/10.1186/s12920-019-0650-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12920-019-0650-0

Keywords