 Research
 Open Access
 Published:
MetaQtest: metaanalysis of quadratic test for rare variants
BMC Medical Genomics volume 12, Article number: 102 (2019)
Abstract
Background
In genomewide association studies (GWASs), metaanalysis has been widely used to improve statistical power by combining the results of different studies. Metaanalysis can detect phenotype associated variants that are failed to be detected in single studies. Especially, in biomedical sciences, metaanalysis is often necessary not only for improving statistical power, but also for reducing unavoidable limitation in data collection. As nextgeneration sequencing (NGS) technology has been developed, metaanalysis of rare variants is proceeding briskly along with metaanalysis of common variants in GWASs. However, metaanalysis on a single variant that is commonly used in common variant association test is improper for rare variants. A sparse signal of rare variant undermines the association signal and its large number causes multiple testing problem. To overcome these problems, we propose a metaanalysis method at the genelevel rather than variant level.
Results
Among many methods that have been developed, we used the unified quadratic tests (Qtests); Qtest is more powerful than or as powerful as other tests such as Sequence Kernel Association Tests (SKAT). Since there are three different versions of Qtest (QTest1, QTest2, QTest3), each assumes different relationships among multiple rare variants, we extended them into metastudy accordingly. For metaanalysis, we consider two types of approaches, the one is to combine regression coefficients and the other is to combine test statistics from each single study. We extend the Qtest for metaanalysis, proposing Meta Quadratic Test (MetaQtest). Meta Qtest avoids the limitations of MetaSKAT. It does not only consider genetic heterogeneity among studies as MetaSKAT but also reflects diverse real situations; since we extend three different Qtests into metaanalysis respectively, flexible Meta Qtest suggests way to deal with genelevel rare variant metaanalysis efficiently From the results of real data analysis of blood pressure trait, our metaanalysis could successfully discovered genes, KCNA5 and CABIN1 that are already well known for relevance with hypertension disease and they are not detected in MetaSKAT.
Conclusion
As exemplified by an application to T2D Genes projects data set, MetaQtest more effectively identified genes associated with hypertension disease than MetaSKAT did.
Background
Genomewide association studies (GWASs) have identified many loci that contributed to human complex traits. As genotyping technologies such as nextgeneration sequencing (NGS) technology evolve, we have been able to gain larger data and more accurate information on human genetics. Discovery of rare variants is one of the most valuable crops of the NGS technologies [1]. The subject of analysis naturally went over from common variants to relatively less studied rare variants, because GWASs on common variants could not entirely explain geneticheritability, only explains small portion of expected heritability. Such phenomenon, known as “missing heritability”, posed the necessity of analyzing rare variant in human disease with a belief that rare variants play an important role in association study [2].
Persisting on same methods in common variant analyses is not appropriate for dealing with the rare variants [3]. Due to the fact that only few people share rare variants, we need a larger sample size than in common variant association test. Small sample size could markedly lower the power of a statistical test. Besides, if each variant effect is weak then single variant analysis has lower power to detect true weak signal. Therefore, in such situations instead of single variant association test, gene level test that handles multiple variants in a gene could be helpful in strengthening the signals by considering several weak ones at a time [4]. In addition to the benefit of increasing the power, genebased multimarker test mitigate the burden of multiple testing correction and easily interpret biological functional meaning of detected genes from the result of test. For these reasons, genebased test is often used for rare variants analysis.
Over the past few years, various statistical methods for genelevel rare variant association test are developed. From collapsing based methods such as Combined Multivariate and Collapsing (CMC) and variable thresholds test (VT test) to variance component tests such as Calpha and Sequence Kernel Association Tests (SKAT or SKATO), each method performs well in different situations [5,6,7,8,9]. CMC method is the one of most representative burden type tests. It unifies collapsing technique and multivariate ttest, Hotelling’s Ttest. Based on variants’ minor allele frequency (MAF), variants are divided into several subgroups, then their genotype values are summarized in 0 or 1. With collapsed genotype values, Hotelling’s Ttest is conducted. Likewise, CMC or VT test is also well used burden test, but it is adaptive burden test. Compared to regular burden test like CMC, VT test allows flexible MAF threshold. Since appropriate threshold can impact on power, VT test can increase the power by choosing the optimal threshold that maximizes test statistic [6]. CMC and VT test are powerful under the assumptions of high proportion of causal variants and same direction of their effects on certain disease; most singlenucleotide polymorphisms (SNPs) in a gene are causal and they are all protective or deleterious. When a small number of SNPs are causal and some of causal are protective and others are deleterious in a gene, burden test loses the power, and Calpha or SKAT outperform them. Both Calpha and SKAT are variance component test that test the variance components instead of means. Calpha test is designed for casecontrol studies without covariates. Under null hypothesis that says that no variants are associated with a phenotype, for casecontrol data, the distribution of allele counts follows binomial distribution. It compares the observed variance of counts with expected variance. The test statistic for Calpha includes squared terms that are observed sample variances, thus Calpha is robust even in the presence of different directions of variants’ effects (because the signs of effects are canceled out and only their effect sizes remain in the test statistic). Despite of the advantage as noted earlier, Calpha has some disadvantages too. P value is computed using permutation that requires intensive computation and covariate adjust is not available. The method proposed to solve these problems is SKAT. SKAT is variance component score test implemented in a regression framework. To test the null hypothesis of zero regression coefficients effect sizes of genetic variants in a gene, SKAT assumes that each regression coefficient follows arbitrary distribution with zero mean and the variance, product of weight and variance component. Then, testing original null hypothesis becomes the same as testing whether each variance component is zero or not. Variance component score statistic is employed in this process. Since SKAT is derived from regression model, it can include covariate terms easily. P value is calculated analytically and diverse kernel functions that can explain genetic similarity between individuals are introduced. Furthermore, optimal version of SKAT, SKATO is proposed to achieve robust power regardless of directions of variants. SKATO is combination of burden test and SKAT, weighted average of burden and SKAT. SKATO searches the optimal weight that is the weight obtaining the minimum p value of test statistics. However, even though robust power is substantially attained by SKATO test, there is still no uniformly most powerful test in all situations [10].
Genelevel Qtest is another powerful rare variant association test [11]. It also uses classical multiple linear regression as well as SKAT, but it takes Wald test based on an eigenvalue decomposition of regression coefficients. Quadratic form of test statistics in Qtest is efficiently implemented in diverse scenarios which embrace various cases in the relation of SNPs in a gene. First, Qtest considers proportion and effect sizes of causal rare variants. Second, it considers the direction of causal variant effects. Finally, it can be used even in the presence of rare variants with common variants, together; this is the major advantage of Qtest. In other words, Qtest could achieve robust power in any case, and its exceeding power was verified in enormous simulation studies.
Metaanalysis is a popular approach to increase the power in GWASs [12]. Aggregated summary from diverse studies recovers as much information as individuallevel data but without any exertion of pooling early stage data sets. In this respect, metastudy has an advantage of increasing the sample size and preserving computational efficiency [13]. Metaanalysis is sometimes essential in inevitable circumstances where individuallevel data cannot be distributed although quickly advancing NGS technologies allow us to have sequencing data at smaller cost than before, producing data still requires considerable time and money. Not only because of this, but also because of releasing personal data to public is a sensitive issue, not all individuallevel datasets are shared. Therefore, metaanalysis, which requires results only, becomes exceptionally useful.
The recently proposed MetaSKAT is extended SKAT for metaanalysis for genelevel rare variants. MetaSKAT aggregates the score statistics of each variants in a gene came from SKAT. Depending on the assumption of genetic effect, homogeneity or heterogeneity of genetic effects across studies, it aggregates the summary score statistics. When genetic effects are homogeneous, summary statistics are combined across the studies first and then combined across the variants in a gene, but when genetic effects are heterogeneous, the combining order is reversed. Optimal MetaSKAT that is weighted sum of test statistic of MetaSKAT and its burden for metaanalysis is also proposed to embrace the merits of burden and nonburden test together. However, simulation results of MetaSKAT show that type I error rates are somewhat uncontrolled. There is another limitation of MetaSKAT. When cohort specific genes are detected by MetaSKAT, it does not report that which cohort is highly associated with a phenotype [14].
In this paper, we extend the Qtest for metaanalysis, proposing Meta Quadratic Test (MetaQtest). Meta Qtest avoids the limitations of MetaSKAT. It does not only consider genetic heterogeneity among studies as MetaSKAT but also reflects diverse real situations; since we extend three different Qtests into metaanalysis respectively, flexible Meta Qtest suggests way to deal with genelevel rare variant metaanalysis efficiently.
Materials and methods
Qtest for single study
Qtest is established in a multiple linear regression framework for a quantitative trait.
where y_{ki} is the phenotype of the ith individual in the kth study, S_{kji} is the genotype value coded 0, 1 or 2 under an additive genotype model (dominant or recessive model is also applicable), β_{k} = (β_{k0}, ⋯, β_{km})^{,} is the vector of regression coefficients for genetic effects of m SNPs, Z_{ki} is the covariate value, and γ is the corresponding vector of regression coefficients. After fitting model, using the estimated regression coefficients, β_{k}, QTest1 creates the new variable, pooled coefficients, β_{pooled, k}. Then, the null hypothesis of interest is to test whether pooled genetic effect exists or not:
However, for QTest2 we test the vector of regression coefficients, β_{k}, rather than pooled coefficient to allow for considering bidirectional variants effect. Corresponding null hypothesis is
To test the hypothesis, Qtest constructs a Waldtype statistic which has the form of quadratic statistics. Depending on the different assumptions needed to collapse the estimates of effects size parameters, there are three versions of Qtest: QTest1, QTest2, and QTest3.
Burden Test: Quadratic Test1 (QTest_{1})
QTest_{1} is a burden type of test. A basic assumption is that all of SNPs in a gene have the same direction of effects on the phenotype. That is, we assume that all variants within a region are all deleterious or protective. If the assumption is true, the power of the burden test becomes higher by aggregated effects of each variant. When combining variants, Qtest_{1} uses the inverse variance weighting method that gives more weight to SNPs that have smaller variances,
Also, like SKAT, QTest_{1} introduces MAF based weight,\( {\mathrm{w}}_{kj}=1/\sqrt{MAF_j\left(1{MAF}_j\right)} \), proposed by Madsen and Browning [15]. By using the MAF based weight, our research focus on the rare SNPs is justified (rarer variant has higher probability of being causal variant). With these two different weights, aggregated effects are expressed as \( {\hat{\beta}}_{pooled,k}={\alpha}_k^T{W}_k{\hat{\beta}}_k,\kern0.5em \mathrm{where}\ {\alpha}_k \) is the vector of each variant’s inverse variance weight and W_{k} is the diagonal matrix of each variant’s MAF based weight,
\( {\hat{\beta}}_k \)is the vector of estimated coefficients of variants. In the ordinal regression framework, \( {\hat{\beta}}_{pooled,k} \) follows a normal distribution with zero mean and \( {\alpha}_k^T{W}_k\mathit{\operatorname{var}}\left({\hat{\beta}}_k\right){W}_k{\alpha}_k \) variance. Based on the distribution of \( {\hat{\beta}}_{pooled,k} \), the Wald type of statistics QTest_{1} is given by
Nonburden test: quadratic Test2 (QTest_{2})
In case of mixed variants in direction of effect, Qtest_{1} would work poorly. Thus, QTest_{2} assumes that some of variants are protective, and the others are negative. Instead of aggregating the effects of variants, it directly constructs Wald statistics for \( {\hat{\beta}}_k \),
where m is the number of variants in a gene. However, the number of rare SNPs in a gene is usually large, so a large degree of freedom lowers the power. Thus, QTest2 proposes a gamma method that could lower the degree of freedom [16].
The final form of Qtest2 statistics is
Optimal test (unified test): quadratic Test3 (QTest_{3})
Qtest_{3} is an optimal method, weighted average of burden type (Qtest_{1}) and nonburden type (Qtest_{2}) statistics. To combine two statistics that follow chi square distributions with different degrees of freedoms, there is a step for making Q_{2} has the same degree of freedom as Q_{1} (df = 1). To summarize the steps, first we define the new parameter, \( {\hat{\beta}}_k^{\ast } \) that is independent with \( {\hat{\beta}}_{pooled,k} \), and based on the new defined \( {\hat{\beta}}_k^{\ast } \), we can get the Q_{2} statistics, \( {Q}_{2\mid 1}^{\ast } \). Next step is to use a gamma method to make \( {Q}_{2\mid 1}^{\ast } \) has 1 degree of freedom (Q_{2 ∣ 1}). Using final two statistics, Q_{1} and Q_{2 ∣ 1}, we can get the optimal Qtest_{3} statistics through a grid search of weight. Final p value of Q_{3} is calculated by empirically. Precalculated empirical distributions are employed in the final step, so the computational burden is reduced.
Since Qtest_{3} could accommodate both scenarios that assume same or different direction of SNPs in a gene, its result is usually robust. The followings are detailed steps of Qtest_{3}.
In step 1 in algorithm 1, we compute \( {\hat{\beta}}_k^{\ast } \) to make \( {\hat{\beta}}_k^{\ast}\perp {\hat{\beta}}_{pooled,k} \). In step 2, we compute \( {Q}_{2\mid 1}^{\ast } \), where \( {V}_k^{\ast }=\mathit{\operatorname{var}}\left({\hat{\beta}}_k^{\ast}\right) \) and \( {U}_k^{\ast } \) consists of eigenvalue vectors of \( {V}_k^{\ast } \), and \( {\Lambda}_k^{\ast } \) is the diagonal matrix whose diagonal elements are eigenvalues of \( {V}_k^{\ast }. \) By step 3, we could obtain Q_{2 ∣ 1} which follows \( {\chi}_1^2 \) distribution, where p_{2 ∣ 1} is obtained from \( {Q}_{2\mid 1}^{\ast } \).
MetaQtest for Meta analysis
Metaanalysis requires summary statistics from each single study. P value and zscore have been conventional summary statistics that could combine the results across study, but in GWAS metaanalysis, the methods using p value and zscore are generally inferior to the model based metaanalysis; those methods cannot efficiently take account of the betweendataset heterogeneity [17]. In model based metaanalysis, there are two different approaches, fixed effect model and random effect model [18]. Under the fixed effect model, effect sizes of all studies are presumed to be same, in other case, to be different [19]. MetaQtest considers both cases. Consequently, each QTest has fixed and random versions of metaanalysis.
Another significant feature in MetaQtest is it is extended keeping the original statistical model structure, structure of QTest. Qtest_{1} is burden type and Qtest_{2} is nonburden type test. This fact is also applied to MetaQtest. MetaQtest_{1} is a burden and MetaQtest_{2} is a nonburden type test. MetaQtest keeps not only type of statistics but also the process derived the statistics. By doing that, it can consistently maximize merits of test statistics.
Summary statistics used and the way to synthesize them are essence in metaanalysis [20]. Depending on the type of modelbased meatanalysis (fix or random) and the type of test statistics (burden or nonburden), metaanalysis requires different input values and takes different approaches for combining those values.
Burden test: Meta quadratic Test1 (metaQtest_{1})
Metaanalysis assuming homogeneous genetic effects across studies: HomometaQ_{1}
First, we extend Qtest1 to metaanalysis maintaining burden type test. We assume that all variants in a gene across studies are from one single study. In Qtest1, to aggregate the effects of variants, \( {\hat{\beta}}_{pooled,k}={\alpha}_k^T{W}_k{\hat{\beta}}_k \) is introduced. Similarly, we introduce \( {\hat{\beta}}_{pooled} \) that adds up all of variants throughout studies. The way of combining each variant’s effect is exactly same as Qtest_{1}, weighted sum of effect sizes of all variants. When we suppose that there are K studies, the estimated pooled effect size of regression coefficients is \( {\hat{\beta}}_{pooled}={\alpha}^TW\hat{\beta}={\sum}_{j=1}^m{\alpha}_j{w}_j{\hat{\beta}}_j \), where j is the index for variants, \( \hat{\beta} \) is the column vector composed of \( {\hat{\beta}}_j \), α is the column vector that includes α_{j} s (j = 1, 2…,m) as components, and W is the diagonal matrix with w_{j} s as diagonal elements.
It is natural that we collapse the variants in the same locus with appropriate weights, such as sample size. Testing hypothesis in this case is H_{0} : β_{pooled} = 0. Since under the null assumption, \( {\hat{\beta}}_{pooled} \) is normally distributed with zero mean and variance α^{T}WVWα, the Wald type of test statistics is
Since the statistics for metaanalysis is constructed based on the preexisting result of individual level data analysis, \( \mathit{\operatorname{var}}\left(\hat{\beta}\right) \) cannot be estimated by the raw data. However, if we assume the independence among studies, we can define \( \mathit{\operatorname{var}}\left(\hat{\beta}\right) \) as the block diagonal matrix that has diagonal with \( \mathit{\operatorname{var}}\left({\hat{\beta}}_k\right) \),
We can easily check that Q_{hom − meta − q1} follows chisquare distribution with 1 degree of freedom, because it is the square of single standard normal random variable,\( \kern0.5em {\hat{\beta}}_{pooled} \). To draw the statistics, Q_{hom − meta − q1}, we need estimated \( {\hat{\beta}}_{kj}s \) from each study, its variance and MAF weights as inputs. Thus, we call this approach betabased approach, also because this is burden type, it is like assuming homogeneous effect (same regression coefficients) across the studies.
Metaanalysis assuming heterogeneous genetic effects across studies: HetmetaQ1
Assuming a variant may have a different effect across studies, we can consider the case of heterogeneous effects over studies. This assumption is in accordance with a meta model with random effects. Since we allow the heterogeneity of effect, we can use the results of each study’s regression fit. The model fittings were carried out separately, so the test statistics for association are made of different estimated regression coefficients. Accordingly, the outcome of model fitting in each study that takes account of heterogeneity is appropriate for summary statistics for metaanalysis.
The fact that Q_{1} follows the chisquare distribution also makes us combine Q_{1} s easily. Thus, test statistics of each study would be a naïve and handy summary statistic for metaanalysis
However, Q_{het − meta − q1} is a statistic for burden test in respect of sum of single burden statistics. Wald type of statistics for a single variant is distributed as the chisquare distribution with one degrees of freedom. When we assume the independence between studies, then the sum of test statistics also follows a chisquare distribution with K degree freedom, where K is the number of studies in metaanalysis, \( {\mathrm{Q}}_{het metaq1}\sim {\upchi}_K^2 \). In the process of deriving Q_{het − meta − q1}, we only require the test statistics. Therefore, we call this approach statistics based meta.
Nonburden test: Meta quadratic Test2 (metaQtest_{2})
Metaanalysis assuming homogeneous genetic effects across studies: HomometaQ2
We extend Qtest2 to metaanalysis in the same manner as the MetaQtest1: fixed(homo) and random(hetero) version of meta. However, the main difference between MetaQtest1 is that we do not combine the effects of different variants, we dose only collapse the same variants. The basic idea of MetaQtest2 is that we regard SNPs on the same locus as the single SNP. For this reason, this approach is different with burden metaanalysis.
It is possible to collapse variants of same locus, because of the assumption of homogeneity in genetic effects size throughout studies. In order to make representative genetic effects size that combines SNPs on the same locus, we use weighted sum. Weights are given as proportional to sample size of study. Representative of regression coefficient of jth SNP is expressed as below,
For the sake of simplicity, we assume that the number of variants in a gene is the same across the study, m = m_{1} = m_{2} = ⋯ = m_{K}.
Since all populations do not share the same variants commonly, some variants exit only in a specific population. We thus put an indicator function in front of index of sample size. To maintain the framework of Qtest_{2}, we also need variancecovariance matrix of estimated regression coefficients. Under the independence assumption of studies, we derive the variancecovariance matrix, V, analogous to \( {\hat{\beta}}_j \) using the sample size proportional weights,
The following step is identical to QTest2. Using eigendecomposed V = UΛU^{T}, we construct the Wald type statistic,
Metaanalysis assuming heterogeneous genetic effects across studies: HetmetaQ2
For rare variant analysis, overlapped variants with other studies are not many. Thus, the approach of homometaQ2 that aggregates the overlapped SNPs in the same locus could not be appropriate for the rare variant analysis; the estimate of variancecovariance matrix is unstable. An easy alternative way is to assume not homogeneity but heterogeneity of genetic effects. To satisfy this assumption and to accommodate different directional effects among variants in a region, we use the result of single study.
Combining single study results using p value is conventional, but this method has a limitation upon achievement of heterogeneity among studies; it just assumes independence among studies. Instead of pvalue based method, we combine test statistics of each study for Qtest2. Q_{2} follows a mixture of chisquare distributions and the summation of the Q_{2} then also follows a mixture of chisquare distributions.
In Homo/HeterometaQ2, the value of a determines the shape of the mixture of chisquare distribution and this value depends on the gene size (number of SNPs in a gene). Therefore, the choice of a should be careful.
Optimal test (unified test): Meta quadratic Test3 (MetaQtest_{3})
We develop an optimal quadratic metaanalysis for robust power gain. A burden type test is known to be more powerful when most variants in a region are causal and their effect directions are the same, but a nonburden type is opposite to this case. Therefore, applying one of the two tests can lose the power to detect the associated variants with a phenotype, if every single gene dose not satisfy the same assumption. A unified approach is to use a weighted average of burden and nonburden test. Since this approach allows two conflicted assumptions about direction among variants (burden or nonburden), an optimal test usually achieves the robust power regardless of directional assumptions.
Metaanalysis assuming homogeneous genetic effects across studies: HomometaQ3
When studies are homogeneous, we suggest Q_{hom − meta − q3} that is weighted average of Q_{hom − meta − q1} and Q_{hom − meta − q2}. We build a test for meta with an adjustment for homogeneous case. The following are steps for constructing Q_{hom − meta − q3}. The steps are identical to Qtest3 but for substituting collapsed variables, (\( \hat{\beta} \), \( {\hat{\beta}}_{pooled} \), α, W, V) instead of vector of variables from each study, (\( {\hat{\beta}}_k \), \( {\hat{\beta}}_{pooled,k} \), α_{k}, W_{k}, V_{k}). The collapsed variables are defined in homo MetaQtest. Like Qtest3, the empirical distribution of suggested test statistic is calculated from precalculated distribution in the step 5.
Metaanalysis Assuming Heterogeneous Genetic Effects across Studies: HetmetaQ3.
When studies are heterogeneous, we suggest Q_{het − meta − q3} that is weighted average of Q_{het − meta − q1} and Q_{het − meta − q2}. The following are steps for Q_{het − meta − q3}. The steps are similar to Qtest3 and MetaQTest3 but, in step 2 we use the sum of Wald type statistics, \( {Q}_{2\mid 1}^{\ast } \) from each study. For the sake of simplicity, we assume that the number of variants in a gene is the same across the study, thus the degree of freedom is just multiplication K by m.
Numerical simulations
To evaluate the properties of the proposed methods, we performed the simulation study. We generated the genotype data using COSI program that implements coalescent model, and we obtained both length 200 kb and 10,000 Europeanlike haplotypes and AfricanAmericanlike haplotypes [21]. By randomly mating with haplotypes, we could obtain genotype data for analysis. The 3 kb regions are also randomly selected for each gene and rare variants under the threshold are analyzed [22]. We filtered out common variants with the threshold of MAF larger than 0.03 and singleton SNPs. Left figures consider MAF < 0.05 and right figures consider MAF < 0.01. In European population, 82.93% variants have MAF < 0.01 and 67.51% variants have MAF < 0.001, this frequency is different from that of generated simulation data set in MetaSKAT (86 and 76% respectively). Since COSI program gives randomness in generating haplotypes, this difference could happen even using the same input parameters. Table 1 shows the simulation study settings which we borrowed the basic idea of them in MetaSKAT. For precise comparison of our methods with MetaSKAT, we brought their setting here and tried to prove better performance of MetaQtest even in the setting for MetaSKAT. Giving variety to kind of population such as sample size and number of covariates, we considered six different scenarios like MetaSKAT. Using COSI program, MetaSKAT generated genotype data set for different populations and created 3 single studies. The number of different scenarios of simulation that they considered is 6, and half of them have the same study sample size and rest of them have different sample size. To give heterogeneity among the studies, in addition to different size, they also allocated different population groups and different number of covariates to each study. We compared the proposed methods with other existing metaanalysis methods. Metaanalysis methods that we considered are as follows: MetaQtest (HommetaQ1, HetmetaQ1, HommetaQ2, HetmetaQ2, HommetaQ3, HetmetaQ3), MetaSKAT (HommetaSKAT, HetmetaSKAT, HommetaSKATO, HetmetaSKATO), MetaBurden, Fisher’s combined probability test and Stouffer’s Zscore method [23, 24].
Type I error and power simulations
For type I error simulations, we generated 100 phenotypes for 100 each gene under the null hypothesis; there is no association between a gene and a phenotype.
As the type I error simulation of MetaSKAT, X_{k1i} is the covariate taking 0 or 1 value with equal probability 0.5. The rest of covariates, from X_{k2i} to \( {X}_{k{q}_ki} \) are taken from a standard normal distribution. The information of covariates for each study is in the Table 1. The index, q_{k} indicates the number of covariates for kth study. Since we generated 100 genes and 100 phenotypes, we carried out 10,000 times association tests, and the level of significance, we gave α = 0.01 and 0.001.
Unlike type I error simulations, we generated 1000 genotype datasets and made phenotypes using the below model for power simulations,
G_{ki, causal} is a vector of causal variants in a gene, and β_{k, causal} is their effect size. For the proportion of causal variants, we set four cases, 10, 20, 30 and 50% of variants are causal like in MetaSKAT. To illustrate the effect of burden or nonburden type test, we also assumed that all variants are positive or 80% are positive and rest of 20% are negative. Since the number of causal variants or positive variants could not be integer in some cases because of small number of variants in a gene, we generated number of causal variants in Bernoulli distribution. The regression coefficient of genetic effect is given same as the MetaSKAT, β_{k, causal} = clog_{10}(MAF). However, we used study specific MAF rather than population MAF, because in reality we hardly get the population MAF even in Metaanalysis. Defining the coefficient in this way reflects the assumption of rare variant study; the rarer SNPs have the larger effects on the phenotype. We set c = 0.475 in 5% of causal variants, 0.375 for 10%, 0.25 for 20%, and 0.175 for 50% of causal variants respectively. Since the effect size of regression coefficient depends on the MAF, the case for different MAF across the studies assumes heterogeneous effect and the case for the same MAF assumes homogeneous effect. These assumptions rely on the belief that different populations could have different MAF, and this phenomenon is called population stratification. In the simulation work, there are some differences with MetaSKAT. First, we used haplotypes that have different distribution of allele frequencies with that of MetaSKAT and second, we used study specific MAF rather than population MAF that was used in simulation of MetaSKAT. Finally, we used Bernoulli distribution to assign causal variants rather than exact given proportion percentages. We expected that the differences might cause the slightly different power results of MetaSKAT calculated here with MetaSKAT paper.
Materials
Metaanalysis for genelevel rare variants association studies
We analyzed whole exome sequencing data from Type 2 Diabetes Genetic Exploration by Nextgeneration sequencing in multiEthnic Samples (T2DGENES) consortium. Sequencing was performed at the Broad Institute with Agilent v2 capture reagent on HiSeq platform. The consortia have exome sequencing data of 13,000 individuals from 5 different ancestry groups: African Americans (AJ, AW), American Hispanics (HA, HS), East Asian (EK, ES), South Asians (SL, SS), and European (UA, UF, UG, US, and UB). The words in parenthesis stand for the populations in the ancestry group. For metaanalysis, we selected East Asian ancestry group, EK from Korean samples and ES from Singaporean samples. Total number of samples in EK is 1086 and in ES is 1078. The primary goal of two consortia is to identify novel type 2 diabetes (T2D) related genetic factors, but data also includes several diabetesrelated quantitative traits such as lipid and blood pressure traits. We applied the proposed meta methods to lipid and blood pressure traits: total cholesterol (CHOL), HDL cholesterol (HDL), LDL cholesterol (LDL), triglycerides (TG), systolic blood pressure (SBP) and diastolic blood pressure (DBP). Subjects having medication that might affect the traits were excluded for the analysis. Sample sizes considered missing count and medication are described in Table 2. For covariates, we used age, sex information (BMI information was used only for blood pressure phenotypes).
Results
Type I error
Empirical type I error rates under the setting of Scenario 1 and 2 are calculated. Since first scenario consider homogeneous case and second scenario considers heterogeneous case, we only calculate type I error in these two scenarios. The rates are calculated under each significant level, 0.01 and 0.001. Empirical type I error rates under the setting of Scenario 1 are given in Table 3. For both MetaQTest and MetaSKAT, type I error rates were well controlled, but in HetmetaSKATO, type I error rate was somewhat inflated. Empirical type I error rates under the setting of Scenario 2 are given in Table 4. For both MetaQ and MetaSKAT, type I error rates were also well controlled, but in HommetaQ2, type I error rate was somewhat inflated.
Power
To compare the power, we performed the simulation work. First, we compared power of our meta method with those of joint analysis to check efficiency of meta method. When metaanalysis yields highly comparable results with joint analysis, the meta method serves well as an alternative of joint analysis. Since jointanalysis cannot handle heterogeneity between studies such as different number of covariates directly, we only performed jointanalysis in homogeneous scenario settings (scenario 1 and 4). In Fig. 1, the blue 8 dots indicate each case for senario1 (causal 5 to 50% and all same direction of variants to different direction) and the pink dots are for scenario 4. Jointanalysis is consistently powerful, but HommetaQ also achieves highly comparable power compared with jointanalysis. Especially, for scenario 1, our meta methods recovers almost power of jointanalysis.
Figure 2 shows power comparisons of eight different meta methods when all causal variants are risk increasing. In the figure, first 4 bars indicate our proposed methods, HommetaQ2, HommetaQ3, HetmetaQ2 and HetmetaQ3 in order. The next 4 bars indicate competing MetaSKAT methods, HommetaSKAT, HommetaSKATO, HetmetaSKAT and HetmetaSKATO. In the first scenarios, since we assume homogeneous genetic effects, HommetaQ are more powerful than HetmetaQ. Up to 20% causal, HommetaQ2 is the most powerful but in 50% causal case, as the power of burden test increases, HommetaQ3, optimal type, becomes the most powerful. In these scenarios, our methods outperform in 5%~ 20% percentages of causal variants. However, when a causal percentage reaches 50%, then MetaSKAT (HommetaSKATO) overtakes the power (HommetaQ3: 74.2% and HommetaSKATO: 76.4%). The reversal occurs, because HommetaQ1 works poorly even HommetaQ2 is more powerful than HommetaSKAT when it is expected to outperform in a case of high causal percentage. But, the power differences are not so big, about 2% difference. Inscenario 2, we gave heterogeneity between studies by giving different number of covariates. The powers of HetmetaQ are higher than those of HommetaQ and their power differences are clearly different, about twice higher in 5% of causal variants. HetmetaQ outperform except in the case of 50% of casual variants and this trend also appears in the rest of scenarios: as the number of causal variants increases, MetaSKAT is more powerful than our method. The reason why this trend is maintained is that our burden type test has lower power than MetaBurden and even our nonburden type test when burden type test should have the highest power is more powerful. In scenario 3, population of studies is different and the number of covariate is also different. In this more heterogeneous case, HetmetaQ has higher power than HommetaQ, and the value of power itself is higher than in scenario 2. Our method outperform up to 20% of causal variants. As the percentage of casual variants increases the power of our method becomes lower than MetaSKAT. One of the distinguishable features of our method is that the power of our methods varies sensitively depending on the range of heterogeneity. The gap of power differences of homo and heterogeneous cases are larger than those of MetaSKAT. All methods in MetaSKAT have the robust power but in 50% of casual variants. Thus, when there are information about heterogeneity prior to analysis and the percentage of causal variants is expected to be lower, then our method is more powerful than MetaSKAT. For the rest of scenarios, we gained similar power results with previous scenarios; only in the case of 50% causal variants, MetaSKAT has higher power than MetaQ, but in scenario 4, our methods outperformed regardless of causal variants percentages and in scenario 6, in the case of 20% causal variants with 50% causal, MetaSKAT outperforms.
Figure 3 shows power comparisons of meta methods when 20% of causal variants are risk decreasing and 80% are risk increasing. In scenario 1, HommetaQ have higher power than MetaSKAT regardless of percentage of causal variants. However, the power differences become smaller as casual percentage increases. Compared to the case of all risk causal variants, in 50% of causal variants our method still have higher power because power loss of burden type is not large. In scenario 2 and 3, HetmetaQ have higher power than MetaSKAT. Our methods show significant difference between HommetaQ and HetmetaQ, but MetaSKAT has little difference between their methods. In scenario 5, up to 20% causal, our best method is powerful than MetaSKAT, but when causal variants are 50%, then MetaSKAT outperforms. Moreover, in scenario 6, in only 20% causal, MetaSKAT outperforms, but the power differences between our proposed methods and MetaSKAT are smaller than those observed in the previous, because in this setting nonburden type test is more powerful than burden, so our burden type test that is less powerful is less involved here. The reason why our burden type test, HommetaQ, have little power in heterogeneous case is that the assumption used in burden type test is not kept well in heterogeneous case. In heterogeneous case, we can hardly say that all of variants in a gene across the different population groups are detected in any population and have same direction. Thus, we do not include the result of HommetaQ in simulation and real data analysis.
Real data analysis results
We applied suggested methods to the real lipid and blood pressure data. All considered meta methods did not detect any significant genes for lipid traits. But, for blood pressure traits we gained the different association results with lipid traits. Before analysis, we filtered common SNPs. Thus, remaining number of genes for the analysis of SBP is 9303 and SNPs in each gene have MAF < 0.05 and MAC > 4. Unlike MetaSKAT methods that failed to detect any significant genes as well in the lipid traits analysis (at the bonferroni’s significant level and arbitrary level, 1.00e04), MetaQtest detected PCDHA9 gene in HetmetaQ2 (p value = 4.75e05) at the bonferroni’s significant level (α = 5.37e06), but this gene is also not known to be related with blood pressure. According to GO annotations, it is involved in calcium ion binding. However, the gene that has the second smallest p value in both HetmetaQ2 (p value = 1.19e05) and HetmetaQ3 (p value = 2.41e05) is KCNA5 and this gene is already known for having strong relevance to hypertension [25, 26]. Table 5 shows that p values of genes that were detected at the threshold, 1.00e04 have the smallest value in MetaQtest. Moreover, we compared our meta method results with that of single QTest analysis to verify the improved power of meta methods. Table 6 shows that PCDHA9 gene is also detected in QTest2 in EK single study analysis and its p value (p value = 8.22e07) is smaller than in HetmetaQ2. However, its p value in ES single study analysis has very small value, so p value from MetaQtest is the compromised these two p values from EK and ES single QTest analysis. But, p values of KCNA5 from each single QTest are both larger than p value from MetaQtest. This fact can say that KCNA5 which is known for one of causal genes of hypertension arterial was only detected by MetaQtest and MetaQtest improved the statistical power of QTest.
For the analysis of DBP, after filtering common variants, the number of genes is 10,528 and SNPs in each gene have MAF < 0.05 and MAC > 3. In the metaanalysis of DBP, two genes, DTYMK and PCCA were detected at the Bonferroni’s significant level (α = 4.75e06). Table 7 shows that MetaAnalysis results for testing the rare variants effects on DBP. DTYMK was detected in HetmetaQ3 (p value = 1.17e06) and PCCA was detected in HommetaSKATO (p value = 3.18e06). DTYMK, however, is only found in EK samples, thus MetaQtest result is only stemmed from EK QTest result (Table 8 shows that p values of QTest are the same with that of MetaQtest). When a gene exists in a single study, then our MetaQtest result reflects the result of single study. DTYMK is not currently identified as related with blood pressure, but as related with thymidylate kinase activity. Another detected gene in MetaSKAT, PCCA is not known to be associated with blood pressure function, but it is related with propionic academia and pccarelated propionic academia.
PCCA gene was also detected using QTest1 of ES sample analysis (p value = 2.76e06, Table 8). But EK QTest results offset the results of MetaQtest, so MetaQtest could not detect this gene.
Although CABIN1 gene was not discovered at the Bonferroni’s significant level, but it is discovered at the FDR adjusted p value. CABIN1 which was detected using HetmetaQ2 and HetmetaQ3 (p value = 8.67e06 and 6.91e06 respectively) is known to be associated hypertension arterial and purpura thrombotic thrombocytopenic that is closely related blood pressure. Table 8 shows that QTest results of EK and ES population groups. CABIN1 was not detected using single QTest.
Discussion and conclusion
We propose MetaQtest for metaanalysis of genelevel rare variant association studies. The basis of MetaQtest is preserving the prior phase of association studies that is Qtest. MetaQtest retains the relationship among multiple variants in a region. Further, MetaQtest considers whether the genetic effects on the phenotype are same across studies or not. Assuming same effects corresponds to fixed effect metaanalysis model or else random effect model. By considering direction of variant effects and their equivalence in effect size at the same time, MetaQtest can cover broad range of realistic metaanalysis cases.
We investigated the performance of MetaQtest through simulation and real data analysis. Simulation studies showed that type I error rates were controlled well and MetaQtest, particularly HetmetaQ achieved the higher than or as powerful as MetaSKAT in various scenarios. However, when causal variants are over than 50%, then our methods are not powerful as MetaSKAT. Thus, when there are small percentage of causal variants, our method is more powerful in all scenario settings. Since Hommeta test uses estimating regression coefficients, satisfaction of model assumption affects the power of MetaQtest sensitively. If there are no many overlapped variants in a gene across single studies, the assumption of Hommeta is broken and HommetaQ hardly perform well. For this reason, in the result of real data analysis (EK and ES sample have the small proportion of shared SNPs in a gene), HommetaQ have inflated p values and we thought that most of them are false positive. Thus, in the result of real data analysis, we excluded the HommetaQ results. Therefore, we expect that the prior test of heterogeneity of genetic effects can help us to determine appropriate model. In real data analysis, we have shown that HetmetaQ searched out some known genes associated blood pressure trait. However, it also discovered some novel genes that are not at known to be associated with blood pressure at least for now, thus to validate biological relationship between them, more research and experiment in biology field are needed. For the future research, we can combine adaptive test that can determine degree of heterogeneity of genetic effects to improve the power further.
Abbreviations
 AJ:

African American Jackson Heart Study Candidate Gene Association Resource
 AW:

African American Wake Forest Study
 BMI:

Body mass index
 CHOL:

Cholestrol
 DBP:

Diastolic blood pressure
 EK:

East Asian Korea Association Research Project (KARE) and Korean National Institute of Health (KNIH)
 ES:

East Asian Singapore Diabetes Cohort Study and Singapore Prospective Study Program
 GWAS:

Genome wide association test
 HA:

Hispanic San Antonio Mexican American Family Studies, Texas
 HDL:

Highdensity lipoprotein cholestrol
 HS:

Hispanic Starr County, Texas
 LDL:

Lowdensity lipoprotein cholestrol
 QTest:

Quadratic test
 SBP:

Systolic blood pressure
 SKAT:

Sequence kerneal association test
 SL:

South Asian London Life Sciences Population (LOLIPOP)
 SS:

South Asian Singapore Indian Eye Study
 T2DGENES:

A consortium for Type 2 Diabetes Genetic Exploration by Nextegeneration sequencing in multiEthnic Samples
 TG:

Triglycerides
 UA:

European Longevity Genes in Founder Populations (Ashkenazi)
 UB:

European MalmoBotnia Study
 UF:

European Metabolic Syndrome in Men Study (Finnish)
 UG:

European Kooperative Gesundheitsforschung in der Region Augsburg (KORA)
 US:

UK Type 2 Diabetes Genetics Consoritum
References
 1.
Asimit J, Zeggini E. Rare variant association analysis methods for complex traits. Annu Rev Genet. 2010;44:293–308.
 2.
Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, McCarthy MI, Ramos EM, Cardon LR, Chakravarti A, et al. Finding the missing heritability of complex diseases. Nature. 2009;461:747–53.
 3.
Hu YJ, Berndt SI, Gustafsson S, Ganna A. Genetic investigation of, a.T.C., Hirschhorn, J., north, K.E., Ingelsson, E., and Lin, DY : Metaanalysis of genelevel associations for rare variants based on singlevariant statistics. Am J Hum Genet. 2013;93:236–48.
 4.
Neale BM, Sham PC. The future of association studies: genebased analysis and replication. Am J Hum Genet. 2013;75:353–62.
 5.
Li B, Leal SM. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet. 2008;83:311–21.
 6.
Price AL, Kryukov GV, de Bakker PI, Purcell SM, Staples J, Wei LJ, Sunyaev SR. Pooled association tests for rare variants in exonresequencing studies. Am J Hum Genet. 2010;86:832–8.
 7.
Neale BM, Rivas MA, Voight BF, Altshuler D, Devlin B, OrhoMelander M, Kathiresan S, Purcell SM, Roeder K, Daly MJ. Testing for an unusual distribution of rare variants. PLoS Genet. 2011;7:e1001322.
 8.
Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rarevariant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet. 2011;89:82–93.
 9.
Lee S, Emond, Bamshad MJ, Barnes MJ, Rieder KC, Nickerson MJ, Team DA, Christiani DC, Wurfel MM, Lin X. Optimal unified approach for rarevariant association testing with application to smallsample casecontrol wholeexome sequencing studies. Am J Hum Genet. 2012;91:224–37.
 10.
Ladouceur M, Dastani Z, Aulchenko YS, Greenwood CM, Richards JB. The empirical power of rare variant association methods: results from sanger sequencing in 1,998 individuals. PLoS Genet. 2012;8:e1002496.
 11.
Lee J, Kim YJ, Lee JY, T2DGenes Consortium, Lee S, Park T. Geneset association test for nextgeneration sequencing data. Bioinformatics. 2016;32:i611–19.
 12.
Lin DY, Zeng D. Metaanalysis of genomewide association studies: no efficiency gain in using individual participant data. Genet Epidemiol. 2010;34:60–6.
 13.
Evangelou E, Ioannidis JPA. Metaanalysis methods for genomewide association studies and beyond. Nat Rev Genet. 2013;14:379–89.
 14.
Lee S, Teslovich TM, Boehnke M, Lin X. General framework for metaanalysis of rare variants in sequencing association studies. Am J Hum Genet. 2013;93:42–53.
 15.
Madsen BE, Browning SR. A groupwise association test for rare mutations using a weighted sum statistic. PLoS Genet. 2009;5:e1000384.
 16.
Biernacka JM, Jenkins GD, Wang L, Moyer AM, Fridley BL. Use of the gamma method for selfcontained geneset analysis of SNP data. Eur J Hum Genet. 2012;20:565–71.
 17.
Willer CJ, Li Y, Abecasis GR. METAL: fast and efficient metaanalysis of genomewide association scans. Bioinformatics. 2010;26:2190–1.
 18.
Zeggini E, Ioannidis JP. Metaanalysis in genomewide association studies. Pharmacogenomics. 2009;10:191–201.
 19.
Nikolakopoulou A, Mavridis D, Salanti G. How to interpret metaanalysis models: fixed effect and random effects metaanalyses. Evid Based Ment Health. 2014;17:64.
 20.
Mosteller F, Colditz GA. Understanding research synthesis (metaanalysis). Annu Rev Public Health. 1996;17:1–23.
 21.
Schaffner SF, Foo C, Gabriel S, Reich D, Daly MJ, Altshuler D. Calibrating a coalescent simulation of human genome sequence variation. Genome Res. 2005;15:1576–83.
 22.
Pruitt KD, Tatusova T, Brown GR, Maglott DR. NCBI reference sequences (RefSeq): current status, new features and genome annotation policy. Nucleic Acids Res. 2012;40:130–5.
 23.
Fisher RA, Genetiker S, Genetician S, Britain G, Ge’ne’ticien S. Statistical methods for Researc workers. Edinburgh: Oliver and Boyd; 1970.
 24.
Stouffer SA, Suchman EA, DeVinney LC, Star SA, Williams RM Jr. The American soldier: adjustment during army life. In: Studies in Social Psychology in World War II, vol. 1; 1949.
 25.
Remillard CV, Tigno DD, Platoshyn O, Burg ED, Brevnova EE, Conger D, Nicholson A, Rana BK, Channick RN, Rubin LJ, et al. Function of Kv1.5 channels and genetic variations of KCNA5 in patients with idiopathic pulmonary arterial hypertension. Am J Phys Cell Phys. 2007;292:1837–53.
 26.
Pousada G, Baloira A, Vilarino C, Cifrian JM, Valverde D. Novel mutations in BMPR2, ACVRL1 and KCNA5 genes and hemodynamic parameters in patients with pulmonary arterial hypertension. PLoS One. 2014;9:e100261.
Funding
Publication costs are funded by the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI) grant (HI16C2037). Also, this work was supported by the Bio & Medical Technology Development Program of the National Research Foundation of Korea (NRF) grant (2013M3A9C4078158) and by grants of the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (grant number: HI16C2037, HI15C2165, HI16C2048).
Availability of data and materials
Not applicable.
About this supplement
This article has been published as part of BMC Medical Genomics Volume 12 Supplement 5, 2019: Selected articles from the 8th Translational Bioinformatics Conference: Medical Genomics. The full contents of the supplement are available online at https://bmcmedgenomics.biomedcentral.com/articles/supplements/volume12supplement5.
Author information
Affiliations
Consortia
Contributions
JK and TP developed statistical method. JK performed statistical analysis. TP conceived and planned the experiments. JK, JL. YK and TP planned and carried out the simulations. JK, JL BO, TP contributed to the interpretation of the results. JK and TP took the lead in writing the manuscript. All authors provided critical feedback and helped shape the research, analysis and manuscript. All authors read and approved the final manuscript.
Corresponding author
Correspondence to Taesung Park.
Ethics declarations
Ethics approval and consent to participate
We used the exome sequencing data of 1,086 samples from KARE and 1,078 from LOLIPOP. KARE study is a part of Korean Genome Epidemiology Study (KoGES), and the dataset was used under the partnership of T2DGENES. The dataset of LOLIPOP study was also used under the partnership of T2DGENES. All participants of KARE study provided written informed consent. The study using KARE and LOLIPOP samples was approved by two independent institutional review boards at Seoul National University.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Ka, J., Lee, J., Kim, Y. et al. MetaQtest: metaanalysis of quadratic test for rare variants. BMC Med Genomics 12, 102 (2019). https://doi.org/10.1186/s1292001905165
Published:
Keywords
 Metaanalysis
 Rare variant analysis
 Exome sequencing
 MetaQtest