Here we propose a novel region-based interaction detection method for genome-wide case-control studies that utilizes SNP-based interaction test statistics and their covariances. LD contrast test is adopted to measure SNP-based interaction effects. We derive the covariance of LD contrast test statistics, which enables a robust aggregation of SNP-SNP interactions within a region pair. The determination of regions comes from gene definitions or BOOST results.
Genomic data formats
There are two alleles for almost every base pair (bp) position in the human genome, one from the maternal chromosome and the other from the paternal chromosome. A combination of the two alleles is denoted as a genotype of this bp position. SNPs are defined as the base pairs that could exhibit different genotype values in different individuals. Normally a SNP only has two possible allele values in the population, one major allele with a higher probability (denoted as B), and one minor allele (denoted as b). Correspondingly, there exist three genoytpes for a typical SNP, i.e., BB, Bb and bb, where Bb is called a heterogeneous genotype and the rest two are called homogeneous genotypes. GWAS uses microarrays to generate SNP genotype data. In SNP data analysis, we use 0/1/2, 0/1/1, and 0/0/1 for BB/Bb/bb as the encoding scheme for additive, dominant, and recessive genetic models, respectively. A more flexible strategy is to estimate the effects of three genotypes independently, at the price of an increased degree of freedom. Allele data could also be used for analysis, with 0/1 as the numerical values of major/minor alleles. However, statistical inference needs to be performed in advance to retrieve allele information from original genotype data, which is called haplotype phasing in the GWAS community. In this paper, we focus on the analysis of genotype data.
LD contrast test for SNP interaction detection
Current interaction detection methods are mainly based on the deviation from additive effect by assuming a linear or logistic regression model. Nevertheless, this approach is not necessarily the most powerful method due to the uncertainty of underpinning biochemical mechanisms. Linkage disequilibrium (LD) contrast test provides another valuable perspective to investigate this problem. Empirical studies have shown that LD contrast test can achieve higher power than logistic regression under certain disease models for case-control studies [6]. In this paper, LD contrast test is adopted to generate SNP-based interaction test statistics because of its clear statistical meaning and mathematical simplicity.
LD represents the statistical association between two genetic loci with allele values, defined as the deviation from the independence of two SNPs (A and B)
$$ LD = p(A,B)-p(A)p(B). $$
(1)
To avoid the ambiguity caused by haplotype phasing, composite LD (CLD) which only requires genotype data is commonly used to approximate LD. CLD is defined as [15]:
$$\begin{array}{@{}rcl@{}} &&\ \ \ CLD = p_{AB}+p'_{AB}-2p(A)p(B) \\ with&&\ \left\{ \begin{array}{ll} p_{AB} = P^{AB}_{AB}+\frac{1}{2}\left(P^{AB}_{Ab}+P^{AB}_{aB}+P^{AB}_{ab}\right) \\ p'_{AB} = P^{AB}_{AB}+\frac{1}{2}\left(P^{AB}_{Ab}+P^{AB}_{aB}+P^{aB}_{Ab}\right) \end{array} \right., \end{array} $$
(2)
where the subscript and the superscript represent two gametes that are passed to offsprings and P denotes the probability of the specific gamete combination. CLD could be regarded as a simplified version of phasing to facilitate the analysis based on genotype data. The statistical properties of CLD have been well studied [16, 17]. One important fact is that CLD corresponds to the sample correlation coefficient \(\hat {r}\) of genotype values under the additive model,
$$ {{} \begin{aligned} \hat{r}_{genotype} \,=\, \frac{CLD}{\sqrt{p(1-p) + D_{A}}\sqrt{q(1-q) + D_{B}}} \approx \frac{CLD}{\sqrt{p(1-p)}\sqrt{q(1-q)}}, \end{aligned}} $$
(3)
where p=p(A),q=p(B),DA and DB represent Hardy-Weinberg disequilibriums, i.e. DA=pAA−p2(A), DB=pBB−p2(B). DA and DB are nearly 0 in GWAS datasets after quality control.
A similar result holds for the original LD and allele values,
$$ \hat{r}_{allele} = \frac{LD}{\sqrt{p(1-p)}\sqrt{q(1-q)}}. $$
(4)
Therefore, CLD could also be viewed as an approximation of LD by using the correlation coefficient of 0/1/2 genotype data under the addtive model to replace that of 0/1 allele values, at the price of implicitly conducting phasing with equal probabilities for two-allele combinations.
Suppose two SNPs work synergistically to contribute to the same pathways, they are less likely to be separated during recombination and will be inherited together to offsprings in the case group. As a result, the SNP-SNP pattern should be different between patients and healthy people. Therefore, checking the difference of LD patterns between cases and controls provides an alternative way to detect interaction. LD contrast test was proposed to statistically test this difference [11]. The test statistic based on CLD has the following form:
$$ \chi^{2} = \frac{\left(\hat{CLD}_{AB}^{case}-\hat{CLD}_{AB}^{control}\right)^{2}} {Var\left(\hat{CLD}_{AB}^{case}\right)+Var\left(\hat{CLD}_{AB}^{control}\right)}, $$
(5)
which follows a 1-df χ2 distribution under the null hypothesis that there is no LD difference between cases and controls.
Covariance between SNP interactions
The key issue in the aggregation of individual SNP-SNP interaction effects is the correction of inflated effect sizes caused by the correlations among individual test statistics. The fact that LD is actually the sample covariance of two SNPs is leveraged to derive the correlation coefficients of LD contrast test statistics.
Suppose two SNP pairs (X,Y) and (U,V) have interactions with contrast LDs
$$ \left\{ \begin{array}{ll} \Delta LD_{XY} = \hat{cov}(X, Y | case) - \hat{cov}(X, Y | control) \\ \Delta LD_{UV} = \hat{cov}(U, V | case) - \hat{cov}(U, V | control) \end{array}\right.. $$
(6)
The corresponding LD contrast test statistics read:
$$ T_{XY}=\frac{\Delta \hat{LD_{XY}}}{\sqrt{Var(\Delta\hat{LD_{XY}})}}\ \text{and}\ \ T_{UV}=\frac{\Delta \hat{LD_{UV}}}{\sqrt{Var(\Delta\hat{LD_{UV}})}}. $$
(7)
The covariance of the two test statistics reads:
$$ {{} \begin{aligned} cov(T_{XY}, T_{UV}) \approx \frac{cov(\Delta\hat{LD_{XY}}, \Delta\hat{LD_{UV}})}{\sqrt{cov(\Delta\hat{LD_{XY}}, \Delta\hat{LD_{XY}})cov(\Delta\hat{LD_{UV}}, \Delta\hat{LD_{UV}})}}. \end{aligned}} $$
(8)
In GWAS, it’s commonly assumed that population samples are independent. Under this assumption, we can derive the following theorems.
Theorem 1.
The covariance of contrast LDs can be decomposed into components from cases and controls separately,
$$ {{} \begin{aligned} cov(\Delta LD_{XY}, \Delta LD_{UV}) = & cov[\hat{cov}(X,Y|case),\hat{cov}(U,V|case)] \\ &+ cov[\hat{cov}(X,Y|control),\hat{cov}(U,V|control)]. \end{aligned}} $$
(9)
Proof 1.
ΔLD is the difference of the two sample covariances in cases and controls. By the linear property of covariance, cov(ΔLDXY,ΔLDUV) can be decomposed into four covariances of two sample covariances. Because individuals are assumed to be independent, the two terms with one sample covariance from cases and the other from controls are 0. Therefore, Theorem 1 holds. □
Theorem 2.
The covariance of sample covariances reads
$$ cov\left[\hat{cov}(X, Y), \hat{cov}(U,V)\right]=\frac{1}{n}\left(\delta_{4}-\delta_{2}+\frac{\sigma_{2}+\tau_{2}}{n-1}\right), $$
(10)
where
$$\begin{array}{@{}rcl@{}} \delta_{4} &=& E\left[(X-EX)(Y-EY)(U-EU)(V-EV)\right], \\ \delta_{2} &=& cov(X,Y)cov(U,V), \\ \sigma_{2} &=& cov(X,U)cov(Y,V), \\ \tau_{2} &=& cov(X,V)cov(Y,U). \end{array} $$
Proof 2.
The covariance of sample covariances can be rewritten as
$$ {\begin{aligned} &cov\left[\hat{cov}(X, Y), \hat{cov}(U,V)\right]\\ &\quad= cov\left[{\textstyle \frac{1}{2n(n-1)}}\sum_{i=1}^{n}\sum_{j=1}^{n}(X_{i}-X_{j})(Y_{i}-Y_{j}), {\textstyle \frac{1}{2n(n-1)}}\right.\\& \quad\left.\sum_{i=1}^{n}\sum_{j=1}^{n}(U_{i}-U_{j})(V_{i}-V_{j})\right] \\ &\quad= {\textstyle \frac{1}{4n^{2}(n-1)^{2}}}\sum_{j=1}^{n}\sum_{j=1}^{n}\sum_{k=1}^{n}\sum_{l=1}^{n}\\ &\quad cov\left[(X_{i}-X_{j})(Y_{i}-Y_{j}),\ (U_{k}-U_{l})(V_{k}-V_{l})\right]. \end{aligned}} $$
(11)
We consider the following four conditions. (1) i=j or k=l. (2) i≠j,i≠k,i≠l,j≠k,j≠l and k≠l. (3) i≠j and { i=k,j=l or i=l,j=k}. (4) i≠j,k≠l, and { i=k or i=l or j=k or j=l}. The basic covariance unit in (11) can be rewriten as
$$ {{} \begin{aligned} &cov\left\{\left[(X_{i}-EX)-(X_{j}-EX)\right]\left[(Y_{i}-EY)-(Y_{j}-EY)\right]\right., \\ &\left. \left[(U_{k}\,-\,EU)\,-\,(U_{l}\,-\,EU)\right]\left[(V_{k}-EV)-(V_{l}-EV)\right]\right\}. \end{aligned}} $$
(12)
There are 2n3−n2,n(n−1)(n−2)(n−3),2n(n−1) and 4n(n−1)(n−2) items for the four conditions respectively. We can further separate (12) into 16 components and calculate their values under different conditions. The derivation is straightforward. Our conclusion thus holds. □
Theorem 3.
The sample mean of (X−EX)(Y−EY)(U−EU)(V−EV)is an asympototically unbiased estimator of δ4,
$$ {{} \begin{aligned} &E\left[\frac{1}{n}\sum_{i=1}^{n}\left(X_{i}-\frac{1}{n}\sum_{j=1}^{n}X_{j}\right) \left(Y_{i}-\frac{1}{n}\sum_{j=1}^{n}Y_{j}\right)\right.\\ &\left. \left(U_{i}-\frac{1}{n}\sum_{j=1}^{n}U_{j}\right) \left(V_{i}-\frac{1}{n}\sum_{j=1}^{n}V_{j}\right)\right] \\ &= \left(1\,-\,\frac{4}{n}\,+\,\frac{6}{n^{2}}\,-\,\frac{3}{n^{3}}\right)\delta_{4} \,+\, \left[\frac{2(n-1)}{n^{2}}\,-\,\frac{3(n-1)}{n^{3}}\right]\left(\delta_{2}+\sigma_{2}\,+\,\tau_{2}\right) \xrightarrow{n\rightarrow\infty} \delta_{4}. \end{aligned}} $$
(13)
Proof 3.
Equation (13) can be rewritten as
$$ {\begin{aligned} &\sum_{i=1}^{n}E\left\{\left[(X_{i}-EX)-\left(\frac{1}{n}\sum_{j=1}^{n}X_{j}-EX\right)\right]\right. \\ &\quad \left[(Y_{i}-EY)-(\frac{1}{n}\sum_{j=1}^{n}Y_{j}-EY)\right] \\ &\quad \left[(U_{i}-EU)-\left(\frac{1}{n}\sum_{j=1}^{n}U_{j}-EU\right)\right]\\&\quad \left. \left[(V_{i}-EV)-\left(\frac{1}{n}\sum_{j=1}^{n}V_{j}-EV\right)\right]\right\}. \end{aligned}} $$
(14)
Again (14) can be separated into 16 components which are solvable under the independence assumption. The rest of the proof is omitted due to page limit. □
By integrating (8-13), the covariance of the LD contrast test statistics can be estimated. Note that the variance of the standardized LD contrast test statistic is approximately 1,
$$ {} Var(T_{XY})\,=\,Var\left[\frac{\Delta \hat{LD_{XY}}}{\sqrt{Var(\Delta\hat{LD_{XY}})}}\right] \approx \frac{Var(\Delta\hat{LD_{XY}})}{Var(\Delta\hat{LD_{XY}})}=1. $$
(15)
Therefore, the covariance of TXY and TUV can be reduced to the corresponding correlation coefficients,
$$ corr(T_{XY},T_{UV}) \approx cov(T_{XY}, T_{UV}). $$
(16)
The test statistic for region-based interactions
To aggregate SNP-SNP interaction test statistics, a minimum p-value based method is adopted. In detail, we assume a multivariate normal distribution MVN(0,Σ) for the observed test statistics zi,i=1,2,...,k1k2, where k1 and k2 are the number of SNPs in the two regions. The covariance matrix Σ is estimated using (8-13).
Then the region-based p-value is defined as the probability that we observe a value that is larger than the largest absolute value of SNP-SNP interaction test statistics under MVN(0,Σ). Denote the absolute value of the test statistic related to the minimum p-value as T:
$$ T=\left|\Phi^{-1}\left(\frac{min(p_{i},\ i=1,2,..,k_{1}k_{2}}{2}\right)\right|. $$
(17)
Then the p-value for this region-region interaction reads,
$$ {\begin{aligned} p_{\text{region-region}} &= Pr\left[max(|z_{i}|,\ i=1,2,..,k_{1}k_{2})\right.\\ & \left.\geq T\ |\ z_{i}\sim MVN(0,\Sigma)\right] \\ &= 1 - Pr\left[max(|z_{i}|,\ i=1,2,..,k_{1}k_{2})\right.\\ & \left.< T\ |\ z_{i}\sim MVN(0,\Sigma)\right] \\ &= 1 - Pr\left[\{|z_{i}|<T,\ i=1,2,..,k_{1}k_{2}\} |\ z_{i}\right.\\ &\left.\sim MVN(0,\Sigma)\right]. \end{aligned}} $$
(18)
In this paper, We use the results of GBOOST [18], the GPU version of BOOST, to specify candidate regions. The regions could also be selected by checking potential pathogenic pathways or protein-protein interaction networks.