Modified entropy-based procedure detects gene-gene-interactions in unconventional genetic models.

BACKGROUND
Since it is assumed that genetic interactions play an important role in understanding the mechanisms of complex diseases, different statistical approaches have been suggested in recent years for this task. One interesting approach is the entropy-based IGENT method by Kwon et al. that promises an efficient detection of main effects and interaction effects simultaneously. However, a modification is required if the aim is to only detect interaction effects.


METHODS
Based on the IGENT method, we present a modification that leads to a conditional mutual information based approach under the condition of linkage equilibrium. The modified estimator is investigated in a comprehensive simulation based on five genetic interaction models and applied to real data from the genome-wide association study by the North American Rheumatoid Arthritis Consortium (NARAC).


RESULTS
The presented modification of IGENT controls the type I error in all simulated constellations. Furthermore, it provides high power for detecting pure interactions specifically on unconventional genetic models both in simulation and real data.


CONCLUSIONS
The proposed method uses the IGENT software, which is free available, simple and fast, and detects pure interactions on unconventional genetic models. Our results demonstrate that this modification is an attractive complement to established analysis methods.


Background
It is generally assumed that genetic interactions play an important role in understanding the mechanisms of complex diseases such as coronary heart disease, Alzheimer's disease, breast cancer or diabetes [1]. In the statistical sense, interaction refers to a situation in which the effect of one factor depends on the values of another factor on a given scale. In our case, genetic interactions come in two flavours. Firstly, interactions between genetic loci, usually termed gene-gene interactions or epistasis, occur, as described for rheumatoid arthritis (RA) [2][3][4]. Specifically, Liu et al. [2] described interactions between the locations DQA2 and DQB2 in the HLA region on chromosome 6. Secondly, gene-environment interactions denote interactions between genetic and environmental factors [5]. For example, Chandra et al. reported that the interaction between serum cholesterol levels and the sigma4 genotype [6] plays a role in Alzheimer's disease.
As a starting point, we will in this paper focus on the detection of gene-gene interactions in a case-control setting, although interactions of higher order can also be of interest. For simplification, our description is restricted to the situation where diallelic genetic markers such as single nucleotide polymorphisms (SNPs) are used leading to three possible genotypes. However, the results are generalizable to gene-environment-interactions with categorical environmental factors or indeed interactions of any categorical variables.
Generally, a single locus testing strategy is undertaken as the primary analysis in a genome-wide association study (GWAS), but this may be unsuitable to detect loci that interact with other variants since the relevant loci may not display effects on their own [7]. A variety of methods exists to detect or control for the presence of gene-gene interactions [8]. For a binary phenotype, most of them are based on the saturated logistic regression model for interaction [9,10] or a simplication of it. The regression parameters are penetrances, odds or log odds. As Cordell [8] notes, this procedure implicitly assumes that the scale used for the regression parameters is the scale of interest. The saturated model has nine two-locus genotypes that are modeled by one intercept, four main effects parameters and four interaction parameters with a dummy coding of the genotypes. Although the saturated model is the best fitting one, a model with fewer parameters might be preferable, e.g. because of greater stability. This can be achieved by assuming a specific genetic model and thus estimating, for example, the additive effect of the number of risk alleles at both genetic loci. In this case, only one parameter is estimated for the interaction effect. In this line, the standard software PLINK [11] provides an overall 4 degree-of-freedom (df ) test for interaction or a derived 1-df test assuming additive effects at both loci.
However, interaction models have been observed in reality that cannot easily be described by regression models for gene-gene-interactions without dummy coding. An example is given by Ziegler and König for sporadic breast cancer [10]. Therefore, the restriction to linear models is not optimal. As an alternative, a number of novel techniques have been developed in the last years that are based on different concepts such as rank building (ANOVA technique) [12], data-mining with the multifactor dimensionality reduction (MDR) approach [13,14], machine learning methods with random forests (RF) [14,15] or with support vector machines (SVM) [16]. Although promising, it is not always clear which interaction effects can be reliably detected by these methods [17].
Another new idea is borrowed from information theory, the entropy-based method. This concept is modelfree and measures the uncertainty or disorder in a system and could therefore lend itself to detect interactions for many genotype constellations. This technique is suggested as particularly powerful and, because of the nonlinearity, as better able to capture nonlinear relationships between genetic variants or other variables [18]. Ferrario et al. reviewed different entropy-based measures providing information on suggested test statistics, simulations and implementations [18]. Focusing on second order interaction, there are three important concepts, namely conditional mutual information, information gain, and relative information gain. These are based on the following definitions: First, entropy can be defined as a measure for uncertainty in a random variable [19]. Then, mutual information refers to the reduction of uncertainty of one variable conditional on the knowledge of the other variable [18]. Furthermore, mutual information can also be conditioned on a third variable yielding the conditional mutual information (CMI) [18], which has been used for a test statistic by Zuo et al. [20]. Second, the term information gain is defined in different ways: Fan et al. [21] subtract the mutual information of two genetic variants estimated in the cases from the same quantities estimated in the controls. Alternatively, in the method IGENT Kwon et al. [22] subtract the conditional entropy of the phenotype, given two genetic variants, from the entropy of the phenotype. Kwon et al. [22] also work with the socalled relative information gain, which is given as the relation between the information gain and the entropy of the phenotype.
The advantage of the approach by Kwon et al. lies in the freely available and fast implementation that is also called IGENT [18]. The software has been implemented in C++ and is available at http://statgen.snu.ac.kr/software/igent/ (different from the information in the paper from 2014 [22]). In contrast, Zuo et al. [20] work with an individual software.
One characteristic of entropy-based procedures is that main effects may also present themselves as a deviation from disorder, i.e. the entropy falls and the test reacts. This is an advantage if we search for main effects and interaction effects simultaneously and do not need to distinguish between either. This might be the case in a first step of an analysis in which the focus is on generally learning about variants having an effect on the outcome. Also, this offers the possibility to reduce the variants for a second computationally more intensive step. However, it should be noted that if we are interested in interaction effects only, a main effect of one or both genetic variants without an interaction will lead to a false positive result. Zuo et al. [20] state that the CMI concept achieves better or comparable control of the false positive error, compared to four previously proposed model-free metrics [20].
In the following, we are only interested in interaction effects, so it is necessary to eliminate main effects, without diminishing the advantages of the entropy approach. We therefore introduce a modification of IGENT that eliminates the problem of the increased type-I-error in the case of only main effects, but keeps the advantages of the entropy method as far as possible. We illustrate the behavior of the proposed procedure with data simulated for different genotypic models and apply it to the analysis of real data on the genetic background of RA [23,24]). The same data set was analyzed previously by Liu et al. [2], who utilized a regression-based approach combined with random forest analyses and Chattopadhyay et al. [3] who worked with three non-parametric scores. Furthermore, most comparisons of interaction methods have so far focused on assessing deviation from additive or multiplicative effects. However, we assume that the strongest advantage of entropy-based methods is seen in more unconventional interaction models not following classical genetic models, and we considered these unconventional interaction models in our simulations.

Entropy and IGENT-estimator
Entropy is originally a term from thermodynamics referring to the level of disorder or uncertainty. Information theory has utilized this phenomenon as a measure for the lack of structure in a system [19]. Shannon defines the entropy H of a set of probabilities p 1 , . . . , p n as − p i log p i .
In the context of a disease state D depending on the genotypes at two genetic loci, Kwon et al. [22] derive the Information Gain similarly as follows: First let entropy of the phenotype be written as with D 0 and D 1 denoting the unaffected and affected state, respectively.
The second order entropy (conditional entropy of the disease state on the genotypes G) is then given by the expression Here, for (i, j = 0, 1, 2), i and j define the genotype at the 1st and 2nd locus. Thus, all possible genetic models are considered.
From that, the Information Gain can be derived as The estimator (2nd order) then leads to For this, X ijk are the observations of G ij and D k in N individuals, leading to the meanX = 2 Furthermore we define the mutual information as Finally, the conditional mutual information is given by Considering two genetic loci with three genotypes each, we construct 3 × 3 contingency tables which tabulate the values in terms of penetrances or odds for the resulting 9 genotype combinations. In this scenario, the IGENT method estimates an imbalance of the 9 odds, i.e., the deviation for the value of 1 for all odds.
The estimator of IG(D|G) (IGENT-estimator) asymptotically and approximately follows a gamma distribution under the null hypothesis that genotype combinations and disease states are independent. This null hypothesis is also violated in the case of association with one or both of the genetic variants. Thus, the type-I-error is inflated if this estimator is used as a test for interactions only.

Modification of the iGENT-estimator
In the following, we want to utilize the IGENT approach while eliminating the influence of main effects so as to yield a purely interactive effect etimator. For this, we first estimate the main effect of one genetic variant by applying the IGENT approach to a 1st order calculation. Specifically, the Information Gain of 1st order for the first genotype is given by This can be estimated by withP i.k = 2 j=0 P ijk . The estimator (1st order) for the second genotype is given analogously.
This leads to the following intuitive modification of IGENT: All three components can be estimated within the IGENT software.
The estimator of IGmod is Here, the factorÊ ijk takes the value 1 if the two genotypes are conditionally independent, and the factorL ij. equals 1 under the condition of no correlation between the genetic variants, i.e., no linkage disequilibrium (LD).
Under this condition of no LD, we can simplify the formula for IGmod(D|G) to . This conversion shows that this estimator works with conditional mutual information. Specifically, it estimates the deviation from the conditional independence, and it follows asymptotically and approximately a gamma distribution with shape-parameter 4 and scale parameter 1 N ln (2) under the null hypothesis of conditional independence of the genetic variants [25].

Comparison with genoCMI
Recently, Zuo et al. [20] introduced an estimator called GenoCMI that was defined as follows: Under the condition of no LD, this is identical to our modified IGENT-estimator IGmod, except for the basis of the logarithm. GenoCMI follows asymptotically and approximately a χ 2 (ν)/2N distribution where the degree of freedom ν is 8. This statement is equivalent to the above result of a gamma distribution with shape parameter 4 and scale parameter 1 N ln (2) . An obvious disadvantage of GenoCMI is that there is no freely available software implementation, whereas IGENT is freely available and efficiently implemented.

Simulation models for gene-gene-interactions
The aim of our simulation study was to evaluate the performance of different estimators not only in commonly assumed interaction models but also in more unusual interaction settings. We therefore selected five interaction models, of which the first four models were proposed by Wan et al. [26], the fifth model was based on Ritchie et al. [27] (there model 4) and was generated using the epistasis model discovery method of Moore et al. [28]. Of note, the models display interaction effects but little or no main effects and can be written as 3 × 3 contingency tables of odds for the first interacting variant with genotypes aa, aA, and AA, and the second interacting variant with genotypes bb, bB, BB, where the minor alleles are denoted by capital letters (see Supplement). The specific values are each determined by a prevalence parameter α and a multiplicative interaction parameter θ and are shown in Table 1 and visualized in the accompanying Figs. 1, 2, 3, 4 and 5 (after conversion to the case-control scenario).
Three models (epistasis model, two allele interactionmodel and XOR model) display only two levels of risk, whereas the multiplicative and the no margin-model a more complicated risk pattern. Specifically, the no margin-model was selected because it exhibits interaction effects in the absence of any main effects. Furthermore, all marginal odds are equal for this special model.

Data simulation
We simulated data according to the five genotype interaction models using the software GAMETES [29,30]. Genotypes were generated according to Hardy-Weinberg proportions, and a range of allele frequencies and patterns of risk-genotype associations was chosen [31].
For the models according to Wan (multiplicative model, epistasis model, two allele interaction-model and XOR model), three different minor allele frequencies (MAFs) were chosen (0.1, 0.2, and 0.4) leading to twelve odds tables (see Supplement). Setting the prevalence to 0.1 and the heritability to 0.03 for the multiplicative model and to 0.02 for the other models, the prevalence parameter g and the interaction parameter t were determined. The heritability is defined as described by Wan et al. [26]: For utilization in GAMETES, these odds tables were converted to penetrance tables. The no margin-model is directly taken from Ritchie et al. [27,32]. With the MAF for the interacting variants set to 0.25 this model shows no marginal effects.
Each model was simulated with 100 variants, with 2 of them interacting, 1000 replicates (datasets), and 800 cases and 800 controls. The MAFs of the noninteracting SNPs were chosen randomly between 0.05 and 0.5.
For simulations under the null hypothesis, we used the four different interaction models according to Wan et al.   [26] with three different MAFs, resulting in 12 models that were simulated with 1000 replicates each. Given that for every model, 100 SNPs were simulated with two interacting SNPs, we had 4949 non-interacting SNP pairs per model. Thus, a total of more than 50 million SNP pairs without interactions were simulated. Simulated data was evaluated at significance thresholds of 5 × 10 −2 , 5 × 10 −3 , 5 × 10 −4 , 5 × 10 −5 , and 5 × 10 −6 .
In the resulting data sets, we estimated IGmod 0 as described above.

Comparison with other approaches
To compare our estimator with previous established approaches, we estimated logistic regression models testing for interaction in an additive genetic model with 1 df as implemented in the module epistasis of the software PLINK [11]. Furthermore, we performed likelihood ratio tests (LRT) comparing a model with 4 parameters (2 additive and 2 dominant terms) with the full model with 8 parameters [33]. Finally, another entropy-based approach with the test statistic T IG was used as described by Fan et al. [21].  Essentially, Fan et al. [21] subtract the mutual information of two genetic variants estimated in the cases from the same quantities estimated in the controls.

Submodel classification
To illustrate the underlying genetic models of the simulated data, we utilized one step of the model-based multifactor dimensionality reduction (MB-MDR) algorithm, which is an efficient algorithm to perform multiple testing in epistasis screening [34]. The procedure tabulates the frequencies of cases and controls in the 3 × 3 genotype combinations and uses a test for association between the trait and the specific genotype combination. The test is performed for every cell of the 3×3 contingency table and denotes the individuals with the genotype combination of the specific cell as having a high risk of being affected (H), or a low risk (L), or not sufficient evidence or information (O). The result is a 3 × 3 matrix, denoted as HLO-matrix. For illustration, these matrices are shown in Table 1 (right) under the assumption that sample size is large enough to yield sufficient evidence.

Real data
We use the data from the genome-wide association study by the North American Rheumatoid Arthritis Consortium (NARAC) that were also analyzed by Liu et al. [2]. The data set comprises genotype data of 2,062 individuals, 868 cases with RA and 1,194 controls, predominantly of Northern European origin. The data had been genotyped on the Illumina 550k platform. After exclusion of monomorphic SNPs and SNPs showing deviation from HWE at p < 0.0001, 515,680 SNPs were available for further analysis [24]. Quality control procedures included removing individuals who had a low overall call rate (< 95 %) of SNPs [23]. From these data we select the HLA-region on chromosome 6 encompassing 2010 SNPs. This area offers a large number of SNPs of which many are associated with RA, and previous analyses hinted at gene-gene interactions in this region. To reduce the number of SNPpairs to investigate, we further selected only SNPs overlapping with the 749 SNPs analyzed by Liu et al. [2]. Because of the assumption of no LD, we furthermore eliminated all SNP pairs with a LD of r 2 > 0.01.

Significance thresholds for the modified IGENT interaction evaluation
As described above, the estimator IGmod 0 asymptotically and approximately follows a gamma distribution with shape parameter 4 and scale parameter 1 N ln (2) under the null hypothesis of conditional independence of the genetic variants [25].
Because of the fact that • the characteristic of the underlying distribution is given only asymptotically and approximately, • the Bonferroni correction is very conservative, and • the test statistic is dependent on allele frequencies and marginal effects, fixing the shape parameter at 4 is partially very conservative. Thus, we utilize alternative cut-offs to identify relevant pairs of interacting SNPs. For the simulated data, we set the global significance level to α = 0.05 and apply a Bonferroni correction to adjust for the number of interactions being tested from the gamma distribution with shape parameter 2 (liberal criterion). For the real data, we also set the global significance level to α = 0.05 but apply a Bonferroni correction to adjust for the number of interactions being tested from the gamma distribution with shape parameter 4 (conservative criterion). Assuming the scale parameter of 1 N ln (2) , this leads to a cut-off at < 0.012836 (see below) for the simulated data and at < 0.017331 for the real data.
For the regression analysis we set the global significance level to α = 0.05 with Bonferroni correction (for the simulated data based on the number of SNP pairs), which leads to significance levels of 1 × 10 −5 for the simulated data and 5 × 10 −8 for the real data.

Evaluation of type I error
Firstly, the type I error is evaluated in the simulation data. For the null simulation we took about 59 million SNP pairs like described above. The global significance level of 5% is controlled using the liberal cut-off of 0.012836 (Table 2). This is due to the structure of the null simulation data with different MAFs, but only a small portion of SNPs with main effects. Table 2 shows that the type I error matches the different thresholds of 5 × 10 −2 , 5 × 10 −3 , 5 × 10 −4 , 5 × 10 −5 , and 5× 10 −6 under the gamma distribution with shape parameter 2 very well. Figure 6 shows the power to detect the interaction in the simulated data for all interaction models and MAF settings. There are four scenarios that display a limited power of < 80%, i.e., the multiplicative model with MAF=0.1, 0.2 and 0.4, and the two allele interaction-model with MAF=0.1. By comparison, the logistic regression model shows overall lower power to detect the interactions, with scenarios having power < 80% including the multiplicative model with MAF=0.4, the epistasis model with MAF=0.1, the two allele interaction-model as well as the XOR model with all MAFs. The no margin-model provides satisfying power for both estimators, the proposed estimator and logistic regression. Thus, specifically for the epistasis model at a low MAF and unconventional models at all MAFs, the proposed estimator IGmod 0 performs better than the classical logistic regression in detecting interactions. The LRT shows overall comparable power as IGmod 0 with better performance in the multiplicative model but lower power in the epistasis and XOR models with low MAF. Finally, the estimator T IG provides good results (power > 80%) in the epistasis and XOR models with higher MAFs and for the no margin-model, but the power is lower than that of the proposed estimator throughout.

Submodel classification
We observed an intermediate power of IGmod 0 depending on the specific setting and interaction model. Thus, to get a more detailed impression of the underlying interactions, we classified the frequently occuring identified interaction pairs using HLO-matrices as described above. Table 3 shows the frequencies of the more common submodels occuring in the simulation of the multiplicative Table 2 Type I error in simulated data at cut-off from gamma distribution with shape parameter 2 It can be seen that IGmod 0 provides high power for mostly complete submodels, in which almost every cell contains enough information to perform the association Frequency (freq) of specific submodels and power to detect the interaction for specific submodels. Submodels are described by HLO-matrices as illustrated in Table 1 test. In incomplete submodels, in which not all cells of the HLO-matrix show an effect or the cells contain not enough information to perform the association test, IGmod 0 shows more difficulty to detect the interaction.

Multiplicative model
In contrast, the regression model provides a power higher than 80% in all submodels.

Epistasis model
Analogously, Table 4 shows the respective submodel results for the epistasis model with MAF=0.1. Frequency (freq) of specific submodels and power to detect the interaction for specific submodels. Submodels are described by HLO-matrices as illustrated in Table 1 Notably, many results with incomplete submodels were obtained in which one or more cells of the HLO-matrix did not show evidence for association or did not contain enough information for association testing. For example, 41% of the relevant pairs led to HLO-matrices which show association only in four out of nine possible cells. However, IGmod 0 provides high power for all submodels, whereas the regression model has more difficulty detecting interactions in incomplete submodels.

Real data
In the analysis of the real data set, we obtain 211 relevant SNP pairs, of which 102 display a model similar to the multiplicative model. Further 90 relevant SNP pairs resemble the epistasis model, and 19 SNP pairs follow more unconventional models. Compared to the regression analysis, our estimator identifies 31 SNP pairs as relevant that are not detected by the regression approach, and these are shown in Tables 5 and 6.
Most of the interaction pairs essentially follow a multiplicative model (multi) or an incomplete multiplicative model (multi incompl). There are seven SNP pairs for which the model resembles the epistatic model (epi), and the remaining are either similar to the XOR model or are difficult to classify (other).   Table 6 Of note, 39 SNP pairs were detected by the regression approach at a significance level of 1 × 10 −9 but not by IGmod 0 . All of these pairs belong to the categories of multiplicative and epistatic models. They do achieve a relatively high test statistic value of more than 0.01, but do not meet our conservative significance criterion of 0.017331.
Comparing the SNP pairs identified by IGmod 0 with the results by Liu et al. [2], we find that all eight SNP pairs reported by Liu et al. [2] are also detected by our proposed estimator (Table 7). Notably, all of these eight pairs belong to either the multiplicative or the epistasis model and have relatively high MAFs (> 32%).
Finally, we compared our results with those by Chattopadhyay et al. [3]. Their 20 top ranked 2-way interactions within the HLA region contain SNPs with a low MAF, but they were excluded from our evaluation either due to an LD of more than 0.01 or because they were not contained in the set of 749 SNPs analyzed by Liu et al. [2].

Discussion
In this paper we proposed a modification of the IGENT estimator for second order genetic interactions, which exploits the advantages of the entropy methods, but controls the type I error. It is a tool designed specifically to detect various patterns conditional on a baseline model, and foremost to detect interactions. The estimator asymptotically and approximately follows a gamma distribution with a shape parameter that depends on the existence of margin effects. The calculation can be made within the IGENT software, which is simple to apply and very fast.
Our simulations show that the power of our method depends on marginal effects, and the lowest power was observed for the multiplicative model. Instead, the advantages of our modification lie in the possibility to detect interactions for unconventional interaction models (like two allele interaction-model and XOR model), and for incomplete models (e.g. like last HLO-matrix in Table 4), in which not all genotype combinations are observed with sufficient frequency. Thus, IGmod 0 outperforms all other approaches in the XOR and epistasis models especially at low MAFs and is better than all approaches except the LRT in the two allele model, where IGmod 0 and the LRT are comparable. Submodel of interaction as detailed in Table 6, p SNP 1 and p SNP 2 as p-values from 1st order calculation, linkage disequilibrium (LD) between SNPs, TS pair as value of the test statistic from 2nd order calculation, p-value as result from 2nd order calculation In the analysis of the real data on RA, we find the results of the simulation confirmed. We observe a similar number of SNP pairs detected by one but not the other approach. Again, our modification is more likely to uncover unconventional or incomplete models.
In general, our results confirm the conclusion of Zuo et al. [20] that GenoCMI measures achieve a control of the false positive error in the presence of main effects.

Conclusions
In conclusion, we proposed a modification of the IGENT method, which is a fast and efficient entropy-based interaction analysis algorithm [22]. The modification reduces the type I error, so it can easily identify second order genegene interactions on a genome-wide scale. The analysis of simulated and real data has shown that, in contrast to classical regression approaches, more unconventional interaction models can be detected with this approach, which makes it an attractive complement to established analysis methods.