 Research
 Open Access
 Published:
IGENT: efficient entropy based algorithm for genomewide genegene interaction analysis
BMC Medical Genomics volume 7, Article number: S6 (2014)
Abstract
Background
With the development of highthroughput genotyping and sequencing technology, there are growing evidences of association with genetic variants and complex traits. In spite of thousands of genetic variants discovered, such genetic markers have been shown to explain only a very small proportion of the underlying genetic variance of complex traits. Genegene interaction (GGI) analysis is expected to unveil a large portion of unexplained heritability of complex traits.
Methods
In this work, we propose IGENT, Information theorybased GEnomewide genegene iNTeraction method. IGENT is an efficient algorithm for identifying genomewide genegene interactions (GGI) and geneenvironment interaction (GEI). For detecting significant GGIs in genomewide scale, it is important to reduce computational burden significantly. Our method uses information gain (IG) and evaluates its significance without resampling.
Results
Through our simulation studies, the power of the IGENT is shown to be better than or equivalent to that of that of BOOST. The proposed method successfully detected GGI for bipolar disorder in the Wellcome Trust Case Control Consortium (WTCCC) and agerelated macular degeneration (AMD).
Conclusions
The proposed method is implemented by C++ and available on Windows, Linux and MacOSX.
Background
Recently, genomewide association studies (GWAS) have been successful in understanding biological mechanisms and elucidating pathways that underlie complex genetic diseases [1]. However, GWAS were shown to explain only a small portion of the heritability of most complex diseases [2]. In order to find 'missing heritability' of complex diseases and understand genetic causes of diseases, genegene interaction (GGI) is expected to play an important role, because complex diseases are known to be controlled by multiple contributing genetic loci.
There are several statistical methods for detection of genegene interaction (GGI) [3]. One of conventional methods to characterize the interaction is regression analysis that includes main effects and relevant interaction terms. However, higherorder interaction may often cause the cell counts to be sparse, so that the parameter estimator may not be obtained. In order to avoid the sparsity problem in higherorder interaction, data mining methods such as support vector machine (SVM) and random forest (RF) were applied to find GGI. However, these methods could handle only a small number of variants due to their heavy computation [4, 5].
The multifactor dimensionality reduction (MDR) method proposed by Ritchie et al. [6] is a nonparametric method that reduces the number of dimensions by converting a highdimensional multilocus model to a onedimensional model to avoid the sparsity problem. MDR evaluates classifiers, which are SNP combinations associated with the disease of interest, to predict and classify disease status through crossvalidation and permutation testing. The kfold crossvalidation splits the data into k subsets. The classifier is modelled on (k1) subsets of the data and estimated by calculation of test accuracy on the remaining subset. This process is repeated for each subset. In addition to crossvalidation, the permutation test can assess the statistical significance of MDR classifiers. However, it is unfeasible to use permutation tests for genomewide scale interaction analysis because the permutation test is computationally intensive. To overcome this heavy computational burden, Pattin et al. proposed an efficient hypothesis test using extreme value distribution (EVD) [7]. Their simulation results showed that the proposed testing method requires at least 20 permutation data to keep up with similar power of 1000fold permutation test.
Although MDR has a simple structure and fast computation, it is hard to find highorder interactions in largescaled dataset because of its exhaustive searching scheme. For example, detection of 2nd order interactions for 300,000 SNPs requires computing 4.5 × 1010 combinations by MDR. When we use 10fold crossvalidation or 1000fold permutation test, it takes 10 times or 1000 times longer.
Wan et al. proposed BOOST, which is a fast method for detecting genegene interaction using Boolean operationbased screening and testing [8]. BOOST is computationally efficient and detects statistical significant interactions based on approximated likelihood ratio statistic. Their simulation study showed that BOOST has higher statistical power than PLINK.
Recently, several approaches based on information theory for modelling GGI have been proposed [9–11]. Shannon started the information theory in 1948 by introducing the entropy that is measure for complexity in mathematical theory of communications [12].
Dawy et al. [9] proposed a relevancechain method to identify the strongly associated lowerorder interactions and build highorder interaction with the use of conditional mutual information. This method can provide fast detection of highorder interaction but it shows poor performance for GGI with no strong marginal effects. Chanda et. al. [10] proposed the kway interaction information (KWII) metric and the total correlation information (TCI) for GGI identification. These entropybased measures represent the amount of information of redundancy and dependency between SNPs and an environmental variable. This method performs a permutation test for statistical significance of detected interaction models. RuizMarín et al. [11] proposed an entropybased test for identification of singlelocus association analysis. Although it showed a more powerful performance than the conventional Fisher tests, this method needs to be extended to handle GGI analysis. Yee et al. [13] proposed a modified entropy based method to evaluate the interactions between single SNP combinations. Their method was shown to be superior to the MDR method in most simulation cases. However, applying this entropy based method directly to the genomewide scale data would be infeasible because of computationally intensive permutations.
In this paper, we develop a fast and efficient method, named IGENT, Information theorybased GEnomewide genegene iNTeraction method, using entropy to identify the genegene interaction in genomewide scale. IGENT supports two types of strategies to identify genegene interactions related with diseases in genomewide scale. One is an exhaustive search approach for lowerorder interactions such as 2nd order interaction, and the other is a stepwise selection approach for higherorder interaction. With tens of thousands of SNPs from thousands of samples, it is difficult to calculate higherorder interaction exhaustively because the computational burden is too heavy. IGENT provides a stepwise approach for higherorder interactions. The evaluation is based on the approximated gamma distribution of information gain without using permutation procedure, which allows us to overcome the computation burden for the GGI analysis in genomewide scale [14].
Methods
Information theory
For detecting GGI associated with phenotypes, our measure is based on basic concept of information theory. The entropy, which measures the quantity of an uncertainty, is defined as
where the entropy H(X) of a discrete random variable Y is a function of the probability distribution p(Y=y_{ j }) which measures the average amount of information contained in Y, or equivalently, the amount of uncertainty removed upon revealing the outcome of Y. Conditional entropy of Y given another discrete random variable X is
The information gain (IG) is defined as follows,
IG which is also called mutual information (MI) can be explained as the reduction in entropy (or uncertainty) of one random variable given another. It is known that the IG follows gamma distribution with parameter a = (X − 1) (Y − 1) and b = 1/(N ln 2) approximately for the independent X and Y random variables [14].
where N is the sample size and X and Y denote the number of levels of the random variables X and Y.
Entropybased genegene interaction analysis
We use the information gain to detect GGI associated with phenotype. Given a casecontrol study with n individuals, let Y be the disease status and X be the SNP combinations, then
IG is given as
The value of IG represents the true association strength. Since, under the null hypothesis of no association, IG follows a gamma distribution approximately by (1), we can assess the statistical significance of the association of SNP combinations and disease.
Exhaustive searching approach and stepwise selection approach
We propose IGENT, an entropybased genegene interaction method for genomewide interaction analysis. IGENT supports exhaustive search (IGENT_exhaustive) for lowerorder interaction and stepwise search (IGENT_stepwise) for higherorder interaction. In Figure 1, our exhaustive search approach and stepwise selection approach are described graphically.
IGENT_exhaust performs an exhaustive search for all possible combinations of variants for the given low order. IGENT_stepwise selects higherorder interactions in a stepwise manner. The detailed steps are summarized as the follows.

1.
Initial step: for all SNPs, calculate 1^{st} order IG^{k} when k is order (in 1^{st} order, k = 1.).

2.
Select SNP or SNP combinations with p^{k} <t, when p^{k} is pvalue of hypothesis testing using the gamma distribution and t is significant threshold.

3.
Calculate IG^{k+1 }for k+1 order interactions for the combinations with selected SNP or combinations adding additional other single SNP.

4.
If there are significant interactions in k+1 order, k = k + 1 and repeat step 2~4. Otherwise, stop forward addition and repeat 2~4 step with the next ranked combinations.
This IGENT_stepwise selection approach reduces search space dramatically. With large genomewide scale data, this approach makes it feasible to discover higherorder interactions. Although this stepwise algorithm is not guaranteed to find the global optimum interaction model, it provides at least a local optimum interaction model with some marginal effects. Therefore, this stepwise approach may have a limitation in detecting the genegene interactions without any marginal effects.
Implementation
Our method is implemented by C++ language. It is runnable on Windows, LINUX and MacOSX. This program supports both exhaustive search and stepwise search.
Simulation studies
The main purpose of our method is to identify epistatic interactions from genomewide data. In order to detect genegene interaction for genomewide data, computational efficiency is a key issue. In simulation 1, we compared the computational efficiency of IGENT and other methods such as BOOST, MDR, RF and SVM. Among these methods, only IGENT and BOOST was shown to be feasible to analyze genegene interaction in genomewide scale, as shown in simulation 1 of Results section. Thus, we mainly compared IGENT and BOOST in genomewide scale with regard to the power of identifying causal genegene interaction through simulations 2, 3, and 4. In simulation 5, we compared IGENT_exhaust and IGENT_stepwise.
For these simulation studies, we use following three epistatic models:
1) Epistatic model set 1 : Eight interaction models
Models 11, 12, and 13 have different strength of genetic effects while fixing the interaction structure, the minor allele frequencies (MAF) and prevalence which have been used by Namkung et al. [15]. Models 14, 15, and 16 have different interaction structures and penetrance functions which were used by Ritchie et al. [16]. Models 17 and 18 were used by Bush et al. [17]. Eight interaction models are summarized in additional file 1.
2) Epistatic model set 2 : four interaction models with main effects
Model 21 is a multiplicative model. Model 22 is an epistasis model that has been used to describe handedness and the colour of swine. Model 23 is a classical epistasis model. Model 24 is the XOR model. The details of these four models have been described by Wan et al. [8].
3) Epistatic model set 3 : Seventy interaction models without main effects
Seventy Disease models without main effects have been proposed by Velez et al. [18]. These 70 epistatic models are distributed across six heritability values (0.01, 0.025, 0.05, 0.1, 0.2, and 0.4) and two different MAFs (0.2 and 0.4).
Using these epistatic model sets, we conduct the following five simulation studies.
Simulation 1: comparing computational efficiency for genomewide genegene interaction analysis
To compare computational efficiency with IGENT, BOOST, MDR, SVM and RF, we construct simulation data using the epistatic model set 1. Each epistatic models contains 2000 individuals balanced between cases and controls. Various numbers of SNPs (50, 100, 500, 1K, 2K, 5K, 10K, 100K, 350K, and 500K) are considered. All analysis are carried out on single core of a 3.16 GHz CPU with 4G memory on LINUX.
Simulation 2: estimating type I error in null simulation
To take an assessment in terms of type I error, we construct 1000 replicates of null simulation data with 1000 SNPs and 1000 individuals based on the epistatic model set

1.
In this null simulation data, all SNPs have no association with disease status. Using null simulation, we compare false positive rates of IGENT and BOOST.
Simulation 3: comparing the power of genegene interaction with main effects
To compare the power of IGENT and BOOST in genegene interaction with main effects, we use the epistatic model set 2. The MAFs of diseaseassociated SNPs is set to be 0.1, 0.2, and 0.4. Each data set has 1000 SNPs from 800 and 1600 individuals. We generate 100 replicate data sets under each setting. Using this simulation, we compare the power of IGENT and BOOST for genegene interaction with main effects.
Simulation 4: comparing the power of genegene interaction without main effects
For evaluation of finding causal genegene interaction with no marginal effects, we use the epistatic model set 3. Using these 70 epistasis models in the set, we generate 100 replicate sets with 1000 SNPs (one pair is causal interaction, others are noncausal SNPs), and four sample sizes (200, 400, 800, and 1600 individuals).
Simulation 5: comparing the efficiency of stepwise search approach
For comparison of the efficiency of IGENT_stepwise, we use the epistatic model set 1. We generate 100 replicate set with 50 SNPs from 400 individuals. Through this simulation, we compare the power and computational efficiency of IGENT_stepwise and IGENT_exhaust.
Genomewide data
Bipolar disorder (BD) data analysis
Using bipolar data from the Wellcome Trust Case Control Consortium (WTCCC) [19], we demonstrated genomewide genegene interaction analysis for 2^{nd}order and higherorder interaction. SNPs with call rates <95% were excluded from the analysis. SNPs showing HardyWeinberg equilibrium (HWE) pvalue<5.7 × 10^{7} were filtered out. Of the remaining SNPs, only SNPs showing MAF of at least 5% were carried forward for further analysis. All quality control steps were conducted using PLINK version 1.07 [20] and R scripts. We performed imputation using fastPHASE version 1.2 [21] to increase the density of interrogated SNPs. After quality control and imputation process, WTCCCBD dataset contained 354,022 SNPs and 4,806 samples.
IGENT was applied for exhaustive twoway interaction analysis of 6.27 × 10^{10} pairs of SNPs for WTCCCBD data and stepwise selection approach for higherorder interactions.
Agerelated macular degeneration (AMD) data analysis
For real data application, we used the AMD data set which contains 116,209 SNPs genotyped with 96 cases and 50 controls from the AgeRelated Eye Disease Study (AREDS) [22]. We conducted the same quality control process as in the BD data analysis except forMAF < 0.01. All quality control steps were conducted using PLINK version 1.07 [20] and R scripts. After quality control process, we used remained 102,504 SNPs from 146 individuals. Pairwise interaction analysis of all 5,253,483,756 pairs was conducted with IGENT_exhaust and BOOST. Also, IGENT_stepwise was performed for higherorder interactions.
Results
Simulation results
In this section, we perform simulation studies to evaluate the properties of IGENT and to compare it with other previous proposed methods. In order to detect genegene interaction with genomewide data, computational efficiency is a key issue. In simulation 1, we compared the computational efficiency of IGENT and other methods such as BOOST, MDR, RF, and SVM. Among these methods, only IGENT and BOOST were shown to be feasible to analyze genegene interaction in genomewide scale in simulation 1. We mainly compared IGENT and BOOST in regard to the power of identifying causal genegene interaction in simulations 2, 3, and 4. In simulation 5, we compared IGENT_stepwise and IGENT_exhaust.
Simulation 1: comparing computational efficiency for genomewide genegene interaction analysis
In order to compare the computational efficiency of IGENT and other methods including BOOST, MDR, RF, and SVM, we conducted 2^{nd} order interaction analysis with various the number of SNPs (50 to 500K). We used LIBSVM library [23] and "randomforest" R package [24] for SVM and RF methods, respectively. All methods used an exhaustive search strategy for fair comparison.
Table 1 presents computation times to finish 2^{nd} order interaction analysis by each method. In simulation data with 350K SNPs, IGENT_exhaust and BOOST can finish the interaction analysis within about 2.17 days and 1.8 days, respectively. However, due to their heavy computation times, MDR, RF, and SVM are not feasible to conduct the genegene interaction analysis with genomewide dataset. For focusing on genomewide interaction analysis, we thus compare the power of IGENT and BOOST in simulations 2, 3, and 4.
Simulation 2: estimating type I error in null simulation
The type 1 error rates of IGENT_exhaust and BOOST are shown in Table 2. Although the type I error rates of IGENT_exhaust and BOOST seem to be slightly higher than the nominal value, it can be shown that the type I errors of IGENT and BOOST agree with the nominal value lying within the confidence interval.
Simulation 3: comparing the power of genegene interaction with main effects
In simulation 3, we compared the IGENT_exhaust, IGENT_stepwise, and BOOST for detecting causal genegene interactions with main effects. In simulation data, IGENT used both exhaustive mode and stepwise mode, and BOOST used an exhaustive mode for searching the 2^{nd} order interactions. The power is calculated as the proportion of 100 data sets in which the interactions of the diseaseassociated SNPs are detected. In all simulation data, we counted the interaction with its pvalue (after multiple comparison procedure by Bonferroni correction) < 0.05. In stepwise mode, only variants with marginal pvalue < 0.05 were proceeded to the next step for calculating 2^{nd} order interactions. In simulation 3, the detection probability of IGENT_exhaust showed the best performance in most models except for Models 24 (Figure 2). The performance of BOOST became worse in the simulation models with low minor allele frequency (MAF 0.1 and 0.2). In simulation 3, the average power of IGENT_stepwise was about 60% relative to IGENT_exhaust, but its computing time was less than 1%(only 0.43%) of IGENT_exhaust.
Simulation 4: comparing the power of genegene interaction without main effects
In simulation 4 which has causal genegene interaction without main effects, IGENT_exhaust performed better than or equivalent to BOOST in most simulation models. In simulation model with lower MAF and small sample size, BOOST showed poor performance. However, they provided equivalent results for models with a MAF of 0.4 or large sample sizes (Figure 3).
Simulation 5: comparing the efficiency of stepwise analysis and exhaust analysis of IGENT
We evaluated the performance of IGENT_stepwise in simulation 5 based on epistatic model set 1. All models were designed with the 2^{nd} order interaction effects and no marginal effects. Although these simulation models do not include the higherorder interaction effects over the 2^{nd} order, it is possible for spurious higherorder interaction to show the large effects on phenotype. To allow for finding spurious higherorder interactions, we exhaustively identified interactions from 1^{st} to 4^{th} orders. By comparing the identified interactions from IGENT_exhaust to those from IGENT_stepwise, we were able to evaluate the performance of IGENT_exhaust.
Table 3 shows IGENT_stepwise has the 66~93% of power of the IGENT_exhaust by using only 12~36% computation of the IGENT_exhaust. For the genomewide interaction analysis, IGENT_stepwise can perform highorder interaction analysis very efficiently.
Analysis of real data: WTCCC bipolar disorder (BD) data
We conducted genomewide twoway interaction analysis and higherorder interactions with WTCCCBD dataset [19]. The IGENT_exhaust completed all twoway interaction pairs (6.25 × 10^{9}) in about 74 hours on a 3.16 GHz CPU with 4G memory on LINUX. IGENT_stepwise took about 1.5 hour in higher order interactions on the same system. Through exhaustive twoway interactions, IGENT_exhaust reported 39 significant interactions. Among these 39 interactions, 26 pairs were also reported by IGENT_stepwise. Among these hub genes, LOC390730, DPP10, and CDC25B have been reported with strong marginal effects in a previous study [19] (Table 4). B2GALT5, PI15, TLE4, AKAP10, and CHST2 did not show significant associations in single locus analysis but showed strong interactions. These genes have been reported as causal genes associated with bipolar disorder in other studies [25–30].
In Figure 4, using twoway interaction analysis by IGENT, we constructed the interaction network of WTCCCBD. In twoway interaction network, a node represents a gene with SNP(s), edge is interaction reported by IGENT analysis. Node size shows the degree of the node and edge width shows the number of SNPSNP interactions. All significant interactions were annotated by HuGE navigator database [31] and GWAS catalog [32]. This network graph represents twoway interactions of genomewide association with bipolar disorder and facilitates biological interpretations.
Analysis of real data: AMD data
We conducted 2^{nd} order interaction analysis and highorder interaction analysis using IGENT and BOOST for AMD data. Table 5 shows the top 5 interactions or SNP identified by IGENT. In the case of AMD data, there are SNPs (rs380390 (CFH) and rs1329428 (CFH)) with strong marginal effect. These SNPs were also reported previously that they have strong association with AMD disorder [22]. IGENT also detected two interactions (CFH (rs380390)  SGCD (rs931798) and CFH (rs1329428)  MED27 (rs9328536)). These two interactions also have a SNP with a strong marginal effect.
Discussion
In this paper, we proposed a fast analysis for searching for highorder interactions associated with complex diseases. IGENT uses information gain which represents association strength with GGI and phenotype without a specific genetic model. The IG measure can be used to compare the association strength across different order of interactions. IGENT adopts an exhaustive search scheme that investigates all possible interactions in lowerorder interactions and a stepwise search scheme for higherorder interactions. The permutation and exhaustive search schemes of the previous GGI methods are computationally too intensive to be employed in large genomewide scale data set for highorder interactions.
Note that IGENT is as fast as BOOST and shows better performance than BOOST. BOOST has been known to have a limitation that the degree of freedom of the statistical test should be reduced when the contingency table is too sparse due to low MAF [8]. IGENT, however, presents stable performance in various epistasis models even with low MAF.
To evaluate significance of IGENT's result, we used hypothesis testing framework by approximating the gamma distribution. It is known that IG follows the gamma distribution under the null hypothesis. Using approximation to the gamma distribution instead of permutation, we can easily calculate statistical significant interactions and save the computation time remarkably. A stepwise approach is more efficient than exhaustive approach in terms of computation. However, this stepwise approach has a tradeoff between computational efficiency and detection of optimal genegene interactions. Our stepwise approach, IGENT_stepwise, reduced a search space extremely for detecting GGI with marginal effects. Although GGI without marginal effects can be generated mathematically [33–35], it is still unclear in practice how the GGI model without marginal effect is biologically associated with a complex disease [3].
In an exhaustive search scheme, our simulation result showed that IGENT_exhaust consistently had better performance than BOOST, as shown in Figures 2 and 3. Although both BOOST and IGENT showed efficient and fast computational performances, IGENT showed power higher than or equivalent to that of BOOST.
Conclusions
In conclusion, we proposed a fast and efficient enhanced entropybased GGI analysis method. Due to its fast and efficient computation scheme, it can easily identify the genegene interaction in genomewide scale. Through real GWAS data analysis, IGENT successfully identified low order and high order interactions. IGENT has been implemented with C++, and is available in http://bibs.snu.ac.kr/software/igent.
Abbreviations
 IGENT:

Interactions analysis method in Genomewide scale based on ENTropy
 WTCCC:

the Wellcome Trust Case Control Consortium
 BD:

Bipolar disorder
 SVM:

support vector machine
 RF:

random forest
 GGI:

genegene interaction
 IG:

information gain
References
 1.
Seng KC, Seng CK: The success of the genomewide association approach: a brief story of a long struggle. Eur J Hum Genet. 2008, 16: 55464. 10.1038/ejhg.2008.12.
 2.
Eichler EE, Flint J, Gibson G, Kong A, Leal SM, Moore JH, Nadeau JH: Missing heritability and strategies for finding the underlying causes of complex disease. Nat Rev Genet. 2010, 11: 44650. 10.1038/nrg2809.
 3.
Cordell HJ: Detecting genegene interactions that underlie human diseases. Nat Rev Genet. 2009, 10: 392404. 10.1038/nrg2579.
 4.
Chen SH, Sun J, Dimitrov L, Turner AR, Adams TS, Meyers DA, Chang BL, Zheng SL, Grönberg H, Xu J, Hsu FC: A support vector machine approach for detecting genegene interaction. Genet Epidemiol. 2008, 32: 15267. 10.1002/gepi.20272.
 5.
Winham SJ, Colby CL, Freimuth RR, Wang X, de Andrade M, Huebner M, Biernacka JM: SNP interaction detection with random forests in highdimensional genetic data. BMC bioinformatics. 2012, 13: 16410.1186/1471210513164.
 6.
Ritchie MD, Hahn LW, Roodi N, Bailey LR, Dupont WD, Parl FF, Moore JH: Multifactordimensionality reduction reveals highorder interactions among estrogenmetabolism genes in sporadic breast cancer. Am J Hum Genet. 2001, 69: 13847. 10.1086/321276.
 7.
Pattin KA, White BC, Barney N, Gui J, Nelson HH, Kelsey KT, Andrew AS, Karagas MR, Moore JH: A computationally efficient hypothesis testing method for epistasis analysis using multifactor dimensionality reduction. Genet Epidemiol. 2009, 33: 8794. 10.1002/gepi.20360.
 8.
Wan X, Yang C, Yang Q, Xue H, Fan X, Tang NL, Yu W: BOOST: A fast approach to detecting genegene interactions in genomewide casecontrol studies. Am J Hum Genet. 2010, 87: 32540. 10.1016/j.ajhg.2010.07.021.
 9.
Dawy Z, Goebel B, Hagenauer J, Andreoli C, Meitinger T, Mueller JC: Gene mapping and marker clustering using Shannon's mutual information. IEEE/ACM Trans Comput Biol Bioinform. 2006, 3: 4756. 10.1109/TCBB.2006.9.
 10.
Chanda P, Zhang A, Brazeau D, Sucheston L, Freudenheim JL, Ambrosone C, Ramanathan M: Informationtheoretic metrics for visualizing geneenvironment interactions. Am J Hum Genet. 2007, 81: 93963. 10.1086/521878.
 11.
RuizMarín M, MatillaGarcía M, Cordoba JA, SusilloGonzález JL, RomoAstorga A, GonzálezPérez A, Ruiz A, Gayán J: An entropy test for singlelocus genetic association analysis. BMC Genet. 2010, 11: 19
 12.
Shannon CE: A mathematical theory of communication. Bell Syst Tech J. 1948, 23: 379423.
 13.
Yee J, Kwon MS, Park T, Park M: A modified entropybased approach for identifying genegene interactions in casecontrol study. PLoS ONE. 2013, 8: e6932110.1371/journal.pone.0069321.
 14.
Goebel B, Dawy Z, Hagenauer J, Muller J: An approximation to the distribution of finite sample size mutual information estimates. Proc IEEE Int'l Conf Comm. 2005, May
 15.
Namkung J, Kim K, Yi S, Chung W, Kwon MS, Park T: New evaluation measures for multifactor dimensionality reduction classifiers in genegene interaction analysis. Bioinformatics. 2009, 25: 33845. 10.1093/bioinformatics/btn629.
 16.
Ritchie MD, Hahn LW, Moore JH: Power of multifactor dimensionality reduction for detecting genegene interactions in the presence of genotyping error, missing data, phenocopy, and genetic heterogeneity. Genet Epidemiol. 2003, 24: 1507. 10.1002/gepi.10218.
 17.
Bush WS, Edwards TL, Dudek SM, McKinney BA, Ritchie MD: Alternative contingency table measures improve the power and detection of multifactor dimensionality reduction. BMC Bioinformatics. 2008, 9: 238244. 10.1186/147121059238.
 18.
Velez DR, White BC, Motsinger AA, Bush WS, Ritchie MD, Williams SM, Moore JH: A balanced accuracy function for epistasis modeling in imbalanced datasets using multifactor dimensionality reduction. Genet Epidemiol. 2007, 31: 30615. 10.1002/gepi.20211.
 19.
Wellcome Trust Case Control Consortium: Genomewide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007, 447: 66178. 10.1038/nature05911.
 20.
Purcell S, Neale B, ToddBrown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, de Bakker PI, Daly MJ, Sham PC: PLINK: a tool set for wholegenome association and populationbased linkage analyses. Am J Hum Genet. 2007, 81: 55975. 10.1086/519795.
 21.
Scheet P, Stephens M: A fast and flexible statistical model for largescale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am J Hum Genet. 2006, 78: 62944. 10.1086/502802.
 22.
Klein RJ, Zeiss C, Chew EY, Tsai JY, Sackler RS, Haynes C, Henning AK, SanGiovanni JP, Mane SM, Mayne ST, Bracken MB, Ferris FL, Ott J, Barnstable C, Hoh J: Complement factor H polymorphism in agerelated macular degeneration. Science. 2005, 308: 3859. 10.1126/science.1109557.
 23.
Chang CC, Lin CJ: LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology. 2011, 2 (3): 127.
 24.
Breiman L: Random Forests. Machine Learning. 2001, 45: 532. 10.1023/A:1010933404324.
 25.
Hamshere ML, Green EK, Jones IR, Jones L, Moskvina V, Kirov G, Grozeva D, Nikolov I, Vukcevic D, Caesar S, GordonSmith K, Fraser C, Russell E, Breen G, St Clair D, Collier DA, Young AH, Ferrier IN, Farmer A, McGuffin P, Wellcome Trust Case Control Consortium, Holmans PA, Owen MJ, O'Donovan MC, Craddock N: Genetic utility of broadly defined bipolar schizoaffective disorder as a diagnostic concept. Br J Psychiatry. 2009, 195: 239. 10.1192/bjp.bp.108.061424.
 26.
Martinowich K, Schloesser RJ, Manji HK: Bipolar disorder: from genes to behavior pathways. J Clin Invest. 2009, 119: 72636. 10.1172/JCI37703.
 27.
Laje G, Allen AS, Akula N, Manji H, John Rush A, McMahon FJ: Genomewide association study of suicidal ideation emerging during citalopram treatment of depressed outpatients. Pharmacogenet Genomics. 2009, 19: 66674. 10.1097/FPC.0b013e32832e4bcd.
 28.
Djurovic S, Gustafsson O, Mattingsdal M, Athanasiu L, Bjella T, Tesli M, Agartz I, Lorentzen S, Melle I, Morken G, Andreassen OA: A genomewide association study of bipolar disorder in Norwegian individuals, followed by replication in Icelandic sample. J Affect Disord. 2010, 126: 3126. 10.1016/j.jad.2010.04.007.
 29.
Iwamoto K, Ueda J, Bundo M, Kojima T, Kato T: Survey of the effect of genetic variations on gene expression in human prefrontal cortex and its application to genetics of psychiatric disorders. Neurosci Res. 2011, 70: 23842. 10.1016/j.neures.2011.02.012.
 30.
van Winkel R, Genetic Risk and Outcome of Psychosis (GROUP) Investigators: Familybased analysis of genetic variation underlying psychosisinducing effects of cannabis: sibling analysis and proband followup. Arch Gen Psychiatry. 2011, 68: 14857. 10.1001/archgenpsychiatry.2010.152.
 31.
Yu W, Gwinn M, Clyne M, Yesupriya A, Khoury MJ: A navigator for human genome epidemiology. Nat Genet. 2008, 40: 1245. 10.1038/ng0208124.
 32.
Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, Collins FS, Manolio TA: Potential etiologic and functional implications of genomewide association loci for human diseases and traits. Proc Natl Acad Sci USA. 2009, 106: 93627. 10.1073/pnas.0903103106.
 33.
Botstein D, Risch N: Discovering genotypes underlying human phenotypes: past successes for mendelian disease, future approaches for complex disease. Nat Genet. 2003, 22837. 33 Suppl
 34.
Kotti S, Bickeboller H, ClergetDarpoux F: Strategy for detecting susceptibility genes with weak or no marginal effect. Hum Hered. 2007, 63: 8592. 10.1159/000099180.
 35.
Culverhouse R, Suarez BK, Lin J, Reich T: A perspective on epistasis: limits of models displaying no main effect. Am J Hum Genet. 2002, 70: 46171. 10.1086/338759.
Acknowledgements
This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIP) (2012R1A3A2026438, 20080062618) and by the Korea Healthcare Technology R&D Project, Ministry for Health & Welfare, Republic of Korea (A101915).
Declarations
Publication for this article has been funded by the Seoul National University.
This article has been published as part of BMC Medical Genomics Volume 7 Supplement 1, 2014: Selected articles from the 3rd Translational Bioinformatics Conference (TBC/ISCBAsia 2013). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcmedgenomics/supplements/7/S1.
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
MK performed the programing and the analysis, and drafted the manuscript. MP participated in the design of the study. TP conceived of the study, and participated in its design and coordination and helped to draft the manuscript. All authors write, read and approved the final manuscript.
Rights and permissions
About this article
Cite this article
Kwon, M., Park, M. & Park, T. IGENT: efficient entropy based algorithm for genomewide genegene interaction analysis. BMC Med Genomics 7, S6 (2014). https://doi.org/10.1186/175587947S1S6
Published:
Keywords
 Support Vector Machine
 Random Forest
 Information Gain
 Multifactor Dimensionality Reduction
 Wellcome Trust Case Control Consortium