 Research
 Open Access
 Published:
HiSSI: highorder SNPSNP interactions detection based on efficient significant pattern and differential evolution
BMC Medical Genomics volume 12, Article number: 139 (2019)
Abstract
Background
Detecting single nucleotide polymorphism (SNP) interactions is an important and challenging task in genomewide association studies (GWAS). Various efforts have been devoted to detect SNP interactions. However, the large volume of SNP datasets results in such a big number of highorder SNP combinations that restrict the power of detecting interactions.
Methods
In this paper, to combat with this challenge, we propose a twostage approach (called HiSSI) to detect highorder SNPSNP interactions. In the screening stage, HiSSI employs a statistically significant pattern that takes into account family wise error rate, to control false positives and to effectively screen twolocus combinations candidate set. In the searching stage, HiSSI applies two different search strategies (exhaustive search and heuristic search based on differential evolution along with χ^{2}test) on candidate pairwise SNP combinations to detect highorder SNP interactions.
Results
Extensive experiments on simulated datasets are conducted to evaluate HiSSI and recently proposed and related approaches on both twolocus and threelocus disease models. A real genomewide dataset: breast cancer dataset collected from the Wellcome Trust Case Control Consortium (WTCCC) is also used to test HiSSI.
Conclusions
Simulated experiments on both twolocus and threelocus disease models show that HiSSI is more powerful than other related approaches. Real experiment on breast cancer dataset, in which HiSSI detects some significantly twolocus and threelocus interactions associated with breast cancer, again corroborate the effectiveness of HiSSI in highorder SNPSNP interaction identification.
Background
It has been widely recognized that single nucleotide polymorphisms (SNPs) are associated with a variety of human complex diseases. Genomewide association study (GWAS) has became a powerful tool for detecting SNPs and detected hundreds of single SNPs associated with complex diseases [1]. However, these single SNPs can only explain a portion of the theoretical estimated heritability of complex diseases [2]. Complex diseases are influenced by various genetic variants and environmental factors. Therefore, SNPSNP interactions defined as various joint effects of genetic variations should also be considered to better understand etiology of complex diseases.
Existing approaches for searching twolocus SNP interactions can be grouped into three categories: exhaustive search, stochastic search and machine learning based search. Methods based on exhaustive search enumerate all possible SNP combinations of twolocus and perform interaction tests for each combination. Ritchie et al. [3] proposed the multifactor dimensionality reduction (MDR) approach, which partitions genotype combinations into two classes and exhaustively searches the best SNP combination by predicting the disease status. Stochastic methods use the random sampling procedures to search the space of SNP combinations. Zhang et al. [4] proposed a Bayesian epistasis association mapping approach, which iteratively uses the Markov chain Monte Carlo to search twolocus interactions. Machining learning methods [5], such as random forest, neural networks and support vector machines, also have been applied to discover SNP interactions. Bureau et al. [6] focused on measures of predictive importance and applied random forest to discover predictive polymorphisms or markers of a phenotype, which are likely to affect disease susceptibility.
There are some challenges in detecting highorder SNP interactions. The first is the computational challenge. Although the overall complexity is linear with the number of individuals, it becomes exponential with the increase of locus. For example, for a dataset containing 1 million SNPs, the number of combinations to be tested is tremendous: 5 ×10^{11} pairwise interactions, 1.7 ×10^{17} 3way interactions, 8.3 ×10^{27} 5way interactions [7]. Therefore, exhaustively searching highorder epistatic interactions would be a heavy computational burden. The second is the statistical challenge. To balance the falsepositive rate and falsenegative rate, many stringent significance thresholds should be applied.
Several highorder SNP interactions detection approaches were developed to attack the aforementioned challenges. Xie et al. [8] proposed EDCF (Epistasis Detector based on the Clustering of relatively Frequent items) to detect multilocus epistatic interactions based on twolocus interaction models. EDCF is a twostage method, it firstly groups all genotype combinations into three clusters and then evaluates the significance of interaction modules based on χ^{2}test. Guo et al. [9] proposed a twostage method called DCHE (Dynamic Clustering for Highorder genomewide Epistatic interactions detecting). DCHE dynamically groups all genotype combinations into three to six subgroups, and then separately adopts χ^{2}test to evaluate the candidate pairwise combination in each subgroup. Yang et al. [10] proposed a stochastic search method (DECMDR). DECMDR combines the differential evolution algorithm [11] with a classification based multifactordimensionality reduction to detect the significant associations between cases and controls among all possible SNP combinations.
These highorder SNP interactions detection approaches still have some limitations. Most of these approaches do not control false positives and apply Bonferroni correction [12] in multiple hypothesis test for GWAS. Bonferroni correction is simple, but it is often overly conservative when the number of SNP is very huge. The correction comes at the cost of increasing the probability of producing false negatives, i.e., reducing statistical power [13, 14].
In this paper, we propose a twostage approach named HiSSI to detect highorder SNP interactions based on candidate pairwise SNP combinations. In the screening stage, a statistically significant pattern considering family wise error rate (FWER) is introduced to control false positives in multiple hypothesis test. HiSSI makes the statistically significant pattern faster and more memoryefficient via a fast WestfallYoung permutation testing [15], and obtains a corrected significant threshold to screen significant pairwise SNP combination candidates. In the search stage, HiSSI employs two different strategies to search highorder SNP interactions. For a small set, HiSSI uses the exhaustive search. For a large set, HiSSI employs a heuristic search technique named differential evolution (DE) algorithm [10, 11] along with χ^{2}test. We conduct simulation studies with various twolocus and threelocus disease models to comparatively study the power of HiSSI and that of stateoftheart approaches, including EDCF [8], DCHE [9] and DECMDR [10]. The empirical study demonstrates that our proposed HiSSI is generally more powerful than these approaches. Further study on a real breast cancer (BC) dataset shows that HiSSI also detects some twolocus and threelocus combinations that are significantly associated with breast cancer. These experiments prove that HiSSI is capable to identify highorder interactions from genomewide data.
Methods
Problem statement
Suppose a genotype dataset include N samples and M SNPs. We use y to denote the phenotype (including case and control), P(s(i,j)) to denote the pattern of pairwise SNP (ith SNP and jth SNP) combinations. Let N_{1} and N_{0} denote the number of affected samples (i.e., cases) and the number of controls.
Suppose a SNP with a major allele A, and a minor allele a. Three genotypes of a SNP are the homozygous reference genotype (AA), the heterozygous genotype (Aa), and the homozygous variant genotype (aa). Generally, these three genotypes are encoded as {0, 1, 2}. In this paper, for kth (k={0,1,2}) genotype of ith (i={1,2,…,M}) SNP, we encode it as {0, 1} by the ratio of the number of case and the number of control, which can be calculated as:
where N_{0ik} and N_{1ik} denote the number of kth genotype of ith SNP under control and case set, respectively; \(\omega =\frac {N_{1}}{N_{0}}\) is a balance factor to control the influence of unbalanced GWAS datasets. If R_{ik}>1, the genotype is encoded by 1; otherwise, encoded by 0. In this way, each SNP is encoded by {0, 1}. For each pairwise SNP combination P(s(i,j)), it is also encoded by {0, 1} instead of nine genotype combinations as follows:
where S_{i} and S_{j} denote the ith and jth SNPs.
In the screening stage, HiSSI attempts to find all significant candidate pairwise SNP combinations (snp_{i},snp_{j}) such that P(s(i,j)) is statistically associated with the phenotype y after correction for multiple hypothesis testing. In the search stage, HiSSI tries to find out highorder SNP interactions based on candidate set. The whole procedure of HiSSI is illustrated in Fig. 1. The following two subsections elaborate on these two stages, respectively.
Stage 1: screening pairwise SNP combinations
For each pairwise SNP combination P(s(i,j)), we can obtain the 2 ×2 contingency table for P(s(i,j)) and phenotype y as Table 1.
HiSSI evaluates the association between the phenotype y and the variable P(s(i,j)) by χ^{2}test [16]. Suppose p_{i,j} is the corresponding pvalue of the twolocus combination (snp_{i},snp_{j}) derived from the contingency table. If p_{i,j}≤δ^{∗} (δ^{∗} is the corrected significant threshold), HiSSI deems the twolocus combination is significant and places it into candidate set.
HiSSI utilizes the minimum attainable pvalue and the set of testable SNP combinations at significance level δ to make the permutationtesting more fast and efficient. Since the minimum attainable pvalue Ψ(x) is symmetric about N/2 [17], there are only \(\lfloor \frac N2 \rfloor \) +1 different values of Ψ(x) denoted as \(\{\delta _{0},\delta _{1},\ldots,\delta _{\lfloor \frac N2 \rfloor }\}\), which is a monotonically decreasing sequence. Σ(δ) is the testable region, one twolocus combination (snp_{i},snp_{j}) is testable if and only if the marginal x∈Σ(δ). \(\Sigma _{k}=[\sigma _{l}^{k},\sigma _{r}^{k}]\bigcup [N\sigma _{r}^{k},N\sigma _{l}^{k}]\) can be computed by starting from Σ(δ_{0})=[0,N] and iteratively shrinked to obtain Σ(δ_{k}) from Σ(δ_{k−1}).
At initialization, HiSSI generates J phenotypes based on J permutations, initializes J different minimum pvalues \(\{p_{min}^{(j)}\}_{j=1}^{J} \) =1 (the maximum value a pvalue can take) and initializes the corrected significance threshold as δ=δ_{1}, δ_{1} is the largest value that Ψ(x) can take other than the trivial value δ_{0}=1, which deems all SNP pairs that are testable and significant. Then, HiSSI computes the corresponding testability region Σ_{k} and \(\sigma _{l}^{k}\). For each twolocus combination, HiSSI computes x_{(i,j)} and check if the combination is testable or not in the current corrected significance level δ. If x_{i,j}∈Σ_{k}, the combination is testable and needs to be processed. In such case, HiSSI does not need to exhaustively analyze all twolocus combinations, and only needs to analyze these combinations whose marginal x is in testable region Σ. By updating the minimum pvalues by J permutations, FWER can be obtained. If FWER (δ)>α, k needs to be increased so as to decrease FWER (δ) to control the false positives. The corrected significance threshold δ^{∗} can be calculated as follows:
The above processes are summarized in Algorithm 1.
Once the corrected significance threshold δ^{∗} is obtained, for each twolocus combination, HiSSI computes the marginal x_{i,j} and a_{i,j}, which is the number of \(\{P(s_{i}(i,j))\}_{i=1}^{N}\) under the cases, and then computes the corresponding pvalue via χ^{2}test. If p_{i,j}≤δ^{∗}, HiSSI deems the combination is significant and places it into the candidate set.
Stage 2: highorder SNP interactions detection
In the search stage, HiSSI provides two strategies (exhaustive search and DEbased search) to search highorder SNP interactions based on candidate set.
Exhaustive search for small candidate set
Exhaustive search is affordable when the candidate set is small and has a larger chance to detect highorder SNP interactions than heuristic search. HiSSI applies exhaustive search on a small candidate set. To exhaustively search KSNP (K≥3) interactions, HiSSI combines all candidate SNP pairs to a set of KSNP, and computes the corresponding pvalue obtained by χ^{2}test of KSNP. HiSSI reports these combinations whose pvalues are smaller than the corrected significant thresholds δ^{∗} of K SNPs, obtained by the Algorithm 1.
Heuristic search for large candidate set
For a large candidate set, HiSSI employs a heuristic search approach based on differential evolution (DE) algorithm [11, 18–23] with χ^{2}test to identify highorder SNP interactions. DE is a powerful heuristic and parallel direct search approach with few control variables. Here, we take K=3 as an example to illustrate the process of detecting highorder interactions. The DEbased search strategy is presented as follows.
 1.
Initialization: for the candidate set C obtained from the first stage, a target vector is employed to represent a combination of three SNPs from C and defined as: :
$$ \begin{aligned} X_{i,g} = (f_{1,i,g},f_{2,i,g},f_{3,i,g}{f \in \mathrm{C}}), \ \ i=1,2,\ldots,ps \end{aligned} $$(4)where ps is the population size, i.e., the number of randomly generated target vectors; g means the gth iteration. i is the ith target vector in the population, f_{j,i,g}(j=1,2,3) represents one of the three SNPs in the ith target vector in the gth generation. At the initialization (g=0), f_{j,i,g}(j=1,2,3) are randomly generated as follows:
$$ \begin{aligned} f_{j,i,0} = \text{rand}_{j}([0,1))\times(upperlower)+lower, \ \ j=1,2,3 \end{aligned} $$(5)where upper and lower are the upper and lower bounds of the indexes of the candidate set. rand_{j}([0,1)) randomly generates a uniformly distributed random value within the range [0,1).
 2.
Mutation: in the mutation operation, each target vector generates a mutant vector:
$$ \begin{aligned} V_{i,g+1} = X_{r1,g}+F \cdot (X_{r2,g}{}X_{r3,g}), \ \ i=1,2,\ldots,ps \end{aligned} $$(6)where r1, r2 and r3∈(1,2,…,ps) are the random indices of the population, and they are mutually different. X_{r1,g}, X_{r2,g} and X_{r3,g} are the selected three target vectors. F∈[0,2] is a real and constant factor that controls the amplification of differential variation (X_{r2,g}−X_{r3,g}).
 3.
Recombination: in the recombination operation, the mutant vector V_{i,g+1} and the current target vector X_{i,g} are incorporated to generate a trial vector:
$$ U_{i,g+1} = (u_{1,i,g+1},u_{2,i,g+1},u_{3,i,g+1}) $$(7)where
$$ \begin{aligned} u_{j,i,g+1}\,=\, \begin{cases}{} \!v_{j,i,g+1}, \ \text{if}\ randb(j) \!\leq\! CR \ \text{or} \ j\,=\,rnbr(i)\\ \!x_{j,i,g}, \ \ \ \ \text{if} \ randb(j) \!>\! CR \ \text{or} \ j\!\neq\! rnbr(i) \end{cases} \ \ \!\!\!\!\!\!\!j=1,2,3 \end{aligned} $$(8)where randb(j) is the jth evaluation of a uniform random number generator with an outcome in [0,1], CR∈[0,1] is the crossover constant. rnbr(i) is a randomly chosen index in (1,2,3), it ensures that U_{i,g+1} obtains at least one parameter from V_{i,g+1}.
 4.
Boundary constraints [10]: a trail vector must be checked whether it is a feasible SNP combination (i.e., no parameters in the trial vector outside of the problem space), and can be adjusted as follows:
$$ \begin{aligned} u_{j,i,g+1}= \begin{cases}{} \text{rand}_{j}([0,1))\times(upperlower)+lower, \\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \text{if }(u_{j,i,g}\textless lower \ \text{or} \ u_{j,i,g+1}\textgreater upper)\\ x_{j,i,g}, \ \ \ \ \ \ \ \text{otherwise} \end{cases} \end{aligned} $$(9)  5.
Selection: the selection operation determines whether the target vector X_{i,g} is replaced by the trial vector U_{i,g+1} in the next generation or not. An evaluation function is used to evaluate the target and trial vectors. Here, HiSSI employs the chisquare test as the evaluation function. If the corresponding pvalue of trial vector U_{i,g+1} obtained by chisquare tese yields a better value than the corresponding pvalue of target vector X_{i,g}, namely p(U_{i,g+1})<p(X_{i,g}), then the target vector X_{i,g+1} is set to U_{i,g+1} in the next generation; otherwise, X_{i,g+1} is set to X_{i,g}.
Through the above four iterative operations (step (2)(5)), the value of the target vector can be improved by competing between target vectors and trial vectors. These four operations are repeated until the maximum number of generations (g_{max}) is reached, and the target vector with the best fitness value is the detected highorder SNP interaction.
FWER control
In GWAS, SNP interaction detection leads to a multiple hypothesis testing problem that generates lots of false positives. To alleviate this problem, Boferroni correction [12] and permutationtesting [24], are widely used for correcting the multiple testing problem. However, Bonferroni correction only works when the number of test patterns is known in advance and small [14]. HiSSI applies a fast permutationtesting method [15] to strictly control the family wise error rate (FWER), defined as the possibility of producing at least one false positive. In the permutationtesting, HiSSI generates a resampled dataset by randomly permuting the phenotype. Then, HiSSI computes the minimum pvalue across all genotype combinations. Repeating the permutation for a sufficiently number (J) of times, it obtains J different minimum pvalues \(\{p_{min}^{(t)}\}_{t=1}^{J} \). The FWER can be evaluated as:
where 1[·] is an indicator function which takes value 1 if its argument is true and 0 otherwise; δ is the corrected significance threshold.
FWER control requires FWER ≤α with α being the desired significant threshold. By doing this, the corrected significant threshold δ is chosen appropriately. The optimal δ^{∗} is obtained by solving the same optimization problem as Equation (3). In addition, the optimization problem also can yield the highest power (the probability of detecting true positives), and strictly control the FWER.
Results
In this section, we evaluate the performance of HiSSI on both simulated and real datasets. In the simulated study, we compare HiSSI with EDCF [8], DCHE [9], DECMDR [10] and HiSSIBC on different disease models (including twolocus and threelocus) with different parameters settings. HiSSIBC is a variant of HiSSI, it obtains the corrected significant threshold using the Bonferroni correction. We adopt the same measure of power suggested by Wan et al. [25] as follows:
where D^{′} is the number of datasets where exist true SNP interactions, and D is the number of all datasets. The definition of marginal effect size λ of a disease locus is the same as the one used in Zhang et al. [4]:
where p_{AA} and p_{Aa} denote the penetrance of genotype AA and Aa, respectively. For the real study, we apply HiSSI on the real breast cancer (BC) GWAS dataset collected from Wellcome Trust Case Control Consortium (WTCCC) [26].
Experiments on simulated datasets
To do comprehensive experimental comparison, we conduct simulation experiments on both twolocus and threelocus disease models. Since the number of candidate SNP combinations is small after screening in the first stage, we apply exhaustive search to detect highorder interaction.
Twolocus disease models
Three twolocus disease models (Model1, Model2 and Model3) are used to compare HiSSI with EDCF [8], DCHE [9], DECMDR [10] and HiSSIBC. Model1 and Model2 are proposed by Marchini et al. [27], where Model1 with a threshold effect, and Model2 with a multiplicative effect. Model3 is proposed by Zhang et al. [4] with an additive effect. The marginal effect size is relatively small in the simulation study, λ=0.2 for Model1, Model2, and Model3. Minor allele frequencies (MAFs) are the same for both loci at three levels: MAF = 0.1, 0.2 and 0.4; and for Linkage disequilibrium (LD), r^{2} is set to 0.7 and 1.0: r^{2}=0.7 is simulated for disease loci ungenotyped, but their LD markers genotyped; r^{2}=1.0 is simulated for directly genotyped disease loci. We use the same simulation program as [4] to simulate 100 datasets under each parameter setting for each disease model. Each dataset contains 100 SNPs, and the sample size is fixed to 1000, 2000 and 4000.
Figure 2 reveals the performance of different approaches on these three models. The power of all methods improves significantly when the sample size increasing from 2000 to 4000, and r^{2} changing from 0.7 to 1. However, the power of most approaches decreases as the MAFs of disease associated markers varying from 0.2 to 0.4. The trend is consistent with the results in [4, 27].
Across all models, HiSSI outperforms HiSSIBC, which evidences that the adopted permutation test is more effective than Bonferroni correction for controlling false positives in multiple hypothesis test. In addition, HiSSI also has a better performance than other approaches (EDCF, DCHE and DECMDR) for Model1–Model3 except some cases, Model1 with N = 2000, r^{2} = 0.7, MAF = 0.4, and Model2 with r^{2} = 1.0, MAF = 0.4. In these cases, HiSSI has a lower power than EDCF and DCHE. That is because HiSSI may lose some genetic associations, since it partitions twolocus genotype combinations into two groups, which is much smaller than the number of genotypes. On the contrary, EDCF and DCHE partition genotype combinations into more groups than HiSSI; EDCF has three groups, DCHE has three to six groups; whose numbers are larger than two and can retain more genetic information. In most cases, DECMDR has the lowest power, since it applies heuristic search and only reports the optimal solution. Another interesting observation is that the power of EDCF drastically decreases when N = 4000 with r^{2} = 0.7 and MAF = 0.2. One possible reason is that EDCF divides each threelocus combinations into three groups and uses the chisquare test with two degrees of freedom to measure the significance, resulting in more false positives.
In addition, highdimensional simulation datasets with 1000 SNPs, 2000 and 4000 samples on Model2 are also used to test HiSSI and other comparing approaches. The settings of MAF and LD are the same as before and the simulation datasets are also generated by the same simulation program as Zhang et al. [4].
Figure 3 reveals the performance of different approaches on Model2 with 1000 SNPs. Similarly, the power of all approaches significantly increases when the sample size increase from 2000 to 4000, r^{2} varies from 0.7 to 1; and decreases when MAF varies from 0.2 to 0.4. For the model with 1000 SNPs, HiSSI still outperforms HiSSIBC, which confirms the effectiveness of permutation test on highdimensional datasets. HiSSI has a better performance than other approaches except r^{2} = 1.0, MAF = 0.4. In such case, HiSSI has a lower power than EDCF and DCHE. EDCF loses its power when N = 4000 with r^{2} = 0.7 and MAF = 0.2. All these results are consistent with the results on the small simulation datasets with Model2.
Threelocus disease models
We use two threelocus disease models (Model4 and Model5) to test the ability of HiSSI in detecting highorder SNP interactions. Model4 is a threelocus interaction model proposed by Zhang et al. [4]. Model5 is the extension of Model1, which is a twolocus interaction model with a threshold effect. The sample size increases from 2000 to 4000; the minor allele frequencies (MAFs) is set to 0.1, 0.2, and 0.4; the r^{2} changes from 0.7 to 1.0; and the marginal effect is set to λ=0.3 for Model4 and Model5. We use the same simulation program in Zhang et al. [4] to simulate 100 datasets under each parameter setting for each disease model, and each dataset contains 100 SNPs.
Figure 4 shows the performance of different approaches on two threelocus disease models for highorder interactions detection. The power of all approaches significantly improves with the sample size increasing from 2000 to 4000, and r^{2} changing from 0.7 to 1. Besides, for Model5, the power of most approaches decreases with MAFs of the disease associated markers varying from 0.2 to 0.4. This trend is consistent with the results in twolocus disease model. However, the trend is not obvious for Model4 with MAF varying from 0.1 to 0.4.
For these two models, HiSSI again has a better performance than HiSSIBC, which shows that permutation test is also more effective than Bonferroni correction in detecting highorder interactions. In addition, HiSSI obtains the highest power for Model4–Model5 except some cases, Model4 with N = 2000, r^{2} = 1.0, MAF = 0.2, 0.4; and N = 4000, MAF = 0.4. In these cases, HiSSI has a lower power than EDCF and DCHE. That is due to the same reason as twolocus disease models, resulting more false positives in HiSSI. For both models, the power of EDCF still drastically decreases when N = 4000 with r^{2} = 0.7 and MAF = 0.2, that is consistent with the results in twolocus disease model. Since EDCF, DCHE and HiSSIBC all employ corrected Bonferroni correction to calculate the threshold, from the power between HiSSI between these methods, we can conclude that permutation test is more effective than Bonferroni correction for controlling false positives in multiple hypothesis test. In most cases, DECMDR has the lowest power, since it applies heuristic search in a larger search space and only reports the optimal solution.
Besides, highdimensional simulation datasets with 1000 SNPs, 2000 and 4000 samples on Model4 and Model5 are also used to test HiSSI and other comparing approaches. The settings about MAF and LD are the same as the simulation datasets with 100 SNPs. Figure 5 reveals the performance of different approaches on Model4 and Model5 with 1000 SNPs. The trend for power of all approaches is consistent with that on the small simulation datasets. For both the two models, HiSSI still has a better power than HiSSIBC; and HiSSI obtains the highest power except Model4 with MAF = 0.4, and N = 2000, r^{2} = 1.0, MAF = 0.2. In these cases, the power of HiSSI is lower than EDCF and DCHE. All these results are consistent with the results on small simulation datasets.
In addition, we also conduct experiments on two models (twolocus and threelocus models) without marginal effect with 100 and 1000 SNPs. The experimental settings, results and analysis can be found in Additional file 1. All simulated models used in simulated experiments are showed in Additional file 2.
Experiment on the breast cancer dataset
A real breast cancer dataset (BC) collected from WTCCC project [26] is used to further evaluate HiSSI. It is reported that breast cancer is caused by a combination of genetic and environmental risk factors [28]. The BC dataset contains 15347 SNPs from 1045 affected individuals and 3893 normal individuals. Quality control is performed to exclude very low rate samples and SNPs. For a SNP, if its call rate <95% across all samples, or its pvalue (HardyWeinberg equilibrium) <0.0001 in controls, or with MAF <0.1, the SNP will be excluded. For a sample, if its call rate <98%, the sample will be excluded. Through the quality control, the BC dataset contains 1045 case samples and 3893 control samples with 5607 SNPs.
Some significant twolocus and threelocus combinations on BC dataset identified by HiSSI is shown in Table 2. In the twolocus combinations, (rs1108842, rs2289247) is in gene GNL3 on chromosome 3. The protein encoded by GNL3 may interact with p53 and may be involved in tumorigenesis. (rs2242665, rs2856705) is on chromosome 6, where rs2856705 is susceptibly associated with breast cancer [29]. (rs1801197, rs6971091) is on chromosome 7, where rs1801197 is located in gene CALCR. It is evidenced that rs1801197/CALCR can lead to breast cancer [29]. (rs365990, rs7158731) is on chromosome 14, where rs365990 is in gene MYH6, and rs7158731 is in gene ZNF839. MYH6 encodes the alpha heavy chain subunit of cardiac myosin, and mutations in this gene cause familial hypertrophic cardiomyopathy and atrial septal defect 3. It is reported that MYH6 and ZNF839 are associated with breast cancer [29]. (rs8059973, rs3785181) is on chromosome 16, where rs8059973 is in gene DBNDD1, rs3785181 is in gene GAS11. rs8059973/DBNDD1 is associated with breast cancer [29]. GAS11 includes 11 exons spanning 25 kb and maps to a region of chromosome 16 that is sometimes deleted in breast and prostrate cancer. This gene is a putative tumor suppressor gene and is reported as being associated with breast cancer [30]. (rs2822558, rs2822787) is on chromosome 21, where rs2822558 is located in gene ABCC13. ABCC13 is a member of the superfamily of genes encoding ATPbinding cassette (ABC) transporters. It is reported that rs2822558/ABCC13 is related to breast cancer [29].
In the threelocus combination, (rs879882, rs2523608, rs805262) is on chromosome 6. rs879882 is in gene POU5F1, which plays a key role in embryonic development and stem cell pluripotency [31]. rs2523608 is located at gene HLAB and belongs to human leukocyte antigen (HLA) class I heavy chain paralogs. HLA class I antigen expression plays a central role in the immune system and is closely related to the aggressiveness and prognosis of BC [32]. rs805262 belongs to gene BAG6, which was first characterized as part of a cluster of genes located within the human major histocompatibility complex class III region. In addition, BAG6 is implicated in the control of apoptosis and is associated with basal cell carcinoma [33]. These identified significant twolocus and threelocus combinations demonstrate that HiSSI is capable to detect SNP interactions on genomewide data.
Parameter setting

In the screening stage, we set J=100 (number of permutations), α=0.05 (target FWER).

In the search stage, there are four common parameters of DE algorithm: population size (ps), generation size (g), the scaling factor (F) and crossover constant (CR). We set these parameters according to previous studies [10, 34] For real dataset, we set: ps=500, g=500, F=0.5 and CR=0.5.
Discussion
Comparison between HiSSI and other approaches

Comparison between HiSSI and HiSSIBC: HISSIBC is a variant of HiSSI, the main difference between HiSSI and HiSSIBC is that HiSSI employs a fast permutation test to obtain corrected significant threshold, while HiSSIBC uses the Bonferroni correction. For all simulation datasets on different disease models (including twolocus and threelocus), HiSSI always outperforms HiSSIBC, which demonstrates that permutation test is more effective than Bonferroni correction in correcting multiple testing.

Comparison between HiSSI and EDCF, DCHE: HiSSI utilizes a statistically significant pattern combined with permutation test to partition genotype combinations into two subgroups, which considers FWER to control false positives; while EDCF partitions genotype combinations into three subgroups, and DCHE dynamically partitions genotype combinations into three to six subgroups. Moreover, both EDCF and DCHE utilize the Bonferroni correction to correct multiple testing. The results on simulation datasets reveals HiSSI has a better performance than EDCF and DCHE, which proves the effectiveness of significant pattern in controlling false positives.

Comparison between HiSSI and DECMDR: both DECMDR and HiSSI utilize differential evolution (DE) algorithm to identify SNP interactions. DECMDR utilizes DE algorithm in the whole search space and uses the classification based multifactordimensionality reduction (CMDR) as a fitness measure to evaluate values of solutions in the DE process. While HiSSI utilizes DE algorithm in a smaller search space based on candidate set and the chisquare test as the fitness measure in DE process, it has a higher probability to search the true interactions. Since MDR is timeconsuming and only reports the optimal solution, DECMDR has a lower power than other approaches in most cases.
The advantages and limitations of HiSSI
The development of HiSSI is to overcome of the limitations of existing approaches on detecting highorder SNP interactions from genomewide data. HiSSI displays several advantages over existing methods:
HiSSI applies a FWERconstrained statistically significant pattern to strictly control false positives in multiple hypothesis test.
HiSSI utilizes a fast permutation testing to obtain corrected significant threshold, which avoids analyzing all twolocus combinations, greatly reduces the total runtime; and also avoids the conservatism of Bonferroni correction.
HiSSI provides two alternative search strategies, exhaustive search and heuristic search for different sizes of GWAS datasets.
The running time of HiSSI is relatively long compared with other approaches. It is a general problem for existing approaches that employ permutation test. Although HiSSI utilizes a fast permutation test, which is faster than traditional permutation test, it is still timeconsuming compared with heuristic algorithms and those approaches with Bonferroni correction. In addition, HiSSI does not directly control the main effects, which may introduce the negative influence of main effects for pairwise SNP combinations; and HiSSI only partitions genotype combinations into two groups, which may lose some genetic association. These limitations may degrade the performance of HiSSI. Future work can be extended to address the above limitations.
Conclusions
Detecting potential SNPSNP interactions in GWAS is an indispensable and challenging problem. In this paper, we proposed a twostage method called HiSSI to solve the problem. In the screening stage, HiSSI controls the false positives using an efficient statistically significant pattern that considers the family wise error rate, and obtains significant candidate pairwise SNP combinations. In the search stage, HiSSI utilizes two different strategies, exhaustive search and DEbased search, to detect highorder SNP interactions. Exhaustive search is applied to a small candidate set, and DEbased search is used for a large candidate set. A series of simulation experiments on both twolocus and threelocus disease models show that HiSSI is more powerful than other related approaches in detecting SNP interactions. Further experiment on a real WTCCC dataset corroborates that HiSSI is capable to identify highorder SNP interactions from genomewide data.
Availability of data and materials
The source code of HiSSI is available at http://mlda.swu.edu.cn/codes.php?name=HiSSI. The simulated disease models are specified in Additional file 2; and the simulated datasets are generated by the program in BEAM (http://bioinformatics.ust.hk/SNPHarvester.html). The genomewide Breast Cancer dataset is requested from Wellcome Trust Case Control Consortium (WTCCC), and its accession number is “EGAD00000000013”. WTCCC datasets cannot be shared without the permission from WTCCC. The researchers interested in WTCCC datasets can also apply them from WTCCC (https://www.wtccc.org.uk/).
Abbreviations
 BC:

Breast cancer
 DE:

Differential evolution
 FWER:

Family wise error rate
 GWAS:

GenomeWide association study
 HiSSIBC:

Highorder SNPSNP interactions detection with Bonferroni correction
 HiSSI:

Highorder SNPSNP interactions detection
 LD:

Linkage disequilibrium
 MAF:

Minor allele frequency
 MDR:

Multifactor dimensionality reduction
 SNP:

Single nucleotide polymorphism
 WTCCC:

Wellcome trust case control consortium
References
 1
Welter D, MacArthur J, Morales J, Burdett T, Hall P, Junkins H, Klemm A, Flicek P, Manolio T, Hindorff L, et al. The nhgri gwas catalog, a curated resource of snptrait associations. Nucleic Acids Res. 2013; 42(D1):1001–6.
 2
Kraft P, Hunter DJ. Genetic risk prediction–are we there yet?New Engl J Med. 2009; 360(17):1701–3.
 3
Ritchie MD, Hahn LW, Roodi N, Bailey LR, Dupont WD, Parl FF, Moore JH. Multifactordimensionality reduction reveals highorder interactions among estrogenmetabolism genes in sporadic breast cancer. Am J Human Genet. 2001; 69(1):138–47.
 4
Zhang Y, Liu JS. Bayesian inference of epistatic interactions in casecontrol studies. Nature Genet. 2007; 39(9):1167.
 5
UpstillGoddard R, Eccles D, Fliege J, Collins A. Machine learning approaches for the discovery of gene–gene interactions in disease data. Briefings in Bioinformatics. 2012; 14(2):251–60.
 6
Bureau A, Dupuis J, Falls K, Lunetta KL, Hayward B, Keith TP, Van Eerdewegh P. Identifying snps predictive of phenotype using random forests. Genet Epidemiol: Official Publ Int Genet Epidemiol Soc. 2005; 28(2):171–82.
 7
Ritchie MD. Finding the epistasis needles in the genoewide haystack. Methods in Molecular Biology. 2015; 2015:19–33.
 8
Xie MZ, Li J, Jiang T. Detecting genomewide epistases based on the clustering of relatively frequent items. Bioinformatics. 2012; 28(1):5–12.
 9
Guo X, Meng Y, Yu N, Pan Y. Cloud computing for detecting highorder genomewide epistatic interaction via dynamic clustering. BMC Bioinformatics. 2014; 15(1):102.
 10
Yang C, Chuang L, Lin Y. Cmdr based differential evolution identifies the epistatic interaction in genomewide association studies. Bioinformatics. 2017; 33(15):2354–62.
 11
Storn R, Price K. Differential evolution–a simple and efficient heuristic for global optimization over continuous spaces. J Global Opt. 1997; 11(4):341–59.
 12
Weisstein EW. Bonferroni correction. From MathWorldA Wolfram Web Resource. 2019 update. 2004. http://mathworld.wolfram.com/BonferroniCorrection.html.
 13
Nakagawa S. A farewell to bonferroni: the problems of low statistical power and publication bias. Behav Ecol. 2004; 15(6):1044–5.
 14
Li Y, Zhao Y, Wang G, Wang Z, Gao M. Elmbased largescale genetic association study via statistically significant pattern. Trans Syst, IEEE, Man, and Cybernet: Syst. 2017; PP(99):1–14.
 15
LlinaresLópez F, Sugiyama M, Papaxanthos L, Borgwardt K. Fast and memoryefficient significant pattern mining via permutation testing. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM: 2015. p. 725–34.
 16
Pearson K. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling, Vol. 50; 1990. pp. 157–175.
 17
LlinaresLópez F, Grimm DG, Bodenham DA, Gieraths U, Sugiyama M, Rowan B, Borgwardt K. Genomewide detection of intervals of genetic heterogeneity associated with complex traits. Bioinformatics. 2015; 31(12):240–9.
 18
Yang M, Guan J, Li C. Differential evolution with autoenhanced population diversity: The experiments on the cec’2016 competition. In: Evolution Computation: 2016. p. 4785–9.
 19
Yang M, Li C, Cai Z, Guan J. Differential evolution with autoenhanced population diversity. IEEE Trans Cybernet. 2015; 45(2):302.
 20
Yang M, Cai Z, Li C, Guan J. An improved jade algorithm for global optimization. In: Evol Comput: 2014. p. 806–12.
 21
Yang M, Guan J, Cai Z, Li C. A selfadaptive differential evolutionary algorithm based on population reduction with minimum distance. Int J Innov Comput Appl. 2014; 6(1):13–24.
 22
Yang M, Guan J, Cai Z, Wang L. Selfadapting differential evolution algorithm with chaos random for global numerical optimization. In: International Symposium on Intelligence Computation and Applications: 2010. p. 112–122.
 23
Fang Z, Yang M, Zhang G, Guan J. A hybrid differential evolutionary algorithm based on the hierarchical clustering. In: Evol Comput: 2016. p. 2367–74.
 24
Chaubey YP. Resamplingbased multiple testing: Examples and methods for pvalue adjustment. Taylor & Francis. 1993.
 25
Wan X, Yang C, Yang Q, Xue H, Fan X, Tang NL, Yu W. Boost: A fast approach to detecting genegene interactions in genomewide casecontrol studies. Am J Human Genet. 2010; 87(3):325–40.
 26
Burton PR, Clayton DG, Cardon LR, Craddock N, Deloukas P, Duncanson A, Kwiatkowski DP, McCarthy MI, Ouwehand WH, Samani NJ, et al. Association scan of 14,500 nonsynonymous snps in four diseases identifies autoimmunity variants. Nature Genet. 2007; 39(11):1329.
 27
Marchini J, Donnelly P, Cardon LR. Genomewide strategies for detecting multiple loci that influence complex diseases. Nature Genet. 2005; 37(4):413–7.
 28
Michailidou K, Hall P, GonzalezNeira A, Ghoussaini M, Dennis J, Milne RL, Schmidt MK, ChangClaude J, Bojesen SE, Bolla MK. Largescale genotyping identifies 41 new loci associated with breast cancer risk. Nature Genet. 2013; 45(4):1–2.
 29
Milne RL, Burwinkel B, Michailidou K, AriasPerez JI, Zamora MP, MenéndezRodríguez P, Hardisson D, Mendiola M, GonzálezNeira A, Pita G, et al. Common nonsynonymous snps associated with breast cancer susceptibility: findings from the breast cancer association consortium. Human Mole Genet. 2014; 23(22):6096–111.
 30
Whitmore SA, Settasatian C, Crawford J, Lower KM, Mccallum B, Seshadri R, Cornelisse CJ, Moerland EW, CletonJansen AM, Tipping AJ. Characterization and screening for mutations of the growth arrestspecific 11 (gas11) and c16orf3 genes at 16q24.3 in breast cancer. Genomics. 1998; 52(3):325–31.
 31
Cai S, Geng S, Jin F, Liu J, Qu C, Chen B. Pou5f1/oct4 expression in breast cancer tissue is significantly associated with nonsentinel lymph node metastasis. BMC Cancer. 2016; 16(1):175.
 32
Hicklin, Daniel J, Marincola, Francesco M, Ferrone, Soldano. Hla class i antigen downregulation in human cancers: Tcell immunotherapy revives an old story. Mole Med Today. 1999; 5(4):178–86.
 33
Zhang M, Liang L, Morar N, Dixon AL, Lathrop GM, Ding J, Moffatt MF, Cookson WOC, Kraft P, Qureshi AA. Integrating pathway analysis and genetics of gene expression for genomewide association study of basal cell carcinoma. Human Genet. 2012; 131(4):615–23.
 34
Price K, Storn RM, Lampinen JA. Differential Evolution: a Practical Approach to Global Optimization; 2006.
Acknowledgements
The twopage short paper of this work was an oral presentation in the 14th International Symposium on Bioinformatics Research and Applications (ISBRA 2018).
About this supplement
This article has been published as part of BMC Medical Genomics, Volume 12 Supplement 7, 2019: Selected articles from the 14th International Symposium on Bioinformatics Research and Applications (ISBRA18): medical genomics. The full contents of the supplement are available at https://bmcmedgenomics.biomedcentral.com/articles/supplements/volume12supplement7.
Funding
This work is supported by Natural Science Foundation of China (61873214, 61872300, 61871020 and 61562054), and partially supported by Chongqing Graduate Student Research Innovation Project (Grant No. CYS19113), the National Key Research and Development Plan Task of China (Grant No. 2016YFC0901902), Natural Science Foundation of CQ CSTC (cstc2018jcyjAX0228 and cstc2016jcyjA0351), Fundamental Research Funds for the Central Universities (XDJK2019B024), the Open Research Project of The Hubei Key Laboratory of Intelligent GeoInformation Processing (KLIGIP2017A05), the Natural Science Fundamental Research Plan of Shaanxi Province (2016JM6038), and the Fundamental Research Funds for the Central Universities, NWSUAF, China (2452015060). The funders had no role in the study design, data collection, analysis and interpretation, decision to publish, or preparation of the manuscript.
Author information
Affiliations
Contributions
XC and JL implemented and conducted the experiments; JW and MG initialized and conceived the whole program; XC, JL and JW analyzed the results, drafted and finalized the manuscript; MG revised the manuscript. All the authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable
Consent for publication
Not applicable
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Additional file 1
Experiments on models without marginal effect. Two disease models (a twolocus and a threelocus models) without marginal effect are used to test the performances of different approaches under different parameter settings.
Additional file 2
Simulated disease models. Simulated twolocus and threelocus models used in the simulation experiments are listed in tables.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Cao, X., Liu, J., Guo, M. et al. HiSSI: highorder SNPSNP interactions detection based on efficient significant pattern and differential evolution. BMC Med Genomics 12, 139 (2019). https://doi.org/10.1186/s1292001905846
Received:
Accepted:
Published:
Keywords
 Genomewide association studies
 Highorder SNP interactions
 Statistically significant pattern
 Family wise error rate
 Differential evolution