 Technical advance
 Open Access
 Published:
A new efficient method to detect genetic interactions for lung cancer GWAS
BMC Medical Genomics volume 13, Article number: 162 (2020)
Abstract
Background
Genomewide association studies (GWAS) have proven successful in predicting genetic risk of disease using singlelocus models; however, identifying single nucleotide polymorphism (SNP) interactions at the genomewide scale is limited due to computational and statistical challenges. We addressed the computational burden encountered when detecting SNP interactions for survival analysis, such as age of diseaseonset. To confront this problem, we developed a novel algorithm, called the Efficient Survival Multifactor Dimensionality Reduction (ESMDR) method, which used Martingale Residuals as the outcome parameter to estimate survival outcomes, and implemented the Quantitative Multifactor Dimensionality Reduction method to identify significant interactions associated with age of diseaseonset.
Methods
To demonstrate efficacy, we evaluated this method on two simulation data sets to estimate the type I error rate and power. Simulations showed that ESMDR identified interactions using less computational workload and allowed for adjustment of covariates. We applied ESMDR on the OncoArrayTRICL Consortium data with 14,935 cases and 12,787 controls for lung cancer (SNPs = 108,254) to search over all twoway interactions to identify genetic interactions associated with lung cancer ageofonset. We tested the best model in an independent data set from the OncoArrayTRICL data.
Results
Our experiment on the OncoArrayTRICL data identified many oneway and twoway models with a singlebase deletion in the noncoding region of BRCA1 (HR 1.24, P = 3.15 × 10^{–15}), as the top marker to predict age of lung cancer onset.
Conclusions
From the results of our extensive simulations and analysis of a large GWAS study, we demonstrated that our method is an efficient algorithm that identified genetic interactions to include in our models to predict survival outcomes.
Background
A fundamental aim of studying human genetics is to predict disease risk from genomic data. Genomewide association studies (GWAS) that used singlelocus models by testing each single nucleotide polymorphism (SNP) for association with a phenotype, proved to be instrumental in identifying thousands of genetic variants associated with human traits and disorders [1,2,3,4]. However, most of the findings explained only a small proportion of the genetic effects on diseases and traits [1, 5]. The complex biological mechanisms and genetic architectures of diseases motivated researchers to not only study main additive effects of single genetic variations, but also interactions between multiple variants with nonadditive effects to explain more of the heritability of complex diseases [6,7,8,9,10]. As the availability of large genomewide genotype and next generation sequencing data continues to grow, detecting genetic interactions (i.e., SNP interactions) will become more feasible with increasing power to detect significant associations [11]. At the same time, epistasis detection faces computational and statistical challenges in analyzing highdimensional data and in testing millions of interaction models from an exhaustive search in GWAS [6, 12]. The number of tests increases exponentially when analyzing higher orders of interactions, which require immense computing resources and processing time. Additionally, if the genotypic combinations that confer risk are nonadditive, finding the combinations of genotypes that increase risk can become a complex combinatorial challenge [7].
With the arrival of multidimensional and complicated genetic data sets, researchers have adapted to this growth by integrating machine learning methods to analyze complex genetic architectures. In genetic epidemiology, a popular series of methods were centered around a machine learning approach adapted to detect gene–gene interactions called the Multifactor Dimensionality Reduction (MDR) method. First introduced by Ritchie et al. (2001), MDR aimed at reducing highdimensional genetic interacting loci to a onedimensional binary variable that could be easily classified into high and low risk groups [7]. While MDR have successfully facilitated detection and characterization of multiple genetic loci, there were disadvantages to this algorithm that limited its use on diverse data structures such as survival data, which is often a primary outcome of interest in cancer research. Gui et al. [13, 14] have expanded on the MDR algorithm to different phenotypes, survival and continuous outcomes data. Survival MDR (SurvMDR) extended the analysis of dichotomous traits in MDR to censored and timetoevent survival data using a logrank test to classify sets of multiloci combinations. This algorithm demonstrated proficiency in identifying genetic interactions associated with censored timetodeath or timetoevent data; however, it was more computationally demanding than MDR and it did not allow for covariate adjustments important for controlling confounding factors [13]. Quantitative MDR (QMDR) offered a computationally efficient algorithm to identify genetic interactions associated with a quantitative outcome, but it also did not allow for covariate adjustments such as age, gender, environmental toxins, and other confounding factors to accurately identify genetic association relations [14].
Currently, there are limited methods capable of identifying genomewide genetic interactions efficiently with adjustment for covariates when studying age of diseaseonset, such as a patient’s age at first diagnosis or recurrence of disease, for largescale studies due to computational demands. It is important to have reliable estimates on the age of first diagnosis to understand the etiology of the disease and to tailor clinical practices, especially when determining the appropriate starting age for diagnostic screening, such as lung cancer screening [15]. In this study, we demonstrated how the Efficient Survival Multifactor Dimensionality Reduction (ESMDR) method improved on the efficiency of SurvMDR and allowed for adjustment of covariate effects to analyze largescale survival and genetic data to analyze age of diseaseonset in association with SNP interactions. Our method used Martingale Residuals as the estimated survival outcome with adjustment for confounding factors that provided an efficient and effective identification of genetic interactions associated with survival outcomes. We demonstrated the strength of the proposed method by designing two simulations to evaluate the 5% type I error threshold through an evaluation of the empirical null distribution and to analyze the predictive power of ESMDR. To analyze the effectiveness of the ESMDR method, we evaluated our approach using the genomewide genotyped lung cancer OncoArrayTRICL (Transdisciplinary Research Into Cancer of the Lung) Consortium data to detect and characterize SNP interactions that were associated with lung cancer ageofonset.
Methods
In this section, we discuss how we improved the computational efficiency without reducing accuracy to develop the ESMDR method when analyzing SNP interactions (i.e., joint effects of two SNPs) in association with age of diseaseonset.
Incorporating martingale residuals for ageofonset survival analysis
ESMDR improved the efficiency of SurvMDR and applied the QMDR algorithm to analyze age of diseaseonset in association with genetic interactions. Our novel ESMDR approach used a combination of survival analysis and QMDR for continuous outcome analysis in two steps. In the first step, we started replacing event time and status with Martingale Residuals with covariate adjustment as a new continuous score. In the second step, we applied QMDR to efficiently categorize the genotype combinations into highrisk and lowrisk groups. The best model was determined in the same way as QMDR, by using the crossvalidated ttest statistic computed from a continuous variable attribute (e.g., Martingale Residuals) to determine the best interaction and overall model.
The novel algorithm for ESMDR was performed as follows:

1
Selected K SNPs from all the SNPs in the data set and created a contingency table among every genotype combination of K SNPs.

2
For each multilocus genotype combination cell, summed the Martingale Residuals between samples with and without each genotype combination.

3
Labeled cells “highrisk” if the sum of the Martingale Residuals was positive; otherwise negative Martingale Residuals were labeled “lowrisk”.

4
Pooled all the highrisk labeled cells into one group and all the lowrisk labeled cells into another group to create a new onedimensional variable.
Using Martingale Residuals to determine high or low risk group for survival data analysis was comparable to using the logrank test statistic in SurvMDR, however, more efficiently when classifying genotype combinations. It can be shown that the sum of the Martingale Residuals is a good surrogate variable of the logrank test statistics for the purpose of determining high/low risk groups for each genotype combination. Next, we compared the similarities of the equations for Martingale Residuals and the logrank test statistic. The sign and magnitude of the Martingale Residuals were dependent on the association of SNPs and the hazard function in the following equation:
In this equation, δ_{i}(t) denotes the number of observed events that occur at each survival time t. The number of expected events was calculated using the coxproportional hazards model with x as the genetic factor and y as the adjusted covariate. The logrank test statistic was defined as the following:
Here, we show that Martingale Residuals is equivalent to the numerator of the logrank test statistic. Therefore, the sum of the Martingale Residuals is equal to the logrank test statistic when the variance is set to 1. This infers that using Martingale Residuals as a substitute for the logrank test statistic in evaluating genomic combinations associated with survival outcomes could provide the same data reduction and categorization process as SurvMDR.
Evaluation through simulations
Our purpose of running a simulation study was to evaluate how well ESMDR performed and how well it performed compared with SurvMDR. To demonstrate the strength of the proposed method, two simulations were designed to evaluate the testing score’s null distribution to evaluate the type I error rate and to analyze power.
Simulation I
The first simulation study was created to estimate the 5% type I error threshold by evaluating an empirical null distribution with independent noninteracting SNPs and quantitative outcome values. Here, we created sets of SNPs (m = 10, 20, 50) with additive coding and sample sizes (n = {200, 400, 800, 1600}) in the simulation data. For every combination of m and n, we simulated m SNPs with minor allele frequencies (MAF) drawn from the uniform distribution over the interval U (0.1, 0.5). Then we simulated n continuous outcomes from a standard normal distribution. The SNP and continuous outcome data were created independently to ensure that there were no associations between SNPs and the outcome. These steps were repeated to create 1000 null data sets for 24 different groups varied by the number of SNPs, sample size, and MAF. As a result, a total of 24,000 data sets were generated. Simulations were conducted in R 3.0.0 (Vienna, Austria). To determine whether the type I error rate was close to 5%, we analyzed the percentage of times that ESMDR randomly identified two interacting SNPs from a null data set.
Simulation II
The second simulation study was created to evaluate the power of ESMDR with a data set that included quantitative outcome variables and a pair of functional interacting SNPs and 18 noninteracting SNPs. SurvMDR was performed to evaluate whether ESMDR was as effective as SurvMDR in identifying functional SNPs.
The simulation data sets included different penetrance functions that described the probabilistic relationship between the quantitative outcome variable and functional SNPs generated with additive coding. We considered two different MAFs (0.2 and 0.4) and seven different broadsense heritability statistics (0.01, 0.02, 0.05, 0.1, 0.2, 0.3, and 0.4) to create a total of 14 unique model combinations, where the two functional SNPs associated with the outcome were evenly distributed across the seven heritability statistics. To create a purely epistatic model, each of the 14 unique models had one or the other functional SNP (MAF 0.2 or 0.4) with no main effects. The 14 alleleheritability frequency combinations were replicated five times to generate 70 models with varying sample sizes that included size (n) = {400, 800, 1600}.
Assuming SNP1 and SNP2 were the two functional SNPs. Let f_{ij} be an element from the ith row and jth column of a penetrance function. We generated the binary variable from a Bernoulli distribution with the following:
We randomly selected 200 highrisk subjects and 200 lowrisk subjects from each of the 70 probabilistic models to create one simulated data set. We repeated this simulation 100 times to obtain at total of 7,000 data sets.To generate the survival time, we used the Coxproportional hazards (Cox ph) model:
In this equation, h_{0}(t) is the baseline hazard function with a Weibull distribution using the shape parameter of 5 and the scale parameter of 2. The x is the genetic factor fixed at value 1 for high risk patients and 0 for low risk patients. ß represents the effect size or the log hazard ratio for a oneunit increase in x (all other covariates held constant). The censoring fractions were sampled from the uniform distribution over the interval U (0,4) from the Bernoulli distribution, resulting in 40% censoring. Finally, we merged survival time and censoring status with the SNP data.
We used Martingale Residuals in our novel ESMDR method to classify each multilocus genotype combination into highrisk and lowrisk groups. The Martingale Residual is the stochastic component and in residual form gives the following:
In this example, δ(t) denotes the number of expected events that occurred at each survival time t. Assuming a null model with no target effects (ß = 0), this residual is the difference between the observed events and expected number of events. The sign and magnitude of the Martingale Residuals are dependent on the association of SNPs and the hazard rate function. Each individual genotype with a positive Martingale Residual (i.e., greater than or equal to 0) was classified as highrisk. Otherwise, a negative Martingale Residual was classified as lowrisk. For every multilocus genotype combination of SNPs, we computed the sum of the Martingale Residuals to obtain a new variable that could be used to classify into the highrisk or lowrisk group.
To estimate the power of the proposed method, we ran ESMDR on each of the 7000 data sets and searched for the best model over all possible one (i.e., singlelocus), two (i.e., two interacting loci), and threeway (i.e., three interacting loci) interaction models, using the Tstatistic testing score. We also used the 95th percentile of the testing score from the null models as a threshold to guard against any nonsignificant findings. The power was estimated as the percentage of time ESMDR correctly included the two functional interacting SNPs in the best model out of each set of 7000 data sets. This significant threshold for the results was at the 0.05 level. For comparison, we ran SurvMDR on the simulated data to define its power. Training and testing scores for ESMDR were analyzed using twofold crossvalidation. The rational for using twofold crossvalidation [16] was that there would be no overlap between training sets and that all the predicted values were independent of each other. The best model was selected with the smallest prediction error and largest consistency in including the two functional interacting SNPs.
OncoArrayTRICL genotyping and quality control
A total of 533,631 SNPs from 57,775 individuals in the OncoArrayTRICL Consortium populationbased study, selected from 29 studies across North America and Europe, as well as Asia, were genotyped using the Illumina OncoArray500K BeadChip Platform, which included the genomewide backbone and select loci known to be associated with cancer phenotypes. To facilitate efficient genotyping and minimize variability that might arise from genotyping at multiple sites, genotyping was conducted at the following five institutions: the Center for Inherited Disease Research, the Beijing Genome Institute, the Helmholtz Zentrum München, Copenhagen University Hospital, and the University of Cambridge. Quality control steps described previously were followed for this OncoArrayTRICL data set [17]. The following participants were excluded from the current study: participants who lacked lung cancer status (did not participate in the lung cancer studies), smoking status, and age and gender information at diagnosis, participants who were close relatives (second degree relatives or closer), duplicate individuals, with nonEuropean ancestry, with lowquality extracted DNA, with low callrate for genotype data, and participants who did not pass other quality control measures. As a result, a total of 14,935 lung cancer cases and 12,787 controls remained in the current study. We restricted SNP filtering to a minimum to include more SNPs for analysis. We included SNPs with MAF ≥ 0.01 and SNPs with 50% and above genotyping rate.
OncoArrayTRICL data analysis
We applied ESMDR to the OncoArrayTRICL Consortium populationbased study to identify genetic interactions in association with lung cancer ageofonset. The OncoArrayTRICL Consortium is a collaboration among world leaders to investigate common causes of cancer susceptibility and progression [17]. Lung cancer cases and controls were genotyped using the OncoArray genotyping array known to tag cancer traits and susceptibility loci in addition to the GWAS backbone; this array consisted of approximately 533,000 tagged SNPs. We identified 27,722 participants (14,935 lung cancer cases and 12,787 healthy controls) aged 15–96 years of European ancestry. All participants provided informed consent and each study site obtained approval from their ethics committee. In this analysis, lung cancer ageofonset, cases (event at diagnosis age), controls (censored at interview age), and a covariate (smoking status) constituted the survival outcome data that were substituted by Martingale Residuals. We randomly sampled 2/3 of the data into a training set and 1/3 as the testing set. We applied our novel ESMDR method to perform an exhaustive oneway and twoway model search. We used PLINK as a prefiltering step to identify uncorrelated and independent SNPs. SNPs that were in linkage disequilibrium were removed, using a stringent correlation threshold of 0.1. After this filtering step, 108,254 SNPs remained. We searched over all oneway and twoway interactions in the training set to identify models consistently selected with the largest training score determined by twofold crossvalidation and we analysed the prediction error of the chosen top 10 models in the testing set. In our real data analysis, we also considered joint detection of the two SNPs with main effects to be successful detection of the functional interaction model. We performed a 10,000fold permutation test to evaluate the significance of chosen models.
To build a predictive model that combined the strength of both oneway and twoway models, we took all the SNPs involved in the top 1000 oneway models and all the SNPs from the top 1000 twoway interactions models and applied a penalized Cox regression method to filter and select the best predictive models to evaluate genetic factors associated with age of lung cancer onset. We ranked the test scores from highest to lowest and picked the top SNPs that best predicted lung cancer onset.
To construct predictive models linking SNPs to censored survival data, we used the least absolute shrinkage and selection operator (Lasso) penalized estimation for the Cox regression model to select top SNPs that were relevant to patients’ ages of lung cancer onset to create a prediction model with a parsimonious set of SNPs that could provide good prediction accuracy [18]. The Lasso procedure is a popular method for variable selection when the number of samples is significantly less than the number of predictor variables in the prediction model [19]. Briefly, Lasso is similar to the forward stepwise method in that it provides coefficient shrinkage as well as variable selection by driving nonsignificant coefficients in a regression model to zero [19]. Therefore, Lasso is a valuable tool to filter SNPs that are not associated with the outcome or highly correlated with other SNPs, especially in situations when the sample size is smaller than the number of SNP predictors.
Survival plots were generated using the Kaplan–Meier (KM) method to visualize the differences in age of lung cancer onset between highrisk and lowrisk groups based on top identified SNPs associated with lung cancer risk. To adjust for additional factors related to patient survival, the Cox ph regression model included adjustment for smoking status as a covariate in the model.
To assess the performance of our model in predicting lung cancer onset at different age intervals, we applied timedependent receiver operating characteristic (ROC) curve and area under the curve (AUC) to evaluate the predictive performance of the best models, previously introduced by Heagerty et al. [20]. In our study, with a given score function f(X), the timedependent sensitivity and specificity functions were defined as follows:
We defined the corresponding ROC(tf(X)) curve for any time t as the plot of sensitivity(c, tf(X)) versus 1—specificity(c, tf(X)) with the cutoff point c varying. The AUC is the area under the ROC(tf(X)) curve, which was denoted as AUC(tf(X)) [18]. Here, the δ(t) is the event indicator at time t. In this study, a larger AUC at time t based on the score function f(X) indicated better predictability of timetoevent at time t as measured by sensitivity and specificity evaluated at time t.
Results
Assessing type 1 error in simulation I
In the first simulation, we determined whether the type I error rate was close to the expected value when there were no SNP interaction effects. Assuming a data set that included 20 noninteracting SNPs and a total sample size of 400, we expected the type I error rate to be 0.05.
In Fig. 1, the null distributions for the one and twoway models followed the normal distribution quite closely, whereas the threeway model displayed a slight right skew. Nevertheless, the upper right tail regions almost perfectly overlapped with the upper right tail of the normal distribution for all three interaction models. This showed that the use of the 95th quantile of the empirical distribution as a threshold to remove false positives was suitable. This would greatly reduce the computing time by comparing testing scores with the prior calculated empirical distribution [14]. In Table 1, we used the 95th quantile of data sets with 400 samples to estimate the type 1 error rate. The estimated rate for type I error was tightly distributed around 5% with a range from 4.5 to 5.6% for the oneway and twoway models. The estimated error rate for the threeway model was greater with a range from 7.5 to 9.7%; however, it also exhibited a trend towards 5% with increasing sample size. Based on the results in Table 1, simulation I revealed that with every twofold increase in the sample size, there was an average 0.6% decrease in error rate for the threeway model. As a result, we expected that the type I error rate would converge to 5% with sample sizes greater than approximately 12,800.
Assessing power and speed in simulation II
In the second simulation, we estimated the power of ESMDR with a data set that included quantitative outcome variables and a pair of functional interacting SNPs and 18 noninteracting SNPs. We determined whether the power of ESMDR was comparable to SurvMDR in identifying the two functional SNPs. We counted the number of times that the functional SNP pair was correctly identified and divided that number by the total number of data sets (500 for this simulation) to get the estimated success rate.
Figure 2 presents a comparison of the power to identify only the two (i.e., stringent model) interacting SNPs (SNP1 and SNP2) for ESMDR and SurvMDR on simulated data. Table 2 displays the percent change in power to detect only the two functional interacting SNPs between ESMDR and SurvMDR. Overall, ESMDR performed better than SurvMDR for larger sample sizes. In addition, both ESMDR and SurvMDR demonstrated increasing power to detect functional SNPs with increasing heritability frequencies.
Figure 3 displays a comparison in power to identify the two interacting SNPs (SNP1 and SNP2) plus an additional SNP (i.e., flexible model) between ESMDR and SurvMDR. Table 3 shows the percent change in power to detect the two interacting SNPs plus an additional SNP. Here, we also demonstrated that ESMDR had greater power compared to SurvMDR. Again, ESMDR performed better than SurvMDR with larger sample sizes.
We compared the computing time between ESMDR and SurvMDR for 100 simulated data sets, for one, two, and threeway interactions, and with tenfold crossvalidation. The computing time for SurvMDR was 734.5 min versus 2.25 min for ESMDR, both of which were run on 1 node in the highperformance computing cluster called Discovery with AMD 3.1 Ghz CPU and 64 GB of memory. Discovery uses a Linux RedHat 6.7 operating system and is comprised of 160 computing nodes (3000 + cores), 12.5 TB of memory, and is available to the Dartmouth research community.
Application to OncoArrayTRICL data set
The main goal was to identify SNPs with main effects and SNP interactions that were associated with lung cancer susceptibility at different ages of disease onset. Using a populationbased study, we applied ESMDR on the OncoArrayTRICL Consortium data with 14,935 cases and 12,787 controls for lung cancer to search over all oneway and twoway interactions to identify genetic interactions in relation to lung cancer ageofonset. For this study, we included 533,631 genotyped variants and removed SNPs in linkage disequilibrium (LD) > 0.1 (n = 108,254 SNPs).
Table 4 lists the top 10 oneway test results generated by ESMDR and crossvalidation. Using ESMDR, highly significant SNPs were identified in association with lung cancer ageofonset. Table 5 displays the top 10 twoway interactions identified by ESMDR that were associated with lung cancer ageofonset. Due to the observed inflation of the type 1 error rate for 3way interactions in the simulation study, a 3way interaction was not evaluated in the OncoArrayTRICL data analysis. For Table 6, we combined SNPs from the top 1000 oneway loci and the top 1000 twoway interactions, ranked the SNP scores from highest to lowest, and applied the Lasso Cox regression method to filter and select the best genetic factors that predicted age of lung cancer onset. Table 6 exhibits the top 10 significant SNPs selected by Lasso Cox regression. To visualize the difference in age of lung cancer onset between the high risk and low risk groups, Fig. 4 illustrates the contrast using the KM) survival curve. KM curves for top oneway SNPs in the intronic region of TULP1, FKBP5 (rs6906359), inbetween genes GTF2IP1, PMS2P5 (rs149743903), and in the deletion of a noncoding region of BRCA1 (rs749410065) (NC_000017.10:g.41196821delT per Human Genome Variation Society nomenclature), and for a top twoway interacting SNPs in gene regions of BRCA1 (rs749410065) and CBR1, LOC100133286 (rs151043730) displayed a clear separation of curves between the high and lowrisk groups. This demonstrated the efficacy of ESMDR using Martingale Residuals to differentiate high risk and low risk groups based on genotype variation when evaluating lung cancer ageofonset. We continued our analysis with a comparison of smoking only and smoking plus SNP models to determine the best performance in predicting lung cancer onset at different ages. We used a common graphical plot called the area under the receiver operating characteristic (ROC) curve, also known as AUC, to measure the performance of our models to discriminate the best parameters at predicting lung cancer onset at different ages based on accuracy. In Fig. 5, the xaxis corresponds to the age of lung cancer onset, starting from 15 to > 80 years, and the yaxis indicates the AUC, ranging from 0.4 to 1. We examined the predictive performance of 7 different models with various tuning parameters identified from Cox Lasso regression, such as smoking only and smoking plus 2 SNPs, 4 SNPs, 13 SNPs, 19 SNPs, 29 SNPs, and 183 SNPs. This figure shows the average of the estimated AUCs over the OncoArrayTRICL data using the predictive scores from the independent leftout test data set. The plot displays good predictive performances of models generated using ESMDR. The AUC for models with more SNPs lies between 0.6 and 0.7 and continues to increase at later ages of onset. There is a noticeable decrease in AUC for ages 40 and below. This could be due to the limited number of lung cancer cases identified for individuals below the age of 40, which indicated that the models might not be appropriate to predict lung cancer diagnoses at 40 years and younger. The AUC of both smoking only and smoking with SNPs increased with age from age 40 and older. However, the AUC, depending on the number of SNPs in the models, differed by age. The model with the largest number of SNPs plus smoking performed the best at AUC 0.68 between ages 40 and 80 of onset compared to the smokingonly model with an AUC of 0.55. There was a noticeable trend where incremental additions of SNPs in the model increased the AUC for ageofonset between 40 and 80 + . On the other hand, the AUC for smokingonly and smoking plus fewer SNP models (e.g., 2 and 4) displayed the opposite trend where it increased around 90 + years of age.
Discussion
In this study, we present a novel algorithm to identify genetic interactions associated with the ageofonset for lung cancer. We demonstrated in two simulation studies that our ESMDR method was properly controlled for at the 5% type I error rate under the null distribution and improved power to detect causal SNPs. We identified new loci that were biologically plausible for lung cancer onset using the large OncoArrayTRICL data with 27,722 individuals. There are two unique contributions from this study. First, we offer a more computationally efficient algorithm, ESMDR, a method that analyses survival data by using Martingale residuals in place of survival outcome data. Second, ESMDR includes the ability to adjust for covariates, such as smoking status, a necessary step to control for confounding factors, whereas existing methods, used for survival analysis such as SurvMDR, are unable to provide.
Using the MDR method to reduce the size of multiple dimensions to a single dimension to identify multilocus genetic interactions in highdimensional genomic data sets has been a wellestablished approach. Richie et al. (2001) first introduced MDR, a nonparametric (i.e., no parameters are estimated) and genetic modelfree (i.e., no genetic model is assumed) model, that condensed multiple genetic loci into a single variable in order to categorize genotypes into two groups [7]. The goal was to group genotypes into highrisk and lowrisk categories associated with and without disease outcomes, respectively. However, MDR was restricted by its inability to analyze different outcome variables other than binary variables and it did not allow for the adjustment of confounding factors that was critical in preventing false association analyses. Therefore, an extension of the traditional MDR method was developed to analyze censored survival data, called Survival MDR or SurvMDR.
Like the original MDR algorithm, SurvMDR is a nonparametric and genetic modelfree method proposed by Gui et al. (2011), and it was developed to allow for the analysis of timetoevent data, such as patient survival time or time to disease relapse [13]. SurvMDR used the logrank test statistic to compare survival times between samples with and without the multilocus risk genotype combination and classified them into high and low risk groups [13]. SurvMDR also used crossvalidation to identify the optimal set of K SNPs and overall best model. While SurvMDR was successful in identifying SNP interactions associated with timetoevent outcomes, it was more computationally demanding than MDR and the inability to adjust for covariates persisted. Consequently, the MDR method was optimized further to develop the Quantitative MDR (QMDR) method to address the slowtocompute algorithm challenge [14].
QMDR optimized the MDR algorithm by offering a computationally efficient way to analyze quantitative or continuous trait outcomes. QMDR compared the mean value of each multilocus genotype to the overall mean and labeled each genotype combination as “highrisk” or “lowrisk”. Crossvalidation was also implemented in QMDR to identify the optimal set of K SNPs and overall best model. For each Kway interaction, the steps used for a kfold crossvalidation were similar to the SurvMDR method except for the step to identify the best Kway interaction. In this case, the largest Ttest statistic was used instead of the square of the logrank statistic when identifying the best interaction model. Inspired by the computational capabilities of QMDR to analyze quantitative outcomes associated with genetic variations, we leveraged this method’s straightforward computing efficiency to evaluate survival outcome data for timetoevent analysis.
Our approach transformed survival data (e.g., time and event status) into a single variable, Martingale Residuals, to use as a surrogate for timetodisease and disease status, with application of QMDR for rapid processing of genotype combinations into high and low risk groups. We were able to identify thousands of significant oneway and twoway models using ESMDR and crossvalidation when applied to the lung cancer OncoArrayTRICL data set. We were unable to compare the results of ESMDR and SurvMDR, both because SurvMDR would have taken an extensive amount of time (e.g., greater than 4 months) to conduct a genomewide genetic interaction analysis using the large OncoArrayTRICL data set, and because the current SurvMDR algorithm would not allow for adjustment of confounding factors such as smoking status.
When searching for SNP interactions using real data, we chose a twofold crossvalidation instead of a tenfold crossvalidation to evaluate the optimal oneway and twoway interaction models as described previously [14]. From the central limit theorem, assuming a sufficiently large sample size (n > 50) from a population with a finite level of variance, the mean of all samples from the same population would be approximately equal to the mean of the population. Therefore, we expected the testing scores with 400 samples from our simulation study to follow a standard normal distribution. However, Gui et al. (2013) displayed a slight right skew with a standard deviation of 1.6 in their empirical distributions that was due to extra variation introduced by overlapping training sets in their tenfold crossvalidation method [14, 21]. Furthermore, twofold crossvalidation had been advocated to perform hypothesis testing where the training folds were mutually independent with no overlap [21]. Consequently, we evaluated the optimal oneway and twoway interaction models and the overall best model using twofold crossvalidation.
We explored prediction models that included SNPs that could be used to forecast lung cancer onset. Figure 5 lays out the AUC estimates for each model. The AUC peaked around ageofonset less than 30 and greater than 90 years old. This may be due to the limited number of lung cancer cases (e.g., less than 10 cases) at younger and older ages. In general, based on AUC averages, age of lung cancer onset was strongly influenced by genetic variants, with increasing numbers of SNPs contributing to better AUC estimates. The plateauing of AUC averages for the 40–80 years old range revealed good estimates for age of onset for all models, which was likely due to the larger sample size for evaluation. Another plausible explanation for the high AUC for early and late age of onset was the likelihood that those cases contained the same combinations of risk SNPs in the models. The identified top SNPs with high AUC for age of onset were not only associated with early lung cancer cases, but they potentially could also contribute to late age of onset cases. The 2 SNP and 4 SNP models had strong associations with lung cancer cases and therefore, were responsible for high AUC averages for early and late age of onset of lung cancer. For the smoking only model, it played less of a role for early lung cancer onset because the adverse effects from smoking could require more time to develop. Over time the effects from smoking could be the main driver for late age lung cancer cases, which could explain why genetic factors do not seem to greatly effect cancer onset in later years. This interpretation could make biological sense since the effect of smoking over a longer time period could have compounding effects on cancer development. Conversely, cancer development due to genetics might appear at earlier rather than later years.
Limitations
While our novel ESMDR overcame some of the limitations described in previous methods used to evaluate genetic interactions, it was not without some of its own disadvantages. When analyzing survival data, the method did not directly evaluate survival variables such as time and event status. As a result, when using Martingale Residuals instead of specific survival outcomes data, we might be missing some important information that was needed to identify associations between SNP interactions and survival outcomes. Our QQ plot analysis from real data indicated a strong departure from the null distribution which indicated that there might exist a systematic bias. This result could be due to a combination of the large sample size and continuous outcome. As a result, we used permutation tests to evaluate the results from the OncoArrayTRICL data set. Another limitation came from over parameterizing our models, resulting in many multifactor cells with missing data [7]. This did not affect the classification of genotype combinations or identifying crossvalidation consistency of the model, however, it could affect our estimation of the prediction error [7]. Future studies would need to address this limitation. Next, we applied our ESMDR method to analyze survival outcomes using case–control studies, where estimating the agespecific incidence (e.g., ageofonset) was not typically designed for case–control studies. On the other hand, cohort studies, which are designed for survival analyses, are expensive and require a great deal of followup time to obtain ageofonset information. This could be one of the barriers in analyzing survival outcomes for large cohort studies; it could require a lot of time and resources to amass an extensive amount of data. In our study, we were able to analyze and identify potential genetic markers that predicted lung cancer risk using a large lung cancer GWAS consortium data, which could be followed up with further investigations for biological and functional significance. Due to fewer available observations of lung cancer ageofonset among younger individuals, we were limited in our ability to predict lung cancer onset for individuals 40 years and younger. With continuous efforts in recruiting participants in the OncoArrayTRICL Consortium, we might find more cases among the early onset population to better predict lung cancer risk in the future. Finally, there were no available validation data to replicate our top SNP findings because these SNPs were not likely genotyped in other GWAS data sets. Currently, there are ongoing efforts to collect external data that will include genotyping of our top SNP findings for replication.
Future studies
ESMDR is a powerful alternative to SurvMDR for identifying interactions, especially at the genomewide scale. We demonstrated its ability to identify highorder genetic interactions in simulated and real data sets. Although ESMDR addresses previous limitations of SurvMDR and other MDRlike methods, there are ways in which this method can be improved. While ESMDR had greatly improved computing efficiency, genomewide scans for interactions will still require massive computing resources, especially to analyze higherorder interactions. It will be necessary to optimize the selection of SNPs in predictive models, for example, by selecting genes known to participate in biological and metabolic pathways [22]. This can improve the predictive ability of ESMDR for two, three, and multiway interactions in a pathway analysis. Second, a future study may entail introducing variance back to Martingale Residuals by way of weighting each residual based on the timetoevent data. This can greatly improve our power for model selection without removing the efficiency of the algorithm.
Conclusions
In summary, the ESMDR method provides a way to analyze highorder interactions at the genomewide scale to advance studies of genetic interactions. We developed a new method that efficiently captures nonlinear and highorder interactions for timetoevent analysis. In general, ESMDR has improved power performance relative to SurvMDR using simulated data. Based on the noticeable trends, we are confident that with bigger sample sizes, ESMDR will continue to significantly gain in power to detect functional interacting SNPs without inflating the type I error rate. Providing new and improved methods to analyze epistasis or gene interactions may offer new opportunities to not only explain the missing heritability for complex disease risk, but can also potentially detect new genetic determinants that is important for clinical utility such as disease diagnosis and prognosis.
Availability of data and materials
The datasets generated and/or analyzed during the current study are available in the NCBI dbGaP repository, dbGaP Study Accession: phs001273.v3.p2, website: https://www.ncbi.nlm.nih.gov/projects/gap/cgibin/study.cgi?study_id=phs001273.v3.p2.
Abbreviations
 GWAS:

Genomewide association studies
 SNP:

Single nucleotide polymorphism
 ESMDR:

Efficient Survival Multifactor Dimensionality Reduction
 HR:

Hazard ratio
 MDR:

Multifactor Dimensionality Reduction
 SurvMDR:

Survival MDR
 QMDR:

Quantitative MDR
 TRICL:

Transdisciplinary Research Into Cancer of the Lung
 MAF:

Minor allele frequency
 Cox ph:

Cox proportional hazards
 GHz:

Gigahertz
 CPU:

Central processing unit
 GB:

Gigabyte
 TB:

Terabyte
 Lasso:

Least absolute shrinkage and selection operator
 ROC:

Receiver operating characteristic
 AUC:

Area under the ROC curve
 bp:

Base pair
 ncRNA:

Noncoding transcript variant
 UTR:

Untranslated region
References
 1.
Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, McCarthy MI, Ramos EM, Cardon LR, Chakravarti A, et al. Finding the missing heritability of complex diseases. Nature. 2009;461:747–53.
 2.
Hirschhorn JN, Daly MJ. Genomewide association studies for common diseases and complex traits. Nat Rev Genet. 2005;6:95–108.
 3.
McCarthy MI, Abecasis GR, Cardon LR, Goldstein DB, Little J, Ioannidis JP, Hirschhorn JN. Genomewide association studies for complex traits: consensus, uncertainty and challenges. Nat Rev Genet. 2008;9:356–69.
 4.
Bush WS, Moore JH. Chapter 11: Genomewide association studies. PLoS Comput Biol. 2012;8:e1002822.
 5.
Maher B. Personal genomes: the case of the missing heritability. Nature. 2008;456:18–21.
 6.
GilbertDiamond D, Moore JH. Analysis of genegene interactions. Curr Protoc Hum Genet. 2011;Chapter 1:Unit1 14.
 7.
Ritchie MD, Hahn LW, Roodi N, Bailey LR, Dupont WD, Parl FF, Moore JH. Multifactordimensionality reduction reveals highorder interactions among estrogenmetabolism genes in sporadic breast cancer. Am J Hum Genet. 2001;69:138–47.
 8.
Moore JH, Williams SM. Epistasis and its implications for personal genetics. Am J Hum Genet. 2009;85:309–20.
 9.
Park MY, Hastie T. Penalized logistic regression for detecting gene interactions. Biostatistics. 2008;9:30–50.
 10.
Andrew AS, Gui J, Sanderson AC, Mason RA, Morlock EV, Schned AR, Kelsey KT, Marsit CJ, Moore JH, Karagas MR. Bladder cancer SNP panel predicts susceptibility and survival. Hum Genet. 2009;125:527–39.
 11.
He H, Oetting WS, Brott MJ, Basu S. Power of multifactor dimensionality reduction and penalized logistic regression for detecting genegene interaction in a casecontrol study. BMC Med Genet. 2009;10:127.
 12.
Moore JH, Asselbergs FW, Williams SM. Bioinformatics challenges for genomewide association studies. Bioinformatics. 2010;26:445–55.
 13.
Gui J, Moore JH, Kelsey KT, Marsit CJ, Karagas MR, Andrew AS. A novel survival multifactor dimensionality reduction method for detecting genegene interactions with application to bladder cancer prognosis. Hum Genet. 2011;129:101–10.
 14.
Gui J, Moore JH, Williams SM, Andrews P, Hillege HL, van der Harst P, Navis G, Van Gilst WH, Asselbergs FW, GilbertDiamond D. A simple and computationally efficient approach to multifactor dimensionality reduction analysis of genegene interactions for quantitative traits. PLoS ONE. 2013;8:e66545.
 15.
Brandt A, Bermejo JL, Sundquist J, Hemminki K. Age of onset in familial cancer. Ann Oncol. 2008;19:2084–8.
 16.
Dietterich TG. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput. 1998;10:1895–923.
 17.
Amos CI, Dennis J, Wang Z, Byun J, Schumacher FR, Gayther SA, Casey G, Hunter DJ, Sellers TA, Gruber SB, et al. The OncoArray consortium: a network for understanding the genetic architecture of common cancers. Cancer Epidemiol Biomarkers Prev. 2017;26:126–35.
 18.
Gui J, Li H. Penalized Cox regression analysis in the highdimensional and lowsample size settings, with applications to microarray gene expression data. Bioinformatics. 2005;21:3001–8.
 19.
Tibshirani R. The lasso method for variable selection in the Cox model. Stat Med. 1997;16:385–95.
 20.
Heagerty PJ, Lumley T, Pepe MS. Timedependent ROC curves for censored survival data and a diagnostic marker. Biometrics. 2000;56:337–44.
 21.
Bengio Y, Grandvalet Y. No unbiased estimator of the variance of kfold crossvalidation. J Mach Learn Res. 2004;5:1089–105.
 22.
Wang K, Li M, Hakonarson H. Analysing biological pathways in genomewide association studies. Nat Rev Genet. 2010;11:843–54.
Acknowledgements
The authors would like to thank all members of the Transdisciplinary Research in Cancer of the Lung (TRICL) Consortium for their data collection that made this study possible.
Funding
This publication was funded in part by the following: National Institute of Health (NIH), National Library of Medicine (NLM) Institute research grants R01LM012012 and T32LM012204, National Cancer Institute (NCI) grant U19CA203654, and the Cancer Prevention Research Institute of Texas (CPRIT) RR170048. NIHNLM R01LM012012, NCI U19CA203654 and CPRIT RR170048 funded the design, data collection, analysis and interpretation of the study and in writing the manuscript. NIHNLM T32LM012204 funded the analysis and interpretation of the data and in writing the manuscript.
Author information
Affiliations
Contributions
J.G. and C.I.A. contributed to the project design and oversight. J.G. contributed to the method development. J.G. and J.L. contributed to the implementation and benchmarking. J.L. and X.J. wrote the manuscript. J.G., C.I.A., J.L., X.J., T.A.M., and S.L. contributed to the data analysis and discussions. X.J., X.X., and J.L. contributed to the genetic annotation of identified genes. J.G., C.I.A., and J.L. contributed to the data interpretation. J.L., X.J., S.L., X.X., D.Z., E.J.D., D.C.C., M.B.S., S.M.A., S.Z., H.B., O.M., M.D.T., T.A.M., C.I.A., and J.G. contributed to the data preparation, manuscript editing, and discussion. All authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
The DartmouthHitchcock Health Institutional Review Board (DHH IRB) at the Office of Research Operations at DartmouthHitchcock reviewed and approved the study protocols and data analysis for the research titled: “GenomeWide Association Study of Common Cancers—The GAMEON Consortium (ONCOCHIP)”, IRB ID: CR00006213. All participants provided written informed consent and each study site approved the study protocols and data analysis from their local ethics committees.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Luyapan, J., Ji, X., Li, S. et al. A new efficient method to detect genetic interactions for lung cancer GWAS. BMC Med Genomics 13, 162 (2020). https://doi.org/10.1186/s12920020008079
Received:
Accepted:
Published:
Keywords
 Genetic interactions
 Machine learning
 Genomewide association study
 Lung cancer