High dimensional model representation of log likelihood ratio: binary classification with SNP data

Background Developing binary classification rules based on SNP observations has been a major challenge for many modern bioinformatics applications, e.g., predicting risk of future disease events in complex conditions such as cancer. Small-sample, high-dimensional nature of SNP data, weak effect of each SNP on the outcome, and highly non-linear SNP interactions are several key factors complicating the analysis. Additionally, SNPs take a finite number of values which may be best understood as ordinal or categorical variables, but are treated as continuous ones by many algorithms. Methods We use the theory of high dimensional model representation (HDMR) to build appropriate low dimensional glass-box models, allowing us to account for the effects of feature interactions. We compute the second order HDMR expansion of the log-likelihood ratio to account for the effects of single SNPs and their pairwise interactions. We propose a regression based approach, called linear approximation for block second order HDMR expansion of categorical observations (LABS-HDMR-CO), to approximate the HDMR coefficients. We show how HDMR can be used to detect pairwise SNP interactions, and propose the fixed pattern test (FPT) to identify statistically significant pairwise interactions. Results We apply LABS-HDMR-CO and FPT to synthetically generated HAPGEN2 data as well as to two GWAS cancer datasets. In these examples LABS-HDMR-CO enjoys superior accuracy compared with several algorithms used for SNP classification, while also taking pairwise interactions into account. FPT declares very few significant interactions in the small sample GWAS datasets when bounding false discovery rate (FDR) by 5%, due to the large number of tests performed. On the other hand, LABS-HDMR-CO utilizes a large number of SNP pairs to improve its prediction accuracy. In the larger HAPGEN2 dataset FTP declares a larger portion of SNP pairs used by LABS-HDMR-CO as significant. Conclusion LABS-HDMR-CO and FPT are interesting methods to design prediction rules and detect pairwise feature interactions for SNP data. Reliably detecting pairwise SNP interactions and taking advantage of potential interactions to improve prediction accuracy are two different objectives addressed by these methods. While the large number of potential SNP interactions may result in low power of detection, potentially interacting SNP pairs, of which many might be false alarms, can still be used to improve prediction accuracy.

(Continued from previous page) to improve prediction accuracy are two different objectives addressed by these methods. While the large number of potential SNP interactions may result in low power of detection, potentially interacting SNP pairs, of which many might be false alarms, can still be used to improve prediction accuracy.
Keywords: Single nucleotide polymorphism, Binary classification, High dimensional model representation, Pairwise SNP interactions, Log likelihood ratio Background Many modern bioinformatics applications utilize data analysis methods originally developed and studied in the fields of statistics, signal processing, and machine learning. In particular, in many cases, the application can be formulated as a classification or regression problem. Data encountered in bioinformatics is typically "smallsample high-dimensional", a challenge not encountered in many classical statistics problems and machine learning applications. For instance, to predict the risk of a person being diagnosed with a specific complex disease in the future, or to predict the effect of a treatment, e.g., for targeted therapy, one may collect several hundred thousand single nucleotide polymorphisms (SNPs) in a sample of several hundred or a few thousand patients with known labels. Although current "omics" data provide a deluge of information per sample point, research being restricted to small sample sizes impedes reliable analysis. While being a small-sample high-dimensional problem is typical of many bioinformatics applications, it seems to be more pronounced when analyzing SNPs as (a) current technologies measure hundreds of thousands of SNPs, and (b) sample size can easily be much smaller than the number of disease associated SNPs. Furthermore, many molecular features, and in particular SNPs, can be weak markers, meaning each individual feature alone cannot reliably predict the disease outcome, and a large collection of features need to be considered together to obtain reliable predictions. Additionally, biological features are typically heavily dependent, for instance due to linkage disequilibrium (LD), and have complex interactions, which may exacerbate the difficulties in developing accurate prediction rules. Finally, note that interpretability is an important aspect in biological research. Not only do we look for biological markers and prediction rules with high accuracy, but also require a glass-box model that can explain "how" and "why" the prediction rule has come to a specific decision. For example, that a specific mutation increases cancer risk by some amount, or presence of a combination of mutations is an indicator of high risk.
To that end, many pipelines implement a first phase of feature selection to reduce dimensionality, improve replicability, and increase prediction accuracy. It has been shown in many studies that such approach in indeed valuable and useful in practical applications [1][2][3][4]. Additionally, penalized methods, such as those using LASSO or elastic net penalties, are widely used. Although feature selection as a means of dimensionality reduction is helpful, it is not always sufficient. For instance, the number of disease associated SNPs passing the selection stage can still be too large compared with sample size.
Due to the large number of biological markers, their complex interactions, the need for interpretability, and the relative lack of large datasets as compared with other machine learning applications, it is typically desired to use low dimensional generative models. The idea is that although the "optimal" rule can be highly complex, it can be well approximated by a low dimensional model, and a proper low dimensional family, for instance generalized linear models (GLMs) with logit or probit links, is large enough to contain a point close to the best low dimensional representation. Thereby, such approximation will avoid over-fitting, improving prediction reliability and accuracy.
SNPs are among the most challenging biological features to analyze. Indeed a single mutation in the deoxyribonucleic acid (DNA) might not greatly impact the risk of a complex disease. Therefore, it is reasonable to assume mutations are weak markers that should be jointly studied to arrive at a reliable decision rule. For instance, GLMs with logit link that take dosage data, i.e., the number of minor alleles at each SNP, are a very popular, if not the most popular, method for binary classification given SNP data. Refer to [3][4][5][6][7][8] for such examples. Additionally, GLMs can be used with sparsity inducing penalties, such as elastic net, and can include product terms of two SNPs to account for their interactions. However, in many cases it is not possible to use all strong pairwise interactions in the data for classification. For instance, given 1000 disease associated SNPs, there are about 500,000 SNP pairs that can be considered in the classification rule. In absence of biological information or given a set of known interacting SNP pairs, it is typically not computationally feasible to consider all possible pairwise interactions and use sparsity inducing penalties such as LASSO and elastic net. Furthermore, such formulations may enforce a linear risk function for many SNPs, implying that a SNP with two minor alleles should induce a risk twice of that SNP having one minor allele, which might not be a valid assumption.
For example, for a recessive SNP we may have that only presence of two minor alleles increases the risk, while one minor allele has no effect on the risk. We would like to emphasize that we found little discussion studying how valid the logit link assumption with linear additive risks is in practice. Support vector machines (SVM), random forests, k nearest neighbors (kNN), and naive Bayes are other methods used for binary classification of SNP data [3,4,6,7]; however, they are not as popular as GLMs, even though in many cases they are suggested to outperform GLMs. Note that many of these methods, such as SVMs and kNN, require a notion of distance, meaning they interpret the number of minor alleles as real numbers, although it might be more suitable to treat dosage data as ordinal variables. Treating dosage data as real-valued variables, although they take a finite number of values, whether understood as ordinal or categorical variables, may be a reason why many off-the-shelf methods, despite outperforming GLMs, do not perform adequately on SNP data.
Here we use the theory of high dimensional model representation (HDMR) to find the "best" second order approximation of the log likelihood ratio, and solve it for the case of categorical observations, which we believe is a better approach to model SNP data than treating dosage values as real numbers. By considering a second order expansion we can account for the effect of single SNPs and pairwise SNP interactions, where by SNP interactions we understand the non-zero terms in the second order HDMR expansion of the log likelihood ratio borrowing from two SNPs which is explained in more detail the "Methods" section. Additionally, we propose linear approximations based on the objectives studied in compressed sensing to approximate the second order HDMR expansion. We use the Sobol indices, an extension of the R 2 statistic that is closely connected to the HDMR expansion, to compute statistics indicating whether there is a significant interaction for a specific SNP pair value. We apply the developed method to a simulated data based on the HAPGEN2 [9] project, as well as lung and breast cancer datasets, showing the proposed methods enjoy higher classification accuracies as well as being able to efficiently detecting strong pairwise SNP interactions.

Methods
Here we describe our classification methodology based on the High Dimensional Model Representation (HDMR) expansion, studied in detail in [10][11][12][13]. We first briefly review the general theory, how it applies to a binary classification problem, and how the categorical observations simply the process.

High dimensional model representation
HDMR is a powerful tool to represent a function of a random vector based on marginal observations. HDMR provides us with a hierarchy of functions that describe how the interactions of variables affect the output. In particular, assuming output Z is a function of input random vector X =[ X 1 , · · · , X D ], i.e., Z = f (X), HDMR decomposes f (X) based on partial observations. Let F = {1, · · · , D}. The HDMR expansion of Z is the collection of functions f u (X u ) for all u ⊆ F such that under the condition that where for each x ∈ R D , x u is the restriction of x to elements in u, x −u is the restriction of x to elements not in u, and μ is the probability measure of random vector X described by probability density function (p.d.f.) w [12]. Note this condition is equivalent to a hierarchical orthogonality criterion of the following form [12]: Therefore, via the HDMR expansion we may write where Equation 6 suggests that in the general case of dependent variables a component function, f u (x u ) depends on all other expansion terms that also have a non-empty intersection with u. However, assuming elements of X are independent, the last term of (6) equals zero and we may write While this greatly simplifies the process of computing the HDMR expansion, the independence assumptions is too strong for SNPs, as they can be heavily correlated.
Observe that by considering sets u such that |u| ≤ d in (4) we arrive a the d th order HDMR expansion of Z, which we hereafter denote by E d (Z|X).

Second order HDMR for categorical observations
Now, additionally consider the case where X is a categorical random vector. In particular, for each f ∈ F, X f is a categorical random variable with support C f . In other words, C f is the collection of categories X f may take. Now, assuming X is categorical, the domain of f u (x u ) is the finite collection of all combinations in C u = f ∈u C f . Therefore, we may write Here our goal is to analyze SNP data when they are reported in dosage, i.e., for each SNP f we report the number of minor alleles, and hence C f = {0, 1, 2}. We hereafter mainly focus on this special case to outline the procedure for estimating coefficients b q f and b q i q j f i f j up to an affine transformation; however, the algorithms developed are more general. Note with little abuse of terminology, we use SNP to refer to the categorical dosage value, being the number of minor alleles. Our specific choice of Q f is 1 X f ≥1 and 1 X f =2 , which will be made clear later. Note a deeper discussion on the HDMR expansion using extended bases can be found in [11,14,15].

HDMR expansion for binary classification
Here we describe how HDMR can be used for a binary classification problem. Consider a binary classification problem with class labels y = 0, 1 and feature index set F. Let X be a random unlabeled observation with true label y x . Given S, it is desired to design a decision rule that assigns a label,ŷ x to X so thatŷ x = y x with high probability. Note given the full joint distribution of X and y x , one could have easily computed P (y x = 1|X), or equivalently the log likelihood ratio L(X) = log(P(y x = 1|X)/P(y x = 0|X)), and use a decision ruleŷ x = 1 L(X)>T , where 1 q is the indicator function of statement q being correct, and T is a threshold.
However, the full joint distribution is typically not available, as is usually estimated given training sample, S. Alternatively, many models assume the classification rule belongs to a family of rules parametrized by θ, and aim to estimate θ given S. For example, a GLM using the famous logit link assumes L(X) = β 0 + f ∈F β f X f , where X f is the value of X for feature f, and θ is the collection of β 0 and β f 's. However, such model is insufficient for many applications where it is desired to account for pairwise SNP interactions, and is not easy to train using LASSO and elastic net penalties while accounting for pairwise interactions by adding terms of the form X f X f to the GLM. Here we develop an algorithm based on observations from second order HDMR expansion of E(L(X)|X), i.e., Z = L(X).
Following the derivation of the HDMR expansion in (10), we only need to substitute Z = L(X) to obtain the HDMR expansion. Now, to compute the HDMR expansion, we need the full joint distribution. However, we are almost never given the true underlying distribution parameters, and are given the training sample S instead.
Note when working with SNP data, the number of disease associated SNPs that may affect the output may be larger than the sample size. Therefore, we may not be able to arrive at a well-defined set of distribution parameters estimates, let alone hoping the estimates to be accurate. Therefore, in the following, we present an algorithm to compute the approximate second order HDMR expansion of the log likelihood ratio directly without computing the distribution parameters for categorical dosage SNP data.

Sobol indices, HDMR expansion, and variable selection
The Sobol indices are an extension of the R 2 statistic, and can be used for global sensitivity analysis [16]. They basically explain the portion of variance explained by each set of variables. For each set of features, u, the total effect Sobol index and the main effect Sobol index are respectively defined as In other words, the total effect Sobol index describes the portion of variance that the set u explains, and the main effect Sobol index describes the amount of information present in u that is not present in any other feature. The approximate second order HDMR expansion of Eq. (10) can also be used to analyze the extent of the effect of each feature and feature pair on the class labels. Note that if a feature is independent of the class labels then S(X f ) = 0 and if two features f i and f j do not have any interactions, i.e., then S X f i ,f j = 0. The use of S X f for variable selection and its connection to other methodologies are discussed below. We then study how the exact HDMR expansion motivates analyzing feature pairs, and study the special case of categorical features.

The Sobol indices and feature filtering
In feature selection and biomarker discovery literature, univariate filters, or filters in short, refer to the family of feature selection algorithms that assess each feature individually and assign a score to each individual feature, which is then used for selecting a subset of features [17,18]. Filters are fast, but do not take feature dependencies into account. Note in this taxonomy, univariate hypothesis tests, such as t-test, equipped with multiple testing correction are an example of filters. Other methodologies used for feature selection include multivariate filters, wrappers, and embedded methods, which are studied in more detail in [1,[17][18][19]. Here we study how the Sobol index for each single feature reduces to filtering, and how it connects to other filter methods. Let u = {f } be the set of single feature f. We have In other words, the total effect Sobol index measures how much information each feature contains about Y, and the main effect Sobol index measure how much information f carries about Y that is not present in any other feature.
Here we prefer to use S f over S c f for three major reasons: (1) Due to the dependencies among biological features and that each individual feature might only have a very small impact on Y, the main effect index might be small for all features. (2) In a high-dimensional setting looking at all features but one may cause over-fitting, making it impossible to reliably measure S c f . Finally, (3) the computation cost to measure S c f for all features can be infeasible. Looking at S f , we would like to select features which affect the output the most, i.e., have large S f 's, which we may formulate as the following hypothesis test: Note that S f = 0 if and only if f is independent of Y. Therefore, the null of the hypothesis test of Eq. (16) can be reformulated as follows: 1 Given f no decision rule with less error than random decision can be built. 2 Y and f are independent. 3 f has the same distribution in both classes.
Note the first formulation has been used to develop methods that train a classifier/regression model, and aim to assess if it outperforms a random decision, for instance using a generalized linear model (GLM) with logit link and linear model β 0 + β f for each f and outputting the p-value for β f = 0. The second formulation leads to using independence criteria, such as Hilbert-Schmidt independence criterion, and the null assuming Y and f are independent. Finally, the last formulation, which may be the most popular one, leads to hypothesis tests that aim to verify if the class-conditioned distributions are different. The Kolmogorov-Simirnov (KS) test and Wilcoxon rank sum test are examples of such tests. Note that different formulations give rise to different nulls and hence different p-values. However, under certain assumptions one may be able to identify different tests with each other. For example, assuming (independent) Gaussian features with equal variances in both classes, linear discriminant analysis (LDA) performs better than a random decision if and only if f has different means in both classes, which is exactly what student-t test measures. Additionally, assuming features are independent and Gaussian with some unknown mean and variances, quadratic discriminant analysis (QDA) performs better than a random decision if f does not have the same mean and variances in both classes, which is exactly what the likelihood ratio test of does [20], which is studied in more detail in [21].
Fisher's exact test and χ 2 -test are two popular hypothesis tests used to determine if two categorical features have similar distributions in both classes. The recently proposed optimal Bayesian filter (OBF) [22] directly measures the sample conditioned probability of a categorical variable having distributional differences across two classes, is suggested to enjoy superior performance compared with several other selection algorithms used for identifying disease associated SNPs [22]. Note we will later use OBF for feature selection in our pipeline.

Pairwise SNP interactions
Suppose f i and f j are categorical random variables, and we would like to see if a specific patten in the form of 1 X f i =c i &X f j =c j carries significant information not available if we consider each feature individually. In other words, we would like to test if the second order HDMR term corresponding to feature pair f i , f j is non-zero. Let Note that Therefore, we would like to see if k Thereby, by a pairwise SNP interaction we understand the nonzero second order HDMR terms that involve two distinct SNPs, i.e., are not present in the first order expansion.
Here we see that such terms correspond to SNP pairs with unequal correlations between the two classes. Note that a linear additive model looking only at individual SNPs is not sufficient to compute the log likelihood ratio for such correlated categorical features. This is also in line with previous definitions proposed for quantifying SNP interactions, e.g., [6,23]. Now to compute the p-value associated with each fixed SNP pair pattern we use Fisher's r to z transformation and approximate the distribution of the null k by the standard normal distribution. We compute the statistics (19) for y = 0, 1, whereρ is the estimate of correlation coefficient from data. We then compute where n y is sample size in class y. Z f i ,f j approximately follows the standard normal distribution, which we use to compute p-values. We hereafter call this method fixed pattern test (FPT).

Algorithm
Here we describe the algorithm developed for classifier design.

Initial filtering
Here we outline the pre-processing we do on the categorical SNP data. When working with SNP data, many times dosage values, i.e., the number of minor alleles, are reported which take values in {0, 1, 2}. Note that many SNPs can be dominant or recessive. A dominant SNP is one for which a SNP with one minor allele behaves similar to a SNP with two minor alleles, and a recessive SNP is one for which a SNP with one minor allele behaves similar to a SNP with non minor alleles. Therefore, we do a first phase pre-processing, and transform each SNP to two "binarized SNP"s. For each SNP, we create two auxiliary features, one the indicator of the presence of a minor allele, and another the indicator of two minor alleles. This way, if only the first auxiliary feature is used then we are dealing with a dominant SNP, if only the second auxiliary feature is used we have a recessive SNP, and if both are used the SNP does not fall into only one of the two categories, for instance we may have an additive SNP. Examples of different SNP models and a discussion on them can be found in [24,25]. By this preprocessing, if k SNPs are observed, we have |F| = 2k binarized SNPs. We hereafter assume all SNPs are already binarized, i.e., for each feature f ∈ F we have C f = {0, 1}, instead of using the family of constraints Q f . Note that we only do so for notational convenience.
As mentioned above, given that many studies may measure several hundred thousand or a few million SNPs, they all cannot be directly inputted to a classifier, particularly that we aim to account for pairwise SNP interactions, i.e., the second order terms of the HDMR expansion. A first phase filtration is typically inevitable to reduce computational complexity of classifier design, i.e., the training of the classification algorithm. Here we use the recently proposed optimal Bayesian filter (OBF) [22], to rank binarized SNPs, and we pick the top D features for classifier design. Note D is a parameter that can be determined later through cross validation, or be chosen as a large value so that most disease associated SNPs are captured. Note we have chosen OBF as it has outperformed many of the currently used methods for selecting disease SNPs, including the popular χ 2 -test [22].

Classification algorithm
Here we design the classifier used to label observations. Note here we are using binarized SNPs as our features. In the training, given the labeled training data and selected binarized SNPs of the preprocessing step, the algorithm assigns coefficients to all SNPs and SNP pairs. Note D might be large, and given D features, there are 0.5D(D − 1) SNP pairs to use. Furthermore, there may be weak individual SNPs or SNP pairs that are so weak that not including them in the final classifier might actually improve performance. Therefore, in our classifier design we remove features and feature pairs that are too weak. The classifier design process can be broken to the following steps: (1) feature pair construction, (2) removing weak features, (3) removing weak feature pairs, (3) merging feature pairs into blocks, and (4) estimating classifier parameters and obtaining the risk function.

Feature pair construction
Given that we have binarized SNP values, for two features f i and f j , we have four feature pairs of the form after called a feature pair. Figure 1 illustrates how to

Removing weak features
Here we remove features that are too weak to be used in the classifier. For each feature, f, we find the risk associated to it, r f . We first

Removing weak feature pairs
We then filter out feature pairs that are too weak. We again follow a procedure similar to our process for removing single features. For each feature pair f i and f j define r Again, feature pairs for which r f i f j c i c j < T 2 are removed, i.e., are assigned zero weight in the classification rule. Note again feature pairs for whichp  Figure 2b illustrates this process for feature pairs.

Feature block construction
Recall given D features, there are 0.5D(D−1) feature pairs. Although our filtering removes many feature pairs, there still could be too many feature pairs to be easily used for classifier design. In addition, since (a) SNPs may be heavily correlated, (b) a feature may occur in many feature pairs, and (c) the binarization scheme described in the pre-processing step creates two binary features for each SNP, the binary features might be heavily correlated, creating dependencies among feature pairs. In other words, given that a specific pattern for a feature pair is observed, one may be able to estimate the value of many other feature pairs. For example, given we observe 1 for a specific feature pair that uses 1 X f i =2 , we can easily conclude all feature pairs that assume 1 X f i =2 are zero.
These dependencies exacerbate classifier design. Therefore, we propose the following procedure to reduce the number of weights to estimate. Note we could have used any other community detection algorithm instead; however, we observed the following procedure works well, and enjoys low computation cost. We consider blocks of feature pairs of the following forms for each feature f i and pattern c i c j : where P's and N's are collections of risk increasing and risk decreasing feature pairs, respectively. Figure 3 depicts this process. Afterwards, given an observation X, we report the ratio of feature pairs in each block which take value one. Among the constructed blocks, we again remove "weak blocks", i.e., for each block A, irrespective of being risk increasing or risk decreasing, we compute r A , the logarithm of the expected ratio of observed patterns of block A in class 1 versus class 0. We then remove blocks for which r A < T 3 . Again, T 3 is threshold that will be chosen through cross validation.
We observed this approach to (a) reduce the number of parameters to estimate when D is large, and (b) improve prediction performance. Note that this approach is equivalent to decomposing w Here, in the simulations, we observed such assumptions improves classification performance when features, i.e., SNPs, are correlated, but leave a mathematical analysis of such assumption on the classifier performance for future work. Now, given observation X, we construct the vector V (X), comprised of each feature value, and the ratio of observed patterns in risk increasing and risk decreasing blocks. The vector V (X) shall be used in the next section to assign a "risk" to observation X.

Estimating classifier parameters
To complete our classifier construction, we need to estimate HDMR coefficients. Note given our construction of blocks in the previous section, we now only need to find a vector b so that we may write that E 2 (L(X)|X) ≈ E(L(X)) + b.V (X). However, we may be dealing with an ill-posed problem due the number of coefficients to estimate being larger than the sample size. Note that although the HDMR expansion of the log likelihood ratio is unique, we mostly compare it with a threshold to assign a label to a newly observed point. Therefore, in many cases it is acceptable to work with a affine transformation of the log likelihood ratio. In other words, although not being able to find the exact second order HDMR expansion of the log likelihood ratio is not desirable, it is not catastrophic either, as any affine transformation of the log likelihood ratio can be used as an equally good decision rule.
To circumvent the ill-posed problem, we use an objective function which is a variation of objective functions mostly studied in the compressed sensing literature [26] that aim to estimate a sparse signal given 1-bit quantized observations. In other words, optimization problems and formulations that aim to estimate vector a from n observations of the form (x i , y i ) : i = 1 : n where y = sign(a · x) and "·" denotes inner product. Connections between these objectives and a convex relaxation to the logistic regression problem is discussed in [27]. Extensions that additionally consider noisy measurements in the form of random flips, i.e., y = sign(a · x) with probability 1 − , −sign(a · x) with probability , increasing/decreasing feature pair r fifj cicj > T 2 to construct risk increasing/decreasing block P fi c1,c2 /N fi c1,c2 . Finally white squares correspond to feature pairs that are removed, i.e., have risks less than T 2 are studied in [27]. Finally, [28] studies extensions to non-Gaussian features. Note in [27] it is shown that up to a constant the error in recovering the signal a matches the minimax error for the unquantized compressed sensing problem. We use the following optimization problem to solve for the weights we wish to use.
where S y is the portion of data in class y. Figure 4 depicts how b * is selected given vectors V (X) for the training data.
Heuristically speaking, given a feature vector in the form of log likelihood ratios of partial observations x u , here we find weights that maximize the distance between the average points of each class. The heuristic for using such objective is as follows: the HDMR expansion obtains the weights that result in the "best" low dimensional representation, i.e., we find the mean square error (MSE) estimate of the log likelihood ratio. The underlying reason we do so is that we believe the HDMR expansion of the log likelihood ratio gives a us a model that enjoys a low prediction error. On the other hand, weights that maximize the distance between the projections of the center points of the two classes to a one dimensional space should also yield low prediction error. Hence, such objective should result in a model that is close to the HDMR expansion. Note that in special cases, for instance independent Gaussian features with equal variances in both classes, we can actually prove that such approach minimizes the prediction error. Given b * we have everything need for classification. Given a new observation X we find R(X) = b * · V (X), and we assign class labelŶ = 1 R(X)>T , for threshold T. Note the thresholds T 1 , T 2 , and T are parameters of the model, and will be selected through the validation process, for instance, by cross validation. We hereafter call the resulting classifier built for categorical X as linear approximation for block second order HDMR expansion of categorical observations (LABS-HDMR-CO). The pseudo-code of LAS-HDMR-CO is provided in Algorithm 1.

Results
Here we use a model developed to generate SNP data to evaluate the performance of LABS-HDMR-CO, and compare it with several popular methods used for binary classification. We consider three datasets, a dataset based on the HAPGEN2 project, a lung cancer dataset, and a breast cancer dataset. OBF takes π(f ), the prior probability a SNP is disease associated, and hyperparameter α describing the Dirichlet prior on each categorical SNP value as input. We assume π(f ) is constant for all features, hence not affecting the ranking, and set α = [ 2,2] for each binarized SNP used with LABS-HDMR-CO and

HAPGEN2 data
Here we generate data from the HAPGEN2 project [9], reporting SNP values for more than 3.9 million SNPs, which we then convert to dosage. This dataset is generated by fixing one or two SNPs on each chromosome to be disease associated. The generated dataset contains 2000 controls (class 0), and 1000 cases (class 1). We randomly select 900 points in each class for training, and the rest is used as test data. We iterate 100 times, and measure the area under curve (AUC) of the receiver operator characteristic (ROC) as our performance metric In addition to LABS-HDMR-CO, we use the nonprocessed data, use OBF for feature selection to select top features, and use several variants of GLMs with probit link for further selection and classification. We use a probit model that uses top 1000 features with LASSO (L 1 ) penalty λ and another that uses top 500 features with elastic net putting equal weights on L 1 and L 2 penalties with penalty coefficient λ. We also use a variant that accounts for pairwise SNP interactions by considering terms of the form X i X j using top 50 features and L 1 penalty λ. We only use the top 50 features (see Tables 1, 2) for the variant accounting for pairwise SNP interactions so that the total number of regressors to use in the regression model is comparable to the linear variants. We observed that larger values of features in the probit link accounting for pairwise Fig. 4 The illustration of the b * selection process. Given V(X) vectors for the two classes, denoted by red and green crosses, the b * is chosen to maximize the distance between the center of projections of V(X) vectors on b interactions drastically increases runtime, causing infeasible computation cost. Finally, we also implement a naive Bayes classifier using top 1000 features.
Indeed, it is an advantage of LABS-HDMR-CO that can incorporate several hundred features in its regression model while accounting for pairwise SNP interactions with reasonable computational burden. Finally, our reasons to choose the probit link over the more popular logit link are 3 fold: (1) In the data generation model the risk of an SNP mutation is modeled as additional linear risk; corresponding to the logit link. Therefore, models based on the logit link get an unfair advantage that they exactly match the data generation model; while in reality almost always the assumed model deviates from reality. (2) We observed the computational cost to train a probit link is much less, about a third, of the logit link. Hence, to reduce computational cost of the GLM variants we compare with, we selected the probit link. (3) In practice the probit link behaves similar to the logit link, which is not surprising as the sigmoid and the cumulative distribution function (CDF) of the standard normal distribution are rather similar. Hence this choice better illustrates how slights deviations in the assumed link might affect performance of a GLM when dealing with SNP data.
Finally, note that all GLM variants and naive Bayes use OBF assuming three categories, i.e., three dosage values, for ranking features. We tested the popular χ 2 -test as well, and obtained lower AUCs for the GLMs and naive Bayes. This results strengthens the observations made in [22] that OBF provides better feature rankings compared several other methods, including the popular χ 2 -test.
The AUCs for LABS-HDMR-CO, the linear probit model using L 1 penalty (probit(lin,LASSO)), second order probit model with L 1 penalty (probit(quad,LASSO)), linear probit using elastic net (probit(lin,elastic net)) and naive Bayes are 91.03%, 84.52%, 86.44%, 84.28%, and 86.83%, respectively. The larger AUC of LABS-HDMR-CO suggests it enjoys superior overall performance, i.e., LABS-HDMR-CO should typically enjoy a higher probability of detection for a fixed false alarm rate value. Figure 5 plots the ROC curve of the classification algorithms. For the GLMs we tested λ = 0.01 : 0.01 : 0.2, and for each variant report the AUC of the λ with superior performance, i.e., highest AUC. Also, the parameters of LABS-HDMR-CO are chosen through cross validation. As the results suggest LABS-HDMR-CO enjoys superior performance compared with other algorithms. In particular, for false positive rates larger than 2% LABS-HDMR-CO enjoys higher probability of detection compared with all other algorithms. For small false alarm rates though, naive Bayes had the highest true positive rate, but was closely followed by LABS-HDMR-CO. While  [9] the ROC curve of probit(lin,LASSO) seems to monotonically increase with respect to false positive rate, true positive rate of probit(quad,LASSO) and probit(lin,elastic net) seem to only infinitesimally increase for false positive rates between 0.04 and 0.05. Given that ROC curves must be concave, we initially believed this might be an artifact of not enough iterations (here being 100); however, simulations only implementing these two GLMs with more iterations resulted in rather similar graphs. However, in many iterations (using MATLAB's default settings) the warning of reaching maximum number of iterations was reported for these two methods. Therefore, this may be an artifact of the limited number of iterations or numerical instabilities of the training stage.
Finally, we observed the runtime of LABS-HDMR-CO using 1000 SNPs is comparable to the probit variant using 500 SNPs and elastic net. This suggests LABS-HDMR-CO is extremely fast for a method that accounts for pairwise SNP interactions. Note given 1000 SNPs there are about 500,000 SNP pairs to evaluate. However, LABS-HDMR-CO has more parameters to tune via cross validation, its total runtime is more than a GLM with one tunable parameter. Note in this work we did not test an elastic net probit that also optimizes over α, the relative weights between L 1 and L 2 penalties; however, we expect the two dimensional search for such model might result in computation costs comparable to LABS-HDMR-CO.
Finally, we use all of data, to find the top SNPs and SNP pairs with largest risks, i.e., r f and r f i f j c i c j , respectively. Using Fisher's exact test and bounding the false discovery rate (FDR) by 5% using the Banjamini-Hochberg procedure [29] 785 SNPs are significant. Also, using FPT for identifying significant pairwise SNP interaction patterns, among the 4 × 1000 2 ≈ 2 × 10 6 patterns to check, 1046864, about 52.4% of all tests, are significant when bounding FDR by 5%. Given than many selected SNPs are on the same chromosome this is not surprising. Furthermore, we observed that all single SNPs that were not significant after bounding FDR are present in at least one SNP pair, emphasizing the importance of considering pairwise SNP interactions See Supplementary Materials for details.

Lung cancer
Data obtained in [30] is deposited on gene expression omnibus (GEO) [31] with accession number GSE33355. It contains 61 sample pairs of cancer and normal lung tissue specimen from non-smoking females collected at national Taiwan university hospital and Taichung veterans general hospital. The data is based on the GPL6801 platform and measures dosage values of 909622 SNPs. In our evaluation healthy and cancerous tissue specimen comprise classes 0 and 1, respectively. Thereby, the goal is to determine if a test point is normal or cancerous. Given the dataset we randomly select 55 points in each class for training, use the rest for testing, and iterate 100 times. Figure 6a plots the ROC curve of different methods. We observe that LABS-HDMR-CO enjoys a higher true positive rate for each given false positive rate, suggesting its superior performance on this dataset. Cross validation sets D = 900, T 1 = 0.25, T 2 = 1.2, and T 3 = 0.8. Linear and quadratic probit with LASSO penalty set λ to 0.2 and 0.02, respectively, and the linear probit with elastic net sets λ = 0.13. Tables 3 and 4 list the top SNPs and SNP pairs with largest risks used by LABS-HDMR-CO. Although cross validation suggests using D = 900 SNPs for prediction, the Fisher's exact test using the Benjamini-Hochberg [29] procedure for FDR correction suggests that only looking at the SNPs used for prediction, only the top 250 are significant bounding FDR by 5%. Although the remaining SNPs are not significant, their net effect is an improvement in prediction which may be due to agglomerating them as individual SNPs, or specific SNP pair intereactions among them that should be considered together as a marker family to observe their effect. Going back to the literature we observe many of the top SNPs and SNP pairs, or the genes they belong to, are shown or suggested to be affected in lung cancer. For example, the top SNP rs9493858 is located on the SGK1 gene, which is suggested to be affected in lung cancer in several studies [32,33]. Looking at the top SNP pairs, we observe the second highest SNP pair map to NXN and MEOX2 genes, suggesting their interaction might be key to understanding lung cancer. Interestingly, mutations in NXN has been associated with colon cancer in east Asian populations [34], but its role in lung cancer requires further investigation. Furthermore, MEOX2 is also suggested to be affected in lung cancer [35]. In this dataset, except for one SNP pair, all risks are positive. In other words, certain mutations increase the cancer risk; however, the data does not suggests any candidate mutations that seem to further help healthy people guard against lung cancer. Now, using FPT to detect significant pairwise SNP interactions, bounding FDR by 5% using the Benjamini-Hochberg procedure, none of the pairs are significant;  however, cross validation suggests T = 1.2, resulting in using 169324 SNP pairs in its analysis. These results suggest although we cannot reliably tell which SNP pairs are true discoveries and which are false discoveries, the information present of the weak SNP pairs outweighs the noise, by aggregating these pairs we can extract the information of encoded in the pairs more than the noise that may be inserted to the decision rule, insert the added information in the prediction rule, and the net effect is more reliable performance. Note that by encoding the SNP pairs in the analysis we may be able to say that for a new test point the net effect of SNP pair interactions is increased risk, i.e., likelihood of being a cancerous point; however, we may not be able to pinpoint which SNP pairs brought us to this conclusion, rather, we can only comment on their agglomerated net effect. Finally, we plot the amount of difference in correlation coefficients of SNP pairs in Fig. 7a as we look for the pattern of both binarized SNPs taking value 1 to have a pairwise interaction, i.e., Z f i f j 1,1 = 1 being an indicator of an interaction. The x-axis denotes the rank of the first SNP in the pair, and y-axis denotes the second. The zaxis as well as the color of each circle corresponding to a SNP pair denote the amount of the difference in correlation coefficients. To avoid a cluttered figure though, only SNP pairs with differences larger than 0.5 are plotted with a non-zero height. Only a small portion of SNP pairs seem to have large differences in correlation coefficients. Additionally, we observe that many SNPs are common among SNP pairs with large differences in correlation coefficients. In other words, few SNPs are present in many of the SNP pairs with potential interactions. Putting SNP pairs with small differences in correlation coefficients aside, we observe the remaining pairs resemble the pairwise patterns of Fig. 3 describing the heuristic behind LABS-HDMR-CO to merge SNP pairs and construct blocks, suggesting suitability of such strategy might further be biologically motivated.

Breast cancer
Data obtained in [36] is deposited on GEO with accession number GSE16619, containing dosage data of 42 normal breast tissue samples and 69 cancerous samples. The data is based on the GPL6804 platform measuring 503590 SNPs. Normal and cancerous points comprise classes 0 and 1, respectively. We randomly select 35 normal points and 60 cancerous points for training, use the remaining data for testing, and iterate 100 times. Cross validation sets D = 1250, T 1 = 0.25, T 2 = 1, and T 3 = 0.5. Linear and quadratic probit with LASSO penalty set λ to 0.01 and 0.03, respectively, and the linear probit with elastic net sets λ = 0.02. Figure 6b plots the ROC curve of different methods. For false positive rates below 5% LABS-HDMR-CO enjoys a higher true positive rate than other algorithms. For higher false positive rates linear probit with elastic net and quadratic problit with LASSO perform almost identical and superior to other algorithms, but are closely followed by LABS-HDMR-CO.
Tables 5 and 6 list the top SNPs and SNP pairs with largest risks used by LABS-HDMR-CO, respectively. We observed some of the SNP IDs present in the data file were not present GPL6804 platform description on the GEO website. For such SNPs we report their ID in the datafile rather than their SNP ID. Although cross validation suggests D = 1250 for classification, only looking at this set, 269 binarized SNPs are significant using the Fisher's exact test bounding FDR by 5% using the Benjamini-Hochberg procedure. Going back to the literature we observe several of the genes top SNPs and SNP pairs map to are shown or suggested to be affected in breast cancer. For example, rs13129525 which ranks 9 is located on the FAM171A1 gene which is suggested to be affected in breast cancer [37]. Furthermore, the SNP pair ranking fourth map to CDK19 [38] and CCDC162P [39] genes, which are both shown to be affected in breast cancer. Similar to the lung  Using FPT to detect significant pairwise interactions among the top D = 1250 SNPs, bounding FDR by 5% using the Benjamini-Hochberg procedure only 4 pairs are significant, although cross validation suggests T 2 = 1, resulting in using 567190 SNP pairs for classification. This suggests in order to boost our prediction accuracies we need to use many SNP pairs that are not significant, but the information contained in the true discoveries outweighs the noise of the many false discoveries present in the prediction rule. Finally, Fig. 7b plots the differences in correlation coefficients where only pairs with differences larger than 0.8 are assigned a non-zero height. We again observe that few SNPs are common among many pairs with potential interactions.

Discussion
Analyzing SNP data and developing classification rules given SNP observations is difficult when studying complex diseases. The small-sample high-dimensional nature of the problem, individual SNPs being potentially weak markers, SNPs being categorical variables in nature, and their complex interactions are several important factors that make classifier design a challenging task. Due to each individual SNP contributing only minimally to the class labels, it seems necessary to account for SNP interactions to obtain reliable predictions. The proposed algorithm, LABS-HDMR-CO aims to balance computation cost, complexity, and prediction performance by using a representation that accounts for pairwise interactions.
Although higher order HDMR expansions can be considered, given current technology, computation power, and sample sizes, accounting only for pairwise interactions seems to be the most one can hope for. Interestingly, we observed in our simulated examples described here that the linear models seem to perform better than expected. Although the current simulations are not sufficient to verify the performance and robustness of linear models for SNP classification, we expect this rather good performance to be due to closeness of GLMs to the first degree HDMR expansions. Note that probit and logit links have rather similar graphs, and the linear model of the logit link aims to compute the log likelihood ratio. In other words, the GLM with logit link assumes the log likelihood ratio is linear, and assumes the risk of a SNP with two minor alleles is twice the risk of a SNP with one minor allele. Note the linear term basically resembles the first order HDMR expansion under this additional "dose-effect linearity" assumption. This assumption, similar to the assumptions we made here in the development of LABS-HDMR-CO, reduces complexity and the number of parameters to estimate. Note that the training of a logit model is usually done by maximum likelihood (ML) estimation of the parameters. Finally, note that since probit and logit links are very similar in shape, the superior performance of LABS-HDMR-CO over GLMs may be due to the following three reasons: (1) LABS-HDMR-CO uses second order HDMR expansion while most GLMs used in practice mimic first order HDMR expansion, (2) the preprocessing of LABS-HDMR-CO decomposing SNP dosage data to two indicators seems to better grasp the nonlinear nature of SNPs, while not affecting the flexibility of the algorithm to account for SNPs that are neither recessive nor dominant, and (3) the additional assumptions made in LABS-HDMR-CO seem to enforce less rigidness in the model than the assumptions of GLMs on linear additive risks.
When analyzing cancer datasets reporting SNP dosage values, in the lung cancer dataset we observed that LABS-HDMR-CO may enjoy much superior performance compared with other popular algorithms, and in the breast cancer dataset we observed may perform only slightly inferior to them. Furthermore, when lack of reliable biological knowledge results in the need of considering extremely large number of potential SNP pairs, although we may not be able to reliably detect which SNP pairs are affected in the disease under study, we may be able to aggregate the information of many potential SNP pairs to

Conclusion
The analysis of genetic variants and their joint effect on complex diseases is a challenging task. In particular, SNPs are a difficult data type to handle due to their highdimensionality, weak effects of each individual SNP on the phenotype under study, the need to account their joint complex interactions, and their categorical nature. These challenges make it difficult to develop classification rules with reliable predictions, and are exacerbated by the small sample sizes in many applications.
Here we revisited the binary classification problem given categorical SNP observations when reported in dosage, and proposed LABS-HDMR-CO as an algorithm that produces classification rules with good prediction performance that can take several hundred SNPs as input, and account for their pairwise interactions. Additionally, LABS-HDMR-CO is a very fast algorithm in nature, with runtime comparable to a GLM with LASSO penalty. However, due to the 4 parameters to tune (D, T 1 , T 2 , and T 3 ) in cross validation, the training of the model may become quite computationally expensive. However, note that different thresholds can be evaluated independent of each other, making LABS-HDMR-CO an ideal candidate for parallel computing. Further development of techniques to speed up the training process is an interesting direction for future work that should be explored in more detail.
The categorical nature of SNP dosage data has been a challenge for many machine learning algorithms, and treating them as real numbers has been the focus of many methods. Although the current work requires further study, our initial results suggest that HDMR can a be suitable framework to study SNP data as categorical variables. Additionally, although we only approximate the second order HDMR expansion, we already know that the second order expansion is the best low dimensional representation in the mean square error (MSE) sense [16]. Therefore, it is not surprising that LABS-HDMR-CO enjoys superior prediction performance compared with many algorithms used to study SNP data.
The ability of HDMR to acknowledge categorical nature of SNP data with complex non-linear interactions opens up a new avenue of research to develop low dimensional models suitable for categorical data. Last but not least, the close connection of HDMR with the Sobol indices gives one the ability to identify significant SNP pairs with high interactions, using the same methodology as used for classification problems. Interestingly, we observed that although we may not be able to detect which SNP pairs are affected in the disease under study, the net effect of aggregating many high-profile SNP pairs can boost the prediction accuracy. In other words, in spite of not being able to declare where the "signal" is, i.e., which specific pairwise patterns are affected, it is possible to implicitly extract and take advantage of such information to improve prediction accuracy.
The power, of LABS-HDMR-CO to account for and combine many SNP pairs in its decision rule opens up many avenues of research demanding further investigation. For instance, the type of a classification problem for categorical observations considered here may be extended to non-binary cases (multiple phenotypes) as well, and we will pursue such an extension of our approach in future work. We note, however, that binary classification is the cornerstone of classification theory and many solutions to multiclass problems can be formulated as a sequence of solutions to binary class problems.