 Research
 Open Access
 Published:
High dimensional model representation of log likelihood ratio: binary classification with SNP data
BMC Medical Genomics volume 13, Article number: 133 (2020)
Abstract
Background
Developing binary classification rules based on SNP observations has been a major challenge for many modern bioinformatics applications, e.g., predicting risk of future disease events in complex conditions such as cancer. Smallsample, highdimensional nature of SNP data, weak effect of each SNP on the outcome, and highly nonlinear SNP interactions are several key factors complicating the analysis. Additionally, SNPs take a finite number of values which may be best understood as ordinal or categorical variables, but are treated as continuous ones by many algorithms.
Methods
We use the theory of high dimensional model representation (HDMR) to build appropriate low dimensional glassbox models, allowing us to account for the effects of feature interactions. We compute the second order HDMR expansion of the loglikelihood ratio to account for the effects of single SNPs and their pairwise interactions. We propose a regression based approach, called linear approximation for block second order HDMR expansion of categorical observations (LABSHDMRCO), to approximate the HDMR coefficients. We show how HDMR can be used to detect pairwise SNP interactions, and propose the fixed pattern test (FPT) to identify statistically significant pairwise interactions.
Results
We apply LABSHDMRCO and FPT to synthetically generated HAPGEN2 data as well as to two GWAS cancer datasets. In these examples LABSHDMRCO enjoys superior accuracy compared with several algorithms used for SNP classification, while also taking pairwise interactions into account. FPT declares very few significant interactions in the small sample GWAS datasets when bounding false discovery rate (FDR) by 5%, due to the large number of tests performed. On the other hand, LABSHDMRCO utilizes a large number of SNP pairs to improve its prediction accuracy. In the larger HAPGEN2 dataset FTP declares a larger portion of SNP pairs used by LABSHDMRCO as significant.
Conclusion
LABSHDMRCO and FPT are interesting methods to design prediction rules and detect pairwise feature interactions for SNP data. Reliably detecting pairwise SNP interactions and taking advantage of potential interactions to improve prediction accuracy are two different objectives addressed by these methods. While the large number of potential SNP interactions may result in low power of detection, potentially interacting SNP pairs, of which many might be false alarms, can still be used to improve prediction accuracy.
Background
Many modern bioinformatics applications utilize data analysis methods originally developed and studied in the fields of statistics, signal processing, and machine learning. In particular, in many cases, the application can be formulated as a classification or regression problem. Data encountered in bioinformatics is typically “smallsample highdimensional”, a challenge not encountered in many classical statistics problems and machine learning applications. For instance, to predict the risk of a person being diagnosed with a specific complex disease in the future, or to predict the effect of a treatment, e.g., for targeted therapy, one may collect several hundred thousand single nucleotide polymorphisms (SNPs) in a sample of several hundred or a few thousand patients with known labels. Although current “omics” data provide a deluge of information per sample point, research being restricted to small sample sizes impedes reliable analysis. While being a smallsample highdimensional problem is typical of many bioinformatics applications, it seems to be more pronounced when analyzing SNPs as (a) current technologies measure hundreds of thousands of SNPs, and (b) sample size can easily be much smaller than the number of disease associated SNPs. Furthermore, many molecular features, and in particular SNPs, can be weak markers, meaning each individual feature alone cannot reliably predict the disease outcome, and a large collection of features need to be considered together to obtain reliable predictions. Additionally, biological features are typically heavily dependent, for instance due to linkage disequilibrium (LD), and have complex interactions, which may exacerbate the difficulties in developing accurate prediction rules. Finally, note that interpretability is an important aspect in biological research. Not only do we look for biological markers and prediction rules with high accuracy, but also require a glassbox model that can explain “how” and “why” the prediction rule has come to a specific decision. For example, that a specific mutation increases cancer risk by some amount, or presence of a combination of mutations is an indicator of high risk.
To that end, many pipelines implement a first phase of feature selection to reduce dimensionality, improve replicability, and increase prediction accuracy. It has been shown in many studies that such approach in indeed valuable and useful in practical applications [1–4]. Additionally, penalized methods, such as those using LASSO or elastic net penalties, are widely used. Although feature selection as a means of dimensionality reduction is helpful, it is not always sufficient. For instance, the number of disease associated SNPs passing the selection stage can still be too large compared with sample size.
Due to the large number of biological markers, their complex interactions, the need for interpretability, and the relative lack of large datasets as compared with other machine learning applications, it is typically desired to use low dimensional generative models. The idea is that although the “optimal” rule can be highly complex, it can be well approximated by a low dimensional model, and a proper low dimensional family, for instance generalized linear models (GLMs) with logit or probit links, is large enough to contain a point close to the best low dimensional representation. Thereby, such approximation will avoid overfitting, improving prediction reliability and accuracy.
SNPs are among the most challenging biological features to analyze. Indeed a single mutation in the deoxyribonucleic acid (DNA) might not greatly impact the risk of a complex disease. Therefore, it is reasonable to assume mutations are weak markers that should be jointly studied to arrive at a reliable decision rule. For instance, GLMs with logit link that take dosage data, i.e., the number of minor alleles at each SNP, are a very popular, if not the most popular, method for binary classification given SNP data. Refer to [3–8] for such examples. Additionally, GLMs can be used with sparsity inducing penalties, such as elastic net, and can include product terms of two SNPs to account for their interactions. However, in many cases it is not possible to use all strong pairwise interactions in the data for classification. For instance, given 1000 disease associated SNPs, there are about 500,000 SNP pairs that can be considered in the classification rule. In absence of biological information or given a set of known interacting SNP pairs, it is typically not computationally feasible to consider all possible pairwise interactions and use sparsity inducing penalties such as LASSO and elastic net. Furthermore, such formulations may enforce a linear risk function for many SNPs, implying that a SNP with two minor alleles should induce a risk twice of that SNP having one minor allele, which might not be a valid assumption. For example, for a recessive SNP we may have that only presence of two minor alleles increases the risk, while one minor allele has no effect on the risk. We would like to emphasize that we found little discussion studying how valid the logit link assumption with linear additive risks is in practice.
Support vector machines (SVM), random forests, k nearest neighbors (kNN), and naive Bayes are other methods used for binary classification of SNP data [3, 4, 6, 7]; however, they are not as popular as GLMs, even though in many cases they are suggested to outperform GLMs. Note that many of these methods, such as SVMs and kNN, require a notion of distance, meaning they interpret the number of minor alleles as real numbers, although it might be more suitable to treat dosage data as ordinal variables. Treating dosage data as realvalued variables, although they take a finite number of values, whether understood as ordinal or categorical variables, may be a reason why many offtheshelf methods, despite outperforming GLMs, do not perform adequately on SNP data.
Here we use the theory of high dimensional model representation (HDMR) to find the “best” second order approximation of the log likelihood ratio, and solve it for the case of categorical observations, which we believe is a better approach to model SNP data than treating dosage values as real numbers. By considering a second order expansion we can account for the effect of single SNPs and pairwise SNP interactions, where by SNP interactions we understand the nonzero terms in the second order HDMR expansion of the log likelihood ratio borrowing from two SNPs which is explained in more detail the “Methods” section. Additionally, we propose linear approximations based on the objectives studied in compressed sensing to approximate the second order HDMR expansion. We use the Sobol indices, an extension of the R^{2} statistic that is closely connected to the HDMR expansion, to compute statistics indicating whether there is a significant interaction for a specific SNP pair value. We apply the developed method to a simulated data based on the HAPGEN2 [9] project, as well as lung and breast cancer datasets, showing the proposed methods enjoy higher classification accuracies as well as being able to efficiently detecting strong pairwise SNP interactions.
Methods
Here we describe our classification methodology based on the High Dimensional Model Representation (HDMR) expansion, studied in detail in [10–13]. We first briefly review the general theory, how it applies to a binary classification problem, and how the categorical observations simply the process.
High dimensional model representation
HDMR is a powerful tool to represent a function of a random vector based on marginal observations. HDMR provides us with a hierarchy of functions that describe how the interactions of variables affect the output. In particular, assuming output Z is a function of input random vector X=[X_{1},⋯,X_{D}], i.e., Z=f(X), HDMR decomposes f(X) based on partial observations. Let F={1,⋯,D}. The HDMR expansion of Z is the collection of functions f_{u}(X_{u}) for all u⊆F such that
under the condition that
where for each \(x \in \mathbb {R}^{D}\), x_{u} is the restriction of x to elements in u, x_{−u} is the restriction of x to elements not in u, and μ is the probability measure of random vector X described by probability density function (p.d.f.) w [12]. Note this condition is equivalent to a hierarchical orthogonality criterion of the following form [12]:
Therefore, via the HDMR expansion we may write
where
Equation 6 suggests that in the general case of dependent variables a component function, f_{u}(x_{u}) depends on all other expansion terms that also have a nonempty intersection with u. However, assuming elements of X are independent, the last term of (6) equals zero and we may write
While this greatly simplifies the process of computing the HDMR expansion, the independence assumptions is too strong for SNPs, as they can be heavily correlated. Observe that by considering sets u such that u≤d in (4) we arrive a the d^{th} order HDMR expansion of Z, which we hereafter denote by E_{d}(ZX).
Second order HDMR for categorical observations
Now, additionally consider the case where X is a categorical random vector. In particular, for each f∈F, X_{f} is a categorical random variable with support C_{f}. In other words, C_{f} is the collection of categories X_{f} may take. Now, assuming X is categorical, the domain of f_{u}(x_{u}) is the finite collection of all combinations in \(C_{u}=\prod \nolimits _{f \in u} C_{f}\). Therefore, we may write
where 1_{q} is the indicator function of statement q being true. Additionally, for the case of second order HDMR expansion we have
for some \(q_{f}^{c}, q_{f,f'}^{c,c'} \in \mathbb {R}\). Therefore, we can further simplify and write
for some \( q_{0},q^{c_{f}}_{f}, q_{f_{i}f_{j}}^{c_{f_{i}} c_{f_{j}}} \in \mathbb {R}\). Note instead of indicators \(1_{x_{f}=c}\) and \(1_{\left \{X_{i}=c_{f_{i}}, X_{j}=c_{f_{j}}\right \}}\), we can basically do a change of basis, and use a new set of indicators that are linearly independent and uniquely tell us the value of the categorical observation. More precisely, we may consider each \(Z^{f}_{c}=1_{\left \{X_{f}=c\right \}}\) as a binary random variable, and directly write the HDMR expansion using \(Z^{f}_{c}\)’s. Therefore, without loss of generality, we may assume C_{f}=2. Additionally, let \(Q_{f}=\left \{q^{i}_{f}: i=1,\cdots,C_{f}1 \right \}\) for be a collection of statements that uniquely determine the value of X_{f}, although they might not necessarily be in the form of \(1_{\{X_{f}=c\}}\). For example, suppose C_{f}={0,1,2}. Instead of statements \(1_{\left \{X_{f}=0\right \}}\) and \(1_{\left \{X_{f}=1\right \}}\) to determine the value of X_{f}, we can also use \(1_{\left \{X_{f} \geq 1\right \}}\) and \(1_{\left \{X_{f} =2 \right \}}\) to determine the category of X_{f}. Note for both cases we have \(1_{\left \{X_{f}=2\right \}}=11_{\left \{X_{f}=0\right \}}1_{\left \{X_{f}=1\right \}}\) and \(1_{\left \{X_{f}=1\right \}}=1_{\left \{X_{f} \geq 1\right \}}  1_{\left \{X_{f} =2\right \}}\). Therefore, we can further simplify and write
Here our goal is to analyze SNP data when they are reported in dosage, i.e., for each SNP f we report the number of minor alleles, and hence C_{f}={0,1,2}. We hereafter mainly focus on this special case to outline the procedure for estimating coefficients \(b^{q}_{f}\) and \(b^{q_{i} q_{j}}_{f_{i} f_{j}}\) up to an affine transformation; however, the algorithms developed are more general. Note with little abuse of terminology, we use SNP to refer to the categorical dosage value, being the number of minor alleles. Our specific choice of Q_{f} is \(1_{\left \{X_{f} \geq 1\right \}}\) and \(1_{\left \{X_{f} =2 \right \}}\), which will be made clear later. Note a deeper discussion on the HDMR expansion using extended bases can be found in [11, 14, 15].
HDMR expansion for binary classification
Here we describe how HDMR can be used for a binary classification problem. Consider a binary classification problem with class labels y=0,1 and feature index set F. Let X be a random unlabeled observation with true label y_{x}. Given S, it is desired to design a decision rule that assigns a label, \(\hat {y}_{x}\) to X so that \(\hat {y}_{x}=y_{x}\) with high probability. Note given the full joint distribution of X and y_{x}, one could have easily computed P(y_{x}=1X), or equivalently the log likelihood ratio L(X)= log(P(y_{x}=1X)/P(y_{x}=0X)), and use a decision rule \(\hat {y}_{x}=1_{L(X)>T}\), where 1_{q} is the indicator function of statement q being correct, and T is a threshold.
However, the full joint distribution is typically not available, as is usually estimated given training sample, S. Alternatively, many models assume the classification rule belongs to a family of rules parametrized by θ, and aim to estimate θ given S. For example, a GLM using the famous logit link assumes \(L(X)=\beta _{0}+ {\sum \nolimits }_{f \in F}\beta _{f} X_{f}\), where X_{f} is the value of X for feature f, and θ is the collection of β_{0} and β^{f}’s. However, such model is insufficient for many applications where it is desired to account for pairwise SNP interactions, and is not easy to train using LASSO and elastic net penalties while accounting for pairwise interactions by adding terms of the form \(\phantom {\dot {i}\!}X_{f} X_{f'}\) to the GLM. Here we develop an algorithm based on observations from second order HDMR expansion of E(L(X)X), i.e., Z=L(X).
Following the derivation of the HDMR expansion in (10), we only need to substitute Z=L(X) to obtain the HDMR expansion. Now, to compute the HDMR expansion, we need the full joint distribution. However, we are almost never given the true underlying distribution parameters, and are given the training sample \(\mathscr {S}\) instead. Note when working with SNP data, the number of disease associated SNPs that may affect the output may be larger than the sample size. Therefore, we may not be able to arrive at a welldefined set of distribution parameters estimates, let alone hoping the estimates to be accurate. Therefore, in the following, we present an algorithm to compute the approximate second order HDMR expansion of the log likelihood ratio directly without computing the distribution parameters for categorical dosage SNP data.
Sobol indices, HDMR expansion, and variable selection
The Sobol indices are an extension of the R^{2} statistic, and can be used for global sensitivity analysis [16]. They basically explain the portion of variance explained by each set of variables. For each set of features, u, the total effect Sobol index and the main effect Sobol index are respectively defined as
In other words, the total effect Sobol index describes the portion of variance that the set u explains, and the main effect Sobol index describes the amount of information present in u that is not present in any other feature.
The approximate second order HDMR expansion of Eq. (10) can also be used to analyze the extent of the effect of each feature and feature pair on the class labels. Note that if a feature is independent of the class labels then S(X_{f})=0 and if two features f_{i} and f_{j} do not have any interactions, i.e.,
then \(S\left (X_{f_{i},f_{j}}\right)=0\). The use of S(X_{f}) for variable selection and its connection to other methodologies are discussed below. We then study how the exact HDMR expansion motivates analyzing feature pairs, and study the special case of categorical features.
The Sobol indices and feature filtering
In feature selection and biomarker discovery literature, univariate filters, or filters in short, refer to the family of feature selection algorithms that assess each feature individually and assign a score to each individual feature, which is then used for selecting a subset of features [17, 18]. Filters are fast, but do not take feature dependencies into account. Note in this taxonomy, univariate hypothesis tests, such as ttest, equipped with multiple testing correction are an example of filters. Other methodologies used for feature selection include multivariate filters, wrappers, and embedded methods, which are studied in more detail in [1, 17–19]. Here we study how the Sobol index for each single feature reduces to filtering, and how it connects to other filter methods.
Let u={f} be the set of single feature f. We have
In other words, the total effect Sobol index measures how much information each feature contains about Y, and the main effect Sobol index measure how much information f carries about Y that is not present in any other feature. Here we prefer to use S_{f} over \(S^{c}_{f}\) for three major reasons: (1) Due to the dependencies among biological features and that each individual feature might only have a very small impact on Y, the main effect index might be small for all features. (2) In a highdimensional setting looking at all features but one may cause overfitting, making it impossible to reliably measure \(S^{c}_{f}\). Finally, (3) the computation cost to measure \(S^{c}_{f}\) for all features can be infeasible.
Looking at S_{f}, we would like to select features which affect the output the most, i.e., have large S_{f}’s, which we may formulate as the following hypothesis test:
Note that S_{f}=0 if and only if f is independent of Y. Therefore, the null of the hypothesis test of Eq. (16) can be reformulated as follows:

1
Given f no decision rule with less error than random decision can be built.

2
Y and f are independent.

3
f has the same distribution in both classes.
Note the first formulation has been used to develop methods that train a classifier/regression model, and aim to assess if it outperforms a random decision, for instance using a generalized linear model (GLM) with logit link and linear model β_{0}+β_{f} for each f and outputting the pvalue for β_{f}=0. The second formulation leads to using independence criteria, such as HilbertSchmidt independence criterion, and the null assuming Y and f are independent. Finally, the last formulation, which may be the most popular one, leads to hypothesis tests that aim to verify if the classconditioned distributions are different. The KolmogorovSimirnov (KS) test and Wilcoxon rank sum test are examples of such tests. Note that different formulations give rise to different nulls and hence different pvalues. However, under certain assumptions one may be able to identify different tests with each other. For example, assuming (independent) Gaussian features with equal variances in both classes, linear discriminant analysis (LDA) performs better than a random decision if and only if f has different means in both classes, which is exactly what studentt test measures. Additionally, assuming features are independent and Gaussian with some unknown mean and variances, quadratic discriminant analysis (QDA) performs better than a random decision if f does not have the same mean and variances in both classes, which is exactly what the likelihood ratio test of does [20], which is studied in more detail in [21].
Fisher’s exact test and χ^{2}test are two popular hypothesis tests used to determine if two categorical features have similar distributions in both classes. The recently proposed optimal Bayesian filter (OBF) [22] directly measures the sample conditioned probability of a categorical variable having distributional differences across two classes, is suggested to enjoy superior performance compared with several other selection algorithms used for identifying disease associated SNPs [22]. Note we will later use OBF for feature selection in our pipeline.
Pairwise SNP interactions
Suppose f_{i} and f_{j} are categorical random variables, and we would like to see if a specific patten in the form of \(1_{X_{f_{i}}=c_{i} \& X_{f_{j}}=c_{j}}\) carries significant information not available if we consider each feature individually. In other words, we would like to test if the second order HDMR term corresponding to feature pair f_{i},f_{j} is nonzero. Let
Note that
Therefore, we would like to see if \(k_{0}^{f_{i},f_{j}} \neq k_{1}^{f_{i},f_{j}}\). Thereby, by a pairwise SNP interaction we understand the nonzero second order HDMR terms that involve two distinct SNPs, i.e., are not present in the first order expansion. Here we see that such terms correspond to SNP pairs with unequal correlations between the two classes. Note that a linear additive model looking only at individual SNPs is not sufficient to compute the log likelihood ratio for such correlated categorical features. This is also in line with previous definitions proposed for quantifying SNP interactions, e.g., [6, 23].
Now to compute the pvalue associated with each fixed SNP pair pattern we use Fisher’s r to z transformation and approximate the distribution of the null \(\left (k_{0}^{f_{i},f_{j}} = k_{1}^{f_{i},f_{j}}\right)\) by the standard normal distribution. We compute the statistics
for y=0,1, where \(\hat {\rho }^{f_{i},f_{j}}_{y}\) is the estimate of correlation coefficient from data. We then compute
where n_{y} is sample size in class y. \(Z^{f_{i},f_{j}}\phantom {\dot {i}\!}\) approximately follows the standard normal distribution, which we use to compute pvalues. We hereafter call this method fixed pattern test (FPT).
Algorithm
Here we describe the algorithm developed for classifier design.
Initial filtering
Here we outline the preprocessing we do on the categorical SNP data. When working with SNP data, many times dosage values, i.e., the number of minor alleles, are reported which take values in {0,1,2}. Note that many SNPs can be dominant or recessive. A dominant SNP is one for which a SNP with one minor allele behaves similar to a SNP with two minor alleles, and a recessive SNP is one for which a SNP with one minor allele behaves similar to a SNP with non minor alleles. Therefore, we do a first phase preprocessing, and transform each SNP to two “binarized SNP”s. For each SNP, we create two auxiliary features, one the indicator of the presence of a minor allele, and another the indicator of two minor alleles. This way, if only the first auxiliary feature is used then we are dealing with a dominant SNP, if only the second auxiliary feature is used we have a recessive SNP, and if both are used the SNP does not fall into only one of the two categories, for instance we may have an additive SNP. Examples of different SNP models and a discussion on them can be found in [24, 25].
By this preprocessing, if k SNPs are observed, we have F=2k binarized SNPs. We hereafter assume all SNPs are already binarized, i.e., for each feature f∈F we have C_{f}={0,1}, instead of using the family of constraints Q^{f}. Note that we only do so for notational convenience.
As mentioned above, given that many studies may measure several hundred thousand or a few million SNPs, they all cannot be directly inputted to a classifier, particularly that we aim to account for pairwise SNP interactions, i.e., the second order terms of the HDMR expansion. A first phase filtration is typically inevitable to reduce computational complexity of classifier design, i.e., the training of the classification algorithm. Here we use the recently proposed optimal Bayesian filter (OBF) [22], to rank binarized SNPs, and we pick the top D features for classifier design. Note D is a parameter that can be determined later through cross validation, or be chosen as a large value so that most disease associated SNPs are captured. Note we have chosen OBF as it has outperformed many of the currently used methods for selecting disease SNPs, including the popular χ^{2}test [22].
Classification algorithm
Here we design the classifier used to label observations. Note here we are using binarized SNPs as our features. In the training, given the labeled training data and selected binarized SNPs of the preprocessing step, the algorithm assigns coefficients to all SNPs and SNP pairs. Note D might be large, and given D features, there are 0.5D(D−1) SNP pairs to use. Furthermore, there may be weak individual SNPs or SNP pairs that are so weak that not including them in the final classifier might actually improve performance. Therefore, in our classifier design we remove features and feature pairs that are too weak. The classifier design process can be broken to the following steps: (1) feature pair construction, (2) removing weak features, (3) removing weak feature pairs, (3) merging feature pairs into blocks, and (4) estimating classifier parameters and obtaining the risk function.
Feature pair construction
Given that we have binarized SNP values, for two features f_{i} and f_{j}, we have four feature pairs of the form \(Z^{f_{i},f_{j}}_{c_{i},c_{j}}=1_{\left \{X_{i}=c_{f_{i}} \& X_{j}=c_{f_{j}}\right \}}\). Each of the created \(Z^{f_{i},f_{j}}_{c_{i},c_{j}}\)’s is hereafter called a feature pair. Figure 1 illustrates how to generate feature pairs \(Z^{f_{i},f_{j}}_{c_{i},c_{j}}\) from binarized SNPs.
Removing weak features
Here we remove features that are too weak to be used in the classifier. For each feature, f, we find the risk associated to it, r^{f}. We first compute \(v^{f}=\max \left \{\log \hat {p}^{f}_{1,1}/\hat {p}^{f}_{1,0},  \log \hat {p}^{f}_{0,1}/\hat {p}^{f}_{0,0} \right \}\), where \(\hat {p}^{f}_{c,y}\) is the posterior probability of X_{f}=c in class y and is obtained through OBF. If \(\hat {p}^{f}_{c,1}>\hat {p}^{f}_{c,0}\) for the pattern c obtaining the maximum in the definition of v^{f} we set r^{f}=v^{f}; otherwise, r^{f}=−v^{f}. We assign the zero coefficient, i.e., remove, features for which r^{f}<T_{1}. Note T_{1} a model parameter. Note features for which \(\hat {p}^{f}_{1,0}>\hat {p}^{f}_{1,0}\) are called risk increasing or positive risk features, and features for which \(\hat {p}^{f}_{1,0}<\hat {p}^{f}_{1,0}\) are called risk decreasing features. Figure 2a illustrates this process for single features.
Removing weak feature pairs
We then filter out feature pairs that are too weak. We again follow a procedure similar to our process for removing single features. For each feature pair f_{i} and f_{j} define \(r^{f_{i}f_{j}}_{c_{i}c_{j}}=\log \left (\hat {p}^{f_{i}f_{j}}_{c_{i}c_{j}}(1)/\hat {p}^{f_{i}f_{j}}_{c_{i}c_{j}}(0)\right)\log \left (\hat {p}^{f_{i}}_{c_{i},1}/\hat {p}^{f_{i}}_{c_{i},0}\right)\log \left (\hat {p}^{f_{j}}_{c_{j},1}/\hat {p}^{f_{j}}_{c_{j},0}\right)\) where \(\hat {p}^{f_{i}f_{j}}_{c_{i}c_{j}}(y)\) is the sample conditioned probability that feature pair f_{i},f_{j} satisfies \(X_{f_{i}}=c_{i}\) and \(X_{f_{j}}=c_{j}\) in class y. Again, feature pairs for which \(\left r^{f_{i}f_{j}}_{c_{i}c_{j}}\right <T_{2}\) are removed, i.e., are assigned zero weight in the classification rule. Note again feature pairs for which \(\hat {p}^{f_{i}f_{j}}_{c_{i}c_{j}}(1)>\hat {p}^{f_{i}f_{j}}_{c_{i}c_{j}}(0)\) are called risk increasing or positive risk feature pairs, and feature pairs for which \(\hat {p}^{f_{i}f_{j}}_{c_{i}c_{j}}(1)<\hat {p}^{f_{i}f_{j}}_{c_{i}c_{j}}(0)\) are called risk decreasing or negative risk feature pairs. Here we have defined \(r^{f_{i}f_{j}}_{c_{i}c_{j}}\) so that the interaction of a feature pair comprised of two independent features would not enter the classification rule. Figure 2b illustrates this process for feature pairs.
Feature block construction
Recall given D features, there are 0.5D(D−1) feature pairs. Although our filtering removes many feature pairs, there still could be too many feature pairs to be easily used for classifier design. In addition, since (a) SNPs may be heavily correlated, (b) a feature may occur in many feature pairs, and (c) the binarization scheme described in the preprocessing step creates two binary features for each SNP, the binary features might be heavily correlated, creating dependencies among feature pairs. In other words, given that a specific pattern for a feature pair is observed, one may be able to estimate the value of many other feature pairs. For example, given we observe \(1_{\left \{X_{f_{i}}=1\right \}}=0\) for a specific feature pair that uses \(1_{\left \{X_{f_{i}}=2\right \}}\), we can easily conclude all feature pairs that assume \(1_{\left \{X_{f_{i}}=2\right \}}\) are zero. These dependencies exacerbate classifier design. Therefore, we propose the following procedure to reduce the number of weights to estimate. Note we could have used any other community detection algorithm instead; however, we observed the following procedure works well, and enjoys low computation cost.
We consider blocks of feature pairs of the following forms for each feature f_{i} and pattern c_{i}c_{j}:
where P’s and N’s are collections of risk increasing and risk decreasing feature pairs, respectively. Figure 3 depicts this process. Afterwards, given an observation X, we report the ratio of feature pairs in each block which take value one. Among the constructed blocks, we again remove “weak blocks”, i.e., for each block A, irrespective of being risk increasing or risk decreasing, we compute r^{A}, the logarithm of the expected ratio of observed patterns of block A in class 1 versus class 0. We then remove blocks for which r^{A}<T_{3}. Again, T_{3} is threshold that will be chosen through cross validation.
We observed this approach to (a) reduce the number of parameters to estimate when D is large, and (b) improve prediction performance. Note that this approach is equivalent to decomposing \(w_{f_{i}f_{j}}^{c_{f_{i}} c_{f_{j}}}\) to two terms, one for the block \(A^{f_{i}}_{c_{i} c_{j}}\) and another for block \(A^{f_{j}}_{c_{j} c_{i}}\), where A is either P or N, and assuming all features pairs in \(A^{f_{i}}_{c_{i} c_{j}}\) have the same decomposed coefficient in their expansion. Here, in the simulations, we observed such assumptions improves classification performance when features, i.e., SNPs, are correlated, but leave a mathematical analysis of such assumption on the classifier performance for future work.
Now, given observation X, we construct the vector V(X), comprised of each feature value, and the ratio of observed patterns in risk increasing and risk decreasing blocks. The vector V(X) shall be used in the next section to assign a “risk” to observation X.
Estimating classifier parameters
To complete our classifier construction, we need to estimate HDMR coefficients. Note given our construction of blocks in the previous section, we now only need to find a vector b so that we may write that E_{2}(L(X)X)≈E(L(X))+b.V(X). However, we may be dealing with an illposed problem due the number of coefficients to estimate being larger than the sample size. Note that although the HDMR expansion of the log likelihood ratio is unique, we mostly compare it with a threshold to assign a label to a newly observed point. Therefore, in many cases it is acceptable to work with a affine transformation of the log likelihood ratio. In other words, although not being able to find the exact second order HDMR expansion of the log likelihood ratio is not desirable, it is not catastrophic either, as any affine transformation of the log likelihood ratio can be used as an equally good decision rule.
To circumvent the illposed problem, we use an objective function which is a variation of objective functions mostly studied in the compressed sensing literature [26] that aim to estimate a sparse signal given 1bit quantized observations. In other words, optimization problems and formulations that aim to estimate vector a from n observations of the form {(x_{i},y_{i}):i=1:n} where y=sign(a·x) and “ ·” denotes inner product. Connections between these objectives and a convex relaxation to the logistic regression problem is discussed in [27]. Extensions that additionally consider noisy measurements in the form of random flips, i.e.,
are studied in [27]. Finally, [28] studies extensions to nonGaussian features. Note in [27] it is shown that up to a constant the error in recovering the signal a matches the minimax error for the unquantized compressed sensing problem. We use the following optimization problem to solve for the weights we wish to use.
where \(\mathscr {S}_{y}\) is the portion of data in class y. Figure 4 depicts how b^{∗} is selected given vectors V(X) for the training data. Heuristically speaking, given a feature vector in the form of log likelihood ratios of partial observations x_{u}, here we find weights that maximize the distance between the average points of each class. The heuristic for using such objective is as follows: the HDMR expansion obtains the weights that result in the “best” low dimensional representation, i.e., we find the mean square error (MSE) estimate of the log likelihood ratio. The underlying reason we do so is that we believe the HDMR expansion of the log likelihood ratio gives a us a model that enjoys a low prediction error. On the other hand, weights that maximize the distance between the projections of the center points of the two classes to a one dimensional space should also yield low prediction error. Hence, such objective should result in a model that is close to the HDMR expansion. Note that in special cases, for instance independent Gaussian features with equal variances in both classes, we can actually prove that such approach minimizes the prediction error.
Given b^{∗} we have everything need for classification. Given a new observation X we find R(X)=b^{∗}·V(X), and we assign class label \(\hat {Y}=1_{R(X)>T}\), for threshold T. Note the thresholds T_{1}, T_{2}, and T are parameters of the model, and will be selected through the validation process, for instance, by cross validation. We hereafter call the resulting classifier built for categorical X as linear approximation for block second order HDMR expansion of categorical observations (LABSHDMRCO). The pseudocode of LASHDMRCO is provided in Algorithm 1.
Results
Here we use a model developed to generate SNP data to evaluate the performance of LABSHDMRCO, and compare it with several popular methods used for binary classification. We consider three datasets, a dataset based on the HAPGEN2 project, a lung cancer dataset, and a breast cancer dataset. OBF takes π(f), the prior probability a SNP is disease associated, and hyperparameter α describing the Dirichlet prior on each categorical SNP value as input. We assume π(f) is constant for all features, hence not affecting the ranking, and set α=[2, 2] for each binarized SNP used with LABSHDMRCO and α=[1, 1, 1] for each nonbinarized SNP used with other classification rules. Note the choice of α to be the all one vector simplifies to a uniform prior. As LABSHDMRCO uses feature pairs, a uniform prior on binarized SNP patterns suggests α=[2, 2].
HAPGEN2 data
Here we generate data from the HAPGEN2 project [9], reporting SNP values for more than 3.9 million SNPs, which we then convert to dosage. This dataset is generated by fixing one or two SNPs on each chromosome to be disease associated. The generated dataset contains 2000 controls (class 0), and 1000 cases (class 1). We randomly select 900 points in each class for training, and the rest is used as test data. We iterate 100 times, and measure the area under curve (AUC) of the receiver operator characteristic (ROC) as our performance metric
In addition to LABSHDMRCO, we use the nonprocessed data, use OBF for feature selection to select top features, and use several variants of GLMs with probit link for further selection and classification. We use a probit model that uses top 1000 features with LASSO (L_{1}) penalty λ and another that uses top 500 features with elastic net putting equal weights on L_{1} and L_{2} penalties with penalty coefficient λ. We also use a variant that accounts for pairwise SNP interactions by considering terms of the form X_{i}X_{j} using top 50 features and L_{1} penalty λ. We only use the top 50 features (see Tables 1, 2) for the variant accounting for pairwise SNP interactions so that the total number of regressors to use in the regression model is comparable to the linear variants. We observed that larger values of features in the probit link accounting for pairwise interactions drastically increases runtime, causing infeasible computation cost. Finally, we also implement a naive Bayes classifier using top 1000 features.
Indeed, it is an advantage of LABSHDMRCO that can incorporate several hundred features in its regression model while accounting for pairwise SNP interactions with reasonable computational burden. Finally, our reasons to choose the probit link over the more popular logit link are 3 fold: (1) In the data generation model the risk of an SNP mutation is modeled as additional linear risk; corresponding to the logit link. Therefore, models based on the logit link get an unfair advantage that they exactly match the data generation model; while in reality almost always the assumed model deviates from reality. (2) We observed the computational cost to train a probit link is much less, about a third, of the logit link. Hence, to reduce computational cost of the GLM variants we compare with, we selected the probit link. (3) In practice the probit link behaves similar to the logit link, which is not surprising as the sigmoid and the cumulative distribution function (CDF) of the standard normal distribution are rather similar. Hence this choice better illustrates how slights deviations in the assumed link might affect performance of a GLM when dealing with SNP data.
Finally, note that all GLM variants and naive Bayes use OBF assuming three categories, i.e., three dosage values, for ranking features. We tested the popular χ^{2}test as well, and obtained lower AUCs for the GLMs and naive Bayes. This results strengthens the observations made in [22] that OBF provides better feature rankings compared several other methods, including the popular χ^{2}test.
The AUCs for LABSHDMRCO, the linear probit model using L_{1} penalty (probit(lin,LASSO)), second order probit model with L_{1} penalty (probit(quad,LASSO)), linear probit using elastic net (probit(lin,elastic net)) and naive Bayes are 91.03%, 84.52%, 86.44%,84.28%, and 86.83%, respectively. The larger AUC of LABSHDMRCO suggests it enjoys superior overall performance, i.e., LABSHDMRCO should typically enjoy a higher probability of detection for a fixed false alarm rate value. Figure 5 plots the ROC curve of the classification algorithms. For the GLMs we tested λ=0.01:0.01:0.2, and for each variant report the AUC of the λ with superior performance, i.e., highest AUC. Also, the parameters of LABSHDMRCO are chosen through cross validation. As the results suggest LABSHDMRCO enjoys superior performance compared with other algorithms. In particular, for false positive rates larger than 2% LABSHDMRCO enjoys higher probability of detection compared with all other algorithms. For small false alarm rates though, naive Bayes had the highest true positive rate, but was closely followed by LABSHDMRCO. While the ROC curve of probit(lin,LASSO) seems to monotonically increase with respect to false positive rate, true positive rate of probit(quad,LASSO) and probit(lin,elastic net) seem to only infinitesimally increase for false positive rates between 0.04 and 0.05. Given that ROC curves must be concave, we initially believed this might be an artifact of not enough iterations (here being 100); however, simulations only implementing these two GLMs with more iterations resulted in rather similar graphs. However, in many iterations (using MATLAB’s default settings) the warning of reaching maximum number of iterations was reported for these two methods. Therefore, this may be an artifact of the limited number of iterations or numerical instabilities of the training stage.
Finally, we observed the runtime of LABSHDMRCO using 1000 SNPs is comparable to the probit variant using 500 SNPs and elastic net. This suggests LABSHDMRCO is extremely fast for a method that accounts for pairwise SNP interactions. Note given 1000 SNPs there are about 500,000 SNP pairs to evaluate. However, LABSHDMRCO has more parameters to tune via cross validation, its total runtime is more than a GLM with one tunable parameter. Note in this work we did not test an elastic net probit that also optimizes over α, the relative weights between L_{1} and L_{2} penalties; however, we expect the two dimensional search for such model might result in computation costs comparable to LABSHDMRCO.
Finally, we use all of data, to find the top SNPs and SNP pairs with largest risks, i.e., r^{f} and \(r^{f_{i}f_{j}}_{c_{i}c_{j}}\), respectively. Using Fisher’s exact test and bounding the false discovery rate (FDR) by 5% using the BanjaminiHochberg procedure [29] 785 SNPs are significant. Also, using FPT for identifying significant pairwise SNP interaction patterns, among the \(4\times \binom {1000}{2} \approx 2 \times 10^{6}\) patterns to check, 1046864, about 52.4% of all tests, are significant when bounding FDR by 5%. Given than many selected SNPs are on the same chromosome this is not surprising. Furthermore, we observed that all single SNPs that were not significant after bounding FDR are present in at least one SNP pair, emphasizing the importance of considering pairwise SNP interactions See for details.
Lung cancer
Data obtained in [30] is deposited on gene expression omnibus (GEO) [31] with accession number GSE33355. It contains 61 sample pairs of cancer and normal lung tissue specimen from nonsmoking females collected at national Taiwan university hospital and Taichung veterans general hospital. The data is based on the GPL6801 platform and measures dosage values of 909622 SNPs. In our evaluation healthy and cancerous tissue specimen comprise classes 0 and 1, respectively. Thereby, the goal is to determine if a test point is normal or cancerous. Given the dataset we randomly select 55 points in each class for training, use the rest for testing, and iterate 100 times. Figure 6a plots the ROC curve of different methods. We observe that LABSHDMRCO enjoys a higher true positive rate for each given false positive rate, suggesting its superior performance on this dataset. Cross validation sets D=900, T_{1}=0.25, T_{2}=1.2, and T_{3}=0.8. Linear and quadratic probit with LASSO penalty set λ to 0.2 and 0.02, respectively, and the linear probit with elastic net sets λ=0.13.
Tables 3 and 4 list the top SNPs and SNP pairs with largest risks used by LABSHDMRCO. Although cross validation suggests using D=900 SNPs for prediction, the Fisher’s exact test using the BenjaminiHochberg [29] procedure for FDR correction suggests that only looking at the SNPs used for prediction, only the top 250 are significant bounding FDR by 5%. Although the remaining SNPs are not significant, their net effect is an improvement in prediction which may be due to agglomerating them as individual SNPs, or specific SNP pair intereactions among them that should be considered together as a marker family to observe their effect. Going back to the literature we observe many of the top SNPs and SNP pairs, or the genes they belong to, are shown or suggested to be affected in lung cancer. For example, the top SNP rs9493858 is located on the SGK1 gene, which is suggested to be affected in lung cancer in several studies [32, 33]. Looking at the top SNP pairs, we observe the second highest SNP pair map to NXN and MEOX2 genes, suggesting their interaction might be key to understanding lung cancer. Interestingly, mutations in NXN has been associated with colon cancer in east Asian populations [34], but its role in lung cancer requires further investigation. Furthermore, MEOX2 is also suggested to be affected in lung cancer [35]. In this dataset, except for one SNP pair, all risks are positive. In other words, certain mutations increase the cancer risk; however, the data does not suggests any candidate mutations that seem to further help healthy people guard against lung cancer.
Now, using FPT to detect significant pairwise SNP interactions, bounding FDR by 5% using the BenjaminiHochberg procedure, none of the pairs are significant; however, cross validation suggests T=1.2, resulting in using 169324 SNP pairs in its analysis. These results suggest although we cannot reliably tell which SNP pairs are true discoveries and which are false discoveries, the information present of the weak SNP pairs outweighs the noise, by aggregating these pairs we can extract the information of encoded in the pairs more than the noise that may be inserted to the decision rule, insert the added information in the prediction rule, and the net effect is more reliable performance. Note that by encoding the SNP pairs in the analysis we may be able to say that for a new test point the net effect of SNP pair interactions is increased risk, i.e., likelihood of being a cancerous point; however, we may not be able to pinpoint which SNP pairs brought us to this conclusion, rather, we can only comment on their agglomerated net effect.
Finally, we plot the amount of difference in correlation coefficients of SNP pairs in Fig. 7a as we look for the pattern of both binarized SNPs taking value 1 to have a pairwise interaction, i.e., \(Z^{f_{i}f_{j}}_{1,1}=1\) being an indicator of an interaction. The xaxis denotes the rank of the first SNP in the pair, and yaxis denotes the second. The zaxis as well as the color of each circle corresponding to a SNP pair denote the amount of the difference in correlation coefficients. To avoid a cluttered figure though, only SNP pairs with differences larger than 0.5 are plotted with a nonzero height. Only a small portion of SNP pairs seem to have large differences in correlation coefficients. Additionally, we observe that many SNPs are common among SNP pairs with large differences in correlation coefficients. In other words, few SNPs are present in many of the SNP pairs with potential interactions. Putting SNP pairs with small differences in correlation coefficients aside, we observe the remaining pairs resemble the pairwise patterns of Fig. 3 describing the heuristic behind LABSHDMRCO to merge SNP pairs and construct blocks, suggesting suitability of such strategy might further be biologically motivated.
Breast cancer
Data obtained in [36] is deposited on GEO with accession number GSE16619, containing dosage data of 42 normal breast tissue samples and 69 cancerous samples. The data is based on the GPL6804 platform measuring 503590 SNPs. Normal and cancerous points comprise classes 0 and 1, respectively. We randomly select 35 normal points and 60 cancerous points for training, use the remaining data for testing, and iterate 100 times. Cross validation sets D=1250, T_{1}=0.25, T_{2}=1, and T_{3}=0.5. Linear and quadratic probit with LASSO penalty set λ to 0.01 and 0.03, respectively, and the linear probit with elastic net sets λ=0.02. Figure 6b plots the ROC curve of different methods. For false positive rates below 5% LABSHDMRCO enjoys a higher true positive rate than other algorithms. For higher false positive rates linear probit with elastic net and quadratic problit with LASSO perform almost identical and superior to other algorithms, but are closely followed by LABSHDMRCO.
Tables 5 and 6 list the top SNPs and SNP pairs with largest risks used by LABSHDMRCO, respectively. We observed some of the SNP IDs present in the data file were not present GPL6804 platform description on the GEO website. For such SNPs we report their ID in the datafile rather than their SNP ID. Although cross validation suggests D=1250 for classification, only looking at this set, 269 binarized SNPs are significant using the Fisher’s exact test bounding FDR by 5% using the BenjaminiHochberg procedure. Going back to the literature we observe several of the genes top SNPs and SNP pairs map to are shown or suggested to be affected in breast cancer. For example, rs13129525 which ranks 9 is located on the FAM171A1 gene which is suggested to be affected in breast cancer [37]. Furthermore, the SNP pair ranking fourth map to CDK19 [38] and CCDC162P [39] genes, which are both shown to be affected in breast cancer. Similar to the lung cancer dataset we again observe that SNP pairs have positive risks, while in contrast to the lung cancer dataset many individual SNPs have negative risk, meaning certain point mutations may reduce the breast cancer relapse risk.
Using FPT to detect significant pairwise interactions among the top D=1250 SNPs, bounding FDR by 5% using the BenjaminiHochberg procedure only 4 pairs are significant, although cross validation suggests T_{2}=1, resulting in using 567190 SNP pairs for classification. This suggests in order to boost our prediction accuracies we need to use many SNP pairs that are not significant, but the information contained in the true discoveries outweighs the noise of the many false discoveries present in the prediction rule. Finally, Fig. 7b plots the differences in correlation coefficients where only pairs with differences larger than 0.8 are assigned a nonzero height. We again observe that few SNPs are common among many pairs with potential interactions.
Discussion
Analyzing SNP data and developing classification rules given SNP observations is difficult when studying complex diseases. The smallsample highdimensional nature of the problem, individual SNPs being potentially weak markers, SNPs being categorical variables in nature, and their complex interactions are several important factors that make classifier design a challenging task. Due to each individual SNP contributing only minimally to the class labels, it seems necessary to account for SNP interactions to obtain reliable predictions. The proposed algorithm, LABSHDMRCO aims to balance computation cost, complexity, and prediction performance by using a representation that accounts for pairwise interactions. Although higher order HDMR expansions can be considered, given current technology, computation power, and sample sizes, accounting only for pairwise interactions seems to be the most one can hope for.
Interestingly, we observed in our simulated examples described here that the linear models seem to perform better than expected. Although the current simulations are not sufficient to verify the performance and robustness of linear models for SNP classification, we expect this rather good performance to be due to closeness of GLMs to the first degree HDMR expansions. Note that probit and logit links have rather similar graphs, and the linear model of the logit link aims to compute the log likelihood ratio. In other words, the GLM with logit link assumes the log likelihood ratio is linear, and assumes the risk of a SNP with two minor alleles is twice the risk of a SNP with one minor allele. Note the linear term basically resembles the first order HDMR expansion under this additional “doseeffect linearity” assumption. This assumption, similar to the assumptions we made here in the development of LABSHDMRCO, reduces complexity and the number of parameters to estimate. Note that the training of a logit model is usually done by maximum likelihood (ML) estimation of the parameters. Finally, note that since probit and logit links are very similar in shape, the superior performance of LABSHDMRCO over GLMs may be due to the following three reasons: (1) LABSHDMRCO uses second order HDMR expansion while most GLMs used in practice mimic first order HDMR expansion, (2) the preprocessing of LABSHDMRCO decomposing SNP dosage data to two indicators seems to better grasp the nonlinear nature of SNPs, while not affecting the flexibility of the algorithm to account for SNPs that are neither recessive nor dominant, and (3) the additional assumptions made in LABSHDMRCO seem to enforce less rigidness in the model than the assumptions of GLMs on linear additive risks.
When analyzing cancer datasets reporting SNP dosage values, in the lung cancer dataset we observed that LABSHDMRCO may enjoy much superior performance compared with other popular algorithms, and in the breast cancer dataset we observed may perform only slightly inferior to them. Furthermore, when lack of reliable biological knowledge results in the need of considering extremely large number of potential SNP pairs, although we may not be able to reliably detect which SNP pairs are affected in the disease under study, we may be able to aggregate the information of many potential SNP pairs to improve prediction accuracy. In other words, when working with SNP data, it seems sets with large FDRs might still carry enough signal to improve prediction accuracies. Finally, note that the HAPGEN2 data is based on real work sequences, and we can expect it to adequately mimic real world scenarios. We observed much higher accuracies for all classification rules for the HAPGEN2 dataset compared with other cancer datasets. As SNPs seem to be equally weak in all datasets, for instance risks of individual SNPs are not very different, we may hypothesize that the relatively small sample size of cancer datasets may contribute to inferior performance, as the trained classifiers have errors much larger than the Bayes error, and that the larger samples are necessary for more reliable predictions.
Conclusion
The analysis of genetic variants and their joint effect on complex diseases is a challenging task. In particular, SNPs are a difficult data type to handle due to their highdimensionality, weak effects of each individual SNP on the phenotype under study, the need to account their joint complex interactions, and their categorical nature. These challenges make it difficult to develop classification rules with reliable predictions, and are exacerbated by the small sample sizes in many applications.
Here we revisited the binary classification problem given categorical SNP observations when reported in dosage, and proposed LABSHDMRCO as an algorithm that produces classification rules with good prediction performance that can take several hundred SNPs as input, and account for their pairwise interactions. Additionally, LABSHDMRCO is a very fast algorithm in nature, with runtime comparable to a GLM with LASSO penalty. However, due to the 4 parameters to tune (D, T_{1}, T_{2}, and T_{3}) in cross validation, the training of the model may become quite computationally expensive. However, note that different thresholds can be evaluated independent of each other, making LABSHDMRCO an ideal candidate for parallel computing. Further development of techniques to speed up the training process is an interesting direction for future work that should be explored in more detail.
The categorical nature of SNP dosage data has been a challenge for many machine learning algorithms, and treating them as real numbers has been the focus of many methods. Although the current work requires further study, our initial results suggest that HDMR can a be suitable framework to study SNP data as categorical variables. Additionally, although we only approximate the second order HDMR expansion, we already know that the second order expansion is the best low dimensional representation in the mean square error (MSE) sense [16]. Therefore, it is not surprising that LABSHDMRCO enjoys superior prediction performance compared with many algorithms used to study SNP data.
The ability of HDMR to acknowledge categorical nature of SNP data with complex nonlinear interactions opens up a new avenue of research to develop low dimensional models suitable for categorical data. Last but not least, the close connection of HDMR with the Sobol indices gives one the ability to identify significant SNP pairs with high interactions, using the same methodology as used for classification problems. Interestingly, we observed that although we may not be able to detect which SNP pairs are affected in the disease under study, the net effect of aggregating many highprofile SNP pairs can boost the prediction accuracy. In other words, in spite of not being able to declare where the “signal” is, i.e., which specific pairwise patterns are affected, it is possible to implicitly extract and take advantage of such information to improve prediction accuracy.
The power, of LABSHDMRCO to account for and combine many SNP pairs in its decision rule opens up many avenues of research demanding further investigation. For instance, the type of a classification problem for categorical observations considered here may be extended to nonbinary cases (multiple phenotypes) as well, and we will pursue such an extension of our approach in future work. We note, however, that binary classification is the cornerstone of classification theory and many solutions to multiclass problems can be formulated as a sequence of solutions to binary class problems.
Nomenclature
1_{s}: indicator function of statement s being true y: the class label taking values 0 or 1 D: the number of features that pass the first phase filtration of the LABSHDMRCO classification algorithm F: the set of feature indices u: an arbitrary subset of F f_{u}: the HDMR component for set u⊆F X: the observation random vector comprised of categorical variables X_{u}: restriction of X to features in u⊆F X_{f}: used instead of X_{{f}} for f∈F L(X): the log likelihood ratio of point X belonging to class 1 n_{y}: sample size in class y\({\hat {p}^{f}_{c},{y}}\): OBF’s estimate of the probability of feature f taking value c in class y\({\hat {p}^{f_{i},f_{j}}_{c_{i},c_{j}}(y)}\): OBF’s estimate of the probability of features f_{i} and f_{j} taking values c_{i} and c_{j}, respectively, in class y\({q^{c}_{u}}\): the HDMR coefficient of when feature set u takes value c C_{f}: the set of categorical values feature f∈F can take r^{f}: the risk associated to binarized feature f\({r^{f}_if_{j}}\): the risk associated to binarized feature pair comprised of f_{i} and f_{j} R(X): the risk associated to observation X\({\mathscr {S}}\): the training sample\({\mathscr {S}_{y}}\): the training sample in class y S(u): the Sobol index of feature set u S^{c}(u): the main effect Sobol index of feature set u T: the threshold used to assign a class to a test point T_{1}: the threhsold used to remove weak features T_{2}: the threhsold used to remove weak feature pairs T_{3}: the threhsold used to remove weak feature blocks b^{∗}: the estimated approximate HDMR coefficients obtained from the training sample\({w^{q}_{u}}\): the HDMR coefficient of when the observed values of feature set u satisfy constraint q Z=f(X): is the dependent varibale whose HDMR expansion is being computed E_{d}(ZX): the d^{th} order HDMR expansion of Z given observation X\({Z^{f_{i},f_{j}}_{c_{i},c_{j}}}\): the indicator of features f_{i} and f_{j} taking values c_{i} and c_{j}, respectively\({\hat {\rho }^{f_{i},f_{j}}_{y}}\): correlation coefficient between f_{i} and f_{j}\({k_{c_{i} c_{j}}^{f_{i},f_{j}}(y)}\): the ratio between probability mass function value of observing features f_{i} and f_{j} taking values c_{i} and c_{j},repectievly, and the probability mass function value assuming f_{i} and f_{j} are independent in class y\({z^{f_{i},f_{j}}_{y}}\): Fisher’s r to z transform value for Bernoulli random variables f_{i} and f_{j}\({N^{f_{i}}_{c_{1} c2}}\): the block comprised of feature pairs that (a) have large negative risks, (b) contain f_{i}, (c) f_{i} takes value c_{1}, and (d) the other feature in the pair, f_{j}, takes value c_{2}\({P^{f_{i}}_{c_{1} c2}}\): the block comprised of feature pairs that (a) have large positive risks, (b) contain f_{i}, (c) f_{i} takes value c_{1}, and (d) the other feature in the pair, f_{j}, takes value c_{2} r^{A}: the risk associated to block A being the ratio of observed pairwise feature patterns of A averaged over all samples
Availability of data and materials
A MATLAB implementation of LABSHDMRCO is available on Github. GWAS datasets are publicly available on GEO, and the synthetically generated HAPGEN2 data is available upon request.
Abbreviations
 AUC:

area under curve
 DNA:

deoxyribonucleic acid
 FDR:

false discovery rate
 FPT:

fixed pattern test
 GEO:

gene expression omnibus
 GLM:

generalized linear model
 HDMR:

high dimensional model representation
 LABSHDMRCO:

linear approximation for block second order HDMR expansion for categorical observations
 LASSO:

least absolute shrinkage and selection operator
 MSE:

mean square error
 ROC:

receiver operator characteristic
 SNP:

single nucleotide polymorphism
References
Sima C, Dougherty ER. What should be expected from feature selection in smallsample settings. Bioinformatics. 2006; 22(19):2430–6.
Hua J, Tembe WD, Dougherty ER. Performance of featureselection methods in the classification of highdimension data. Pattern Recog. 2009; 42(3):409–24.
Huang HH, Xu T, Yang J. Comparing logistic regression, support vector machines, and permanental classification methods in predicting hypertension. BMC Proceedings. 2014; 8(1):96.
Long N, Gianola D, Rosa GJ, Weigel KA, Avendaño S. Comparison of classification methods for detecting associations between snps and chick mortality. Genet Sel Evol. 2009; 41(1):18.
Long N, Gianola D, Rosa GJM, Weigel KA, Avendaño S. Machine learning classification procedure for selecting snps in genomic selection: application to early mortality in broilers. J Anim Breeding Genet. 2007; 124(6):377–89.
Schwender H, Ickstadt K. Identification of SNP interactions using logic regression. Biostatistics. 2007; 9(1):187–98.
GarcíaMagariños M, LópezdeUllibarri I, Cao R, Salas A. Evaluating the ability of treebased methods and logistic regression for the detection of snpsnp interaction. Ann Hum Genet. 2009; 73(3):360–9.
Weissfeld JL, Lin Y, Lin HM, Kurland BF, Wilson DO, Fuhrman CR, Pennathur A, Romkes M, Nukui T, Yuan JM, et al. Lung cancer risk prediction using common snps located in gwasidentified susceptibility regions. J Thorac Oncol. 2015; 10(11):1538–45.
Su Z, Marchini J, Donnelly P. HAPGEN2: simulation of multiple disease SNPs. Bioinformatics. 2011; 27(16):2304–5.
Rabitz H, Aliş ÖF. General foundations of highdimensional model representations. J Math Chem. 1999; 25:197–233.
Li G, Rabitz H. General formulation of HDMR component functions with independent and correlated variables. J Math Chem. 2012; 50(1):99–130.
Hooker G. Generalized functional ANOVA diagnostics for highdimensional functions of dependent variables. J Comput Graph Stat. 2007; 16(3):709–32.
Sobol IM. Theorems and examples on high dimensional model representation. Reliab Eng Syst Saf. 2003; 79(2):187–93.
Alış ÖF, Rabitz H. Efficient implementation of high dimensional model representations. J Math Chem. 2001; 29(2):127–42.
Li G, Hu J, Wang SW, Georgopoulos PG, Schoendorf J, Rabitz H. Random samplinghigh dimensional model representation (RSHDMR) and orthogonality of its different order component functions. J Phys Chem A. 2006; 110(7):2474–85.
Lu R, Wang D, Wang M, Rempała GA. Estimation of Sobol’s sensitivity indices under generalized linear models. Commun StatTheory Methods. 2018; 47(21):5163–95.
Ilyin SE, Belkowski SM, PlataSalamán CR. Biomarker discovery and validation: technologies and integrative approaches. Trends Biotechnol. 2004; 22(8):411–6.
Saeys Y, Inza I, Larrañaga P. A review of feature selection techniques in bioinformatics. Bioinformatics. 2007; 23(19):2507–17.
Diamandis EP. Cancer biomarkers: can we turn recent failures into success?J Natl Cancer Inst. 2010; 102(19):1462–7.
Pearson ES, Neyman J. On the problem of two samples In: Neyman J, Pearson ES, editors. Joint Statistical Papers (1967). Cambridge: Cambridge University Press: 1930. p. 99–115.
Zhang L, Xu X, Chen G. The exact likelihood ratio test for equality of two normal populations. Am Stat. 2012; 66(3):180–4.
Foroughi pour A, Dalton LA. Optimal bayesian feature filtering for singlenucleotide polymorphism data. In: IEEE International Conference on Bioinformatics and Biomedicine (BIBM). Kansas: IEEE: 2017. p. 2290–2.
Shen J, Li Z, Song Z, Chen J, Shi Y. Genomewide twolocus interaction analysis identifies multiple epistatic snp pairs that confer risk of prostate cancer: A crosspopulation study. Int J Cancer. 2017; 140(9):2075–84.
Han SA, Song JY, Min SY, Park WS, Kim MJ, Chung JH, Kwon KH. A genetic association analysis of polymorphisms, rs2282695 and rs12373539, in the FOSB gene and papillary thyroid cancer. Exp Ther Med. 2012; 4(3):519–23.
Samuels ME. Saturation of the human phenome. Curr Genomics. 2010; 11(7):482–99.
Plan Y, Vershynin R. Onebit compressed sensing by linear programming. Commun Pur Appl Math. 2013; 66(8):1275–97.
Plan Y, Vershynin R. Robust 1bit compressed sensing and sparse logistic regression: A convex programming approach. IEEE Trans Inf Theory. 2013; 59(1):482–94.
Ai A, Lapanowski A, Plan Y, Vershynin R. Onebit compressed sensing with nonGaussian measurements. Linear Algebra Appl. 2014; 441:222–39.
Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B Methodol. 1995; 57(1):289–300.
Lu TP, Lai LC, Tsai MH, Chen PC, Hsu CP, Lee JM, Hsiao CK, Chuang EY. Integrated analyses of copy number variations and gene expression in lung adenocarcinoma. PloS ONE. 2011; 6(9):24829.
Edgar R, Domrachev M, et al. Gene expression omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 2002; 30(1):207–10.
Abbruzzese C, Mattarocci S, Pizzuti L, Mileo AM, Visca P, Antoniani B, Alessandrini G, Facciolo F, Amato R, D’Antona L, et al. Determination of sgk1 mrna in nonsmall cell lung cancer samples underlines high expression in squamous cell carcinomas. J Exp Clin Cancer Res. 2012; 31(1):4.
Matschke J, Wiebeck E, Hurst S, Rudner J, Jendrossek V. Role of sgk1 for fatty acid uptake, cell survival and radioresistance of ncih460 lung cancer cells exposed to acute or chronic cycling severe hypoxia. Radiat Oncol. 2016; 11(1):75.
Zhang B, Jia WH, Matsuda K, Kweon SS, Matsuo K, Xiang YB, Shin A, Jee SH, Kim DH, Cai Q, et al. Largescale genetic study in east asians identifies six new loci associated with colorectal cancer risk. Nat Genet. 2014; 46(6):533.
Cortese R, Hartmann O, Berlin K, Eckhardt F. Correlative gene expression and dna methylation profiling in lung development nominate new biomarkers in lung cancer. Int J Biochem Cell Biol. 2008; 40(8):1494–508.
Kadota M, Sato M, Duncan B, Ooshima A, Yang HH, DiazMeyer N, Gere S, Kageyama SI, Fukuoka J, Nagata T, et al. Identification of novel gene amplifications in breast cancer and coexistence of gene amplification with an activating mutation of pik3ca. Cancer Res. 2009; 69(18):7357–65.
SantuarioFacio SK, CardonaHuerta S, PerezParamo YX, Trevino V, HernandezCabrera F, RojasMartinez A, UscangaPerales G, MartinezRodriguez JL, MartinezJacobo L, PadillaRivas G, MuñozMaldonado G, GonzalezGuerrero JF, ValeroGomez J, VazquezGuerrero AL, MartinezRodriguez HG, BarbozaQuintana A, BarbozaQuintana O, GarzaGuajardo R, OrtizLopez R. A new gene expression signature for triplenegative breast cancer using frozen fresh tissue before neoadjuvant chemotherapy. Mol Med. 2017; 23(1):101–11.
V Broude E, Gyorffy B, A Chumanevich A, Chen M, SJ McDermott M, Shtutman M, F Catroppo J, B Roninson I. Expression of cdk8 and cdk8interacting genes as potential biomarkers in breast cancer. Curr Cancer Drug Targets. 2015; 15(8):739–49.
Miyagawa Y, Matsushita Y, Suzuki H, Komatsu M, Yoshimaru T, Kimura R, Yanai A, Honda J, Tangoku A, Sasa M, et al. Frequent downregulation of lrrc26 by epigenetic alterations is involved in the malignant progression of triplenegative breast cancer. Int J Oncol. 2018; 52(5):1539–58.
Acknowledgments
Not Applicable.
About this supplement
This article has been published as part of BMC Medical Genomics Volume 13 Supplement 9, 2020: The International Conference on Intelligent Biology and Medicine (ICIBM) 2019: Computational methods and application in medical genomics (part 2). The full contents of the supplement are available online at https://bmcmedgenomics.biomedcentral.com/articles/supplements/volume13supplement9.;
Funding
The research and publication costs were partially supported by the National Science Foundation grant DMS1853587 to GR and by the Mathematical Biosciences Institute (MBI) at The Ohio State University. MBI is supported by the National Science Foundation under grant DMS1440386. The funding sources had no role in the design of the study and in writing of the manuscript.
Author information
Authors and Affiliations
Contributions
AFP and GR designed the approach and performed analyses. AFP, MP, GR were involved in initial manuscript preparation. LESC and EK provided simulated data. LAD, AFP, MP, GR, EK and LESC were involved in discussions throughout the study and in preparing the final manuscript. All author(s) have read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not Applicable.
Consent for publication
Not Applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Additional file 1
Supplementary: top SNPs of LABSHDMRCO. The top SNPs used by LABSHDMRCO for the HAPGEN2, breast cancer, and lung cancer datasets are provided in the Supplementary.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
pour, A.F., Pietrzak, M., SuchestonCampbell, L.E. et al. High dimensional model representation of log likelihood ratio: binary classification with SNP data. BMC Med Genomics 13 (Suppl 9), 133 (2020). https://doi.org/10.1186/s12920020007741
Published:
DOI: https://doi.org/10.1186/s12920020007741
Keywords
 Single nucleotide polymorphism
 Binary classification
 High dimensional model representation
 Pairwise SNP interactions
 Log likelihood ratio