 Research Article
 Open access
 Published:
Diagnostic biases in translational bioinformatics
BMC Medical Genomics volume 8, Article number: 46 (2015)
Abstract
Background
With the surge of translational medicine and computational omics research, complex disease diagnosis is more and more relying on massive omics datadriven molecular signature detection. However, how to detect and prevent possible diagnostic biases in translational bioinformatics remains an unsolved problem despite its importance in the coming era of personalized medicine.
Methods
In this study, we comprehensively investigate the diagnostic bias problem by analyzing benchmark gene array, protein array, RNASeq and miRNASeq data under the framework of support vector machines for different model selection methods. We further categorize the diagnostic biases into different types by conducting rigorous kernel matrix analysis and provide effective machine learning methods to conquer the diagnostic biases.
Results
In this study, we comprehensively investigate the diagnostic bias problem by analyzing benchmark gene array, protein array, RNASeq and miRNASeq data under the framework of support vector machines. We have found that the diagnostic biases happen for data with different distributions and SVM with different kernels. Moreover, we identify total three types of diagnostic biases: overfitting bias, label skewness bias, and underfitting bias in SVM diagnostics, and present corresponding reasons through rigorous analysis. Compared with the overfitting and underfitting biases, the label skewness bias is more challenging to detect and conquer because it can be easily confused as a normal diagnostic case from its deceptive accuracy. To tackle this problem, we propose a derivative component analysis based support vector machines to conquer the label skewness bias by achieving the rivaling clinical diagnostic results.
Conclusions
Our studies demonstrate that the diagnostic biases are mainly caused by the three major factors, i.e. kernel selection, signal amplification mechanism in highthroughput profiling, and training data label distribution. Moreover, the proposed DCASVM diagnosis provides a generic solution for the label skewness bias overcome due to the powerful feature extraction capability from derivative component analysis. Our work identifies and solves an important but less addressed problem in translational research. It also has a positive impact on machine learning for adding new results to kernelbased learning for omics data.
Background
With the surge of translational medicine and computational omics research, complex disease diagnosis tends to more and more rely on disease signatures discovered from the sheer enormity of highthroughput omics data [1–4]. Identifying disease molecular signatures from different pathological states not only captures the subtlety between disease subtypes and controls, but also provides disease gene hunting, related pathway query, genome wide association (GAWS) investigations, and following drug target identification [5–7]. The translational technologies in medicine along with the exponential growth of highthroughput data in genomics, transcriptomics, and proteomics are preparing for the coming era of personalized medicine to customize medical decisions and practices to individual patients [6, 8].
Although different stateoftheart classifiers have been widely employed in such a massive data driven disease diagnostics to enhance diagnostic accuracy, there was almost no investigation on their diagnostic biases that are essential for the success of translational medicine [9, 10]. A diagnostic bias simply refers that a classifier cannot unbiasedly conduct diagnosis for a given input omics data in our context. Instead, it may tend to favor some phenotype or even totally ignore the other, even if the diagnostic accuracy appears to be reasonable sometimes.
In other words, given a training data consisting of m normalized omics samples x_{ i } and its corresponding labels y_{ i } ∈ {−1,+1}, i.e. \(\{x_{i},y_{i}\}{}_{i=1}^{m},\) the decision function f(xx_{1},x_{2}⋯x_{ m }) inferred from the classifier demonstrates some bias in determining the class type (phenotype) of a new sample x^{∗}, which is assumed to follow a same normalization procedure as the training data, due to inappropriate parameter choice, model selection, biased label distribution, or even some special characteristics of input data. It is noted that we generally assume all training and testing samples are chosen from a normalized population data for the convenience of diagnosis in our context, which avoids possible renormalization and classifier retraining overhead for the following diagnosis. For example, a diagnostic results: f(x^{∗}x_{1},x_{2}⋯x_{ m })=1 is probably obtained because almost all training samples are labeled with ^{′}+1^{′}, even if the true label of x^{∗} is y^{∗}=−1.
As a result, inaccurate or even deceptive diagnostic results would be produced and lead to an inaccurate or even totally wrong clinical decision making. In particular, such a diagnostic bias can happen to any classifiers due to different decision models, input data distributions, and/or model selection choices.
As such, a comprehensive and rigorous investigation on the diagnostic bias problem are an urgent demand from translational research. This is because a robust disease diagnostic requires a classifier achieves both efficiency and security. The efficiency means the classifier can attain a highlevel diagnostic accuracy with a good generalization capability. The security refers to the classifier can unbiasedly recognize each label type by avoiding possible biases in the classifier ’s decision function inference. There are quite a lot previous studies done on the efficiency problem, but almost no previous literature addressed the security issue, i.e. the diagnostic bias problem in translational research. In particular, we need to answer the following diagnostic bias related queries: when will it happen, why does it happen, and how to conquer it and achieve efficiency?
To answer these key questions, we employ support vector machines (SVM) as a representative in this study to investigate disease diagnostic bias for its rigorous decision model, good scalability, and popularity in translational medicine [11–13]. We present the following novel findings from using benchmark gene array, protein array, RNASeq and miRNASeq data in this work.
First, diagnostic biases can happen for an SVM classifier under any kernels in different model selections, whereas it is more likely to occur under nonlinear kernels. Given input data with two different phenotypes, diagnostic biases usually reflect as extremely imbalanced sensitivity and specificity values, even if they appear to achieve a reasonable diagnostic accuracy. Moreover, it seems that diagnostic biases are irrespective of data distributions: we have observed it happens to normally distributed and negative binomial distributed data.
Second, there are three types of diagnostic biases: overfitting bias, label skewness bias, and underfitting bias in SVM diagnostics. The overfitting and label skewness biases both demonstrate a majoritycount phenotype favor mechanism, i.e., only majoritycount samples can be recognized in diagnosis. They are mainly caused by a builtin molecular signal amplification mechanism in omics data profiling, data label skewness, and inappropriate kernel selection respectively.
The builtin signal amplification mechanism is mainly responsible for the overfitting biases. It refers that all highthroughput omics profiling systems employ realtime PCR or similar approaches to amplify gene or protein expression levels exponentially [14, 15]. The data label skewness, which is mainly responsible for the label skewness biases, means that class label distributions are skewed to some specific type of samples (e.g., positive). We define the type of samples with more counts in the label set as the majoritycount type for the convenience of description. The inappropriate kernel selection simply means a wrong kernel selection lets the corresponding SVM classifier lose diagnostic capability and result in the underfitting biases.
Third, the label skewness bias is more challenging to detect and conquer because it can be easily confused as a normal diagnostic case from its deceptive accuracy. To tackle this problem, we propose a derivative component analysis based support vector machines (DCASVM) to conquer the label skewness bias by comparing its performance with those of the stateoftheart peers. The proposed DCASVM diagnosis not only conquers the label skewness bias but also achieves rivaling clinical diagnostic results by leverage the powerful feature extraction capabilities of derivative component analysis [16].
It is noted that our studies comprehensively identify different diagnostic biases and present novel effective solutions for the important but less addressed problem, Compared with our previous work in conquering SVM overfitting [10], this study provides more systematic and novel results to kernelbased learning for omics data and translational bioinformatics. In particular, our studies firstly identify the label skewness bias that is usually confused as a normal diagnostic case in the past literature and provides a rivaling clinical bias overcome method. As such, it will have positive impacts on translational research and machine learning fields.
Methods
As a widely used diagnostic method for its good scalability, support vector machines (SVM) can be described as follows. Given a training data set \(\{(x_{i},y_{i}\})_{i=1}^{m}\), x_{ i }∈ℜ^{n} with labels y_{ i }∈{−1,+1}, an SVM computes an optimal separating hyperplane: (w·x)+b=0 to attain the maximum margin between the positive and negative observations (samples), where w is the normal and bias vector of the hyperplane respectively. The margin refers to the maximal width of two boundary hyperplanes parallel to the optimal separating hyperplane.
If the training data are linearly separable, it is equivalent to finding w and b that minimize the quadratic programming (QP) problem \(\arg \min _{w,b}\frac {1}{2}w^{2}\) under the condition \(y_{i}(w\cdot x_{i}+b)1\geqslant 0,\) for each observation x_{ i } in the training data [13]. The QP problem can be solved by seeking solutions of Lagrange multipliers α_{ i }≥0,i=1,2⋯m, in the following dual problem,
where w and b can be calculated by \(w=\sum _{i=1}^{m}\alpha _{i}y_{i}x_{i}\) and y_{ i }(w·x_{ i }+b)−1=0 respectively. As a result, the class type of an unknown sample x^{′} can be determined as \(f(x')=sign((\sum _{i=1}^{m}\alpha _{i}(x_{i}\cdot x')+b).\) That is, the support vectors, which are the training samples x_{ i } corresponding to α_{ i }>0, totally determine diagnostics according to the spatial locations of test samples with respect to them. Geometrically, the support vectors are the data points that are closest to the optimal separating hyperplane and can be usually identified in corresponding visualization.
If the training data are not linearly separable, it means the SVM classifier can find only the optimal separating hyperplane that separates many but not all training samples. In other words, the SVM classifier permits misclassification errors in this soft margin case [12]. Mathematically, it is equivalent to adding slack variables ξ_{ i } and a penalty parameter C to the original problem under L_{1} or L_{2} norms. The penalty parameter C, also called the box constraint parameter, is the upper bound of all Lagrange multipliers α_{ i } in the corresponding dual problems.
For example, the original problem is updated as \(\arg \min _{w,b,\xi _{i}} \left (\frac {1}{2}w^{2}+C\sum _{i=1}^{m}\xi _{i}\right)\) under the conditions \(y_{i}(w\cdot x_{i}+b)1\geqslant \xi _{i}\), and ξ_{ i }≥0 under the L_{1} norm regularization. Similarly, the original problem is updated as \(\arg \min _{w,b,\xi _{i}}(\frac {1}{2}w^{2}+C\sum _{i=1}^{m}{\xi _{i}^{2}})\) under the same conditions for the L_{2} norm regularization. The w,b and corresponding support vectors can be obtained by solving its corresponding dual problems [12].
If the training data do not have a simple hyperplane as an effective separating criterion, they can be mapped to a higher or even infinitely dimensional feature space г using a mapping function ϕ:x_{ i }→г, and constructing an optimal nonlinear decision boundary in г to achieve more separation capabilities. Correspondingly, the decision function for an unknown sample x^{′} is formulated as \(f(x')=sign\left (\left (\sum _{i=1}^{m}\alpha _{i}(\phi (x_{i})\cdot \phi (x')\right)+b\right).\) Note that the inner product (ϕ(x_{ i })·ϕ(x_{ j })) in г can be evaluated by any kernel (ϕ(x_{ i })·ϕ(x_{ j }))=k(x_{ i },x_{ j }) implicitly in the input space ℜ^{n} if its corresponding kernel matrix is positive definite, that is, \(f(x')=sign\left (\left (\sum _{i=1}^{m}\alpha _{i}k(x',x_{i})+b\right.\right).\)
Kernel selection
Although there are a class of kernel functions available, we mainly focus on the following kernels: a Gaussian radial basis function (‘rbf’) kernel: k(x,x^{′})= exp(x−x^{′}^{2}/2σ^{2}), quadratic kernel (‘quad’) : k(x,x^{′})=(1+(x_{ i }·x^{′}))^{2}, multilayer perceptron kernel (‘mlp’): k(x,x^{′})= tanh((x_{ i }·x^{′})−1) kernel, and a widelyused linear kernel: k(x,x^{′})=(x_{ i }·x^{′}), in our experiment. In addition, we design an adjusted Gaussian kernel function: ‘rbf2’, which is obtained by tuning the bandwidth parameter as the total variations of all m training samples: \(\sigma ^{2}=\frac {1}{(m1)^{2}}\sum _{i,j}x_{i}x_{j}^{2} \) in the original Gaussian kernel, to demonstrate the impact of parameter tuning in enhancing SVM diagnosis under the Gaussian ‘rbf’ kernel.
In practice, there are different SVM variants applied in disease diagnosis for its advantages in modeling or implementation. LeastSequare SVM (LSSVM) is one of those methods [12, 17]. It only employs equality constraints to reformulate the standard SVM (CSVM). As a result, the normal w and bias b of the optimal separating hyperplane are calculated by solving linear systems instead of a quadratic programming problem [18].
Previous results have reported that LSSVM is comparable to the classic SVM in terms of performance and generalization [12, 18]. In this work, we employ LSSVM to substitute the classic SVM in disease diagnosis for its efficiency and simplicity [17, 18]. The detailed LSSVM implementations are chosen from Matlab R2012b bioinformatics Toolbox, which implements the L_{2} softmargin SVM classifier [19].
SVM classifier parameterization
Since we aim at addressing generic diagnostic biases problems in translational bioinformatics through support vector machines, we do not tend to employ an SVM model with too many parameters or seek very special values in parameter setting to prevent the loss of generalization of results. As such, we employ the LSSVM model for its builtin advantage in simplifying parameter setting than the other models [17]. Moreover, we choose to set the default parameters generically in the SVM diagnosis to guarantee the reproducibility and generalization of our results.
The most important parameter in our context will be the penalty parameter C, which affects the training errors and generalization somewhat directly. A large C may produce better diagnostic results but risk the loss of the generalization of the classifier; A small C may lead to low diagnostic results but enhance the classifier’s generalization. In our context, the penalty parameter C is chosen as 1.0 uniformly in all diagnoses instead of rescaled values for different groups of samples to guarantee comparable results for different data sets that have skewed or balanced label distributions. In particular, such a parameter choice will contribute to more comparable and easily interpretable Lagrange multipliers α_{ i } values that are weights of the support vectors. Although a gridsearch way can be employed to seek ‘optimal’ C parameters by trying a geometric sequence such as 2^{−10},2^{−9},…2^{0},…2^{10} under a specified cross validation for each data set [13], such an approach may not contribute to generalizable diagnostic results and possible prohibitive training time demand.
Furthermore, we choose to automatically scale the training samples to zero mean and unit variance data before training, which is equivalent to corresponding feature scaling [13], to optimize the kernel matrix’s structure for the sake of learning efficiency and the following diagnostic generalization.
Model selection
We employ widelyused crossvalidation methods for model selection that include kfold crossvalidation (kfold CV) and independent training and test set approach for the sake of comprehensive diagnostic bias investigation, in addition to leaveoneout cross validation (LOOCV). The kfold CV randomly partitions the training data to form k disjoint subsets with approximately equal size, removes the i^{th} subset from the training data and employs the the remaining k −1 subsets to construct the decision function and infer the class types of the samples in the removed subset. Moreover, in the independent training and test set approach, we randomly select 50 % of input omics data for training and another 50 % for test, and repeat such a process 500 times for each data to fully investigate different diagnostic biases and validate the effectiveness of our proposed biasconquering algorithm.
Data selection and preprocessing
We firstly choose three benchmark omics data sets: BreastIBC, Hepatocellular carcinoma (HCC), and Kidney in our experiment, which are produced by stateoftheart gene array, protein array and RNASeq technologies respectively [20–22]. Table 1 illustrates the detailed information of the three data sets in platforms, sample distributions, and feature numbers, where a feature refers a gene (probe), m/z ratio, or transcript in our context.
It is noted that these data are normalized and processed by different methods. For example, robust multiarray average (RMA) method is applied to normalize the BreastIBC data and Reads Per Kilobase per Million mapped reads (RPKM) is used to normalize Kidney the data [23–25]. The original rawBreastIBC data set has been retrieved from the NCBI Gene Expression Omnibus (GEO) series data with accession number GSE5847, which consists of 13 inflammatory breast cancer (‘IBC’) and 34 noninflammatory breast cancer (‘NIBC’) stromal cell samples across 22,283 probes [21, 26]. We have further filtered smallvariance genes and obtained our BreastIBC data set with 18,995 probes. The Hepatocellular carcinoma (HCC) data is a mass spectral proteomic data set generated from the MALDITOF platform and its detailed normalization process can be found in Ressom et al.’s work [20].
It is noted that both BreastIBC and HCC data are subject to normal distributions, and the Kidney data are subject to negative binomial (NB) distributions approximately [25]. In addition, the sample label distributions of these data are also different. The HCC data have an almost balanced distribution: 78 Hepatocellular carcinoma vs 72 normalsamples. But the BreastIBC and Kidney data have obviously skewed label distributions, where the majority count samples are much more than the minority count samples (e.g. 13 ‘IBC’ vs 34 ‘NIBC’ in the BreastIBC data; 68 normal vs 475 renal cell carcinormal tumor samples in the Kidney data).
Results
We introduce the following set of measures for the sake of diagnostic bias investigations: diagnostic accuracy, sensitivity, specificity, positive predictive ratio (PPR), and negative predictive ratio (NPR). The diagnostic accuracy is the ratio of the correctly diagnosed test samples (targets) over total test samples (targets), i.e. \(accuracy=\frac {TP+TN}{TP+FP+TN+FN},\) where TP (TN) is the number of positive (negative) samples correctly diagnosed, and FP (FN) is the number of negative (positive) samples incorrectly diagnosed. The sensitivity, specificity, and positive predictive ratio (PPR) are defined as \(sensitivity=\frac {TP}{TP+FN},\) and \(specificity=\frac {TN}{TN+FP},PPR=\frac {TP}{TP+FP},\) and \(NPR=\frac {TN}{TN+FN}\) respectively. It is noted that we use targets and samples interchangeably in this study.
We conduct SVM diagnosis under a 5fold cross validation for the three data sets under the following kernels: ‘linear’, ‘quad’, ‘mlp’, ‘rbf’, and ‘rbf2’, where the bandwidth parameter σ^{2} in the ‘rbf’ and ‘rbf2’ kernels are selected as 1 and the total variations of all training samples respectively. It is noted that each sample in the training data is scaled as a zero mean sample with variance 1.0 before building the optimal separation plane in SVM diagnostics. Table 2 illustrates the SVM diagnoses for the three benchmark data sets with five kernels under the 5fold cross validation. We have the following interesting findings about diagnostic biases.
Three diagnostic biases
The diagnostic biases would take place in an SVM classifier with any kernels, but it is more likely to occur under nonlinear kernels. In fact, they can happen for almost all SVM classifiers under three different scenarios: overfitting bias, label skewness bias, and underfitting bias. It is worthwhile to point out that the overfitting bias and label skewness bias may demonstrate similar diagnostic results, whereas they are caused by different reasons.
Overfitting biases
The overfitting bias demonstrates the majoritycount phenotype favor mechanism in diagnosis under the nonlinear kernels like ‘rbf’. That is, the SVM classifier will always diagnose an unknown sample as the type of the samples with the majoritycount in the training data (e.g., ‘NIBC’ type for the BreastIBC data). Finally, its diagnostic accuracy will equal or approximate the majoritycount ratio of the input data. For example, the SVM with the ‘rbf’ kernel (SVMrbf) has the diagnostic accuracies that approximate or totally equal to their corresponding majoritycount ratios for the three data sets : \(72.56\,\%\thickapprox \frac {34}{34+13}=72.34\,\%, 52.00\,\%=\frac {78}{78+72},\) and \(87.48\,\%=\frac {475}{475+68}\) respectively.
Why does NaN appear in diagnostic results?
The question is why the corresponding NPR is NaN in diagnostics (Table 2)? The reason is that the classifier can only recognize the majoritycount samples that are specified as the positive type target in our experiment. That is, each trial of diagnoses has a zero count for true negative and false negative, i.e. TN=0 and FN=0, because all negative targets, which are minoritycount samples in our experiment, are diagnosed as the positive type. As a result, \(NPR=\frac {TN}{TN+FN}\) will be NaN. So are the corresponding sensitivity values always 100 % \(\left (\frac {TP}{TP+FN}=\frac {TP}{TP}=1.0\right)\) and the specificity values 0 % (\(\frac {TN}{TN+FP}=\frac {0}{FP}=0.0,\) where FP is actually totally number of negative samples that appear as the minoritycount samples in our diagnostic experiments).
Similarly, the SVM with the ‘rbf2’ kernel also demonstrates similar diagnostic results as before, where ‘rbf2’ is obtained by tuning the bandwidth parameter in the original Gaussian kernel. Although they may show some improvements for the protein array data (HCC data), they still demonstrate the majorphenotype favor mechanism for the gene array and RNASeq data. Alternatively, it indicates that simply tuning the bandwidth parameter may not be a good way to conquer such an diagnostic bias.
Label skewness biases
Unlike the overfitting bias, the label skewness bias demonstrates two different cases. The first is that the SVM classifiers with a linear or nonlinear kernel (e.g., ‘quad’) demonstrate an explicit label skewness diagnostic bias by presenting a diagnostic accuracy close to the majoritycount ratio and a pair of unbalanced sensitivity and specificity. For example, Table 1 shows that both SVMlinear and SVMquad classifiers achieve a 74.56 % accuracy that is close to the majoritycount ratio: 72.34 % with an imbalanced sensitivity 97.14 % and specificity 16.67 % respectively for the BreastIBC data. This indicates such a model can recognize few negative targets in one or more diagnostic trials in addition to diagnosing all positive targets and most of negative targets to the positive target type, which is the majoritycount type specified in our implementations.
The second is that a linear kernel SVM demonstrates an implicit label skewness diagnostic bias by presenting a normal diagnostic accuracy but with a pair of imbalanced sensitivity and specificity. For example, the SVMlinear classifier achieves 90.23 % accuracy with sensitivity 96.84 % and specificity 44.07 %. Such a result indicates there are a large number of false positives than those of false negatives due to the dominance of the positive type in the training data.
It is noted that not all linear kernels would encounter diagnostic bias. Instead, the SVMlinear classifier achieves 94.02 % accuracy with 95.81 % sensitivity and 94.21 % specificity for the Hepatocellular carcinoma (HCC) data with 78 HCC and 72 normal samples that have a more balanced label distribution than those of the BreastIBC and Kidney data.
Underfitting biases
The underfitting bias refers that an SVM classifier with a nonlinear kernel such as ‘mlp’ leads to an underfitting model in diagnostics. The model itself is inappropriate for disease diagnostics because the highdimensional feature selection space generated from the kernel function may distort the information conveyed by the original data [12, 27]. As a result, the SVM classifier will have a quite low diagnostic performance due to the underfitting. For example, the SVMmlp classifier has about 50 % level diagnostic accuracy for all the three data sets. That is, the classifier is equivalent to a random classifier that conducts almost adhoc diagnosis because of the underfitting bias.
Finally, it is clear that the diagnostic biases seem to be irrespective of data distributions. They happen for the gene and protein array data that are subject to normal distributions and RNASeq count data that are subject to negative binomial (NB) distributions in our experiment [25].
Diagnostic biases under other cross validations
It is worthwhile to point that diagnostic biases can also happen in other cross validations such as independent training and test set approach and leaveoneout cross validation (LOOCV) besides the kfold cross validation. This is because diagnostic biases may occur in each diagnostic trial under a specific kernel due to the builtin characteristics of input data we will mention in the next section. For example, we generate 100 independent training and test sets for the BreastIBC data, where each sample has a 50 % likelihood to be selected in the training and test set. The SVMrbf and SVMlinear classifiers has the almost same performance as illustrated in Table 2. For example, the former has the average accuracy: 72.70 % ± 6.48 % with sensitivity: 100.00 ± 0.00 % and specificity: 00.00 ± 0.00 %; the latter has the average accuracy: 73.83 % ± 7.02 % with sensitivity: 92.87 % ± 6.58 % and specificity: 25.45 % ± 15.82 %. It is noted that similar results can be also found for this data set under the LOOCV.
What are the reasons for diagnostic biases?
The are different reasons for the three different diagnostic biases, though the overfitting bias and label skewness bias may demonstrate similar diagnostic results.
The reason for the overfitting bias is rooted in the large or even huge pairwise distances d_{ ij }=x_{ i }−x_{ j }]]^{1/2} between omics samples, which implies that the corresponding distances in the feature space under the ’rbf’ kernel k(x_{ i },x_{ j })= exp(−x_{ i }−x_{ j }^{2}/2) will be a zero or tiny value approximate to zero. As a result, it leads to an identity or approximately identity kernel matrix that causes the SVM classifier to recognize the majoritycount type samples only.
Figure 1 illustrates the boxplots of all pairwise sample distance squares \(d_{\textit {ij}}^{2}, (i\neq j)\) in each data set in the first row of plots and kernel matrices of the three data sets under the ‘rbf’ kernel in the second row of plots by viewing each data set as the population of training data. It is interesting to see that the the minimum \(d_{\textit {ij}}^{2}\) are greater than 10^{2}, which means the distance between any two samples in the feature space will be approximately zero: k(x_{ i },x_{ j })≤ exp(−10^{2}/2)∼10^{−22}. As a result, the corresponding kernel matrix will be an identity matrix as illustrated by the corresponding plot in the second row.
It is noted that the large or even huge pairwise sample distances in each omics dataset are actually rooted in the molecular signal amplification mechanism in highthroughput profiling, where gene array, protein array and RNASeq technologies all employ realtime PCR or similar approaches to amplify gene and protein expression levels exponentially [14, 15]. As a result, the molecular signals greatly increase the sensitivity of disease phenotype and corresponding genotypes in diagnostics [28]. On the other hand, the pairwise distances between two samples are large or even huge mathematically, even if each sample is standardized as a zeromean point with unit standard deviation.
The label skewness bias is due to the skewness of the label distributions that lead to there are more support vectors from the majoritycount type samples and the class type of an unknown sample is more likely to be determined as the majoritycount type. Figure 2 shows the distributions of α values, i.e., the Lagrange multipliers’ values: α_{1},α_{2}⋯α_{ m } in the dual problem, in each diagnostic trial in the 5fold cross validation. As the weights of corresponding support vectors, its values are always positive or zero as we pointed out before. However, the sign of a weight is assigned in our SVM implementation for the convenience of indicating its class property, i.e. a positive (negative) sign means this weight (e.g. α_{1}) is for the support vector belonging to the positive (negative) target group. It is easy to detect that the distributions of α values are nearly balanced for the Hepatocellular carcinoma (HCC) data that has a relatively balanced sample label distributions, where the number of positive signs are almost equal as that of the negative signs. However, the the distributions of α values of the BreastIBC and Kidney data are obviously skewed to the positive targets, which are the majoritycount samples in each data set. In other words, more support vectors can be found for the majoritycount type, which will increase the likelihood of an unknown sample to be detected as the majoritycount type in the following decision making. For example, since there are 256 and 178 α values carrying the positive and negative signs respectively in the 5^{th} trial of diagnosis for the Kidney data, there will be a more likelihood for a test sample to be detected as a positive target.
On the other hand, the corresponding b values, which are the intercepts of the hyperplane that separates the two groups in the normalized data space, are all positive in each trial. For example, the b values of the five diagnostic trials for the Kidneyand BreastIBC data are [0.7425, 0.7603, 0.7333, 0.7649, 0.7465] and [0.4594, 0.4210, 0.4594, 0.4594, 0.4359] respectively. As such, given a test sample x^{′}, the decision function \(f(x')=sign((\sum _{i=1}^{k}\alpha _{i}k(x',x_{i})+b)\) is more likely to determine it as the positive type, because most support vectors are from the positive type (the majoritycount type) and the intercept value b is positive.
The underfitting bias is caused by the inappropriate kernel function such as ‘mlp’ that results in a kernel matrix with all entries are ‘1’s that has no any capability to distinguish different samples. To some degree, it corresponds an extreme case for an SVM classifier under the Gaussian kernel with a too large bandwidth parameter that also leads to the kernel matrix with all ‘1’ entries. It is noted that the underfitting bias is also independent of input data label distributions as the overfitting and labelskewness bias, though it corresponds to a kernel matrix with all ‘1’ entries instead of an identity kernel matrix as the former or a normal kernel matrix as the latter.
Figure 3 shows the ‘mlp’ and ‘linear’ kernel matrices of the three data sets, where each data is treated as a training population. It is clear to see that the kernel matrices under the underfitting bias are flat matrices with all ‘1’ entries, but the kernel matrices under the linear kernel appear to be normal for all three data sets, even if there are explicit and implicit label skewness biases for the BreastIBC and Kidney data respectively.
Diagnostic bias conquering
There are no systematic approaches available to conquer diagnostic biases due to the gap between machine learning and translational bioinformatics [10]. Although previously related work has been proposed to investigate imbalanced data in SVM classification in data mining, all of these work mainly focus on the ‘imbalanced data’ where the sample label distributions are extremely imbalanced (e.g., 99.5 % positive labels and 0.5 % negative labels) [29, 30]. Moreover, these imbalanced data are not highthrough omics data that do not have ‘large number of variables but small number of observations’ characteristics shared by all highthroughput omics data [11]. Thus, a more general but omics data focused algorithm is needed to overcome the diagnostic biases.
The overfitting and underfitting biases can be ‘conquered’ by avoiding using the corresponding kernels that lead to the identity, nearly identity, or all ‘1’ entries kernel matrices. However, it can be challenging to conquer the label skewness bias, especially the implicit diagnostic bias case that has ‘reasonable’ diagnostic accuracy but unbalanced sensitivity and specificity.
In this work, we propose a derivative component analysis (DCA) based support vector machines (DCASVM) to conquer the label skewness bias by extracting true signals by digging latent data characteristics from an input data [16]. The true signals share the same dimensionally with the original data but capture essential data characteristics. We introduce DCA briefly as follows and more details about this algorithm can be found in Han’s previous work on DCA [16].
Derivative component analysis (DCA)

1.
Input: X^{t}=[x_{1},x_{2}⋯x_{ n }],x_{ i }∈ℜ^{p}, DWT level J; cutoff τ; wavelet ψ, variability explanation threshold ρ

2.
Output: true signals: X^{∗}

3.
Step 1: ConductJlevel DWT with wavelet ψ for X^{t} to obtain coefficient detail cD_{ j } and approximation matrix cA:[cD_{1},cD_{2},⋯,cD_{ J };cA_{ J }], where \({cD}_{j}\in \Re ^{p_{j}\times n}, {cA}_{J}\in \Re ^{p_{J}\times n},p_{j}=\left \lceil {{p}/{2^{j}}}\right \rceil \).

4.
Step 2: Extract subtle data characteristics, remove system noise and retrieve global data characteristics

(a)
Conduct PCA for cD_{ j },1≤j≤τ to obtain its PC matrix U and score matrix S: \(U=[u_{1},u_{2},\cdots u_{p_{j}}], u_{i}\in \Re ^{n}\) and score matrix \(S=[s_{1},s_{2}\cdots s_{p_{j}}], s_{i}\in \Re ^{p_{j}}, i=1,2\cdots p_{j}\).

(b)
Identify PCs u_{ i },u_{2}⋯u_{ m }, such that its variability explanation ratio ρ_{ m }≥ρ

(c)
Reconstruct \({cD}_{j}\leftarrow \frac {1}{p_{j}}{cD}_{j}\vec {(1)}\vec {(1)}^{T}+\sum _{i=1}^{m}u_{i}\times {s_{i}^{T}}, \vec {(1)}\in \Re ^{p_{j}}\) with all entries being ‘1’s

(d)
Reconstruct cD_{ j },τ≤j≤J and cA_{ J } under the variability explanation ratio at least 95 %

(a)

5.
Step 3: Approximate the original data by the corresponding inverse DWT with the wavelet X^{∗}←inverseDWT([cD_{1},cD_{2}⋯cD_{ J };cA_{ J }]).
In our implementation, we uniformly set the transform level J=7 for the wavelet `db8^{′}, cutoff τ=2, and apply the first PCbased detail coefficient matrix reconstruction in DCA for the convenience of implementations [16, 31].
Derivative component analysis based support vector machines (DCASVM)
Given training data X=[x_{1},x_{2}⋯x_{ p }]^{T} and their labels \(\{x_{i},c_{i}\}_{i=1,}^{p}c_{i}\!\in \!\{1,1\},\) its corresponding true signals Y=[y_{1},y_{2}⋯y_{ p }]^{T} are computed by using DCA, Then, a maximummargin hyperplane: O_{ h }:w^{T}ϕ(y)+b=0 in the feature space is constructed to separate the ‘+1’ (‘cancer’) and ‘1’ (‘control’) types of the samples in true signals Y, which is equivalent to solving the following optimization problem with a parameter μ>0,
The dual problem of this constrained minimization problem can be formulated as follows, where k(y_{ i },y_{ j })=(ϕ(y_{ i })·ϕ(y_{ j }))
The b and α_{ i },i=1,2⋯p can be obtained by solving the corresponding linear system of the dual problem. The decision rule \(f(x')=sign\left (\sum _{i=1}^{p}\alpha _{i}k(y_{i},y')+b\right)\) is used to determine the class type of a testing sample x^{′}, where y^{′} is its corresponding vector computed from DCA. The function k(y_{ i },y^{′}) is a kernel function mapping y_{ i } and y^{′} into a samedimensional or highdimensional feature space, which is chosen as the linear kernel k(y_{ i },y^{′})=(y_{ i }·y^{′}) in our experiment.
Random undersampling Boost (RUBoost)
To demonstrate the effectiveness of the proposed algorithm, we include an ensemble learning method: random undersampling Boost (RUBoost) as well as the original SVM as comparison algorithms [29]. The reason we choose the ensemble learning method is because it is believed to perform well for imbalanced data [29, 30, 32]. We employ an ensemble of 1000 deep trees that have minimal leaf size of 5 with a learning rate 0.1 in RUBoost learning to attain a high ensemble accuracy.
Table 3 compares the performance of the proposed DCASVM with those of SVM and RUBoost under the 5fold cross validation. It is interesting to see that our algorithm not only fully conquer the label skewness biases for the BreastIBC and Kidney data, but also achieve exceptional diagnostic results for all three data sets for its latent data characteristics extraction that forces a data characteristics driven diagnosis. It is noted that the extracted latent data characteristics contribute to the structure optimization of the kernel matrices that enhance the classifier’s detectability [31, 33, 34].
For example, the explicit label skewness diagnostic bias illustrated in the BreastIBC data is overcome by achieving 97.78 % diagnostic accuracy with 100 % sensitivity and 90 % specificity. Unlike all negative targets are recognized as the positive targets in some diagnostic trial, the total negative prediction rate (NPR) is 100 % and the positive prediction rate (PPR) is 97 %. Moreover, the implicit label skewness diagnostic bias illustrated in the Kidney data is overcome by achieving 99.81 % diagnostic accuracy with 99.79 % sensitivity and 100 % specificity, compared to the original 90.23 % diagnostic accuracy with 96.84 % sensitivity and 44.07 % specificity.
Furthermore, DCASVM achieves the exceptional diagnostics on the HCCdata by attaining 99.33 % diagnostic accuracy with 100 % sensitivity and 98.57 % specificity compared to the original 94.02 % accuracy with 95.81 % sensitivity and 92.42 % specificity. Alternatively, the RUBoost diagnosis has some improvements in balancing the sensitivity and specificity, whereas it has relatively low diagnostic accuracy, especially for balanced HCC data, and needs a long learning time.
Figure 4 compares the ROC plots of DCASVM, SVM, PCASVM, ICASVM diagnoses under the 5fold cross validation for the BreastIBC and Kidney data [16, 33]. It is easy to see that the proposed DCASVM diagnosis conquers the label skewness bias by achieving the best performance, which prepares itself as a good candidate in personalized diagnostics in the coming personalized medicine for its unbiased exceptional diagnostic performance for different omics data. It is worthwhile to point out that such a rivaling clinicallevel diagnosis is mainly because the true signals extraction in DCA that forces the SVM hyperplane construction to rely on both subtle and global data characteristics of the whole profile in a denoised feature space, which seems to contribute to a robust and consistent highaccuracy diagnosis greatly. In fact, since such a consistent performance applies to different data sets rather than work only on an individual data set, it almost prevents from any overfitting possibility. Moreover, the following two subsections further demonstrate such an exceptional performance is impossible from overfitting because our proposed algorithm works well consistently for different data sets with different training and test data selection methods. Especially, the phenotype separation results in Fig. 5 strongly validate the effectiveness from a biomarker discovery and visualization standing point.
Independent data sets: brain low grade glioma (LGG) TCGA data
To further demonstrate the effectiveness of our proposed algorithm, we have retrieved level3 TCGA data for brain low grade gliomas (LGG) from the TCGA portal that include gene expression, protein expression, RNASeq and miRNASeq data [22, 35]. The LGG refers to the grade I and grade II glioma tumors that are usually considered as benign brain tumors compared with those grade II and IV glioma tumors. Since the gene and protein expression data only contain gradeI glioma samples that prevent us doing diagnostics from a translational bioinformatics viewpoint, we include the RNASeq and miRNASeq data as the independent data sets: GliomaRNASeq and GliomaMiRNASeq for our algorithm testing. The detailed information about the two data sets can be found in the Table 4, where each feature refers to a gene or microRNA.
Normalization
It is noted that both are ‘imbalanced data’, where 96.63 % and 95.88 % samples are gradeII tumors respectively, and follow the negative binomial (NB) distribution approximately. The raw GliomaRNASeq data, a big data that asks 14.5 Gigebytes storage, is normalized by dividing each sample with a scale factor s=Q_{3}/1000, where Q_{3} is the 75percentile of each sample. The raw data is normalized by the countpermillion method, in which all counts in a sample are adjusted to reads per million to facilitate comparison between samples [36].
Monte Carlo simulation oriented training and test data selection
Different from the previous kfold crossvalidation, we randomly select 50 % of Glioma RNASeq (miRNASeq) samples for training and another 50 % for test, and repeat such a process 500 times in our diagnostic experiments. It is noted that such a Monte Carlo simulation oriented independent training and test data choice will have an advantage to evaluate the effectiveness of the proposed algorithm than the previous kfold crossvalidation. This is because it reduces the dependence between training and test data by fully leveraging the two omics data sets with a large number of observations.
Table 5 compares the diagnostic results of DCASVM, with SVM under four different kernels: ‘linear’, ‘rbf’, ‘quad’ and ‘mlp’for the two data sets. It is not a surprise that the SVMmlpclassifier encounters the underfitting bias for both LGG data sets by demonstrating quite low diagnostic accuracy values. Similarly, the SVMrbf classifier still suffers from the overfitting bias by only recognizing the majority count phenotypes. That is, its average diagnostic accuracy closely approximates the majority count ratios of the GliomaRNASeq and GliomaMiRNASeq data sets \(96.68\,\%\approx \frac {516}{516+18}\) and \(96.63\,\%\approx \frac {512}{512+18}\) respectively. For the same reason, its average positive prediction rate will just be its diagnostic accuracy because the SVMrbf classifier diagnoses all samples into the positive samples. Alternatively, the corresponding negative prediction ratio \(NPR=\frac {TN}{TN+FN}\) is NaN because of TN=FN=0 in each diagnostic case, and the sensitivity and specificity are 100 % and 0 % respectively.
Also like the previous cases, the SVMlinear and SVMquad classifiers both encounter the explicit label skewness bias because both data sets are imbalanced where the GliomaRNASeq data has 18 grade I and 516 grade II gliomas and the GliomaMiRNASeqdata has 18 grade I and 512 grade II gliomas respectively.
The explicit label skewness bias demonstrates a deceptive diagnostic accuracy that is close to the majoritycount ratio for each data. For example, the SVMlinear classifier achieves an average accuracy 95.87 % and 93.78 % for the two data sets respectively, both of which are close to the majoritycount ratios 96.68 % and 96.63 %. However, both diagnostic results are characterized by imbalanced sensitivity & specificity, and positive & negative prediction rates. For example, the SVMlinear classifier achieves 98.77 % sensitivity and 12.10 % specificity.
Although its average negative predication ratio (NPR) appears to be NaN, such an exception is caused by the fact that both TN and FN are zero counts in some trials of diagnosis, due to the majorcount phenotype favor mechanism. In fact, it is easy to estimate that its average NPR should be a small percentage, because the corresponding average PPR is 97.02 %, i.e. very few negative targets or even none are correctly diagnosed in each diagnosis. As such, the ‘high’ diagnostic accuracy does not mean the classifiers have high detection capabilities. Instead, the high’ diagnostic accuracy is from the high majoritycount ratio.
However, the proposed DCASVM algorithm successfully overcomes the diagnostic biases and achieves rivalingclinical diagnostic accuracy and balanced sensitivity and specificity for the two data sets. In particular, we still employ the transform level J=7 and cutoff τ=2, in addition to keeping the first PCbased detail coefficient matrix reconstruction in DCA for the sake of consistence.
Such a result is consistent with the previous results from gene/protein expression and RNASeq data with kfold cross validation. For example, our DCASVM classifier achieves 99.52 % (sensitivity: 99.64 %, specificity: 97.00 %, NPR: 91.98 %, PPR: 99.87 %) and 99.63 % (sensitivity: 99.73 %, specificity: 97.52 %, NPR: 93.13 %, PPR: 99.89 %) average diagnostic accuracy for the GliomaRNASeqand GliomaMiRNASeqdata. Considering different types of omics data and different training and test data selections, such a result strongly suggests the effectiveness of our proposed method in conquering the diagnostic biases.
Diagnostic index
We create a diagnostic index \(\beta =\log _{2}a\log _{2}\frac {s+p}{2},\) where a,s, and p represent accuracy, sensitivity and specificity to evaluate if a classifier is subject to any diagnostic biases. A small diagnostic index value (e.g., β=0.01) means the classifier achieves a good accuracy with a light degree diagnostic bias. The smallest diagnostic index refers to the perfect diagnosis for a classifier: a=s=p=100 %. Alternatively, a large β (e.g., 2.0) means classifier achieves a poor diagnostic accuracy or a high degree diagnostic bias. Table 6 compares the diagnostic index values of the proposed DCASVM with those of the other classifiers. It is interesting to see that its β values are the lowest among all diagnostic index values, which validate again the effectiveness of the proposed algorithm in conquering the label skewness bias and achieving rivaling clinical diagnostic results.
Derivative component analysis based phenotype separation
We create a diagnostic index \(\beta =\log _{2}a\log _{2}\frac {s+p}{2}\), where a,s, and p represent accuracy, sensitivity and specificity to evaluate if a classifier is subject to any diagnostic biases. A small diagnostic index value (e.g., β=0.01) means the classifier achieves a good accuracy with a light degree diagnostic bias. The smallest diagnostic index refers to the perfect diagnosis for a classifier: a=s=p=100 %. Alternatively, a large β (e.g., 2.0) means classifier achieves a poor diagnostic accuracy or a high degree diagnostic bias. Table 6 compares the diagnostic index values of the proposed DCASVM with those of the other classifiers. It is interesting to see that its β values are the lowest among all diagnostic index values, which validate again the effectiveness of the proposed algorithm in conquering the label skewness bias and achieving rivaling clinical diagnostic results.
Derivative component analysis based phenotype separation
The diagnostic results from the proposed DCASVM classifier indicates that the highdimensional omics data in our experiment are linear separable after derivative component analysis. In other words, it means that support vectors can be found to separate the two groups of samples geometrically according to the definition of linear separability [12]. On the other hand, it suggests that disease biomarkers can be identified from the omics data to discriminate different phenotypes in such a translational bioinformatics based disease diagnostics. As such, we demonstrate the following biomarker discovery method that captures disease biomarkers and a visualization technique that show the possible support vectors in phenotype separation, that is to further ‘prove’ and validate the effectiveness of our proposed algorithm.
Our biomarker discovery method assumes the normal distribution of input data. If an input data is not normally distributed, we conduct a transform Y=E(log(X+1))/var(log(X+1)) to covert it to a corresponding normally distributed data approximately. It is noted that log(X+1) is obtained by elementwisely applying the log transform to X+1, which adds each entry in input data X by 1. Similarly, E(log(X+1)) updates log(X+1) by adjusting its column with its corresponding mean, and var(log(X+1)) is the matrix, each column of which is a vector consisting of the variance of log(X+1) at the column, and Y is obtained by the elementwise division between E(log(X+1)) and var(log(X+1)).
Then, derivative component analysis (DCA) is applied to the normally distributed omics data to retrieve its true signals by using the same parameter setting in the previous experiments. Finally, the classic twosample ttest is employed to identify the differentially expressed features (e.g. genes) with the smallest pvalues from the extracted true signals as potential biomarkers. It is worthwhile to point out that a large amount of tiny pvalues will come from the ttest due to the denoising process in DCA. Although we can get a set of wellsupported biomarkers from the statistical test applied to the true signals, we prefer to employ the top three biomarkers to conduct phenotype separation and corresponding support vector finding for the convenience of visualization.
Figure 5 shows the corresponding phenotype separations for four data sets from different highthroughput technologies and platforms: GliomaRNASeq (LGG RNASeq), GliomaMiRNASeq (LGG MiRNASeq), Kidney (Kidney (KIRC) RNASeq), and HCC (HCC MALDITOF), by using its top three biomarkers. Each yellow/red dot in the visualization represents a corresponding sample. For example, the 18 yellow dots represent 18 grade I glioma samples in the NW plot for LGG RNASeq data. It is interesting to see that the three biomarkers discovered from each data set demonstrate the linearseparability very well and corresponding support vectors can be easily found from each phenotype separation.
Such results strongly suggest the effectiveness of our proposed algorithm and provides a visualization support for DCASVM’s rivaling clinical diagnostic performance. Furthermore, it provides more insights to elucidate the latent structures of the omics data, which can contribute to deciphering the different pathological substates of tumors. For example, the NE subfigure discloses that 512 grade II tumors of the GliomaMiRNASeq data span three different clusters, which may indicate that grade II tumors may have different pathological substates due to different genetic alternations [35]. It is also noted that such results also apply to the BreastIBC data though it is not included in Fig. 5.
Discussion
In this work, we comprehensively investigate diagnostic bias in translational bioinformatics by using support vector machines (SVM). It is worthwhile to point that the overfitting bias and underfitting bias can be viewed as special diagnostic biases associated with the kernelbased learning, though they still happen in the other classifierbased diagnosis. However, the label skewness bias can be found widely found in the other classifiers, because the SVM classifiers with different kernels can be viewed as the ‘simulations’ of different classifiers [12]. For example, an SVMlinear classifier can be viewed as a simulation of linear discriminant analysis (LDA), because they usually have a similar or same level performance [37]. In fact, LDA does demonstrate label skewness diagnostic bias on the BreastIBC data under the same cross validation by achieving 71.83 % accuracy with 94.17 % sensitivity and 15 % specificity.
We also have employed a multilayer perceptron (MLP) classifier to the five data sets used to investigate the occurrence of diagnostic biases for its comparable performance with respect to SVM and other classifiers such as decision trees [38, 39]. We still use the 5fold cross validation is still for the convenience of comparisons. The MLP classifier has 10 neurons in its input layer, two hidden layers, each of which has 5 neurons, and two neurons in its output layer. The LevenbergMarquardt optimization is employed to train the network, in which the maximum number of epochs and minimum performance gradient in training are set as 10^{3} and 10^{−9} respectively [40]. We are interesting to find that it encounters different diagnostic biases on almost all data sets under the 5fold cross validation except the Hepatocellular carcinoma (HCC) data, where it has an accuracy 85.91 % with sensitivity 90.29 % and specificity 81.92 %. For example, it achieves 92.18 % accuracy (sensitivity 95.40 %, specificity: 0.0 %) for the GliomaRNASeqdata, and 96.07 % accuracy (sensitivity 99.40 %, specificity: 1.08 %) for the GliomaMiRNASeq data respectively. Obviously, it encounters overfitting diagnosis by diagnosing all test samples as the majority count samples with an approximately zero specificity. In addition, it demonstrates the explicit label skewness biases for the Kidney and BreastIBC data with low diagnostic accuracy: 79.73 % (sensitivity: 14.45 %, specificity: 89.09 %) and 65.78 % (sensitivity: 85.71 %, specificity: 13.33 %) respectively. All these results strongly demonstrate the generalization of our proposed diagnostic biases.
Unlike other adhoc diagnostic bias conquering by tuning parameters, the proposed DCASVM demonstrates rivalingclinical level diagnostic results by overcoming both explicit and implicit label skewness biases. Although some statistical testbased feature selection can conquer some diagnostic bias well for some data, it may not be generalized to other data with different distributions. For example, the SVMlinear classifier can achieve a quite excellent diagnostic performance on the BreastIBC data with an average diagnostic accuracy 98.00 % (sensitivity: 100 %, specificity: 93.33 %) under the 5fold cross validation, if we only pick the topranked 200 genes (features) from this data by using Bayesian ttest [41]. However, if we apply the same feature selection approach to the Hepatocellular carcinoma (HCC) data, the classifier only attains a mediocre performance with an average diagnostic accuracy 88.03 % (sensitivity: 84.76 %, specificity: 91.08 %), which is far from the more than 94 %level diagnostic accuracy achieved by the same classifier without using any feature selection. On the other hand, such a normal distribution assumed feature selection method can not apply to the RNASeq and MiRNASeq data directly, because these data are not normally distributed. Thus, such a feature filtering approach can not be a good choice for overcoming diagnostic biases. Alternatively, our derivative component analysis (DCA) is a generic feature extraction algorithm that does not have special data distribution requirements but retrieve true signals from each omics data by capturing essential data behaviors. As such, the proposed DCASVM diagnosis can be viewed as a generic solution for the diagnostic bias problem in translational bioinformatics.
Although we assume training and testing samples are picked from a normalized population in our context, our method can still work well provide the testing samples are not normalized or normalized with a different approach as the training ones. The renormalization process will be required but it can be different for different types of omics data. For example, the renormalization for microarray data is usually done by normalizing all the training and testing samples before retraining the classifier in diagnostics [42, 43]. This is mainly because microarray data generally has strong backgroundsignals that make the comparisons of expression levels between genes within a single sample impossible [44, 45]. Due to its fundamentally different data generation mechanism as microarray data, RNASeq or MiRNASeq data can compare different genes’ expression levels within a single sample [44]. As such, the renormalization for such type of data can be done by only conducting normalization for each testing sample by using corresponding normalization methods (e.g. DESeqnormalization) before the proposed diagnosis [24, 46].
Conclusions
Our studies comprehensively investigate the diagnostic bias problem in translational bioinformatics by analyzing benchmark gene array, protein array, RNASeq and miRNASeq data. We identify three types of diagnostic biases: overfitting bias, label skewness bias, and underfitting bias in SVM diagnosis, and disclose the reasons for its occurrence through rigorous analysis. As we pointed out before, the diagnostic biases, which happen at almost all kernels and data with different distributions, are actually caused by three major factors, that is, kernel selection, special signal amplification mechanism in the high throughput profiling, and training data label distribution.
Interestingly, the overfitting bias and label skewness bias both demonstrate a majoritycount phenotype favor mechanism in diagnosis, which means that only majoritycount samples can be recognized in diagnosis. However, the former is rooted in the molecular signal amplification mechanism in highthroughput profiling that leads to the large or even huge pairwise distances in the training data. The latter is caused by the unbalanced label distributions in the training data.
Unlike other diagnostic biases, the label skewness bias is hard to detect and conquer, especially the implicit label skewness bias that usually demonstrate quite normal or even some good diagnostic accuracy but with imbalanced sensitivity and specificity. Our studies propose a DCASVM that not only conquer the bias but also achieve rivaling clinical diagnostic results by leverage the powerful feature extraction capabilities of derivative component analysis. Our work is not only significant in translational bioinformatics by identifying and solving an important problem, but also has a positive impact on machine learning for adding new results to kernelbased learning for omics data.
In our further studies, we plan to investigate the label skewness bias for the multiclass diagnostics, which can be more complicate and applied in medical informatics than the current binary type diagnostics [47]. Moreover, we are interested in investigating diagnostic biases in deep learning methods for its importance in big omics data oriented diagnostics [48, 49], in addition to integrating different types of omics data sets to conduct differential expression analysis [50].
Availability of supporting data
All data sets used in this paper are publicly available from https://sites.google.com/site/tbdiagnosticbiases/.
References
Berger B, Peng J, Singh M. Computational solutions for omics data. Nat Rev Genet. 2013; 14(5):333–46.
Han H, Li XL, Ng SK, Ji Z. Multiresolutiontest for consistent phenotype discrimination and biomarker discovery in translational bioinformatics. J Bioinformatics Comput Biol. 2013; 11(06):1343010.
NepomucenoChamorro I, Azuaje F, Devaux Y, Nazarov PV, Muller A, AguilarRuiz JS, et al. Prognostic transcriptional association networks: a new supervised approach based on regression trees. Bioinformatics. 2011; 27(2):252–8.
NepomucenoChamorro I, AguilarRuiz JS, Riquelme JC. Inferring gene regression networks with model trees. BMC Bioinformatics. 2010; 11:517.
Shah NH, Tenenbaum JD. The coming age of datadriven medicine: translational bioinformatics’ next frontier. J Am Med Inform Assoc. 2012; 19:e2–e4.
Canuel V, Rance B, Avillach P, Degoulet P, Burgun A. Translational research platforms integrating clinical and omics data: a review of publicly available solutions. Brief Bioinform. 2015; 16(2):280–90.
Lai Y, Zhang F, Nayak TK, Modarres R, Lee NH, McCaffrey TA. Concordant integrative gene set enrichment analysis of multiple largescale twosample expression data sets. BMC Genomics. 2014; 15(Suppl 1):S6.
Chen R, Mias GI, LiPookThan J, Jiang L, Lam HY, Chen R, et al. Personal omics profiling reveals dynamic molecular and medical phenotypes. Cell. 2012; 148(6):1293–307.
Chien S, Bashir R, Nerem RM, Pettigrew R. Engineering as a new frontier for translational medicine. Sci Transl Med. 2015; 7(281):281fs13.
Han H, Jiang X. Overcome support vector machine diagnosis overfitting. Cancer Inform. 2014; Sl:1145–158.
Han H, Li X. Multiresolution independent component analysis for highperformance tumor classification and biomarker discovery. BMC Bioinformatics. 2011; 12(S1):S7.
ShaweTaylor J, Cristianini N. Support Vector Machines and other kernelbased learning methods. New York NY: Cambridge University Press; 2000.
Hastie T, Tibshirani R, Friedman J. The Elements of statistical learning, Second edition. New York: Springer; 2008.
Blomquist TM, Crawford EL, Lovett JL, Yeo J, Stanoszek LM, Levin A, et al.Targeted RNAsequencing with competitive multiplexPCR amplicon libraries. PLoS ONE. 2013; 8(11):e79120.
Nagy ZB, Kelemen JZ, Fehér LZ, Zvara A, Juhász K, Pusás LG. Realtime polymerase chain reactionbased exponential sample amplification for microarray gene expression profiling. Anal Biochem. 2005; 337(1):76–83.
Han H. Derivative component analysis for mass spectral serum proteomic profiles. BMC Med Genomics. 2014; 7:S1.
Suykens JAK, Vandewalle J. Least squares support vector machine classifiers. Neural Process Lett. 1999; 9(3):293–300.
Van GT, Suykens JAK, Baesens B, Viaene S, Vanthienen J, Dedene G, et al. Benchmarking least squares support vector machine classifiers. Mach Learn. 2004; 54(1):5–32.
Bioinformatics Toolbox. http://www.mathworks.com/products/bioinfo/.
Ressom H, Varghese R, Drake S, Hortin G, AbdelHamid M, Loffredo C, et al. Peak selection from MALDITOF mass spectra using ant colony optimization. Bioinformatics. 2007; 23(5):619–26.
Boersma BJ, Reimers M, Yi M, Ludwig JA, Luke BT, Stephens RM, et al. A stromal gene signature associated with inflammatory breast cancer. Int J Cancer. 2008; 122(6):1324–32.
TCGA portal. https://tcgadata.nci.nih.gov/tcga/.
Irizarry R, Hobbs B, Collin F, BeazerBarclay Y, Antonellis K, Scherf U, et al. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 2003; 4:249.
Dillies MA1, Rau A, Aubert J, HennequetAntier C, Jeanmougin M, Servant N, et al.A comprehensive evaluation of normalization methods for Illumina highthroughput RNA sequencing data analysis. Brief Bioinform. 2013; 14(6):671–83.
Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y. RNAseq an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 2008; 18(9):1509–17.
The NCBI Gene Expression Omnibus (GEO). http://www.ncbi.nlm.nih.gov/geo/.
Haasdonk B. Feature space interpretation of svms with indefinite kernels. IEEE Trans Pattern Anal Mach Intell. 2005; 27(4):482–92.
Rallapalli G, Kemen EM, RobertSeilaniantz A, Segonzac C, Etherington G, Sohn KH, et al.EXPRSS: an Illumina based highthroughput expressionprofiling method to reveal transcriptional dynamics. BMC Genomics. 2014; 15:341.
Seiffert C, Khoshgoftaar TM, Van Hulse J, Napolitano A. RUSBoost: Improving clasification performance when training data is skewed. In: 19th International Conference on Pattern Recognition (ICPR). Tampa, FL: IEEE: 2008. p. 1–4.
Sun Y, Wong AC, Kamel M. Classification of imbalanced data, a review. Int J Patt Recogn Artif Intell. 2009; 23:687.
Jolliffe I. Principal component analysis. New York: Springer; 2002.
Oh S, Lee MS, Zhang BT. Ensemble learning with active example selection for imbalanced biomedical data classification. IEEE/ACM Trans Comput Biol Bioinform. 2011; 8(2):316–25.
Han X. Nonnegative principal component analysis for cancer molecular pattern discovery. IEEE/ACM Trans Comput Biol Bioinformatics. 2010; 7(3):537–49.
Han X. Improving gene expression cancer molecular pattern discovery using nonnegative principal component analysis. Genome Informat. 2008; 21:200–11.
Zhang J, Wu G, Miller CP, Tatevossian RG, Dalton JD, Tang B, et al.Wholegenome sequencing identifies genetic alterations in pediatric lowgrade gliomas. Nat Genet. 2013; 45(6):602–12.
Tam S, Tsao MS, McPherson JD. Optimization of miRNAseq data preprocessing. Brief Bioinform. 2015;:1–14. doi:10.1093/bib/bbv019.
McLachlan G. Discriminant Analysis and Statistical Pattern Recognition. Hoboken, NJ USA: Wiley Interscience; 2005.
Nazarov PV, Apanasovich VV, Lutkovski VM, Yatskou MM, Koehorst RBM, Hemminga MA. Artificial neural network modification of simulationbased fitting: application to a proteinlipid system. J Chem Inf Comput Sci. 2004; 44(2):568–74.
Huang J, Lu J, Ling CX. Comparing naive bayes, decision trees, and SVM with AUC and accuracy. In: Third IEEE International Conference on Data Mining. Melbourne, Florida: IEEE: 2003. p. 553–6.
Jing X. Robust adaptive learning of feedforward neural networks via LMI optimizations. IEEE Trans Neural Netw. 2012; 31:33–45.
Fox RJ, Dimmic MW. A twosample Bayesian ttest for microarray data. BMC Bioinformatics. 2006; 10(7):126.
McCall MN, Bolstad BM, Irizarry RA. Frozen robust multiarray analysis. Biostatistics. 2010; 11(2):242–53.
Han X. Inferring species phylogenies: a microarray approach. Comput Intell Bioinformatics Lecture Notes Comput Sci. 2006; 4115:485–93.
Robinson MD, Oshlack A. A scaling normalization method for differential expression analysis of RNAseq data. Genome Biol. 2010; 11:R25.
Wang Z, Gerstein M, Snyder M. RNASeq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009; 10:57–63.
Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010; 11:R106.
Tapia E, Ornella L, Bulacio P, Angelone L. Multiclass classification of microarray data samples with a reduced number of genes. BMC Bioinformatics. 2011; 12:59.
Fakoor R, Ladhak F, Nazi A, Huber M. Using deep learning to enhance cancer diagnosis and classification. In: Proceedings of the ICML Workshop on the Role of Machine Learning in Transforming Healthcare. Atlanta, Georgia: JMLR: W&CP: 2013.
Quang D, Chen Y, Xie X. DANN: a deep learning approach for annotating the pathogenicity of genetic variants. Bioinformatics. 2015; 31(5):761–3.
Lai Y, Eckenrode SE, She JX. A statistical framework for integrating two microarray data sets in differential expression analysis. BMC Bioinformatics. 2009; 10(Suppl 1):S23.
Acknowledgements
This work was partially supported by the startup funding package provided to Han by the Fordham University.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The author declares that he has no competing interests.
Authors’ contributions
Han does all the work for this study.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Han, H. Diagnostic biases in translational bioinformatics. BMC Med Genomics 8, 46 (2015). https://doi.org/10.1186/s129200150116y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s129200150116y