- Research
- Open Access
Gene masking - a technique to improve accuracy for cancer classification with high dimensionality in microarray data
https://doi.org/10.1186/s12920-016-0233-2
© The Author(s) 2016
- Published: 5 December 2016
Abstract
Background
High dimensional feature space generally degrades classification in several applications. In this paper, we propose a strategy called gene masking, in which non-contributing dimensions are heuristically removed from the data to improve classification accuracy.
Methods
Gene masking is implemented via a binary encoded genetic algorithm that can be integrated seamlessly with classifiers during the training phase of classification to perform feature selection. It can also be used to discriminate between features that contribute most to the classification, thereby, allowing researchers to isolate features that may have special significance.
Results
This technique was applied on publicly available datasets whereby it substantially reduced the number of features used for classification while maintaining high accuracies.
Conclusion
The proposed technique can be extremely useful in feature selection as it heuristically removes non-contributing features to improve the performance of classifiers.
Keywords
- Genetic Algorithm
- Feature Selection
- Classification Accuracy
- High Classification Accuracy
- Genetic Algorithm Operator
Background
Traditionally, clinical methods are employed to detect cancers such as ultrasonography, X-Ray, Computed Tomography (CT) and Magnetic Resonance Imaging (MRI) [1]. However, many cancers cannot be distinguished easily using traditional approaches. An alternative approach to improve detection is to use analyze microarray gene profiles. In microarray gene profiles, mRNA samples are used to measure the expression level of genes, which can be in the magnitude of thousands. This in turn makes detection and classification of difficult due to the high dimensionality in data [2], therefore, there is a need for computation methods to help improve the classification of cancers using microarray gene profiles.
Generally, computational methods are used to remove non-contributing and noisy dimensions from data while simultaneously trying to maintain a high classification rate [3]. Additionally, class imbalance is an important consideration in classification of biomedical data, and there are techniques [4] which incorporate class distribution within the classification algorithm. Our approach is different in that we separate the classification from data preprocessing where we assume class imbalance is to be handled.
Feature selection and extraction is a well researched topic in biomedical fields, especially in the areas concerning microarray data [5–7]. Several methods have been discussed relating to feature selection for microarray data [6, 8–17] and they can be broadly categorized into two groups, filter based methods and wrapper based methods. In filter based methods, genes are selected prior to training the classification model whereas wrapper based methods involve gene selection within the classification process [5, 18, 19].
The importance of selecting features from gene subsets or groups has recently become popular topic in microarray research [7, 20]. For instance, top-r feature selection proposed by Sharma et. al [20] does provide very good results based on a small subset of genes, however, it should be noted that it has a few drawbacks. Firstly, it is quite computationally expensive, requiring a total number of search combinations between h+1 C 2×(d/h) and (2 h −1)×(d/h), where h is the block size and d is the total number of dimensions [20]. Additionally, initial parameter selection is crucial and it greatly affects the final results. Top-r is sensitive to the selection of block size and number of resulting blocks. Selecting ideal value of h could be a tricky task and final results are dependent on this value [20]. Lastly, it should be noted that top-r does not fully consider the interaction among features but only amongst the top-r features from each block [5].
In this paper, we consider the classification of the small round blue cell tumor (SRBCT) [21] dataset which has been categorized into 4 types of cancers and has 2308 gene expressions. Khan et al. [1], Tibshirani et al. [21] and Kumar et al. [22] have previously worked on this dataset whereby they have all reported 100% classification accuracies with 96, 43 and 13 genes respectively. While Khan et al. [1] and Tibshirani et al. [21] use the fully-fledged dataset with 2308 genes to perform analysis, Kumar et al. [22] begin their analysis from a reduced set of 96 genes (from Khan et al. [1] findings) to obtain results. Kumar et al. [22] do not use all 2308 genes due to the computational complexity of their approach. Our motivation in this paper is to build upon the approach proposed by Kumar et al. [22] and propose a new method that does not suffer from similar limitations. In the proposed method, we propose a wrapper based method where we commence with the entire feature set from the microarray data without any prior need of feature selection and achieve high classification accuracy with as few features as possible.
Furthermore we validate our approach using the mixed-lineage leukemia (MLL) [23] and lung cancer (LC) [24] datasets. MLL dataset comprises of 3 classes, with each sample containing 12,582 gene expressions. Lastly, LC dataset contains 2 cancer types and each sample comprises of 12,533 gene expressions. We applied gene masking with nearest shrunken centroid classifier to significantly reduce the number of dimensions for the datasets while maintaining 100% accuracies during classification.
Methods
Gene masking has been derived from genetic algorithm, whereby genetic algorithm is used to search for an optimal gene mask that provides the greatest performance gains while removing the most number of features for the selected classification algorithm. For this study, Nearest Centroid and Nearest Shrunken Centroid classifiers were used for classification.
Genetic algorithm
The genetic algorithm (GA) is a heuristic search based algorithm inspired by Darwin’s theory of natural selection. It was first introduced by Holland and it simulates natural processes of evolution, namely selection, crossover and mutation. GA is a competitive search algorithm where evolution of individuals is directed mainly by the principle of “survival of the fittest”. Fitness of an individual is determined by a fitness function and individuals with a higher fitness have a greater bias for contributing to the next generation than their less fit counterparts [25]. More details on GA processes and functions are described in latter sections.
Nearest centroid classifier
Nearest Centroid Classifier (NCC) is a basic prototype classifier that creates centroids (which is the mean for a particular class) to create a classification model. Samples closest to a centroid is assigned a label of that particular class [21].
Nearest shrunken centroid classifier
Nearest Shrunken Centroid Classifier (NSCC) [21], is a simple modification of NCC that uses “de-noised” versions of the centroids. Features that are noisy and have little variation from the overall mean are removed during shrinkage. The amount of shrinkage is determined by a constant Δ, where a larger value of Δ removes a larger number of features. Therefore, it can be stated that this classifier has an “in-built” feature selection mechanism.
Gene masking
Gene masking is a technique that incorporates evolutionary techniques to reduce the dimensionality of data within the training phase of the classification model. The basic premise of this technique is to heuristically remove non-contributing features in data while training the classifier. The amount of contribution by a feature is determined by its impact on classification accuracy, whereby non-contribution is attributed to features whose removal and/or existence has minimal effect on classification accuracy. By reducing the dimensionality of data, gene masking helps improve classifier performance and reduces the computational complexity of the problem. Moreover, it can be used as a feature isolation technique that allows for the identification of features which contribute the most towards classification.
Overview
Illustration of gene masking on the original dataset to produce a masked dataset
In gene masking, the GA processes are unmodified and it goes through its basic set of genetic operations. For each generation, fitness is calculated for every mask in the population. These masks are then exposed to the three GA operators; selection, crossover and mutation. Finally, the best performing mask is chosen after the generation limit is reached in GA.
Flowchart depicting the relation of Genetic Algorithm and Classifier in gene masking where the best chromosome represents the best gene mask discovered
Process details
Illustration of fitness evaluation with gene masking. Cross validation is performed using a classifier and the average accuracy is used for fitness calculation
Upon fitness evaluation, GA goes through its orthodox set of operators, namely selection, crossover and mutation. Selection has been performed using roulette wheel selection, which is biased towards individuals with higher fitness. Crossover is accomplished by performing a random one-point binary crossover to swap the genes and mutation is performed by negating gene values at random locations. However, to preserve the highest performing chromosome between generations, elite selection is used to ensure that a mask with the highest fitness is passed to the next generation unmodified by GA operators.
This process of performing fitness evaluations and applying genetic operators continues until the number of generations specified during the initial parameter configuration is reached. The best chromosome discovered during the evolution of the population is selected. This chromosome represents the gene mask that yielded the highest fitness value during training. The best evolved gene mask is subsequently used for masking the test dataset during the testing phase.
Experiment and discussion
Primarily, we had considered the SRBCT dataset for gene masking. The following sections provide details on the data, and the experiment and its results.
Dataset
Gene masking was applied on the dataset containing gene-expression profiling using cDNA microarrays on small round, blue cell tumors (SRBCT) of childhood, named as such due to their similarity to routine histology. Each type of tumor can be classified into one of four classes either neuroblastoma (NB), rhabdomyosarcoma (RMS), non-Hodgkin lymphoma (NHL) or the Ewing family of tumors (EWS). The dataset comprises of 63 training samples and 25 test samples, each of which contains 2308 gene expressions from cDNA microarrays [1, 21]. Of the 25 test samples, 5 samples are not SRBCTs, which were discarded for the purpose of this study since corresponding non-SRBCT samples were not present in the training set. Classification by microarrays is a difficult task since the number of features (genes) are relatively large whereas the number of samples are relatively small and it is also important to identify genes that contribute most to classification [1, 21].
Results
GA, and subsequently, gene masking, is stochastic by nature. During our experiment, multiple experiments with the same parameter combinations were executed while tuning GA parameters to get a consolidated view on the performance of gene masks with a particular combination of parameters.
Genetic algorithm parameters
Parameter | Value |
---|---|
GA type | Binary |
Population size | 105 |
Chromosome length | 2308 |
No. of generations | 50000 |
Selection function | Roulette wheel |
Crossover rate | 0.85 |
Mutation rate | 0.10 |
Elite conservation | Yes, num_elite=1 |
Parameter tuning and selection method used in this study
Parameter tuning and selection |
---|
Let S be the set of training samples |
Let CR be the crossover rate and MR be the mutation rate |
Let k be the number of cross validation folds, where k = 5 is fixed |
Let α be the Accuracy to Elimination Ratio |
Define the GA parameters apart from CR and MR as those highlighted in Table 1 |
Define α to belong to the set (0, 0.1, 0.2, …, 0.9, 1] |
Define CR to belong to the set (0.5, 0.55, 0.6, …, 0.95, 1] |
Define MR to belong to the set [0, 0.05, 0.1, …, 0.45, 0.5) |
For each combination of { α, CR, MR}: |
- Perform k-fold cross validation using the classifier and gene masking on the set of samples S |
- Report the results obtained by the best performing gene mask |
- Repeat for 10 iterations |
Select the best performing combination of { α, CR, MR} for testing and reporting |
During the initial phases of experiments, NCC was used with gene masking to evaluate the performance against the SRBCT dataset. This approach yielded good results with 100% classification accuracy, however, there was only about 28% reduction in genes (about 650 genes) from the original microarray data. This may be attributed to the fact that NCC is a very basic classifier. Additionally, it can be noted that with NCC, having a lower value for α (signifying a greater preference towards dimensionality reduction) yielded better results with α=0.3, giving 100% training and test accuracies.
Gene masking and NSCC performance on SRBCT test set with different values for Δ with α = 0.9
Δ | Genes left | Genes left | Test accuracy |
---|---|---|---|
after shrinkage | after masking | ||
3 | 343 | 36 | 0.9 |
3.5 | 280 | 23 | 0.95 |
4 | 235 | 21 | 0.95 |
4.5 | 208 | 15 | 0.9 |
5 | 174 | 14 | 0.95 |
5.5 | 158 | 14 | 0.95 |
6 | 135 | 12 | 0.95 |
6.5 | 124 | 15 | 1 |
7 | 112 | 16 | 1 |
7.5 | 102 | 13 | 1 |
8 | 90 | 17 | 1 |
8.5 | 80 | 20 | 1 |
9 | 72 | 19 | 1 |
9.5 | 65 | 18 | 0.95 |
10 | 61 | 14 | 0.8 |
10.5 | 54 | 15 | 0.75 |
11 | 48 | 12 | 0.75 |
11.5 | 42 | 13 | 0.8 |
12 | 41 | 10 | 0.8 |
Comparison of performance of NCC and NSCC with gene masking
NCC | NSCC | |
---|---|---|
Number of genes remaining | 1637 | 13 |
Training accuracy | 100% | 100% |
Test accuracy | 100% | 100% |
Comparison of performance of similar techniques
Method (Classifier) | Number of genes | Accuracy |
---|---|---|
PCA, MLP, Neural Network [1] | 96 | 100% |
Nearest Shrunken Centroid [21] | 43 | 100% |
Information gain + SVM [26] | 150 | 95% |
Towing rule + SVM [26] | 150 | 95% |
Sum minority + SVM [26] | 150 | 95% |
Max minority + SVM [26] | 150 | 91% |
Gini index + SVM [26] | 150 | 95% |
Sum of variances + SVM [26] | 150 | 95% |
t-statistics + SVM [26] | 150 | 95% |
One-dimensional SVM + SVM [26] | 150 | 95% |
Information gain + LDA with NCC [20] | 4 | 70% |
Chi-squared + NNC [20] | 4 | 70% |
Gain Ratio + NNC [20] | 4 | 85% |
Gene masking + ANN [22] | 13 | 100% |
Gene masking + NCC (this paper) | 650 | 100% |
Gene masking + NSCC (this paper) | 13 | 100% |
The 13 genes selected via gene masking with their relative occurrence in other solutions
Image | Name | Percentage | In [21] | In [1] | In [22] |
---|---|---|---|---|---|
ID | occurrence | ||||
39093 | methionine aminopeptidase; | 42.86% | No | Yes | No |
eIF-2-associated p67 | |||||
365826 | growth arrest-specific 1 | 100% | No | Yes | No |
1416782 | creatine kinase, brain | 100% | No | Yes | No |
461425 | myosin MYL4 | 71.43% | Yes | Yes | No |
810057 | cold shock domain protein A | 100% | Yes | No | No |
866702 | protein tyrosine phosphatase, | 57.14% | Yes | Yes | Yes |
non-receptor type 13 (APO-1/CD95 | |||||
(Fas)-associated phosphatase) | |||||
854899 | dual specificity phosphatase 6 | 28.57% | No | Yes | No |
629896 | microtubule-associated protein 1B | 71.43% | No | Yes | Yes |
214572 | ESTs | 100% | No | No | No |
208718 | annexin A1 | 100% | No | Yes | No |
784224 | fibroblast growth factor receptor | 100% | Yes | Yes | No |
204545 | ESTs | 57.14% | Yes | Yes | No |
295985 | ESTs | 100% | Yes | Yes | No |
Furthermore, due to the stochastic nature of gene masking, the gene masks that produce 100% accuracies do not tend to select the same combination of genes. Therefore, we have also identified and reported the relative occurrence of these genes (in Table 6) during various iterations where solutions that gave 100% accuracy with 15 genes or less were observed.
Discussion
Gene masking can be very useful in feature selection and it can isolate features that lead to high classification accuracy. As per the results on the SRBCT dataset, it can be seen that gene masking can be used to identify features which have significant contribution towards classification.
However, in order to further investigate the proposed technique, gene masking in conjunction with NSCC was used to classify even larger datasets (in terms of number of genes in gene expression data). These datasets were mixed-lineage leukemia (MLL) [23] and lung cancer (LC) [24] datasets. The MLL dataset comprises of 12,582 gene expressions for each sample. It consists of 57 training samples and 15 test samples and each of these samples can be categorized into one of three cancer types, either ALL, MLL or AML [23]. On the other hand, LC dataset contains tissue samples of two cancer types, MPM or ADCA, consisting of 32 training samples and 149 test samples with each sample comprising of 12,533 genes expressions [24].
A summary of performance of gene masking with NSCC on MLL Leukemia and Lung Cancer datasets
Dataset | Genes remaining | Test accuracy |
---|---|---|
MLL Leukemia | 94 | 100% |
Lung Cancer | 90 | 100% |
It should be noted that gene masking has been derived completely off a basic binary GA. As with most evolutionary global optimization algorithms, the risk of getting stuck in local optima is greater when the search space is extremely large. While searching for global optimal locations in a large search domain, a subsequent degradation in performance can be noted. Gene masking currently suffers from a similar limitation, which is highlighted by the results summarized in Table 7 for MLL and LC datasets.
Even with NSCC as the classifier that allows for an “in-built” feature selection procedure, the performance of gene masking was not as good as those with the SRBCT dataset, if dimensionality reduction is considered as a basis of performance. If the amount of shrinkage by NSCC is increased, there is a lot of loss of information solely on the basis of the magnitude of variation from the overall mean without considering feature interdependencies. Therefore, with NSCC, MLL and LC datasets could only be shrunk to about 400 genes each prior to initializing gene masking. From there onwards, gene masking was able to further reduce the number of genes required to maintain 100% accuracy to about 90 genes for both datasets.
Conclusion
Gene masking can be very useful in feature selection as it can isolate features that lead to high classification accuracy. It does so by considering the impact of features on classification and heuristically removes non-contributing features. In this paper, we have demonstrated its viability by achieving 100% accuracy while significantly reducing the number of genes required on SRBCT, MLL and LC datasets containing microarray gene expressions for cancers.
Declarations
Funding
Publication of this article was funded by CREST, JST, Japan.
Declarations
This article has been published as part of BMC Medical Genomics Volume 9 Supplement 3, 2016. 15th International Conference On Bioinformatics (INCOB 2016): medical genomics. The full contents of the supplement are available online https://bmcmedgenomics.biomedcentral.com/articles/supplements/volume-9-supplement-3.
Availability of data and materials
Only publicly available data has been used in this research and the cancer profiles are for SRBCT, MLL and LC available on internet [1, 23, 24].
Authors’ contributions
SL designed the gene masking concept and programmed the genetic algorithm engine. HS lead a team consisting of VVN, VWP and GS, and implemented the gene masking concept in C++ as well as carried out all experiments. HS wrote the first draft of the paper. AS, TT and SL supervised the project, and contributed in the preparation of the manuscript. All authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Consent for publication
Not applicable.
Ethics approval and consent to participate
Not applicable.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
- Khan J, Wei JS, Ringner M, Saal LH, Ladanyi M, Westermann F, Berthold F, Schwab M, Antonescu CR, Peterson C. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat Med. 2001; 7(6):673–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Sarhan AM. Cancer classification based on microarray gene expression data using DCT and ANN. J Theor Appl Inf Technol. 2009; 6(2):208–16.Google Scholar
- Ghodsi A. Dimensionality reduction a short tutorial. Ontario: Department of Statistics and Actuarial Science, Univ. of Waterloo. 2006.Google Scholar
- Blagus R, Lusa L. Improved shrunken centroid classifiers for high-dimensional class-imbalanced data. BMC Bioinforma. 2013; 14(1):64. doi:10.1186/1471-2105-14-64.View ArticleGoogle Scholar
- Ghalwash MF, Cao XH, Stojkovic I, Obradovic Z. Structured feature selection using coordinate descent optimization. BMC Bioinforma. 2016; 17(1):1–14. doi:10.1186/s12859-016-0954-4.View ArticleGoogle Scholar
- Marczyk M, Jaksik R, Polanski A, Polanska J. Adaptive filtering of microarray gene expression data based on gaussian mixture decomposition. BMC Bioinforma. 2013; 14(1):1–12. doi:10.1186/1471-2105-14-101.View ArticleGoogle Scholar
- Holec M, Kléma J, železný F, Tolar J. Comparative evaluation of set-level techniques in predictive classification of gene expression samples. BMC Bioinforma. 2012; 13(10):1–15. doi:10.1186/1471-2105-13-S10-S15.Google Scholar
- Guyon I, Weston J, Barnhill S, Vapnik V. Gene selection for cancer classification using support vector machines. Mach Learn. 2002; 46(1):389–422. doi:10.1023/A:1012487302797.View ArticleGoogle Scholar
- Swift S, Tucker A, Vinciotti V, Martin N, Orengo C, Liu X, Kellam P. Consensus clustering and functional interpretation of gene-expression data. Genome Biol. 2004; 5(11):1–16. doi:10.1186/gb-2004-5-11-r94.View ArticleGoogle Scholar
- Mamitsuka H. Selecting features in microarray classification using {ROC} curves. Pattern Recognition. 2006; 39(12):2393–404. doi:10.1016/j.patcog.2006.07.010 Bioinformatics.View ArticleGoogle Scholar
- Zhou J, Lu Z, Sun J, Yuan L, Wang F, Ye J. Feafiner: Biomarker identification from medical data through feature generalization and selection. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD ’13. New York, NY, USA: ACM: 2013. p. 1034–1042, doi:10.1145/2487575.2487671. http://doi.acm.org/10.1145/2487575.2487671.Google Scholar
- Sharma A, Paliwal KK. Cancer classification by gradient LDA technique using microarray gene expression data. Data Knowl Eng. 2008; 66(2):338–47.View ArticleGoogle Scholar
- Brunet JP, Tamayo P, Golub TR, Mesirov JP. Metagenes and molecular pattern discovery using matrix factorization. Proc Natl Acad Sci. 2004; 101(12):4164–169. doi:10.1073/pnas.0308531101. doi:http://www.pnas.org/content/101/12/4164.full.pdf.View ArticlePubMedPubMed CentralGoogle Scholar
- Sharma A, Paliwal KK. A Gene Selection Algorithm using Bayesian Classification Approach. Am J Appl Sci. 2012; 9(1):127–31.View ArticleGoogle Scholar
- Mitra S, Ghosh S. Feature selection and clustering of gene expression profiles using biological knowledge. IEEE Trans Syst Man Cybern Part C Appl Rev. 2012; 42(6):1590–1599. doi:10.1109/TSMCC.2012.2209416.View ArticleGoogle Scholar
- Sharma A, Imoto S, Miyano S. A filter based feature selection algorithm using null space of covariance matrix for DNA microarray gene expression data. Curr Bioinforma. 2012; 7(3):289–94.View ArticleGoogle Scholar
- Sharma A, Paliwal KK, Imoto S, Miyano S. A feature selection method using improved regularized linear discriminant analysis. Mach Vis Appl. 2014; 25(3):775–86.View ArticleGoogle Scholar
- Inza I, Larrañaga P, Blanco R, Cerrolaza AJ. Filter versus wrapper gene selection approaches in DNA microarray domains. Artif Intell Med. 2004; 31(2):91–103. doi:10.1016/j.artmed.2004.01.007. Data Mining in Genomics and Proteomics].View ArticlePubMedGoogle Scholar
- Leung Y, Hung Y. A multiple-filter-multiple-wrapper approach to gene selection and microarray data classification. IEEE/ACM Trans Comput Biol Bioinforma. 2010; 7(1):108–17. doi:10.1109/TCBB.2008.46.View ArticleGoogle Scholar
- Sharma A, Imoto S, Miyano S. A top-r feature selection algorithm for microarray gene expression data. IEEE/ACM Trans Comput Biol Bioinforma (TCBB). 2012; 9(3):754–64.View ArticleGoogle Scholar
- Tibshirani R, Hastie T, Narasimhan B, Chu G. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci. 2002; 99(10):6567–572.View ArticlePubMedPubMed CentralGoogle Scholar
- Kumar R, Chand K, Lal SP. Gene Reduction for Cancer Classification Using Cascaded Neural Network with Gene Masking In: Sokolova M, van Beek P, editors. Advances in Artificial Intelligence: 27th Canadian Conference on Artificial Intelligence, Canadian AI 2014, Montréal, QC, Canada, May 6-9, 2014. Proceedings. Cham: Springer: 2014. p. 301–6.Google Scholar
- Armstrong SA, Staunton JE, Silverman LB, Pieters R, den Boer ML, Minden MD, Sallan SE, Lander ES, Golub TR, Korsmeyer SJ. MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nat Genet. 2001; 30(1):41–7.View ArticlePubMedGoogle Scholar
- Gordon GJ, Jensen RV, Hsiao LL, Gullans SR, Blumenstock JE, Ramaswamy S, Richards WG, Sugarbaker DJ, Bueno R. Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. Cancer Res. 2002; 62(17):4963–967.PubMedGoogle Scholar
- Goldberg DE, Holland JH. Genetic algorithms and machine learning. Mach Learn. 1988; 3(2):95–9.View ArticleGoogle Scholar
- Li T, Zhang C, Ogihara M. A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression. Bioinformatics. 2004; 20(15):2429–437. doi:10.1093/bioinformatics/bth267. http://bioinformatics.oxfordjournals.org/content/20/15/2429.full.pdf+html.View ArticlePubMedGoogle Scholar