With the availability of high throughput genomics data, methods for cancer class classification and prediction based on molecular information have been vigorously pursued in recent years. The objective of this study is to find important molecular markers and/or build a classifier such that the classifier with selected markers as the independent variables can accurately classify the diagnostic disease status of a sample using expression data. Popular methods for this problem include Prediction Analysis of Microarrays (PAM, [1]), Top Scoring Pair (TSP, [2]), k-Top Scoring Pair (k-TSP, [3]), Support Vector Machine (SVM, [4]) etc. There are also many other endeavors such as individual-gene-ranking by evaluating the discriminating power of classes (see [5, 6] and the references therein), gene filtering through relevance and correlation analyses [7, 8], gene selection for classification based on the Bayes error [9], comparing the distributions of within-class correlations with between-class correlations via Kullback-Leibler divergence [10], recursive feature addition with Lagging Prediction Peephole Optimization to choose the final optimal marker set [11], SVM based recursive feature elimination [12, 13], random forests [14] and random subspace search [15, 16], among others.

There are a few challenges associated with such study. One of them is that the number of independent variables (markers) is typically much more than the number of available samples, often referred as curse of dimensionality. To identify possibly nonlinear effects of many variables and their interactions, it is often necessary to estimate a large number of modeling parameters. A direct consequence of the curse of dimensionality is that the total number of parameters that the data can estimate is restricted by the number of the samples. When the total number of parameters greatly exceeds the number of samples, overfitting occurs such that the prediction of the phenotype works well for the learning data but the performance of the classifier applied to independent test samples exhibit poor classification accuracy. The informative marker selection process unfortunately needs to consider modeling with each possible combination of all markers in order to find the globally best marker set, which has the best discriminating power for the different disease categories and may or may not be the primary biological and pathological driving factors underlying disease progression. Hence, an effective practice is to first reduce the dimensionality of the marker space.

The TSP and k-TSP classifiers are two simple algorithms that select gene pairs with top scores to build classifiers. They were shown to perform well for binary classification with gene expression data [2, 3]. The gene pairs were selected based on simple pairwise comparisons between two marker expression levels within the same sample. Specifically, let *p*
_{
ij
}
*(C*
_{
1
}
*)* be the percentage of training samples in class 1 that the expression of one marker is less than that of the other marker in the same sample and let *p*
_{
ij
}
*(C*
_{
2
}
*)* be similarly defined. The score for a gene pair is defined as the estimated difference between the two percentages *p*
_{
ij
}
*(C*
_{
1
}
*) - p*
_{
ij
}
*(C*
_{
2
}
*)*. Then the gene pair that received the highest score is selected as the marker set for TSP classifier and the top k gene pairs with highest scores are used for the k-TSP classifier. Tan et al. [3] extended the two classifiers to multi-class classification through one-vs-others, one-vs-one, and hierarchical classification (HC) schemes. They reported that the HC schemes for TSP and k-TSP gave better performance than the other two schemes.

There are advantages and disadvantages with the TSP and k-TSP classifiers. Some advantages of the two classifiers are that they are simple to implement and the resulting classifiers are easy to interpret. They are also invariant to monotone transformations as they only depend on relative rankings of gene expressions within the same sample. The overfitting problem is largely avoided due to simple comparisons. In addition, they are different from most algorithms in that comparisons in other algorithms were mostly between expressions from different samples. Comparison of expressions within the same sample in TSP and k-TSP helps to eliminate the influence of sampling variability due to different subjects.

A disadvantage is related to how the scores for gene pairs are defined. As the scores were calculated from percentages, the sample size information was not fully utilized in TSP and k-TSP. For example, suppose 4 out of 10 samples in class 1 and 6 out of 10 samples in class 2 satisfy the condition: Marker 1 has smaller expression value than marker 2. The score for the pair with markers 1 and 2 is 0.2, which is the absolute difference between the two percentages. In another case, suppose all the counts are multiplied by 10, i.e. 40 out of 100 samples in class 1 and 60 out of 100 samples in class 2 satisfy the condition. Then the score for the marker pair is identical to the previous case. So the additional information with extra sample size is completely ignored in TSP and k-TSP classifiers.

The multi-class classifiers HC-TSP and HC-k-TSP are two versions that showed best performance among all TSP family classifiers [3]. They were derived from a scheme that performs sequential binary classification. At the root node, the training samples are partitioned into two classes, the largest class and the composite class. The largest class containing the largest number of samples is treated as a leaf node for final classification of the phenotype. The composite class is then further partitioned similarly as the root partition. This scheme intends to balance the two classes during each binary partition. However, the markers selected at each binary partition with TSP or k-TSP are not necessarily the best marker set to separate all the classes since they are selected based on their differentiating ability to separate the largest class from the composite class at the node. In addition, the selection of markers at each partition does not have a mechanism to control the redundancy of the candidate marker set. For example, in prostate cancer LNCaP cells, forkhead transcription factor (FOXO3a) that is the phosphatidylinositol 3-kinase (PI3K/Akt) downstream substrate, is a positive regulator for the induction of androgen receptor (AR) gene expression. The blocking of AR functions by AR interfering RNA leads to dramatic LNCaP cell death. Hence the inhibition of the PI3K/Akt pathway may result in the activation of the FOXO3a transcription factor, which may then induce the AR gene expression to protect cells from apoptosis of LNCaP prostate cancer cells. The PI3K/Akt and FOXO3a could both be selected in the marker selection algorithm of HC-TSP or HC-k-TSP. Apparently, they are highly correlated.

In this article, we propose a new algorithm to overcome the above problems of TSP family classifiers. We introduce a new definition of the score for each marker set so that the sample size information is fully utilized. In addition, it is unrealistic to assume that the number of informative genes is always even as in TSP family classifiers. We present a new algorithm that performs sequential search and do not restrict the informative markers to be even numbered. The binary class and multi-class cases are unified into a single framework. The algorithm was applied to 9 binary class and 10 multi-class cancer genomics datasets. The TSG classifier achieved better leave-one-out cross validation accuracy for the binary classification than TSP or k-TSP classifiers. For the multi-class problems, our TSG classifier gives comparable performance or outperform TSP family and other popular classifiers with a big margin in independent test accuracy for several cancer datasets. Beyond high accuracy, our new algorithm also has the advantage of giving small number of informative marker set and all the advantages of the TSP family classifiers.