Skip to main content

Table 2 Empirically estimated effects and covariance

From: Optimally splitting cases for training and testing high dimensional classifiers

p Bayes Acc. n Prev. %t Full data Accuracy Opt. Vs. t= 2/3 Opt. Vs. t= 1/2
0.9 0.962 240 50% 58.3 0.961 0.001 0.002
0.6 0.861 240 50% 54.2 0.860 0.003 0.002
  1. Simulation results based on empirical estimates of covariance matrix and effect sizes. Columns are: p is the weight on a diagonal matrix, Bayes Acc. is the optimal accuracy possible, n is the total sample size, Prev. is the prevalence from the most prevalent group, %t is the optimal allocation proportion to training, Full data Accuracy is the mean accuracy when n = 240, and Opt. vs t = 2/3 is the root mean squared difference (RMSD) for the optimal rule and the 2/3 rds-to-training rule, and Opt vs t = 1/2 is the RMSD between the optimal rule and the 1/2-to-training rule. Sample covariance matrix S calculated from [12]. Effect sizes are estimated by the Empirical Bayes method of [10] with effect sizes shrunk to 80% of the empirical size. We followed methods similar to those previously proposed ([16], [17], [18]) to obtain non-singular covariance matrix estimates, namely , where diag(S) is a matrix of zero's and diagonal elements of S. Bayes accuracy is the optimal accuracy for a linear classifier in the population, which is (e.g., [13] where is a vector of half-distances between the class means. The number of informative genes was selected to achieve realistic Bayes (optimal) accuracies, so that all other gene effects were set to zero. Genes with largest standardized fold changes were selected as informative.