Skip to main content

Table 3 Applications to real datasets

From: Optimally splitting cases for training and testing high dimensional classifiers

Dataset n Prevalence %t Full dataset accuracy Optimal vs. Optimal vs.
Rosenwald 240 52% 63% 0.96 0.001 0.002
Boer 152 53% 53% 0.98 0.004 2e-4
Golub 72 65% 56% 0.95 0.002 0.004
Sun 131 62% 31% 0.83 0.022 0.008
van't Veer 117 67% 26% 0.78 0.004 0.001
  1. Nonparametric bootstrap with smooth spline (or isotonic regression) learning curve method results [Additional file 1]. n is the total number of samples from the two classes, and "Prevalence" is the prevalence of the majority class. %t is the percent of samples allocated to the training set under optimal allocation, t/n ·100%. "Full dataset accuracy" is the estimated mean accuracy on the full dataset of size n. "Optimal vs. rule" is the difference between the root mean squared error for an optimal training set allocation and for the "2/3 rds to training set" allocation rule. The rightmost column is for the "1/2 to training set" allocation rule. Classes for datasets are: Germinal Center B-cell-like lymphoma versus other (Rosenwald et al., 2002), renal clear cell carcinoma primary tumor versus control normal kidney tissue (Boer et al., 2001), acute myelogenous leukemia versus acute lymphoblastic leukemia (Golub et al., 1999), glioblastoma versus oligodendroglioma (Sun et al., 2006), grade 1/2 versus grade 3 lung cancer (van't Veer et al., 2002).