ID | Dataset | Nr. of samples | Nr. of samples (QC) | Chip | Source | Reference |
---|
 |  |  | Rejected | Passed |  |
---|
D1 | Richardson (I) | 47 | 5 | 42 | hgu133plus2 | GSE3744 | [39] |
D2 | Li | 115 | 6 | 109 | hgu133plus2 | GSE19615 | [40] |
D3 | Lu | 127 | 3 | 124 | hgu133plus2 | GSE5460 | [41] |
D4 | Bos | 204 | 16 | 188 | hgu133plus2 | GSE12276 | [42] |
D5 | Dedeurwaerder | 90 | 7 | 83 | hgu133plus2 | GSE20711 | [43] |
D6 | expO | 353 | 20 | 333 | hgu133plus2 | GSE2109 | [12] |
D7 | Kao | 327 | 33 | 294 | hgu133plus2 | GSE20685 | [44] |
D8 | Richardson (II) | 84 | 9 | 75 | hgu133plus2 | GSE18864 | [40] |
D9 | Sabatier | 266 | 24 | 242 | hgu133plus2 | GSE21653 | [45] |
D10 | Guedj | 537 | 36 | 501 | hgu133plus2 | E-MTAB-365 | [21] |
D11 | Symmans (III) | 32 | 4 | 28 | hgu133plus2 | GSE17700 | [46] |
D12 | Symmans (I) | 298 | 23 | 275 | hgu133a | GSE17705 | [46] |
D13 | Symmans (II) | 32 | 3 | 29 | hgu133a | GSE17700 | [46] |
D14 | Desmedt | 198 | 13 | 185 | hgu133a | GSE7390 | [47] |
D15 | Farmer | 49 | 3 | 46 | hgu133a | GSE1561 | [48] |
D16 | Schmidt | 200 | 18 | 182 | hgu133a | GSE11121 | [49] |
D17 | VDX | 344 | 29 | 315 | hgu133a | GSE2034,GSE5327 | [36, 37] |
D18 | Miller | 251 | 18 | 233 | hgu133a | GSE3494 | [50] |
D19 | Pawitan | 159 | 16 | 143 | hgu133a | GSE1456 | [51] |
D20 | Shi | 278 | 19 | 259 | hgu133a | GSE20194 | [52, 53] |
D21 | MSK | 99 | 8 | 91 | hgu133a | GSE2603 | [54, 55] |
D22 | UNT | 137 | 6 | 131 | hgu133a | GSE2990 | [56, 57] |
Total | Â | 4227 | 319 | 3908 | Â | Â | Â |
- The compendium consists of data from 22 datasets measured by a single measurement platform, i.e. Affymetrix. The expression data was measured on two distinct array designs, i.e. hgu133plus2 (top 11 datasets, 2,182 samples) and hgu133a (bottom 11 datasets, 2,045 samples). We only considered the 22,215 probesets that these designs have in common, which represent all non-control probesets present on the hgu133a platform. Shared probesets are based on an identical set of probes with identical probe sequences. Remaining heterogeneity on these datasets was further reduced using frozen RMA [17] normalization and robust scaling [12] (Methods). Furthermore, an extensive quality control (QC) analysis was performed aimed at identifying (and removing) hybridizations that consistently showed indications of poor quality (Methods; Additional file 1: Section 1.2). ID: short dataset identifier; Dataset: dataset name; Nr. of samples: total number of available samples; Rejected: number of samples removed based on QC; Passed: total number of samples remaining after QC. In total 319 samples (7.55%) were rejected based on consistent indications of poor quality. Chip: array design used, i.e. hgu133plus2 or hgu133a; Source: the accession number under which the raw intensity data can be found at GEO [34]. Dataset D10 is available at ArrayExpress [35] (accession number E-MTAB-365); Reference: reference to main study. The 344 sample VDX dataset (D17) consists of the combined expression data of the 286 sample dataset by Wang et al. [36] and the 58 ER- sample dataset by Yu et al. [37]. Finally, note that the Symmans datasets (D11-D13) represent ER+ datasets. To prevent bias due to scaling of a dataset with a highly skewed subtype distribution [26, 38], datasets D12 and D13 were first concatenated to the VDX dataset and subsequently scaled as a single dataset, after which the VDX dataset was removed. Similarly, dataset D11 was combined with the expO dataset during scaling. A similar strategy was followed by Haibe-Kains et al. [12]