Skip to main content

Table 1 Overview Affymetrix compendium

From: Breast cancer subtype predictors revisited: from consensus to concordance?

ID

Dataset

Nr. of samples

Nr. of samples (QC)

Chip

Source

Reference

   

Rejected

Passed

 

D1

Richardson (I)

47

5

42

hgu133plus2

GSE3744

[39]

D2

Li

115

6

109

hgu133plus2

GSE19615

[40]

D3

Lu

127

3

124

hgu133plus2

GSE5460

[41]

D4

Bos

204

16

188

hgu133plus2

GSE12276

[42]

D5

Dedeurwaerder

90

7

83

hgu133plus2

GSE20711

[43]

D6

expO

353

20

333

hgu133plus2

GSE2109

[12]

D7

Kao

327

33

294

hgu133plus2

GSE20685

[44]

D8

Richardson (II)

84

9

75

hgu133plus2

GSE18864

[40]

D9

Sabatier

266

24

242

hgu133plus2

GSE21653

[45]

D10

Guedj

537

36

501

hgu133plus2

E-MTAB-365

[21]

D11

Symmans (III)

32

4

28

hgu133plus2

GSE17700

[46]

D12

Symmans (I)

298

23

275

hgu133a

GSE17705

[46]

D13

Symmans (II)

32

3

29

hgu133a

GSE17700

[46]

D14

Desmedt

198

13

185

hgu133a

GSE7390

[47]

D15

Farmer

49

3

46

hgu133a

GSE1561

[48]

D16

Schmidt

200

18

182

hgu133a

GSE11121

[49]

D17

VDX

344

29

315

hgu133a

GSE2034,GSE5327

[36, 37]

D18

Miller

251

18

233

hgu133a

GSE3494

[50]

D19

Pawitan

159

16

143

hgu133a

GSE1456

[51]

D20

Shi

278

19

259

hgu133a

GSE20194

[52, 53]

D21

MSK

99

8

91

hgu133a

GSE2603

[54, 55]

D22

UNT

137

6

131

hgu133a

GSE2990

[56, 57]

Total

 

4227

319

3908

   
  1. The compendium consists of data from 22 datasets measured by a single measurement platform, i.e. Affymetrix. The expression data was measured on two distinct array designs, i.e. hgu133plus2 (top 11 datasets, 2,182 samples) and hgu133a (bottom 11 datasets, 2,045 samples). We only considered the 22,215 probesets that these designs have in common, which represent all non-control probesets present on the hgu133a platform. Shared probesets are based on an identical set of probes with identical probe sequences. Remaining heterogeneity on these datasets was further reduced using frozen RMA [17] normalization and robust scaling [12] (Methods). Furthermore, an extensive quality control (QC) analysis was performed aimed at identifying (and removing) hybridizations that consistently showed indications of poor quality (Methods; Additional file 1: Section 1.2). ID: short dataset identifier; Dataset: dataset name; Nr. of samples: total number of available samples; Rejected: number of samples removed based on QC; Passed: total number of samples remaining after QC. In total 319 samples (7.55%) were rejected based on consistent indications of poor quality. Chip: array design used, i.e. hgu133plus2 or hgu133a; Source: the accession number under which the raw intensity data can be found at GEO [34]. Dataset D10 is available at ArrayExpress [35] (accession number E-MTAB-365); Reference: reference to main study. The 344 sample VDX dataset (D17) consists of the combined expression data of the 286 sample dataset by Wang et al. [36] and the 58 ER- sample dataset by Yu et al. [37]. Finally, note that the Symmans datasets (D11-D13) represent ER+ datasets. To prevent bias due to scaling of a dataset with a highly skewed subtype distribution [26, 38], datasets D12 and D13 were first concatenated to the VDX dataset and subsequently scaled as a single dataset, after which the VDX dataset was removed. Similarly, dataset D11 was combined with the expO dataset during scaling. A similar strategy was followed by Haibe-Kains et al. [12]