Breast cancer subtype predictors revisited: from consensus to concordance?

MJ. Sontrop, Herman; JT. Reinders, Marcel; D. Moerland, Perry

doi:10.1186/s12920-016-0185-6

BMC Medical Genomics

Table 1 Overview Affymetrix compendium

From: Breast cancer subtype predictors revisited: from consensus to concordance?

ID	Dataset	Nr. of samples	Nr. of samples (QC)		Chip	Source	Reference
			Rejected	Passed
D1	Richardson (I)	47	5	42	hgu133plus2	GSE3744	[39]
D2	Li	115	6	109	hgu133plus2	GSE19615	[40]
D3	Lu	127	3	124	hgu133plus2	GSE5460	[41]
D4	Bos	204	16	188	hgu133plus2	GSE12276	[42]
D5	Dedeurwaerder	90	7	83	hgu133plus2	GSE20711	[43]
D6	expO	353	20	333	hgu133plus2	GSE2109	[12]
D7	Kao	327	33	294	hgu133plus2	GSE20685	[44]
D8	Richardson (II)	84	9	75	hgu133plus2	GSE18864	[40]
D9	Sabatier	266	24	242	hgu133plus2	GSE21653	[45]
D10	Guedj	537	36	501	hgu133plus2	E-MTAB-365	[21]
D11	Symmans (III)	32	4	28	hgu133plus2	GSE17700	[46]
D12	Symmans (I)	298	23	275	hgu133a	GSE17705	[46]
D13	Symmans (II)	32	3	29	hgu133a	GSE17700	[46]
D14	Desmedt	198	13	185	hgu133a	GSE7390	[47]
D15	Farmer	49	3	46	hgu133a	GSE1561	[48]
D16	Schmidt	200	18	182	hgu133a	GSE11121	[49]
D17	VDX	344	29	315	hgu133a	GSE2034,GSE5327	[36, 37]
D18	Miller	251	18	233	hgu133a	GSE3494	[50]
D19	Pawitan	159	16	143	hgu133a	GSE1456	[51]
D20	Shi	278	19	259	hgu133a	GSE20194	[52, 53]
D21	MSK	99	8	91	hgu133a	GSE2603	[54, 55]
D22	UNT	137	6	131	hgu133a	GSE2990	[56, 57]
Total		4227	319	3908

The compendium consists of data from 22 datasets measured by a single measurement platform, i.e. Affymetrix. The expression data was measured on two distinct array designs, i.e. hgu133plus2 (top 11 datasets, 2,182 samples) and hgu133a (bottom 11 datasets, 2,045 samples). We only considered the 22,215 probesets that these designs have in common, which represent all non-control probesets present on the hgu133a platform. Shared probesets are based on an identical set of probes with identical probe sequences. Remaining heterogeneity on these datasets was further reduced using frozen RMA [17] normalization and robust scaling [12] (Methods). Furthermore, an extensive quality control (QC) analysis was performed aimed at identifying (and removing) hybridizations that consistently showed indications of poor quality (Methods; Additional file 1: Section 1.2). ID: short dataset identifier; Dataset: dataset name; Nr. of samples: total number of available samples; Rejected: number of samples removed based on QC; Passed: total number of samples remaining after QC. In total 319 samples (7.55%) were rejected based on consistent indications of poor quality. Chip: array design used, i.e. hgu133plus2 or hgu133a; Source: the accession number under which the raw intensity data can be found at GEO [34]. Dataset D10 is available at ArrayExpress [35] (accession number E-MTAB-365); Reference: reference to main study. The 344 sample VDX dataset (D17) consists of the combined expression data of the 286 sample dataset by Wang et al. [36] and the 58 ER- sample dataset by Yu et al. [37]. Finally, note that the Symmans datasets (D11-D13) represent ER+ datasets. To prevent bias due to scaling of a dataset with a highly skewed subtype distribution [26, 38], datasets D12 and D13 were first concatenated to the VDX dataset and subsequently scaled as a single dataset, after which the VDX dataset was removed. Similarly, dataset D11 was combined with the expO dataset during scaling. A similar strategy was followed by Haibe-Kains et al. [12]

Back to article page

ISSN: 1755-8794

Contact us

Submission enquiries: bmcmedicalgenomics@biomedcentral.com
General enquiries: ORSupport@springernature.com