The removal of multiplicative, systematic bias allows integration of breast cancer gene expression datasets – improving meta-analysis and prediction of prognosis
© Sims et al; licensee BioMed Central Ltd. 2008
Received: 14 May 2008
Accepted: 21 September 2008
Published: 21 September 2008
The number of gene expression studies in the public domain is rapidly increasing, representing a highly valuable resource. However, dataset-specific bias precludes meta-analysis at the raw transcript level, even when the RNA is from comparable sources and has been processed on the same microarray platform using similar protocols. Here, we demonstrate, using Affymetrix data, that much of this bias can be removed, allowing multiple datasets to be legitimately combined for meaningful meta-analyses.
A series of validation datasets comparing breast cancer and normal breast cell lines (MCF7 and MCF10A) were generated to examine the variability between datasets generated using different amounts of starting RNA, alternative protocols, different generations of Affymetrix GeneChip or scanning hardware. We demonstrate that systematic, multiplicative biases are introduced at the RNA, hybridization and image-capture stages of a microarray experiment. Simple batch mean-centering was found to significantly reduce the level of inter-experimental variation, allowing raw transcript levels to be compared across datasets with confidence. By accounting for dataset-specific bias, we were able to assemble the largest gene expression dataset of primary breast tumours to-date (1107), from six previously published studies. Using this meta-dataset, we demonstrate that combining greater numbers of datasets or tumours leads to a greater overlap in differentially expressed genes and more accurate prognostic predictions. However, this is highly dependent upon the composition of the datasets and patient characteristics.
Multiplicative, systematic biases are introduced at many stages of microarray experiments. When these are reconciled, raw data can be directly integrated from different gene expression datasets leading to new biological findings with increased statistical power.
Successful microarray experiments are reliant on sufficient care being taken to minimize and account for experimental variability. Formalization and control of all stages of the experimental pipeline is now routine, and the need to associate experiments with detailed descriptions of protocols and techniques is now widely accepted . However, despite these efforts, it is still not possible to account for all potential sources of variation, and even identical experiments performed at different sites have been shown to produce significantly different results . This makes it difficult to routinely compare gene expression data generated from different experiments, even when using samples from comparable sources that have been processed on the same microarray platform using similar protocols.
Issues of experimental reproducibility have become increasingly important with the advent of microarray databases and repositories (e.g. ArrayExpress , GEO ), given the potential they offer for cross-experimental comparison and data mining. Even if it is possible to successfully control inter-experiment variation to a point where this might be possible, rapid developments in both the hybridization protocols and in the arrays themselves have also led to major improvements in the technology. Lower requirements for the amount of starting RNA have enabled gene expression profiling to be combined with cell sorting methods or laser capture microdissection, while increases in the number of features represented on the arrays have resulted in progressively more detailed coverage of the transcriptome. Since each advance in technology leads to genuine improvements, there is a strong incentive to use the latest arrays and protocols whenever possible. This is however, tempered by a lack of backward compatibility between datasets produced using different array and protocol versions, and any decision to move to a newer (better) iteration of the technology must be made with an appreciation of the difficulty in maintaining compatibility with previous studies.
In this study we first demonstrate, using an extended series of validation data, that Affymetrix datasets cannot in general be compared at the raw expression level due to systematic, multiplicative biases. Secondly, we show that simple batch mean-centering can significantly reduce the level of inter-experimental variation and that this allows raw transcript levels to be compared across datasets. The approach is then applied to a series of published breast cancer studies and we show that the integrated datasets possess increased statistical power and improved prognostic ability, compared to the individual datasets alone.
Systematic bias in microarray data
Summary of the effect of mean batch-centering on data generated from different experiments.
Within two-fold consistent (%)
SAM Common, top 1000
Data from different experiments
Amplified 10 ng and unamplified 10 μg protocols, RMA, U133A
U133A and plus 2.0 arrays common, MAS5 present = 4/6 chips
Exon v U133 plus 2 (consensus mapping) Plier and MAS5
Exon v U133 plus 2 (SIF mapping) Plier and MAS5
Standard 10 μg and revised 2 μg protocols, RMA, U133A
GeneChip 3000 and GeneArray 2500 scanners, RMA, U133A
NuGEN 10 ng and EpiStem 2 ng amplification, RMA, U133 plus 2
Batch mean-centering (see methods) of the amplified and unamplified datasets was found to dramatically increase the correspondence comparison across the datasets, with 100% of probesets having fold changes within two-fold between experimental branches (Fig. 1B black dots, Table 1). Similarly, when significance analysis of microarrays (SAM) was used to identify lists of probesets with statistically significant changes between the same replicate groupings used to generate Fig. 1, the intersection between sets was also found to be greater following mean batch correction (Table 1) and Pearson correlation of raw intensities was also found to increase (Figs. 1C, 1D and Additional File 1). It is notable that while correction improves fold-change correspondence across protocols, fold changes between datasets are preserved (Fig. 1E, Table 1).
The same approach was applied to a variety of other validation datasets that were designed to investigate the effect of using different generations of Affymetrix GeneChips, alternative protocols and scanning hardware (Table 1, Additional File 1). A systematic bias was found to be present in all datasets, and correspondence improved following mean-centering in all cases (median centering also performed similarly; data not shown). These results demonstrate that systematic, multiplicative bias is a widespread property within Affymetrix array data, and mean-centering was found to lead to improvements irrespective of the summarisation method used (RMA or MAS5 ) to process the data or when using alternative Chip Description Files (CDFs) to group probes according to Unigene cluster  (Additional File 2).
Integration of published breast tumour datasets
The greatest single difference between molecular subtypes has repeatedly been demonstrated to be between estrogen receptor-positive (ER+) luminal tumours and ER-negative basal tumours [6–8, 13, 14]. SAM analysis was used to identify probesets differentially expressed between the basal/basal-like and non basal-like/luminal subtypes using the combined data from both sets of samples. It was only following mean-centering that probesets were identified that represent genes that are known to characterize the differences between these subtypes (Additional File 3), including the fundamental estrogen receptor alpha and GATA binding protein 3, which maintains differentiation into the luminal cell fate in the mammary gland . In addition, following mean-centering, a greater number of statistically significant probesets were found to be differentially expressed between the tumour subtypes than were found between the two initial sets of samples (Additional File 4).
Published breast cancer datasets used in this study.
Array express/GEO ID
Tumour Size (cm)
Chin et al. 2006
Desmedt et al. 2007
Farmer et al 2005
Ivshina et al. 2006
Loi et al. 2007
U133A, U133 plus2.0
Minn et al. 2007
Pawitan et al. 2005
Richardson et al.
Sotiriou et al. 2006
Wang et al. 2005
Effect of dataset composition on differential gene expression.
SAM Common, top 1000
Unamplified MCF7 (3) v MCF10A (3)
Amplified MCF7 (3) v MCF10A (3)
Unamplified MCF7 (3) v MCF10A (3)
Amplified MCF7 (3) v MCF10A (2)
Richardson et al. Non-basal (12) v basal (12)
Farmer et al. Luminal A (12) v basal (12)
Richardson et al. Non-basal (7) v basal (19)
Farmer et al. Luminal A (15) v basal (14)
Richardson et al. Non-basal (3) v basal (19)
Farmer et al. Luminal A (15) v basal (3)
The mean-centering approach was compared with a previously described method for integrating breast cancer tumour microarray data generated on different platforms, using a distance weighted discrimination (DWD) method to adjust for systematic microarray data biases . For both the validation datasets and the published datasets, mean-centering out-performed DWD (Table 3 and Additional File 10).
Most variable genes
Mean-centering has been widely used in the past to compare relative gene expression of high and lowly expressed genes together within a single dataset, particularly for heatmaps and clustering programs . However, this is the first study to assess the utility of mean-centering for minimising the effects of dataset-specific bias and integration of multiple datasets. An unknown, systematic, multiplicative bias associated with each group of arrays processed together ('dataset') is simply removed when the GeneChips are considered relative to each other. The approach clearly shows significant improvements in the degree of correspondence found across datasets, without any loss of internal coherence within each of the initial datasets from which the integrated dataset is assembled. Relative intensities within each individual dataset are left unchanged (Figure 1D), with the consequence that both fold-changes and p-values produced by techniques such as SAM, remain identical to those found prior to correction (Table 1). Therefore, balanced corrected datasets can be treated with at least as much confidence as the initial uncorrected data. We have also demonstrated that combining greater numbers of datasets or tumours increases the overlap in differentially expressed probesets between studies and that this is further improved with mean-centering.
A number of previous studies have also investigated the level of consensus found between different experimental datasets. The mean-centering approach out-performed a distance weighted discrimination method  that attempted to adjust for systematic microarray data biases for integrating breast cancer tumour microarray data generated on different platforms. This group stated that they had also applied this technique to 'merge two distinct Affymetrix breast tumor datasets together' and 'saw similar, but less dramatic results due to fewer systematic biases present in datasets performed on the same Affymetrix microarrays' . Our results suggest that there are many sources of systematic biases in Affymetrix data, which are highly significant and multiplicative, but that these can be largely corrected for, allowing the integration of datasets. An empirical Bayes method to adjust for batch effects  (ComBat; http://statistics.byu.edu/johnson/ComBat/) has also been used to integrate published datasets for meta-analysis . This approach generated plots analogous to those in Additional File 8 for mean-centering and weighted mean-centering when ER status was used as a covariate (data not shown). The mean-centering method described in this study was used in a recent meta-analysis whilst our manuscript was under review , although no attempt was made to account for differences in dataset composition.
Combining two published studies without mean-centering, clearly demonstrated how dataset-specific biases can mask the biological differences between breast cancer tumour subtypes (Figure 2). The Farmer et al.  dataset was generated from trucut tumour biopsies (4 × 2.5 μm sections), necessitating RNA amplification prior to hybridization to U133A GeneChips. By contrast, RNA in the Richardson et al. 2006 study  was derived from tumours following surgical removal, so amplification was not required prior to hybridisation to U133 plus 2.0 GeneChips. Despite the experimental differences between the studies, both of which have been shown above to lead to significant deviations in measured raw intensities in our validation datasets, mean-centering appears to reconcile the data and leads to the identification of biologically plausible relationships not found when combining uncorrected data.
The gold standard for demonstrating the power of a gene expression classifier is to test it against independent datasets. However, if the molecular profile of a set of tumours is representative of its patient characteristics, then any prognostic signature will be dependant upon the composition of the patient cohort and therefore be dataset-specific. Thus in order to generate accurate prognostic predictions, the characteristics of this second 'test' dataset must have similar characteristics to the first 'training' set . Recently, strong time dependence was identified for a prognostic signature when comparing an independent validation dataset with a longer median follow-up time (14 years) compared to the original study (8 years) . A number of recent microarray studies have been performed after increasing the size of a dataset with additional samples [8, 18, 20, 33], however it is unclear whether subsequent changes in the results are due to changes in the sample composition of the extended dataset or simply to technical effects arising from the microarrays being processed in different batches. Some studies have also based their findings upon combined data from more than one type of Affymetrix GeneChip without evaluating any GeneChip-specific effects.
By integrating six published datasets with patient follow-up information we have demonstrated that combining breast cancer datasets can increase the accuracy of prognosis prediction and that this can be improved by removing systematic, multiplicative bias. The most accurate prognosis predictions are generated when the test-sets closely share the patient and tumour characteristics of the training-sets. An alternative approach to building ever larger combined datasets representing the whole breast cancer population, would be to concentrate on generating gene expression classifiers for clearly defined groups of patients (e.g. node-negative, ER-positive from patients aged 50–60, with 10 years of follow-up). Strict entry criterion would severely restrict the number of tumours eligible for inclusion, whilst taking no account of possible unknown confounding factors. In clinical practice, we urgently need single sample predictors , applicable to all patients and our work strongly suggests that these will be best generated from the largest possible cohorts (or integrated datasets) representing the wider population, which will involve large international collaborations and public sharing of data. The current consensus for best practices for breast cancer treatment are based upon bringing together data for hundreds of trials representing thousands of women within the Early Breast Cancer Trialists' Collaborative Group (EBCTCG). If we can begin to bring together many large datasets of gene expression data free from dataset-specific bias the opportunity exists to create a highly valuable resource. A possible 'new' breast cancer subtype characterized by the high expression of interferon regulated genes  was identified by cluster analysis of a combined (non-Affymetrix) dataset of 315 breast tumours, which is consistent with the notion that rare molecular subtypes will only be detected with larger datasets. Our findings are in agreement with the conclusions of a recent study  that integrated breast tumour datasets generated on two different microarray platforms; they also showed that the gene expression profile generated by integrated analysis of multiple datasets achieves better prediction of breast cancer recurrence, and that the performance of profiles is confounded by the known and unknown clinical background of patients . In the current study however, we demonstrate improved prediction of prognosis in datasets derived from the same platform integrated using a simpler scaling method of the raw data rather than a normalisation method reliant on fold changes. One limitation of our study is that it was not possible to use a single definition of follow-up endpoint across the published datasets, in each case we used the most conservative indicator of relapse available (recurrence-free survival, disease-free survival and distant metastasis-free survival) rather than overall survival. There was also some variation in both patient age and tumour size between the studies (Table 3). Variation in gene expression due to the heterogeneity of patient characteristics has begun to be addressed by studies that investigate the effects of age , race [41, 42] and differences in risk factors [43, 44] for breast cancer. Integrating large breast tumour gene expression datasets will potentially enable us to uncover more subtle population-level associations, providing that all clinical details and follow-up information is consistent, complete and made publicly available.
Systematic multiplicative biases are introduced at many stages of microarray experiments, however they can easily be accounted for, which can enable raw data to be directly integrated from different gene expression datasets in order to generate results with improved statistical power and greater biological significance.
Generation and processing of RNA comparing the two breast cell lines, MCF7 (cancer) and MCF10A (normal immortalised mammary epithelial) was described previously [45, 46]. Briefly, cells were grown in Dulbecco's modified Eagle's medium (DMEM) with 10% fetal calf serum (MCF7) or DMEM/F12 with 5% horse serum, 2 ng/ml, 0.5 μg/ml hydrocortisone, 0.5 μg/ml cholera toxin, and 5 μg/ml insulin (MCF10A). Minimally passaged cells (< 20) were obtained from the American Type Culture Collection ATCC. RNA was isolated using Trizol® (Ambion) according to manufacturer's instructions, purified using Qiagen RNeasy columns (Qiagen, Valencia, CA) and quantified using a Nanodrop spectrophotometer (Labtech). The quality and amount of starting RNA was confirmed with an Agilent Bioanalyzer 2100 (Agilent) prior to labelling and hybridisation to HG-U133A, HG-U133 plus 2.0, or Human Exon 1.0ST GeneChips (Affymetrix) using either the Affymetrix standard or small sample preparation protocols as previously described .
GeneChip processing and analysis
Each experiment was repeated in triplicate, with three samples per cell line for each amount of starting total RNA, protocol or GeneChip used (1 dataset = 3 × MCF7 and 3 × MCF10A = 6 GeneChips). The HGU133A Genechips for the standard protocol and amplification experiments were scanned using a GeneArray 2500 scanner and the HG-U133 plus 2.0, and Exon 1.0ST Genechips were scanned using a GeneChip Scanner 3000. All MCF7 and MCF10A microarray data is MIAME compliant and accessible via MIAME VICE . All protocols are described in full here, http://bioinformatics.picr.man.ac.uk/mbcf/downloads/. Raw spot readings were processed using R  and Bionconductor . Probeset summarisation was done using MAS 5.0 and RMA  as implemented in the Simpleaffy package  or plier algorithm from Affymetrix ExACT software. Mappings between the Exon and U133A plus 2.0 GeneChips were performed as described previously . Alternative cell description files, relating probesets to unigene sequences were implemented as described previously . Lists of common significantly differentially expressed genes before and after mean batch-centering were identified using SAM  analysis (siggenes BioConductor package) by adjust Δ value to find the top 1000 differentially expressed probesets using each protocol, as described previously .
Correction by batch mean-centering
The mean expression level per probeset for a given dataset is subtracted from the individual GeneChip expression level on the Log2 scale. This can simply be performed in R using the 'rowMeans' function. Whilst preparing the manuscript we also noticed that this can be achieved using the 'pamr.batchadjust' function within the pamr Bioconductor package.
Affymetrix data was downloaded from a total of ten datasets from published studies listed in Table 2 from the Gene Expression Omnibus  or Array Express  repositories. Raw .cel files were not available for the Wang et al. dataset, so all other datasets were normalized as in this study using the MAS 5.0 algorithm with a target intensity of 600 as implemented in the Simpleaffy package , using R  within BioConductor . NetAffx  was used to identify Affymetrix probesets representing the 'intrinsic gene set' previously used to classify human breast tumours . Centered average linkage clustering was performed using the Cluster  and TreeView programs as described previously . Supervised principal components analysis using the Superpc for R package was used as previously described [29, 30], in order to compare the predictive power of combining different published datasets. The follow up endpoints for the Loi et al., Pawitan et al. and Sotoriou et al. datasets were recurrence-free survival, for Desmedt et al. and Ivshina et al. datasets it was disease-free survival and for the Minn et al. and Wang et al. datasets it was distant metastasis-free survival.
AHS is supported by Breakthrough Breast Cancer, RBC is funded by a Breast Cancer Campaign Fellowship, YH, GS, MJO, SDP and CJM are supported by Cancer Research UK.
- Brazma A, Kapushesky M, Parkinson H, Sarkans U, Shojatalab M: Data storage and analysis in ArrayExpress. Methods Enzymol. 2006, 411: 370-386. 10.1016/S0076-6879(06)11020-4.PubMedView Article
- Chu TM, Deng S, Wolfinger R, Paules RS, Hamadeh HK: Cross-site comparison of gene expression data reveals high similarity. Environ Health Perspect. 2004, 112 (4): 449-455.PubMedPubMed CentralView Article
- Barrett T, Suzek TO, Troup DB, Wilhite SE, Ngau WC, Ledoux P, Rudnev D, Lash AE, Fujibuchi W, Edgar R: NCBI GEO: mining millions of expression profiles – database and tools. Nucleic Acids Res. 2005, D562-566. 33 Database
- Pepper SD, Saunders EK, Edwards LE, Wilson CL, Miller CJ: The utility of MAS5 expression summary and detection call algorithms. BMC Bioinformatics. 2007, 8: 273-10.1186/1471-2105-8-273.PubMedPubMed CentralView Article
- Dai M, Wang P, Boyd AD, Kostov G, Athey B, Jones EG, Bunney WE, Myers RM, Speed TP, Akil H, et al: Evolving gene/transcript definitions significantly alter the interpretation of GeneChip data. Nucleic Acids Res. 2005, 33 (20): e175-10.1093/nar/gni179.PubMedPubMed CentralView Article
- Perou CM, Sorlie T, Eisen MB, Rijn van de M, Jeffrey SS, Rees CA, Pollack JR, Ross DT, Johnsen H, Akslen LA, et al: Molecular portraits of human breast tumours. Nature. 2000, 406 (6797): 747-752. 10.1038/35021093.PubMedView Article
- Sorlie T, Perou CM, Tibshirani R, Aas T, Geisler S, Johnsen H, Hastie T, Eisen MB, Rijn van de M, Jeffrey SS, et al: Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc Natl Acad Sci USA. 2001, 98 (19): 10869-10874. 10.1073/pnas.191367098.PubMedPubMed CentralView Article
- Sorlie T, Tibshirani R, Parker J, Hastie T, Marron JS, Nobel A, Deng S, Johnsen H, Pesich R, Geisler S, et al: Repeated observation of breast tumor subtypes in independent gene expression data sets. Proc Natl Acad Sci USA. 2003, 100 (14): 8418-8423. 10.1073/pnas.0932692100.PubMedPubMed CentralView Article
- Liu G, Loraine AE, Shigeta R, Cline M, Cheng J, Valmeekam V, Sun S, Kulp D, Siani-Rose MA: NetAffx: Affymetrix probesets and annotations. Nucleic Acids Res. 2003, 31 (1): 82-86. 10.1093/nar/gkg121.PubMedPubMed CentralView Article
- Richardson AL, Wang ZC, De Nicolo A, Lu X, Brown M, Miron A, Liao X, Iglehart JD, Livingston DM, Ganesan S: X chromosomal abnormalities in basal-like human breast cancer. Cancer Cell. 2006, 9 (2): 121-132. 10.1016/j.ccr.2006.01.013.PubMedView Article
- Farmer P, Bonnefoi H, Becette V, Tubiana-Hulin M, Fumoleau P, Larsimont D, Macgrogan G, Bergh J, Cameron D, Goldstein D, et al: Identification of molecular apocrine breast tumours by microarray analysis. Oncogene. 2005, 24 (29): 4660-4671. 10.1038/sj.onc.1208561.PubMedView Article
- Calza S, Hall P, Auer G, Bjohle J, Klaar S, Kronenwett U, Liu ET, Miller L, Ploner A, Smeds J, et al: Intrinsic molecular signature of breast cancer in a population-based cohort of 412 patients. Breast Cancer Res. 2006, 8 (4): R34-10.1186/bcr1517.PubMedPubMed CentralView Article
- Sorlie T, Wang Y, Xiao C, Johnsen H, Naume B, Samaha RR, Borresen-Dale AL: Distinct molecular mechanisms underlying clinically relevant subtypes of breast cancer: Gene expression analyses across three different platforms. BMC Genomics. 2006, 7 (1): 127-10.1186/1471-2164-7-127.PubMedPubMed CentralView Article
- Hu Z, Fan C, Oh DS, Marron JS, He X, Qaqish BF, Livasy C, Carey LA, Reynolds E, Dressler L, et al: The molecular portraits of breast tumors are conserved across microarray platforms. BMC Genomics. 2006, 7: 96-10.1186/1471-2164-7-96.PubMedPubMed CentralView Article
- Kouros-Mehr H, Slorach EM, Sternlicht MD, Werb Z: GATA-3 maintains the differentiation of the luminal cell fate in the mammary gland. Cell. 2006, 127 (5): 1041-1055. 10.1016/j.cell.2006.09.048.PubMedPubMed CentralView Article
- Chin K, DeVries S, Fridlyand J, Spellman PT, Roydasgupta R, Kuo WL, Lapuk A, Neve RM, Qian Z, Ryder T, et al: Genomic and transcriptional aberrations linked to breast cancer pathophysiologies. Cancer Cell. 2006, 10 (6): 529-541. 10.1016/j.ccr.2006.10.009.PubMedView Article
- Desmedt C, Piette F, Loi S, Wang Y, Lallemand F, Haibe-Kains B, Viale G, Delorenzi M, Zhang Y, d'Assignies MS, et al: Strong time dependence of the 76-gene prognostic signature for node-negative breast cancer patients in the TRANSBIG multicenter independent validation series. Clin Cancer Res. 2007, 13 (11): 3207-3214. 10.1158/1078-0432.CCR-06-2765.PubMedView Article
- Ivshina AV, George J, Senko O, Mow B, Putti TC, Smeds J, Lindahl T, Pawitan Y, Hall P, Nordgren H, et al: Genetic reclassification of histologic grade delineates new clinical subtypes of breast cancer. Cancer Res. 2006, 66 (21): 10292-10301. 10.1158/0008-5472.CAN-05-4414.PubMedView Article
- Pawitan Y, Bjohle J, Amler L, Borg AL, Egyhazi S, Hall P, Han X, Holmberg L, Huang F, Klaar S, et al: Gene expression profiling spares early breast cancer patients from adjuvant therapy: derived and validated in two population-based cohorts. Breast Cancer Res. 2005, 7 (6): R953-964. 10.1186/bcr1325.PubMedPubMed CentralView Article
- Sotiriou C, Wirapati P, Loi S, Harris A, Fox S, Smeds J, Nordgren H, Farmer P, Praz V, Haibe-Kains B, et al: Gene expression profiling in breast cancer: understanding the molecular basis of histologic grade to improve prognosis. J Natl Cancer Inst. 2006, 98 (4): 262-272.PubMedView Article
- Wang Y, Klijn JG, Zhang Y, Sieuwerts AM, Look MP, Yang F, Talantov D, Timmermans M, Meijer-van Gelder ME, Yu J, et al: Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet. 2005, 365 (9460): 671-679.PubMedView Article
- Sims AH, Ong KR, Clarke RB, Howell A: High-throughput genomic technology in research and clinical management of breast cancer. Exploiting the potential of gene expression profiling: is it ready for the clinic?. Breast Cancer Res. 2006, 8 (5): 214-10.1186/bcr1605.PubMedPubMed CentralView Article
- van 't Veer LJ, Dai H, Vijver van de MJ, He YD, Hart AA, Mao M, Peterse HL, Kooy van der K, Marton MJ, Witteveen AT, et al: Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002, 415 (6871): 530-536. 10.1038/415530a.PubMedView Article
- Adler AS, Lin M, Horlings H, Nuyten DS, Vijver van de MJ, Chang HY: Genetic regulators of large-scale transcriptional signatures in cancer. Nat Genet. 2006, 38 (4): 421-430. 10.1038/ng1752.PubMedPubMed CentralView Article
- Chang HY, Nuyten DS, Sneddon JB, Hastie T, Tibshirani R, Sorlie T, Dai H, He YD, Van't Veer LJ, Bartelink H, et al: Robustness, scalability, and integration of a wound-response gene expression signature in predicting breast cancer survival. Proc Natl Acad Sci USA. 2005, 102 (10): 3531-3532. 10.1073/pnas.0409462102.View Article
- Chi JT, Wang Z, Nuyten DS, Rodriguez EH, Schaner ME, Salim A, Wang Y, Kristensen GB, Helland A, Borresen-Dale AL, et al: Gene expression programs in response to hypoxia: cell type specificity and prognostic significance in human cancers. PLoS Med. 2006, 3 (3): e47-10.1371/journal.pmed.0030047.PubMedPubMed CentralView Article
- Liu R, Wang X, Chen GY, Dalerba P, Gurney A, Hoey T, Sherlock G, Lewicki J, Shedden K, Clarke MF: The prognostic role of a gene signature from tumorigenic breast-cancer cells. N Engl J Med. 2007, 356 (3): 217-226. 10.1056/NEJMoa063994.PubMedView Article
- Fan C, Oh DS, Wessels L, Weigelt B, Nuyten DS, Nobel AB, van't Veer LJ, Perou CM: Concordance among gene-expression-based predictors for breast cancer. N Engl J Med. 2006, 355 (6): 560-569. 10.1056/NEJMoa052933.PubMedView Article
- Bair E, Hastie T, Debashis P, Tibshirani R: Prediction by supervised principal components. Stanford Tech Report. 2004
- Bair E, Tibshirani R: Semi-supervised methods to predict patient survival from gene expression data. PLoS Biol. 2004, 2 (4): E108-10.1371/journal.pbio.0020108.PubMedPubMed CentralView Article
- Ihaka R, Gentleman R: R: a language for data analysis and graphics. Journal of Computational and Graphical Statistics. 1996, 5: 299-314. 10.2307/1390807.
- Loi S, Haibe-Kains B, Desmedt C, Lallemand F, Tutt AM, Gillet C, Ellis P, Harris A, Bergh J, Foekens JA, et al: Definition of clinically distinct molecular subtypes in estrogen receptor-positive breast carcinomas through genomic grade. J Clin Oncol. 2007, 25 (10): 1239-1246. 10.1200/JCO.2006.07.1522.PubMedView Article
- Minn AJ, Gupta GP, Padua D, Bos P, Nguyen DX, Nuyten D, Kreike B, Zhang Y, Wang Y, Ishwaran H, et al: Lung metastasis genes couple breast tumor size and metastatic spread. Proc Natl Acad Sci USA. 2007, 104 (16): 6740-6745. 10.1073/pnas.0701138104.PubMedPubMed CentralView Article
- Benito M, Parker J, Du Q, Wu J, Xiang D, Perou CM, Marron JS: Adjustment of systematic microarray data biases. Bioinformatics. 2004, 20 (1): 105-114. 10.1093/bioinformatics/btg385.PubMedView Article
- Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA. 1998, 95 (25): 14863-14868. 10.1073/pnas.95.25.14863.PubMedPubMed CentralView Article
- Johnson WE, Li C, Rabinovic A: Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics. 2007, 8 (1): 118-127. 10.1093/biostatistics/kxj037.PubMedView Article
- Acharya CR, Hsu DS, Anders CK, Anguiano A, Salter KH, Walters KS, Redman RC, Tuchman SA, Moylan CA, Mukherjee S, et al: Gene expression signatures, clinicopathological features, and individualized therapy in breast cancer. Jama. 2008, 299 (13): 1574-1587. 10.1001/jama.299.13.1574.PubMedView Article
- Ben-Porath I, Thomson MW, Carey VJ, Ge R, Bell GW, Regev A, Weinberg RA: An embryonic stem cell-like gene expression signature in poorly differentiated aggressive human tumors. Nat Genet. 2008, 40 (5): 499-507. 10.1038/ng.127.PubMedPubMed CentralView Article
- Zhang Z, Chen D, Fenstermacher DA: Integrated analysis of independent gene expression microarray datasets improves the predictability of breast cancer outcome. BMC Genomics. 2007, 8 (1): 331-10.1186/1471-2164-8-331.PubMedPubMed CentralView Article
- Yau C, Fedele V, Roydasgupta R, Fridlyand J, Hubbard A, Gray JW, Chew K, Dairkee SH, Moore DH, Schittulli F, et al: Aging impacts transcriptome but not genome of hormone-dependent breast cancers. Breast Cancer Res. 2007, 9 (5): R59-10.1186/bcr1765.PubMedPubMed CentralView Article
- Amend K, Hicks D, Ambrosone CB: Breast cancer in african-american women: differences in tumor biology from European-american women. Cancer Res. 2006, 66 (17): 8327-8330. 10.1158/0008-5472.CAN-06-1927.PubMedView Article
- Carey LA, Perou CM, Livasy CA, Dressler LG, Cowan D, Conway K, Karaca G, Troester MA, Tse CK, Edmiston S, et al: Race, breast cancer subtypes, and survival in the Carolina Breast Cancer Study. Jama. 2006, 295 (21): 2492-2502. 10.1001/jama.295.21.2492.PubMedView Article
- Millikan RC, Newman B, Tse CK, Moorman PG, Conway K, Smith LV, Labbok MH, Geradts J, Bensen JT, Jackson S, et al: Epidemiology of basal-like breast cancer. Breast Cancer Res Treat. 2007
- Yang XR, Sherman ME, Rimm DL, Lissowska J, Brinton LA, Peplonska B, Hewitt SM, Anderson WF, Szeszenia-Dabrowska N, Bardin-Mikolajczak A, et al: Differences in risk factors for breast cancer molecular subtypes in a population-based study. Cancer Epidemiol Biomarkers Prev. 2007, 16 (3): 439-443. 10.1158/1055-9965.EPI-06-0806.PubMedView Article
- Wilson CL, Pepper SD, Hey Y, Miller CJ: Amplification protocols introduce systematic but reproducible errors into gene expression studies. Biotechniques. 2004, 36 (3): 498-506.PubMed
- Okoniewski MJ, Hey Y, Pepper SD, Miller CJ: High correspondance between Affymetrix exon and standard expression arrays. Biotechniques. 2007, 42 (2): 181-185. 10.2144/000112315.PubMedView Article
- MIAME VICE. [http://bioinformatics.picr.man.ac.uk/vice]
- Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, et al: Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 2004, 5 (10): R80-10.1186/gb-2004-5-10-r80.PubMedPubMed CentralView Article
- Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP: Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 2003, 4 (2): 249-264. 10.1093/biostatistics/4.2.249.PubMedView Article
- Wilson CL, Miller CJ: Simpleaffy: a BioConductor package for Affymetrix Quality Control and data analysis. Bioinformatics. 2005, 21 (18): 3683-3685. 10.1093/bioinformatics/bti605.PubMedView Article
- Tusher VG, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci USA. 2001, 98 (9): 5116-5121. 10.1073/pnas.091062498.PubMedPubMed CentralView Article
- Bergamaschi A, Kim YH, Wang P, Sorlie T, Hernandez-Boussard T, Lonning PE, Tibshirani R, Borresen-Dale AL, Pollack JR: Distinct patterns of DNA copy number alteration are associated with different clinicopathological features and gene-expression subtypes of breast cancer. Genes Chromosomes Cancer. 2006, 45 (11): 1033-1040. 10.1002/gcc.20366.PubMedView Article
- The pre-publication history for this paper can be accessed here:http://www.biomedcentral.com/1755-8794/1/42/prepub
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.