Functional characterization of breast cancer using pathway profiles

Background The molecular characteristics of human diseases are often represented by a list of genes termed “signature genes”. A significant challenge facing this approach is that of reproducibility: signatures developed on a set of patients may fail to perform well on different sets of patients. As diseases are resulted from perturbed cellular functions, irrespective of the particular genes that contribute to the function, it may be more appropriate to characterize diseases based on these perturbed cellular functions. Methods We proposed a profile-based approach to characterize a disease using a binary vector whose elements indicate whether a given function is perturbed based on the enrichment analysis of expression data between normal and tumor tissues. Using breast cancer and its four primary clinically relevant subtypes as examples, this approach is evaluated based on the reproducibility, accuracy and resolution of the resulting pathway profiles. Results Pathway profiles for breast cancer and its subtypes are constructed based on data obtained from microarray and RNA-Seq data sets provided by The Cancer Genome Atlas (TCGA), and an additional microarray data set provided by The European Genome-phenome Archive (EGA). An average reproducibility of 68% is achieved between different data sets (TCGA microarray vs. EGA microarray data) and 67% average reproducibility is achieved between different technologies (TCGA microarray vs. TCGA RNA-Seq data). Among the enriched pathways, 74% of them are known to be associated with breast cancer or other cancers. About 40% of the identified pathways are enriched in all four subtypes, with 4, 2, 4, and 7 pathways enriched only in luminal A, luminal B, triple-negative, and HER2+ subtypes, respectively. Comparison of profiles between subtypes, as well as other diseases, shows that luminal A and luminal B subtypes are more similar to the HER2+ subtype than to the triple-negative subtype, and subtypes of breast cancer are more likely to be closer to each other than to other diseases. Conclusions Our results demonstrate that pathway profiles can successfully characterize both common and distinct functional characteristics of four subtypes of breast cancer and other related diseases, with acceptable reproducibility, high accuracy and reasonable resolution.

Annotate EGA data sets. It has been noted that many Illumina probes have unreliable original annotations, which leads to sub-optimal performance for the downstream biology analysis [6]. As a consequence, we re-annotated probes of EGA data according to the results in [6] and only used the probes with either 'Perfect' or 'Good' annotation. When multiple probes target the same gene, we used the median of these probes.
Differentially expressed genes. We used two-tailed t-test to identify differentially expressed genes (DEGs) for each subtype of breast cancer in each data set. After p-values for all genes were computed, we computed the FDR to correct for multiple testing problem using the method by Benjamini and Hochberg [7].
The comparison of the reproducibility between enriched pathways and top 1500 DEGs is based on following concerns. 1) From a hypergeometric statistics perspective, larger number of genes will achieve better reproducibility. We therefore used the top 1500 DEGs in the reproducibility comparison instead of known signature genes available in the literature because the number of the latter is much smaller than 1500. In addition, the selection of 1500 is based on our observation that the increase in reproducibility of the top DEGs slows down quickly after its number reaches 1500 ( Figure S1). 2) We also performed the comparison using the top 6000 DEGs and the enriched pathways still achieve better reproducibility ( Figure S2), while the total KEGG pathways used in this study only covers 5584 genes.
Determination of FDR cut-off. In this study, we used FDR cut-off 0.1 to select enriched pathways. In general, decreasing the cut-off value will increase the percentage of true positives, but also decrease common enriched pathways (CEPs) across different data sets and increase false negatives. Empirically, how to determine cut-off depends on the application. For our case, we intended to characterize subtypes of breast cancer by all the dysregulated pathways instead of finding the most significant biomarkers. Thus we considered the following points when determining FDR cut-off. First, results of each data set should contain a reasonable number of enriched pathways for further analysis. Using a too stringent DFR cut-off may miss some potentially significant results. Second, results of different data sets should have a nontrivial overlap. Third, by literature search we found that quite a few of enriched pathway are related to breast cancer or cancer when using this cut-off value. In some cases, the pathway enrichment analysis result may have no significant pathways or too many significant pathways when using a general FDR cut-off value. Then one may use the top ranked pathways for analysis. This strategy can still make sense in some extent. In addition, some applications require the same number of enriched pathways for different data sets. In this case, people can determine the appropriate number of top pathways by making sure that the top pathways of each data set can all satisfy the same constraint of FDR cut-off.
Coverage of KEGG pathway genes. In this work, all three used breast cancer data sets have a good coverage of pathway genes. For pathway enrichment analysis, a fundamental requirement is that most of the pathway genes should be contained by the platform. The used KEGG pathways contain 5584 genes. TCGA RNA-Seq, TCGA microarray and EGA microarray data sets contain about 20360, 17814 and 17621 genes, which cover 97%, 91% and 90% pathways genes, respectively. We also investigated how many genes of an individual pathway were covered by platforms. We defined the pathway gene coverage rate for a pathway as the percentage of genes of a pathway which are contained by a data set. We then plotted the curves of the empirical cumulative distribution function (CDF) for the pathway gene coverage rate for different data sets as shown in Figure S3. It is clearly seen that all three data sets have a good coverage of genes for each pathway. For TCGA RNA-Seq, TCGA microarray and EGA microarray data sets, only 0.6%, 5.7% and 9.7% pathways have pathway gene coverage rates smaller than or equal to 80%, respectively. Currently, there is still no a standard to judge whether a platform is acceptable when considering coverage of pathway genes. An experiential method is to compare the CDF for pathway gene coverage rate of a platform with that of a widely used platform such as the Affymetrix HG-U133A chip. The platform will be acceptable, if its CDF curve for pathway gene coverage rate is about in the right of that of the Affymetrix HG-U133A chip. In Figure S3 we can see that all the three used breast cancer data sets have better pathway gene coverage than that of the Affymetrix HG-U133A chip.
Comparison between consistent and inconsistent pathways. We have grouped enriched pathways according to the number of data sets in which they are enriched in Additional file 4 so that it will be much easier to identify the enriched pathways that are consistent or inconsistent across the three data sets. After carefully observing these consistent or inconsistent results, we think that the consistent pathways (enriched in all data sets) are more likely to be true positive results than those inconsistent ones in general because 1) we find that consistent pathways generally have lower p-values and the pathways enriched in only one data set generally have higher p-values ( Figure S4); 2) it is much harder to find literature support for pathways enriched in only one data set than others.
EGA validation data set. The EGA data [3] has an additional validation data set besides the discovery data set used in our analysis. We have generated pathway profiles using EGA validation data set. We found that the reproducibility of the enriched pathways between EGA discovery and validation data sets are 81%, 87%, 91% and 89% for luminal A, luminal B, triple-negative, and HER2+ subtypes, respectively. As expected, the reproducibility between EGA discovery set and validation set is higher than the rest cases reported in the manuscript of all the four subtypes of breast cancer ( Figure S5). As also shown in Figure S5, there is little change in reproducibility between the two TCGA data sets and the EGA data set when substituting the EGA discovery set with the validation set. The another reason why the EGA discovery set is used in the manuscript is because the discovery set seems more homogenerous than the validation set. For an example, we found 17% tumor samples in EGA validation data set are in the normal-like subtype based on PAM50 classification. This percentage is much higher than that of other three data sets, which are 3%, 2%, and 6% for TCGA RNA-Seq data set, TCGA microarray data set, and EGA discovery data set, respectively. Currently, there are still doubts about the real existence of normallike subtype and some researchers believe they could be a technical artifact from high contamination with normal tissue [8]. Taken these factors together, the EGA discovery data set is therefore used in the manuscript. ER+ specific pathways. ER+ specific pathways are those pathways specific to both of luminal A and luminal B subtypes, which are all ER+ tumors. It is valuable to have an insight on these ER+ specific pathways, as ER status is a very important factor in planning breast cancer treatment. Among 6 ER+ specific pathways listed in Table S1, four of them have supporting evidence from previous studies: the Primary bile acid biosynthesis, Jak-STAT signaling pathway, Complement and coagulation cascades, and GnRH signaling pathway. Lower levels of unconjugated bile acids have been observed in the serum of women with ER+ tumors than that in the healthy women [9]. Further study [10] suggested that naturally occurring bile acids influence the growth and steroid receptor function of human breast cancer cells based on the experiments in the ER+ MCF-7 human breast cancer cell line. For the Jak-STAT signaling pathway, the presence of pStat5 is found predominantly in well-differentiated estrogen receptor (ER) -positive tumors and is associated with favorable prognosis [11]. The complement and coagulation cascades pathway has also been shown to be involved specifically with ER+ breast cancer, where one study [12] noted that this pathway was enriched for proteins in plasma samples of ER+ breast cancer compared to control plasma, where 467 quantified proteins were mapped to their corresponding genes to do pathway enrichment analysis. For the GnRH signaling pathway, it has been demonstrated that estradiol positive and negative feedback on GnRH neuron firing activity require signaling via ERα [13], which is overexpressed in ER+ tumors. Figure S1 Reproducibility of top DEGs between different data sets vs. number of top DEGs for each subtype of breast cancer. The reproducibility of top DEGs is plotted against the number of top DEGs. In general, the reproducibility increases as the number of DEGs increases. On the other hand, the reproducibility increases more slightly even though the number of top DEGs still keeps increasing. It is clear that for each subtype of breast cancer the pair between two TCGA data sets has the highest reproducibility and the pair between TCGA microarray and EGA microarray data sets has the lowest reproducibility when the number of top DEGs is larger than some value.  Figure S3 Curves of the empirical cumulative distribution function (CDF) for the percentage of genes of a pathway which are contained by different data sets/platforms. The pathway gene coverage rate is defined as the percentage of genes of a pathway which are covered by a data set/platform. X-axis is the pathway gene coverage rate. Y-axis represents the percentage of used KEGG pathways whose pathway gene coverage rates are smaller or equal to the corresponding pathway gene coverage rate.

Figure S5
Reproducibility between two TCGA data sets (RNA-Seq and microarray) and two EGA microarray data sets (discovery set and validation set). The FDR cut-off 0.1 is used. It can be seen that whether using the EGA discovery set or validation set does not affect the reproducibility between TCGA data sets and EGA data sets much. On the other hand, the reproducibility between two EGA data sets is obviously higher than that between TCGA data sets and EGA data sets. Tables   Table S1 ER+