Significant random signatures reveals new biomarker for breast cancer

Background In 2012, Venet et al. proposed that at least in the case of breast cancer, most published signatures are not significantly more associated with outcome than randomly generated signatures. They suggested that nominal p-value is not a good estimator to show the significance of a signature. Therefore, one can reasonably postulate that some information might be present in such significant random signatures. Methods In this research, first we show that, using an empirical p-value, these published signatures are more significant than their nominal p-values. In other words, the proposed empirical p-value can be considered as a complimentary criterion for nominal p-value to distinguish random signatures from significant ones. Secondly, we develop a novel computational method to extract information that are embedded within significant random signatures. In our method, a score is assigned to each gene based on the number of times it appears in significant random signatures. Then, these scores are diffused through a protein-protein interaction network and a permutation procedure is used to determine the genes with significant scores. The genes with significant scores are considered as the set of significant genes. Results First, we applied our method on the breast cancer dataset NKI to achieve a set of significant genes in breast cancer considering significant random signatures. Secondly, prognostic performance of the computed set of significant genes is evaluated using DMFS and RFS datasets. We have observed that the top ranked genes from this set can successfully separate patients with poor prognosis from those with good prognosis. Finally, we investigated the expression pattern of TAT, the first gene reported in our set, in malignant breast cancer vs. adjacent normal tissue and mammospheres. Conclusion Applying the method, we found a set of significant genes in breast cancer, including TAT, a gene that has never been reported as an important gene in breast cancer. Our results show that the expression of TAT is repressed in tumors suggesting that this gene could act as a tumor suppressor in breast cancer and could be used as a new biomarker.


Background
Cancer is a complex disease caused by uncontrolled division of abnormal cells in the body. This uncontrolled division is usually due to one or several mutations on so-called cancer driver genes, that will increase survival and proliferation of the cells under the good microenvironmental conditions. Breast cancer is a major leading cause of death among women [1]. Some evidence show that a rare population of the cells inside tumor are responsible for growth, development, invasion and metastasis [2,3]. Therefore, discovering and controlling the mechanisms that regulate self-renewal and metastasis in tumors before they reach the late stage is essential for personalized patient care [4,5]. Different cancer driver genes have been described in breast cancer, including TP53, BRCA1 and PALB2 [6]. Cancer genes do not act separately and deregulation of various genes from different pathways can lead to cancer initiation or progression [7,8]. These genes give selective advantages to the cells, leading to profound changes in the cellular and also molecular phenotype of the cancer cells as compare to their normal counterparts. Many transcriptomic studies have shown that cancer cells exhibit specific expression profiles and these profiles can be used to separate normal from cancer cells but also to classify tumor samples with different clinico-pathological features [9]. Classical methods aiming to find cancer driver genes by looking to mutations can failed to discover important prognostic or therapeutic targets that exhibit differential expression but without carrying mutations. For this reason substantial efforts have been made to predict gene signatures related to human cancer [10][11][12][13][14][15][16][17] and also cancer stem cells. Some methods are based on considering single gene features while others taking into account the functional relationships between genes by considering a predefined biological network such as a co-expression network [12,16] or a protein-protein interaction (PPI) network [15,17].
Recent studies report that the performance of many network-based methods is comparable to methods based on single genes, and they have limited improvement in gene signature stability over different datasets [12,13]. However, some approaches that produce informative genes or sub-networks by considering functionally related genes have more success in overcoming this problem [14,15]. An important task is the evaluation of the significance of a cancer signature. On the other hand, it is possible that many of the randomly created gene signature groups, similar to already known or predicted groups, be able to separate normal from cancer cells. This is very complicated to interpret the effectiveness of random genes in classifying samples. Many kinds of possibility should be checked before we set up a general finding about why these randomly selected genes contain the differential information in controls and diseases and generic causal disease genes are very important for discovering the true signatures.
Statistical tests are usually applied to identify the association between a signature and outcome [18][19][20]. In 2011, Venet et al. [21] reported that gene signatures unrelated to cancer are significantly associated with breast cancer outcome. They compared 48 published breast cancer outcome signatures to random signatures of identical size and showed that the generated random signatures could separate good and poor patients significantly, even with nominal p-values less than the nominal p-values of published signatures. They suggested that nominal p-value is not a good estimator to show the significance of a signature and further hypothesized that such significant random signatures contain genes associated with proliferation and to a lesser extent cell cycle. In this research, we show that by using an empirical p-value, the published cancer-related signatures are more significant than random signatures and most of the random signatures are not significant with respect to empirical p-value. We show that random signatures with significant both nominal and empirical p-value are informative and can be used to predict genes that are highly associated to cancer (in our case breast cancer). To identify information in such random signatures, we introduce a novel method. Briefly, a score is assigned to each gene representing the frequency of its presence in the significant random signatures. The scores are then diffused through a PPI network and a permutation procedure is used to determine the genes with significant scores. The subset of genes whose scores are significant is considered as the set of significant genes. This computational methodology is applied to NKI cohort [10] that is a breast cancer dataset studied by Venet et al. to compute a set of significant genes. The disease association of this set is investigated using the GAD tool in David Functional Annotation server [22]. It is shown that this set is significantly related to breast cancer. To evaluate the prognostic performance of the computed set of significant genes, we use Distant Metastasis-Fee Survival (DMFS) and Recurrence-Free Survival (RFS) datasets [12] organized by Amsterdam Classification Evaluation Suite (ACES) by compiling a large cohort of breast cancer samples from the National Center for Biotechnology Information's (NCBI's) Gene Expression Omnibus (GEO). The results show that the top ranked genes from the set of significant genes set can successfully separate patients with poor and good prognosis in these datasets. To further investigate the function of the set of significant genes, pathway enrichment analysis is performed. Interestingly, the enriched significant pathways are highly related to cancer specially breast cancer and can separate patients with poor prognosis from those with good prognosis. Finally, we investigated the association of the top 10 genes with breast cancer. Among them, only Tyrosine aminotransferase (TAT) which is the first rank genes is not reported as a significant gene in cancer and we showed that this gene is frequently down regulated in tumor samples of breast cancer. Therefore, we suggest TAT as a novel biomarker in breast cancer tumor and its potential as tumor-suppressor gene should be further investigated.

Computing the empirical p-value for a signature
To compute the nominal p-value for a signature (or random signature), similar to Venet et al. [21], the 295 patients of the NKI cohort [10] and the overall survival end-points are considered and the same outcome association estimation procedure is used. First, the cohort is split based on the median of the first principal component (PC1) of a signature. Then, given this binary stratification of the cohort, the (observed) nominal pvalue of this signature is computed using the standard Cox procedure (R package) [23]. Then the empirical pvalue is computed based on permutation procedure [14]. Permutation test is a statistical tool for constructing sampling distributions. Similar to bootstrapping, permutation test builds sampling distribution by resampling the observed data points. Under the null hypothesis in permutation test, the sample labels are exchangeable i.e. the outcome is independent from the observed variables [14,24]. By permuting the outcome values during the test, we observe many possible alternative outcomes and evaluate the significance of the true labels using calculated nominal p-values. In NKI cohort, we randomly shuffle the labels (N or ∼ N) and compare the nominal p-values for each of the 48 breast cancer signature groups to 1000 nominal p-value which are obtained by permutation process. For k-th breast cancer signature group with p nominal k and 1000 nominal p-value p(1), p(2), ..., p(1000) which are resulted by permutation process, the Benjamini-Hochberg (BH) procedure controls the False Discovery Rate (FDR) in multiple testing experiments [25]. Indeed, for a given α and ordered sequence of 1001 nominal pvalues, the adjusted p-values based on BH methods are calculated as: For k-th breast cancer signature group, the p-value of the permutation test, called empirical p-value, is equal to the fraction of the 1000 adjusted nominal p-values that are equal or less than the adjusted nominal p-value of k-th group (p BH k ), as shown in Eq. 2.
where p BH (i) is the adjusted nominal p-value of i-th permutation test. The discoveries, i.e. the significant tests, are those with an empirical p-value less than α = 0.05. The values of the adjusted nominal p-value and adjusted nominal p-values for 1000 permutations related to the 48 breast cancer signature groups are shown in Fig. 1. The

Meta-analysis and diffusion kernel approach to extract the information embedded in significant random signatures
In a complex disease like cancer, genes do not act in isolation and the interactions between them play a significant role [7,8]. To take these interactions into account, the corresponding protein of each gene is considered and a PPI network is inferred using STRING database [26]. All the Entrez ID from the expression dataset and the Ensembl protein ID from STRING database are mapped to their gene name (HUGO symbol). The interactions between proteins in STRING database include physical and functional associations. In our algorithm, the evidence of conserved neighbors, co-occurrence, fusion co-expression and experiments are used to derive the interactions. Considering the significant random signatures, a score is assigned to each gene based on the number of times it is observed in these signatures. For example, a gene that occurs in 20 significant random signatures will get a score of 20. Let n be the number of genes and S = (S 1 , S 2 , ..., S n ) be the score of the genes. In this step, we construct a weighted graph G with nodes corresponding to the genes. Each node of G gets the score of its corresponding gene and the weights of the edges of G are the interaction scores between proteins coded by genes, which are obtained from STRING. The score of an interaction shows the confidence prediction of that interaction. The gene scores are diffused through G using the diffusion kernel of Kondor and Lafferty [15,27], as described below: Laplacian matrix for simple graphs is defined as H = D − A, where D is the degree matrix and A is the graph's adjacency matrix. For simple graph G, A is a zero-one matrix which all its diagonal entries are zero. Also, the ith diagonal entry of matrix D is the sum of the entries in the ith row of A. A similar approach can be used for constructing the laplacian matrix for weighted graph G. In this case, the ijth entry of the matrix A is the weight of the edge between the genes i and j. Similarly, the ith diagonal entry of matrix D will be the sum of the entries in ith row of A. In this case, the Laplacian matrix is also defined as H = D − A. Considering w ij as the weight of the edge between genes i and j in graph G, the Laplacian matrix H for graph G is defined as H = H ij , where: The diffusion kernel with generator H and bandwidth β is defined as: where β shows the diffusion strength. For low diffusion strength kernels, scores are diffused only to a few well-connected neighbors but for high diffusion strength kernels, scores are diffused to distant nodes through the network. In this work, β is considered to be 0.3 since in [27] it is reported to achieve the least error rate in the breast cancer dataset. Using the matrix k β the new scores, diffusion scores, for the genes are computed as follows: In fact, the diffusion score of one gene is based on its score, its neighbors scores and the score of its distant nodes.

Identifying significant genes by permutation procedure
To determine the significance of diffusion scores of genes, the following random permutation procedure is used. Let S β = (S β (1), S β (2), . . . , S β (n)) where S β (i) denotes the diffusion score of gene i and ϕ 1 , ϕ 2 , ..., ϕ 1000 be 1000 random permutation on {1, 2, .., n}. S ) is constructed 1000 random permutation of S β according to ϕ 1 , ϕ 2 , ..., ϕ 1000 . We constructed 1000 random diffusion scores S r β , as follows: Let S r β j be the random diffusion score of gene j in vector S r β . The null set {S r β (j)|1 ≤ r ≤ 1000} is considered for this gene. Then, the permutation score of S β j is computed by: The genes, which have permutation score less than 0.05 are considered as the set of significant genes. The set of significant gens are first sorted with respect to their permutation score and then based on their scores.

Computing a pathway-score
Let SG be the set of significant genes computed by the method. For the pathway P, let the set P SG = {g 1 , g 2 , ..., g k } be the genes in SG which are presented in pathway P. Each gene g i in P SG is given two values, and is computed using the following equations: where p j ranges over the patients of phenotype N or ∼ N and e g i ,p j denotes the gene expression value of gene g i in patient p j . Similar to the procedure mentioned in Lim et. al. [14], considering each patient p k in phenotype ∼ N, we define two new scores for pathway P: score p k N (P) and score p k ∼N (P) are obtained based on a weighted mean approach. For instance, score p k N (P) is a weighted mean of values (e g i ,p k − μ N ) 2 , with corresponding non-negative weights as e g i ,p k . In this formula, the weights are the gene expression values for genes in SG presented in pathway P. We use the non-negative terms (e g i ,p k −μ N ) 2 and (e g i ,p k −μ ∼N ) 2 as a measure of the difference in the gene expressions of normal and cancer groups, respectively.

Patients and cell line selection
The ethics committee at the Royan Institute approved this study, and all the patients gave written informed consent on the use of clinical specimens for medical research. Ten breast cancer patients undergoing curative resection are included in this study. The median age of patients is 50 years (range 37-58 years). All patients are diagnosed with invasive ductal carcinoma; four of them are also metastatic. All patients underwent curative surgery, however three of them experienced neo-adjuvant therapy pre surgery. Both tumor and adjacent non-tumor tissue (the adjacent non-tumor tissue is defined as at least 1-cm distance from the tumor edge) are processed immediately after operation. The expression of TAT is evaluated by quantitative real-time polymerase chain reaction (RT-PCR) in all ten paired specimens. Among breast cancer cell lines MCF7 (is characterized as metastatic, ER+, PR+/-, HER2-and Luminal A type) and MDA-MB231 (is characterized as metastatic, ER-, PR-, HER2-, Claudin-low type and highly invasive) are selected and subjected to mammospheres formation and further analysis for TAT expression.

RNA extraction and quantitative real-time polymerase chain reaction (qRT-PCR)
The expression of TAT (Tyrosine aminotransferase) is assessed by specific primer (F: 5' ATGCTGATCTCTGT-TATGGG3' , R: 5' CACATCGTTCTCAAATTCTGG3') in tumor, normal and cell lines, respectively. Briefly, all specimens are preserved at -80°C until RNA extraction. Total RNA is isolated using Trizol reagent (Qiagen, USA) and treated with DNAse I (Fermentas, USA) for 30 minutes in order to digest the genomic DNA. The quality of RNA samples is monitored by agarose gel electrophoresis and a spectrophotometer (Biowave II, UK). A total of 2 μg of RNA is reverse transcribed with a cDNA synthesis kit (Fermentas, USA) and random hexamer primers according to the manufacturer's instructions. Transcript levels are determined using the SYBR Green master mix (Takara, Japan) and a Rotorgene 6000. Expression of genes is normalized to the GAPDH housekeeping gene (F: 5'CTCATTTCCTGGTATGACAACGA3' , R: 5'CTTC-CTCTTGTGCTCTTGCT3'). Relative quantification of gene expression is calculated using the Ct method.

Computing empirical p-value for published breast cancer signatures
In

Extracting significant genes embedded in empirically significant random signatures
Like Venet et al. [21], we also hypothesize that significant random signatures contain information. We introduce a novel method to extract the biologically relevant information in significant random signatures (see "Methods"). To achieve a set of significant genes in breast cancer considering significant random signatures, we use the NKI cohort, which is a breast cancer dataset studied by Venet et al. [21]. To this end, a set of 1000 random signatures of identical size is generated for each of the 48 published breast cancer signatures. The random signatures are considered significant if they are associated with breast cancer outcome with both nominal and empirical p-values. To demonstrate this, we consider one of the 48 signature groups with 106 genes as an example. Firstly, we select 106 random genes from the set of all human genes. We then repeat this process 1000 times and construct 1000 random signatures of identical size. By using the same procedure for each 48 group of signatures, we obtain 48,000 random signatures. Parts

Disease association of significant genes
To investigate the association of the top ranked genes with disease, the Genetic Association Database (GAD) tool in David Functional Annotation server [22] is used. GAD is an archive of published genetic association studies, which allows analysis of complex common human genetic disease [28]. The top-level disease and disease class assigned by GAD, given the 840 top ranked genes, is breast cancer and cancer with p-value= 0.0007 and p-value= 0.00098, respectively. Table 1 shows the enriched disease and disease class achieved from different set of genes. It can be seen from this table that the disease classes of the other sets of genes other than the first 840 top ranked ones is not related to cancer. This clearly highlights how our method can extract meaningful information from significant random signature.

Association of top 20 genes with DMFS and RFS datasets
To further investigate the importance of genes extracted with our method, the prognostic performance of the top significant genes is computed using DMFS and RFS datasets. These two data sets, introduced by Staiger et al. [12], are two cohorts of breast cancer samples in NCBIs GEO. DMFS dataset is collected from six studies (Ivshina, Hatzis-Pusztai, Desmedt-June07, Miller, Schmidt, Loi) with 190 and 433 samples for poor and good prognosis, respectively. The RFS dataset contains 12 studies (Ivshina, Hatzis-Pusztai, Desmedt-June07, Minn, Miller, WangY-ErasmusMC, Schmidt, Pawitan, Symmans, Loi, Zhang, WangY) with 455 and 1161 samples for poor and good prognosis, respectively. The DMFS data set is a subset of the RFS data set. Their difference, however, is that in RFS data set, the patients are labeled according to recurrence-free survival whereas in DMFS data set, they are labeled according to distant metastasisfree survival. Among the top twenty significant genes computed previously, sixteen genes have gene expression

Prognosis value of the pathways associated with significant genes
To investigate the functions of the set of significant genes, hereinafter referred to as SG, pathways enrichment analysis is performed using ConsensusPathDB [29]. Only the pathways enriched with p-value less than 10 −9 are considered ( Table 2). Table 2 shows 22 enriched pathways from KEGG, Wikipathways, SMPDB and PID databases. Association of these pathways with cancer is surveyed through an extensive literature search. Among the 22 founded pathways, 14 of them are directly involved in cancer development and mostly contributed to cell cycle, proliferation and self-renewal ability. However, the remaining pathways indirectly affect tumor progression. The significance of these pathways is then evaluated using the DMFS and RFS datasets. To find the prognosis value of suggested pathways, a defined pathway-score is assigned to each patient and a statistical test is applied to distinguish the population of scores for phenotype N (good) and ∼ N (poor). Considering pathway P, for each patient p k in phenotype N, two scores, score pk N (P) and score pk ∼N (P), are defined (see "Methods" for more details). The population of pathway-scores, score pk N (P) and score pk ∼N (P), are supposed to vary for a pathway P that performs differently between the two phenotypes N and ∼ N. Statistical t-test is applied for testing H 0 (there is no important difference between pathway-scores) versus H 1 (there is difference between pathway-scores). Most of the selected pathways can significantly separate the poor and good samples with significant p-values p − value < α (α = 0.05).

Association of top 10 genes with cancer
To get a better insight in the importance of the significant genes extracted from empirically significant random signature, we investigated the role of the 10 most significant genes. Through extensive literature search, it is shown that most of the top 10 genes are reported to be associated with breast cancer or cancer in general. Table 3 presents a summary about the function of these genes. Among the listed genes, BIRC5, SEC14L2, Thymidine kinase (TK1), ZNF385B, CLIC6, ELOVL1, CHAF1B and TFF1 have been reported to have a role in early detection of cancers, tumor progression and

Estrogen signaling pathway, adhesion
Breast and gastric cancer [50,51] metastasis in most of cancer types including breast cancer (see Table 3). PHYHD1 [30] is recently identified as a predictor for progression-free survival and metastasis in prostate cancers. Surprisingly, the most significant gene, TAT (Tyrosine aminotransferase), has not been reported to have a role in breast cancer. TAT encodes a mitochondrial enzyme mainly expressed in liver and contributes to metabolism and carbon metabolism pathways [31]. TAT gene is located on the chromosome 16 at position q22.2. Intriguingly, this chromosome is frequently deleted in many tumors including breast, liver, lung and gastric, suggesting the existence of a tumor suppressor gene within this region [31,32]. Tumor suppressive mechanism of TAT gene has been previously reported in hepatocellular carcinomas (HCC). Indeed, down regulation of TAT is widely detected in primary HCC, which is significantly associated with either the loss of TAT allele or hyper methylation of TAT [32]. Induction of TAT into HCC cells prevents their tumorigenicity. Also, it has pro-apoptotic effect through the mitochondrial pathway [31]. Loss of chromosome 16q is widely reported in low tumor grade and luminal (ER+) breast cancer [31][32][33][34][35]. However, this study is the first one to suggest a role for this gene in breast cancer.

Expression pattern of TAT in malignant breast cancer vs. adjacent normal tissue and mammospheres vs. parental adherent cells
Based on our data, we hypothesized that TAT could play an important role in breast cancer. Therefore, its expression is evaluated in breast tumor samples. All tumors in the present study are classified as invasive ductal carcinoma (IDC). Three samples are ER+, PR+ and HER2+. Three patients have undergone neoadjuvant therapy prior to surgery due to their histopathological characteristics and tumor stage. As shown in Fig. 5, in most of cases, TAT is under expressed as compared to adjacent normal tissue. However, two of them had over-expressed TAT genes. Surprisingly, the expression of TAT increased in mammospheres derived from MCF-7 and MDA-MB-231 as compared to their adherent counterparts (about 3.2 fold, p < 0.001). The decreased expression of TAT in tumor as compared to normal tissue is confirmed in TCGA BRCA dataset. Only the cases for which both tumor and adjacent normal tissue RNA-seq data are available are considered for analysis. A massive and highly significant (p-value < 10 −15 ) decrease of TAT expression is observed in tumors as compared to their adjacent tissue in most of the samples (87/112, median decreased of 20 fold, Fig. 6).

Discussion
Nominal p-values are most commonly used to show the significance of the observations. In 2012, Venet et.al. [21] suggested that nominal p-values are not reliable measures to show the significance of a human cancer signature and outcome. They showed that, at least in the case of breast cancer, signatures reported in the literature are no better than randomly generated signatures. To show this, they generated random signatures that could separate good and poor patients with significant nominal p-values. They further suggested that such significant random signatures are due to genes associated with proliferation and cell cycle.
In this research, we first show that by using the empirical p-values and considered it as a complimentary criterion for nominal p-value, most of the random signatures are not more significant than published signatures related to breast cancer. Next, we focused on that subset of random signatures with significant both empirical and nominal pvalue. This subset of random signatures may contain some information that makes them be significant like published ones. To show that the significant random signatures are informative, we apply a computational method to extract information embedded within them. To do this, we define a novel scoring assignment method based on the number of the significant signatures that contain a specific gene to give a score to each gene. Since genes do not act in isolation in a complex disease like cancer and the interactions between them play a significant role, we consider the relationship of the genes in PPI network. To this end, a diffusion method on PPI network is used to smooth the score of the genes. Using a permutation method, the genes with significant score are selected as cancer-related genes.
We applied this method on the NKI cohort, which is a breast cancer dataset studied by Venet et al. [21] to achieve a set of significant genes in breast cancer. It is shown that this predicted set of genes is related to breast cancer. To evaluate the prognostic performance of the computed set of significant genes, we used two data sets of DMFS and RFS. They contain cohorts of 6 and 12 datasets from GEO, introduced by Staiger et al. [12]. We show that the set of significant genes can separate the poor and good prognosis in these datasets. To show the accuracy of this method, the following procedure is done. Firstly, pathways enrichment analysis using ConsensusPathDB is performed considering KEGG, Wikipathways, SMPDB and PID databases on this set of genes. All enriched pathways, including cell cycle, p53 signaling pathway and DNA Damage Response are associated with cancer development. Secondly, for most of the significant genes obtained by this method (all of the 10 most significant genes), a role in cancer initiation or progression has been described in multiple types of cancer. In fact, 8 out of these 10 genes have been shown or suspected to play key roles in breast cancer development (see Table 3), highlighting the effectiveness of our method. In addition, our method could effectively identify new important candidates for the cancer type being studied. It identified TAT which has not so far been reported in cancer. In summary, the obtained results demonstrate the accuracy of the proposed method as it can effectively extract meaningful information from a set of completely random signatures. This method allows the identification of genes with expressions that contain predictive values and are associated with cancer-related pathways. Finally, we checked the expression of TAT in human breast cancer tissues as well as mammospheres as a model of breast cancer stem cells. TAT is down regulated in most of the invasive ductal carcinoma patients (71%) used in this study and in TCGA patients from BRCA projects. Interestingly, a previous study reported that TAT, which is located on chromosome 16q, has a tumor suppressive role in hepatocellular carcinomas (HCC) [31]. Indeed, down regulation of TAT expression is widely detected in primary HCC, which is significantly associated with either the loss of TAT allele or hyper methylation of TAT. Induction of TAT into HCC cells prevents their tumorigenicity. TAT has been shown to exhibit pro-apoptotic effect through the mitochondrial pathway [31]. Although the role of TAT in breast cancer is unclear, the loss of chromosome 16q has been widely reported in low tumor grade and luminal (ER+) breast cancer [31][32][33][34][35]. The expression pattern of TAT is down regulated in seven of ten patients in the present study suggesting that loss or low expression of TAT could contribute to initiation or/and progression of breast cancer. However, TAT is up regulated in two patients as well as mammospheres derived from malignant breast cancer lines. Mammospheres is a model for enriching the breast cancer stem cells [36,37]. There are several studies indicating that breast cancer stem cells are responsible to resistance to chemotherapy [38,39] and induction of metastasis [40]. Therefore, the similarity of TAT expression in both mammospheres and the two of our patients can lead to the hypothesis that over expression of TAT may be associated with the resistance of tumor to therapy. This hypothesis can be the subject of study for future research.

Conclusion
As a conclusion, random signatures can contain significant information to discover new cancer genes. The method we developed can be used to rank the genes extracted from significant random signatures and predict important signatures in cancer. In addition, this study is the first one to suggest a role of TAT in breast cancer. However, further investigations should be conducted to elucidate the putative tumor suppressor properties of TAT in breast cancer as well as its potential importance in stem cells, metastasis and resistance to drugs.
Additional file 1: A set of 840 significant genes which is resulted by "Extracting significant genes embedded in empirically significant random signatures" subsection.