 Research Article
 Open Access
 Open Peer Review
 Published:
Reproducible detection of diseaseassociated markers from gene expression data
BMC Medical Genomics volume 9, Article number: 53 (2016)
Abstract
Background
Detection of diseaseassociated markers plays a crucial role in gene screening for biological studies. Twosample test statistics, such as the tstatistic, are widely used to rank genes based on gene expression data. However, the resultant gene ranking is often not reproducible among different data sets. Such irreproducibility may be caused by disease heterogeneity.
Results
When we divided data into two subsets, we found that the signs of the two tstatistics were often reversed. Focusing on such instability, we proposed a signsum statistic that counts the signs of the tstatistics for all possible subsets. The proposed method excludes genes affected by heterogeneity, thereby improving the reproducibility of gene ranking. We compared the signsum statistic with the tstatistic by a theoretical evaluation of the upper confidence limit. Through simulations and applications to real data sets, we show that the signsum statistic exhibits superior performance.
Conclusion
We derive the signsum statistic for getting a robust gene ranking. The signsum statistic gives more reproducible ranking than the tstatistic. Using simulated data sets we show that the signsum statistic excludes heterotype genes well. Also for the real data sets, the signsum statistic performs well in a viewpoint of ranking reproducibility.
Background
Detection of diseaseassociated markers plays a crucial role in gene screening for biological studies. In this field, statisticians seek to identify informative genes as candidates for further investigation. To this end, it is desirable to correctly rank genes according to their degree of differential expression. In such efforts, twosample test statistics, such as the tstatistic and Wilcoxon sumrank statistic, are widely used to rank genes based on gene expression data.
However, the resultant gene rankings are often not reproducible among different data sets. Such irreproducibility may be caused by disease heterogeneity [1]. In fact, we can easily confirm this ranking irreproducibility in the microarray data used by [2] (see Application for more detail). That data set contains 51 nonmetastatic samples and 46 metastatic samples from patients with breast cancer, a disease that is heterogeneous due to the existence of multiple subtypes [3]. We divided the full data into two independent sets (data1 and data2), transformed each data set such that the tscores of data1 were positive without loss of generality, and then ranked genes using the tstatistic in data1 and data2 separately. Figure 1 shows the correspondence of these ranking scores. Some genes that were topranked in data1 had rather low scores in data2. Moreover, for many genes, even the signs of the tscores were mismatched. This may be due to statistical variations caused by a sample size or heterogeneous factors in breast cancer. Thus, the simple tstatistic or correlation results in an unstable estimation that strongly depends on the dataset used.
Such heterogeneity was also discussed in a context of cancer outlier [4, 5] which developed the methods using the twosample test statistic to detect genes that are over or down expressed in the subset of the disease group compared with the normal group. In this paper, we focus not on the subset but rather on the whole set. That is, our main aim is to detect the genes that are differentially expressed entirely in the disease group. Thus we develop the ranking method which has robustness for the heterogeneity.
Novel methods such as lasso are able to determine both highly ranked genes and classifiers simultaneously. However, [6] supported the importance of choosing a filtering method that yields a gene ranking corresponding to feature selection, rather than to the classification in machinelearning theory. In other words, preselection and evaluation of the resultant gene set must be separated from the classifier’s performance. Moreover, each topranked gene in itself must be informative or effective in some sense, e.g., robustness with respect to heterogeneity focused in this paper. Therefore we consider the ranking score independently derived for each gene, unlike [7], which took correlations between genes into consideration.
Because tstatistics and correlations strongly fluctuate due to sample variation, combining a sampling method with a twosample test statistic should improve reproducibility. The effects of sampling method have been demonstrated by multiple studies, both theoretical [8] and applied [9]. Meanwhile, [10] presented counterarguments: several approaches to feature selection with ensemble learning by the sampling method are ineffective in terms of predictive ability, stability, and interpretability. Those authors concluded that the simple Student’s ttest exhibits superior performance in these regards. Dabney [11] also argued that the simple tstatistic is more accurate than the modified tstatistic and shrunken centroids. However, those authors’ conclusions are based on empirical studies in the context of particular data. We argue that the sampling method is effective from the standpoint of robustness with respect to heterogeneous factors; this is because heterogeneity represents a mixture of two or more classes, and we can easily imagine that sampling is the best way to capture heterogeneity to integrate information from many small subsample sets. To stabilize the performance of the simple tstatistic, we derived a signsum statistic that improves ranking reproducibility. This novel statistic repeatedly counts the sign of mean difference between subsets of the normal and disease groups. The signsum statistic is an extension of the Wilcoxon sumrank statistic, which itself has superior robustness but an inferior power relative to the tstatistic [12]. We show the probabilistic result of the signsum statistic, and demonstrate its superior performance through simulations and applications to real data in this paper.
Methods
Derivation
Let X _{ ij } be gene expression levels for samples i=1,⋯,n, genes j=1,…,p. We assume that all samples fall into either of two groups, 0 or 1, which denote normal and disease groups consisting of n _{0} and n _{1} samples respectively. Then, a twosample tstatistic is defined by
where \(\bar {X}_{jy}\) is the sample mean of group y for gene j. There are two options for s _{ j }, a pooled Student’s type or a nonpooled Welch’s type. In this paper, we use Welch’s tstatistic; therefore, s _{ j } is written as \(s_{j} = \sqrt {s_{1j}^{2} / \hat {\pi }_{1}+s_{0j}^{2}/ \hat {\pi }_{0}}\), where s _{ yj } is the sample standard deviation of gene j and \(\hat {\pi }_{y} = n_{y} / n\) for group y. Without loss of generality, we can assume that \(\bar {X}_{j1}\bar {X}_{j0}\geq 0\).
However, as discussed in Background, the tstatistics can fluctuate between two divided data sets, and even the signs of tstatistics can be mismatched. Therefore, we focus our attention to the signs of the tstatistics. If the tstatistics are evaluated by the full sample, the signs are positive over all genes by the assumption above. However, if the tstatistics are evaluated by subsets of the full sample, the signs may change, as shown in Fig. 1. Therefore, we derived a signsum statistic to count the signs of the tstatistics for all possible subsets.
We pick samples of sizes a and b from groups 1 and 0 respectively; thus, there are \(\binom {n_{1}}{a}\) and \(\binom {n_{0}}{b}\) combinations of subsets from each class. The signsum statistic is defined as
where \(k_{1}=\binom {n_{1}}{a}, k_{0}=\binom {n_{0}}{b}, \mathrm {H}(x)\) is a Heavisidestep function that takes the value 0 if x<0 or 1 otherwise, and \(\bar {X}_{jyt}\) is the sample mean of gene j in the tth subset of group y. A larger value of the signsum statistic means that the signs of the tstatistics evaluated by subsamples are more stably positive. We can show that the signsum statistic is an extension of Wilcoxon’s sumrank statistic; in fact, if a and b are equal to 1, then they are equivalent.
For comparison, we derived a tstatistic evaluated by subsamples. In a manner similar to the derivation of the signsum statistic, it is defined as
These statistics are described by the character U because they are members of tstatistics, as shown in Additional file 1. We compare the signsum statistic (2) with the tstatistic evaluated by subsamples (3) from the perspective of tstatistics in the next subsection.
By the assumption \(\bar {X}_{j1}\bar {X}_{j0}\geq 0\), the gene which has a larger score of the statistic is regarded as more informative for the detection of differentially expressed genes; thus, the gene ranking is obtained by sorting the values of the statistics in descending order over all genes.
Robustness for heterogeneity
Heterogeneous disease factors can cause a mixture of two or more classes in some gene expression levels in the disease group. We call genes affected by such factors as “hetero genes”, and unaffected genes as “homo genes”. The signsum statistic can effectively detect such heterogeneity. To demonstrate this, here we provide a theorem about asymptotic confidence intervals.
Let U be a general twosample Ustatistic (we drop the gene index for simplicity). Because the Ustatistic has the property of asymptotic normality, the asymptotic confidence interval is described as \(\mathrm {E}[U] \pm (\sigma _{U} / \sqrt {n}) Z_{\alpha /2}\), where \({\sigma ^{2}_{U}}\) is an asymptotic variance of U and Z _{ α/2} is the 100α/2 upper percentile of a standard normal distribution. Because U ^{T} and U ^{S} are members of Ustatistics, these statistics are evaluated by the interval estimators as shown in Theorem 1.
Theorem 1

1.
The asymptotic confidence interval of the tstatistic evaluated by subsamples with level α is
$$\begin{array}{@{}rcl@{}} \frac{\sqrt{a+b} \ (\mu_{1}\mu_{0})}{\sqrt{\cfrac{{\sigma_{1}^{2}}}{\pi_{1}}+\cfrac{{\sigma_{0}^{2}}}{\pi_{0}}}} \pm Z_{\alpha/2} \frac{(a+b)^{1/2}}{\sqrt{n}} \end{array} $$(4)if \(\hat {\pi }_{1} \rightarrow \pi _{1}\) and \(\hat {\pi }_{0} \rightarrow \pi _{0}\), where π _{1}+π _{0}=1 and Z _{ α/2} is 100α/2 upper percentile of a standard normal distribution.

2.
The asymptotic confidence interval of the signsum statistic (2) with level α is
$$\begin{array}{@{}rcl@{}} \mathrm{E}[U^{S}] \pm Z_{\alpha/2} \frac{\tilde{\sigma}}{\sqrt{n}}, \end{array} $$(5)if \(\hat {\pi }_{1} \rightarrow \pi _{1}\) and \(\hat {\pi }_{0} \rightarrow \pi _{0}\), where Z _{ α/2} is 100α/2 upper percentile of a standard normal distribution, and
$$\begin{array}{@{}rcl@{}} \mathrm{E}[U^{S}]&=&\mathrm{E}[G_{1}(V_{1})], \end{array} $$(6)$$\begin{array}{@{}rcl@{}} \tilde{\sigma}^{2} &=&\frac{a^{2}}{\pi_{1}}\text{Var} [G_{1} (V_{1})]+\frac{b^{2}}{\pi_{0}}\text{Var} [G_{0} (V_{0})], \end{array} $$(7)where G _{ y }(v)=Pr(W _{ y }≤v) for y=0,1, and
$$\begin{array}{@{}rcl@{}} V_{1} &=& \frac{1}{a}X_{11}, W_{1} =\frac{1}{a}\sum\limits_{i=2}^{a} X_{1i} + \frac{1}{b} \sum\limits_{j=1}^{b} X_{0j}, \\ V_{0} &=& \frac{1}{b}X_{01}, W_{0} =\frac{1}{a}\sum\limits_{i=1}^{a} X_{1i} + \frac{1}{b} \sum\limits_{j=2}^{b} X_{0j}. \end{array} $$Here X _{1}s and X _{0}s are independently distributed with F _{1} and F _{0}, which denote the distribution functions of gene expression levels of the disease and normal groups, respectively.
A proof of the Theorem 1 is given in Additional file 1. We note that V _{1}−W _{1} and V _{0}−W _{0} represent the mean differences in the disease and normal groups, respectively. A property of Ustatistics allows us to evaluate the asymptotic variance of the signsum statistic by the conditional distribution of W _{ y } given V _{ y } for each group y. The difference between these two statistics is mainly due to the fact that the signsum statistic is the sum of the nonlinear functions of the tstatistic evaluated by subsamples. As a result, information about F _{1} and F _{0} is strongly reflected in the signsum statistic as a result of changing a and b. We can discriminate hetero genes from homo genes by this property, as shown in the next subsection.
The effects of different setting of parameters
Here we aim to remove heterogeneous factors by choosing a and b used in the signsum statistic, or equivalently controlling the subsample sizes from the disease and normal groups. If the tvalue of a homo gene is larger than that of a hetero gene, then the tstatistic easily distinguishes the homo gene from the hetero gene. However, if the tvalues of the two genes are equal, then the tstatistic will confuse these genes, because their confidence intervals are equal. Such confusing homo genes will be topranked by the signsum statistic if we find a and b such that
The difficulty and importance of considering such hetero genes is also discussed in [13] in the context of the false positive rate. The signsum statistic repeatedly counts the sign of the difference between the means of the disease and normal groups. Hence, the sign mismatches due to heterogeneity in the disease group would be effectively detected by a small a value, chosen such that the sample mean of the disease group fluctuates. This consideration is supported through numerical evaluations of specific situations, as described below.
Let all gene expression levels in the normal group follow N(0,1) without loss of generality. Then, the homo gene expression levels of the disease group follow \(N(\mu _{1},{\sigma _{1}^{2}})\), and the hetero gene expression levels follow \(\tau _{1} N(m_{1},{v_{1}^{2}})+ \tau _{2} N(m_{2},{v_{2}^{2}})\), where τ _{1} and τ _{2} are mixing proportions with τ _{1}+τ _{2}=1. Moreover, we constrain expectations of the tstatistics of these genes by equality in the limiting sense of probability convergence. That is,
where \(\pi _{1}, \pi _{0}, \mu _{1}^{*},\) and \(\sigma _{1}^{*}\) are the sample ratio of the disease group, the sample ratio of the normal groups, the expected mean, and the expected standard deviation of the hetero gene. The ranking by tstatistics fluctuates because the interval estimators of the hetero and homo genes are almost overlapping.
The asymptotic confidence interval of the signsum statistic is not evaluated analytically because it has an integral form. In some situations, we can confirm that upper confidence limits of the signsum statistic differ between the homo and hetero genes. Figure 2 shows one such situation in the same setting as simulation (I) in Simulation. Thus, the signsum statistic can distinguish homo and hetero genes, whereas the tstatistic cannot. Figure 2 shows that (i) for each fixed value of a (sampling size from normal group), a larger value of b (sampling size from disease group) is better at discriminating these scenarios, and (ii) a smaller value of a tends to be superior. This is because disease heterogeneity affects the difference between a sensitive estimator of the mean in the disease group and a stable estimator of the mean in the normal group. Although it would be ideal to obtain an optimal setting for the sampling size in general situations, based on these observations we fixed a as 1 and allowed b to be 1,5, or 10 below. Below, the signsum statistic for each a and b is described as s _{ a, b }.
Simulation
We carried out simple simulation studies to evaluate the performance of the signsum statistic. With the number of genes set at 1000, we generated expression levels for 100 homo genes and 100 hetero genes; the remaining 800 were noninformative genes whose expression level distributions were equal in the disease and normal groups. Gene expression levels in the normal group were assumed to be drawn from a standard normal distribution N(0,1) without loss of generality. Homo gene expression levels in the disease group were drawn from a normal distribution \(N(1,{\sigma _{1}^{2}})\), and hetero gene expression levels in the disease group were drawn from a normal mixture distribution τ _{1} N(0,1)+τ _{2} N(m _{2},1), where τ _{1},τ _{2} are positive values with τ _{1}+τ _{2}=1. The mixture model suggests that a proportion τ _{1} of gene expression levels in the disease group cannot be discriminated from those in the normal group, as in real data.
We considered three situations in which the tstatistic confuses the homo and hetero genes by the constraint as (9) with different parameters: (I) \({\sigma _{1}^{2}}=1, \tau _{1}=0.5\), (II) \( {\sigma _{1}^{2}}=4, \tau _{1}=0.75\) and (III) \({\sigma _{1}^{2}}=1, \tau _{1}=0.25 \), with sample size n=200,1000 with equal n _{0} and n _{1}. We compared the gene rankings among three statistics: simple tstatistic with subsamples, simple tstatistic without subsamples, and signsum statistic with 100 repetitions. The sampling sizes were fixed as a=1 and b=1 for the tstatistic, and as a=1 and b=1,5 and 10 for the signsum statistic. Robustness with respect to heterogeneity was calculated based on the number of homo genes in the top 100 ranking. Although \({U_{j}^{T}}\) and \({U_{j}^{S}}\) are defined by all possible subsets, in this case we only need to evaluate sufficient combinations to achieve convergence of the top 100 rankings as written in Additional file 2.
Application
We compared the tstatistic with the signsum statistic using five real data sets [2, 14–17]. The data set in [2] (breast cancer data) contains 97 gene expression subjects for primary breast tumors in which 46 subjects are in relapsed group and 51 subjects are in relapsefree group for 5 years. We applied the same filtering used in [2], yielding a final full data set consisting of 97 samples and 5420 genes. The data set in [14] (cohort data) combine 454 gene expression samples from different diseases. We picked 32 samples from lung cancer tumors, 45 samples from pancreatic ductal adenocarcinoma tumors, and 70 samples from unaffected individuals, yielding a final full data set consisting of 147 samples and 863 genes. The data set in [15] (prostate cancer data) contains 6144 gene expressions for 455 prostate cancer tumors in which 103 subjects are determined as fusion statuspositive and 352 subjects are determined as fusion statusnegative. The data set in [16] (breast cancer data2) contains 17489 gene expressions for 286 breast cancer tumors in which 107 subjects are in relapsed group and 179 subjects are in relapsefree group within 5 years. The data set in [17] (leukemia data) contains 7129 gene expressions for 72 leukemia samples in which 47 subjects are in acute lymphoid leukemia group and 25 subjects are in acute myelogenous leukemia group.
For these data, the measure of reproducibility is given below. First, we divided the original data randomly into two data sets while maintaining the sample ratio of the disease and normal groups at the same value as in the full data set. After gene rankings were performed by the tstatistic and the signsum statistic, we selected the topranked 100 genes and counted the genes that overlapped between the two selections. This procedure was repeated for 100 trials, so we compared the tstatistic with the signsum statistic based on the mean and standard deviation of the overlapping counts. To account for the difference in sample sizes of the two datasets we also used ORRS (Overlap Ratio to Random Selection). For p genes, the ORRS for the top kranking is defined as
where \(N_{p,k}=\sum _{i=0}^{k} i \binom {k}{i} \binom {pk}{ki}/\binom {p}{k}{=k^{2} / p}, S_{1t}\) and S _{2t } are the top kranked genes sets for two divided data on the tth trial; in this case, k=100 and T=100. N _{ p, k } refers to the expected overlap in gene number for a random selection. A larger ORRS value means that the selection is more reproducible than the random selection.
Results and discussion
The performance of the signsum statistic
Table 1 shows the simulation results. We observe that the signsum statistic selected more homo genes highly associated with the class labels than the tstatistic. Overall, s _{1,10}, the signsum statistic with sampling size 1 from the disease group and 10 from the normal group, performed the best in Situations (I) and (II). s _{1,1},s _{1,5},s _{1,10} were competitive and performed better than the tstatistic in Situation (III). These results confirmed the stable behavior of the signsum statistic, as shown in Fig. 2. Figure 3 also illustrates the superior performance of the signsum statistic, which shows one of the resulting ranking scores from the 100 trials. Homo and hetero genes were well discriminated by signsum statistic, but confused by the tstatistic. The ranking yielded better results in a large sample size (n=1000) than in a small sample size (n=200). When the sample size is 1000, almost all homo genes were ranked higher than hetero genes. Table 2 shows the application results, which indicated that the signsum statistic performed better with respect to these reproducibility measures. Overall, s _{1,10} performed the best, and this result corresponds to the simulation study, as shown in Table 1.
Discussion
Gene ranking procedures are not reproducible among different studies [18]. To obtain a robust ranking, ensemble or resampling methods are effective [8, 9]. Counterintuitively, however, resampling methods do not improve reproducibility [10]. In this paper, we evaluated a resampling method for robustness with respect to heterogeneity in a microarray study. We focused on the sign mismatch of tscores in the context of a classification problem. We often found that the genes with large tscores in the training data had small or signreversed tscores in the test data. The signsum statistic was developed based on these two motivations. Using numerical simulation, we proved that the signsum statistic improves the robustness with respect to heterogeneity relative to the tstatistic. Furthermore, the signsum statistic allowed us to obtain a reproducible ranking in applications to real data. These conclusions were validated by an evaluation of the upper confidence limit (Theorem 1).
In the context of gene screening, FDR (False Discovery Rate) has been studied by novel methods such as SAM [19] and ranking procedure by qvalues [20] for decisions about the cutoff value for gene ranking. It is less meaningful to focus on the cutoff value until we have a correct and stable gene ranking. Therefore, in this study, we focused on obtaining a reproducible gene ranking. Obtaining the cutoff value of the signsum statistic is a goal for future work.
In this paper, we focused on robustness with respect to heterogeneity. However, we should still confirm that resulting genes are informative. In fact, the cancer outlier methods, which focus on the hetero genes, provide high reproducibility. However, we consider that the topranked differentially expressesd genes in any rankings should be effective for the latter prediction problem. Although further study is needed for such discussion, we performed a simple examination to ensure a certain degree of predictive power. Table 3 shows thepredictive performance of DLDA (Diagonal Linear Discriminant Analysis) measured by AUC (Area Under the Curve). The AUC was calculated for all 100 trials used in Application, regarding randomly two divided datasets as training and test data. The scores were based on the top 10, 50 or 100 genes in every ranking. In particular, it shows that the tstatistic and signsum statistic have comparable predictive performance, although the DLDA predictor is constructed from each tvalue for all genes. Thus the signsum statistic improves the ranking reproducibility without loss of predictive performance of the resultant genes.
Gene ranking is an essential in biological investigations. In this study, we were motivated by the desire to identify robust and predictive biomarkers. Hetero genes may be informative for some patients, but uninformative in others. In this sense, hetero genes should be extracted from gene rankings if these predictive performance is eqaul to or less than that of homo genes.
Conclusions
The tstatistic confuses homo and hetero genes as shown in the simulation study. The ranking irreproducibility would be caused by such heterogeneity also in the real data analysis. In fact, even the signs of tstatistics of many genes mismatch in the real data. We present the signsum statistic for getting robust ranking. Robustness for heterogeneity of the signsum statistic is shown by the evaluation of the upper confidence limit. We can get more reproducible ranking by the signsum statistic for simulated data which assumes that there are heterogeneous factors, for the breast cancer data which is known as the hetero disease and the data which includes different disease statuses.
Availability of supporting data
The data sets supporting the results of this article are provided at the following database http://bioinformatics.nki.nl/data/vantVeer_Nature_2002/ and GEO under the accession number of GSE31568.
Abbreviations
AUC, area under curve; DLDA, diagonal linear discriminant analysis
References
 1
Di Camillo B, Sanavia T, Martini M, Jurman G, Sambo F, Barla A, Squillario M, Furlanello C, Toffolo G, Cobelli C. Effect of size and heterogeneity of samples on biomarker discovery: synthetic and real data assesment. PLoS ONE. 2012; 7:32200.
 2
Van’t veer L, Dai H, Van de Vijver M, He Y, Hart A, Mao M, Peterse H, Van Der Kooy K, Marton M, Witteveen A, Schreiber G, Kerkhoven R, Roberts C, Linsley P, Bernards R, Friend S. Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002; 415:530–6.
 3
Sorlie T, Perou C, Tibshirani R, Aas T, Geisler S, Johnsen H, Hastie T, Eisen M, Van De Rijn M, Jeffrey S, Thorsen T, Quist H, Matese J, Brown PO, Botstein D, Lonning P, BorresenDale AL. Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc Natl Acad Sci U S A. 2001; 98:10869–74.
 4
Tibshirani R, Hastie T. Outlier sums for differential gene expression analysis. Biostatistics. 2007; 8:2–8.
 5
Wu B. Cancer outlier differential gene expression detection. Biostatistics. 2007; 8:566–75.
 6
Draminski M, Radaiglesias A, Enroth S, Wadelius C, Koronacki J, Komorowski J. Monte carlo feature selection for supervised classification. Bioinformatics. 2008; 24:110–7.
 7
Zuber V, Strimmer K. Gene ranking and biomarker discovery under correlation. Bioinformatics. 2009; 25:2700–9.
 8
Meinshausen N, Buhlmann P. Stability selection. J R Stat Soc Ser B: Stat Methodol. 2010; 72:417–73.
 9
Abbel T, Helleputte T, Van de Peer Y, Dupont P, Saeys Y. Robust biomarker identification for cancer diagnosis with ensemble feature selection methods. Bioinformatics. 2010; 26:392–8.
 10
Haury AC, Gestraud P, Vert J. The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures. PLoS ONE. 2011; 6:28210.
 11
Dabney A. Classification of microarrays to nearest centroids. Bioinformatics. 2005; 21:4148–154.
 12
Pan W. A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments. Bioinformatics. 2002; 18:546–4.
 13
Pepe M. Selecting differentialy expressed genes from microarray experiments. Biometrics. 2003; 59:133–42.
 14
Keller A, Leidinger P, Bauer A, ElSharawy A, Haas J, et al.Toward the bloodborne mirnome of human diseases. Nat Methods. 2011; 8:841–3.
 15
Setlur S, Mertz K, Hoshida Y, Demichelis F, Lupien M, et al.Estrogendependent signaling in a molecularly distinct subclass of aggressive prostate cancer. J Natl Cancer Inst. 2014; 100:815–25.
 16
Wang Y, Kijin J, Zhang Y, AM S, et al.Geneexpression profiles to predict distant metastasis of lymphnodenegative primary breast cancer. Lancet. 2005; 365:671–9.
 17
Golub T, Slonim D, Tamayo P, Huard C, Gaasenbeek M, Mesirov J, Coller H, Loh M, Downing J, Caligiuri M, Bloomfield C, Lander E. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999; 286:527–31.
 18
Fan C, Oh D, Wessels L, Weigelt B, Nuyten D, Nobel A, et al.Concordance among geneexpressionbased predictors for breast cancer. N Engl J Med. 2006; 355:560–9.
 19
Tusher V, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A. 2001; 98:5116–121.
 20
Storey J, Tibshirani R. Statistical significance for genomewide studies. Proc Natl Acad Sci U S A. 2003; 100:9440–5.
Acknowledgements
We thank the reviewers of our manuscript for careful reading and for giving beneficial suggestions. This work was supported by JSPS KAKENHI Grant Number 25280008.
Authors’ contributions
KO, OK and SE designed the methods of this article. KO carried out the simulation study and data analysis, and wrote the paper. All authors have read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Author information
Additional files
Additional file 1
Proof of theorem 1. We give the proof of theorem 1 in this file. (PDF 146 kb)
Additional file 2
Rcode of the signsum statistic. We give an example of the Rcode of the signsum statistic. (PDF 34.9 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Omae, K., Komori, O. & Eguchi, S. Reproducible detection of diseaseassociated markers from gene expression data. BMC Med Genomics 9, 53 (2016) doi:10.1186/s1292001602145
Received
Accepted
Published
DOI
Keywords
 Gene expression analysis
 Genes screening
 Heterogeneity
 Subsampling method
 Twosample test
 Ustatistic