SAGE library construction
26 normal and 11 lung cancer SAGE libraries were constructed and used in the analysis . The construction of the 26 normal libraries has been previously described [25, 26]. 24 of these libraries were generated from exfoliated bronchial cells obtained from bronchial brushes, and two libraries from normal lung parenchyma (Additional file 1). Conversely, the 11 cancer libraries were generated from biopsied specimens with six libraries representing lung squamous cell carcinoma and five libraries representing carcinoma in situ. This data can be found at the GEO database with the following series accession numbers: GSE3707, GSE5473, and GSE7898. All samples were acquired under approval by the University of British Columbia - British Columbia Cancer Agency Research Ethics Board (UBC-BCCA-REB) and all subjects provided written consent.
SAGE data from public domain
Publicly available SAGE data were also used in this analysis, representing both brain and breast cancer. Specifically, six normal and 12 breast cancer libraries (Additional file 2) and 7 normal and 19 brain cancer libraries were used (Additional file 3). The libraries were obtained from the cancer genome anatomy project (CGAP) database http://cgap.nci.nih.gov[27, 28].
Given that SAGE libraries are expensive to generate, the number of libraries in a given study is typically small (i.e., in 10's, rather than in 100's). The permutation test is a non-parametric test, which does not assume any underlying distribution. The number of samples required for the test to achieve sufficient statistical power is relatively low compared to other statistical tests (e.g., t-test and χ2
-square test). Furthermore, each additional sample increases the power of the test exponentially. The permutation test is a test of the means between two different distributions. Without loss of generality, let us assume that one distribution is for the gene expression level of a particular gene in normal tissues (i.e. subscript n
), and that the other distribution is for cancerous tissues (i.e. subscript c
). Genes are selected using the following hypotheses:
If there is little difference between the two means, it would make no difference if we mix the cancerous samples with the normal samples. But, if the null hypothesis is rejected, it indicates that the gene expression levels of normal and cancer samples are sufficiently different (the alternative hypothesis). In the following, we show our specific implementation of the test. Let n and c be the number of normal tissue samples and the number of cancerous tissue samples respectively.
A. For each gene, select all the gene-specific normalized tag counts from the normal libraries and all the gene-specific normalized tag counts from the cancer libraries.
B. Randomly select n counts to create a simulated normal set, and calculate the simulated normal mean μ
C. Similarly, select the remaining c counts form the simulated cancerous set. Calculate the simulated cancer mean μ
D. Consider the random variable v = μ
, called the simulated difference.
E. Repeat the steps A to D above m times. Let μ and σ denote the mean and the standard deviation of v.
F. Now separate the libraries back into their true identity: normal or cancerous. Calculate the true observed difference O = μ
, where μ
denotes the true mean count of the cancerous libraries, and μ
denotes the true mean of the normal libraries.
G. Calculate the Permutation Score PS where
H. Repeat all the above steps for each gene. Sort the permutation score in descending order.
The permutation score is one way to measure how likely the actual observed difference occurs by chance. It is based on standardization, i.e., subtracting the mean and then divided by the standard deviation. The more the true observed difference is from the average (expressed as multiples of the standard deviation), the less likely that the true observed difference is a coincidence. That is to say, the larger the permutation score, the more significant is the observed difference between cancerous and normal samples.
On the other hand, for the sake of evaluating the constancy requirement, the ideal reference gene would have a permutation score equal to 0. This means that there is no difference in the distributions of expression levels between cancerous and normal samples. For the results reported here, we used m = 10,000 permutations.
Raw tag counts for each SAGE library were normalized to tags per million (TPM) to facilitate adequate comparison among libraries. Tag-to-gene mapping was performed using the February 5th, 2007 version of SAGEGenie . In cases where multiple SAGE tags mapped to the same gene, the tags were collapsed to capture all potential transcript variants, and a cumulative tag count was utilized for analysis.
Statistical criteria for reference gene selection
The permutation test outlined above was used to identify genes which were statistically similar when comparing the libraries from normal tissue (bronchial epithelium and lung parenchyma) and cancerous tissue of the lung. Three main criteria were used for reference gene selection: permutation score (described above) ≤ 0.15; at least two SAGE tags observed in each library; and an overall average count of ≥ 25 across all samples. For the analysis in brain and breast tissue, the first two criteria were maintained, but due to the lower sequencing depth, an average count of ≥ 10 across all samples was used instead.
Quantitative RT-PCR validation in clinical lung cancer specimens
One microgram of total RNA from 15 lung tumor and matched non-malignant parenchyma samples were converted to cDNA using the High-Capacity cDNA archive kit (Applied Biosystems Inc., Foster City CA). One hundred nanograms of cDNA were utilized for qPCR using the TaqMan Gene Expression Assay (Applied Biosystems Inc). All fifteen lung NEPS genes and six additional reference genes were assayed. All TaqMan probes were pre-optimized by Applied Biosystems. Primer IDs for all genes are provided in Additional file 4. The 30 samples were assayed in triplicate in parallel along with negative (no cDNA template) controls using the 7500 Fast Real-Time PCR System. Appropriate cDNA dilutions were used such that the exponential phase of the amplification curves were within the 40 PCR cycles recommended by the manufacturer (i.e. ranging from 16-36 cycles for the 20 genes and 1-13 cycles for 18SRNA). Cycle thresholds were determined from amplification curves using 7500 Fast System software.
For the analysis of qPCR data, three different methods were used. Within each method, all genes were ranked from best to worst. Subsequently, for each gene, a cumulative ranking across all three methods was determined by summing its rank from each individual method. Two previously published methods, geNorm  and NormFinder , and the variance of cycle threshold difference (dCt) across all 15 tumor/matched non-malignant sample pairs were the approaches used to determine constancy.
Analysis of publicly available microarray datasets
Lung NEPS genes were used to re-normalize two publicly available microarray datasets. Microarray data were obtained from GEO at NCBI under accession numbers GSE10072  and GSE12428 .
For the Affymetrix data (GSE10072), Raw CEL files were processed through Affymetrix's Microarray Array Suite (MAS) 5.0 algorithm in the "affy" package in Bioconductor [32, 33]. Briefly, MAS 5.0 is a three step process which involves a global background signal correction, correction of the probe value for cross-hybridization and spurious signals using mismatch probes which are off by one base, and finally, scale normalization of each experiment to a fixed median intensity to facilitate inter-experimental comparison http://media.affymetrix.com/support/technical/whitepapers/sadd_whitepaper.pdf. Probes were filtered on MAS 5.0 calls, and those having a "P" or "M" call in at least 50% of samples were retained. This resulted in a dataset of 11440 probes. Of the 15 lung NEPS reference genes, 12 were represented on the array platform. Of those 12 genes, probes which had a "P" call in 100% of the samples were used for the calculation of the scaling factor with only one probe/gene allowed. If two probes met these criteria for one gene, the probe with the highest mean expression was chosen. After employing these criteria, eight probes were used (Additional file 5), which represented genes PPP1CB, B2M, RPL4, CAPZB, ATP5J, RAB5C, NDUFA1, and HSPA1A.
For the Agilent microarray data (GSE12428), all lung NEPS genes were represented on this microarray platform. Data was processed as described previously . In the cases where lung NEPS genes were represented with multiple probes, the probe with the maximum average intensity across the dataset was used. A list of the probes used is given in Additional file 6. Since each sample had at least two replicate experiments, the average across replicate experiments was used for each probe.
To determine the scaling factor, for each sample, linear regression analysis was performed comparing the values for the reference gene (x) versus the average values for the reference genes across the sample set (y). The slope of the line based on least-squares fitting was then multiplied to each value in the experiment.
Next, Significance Analysis of Microarrays (SAM) was performed to determine differentially expressed genes between non-malignant and malignant samples for both microarray datasets using the "samr" package in R . Unpaired analysis was performed using the normal samples versus tumor samples and the delta parameter set to 0.4. Probes which had a Q-value% ≤ 5 were considered significant. For the Affymetrix dataset, results were compared between the dataset normalized with MAS 5.0 alone and MAS 5.0 + NEPS scaling and for the Agilent dataset, the comparison was done between median normalization alone and NEPS scaling followed by median normalization.