Training set patient population
Patients were enrolled in the AEGIS trials (Airway Epithelium Gene Expression In the DiagnosiS of Lung Cancer), designed as prospective, observational, cohort studies (registered as NCT01309087 and NCT00746759) of current and former cigarette smokers with a suspicion of lung cancer undergoing bronchoscopy as part of their diagnostic workup. A set of patients from one of the cohorts (“AEGIS 1”) was selected for the exclusive purpose of training a gene expression classifier. All enrolled patients were followed post-bronchoscopy until a final diagnosis was made, or for 12 months. Patients were diagnosed as having primary lung cancer based on cytopathology obtained at bronchoscopy or upon subsequent lung biopsy (such as TTNB or surgical lung biopsy (SLB) when bronchoscopy did not lead to a diagnosis of lung cancer). Patients were diagnosed as having benign disease based on a review of medical records and follow-up procedures at 12 months post-bronchoscopy (described in more detail in Additional file 1). Bronchoscopy was considered “diagnostic” when clinical samples collected at the time of the bronchoscopy procedure yielded a confirmed lung cancer diagnosis via cytology or pathology. The study was approved by IRB at each of the participating medical centers (the ethics committees and the study protocol numbers for each of the centers is listed separately; Additional file 2), and all patients signed an informed consent prior to enrollment.
Physicians at each of 25 participating medical centers (see Additional file 3) were instructed to collect normal appearing bronchial epithelial cells (BEC) from the right mainstem bronchus (or the left side if any abnormalities were observed on the right) during bronchoscopy using standard bronchoscopic cytology brushes. Following collection, the cytology brushes were cut and placed in an RNA preservative (Qiagen RNAProtect, Cat. 76526) immediately after collection and stored at 4°C. Specimens were then shipped at 4-20°C to a central laboratory for further processing.
BECs were separated from cytology brushes using a vortex mixer and were then pelleted and processed using QIAzol lysis reagent (Qiagen). RNA was isolated by phenol/chloroform extractions and purified on a silica membrane spin-column (Qiagen miRNeasy kit, Cat. #217004) according to manufacturer’s recommendations. RNA was analyzed on a NanoDrop ND-1000 spectrophotometer (Thermo Scientific) to determine concentration and purity, and RNA integrity (RIN) was measured on a 2100 Bioanalyzer (Agilent Technologies). Each sample was then stored at −80°C until processing further on microarrays.
Total RNA (200 ng) was converted to sense strand cDNA, amplified using the Ambion WT Expression kit (Life Technologies Cat. #4440536), and labeled with Affymetrix GeneChip WT terminal labeling kit (Affymetrix Cat. #900671), (described in more detail in Additional file 1). The labeled cDNA was hybridized to Gene 1.0 ST microarrays (Affymetrix Cat. #901085) and analyzed on an Affymetrix GeneChip Scanner. Individual CEL files for each of the patient samples were normalized using the standard Affymetrix Gene 1.0 ST CDF and RMA .
A gene expression classifier was derived in a multi-step process. Initial modeling consisted of using the training data to select genes which were associated with three clinical covariates (gender, tobacco use, and smoking history) to identify gene expression correlates of these clinical variables. Lung cancer-associated genes were then selected, and finally a classifier for predicting the likelihood of lung cancer based on the combination of the cancer genes, the gene expression correlates, and patient age was derived. All aspects of this classifier development procedure were determined using cross validation and using only data from the training set samples.
Clinical Factor Gene Expression Correlates (CFGC)
Covariates of lung cancer in this study population, including sex (male/female), smoking status (current/former), and pack years (<10/>10), were modeled to identify gene expression correlates for the clinical factors. Empirical Bayes t-tests were used to identify genes whose expression was significantly associated with each of the clinical factors. Next, the significant genes were used to build three models, one for predicting each clinical factor, using penalized logistic regression (LASSO) . Finally, the predicted values from the gene expression models for gender (GG), smoking status (GS), and pack-years (GPY) were computed, yielding genomic sex, genomic smoking status, and genomic pack year measures for each patient. These three genomic measures were used as new covariates to help in selecting genes with lung-cancer associated gene expression and in the lung cancer classifier (described below).
Selection of lung cancer genes
A logistic regression model with lung cancer status (1 = cancer-positive and 0 = cancer-negative) as the dependent variable was fit using the training data, CFGC’s, and patient age as predictors. This model served as the “baseline” for subsequent gene expression analysis.
Next an empirical Bayes linear model was fit using gene expression values as the independent variable and the logistic regression baseline model residuals as the dependent variable. The residuals from this baseline model are a measure of patient cancer status that could not be predicted on the basis of clinical factors or their genomic correlates alone. That is, the empirical Bayes linear model was used to select genes with predictive potential for lung cancer independent or additive to that represented by clinical covariates. We note that a gene associated with both clinical factors and cancer could still be selected if the cancer association retained significance in this model. The top lung cancer-associated genes from this analysis were grouped using hierarchical clustering. To reduce the number of genes, for each cluster we selected a small number (2–4) of genes whose average was highly correlated to the average of all genes in the cluster. Subsequent modeling used these “reduced” cluster mean expression values rather than individual gene expression values. Cross validation was used to select which cluster means were independently significantly associated with lung cancer in the context of the other clusters. Overall, this served to select clusters that cumulatively provided the best classifier performance, and specific genes that best represented each of these clusters in a parsimonious manner. Functional analysis of genes within each of the cancer clusters was performed using DAVID  to identify biological terms describing the cancer-associated genes in the classifier.
Lung cancer classifier
A lung cancer classifier was developed using lung cancer status as the outcome variable and a) the cancer associated gene expression cluster means, b) patient age, c) genomic gender (GG), d) genomic smoking status (GS), and e) genomic pack years (GPY) as predictors. The model was fit using a penalized logistic regression model; the penalization factor (lambda) was 0 for the clinical/ gene expression correlates and 10 for each of the gene expression cluster means. The resulting model score is on a 0 to 1 scale. A score threshold for predicting lung cancer status was established to achieve a sensitivity of approximately 90% for patients with a non-diagnostic bronchoscopy. An evaluation of the benefit of the gene expression classifier to predict lung cancer compared to clinical factors alone was performed by generating a “clinical model” that included age, gender, smoking status, and pack-years (all determined clinically) in a logistic regression model to predict lung cancer status. The difference in performance between the complete gene expression classifier and the clinical factors classifier to predict lung cancer status was assessed by comparing the AUC’s of each model in the training set.
Analysis of an independent test set
Data from a prior study  were used as an independent test set to assess the performance of the locked classifier derived in this study. In that study BECs were collected at bronchoscopy from patients undergoing bronchoscopy for suspicion of lung cancer, and RNA was analyzed on microarrays (Affymetrix HG-U133A). CEL files from that study (n = 163) were re-normalized to produce gene-level expression values using Robust Multiarray Average (RMA)  in the Bioconductor R package (version 1.28.1). This processing used the Entrez Gene-specific probeset chip definition file (CDF)  in place of the standard U133A CDF provided by Affymetrix in order to facilitate cross-platform analyses. Analyses were performed using the R environment for statistical computing (version 2.9.2).
The classifier was applied to patients in the test set with two modifications to account for the difference in microarray platforms. First, the HG-U133A RMA expression values were adjusted by adding a gene-wise constant defined as the difference between the mean of the test set samples and the mean of the training set samples, separately for each gene. This procedure functioned to shift the mean of each gene’s expression levels in the test set to the mean observed in the training set. Second, for the classifier genes where a corresponding HG-U133A probeset was not available (LYPD2 and RNF150), the gene’s mean expression value in the training set was used for all of the test set samples.
Classifier accuracy was assessed using standard measures of prediction accuracy: the area under the curve (AUC), sensitivity, specificity, NPV and PPV. Cross-validation, using a 10% sample hold-out set, was used in the training set to estimate the performance of the prediction classifiers generated using these approaches . These performance estimates were used to guide the development of the classifier discovery procedure. A final model was set prior to performing a one-time analysis of the test set. Fisher’s exact test was used to calculate statistical significance of all categorical variables (i.e., sex, smoking status, race, mass size, and mass location) and a t-test was used for continuous variables (i.e., age and smoking history).