Tobacco use induces anti-apoptotic, proliferative patterns of gene expression in circulating leukocytes of Caucasian males

Background Strong epidemiologic evidence correlates tobacco use with a variety of serious adverse health effects, but the biological mechanisms that produce these effects remain elusive. Results We analyzed gene transcription data to identify expression spectra related to tobacco use in circulating leukocytes of 67 Caucasian male subjects. Levels of cotinine, a nicotine metabolite, were used as a surrogate marker for tobacco exposure. Significance Analysis of Microarray and Gene Set Analysis identified 109 genes in 16 gene sets whose transcription levels were differentially regulated by nicotine exposure. We subsequently analyzed this gene set by hyperclustering, a technique that allows the data to be clustered by both expression ratio and gene annotation (e.g. Gene Ontologies). Conclusion Our results demonstrate that tobacco use affects transcription of groups of genes that are involved in proliferation and apoptosis in circulating leukocytes. These transcriptional effects include a repertoire of transcriptional changes likely to increase the incidence of neoplasia through an altered expression of genes associated with transcription and signaling, interferon responses and repression of apoptotic pathways.


Background
Gene expression profiling has become a powerful approach to the study of molecular pathophysiology and is a potentially useful diagnostic tool in multiple fields [1]. Oncologists have applied gene expression profiling to predict breast cancer aggressiveness [2], and microarraydriven approaches have been used to analyze cardiovascular diseases such as hypertension, heart failure, cardiac rejection, and atherosclerosis [3][4][5]. Ideally, gene expression profiling is performed on the specific cell type and tissue of interest, i.e. the tumor, myocardium, or atheroma. However, sampling target tissues from humans is often problematic, and data derived from tissues not routinely available to clinicians limits the diagnostic utility of this approach.
For the study of biological processes that involve an inflammatory response, gene expression profiles can be obtained from circulating leukocytes [6]. Due to the ease of sampling, gene expression profiling of circulating leukocytes has been applied to the study of cancer [7], atherosclerosis [8,9], and systemic lupus erythematosus [10]. These studies demonstrate the utility of transcriptional analysis of peripheral blood in the study of disease states having a systemic, inflammatory component. Tobacco use, whether by smoking or chewing, is associated with the development of many diseases. People who smoke more than 20 cigarettes per day have a 3-6 fold increased incidence of myocardial infarction [11] and increased overall rates of cardiovascular mortality compared to those who have never smoked [12]. The risk of developing lung cancer is 20-fold increased in cigarette smokers [8], and smokers are at increased risk of developing chronic obstructive pulmonary disease, multiple cancers (e.g. esophageal, bladder, and leukemia), pneumonia, osteoporosis, and periodontal disease [13]. Despite these major adverse health effects, more than 20% of American adults identify themselves as active smokers [14].
The mechanistic link between tobacco smoking and related diseases remain incompletely understood. To date, there have been numerous reports analyzing the effect that exposure to cigarette smoke has on the gene expression profiles of various cell types [15][16][17][18][19][20][21][22]. However, despite this detailed analysis, very little consensus amongst findings has been reported, even when the same cell type has been studied [16]. This lack of significant overlap in conclusions may be the result of the considerable heterogeneity in methodology as well as the relatively small (on average 5-10 test subjects) sample populations in each study. Furthermore, many of these reports rely on the in vitro exposure of cells to cigarette smoke condensate, raising the obvious issue of physiological relevance amongst these various studies.
Here we report a novel method for analyzing the in vivo effects of tobacco use on gene expression in circulating leukocytes. The purpose of this study is not to identify biomarkers associated with tobacco use; rather, our approach is aimed at identifying changes in genes and gene sets that result from tobacco use and applying this information to identify potential cellular pathways associated with tobacco-dependent pathology. Our results indi-cate that tobacco use affects pathways that control cell death, response to stress, macromolecular metabolism and the inflammatory cascade, providing new insights into the systemic effects of smoking that may underlie tobacco-related diseases.

Subject Population
Subjects between the ages of 18 and 50 years (inclusive) referred to UNC Hospitals were considered for enrollment in this University of North Carolina Institutional Review Board-approved study (IRB 04-MED-471). Exclusion criteria included current cancer treatment, pregnancy, lymphoma, leukemia, chronic immunosuppressive therapy, infection with HIV or HCV, history of solid organ transplant, and anemia (i.e. conditions which might alter peripheral blood counts or patterns of gene expression). After obtaining informed consent for a one-time blood donation, subjects were interviewed for pertinent medical information, including a detailed history of tobacco use, family history of heart disease and diabetes. Blood cell counts including a white blood cell differential analysis was performed to ensure consistency in cell subtype number between study populations.

Blood Withdrawal and Processing
Blood (30 ml) was drawn early in the day from subjects fasted for at least 8 hours to minimize the signals associated with nutritional and diurnal cycles from the microarray data. Processing was begun within 15 minutes of the time of blood draw. Eight ml were collected into a tube containing EDTA and proteinase inhibitors (Becton, Dickinson and Co., Cockeysville, MD) to provide a sample of plasma for cotinine assays. The balance of blood was collected into Na-EDTA Vacutainer tubes (Becton, Dickinson and Co., Cockeysville, MD). Whole blood was treated with 10 volumes of carbonate-buffered 150 mM NH 4 Cl to lyse red blood cells. The remaining leukocytes were washed and concentrated by centrifugation [23,24]. RNA and DNA were recovered from leukocytes using a modified one-step acid guanidinium isothiocyanate-phenolchloroform extraction (RNA-STAT60, Tel-Test, TX). RNA was subsequently post-purified using the RNeasy Mini-kit (Qiagen, Valencia, CA). RNA quantity, purity, and integrity were assessed by spectrophotometry and microcapillary electrophoresis on an Agilent BioAnalyzer 2100. Complete processing of samples occurred within 2 hours, exceeding the standards set by the Consortium for Expression Profiles in Sepsis [25]. Plasma cotinine levels were determined by competitive ELISA using the Serum Cotinine Assay Kit (BioQuant; San Diego, CA) essentially as described by the manufacturer.

Statistical Methods
Microarray data were normalized via the loess local intensity normalization [7,29], and probes were filtered for features having a normalized intensity of < 30 aFU in either channel. Probes were removed if < 70% of the data were present across all samples. Missing data points were imputed using the k nearest-neighbors algorithm (k = 10). 18,375 probes passed these filters, and were subsequently used for analysis. Scripts written in the R Statistical Language and Environment ("R"; Version 2.2.1, build r36812, release date 2005-12-20.) and Perl (ActiveState Perl 5.8.1, build 807, release date 2003-11-6) were used to standardize (μ = 0, σ = 1) each sample in the data set.

Hyperclustering
A median-centered gene list was used for cluster analysis to identify relationships between subject samples (arrays). The clustering file was then used as the basis for a new preclustering file to incorporate gene annotation data. Genes were assigned to GO and TRANSFAC categories using the GATHER web interface [42]. Categories showing statistical enrichment (p value < 0.01) were identified, and each gene in the pre-clustering file was annotated as to its membership in the appropriate category. The TRANSFAC predictions of transcription factor binding sites were designated in the pre-clustering file by the value representing the median-centered mean fold change expressed as the Log 2 of the ratio of each sample to the reference for each gene. This method of indicating membership was chosen to reflect a relationship between expression level (as measured by microarray) and presence or absence of transcription factor binding sites. Gene membership in GO categories was indicated by a binary value of either 1.00 or 0.00 as a placeholder for the expression ratio. Blue color was added after the fact to heat maps indicating Gene Ontology membership to avoid confusion with expression values. The annotated pre-clustering file was then clustered on only the Y axis (genes) to preserve relationships among arrays. This technique, which we have designated "Hyperclustering," allows both the gene expression data and various other forms of annotation to be represented as a single heat map, effectively illustrating functional relationships among genes.

Subject Selection for Gene Expression Analysis
Initial analysis of the gene transcription data from a cohort of 171 individuals revealed strong signals related to the race and gender of the subject (unpublished observations). Similar signals have been described in other microarray experiments. These signals can hinder attempts to identify signals related to the biological effect being studied [48]. For this reason, we selected the largest cohort in our dataset (Caucasian males) to maximize the statistical power of our analysis. We adopted a case-control approach to our study design and data analysis. Selected subject demographics are presented in Table 1.

Tobacco Use Determination
Self-reported tobacco use history is notoriously inaccurate [49][50][51]. For purposes of this study, we defined tobacco use status by the subject's plasma cotinine concentration. Cotinine, the principle metabolite of nicotine, is a reliable surrogate marker of tobacco use [52,53]. It has a plasma half-life of approximately 24 hours (as opposed to nicotine's in vivo half-life of 30 minutes) and tends to reach steady state levels that vary by only 15%-20% in people with regular smoking habits [52]. As seen in Figure 1, the distribution of plasma cotinine is similar in both the Caucasian male subpopulation under study and a larger cohort of 171 subjects, with strong bimodal peaks near 0 ng/mL and 150 ng/mL. Cutoffs of plasma cotinine for the definition of active tobacco users and non-users were set at > 100 ng/mL and < 50 ng/mL, respectively, based on previously reported values [52,53].
Using these criteria, 24 subjects were classified as tobacco users and 38 as non-tobacco users, with 5 subjects having cotinine levels that fell between 50 and 100 ng/mL. These 5 intermediate subjects were removed from further consideration. Comparing each subject's plasma cotinine values with their self-reported tobacco use status revealed overall consistent results (i.e. a high cotinine value for subjects who self-reported that they were active tobacco users). Nevertheless, there were notable exceptions. Seven subjects reported that they were non-tobacco users, yet had plasma cotinine levels > 100 ng/mL. Errors in this dimension could be explained by subject misrepresentation or failure of the subjects to disclose nicotine replacement therapy as part of a smoking cessation plan (use of nicotine patches or gum). Interestingly, 3 subjects identified themselves as active smokers, yet had very low plasma cotinine levels. Rapid metabolism of nicotine, smoking of a small number of cigarettes daily, or the use of extremely low-nicotine smoking products could all account for this discrepancy. This discrepancy in self-reported tobacco use and plasma cotinine levels did not appreciably alter the results of our studies (data not shown). All subjects were categorized based only on plasma cotinine levels only. The 2 subject groups will henceforth be referred to as "high cotinine" (i.e. tobacco users) and "low cotinine" (i.e. non-tobacco users). Using this criterion, those subjects reporting to be "smokers" but who had low plasma cotinine levels were included in the low cotinine group while subjects with high cotinine levels who denied smoking were included in the high cotinine group. To ensure that patient co-morbidities did not influence the gene expression profile, we performed principal components analysis (PCA) on the expression values of genes identified in this paper using the combined significant gene list and visualized in the context of COPD, diabetes, CAD class, and smoking status (Additional File 1). As expected, the top component of variation appears to be associated only with smoking status.

Transcriptional Signals of Tobacco Use
The subjects were stratified based upon the results of the cotinine assay, and differential gene expression was determined by SAM. We identified 38 genes as being differentially expressed (8 genes up-regulated, 30 genes downregulated in the high-cotinine group) at an 11.7% FDR ( Table 2). Notable among this list were genes involved in apoptosis, cell cycle regulation, and oncogenesis.
Visual inspection of the SAM-identified genes revealed that a number of differentially expressed genes are involved in the cell cycle control Gene Ontologies. CTCF was down regulated in the high cotinine group. Mutations in this gene have been associated with a variety of cancers [54]. Furthermore, CTCF plays an important role in the regulation and differentiation of human myeloid leukemia cells, adding another possible underlying mechanism of leukemiagenesis in tobacco users [55]. Conversely, we found that SASH1 (which is implicated in tumorogenesis of colorectal and breast cancer) was up regulated [56]. Interestingly, CX3CR1 was significantly down regulated in the high cotinine group. As CX3CR1 is up-regulated in atherosclerotic lesions [57], we expected it to be up-regulated in circulating leukocytes of tobacco users due to the increased incidence and severity of CAD in this population (reviewed by Njolstad [11]). However, Barlic, et al., showed that macrophage up-regulation of CX3CR1 leads to retention of those cells in vessel walls [57]. As the kinetics of the up-regulation of this gene are ill-defined, and it is not yet clear whether circulating monocytes differentially express CX3CR1 prior to tissue macrophage transformation, considerably more study will be necessary to elucidate what role it may play in the pathogenesis of smoking-related atherosclerotic disease.
Further analysis identified genes involved in apoptotic pathways. The pro-apoptotic genes C1D, MTCBP-1, CTCF, IKIP, MAF, and YWHAQ were all significantly down regulated in the high cotinine group. C1D (also known as SUNCOR) is representative of this group. C1D is a multifunctional nuclear protein with DNA-binding properties. When C1D is experimentally over-expressed it activates DNA-PK, inducing apoptosis [58]. On the other hand, the c-terminal modulator protein (CTMP, also known as THEM4) was significantly over-expressed in the high cotinine population. CTMP protein stimulates the phosphorylation of AKT/PKB, increasing glucose uptake and blocking apoptosis [22]. The relative mean fold change was modest for most of these genes ( Table 2); nevertheless, in subjects with high plasma cotinine the overall expression pattern of these genes is anti-apoptotic compared to low cotinine subjects. The combination of increased cell cycle activity, resistance to apoptotic triggers, increased expression of oncogenes, and decreased expression of tumor suppressor genes in circulating leukocytes suggests a mechanism responsible for the low-level, systemic, increased risk of oncogenesis in patients who use tobacco products.
Testing for differential expression of individual genes does not take advantage of our knowledge of the underlying relationships. Therefore, additional power can be gained by testing for differential expression of gene sets that underlie a common biological process [37,38,59]. This idea motivated the development of techniques that pair local statistics of individual gene expression with global statistics based on membership in defined pathways and functional groups. One such algorithm, Gene Set Analysis (GSA), was implemented using the Molecular Signatures database (MSigDB). The GSA algorithm identified 16 gene sets at a p-value < 0.0001 and FDR of 0%. The top three MSigDB pathways were "Death Pathway," Histogram of plasma cotinine concentration  (Table 3). Although many of the genes comprising these sets did not reach statistical significance individually, taken as a group they were highly significant. Genes related to apoptosis and type I interferon response were common elements in all of these pathways. Among genes involved in the MSigDB "Death Pathway," expression of BIRC3 and TRAF2 (anti-apop-totic genes) were up regulated while CASP9, FADD, and STK17A (pro-apoptotic genes) were down regulated in the high cotinine group. This overall expression pattern is indicative of an anti-apoptotic phenotype, which characterizes virtually all cancers. These observations suggest that transcriptional profiles associated with tobacco use may indicate pre-cancerous tendencies. The 71 genes present in the top 3 pathways (Table 3) were added to the

Pattern Identification via the Hyperclustering Technique
Differentially expressed genes were hyperclustered (see Materials and Methods) and visualized ( Figure 2) using the pooled gene list. The subjects with the highest mean levels of cotinine were clearly separated from the subjects with the lowest mean cotinine levels using this technique. Moreover, genes were clustered into functional groups based on their expression patterns, membership in Gene Ontologies (Table 4, labeled A-G), and presence of predicted transcription factor binding sites. This produced 5 physiologically relevant clusters. The 'Stress' cluster is comprised of stress-responsive genes involved in signal transduction (CX3CR1 and ITGB1). The 'Macromolecular Metabolism' cluster is made up of metabolic genes (HIPK1, SUMO2, SULF2, and FKBP3). The third cluster, 'Transcription and Signaling', contains genes associated primarily with G protein signaling and transcriptional regulation (RASGEF1A, RAB2, ARHGAP1, PPP1R12B, CREBBP, and GNG2). 'Cell Death and Apoptosis' is a cluster of genes associated with apoptosis and its regulation. The fifth cluster, 'Interferon' is defined by genes that potentially contain an interferon-stimulated response element-binding site or are responsive to type-1 interferons.
The utility of the hyperclustering technique is readily apparent: a single image indicates the relationships among the genes, lending physiological relevance to a data set. A case in point is the 'Interferon' cluster, comprised of genes that are strongly up regulated in approximately half of the subjects with the highest cotinine levels.
The genes in this cluster (IFI44, IFIT1, USP18, and HERC5. Figure 2) are interferon responsive genes, and are members of the gene class forming the early response to type-I interferons, indicative of a cellular response to viral agents or very specific forms of genotoxicity. Our findings are consistent with those of Grumelli, et al. who demonstrated that lymphocytes isolated from lung samples of patients with smoking-related lung damage showed an increase in expression of multiple interferon-inducible proteins [60]. These results indicate that induction of interferon-dependent transcription pathways appear sys-temically in some tobacco users. Only half of the tobacco users have this expression pattern; the mechanisms of which are unknown, but worthy of future investigation. It is tempting to speculate that these patterns of systemic interferon-responsive induction identify a group of tobacco users who may develop early and severe disease.
Longitudinal studies designed to track the patterns of gene expression over time in cohorts of tobacco users and nonusers will be necessary to unambiguously determine the meaning of these observations.

Real time PCR verification of differentially expressed genes
Quantitative real time PCR was used for both technical (microarray) and biological verification. Four genes selected from SAM and one gene from GSA: CX3CR1, SASH1, HRASLS3, PTGDR, and CDKN1C, respectively, were used for technical verification (Figure 3, left panel) on samples randomly selected from the low and high cotinine subject population (Caucasian males). The up or down regulation of these genes, irrespective of their method of identification (SAM or GSA) was consistent with the microarray analysis. Furthermore, the relative fold changes determined via quantitative real time PCR were either equal to or greater than the fold change measured by the microarray analysis, and significantly different between the low and high cotinine subjects (P < 0.05). Analysis using subjects excluded from the microarray analysis (Caucasian females) biologically validated the cotinine-dependent change in expression of two genes, CDKN1C and SASH1 (Figure 3, right panel). RPS29 was used as a negative control gene and was not found to be differentially expressed either by microarray or real time PCR analysis.

Conclusion
In this study we demonstrated that groups of genes in circulating human leukocytes are affected by tobacco use in vivo. We identified genes and their relationships using a combination of testing individual genes (SAM), testing gene sets (GSA), and high throughput annotation (GATHER). Hyperclustering using Gene Ontologies and transcription factor binding sites associated with these genes illuminated the functional significance of the differentially regulated genes. The resulting gene expression  [71] Genes involved in signaling via Fas and DR3, 4, and 5. < 0.0001 0 METASTASIS_ADENOCARC_DN (c2:1553) [72] Genes involved in metastasis of solid tumors. < 0.0001 0 DAC_IFN_BLADDER_UP (c2:1304) [73] Interferon responsive genes upregulated by DAC treatment. < 0.0001 0 Top 3 gene sets (71 total genes) identified by GSA comparing gene expression profiles of subjects with high plasma cotinine versus low plasma cotinine, showing the names of differentially expressed gene sets as defined by Gene Set Enrichment Analysis [70] with accompanying p-value and FDR.
Hyperclustering of cotinine responsive genes  Table 4 for detailed category information). spectra revealed novel and under-recognized molecular pathways in the pathophysiology of diseases commonly associated with tobacco use. Genomic signals in circulating leukocytes characteristic of cellular metabolism, transcription and signaling, apoptosis, response to stress, and the interferon response were all correlated with nicotine exposure. These results strongly suggest that tobacco use promotes a pro-carcinogenic environment, predisposing individuals to develop cancers in a variety of organ systems.
Interestingly, some genes that have previously been linked to smoking were not differentially expressed in our 2 subject groups [61][62][63]. For example, neither CYP1B1 (a cyto-chrome P450 enzyme playing an important role in chemical carcinogenesis) nor SOD2 (which destroys toxic radicals normally produced within cells) had an expression profile that differed significantly between high and low cotinine groups. Although several previous reports identified these genes as being affected by smoking, design and subject pool differences used in the present study could explain the absence of these genes from our profile. CYP1B1 is expressed to a greater degree in the females than in males and our data set is all male [64]. SOD2 gene expression declines with age [65]. The mean age of one of the studies reporting differential regulation of SOD2 was 27 years while the mean age of our study subjects is 46.5 years, which may explain why the SOD2 * Node is not a child of the parent node for this group gene expression ratios between the groups in our study did not vary significantly.
A significant link has been established between smoking and the development of blood-borne cancers such as acute myelogenous leukemia (AML) and acute lymphocytic leukemia (ALL) [66,67]. Exposure to compounds derived from tobacco use is typically highest in the oral and nasal cavities, the laryngotracheobronchial tree, and the urinary system, putting these tissues at the greatest risk of developing tumors [68]. Nevertheless, given chronic exposure to carcinogens, blood tissues are likewise at an increased risk of carcinogenesis [69]. Sandler, et al., observed a clear dose response to smoking, with heavy smokers at the highest risk of developing leukemia [66]. The causative mechanism for this observed increase in leukemia among smokers is unknown. Our results identify highly relevant, differentially expressed genes that may serve as the basis for future experiments aimed at addressing the molecular etiology of AML and ALL in smokers. Moreover, these gene signals were detected in an easily obtainable sample of peripheral blood.
We found a correlation between tobacco use and increased expression of interferon-inducible genes in circulating leukocyte populations. Strong induction of interferon-responsive gene expression was seen in only a subset of tobacco-using subjects, arguing that interferon induction is not a direct effect of tobacco use. The mechanism of induction of these genes is not clear from our data. Previous studies have found a strong correlation between the parenchymal destruction associated with end-stage emphysema and the presence of interferon and interferon-inducible genes in the lung [60]. Intriguingly, 5 of the 6 subjects (83%) with a diagnosis of COPD in this study demonstrated the high-interferon response phenotype. Our observation of elevated peripheral interferon response gene expression may reflect a systemic manifestation of a destructive pulmonary inflammatory response. These observations may provide evidence of a systemic immune basis for smoking-related lung parenchymal destruction. Alternatively, the expression of interferonresponsive genes in the periphery may be secondary to the upper and lower respiratory tract infections to which smokers are prone.
Hyperclustering revealed 5 distinct, physiologically relevant gene groups in peripheral leukocytes affected by tobacco use in vivo. Furthermore, these gene groups belong to pathways and regulatory systems important to the etiology of smoking-related diseases. These novel results enhance our understanding of how tobacco use affects patterns of gene expression in leukocytes, and provide a starting point for elucidating the molecular mechanisms of tobacco-related neoplasia, atherosclerosis, and immune dysfunction. The hyperclustering visualization facilitated interpretation of microarray data by fusing the expression data with functional annotation derived through robust statistical methodology (GSA and GATHER) prior to cluster analysis. This technique is a visual representation that combines gene expression data and any form of additional annotation. Gene expression profiling of readily obtainable peripheral blood samples identified genes that regulate response to stress, macromolecular metabolism, transcription and signaling, interferon response, and cell death and resistance to apoptosis. This profile may identify some novel targets for therapeutic intervention for both smoking-related diseases and, potentially, for smoking cessation.
Histogram of relative expression of selected genes using real time PCR Figure 3 Histogram of relative expression of selected genes using real time PCR. Technical verification (left) of differentially expressed genes identified in the subject population (Caucasian males) by SAM/GSA (n = 20): CDKN1C, HRASLS3, PTGDR, CX3CR1, and SASH1. Biological verification (right) of two selected genes using independent samples not included in our subject population (Caucasian females, n = 10): CDKN1C and SASH1. Data is represented as the log base 2 relative change in gene expression (± standard error) and all expression normalized to low cotinine from the subject population samples (Caucasian males). The data labels represent the fold change in high versus low cotinine samples, all of which were statistically significant (P < 0.05). The fold change in the gene RPS29 was used as a negative control and was not significant (n.s.) between the high and low cotinine groups.