Prediction of the gene expression in normal lung tissue by the gene expression in blood

Background Comparative analysis of gene expression in human tissues is important for understanding the molecular mechanisms underlying tissue-specific control of gene expression. It can also open an avenue for using gene expression in blood (which is the most easily accessible human tissue) to predict gene expression in other (less accessible) tissues, which would facilitate the development of novel gene expression based models for assessing disease risk and progression. Until recently, direct comparative analysis across different tissues was not possible due to the scarcity of paired tissue samples from the same individuals. Methods In this study we used paired whole blood/lung gene expression data from the Genotype-Tissue Expression (GTEx) project. We built a generalized linear regression model for each gene using gene expression in lung as the outcome and gene expression in blood, age and gender as predictors. Results For ~18 % of the genes, gene expression in blood was a significant predictor of gene expression in lung. We found that the number of single nucleotide polymorphisms (SNPs) influencing expression of a given gene in either blood or lung, also known as the number of quantitative trait loci (eQTLs), was positively associated with efficacy of blood-based prediction of that gene’s expression in lung. This association was strongest for shared eQTLs: those influencing gene expression in both blood and lung. Conclusions In conclusion, for a considerable number of human genes, their expression levels in lung can be predicted using observable gene expression in blood. An abundance of shared eQTLs may explain the strong blood/lung correlations in the gene expression. Electronic supplementary material The online version of this article (doi:10.1186/s12920-015-0152-7) contains supplementary material, which is available to authorized users.


Background
Study of tissue specificity in gene expression is important for understanding tissue biology and can facilitate an identification of genes associated with risk of human diseases [1]. A number of studies on comparative analysis of the gene expression in different tissues were published [2][3][4]. Pan-tissue analysis of the gene expression has shown that, on the genome level, there is a strong positive correlation in expression, meaning that genes expressed at a low level tend to have low expression across multiple tissues [5][6][7].
Correlation on the gene level, e.g., correlation between expression of a given gene in blood and in lung tissue across different individuals, is poorly studied. The major reason for this is lack of paired data: samples from the different tissues of the same individual. Such samples were not available until very recently. Genotype-Tissue Expression Project (GTEx), launched by the NIH in 2010, has yielded paired gene expression data, allowing analysis of the gene expression in different tissues originating from the same individual.
Analysis of the inter-tissue correlations, especially between gene expression in blood and other tissues, has obvious practical significance. Blood is the most accessible human tissue and identification of the genes whose expression in blood mirrors the gene expression in other tissue may be important for cancer risk prediction based on the gene expression profiling in blood.
The goal of this study was: (i) To test the hypothesis that for some human genes, gene expression in blood mirrors the gene expression in lung; and (ii) To identify gene characteristics influencing blood/lung correlations in expression.

Gene expression in blood mirrors gene expression in lung
At the genome level (mean expression level of the gene in blood versus the mean expression level in lung), there was a strong positive correlation: Spearman's rank correlation coefficient (SRCC) = 0.90, N = 22,704, P = 2 · 10 −34 . Figure 1 shows the scatterplot of the mean expression in lung versus mean expression in blood. At the gene level (correlation between paired blood/lung samples), average SRCC was positive and significant: average SRCC = 0.06, N = 22,704, P = 3.1 · 10 −8 . P-value is shown for the testing null hypothesis that mean SRCC equals zero.

Gene specific linear regression models
A total of 22,704 gene-specific generalized linear regression models were built with gene expression in lung tissue as outcome and gene expression in blood, age and gender as predictors. For 3671 genes, at least one predictor was nominally significant. For 1216 genes, age was the only significant predictor; for 331 genes, gender was the only significant predictor; and for 1748 genes, gene expression in blood was the only significant predictor of the gene expression in lung. The list of genes whose expression in lung can be predicted by age, gender or expression level in blood can be found in Additional file 1.

Number of eQTLs and prediction efficacy
Positive correlation between gene expression in blood and the gene expression in lung can be driven by shared genetic polymorphisms influencing gene expression -eQTLs. According to GTEx, there are 235,576 eQTLs for blood and 222,038 for lung tissue. The number of eQTLs per gene varied from 0 to several hundred, e.g., MICA has 539 blood and 482 lung eQTLs. More than 80 % (189,869) of blood eQTLs also influence gene expression in lung (shared eQTLs). For 99.9 % of shared eQTLs the direction of the effect is the same in blood and lung, meaning that an allele that increases gene expression in blood also increases gene expression in lung. Therefore, more than 80 % of eQTLs are shared by blood and lung tissues and, even more importantly, essentially all of them have the same direction of the effect on gene expression.

eQTLs and inter-individual variation in expression
eQTLs can contribute to inter-individual variation in the gene expression. We divided all genes into 4 groups based on the number of the reported eQTLs: (1) no reported eQTLs; (2) 1 to 10 eQTLs; (3) 10 to 51; and (4) 52 or more. Variance was estimated in each group separately for lung and blood tissues (Fig. 2). The variance was highest in the 4 th group. For both tissue types, there was a significant positive association between the number of eQTLs and inter-individual variance in the gene expression.

Gene characteristics associated with the blood-based prediction efficacy
We tried to determine if factors other than number of eQTLs gene characteristics influence efficacy of the blood-based prediction of the gene expression in lung. We used the same gene characteristics as we used in our recently published paper on prediction of SNP reproducibility [8]. Table 1 shows the list of analyzed characteristics and corresponding statistics. For information about sources of the data we refer the reader to the Fig. 1 The scatterplot of the mean expression in blood versus mean expression in lung. Each dot represents a gene original publication [8]. Gene characteristics significantly associated with prediction efficacy of the gene expression in lung based on the gene expression in blood include nuclear localization, acetylation, methylation, ubiquitination, and the size of the coding region.

Discussion
Gene expression is controlled by a combination of general and tissue-specific transcription factors (TFs). Transcription initiation requires a formation of a multiprotein complex that includes general transcription factors (GTFs) [9]. GTFs are not tissue-specific: binding of GTFs to the promoter regions is required for initiation of transcription of the gene in any tissue. GTFs bind the promoter region of the gene in the sequencespecific manner and the binding efficacy is sequencedependent [10]. Genetic polymorphisms in the GTFs' binding sites modulate gene expression similarly across different tissue types. Tens of thousands of SNPs were detected in promoter regions of the human genes [11]. In fact, SNP density in regulatory regions of the human genes is higher compared to other regions [12], which can be explained by a higher G + C content in the regulatory regions. A higher G + C content is associated with elevated mutation rate, especially in CpG sites [13,14]. Promoter regions are not the only sites harboring regulatory elements modulating gene expression. 5′UTR regions often contain regulatory elements, not to mention   [15,16]. Our recent (February 25 th , 2015) access of dbSNP database indicated that there were almost 400 K SNPs in 5′ UTRs. The existence of multiple polymorphic sites in the human genome modulating gene expression level is supported by eQTL analyses. The first eQTL study based on the analysis of lymphoblastoid cell lines and a rather limited number of SNPs generated by HapMap project [17] has identified almost 52 K eQTLs with FDR < 0.1.
Our analysis demonstrated that: (i) for about 18 % of the human genes, the expression in lung can be predicted by an assessment of the gene expression in blood, and (ii) the efficacy of the prediction depends on the number of shared eQTLs. Different human tissues are expected to have exactly the same set of eQTLs, since they are the same (except somatic mutations) on the DNA level. However, functional effect of eQTLs can be tissue-specific because transcription factors are often tissue-specific.
We used P-values that were not adjusted for multiple testing. This suggests that some nominally significant correlations can be false-positives. Among 16,998 assessed genes the number of nominally significant blood/lung gene expression correlations was 2380. The expected number of correlations based on the type 1 error of 0.05 is 850 which is lower than the observed number (χ 2 = 793.5, df = 1, p = 1.2 − 56 ). Though this result clearly indicates the presence of true positives, we need to be cautious about using them to define the list of genes in the human genome whose expression in lung can be predicted by gene expression in blood. A comprehensive list of such genes cannot be generated using available data because the sample size is too small, and data on tobacco smoke exposure are lacking (see the study limitations section). The goal of this study was to demonstrate that (i) for a considerable number of human genes, their expression in lung could be predicted by the level of their expression in blood, and (ii) to identify factors influencing blood/lung correlations.
We found that the majority of eQTLs are shared and that an absolute majority of shared eQTLs has the same direction of effect in different tissues, meaning that if a given allele increases the expression of the gene in blood, it also will increase the expression of the same gene in lung. One can expect that an individual with an excess of up-regulating eQTLs will have elevated expression of the gene across different tissues.
In univariate analyses eQTLs were the most significant predictors of blood/lung correlation in gene expression. We found that nuclear localization, posttranslational modification of the gene product, and gene size are associated with the efficacy of the blood-based prediction of the gene expression in lung. We used a multivariable linear regression model to test if those characteristics are eQTLindependent. Minus LOG(P) for the statistical significance of the β for the gene expression in blood was used as outcome, and number of shared eQTLs, nuclear localization, gene size and posttranslational protein modification were used as predictors. Only number of eQTLs remained significant in the model: F = 336.5, df = 1, P = 3.6x10 −16 while nuclear localization, posttranslational protein modifications and gene size were not significant: corresponding P-values are 0.25, 0.19 and 0.06.
The results of this analysis suggest that shared eQTLs with the same direction of the effect drive blood/lung correlation in gene expression. Based on this observation, one can expect similar correlations for other tissue types. Unfortunately, GTEx database does not have enough paired samples to explore correlations between other tissues.

Study limitations
The conducted study has two major limitations.

A small sample size
GTEx data are unique since they provide gene expression in normal human tissues from the same individual. However, the number of available samples is relatively small (31 paired blood/lung samples). This suggests that the statistical power to detect statistically significant correlations between gene expression in blood and lung is limited. One can expect that the real number of genes whose expression in blood reflects their expression in lung is higher.

Lack of data on environmental exposures (smoking)
Exposure to tobacco smoke has been shown to influence gene expression in lung [18,19] and blood [20]. Smoking data for GTEx participants were not available at the moment of the analysis. It is difficult to predict the effect of smoking on blood/lung correlation in gene expression: for genes whose expression in blood and lung changes similarly in response to tobacco smoke exposure, the correlation can be stronger, while the correlation is expected to be weaker for genes that are tobacco smoke sensitive in one tissue but not in the other. The results of the conducted analysis, hovewer, suggest that genetic (eQTLs) rather than environmental (tobacco smoke) factors are major drivers of the positive association between blood and lung expression.

Conclusion
In conclusion we found that for about 20 % of the genes, expression level in lung can be predicted by assessment of the expression level in blood. Genes with a strong genetic component in the control of the expression level (number of eQTLs) show stronger blood/lung association in gene expression. The results of the conducted analysis also indicate that the expression levels of the genes with a strong genetic component in the control of the expression show a higher inter-individual variation in expression level. We believe that genes whose expression in lung can be predicted by assessment of the gene expression in blood can be a valuable resource for blood-based disease risk prediction models.

Methods
All data used in this study were acquired from publicly available sources and we have not conducted any research on human subjects/samples ourselves.

Data sources
We used GTEx data http://www.gtexportal.org/home/ to correlate gene expression in blood with the expression level of the same gene in lung from the same donor. For the correlation analysis we used non-parametric Spearman's rank test because for many genes the distribution of expression values deviates from the normal distribution. A total of 31 donors were identified for which premortal gene expression in whole blood and paired gene expression in normal lung is available. Donors with a history of lung or respiratory conditions (asthma and pneumonia) were excluded from the analysis. We used lung tissue because it has the largest number of paired samples. Table 2 shows basic demographic characteristics of the donors. Gene expressions in both blood and lung tissues were assessed by RNA sequencing with median sequencing depth of 82.1 million mapped reads per sample. Details on sequencing and data processing can be found in recent GTEx publication [21].

eQTLs
We used data on lung and blood eQTLs recently released by GTEx. eQTLs with reported statistical significance level <10 −5 were used in the analysis. eQTLs were subdivided into 3 categories: (1) Blood eQTLsthose influencing the expression level of the gene in blood tissue; (2) Lung eQTLsthose influencing the expression level of the gene in lung; and (3) Shared eQTLsthose influencing the expression level of the gene in blood and lung.
Efficacy of the prediction of the gene expression in lung based on the gene expression in blood To predict gene expression in lung, we built genespecific linear regression models with gene expression in lung as outcome and the gene expression in blood, age, and gender as predictors. Model coefficients for predicting variables were used to assess an efficacy of the prediction of the gene expression in lung based on the gene expression blood. We used -LOG(P blood ): where P blood is type one error for the difference of regression coefficient from 0 (null hypothesis). We did not use correlation coefficient because, for some genes, the correlation between gene expression in blood and lung was driven by gender or age-related effects.

Gene characteristics influencing correlation between gene expression in blood and lung
We tested the hypothesis that prediction efficacy of the gene expression in lung based on the gene expression in blood depends on gene characteristics. We used a set of the gene characteristics from our recently published paper on prediction of SNP reproducibility in GWASs [8].

Statistical analysis
Data analysis was performed in R 3.0.2 (The R Foundation for Statistical Computing). Normalized expression data was obtained from Affymetrix expression data from GTEx at http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi? acc=GSE45878 for 22,704 genes. A linear regression model was built for every gene using the "lm" function in the base R package, using expression levels in blood, and the age and gender of each individual into account as predictors.
For identification of gene characteristics associated with efficacy of prediction of the gene expression in lung based on expression level in blood, we used a nonparametric Kolmogorov-Smirnov test for binary traits, and a Spearman's rank correlation for non-binary traits.

Additional file
Additional file 1: Table S1. Genes whose expression in lung is predicted by the gene expression in blood in multivariate linear regression model with gene expression in lung as outcome, gene expression in blood, age and gender as predictors. We included all nominally significant (type1 error ≤0.05) genes. Entrez gene IDs are shown. Table S2. Genes whose expression in lung is significantly (P ≤ 0.05) predicted by age in multivariate linear regression model with gene expression in lung as outcome, gene expression in blood, age and gender as predictors. Table S3. Genes whose expression in lung is significantly (P ≤ 0.05) predicted by gender in multivariate linear regression model with gene expression in lung as outcome, gene expression in blood, age and gender as predictors.