The global landscape of intron retentions in lung adenocarcinoma
© Zhang et al.; licensee BioMed Central Ltd. 2014
Received: 9 December 2013
Accepted: 14 March 2014
Published: 20 March 2014
Skip to main content
© Zhang et al.; licensee BioMed Central Ltd. 2014
Received: 9 December 2013
Accepted: 14 March 2014
Published: 20 March 2014
The transcriptome complexity in an organism can be achieved by alternative splicing of precursor messenger RNAs. It has been revealed that alternations in mRNA splicing play an important role in a number of diseases including human cancers.
In this study, we exploited whole transcriptome sequencing data from five lung adenocarcinoma tissues and their matched normal tissues to interrogate intron retention, a less studied alternative splicing form which has profound structural and functional consequence by modifying open reading frame or inserting premature stop codons.
Abundant intron retention events were found in both tumor and normal tissues, and 2,340 and 1,422 genes only contain tumor-specific retentions and normal-specific retentions, respectively. Combined with gene expression analysis, we showed that genes with tumor-specific retentions tend to be over-expressed in tumors, and the abundance of intron retention within genes is negatively related with gene expression, indicating the action of nonsense mediated decay. Further functional analysis demonstrated that genes with tumor-specific retentions include known lung cancer driver genes and are found enriched in pathways important in carcinogenesis.
We hypothesize that intron retentions and consequent nonsense mediated decay may collectively counteract the over-expression of genes promoting cancer development. Identification of genes with tumor-specific retentions may also help develop targeted therapies.
As one of the leading causes of cancer-related mortality in the world, lung cancer accounts for approximately 12 percent of all cancer incidences and 17.6 percent of cancer deaths [1, 2]. Of them, lung adenocarcinoma accounts for more than 500,000 deaths per year worldwide and is the most common subtype of non-small cell lung cancer . Although the underlying mechanism of lung adenocarcinoma is still under investigation, studies showed that recurrent mutations in the epidermal growth factor receptor (EGFR) and the anaplastic lymphoma kinase (ALK) fusions could change the efficacy of treatment for patients with lung adenocarcinoma [4–8]. Genetic modifications in other genes, including targeted mutations in BRAF, AKT1, ERBB2 and PIK3CA, as well as ROS1- and RET-involved fusions, may also affect cancer therapy . In addition, a recent study has found frequent copy number changes in NKX2-1, TERT, PTEN, MDM2, CCND1, and MYC in lung adenocarcinoma , highlighting the role of various types of genetic alternations in carcinogenesis.
Alternative splicing in multiple-exon genes is prevalent in eukaryotes and it is actively involved in development, cell differentiation and disease. Approximately 90% of multi-exon human genes have splicing variants in different tissues and cell lines [10, 11]. Intron retention, or the maintenance of an intron in a mature mRNA transcript, is a less common type of alternative splicing  and can have large functional consequence by introducing premature mutations to the mature transcript. Although the impact of intron retentions has been less acknowledged, a recent report suggests that intron retention is one of the most predominant splice events in three breast cancer subtypes , and the retention of intron 4 in the wild-type cholecystokinin type 2 (CCK 2 ) receptor shows elevated expression associated with increased tumor growth in a few cancers .
The emergence of high-throughput sequencing technologies in the past few years has provided a new platform to perform large-scale transcriptome profiling at an affordable cost. Based on high-throughput sequencing, RNA-Seq can precisely measure mRNA expression and characterize gene isoforms [15, 16], and is commonly used to identify somatic mutations [17, 18], differentially expressed genes , fusion genes in tumor tissue [20–22], and allele-specific expression [23, 24]. Here in the present study, we exploited the rich information in RNA-seq data to investigate the potential role of intron retentions in lung adenocarcinoma. Using tumor and matched normal samples, we systematically identified genes with tumor-specific intron retentions. Further investigation suggests a potential protective role of intron retentions in carcinogenesis through the action of nonsense mediated decay (NMD).
Transcriptome sequencing data from five lung adenocarcinoma and their paired adjacent normal tissue specimens  were downloaded from European Nucleotide Archive (ENA, http://www.ebi.ac.uk/ena/), using the accession number ERP001058. Reads from five patients (LC1, LC5, LC10, LC11, and LC12) were used in this study, and as described in the original study, all protocols were approved by the Institutional Review Board of Seoul National University Hospital (Approval # C-1111-102-387) and Seoul St. Mary’s Hospital (Approval # KC11TISI0678). 101-bp paired-end reads were generated by Illumina Hiseq 2000 sequencer for each sample.
To extract exon-intron junction sequences, human exon information was first downloaded from Ensembl database (release 69) . To assign exon-intron junction unambiguously, intersecting exons were excluded, resulting in 164,500 non-overlapping exons. Then exon-intron junctions were then determined and 101-bp sequences were extended in each direction for future mapping.
where R i is the retention abundance for gene i.
Gene expression was first calculated by using the RSEM program , which effectively uses ambiguously-mapping reads to estimate expression abundance. Next, EdgeR package was used to normalize the data by trimmed mean of M values (TMM) and identify differentially expressed genes . Genes at low expression level (≤ 1 transcript per million reads, TPM) were excluded and DEGs were defined as genes with a p-value < 0.05 after Benjamini-Hochberg adjustment .
Variants in tumor samples were first identified by SAMtools  for each patient, and only variants supported by at least three reads with base quality ≥20 were retained. Positions of those variants were then examined in normal samples to make sure they were also covered by reads from the corresponding normal samples and they were not variable in normal samples.
Gene ontology (GO)  information for query genes was assigned using bioconductor (http://www.bioconductor.org) package “org.Hs.eg.db”. Enrichment tests were performed by assuming a hypergeometric distribution using “topGO” package . KEGG (Kyoto Encyclopedia of Genes and Genomes) database  was used to retrieve pathway annotation information, and Fisher’s exact test was performed to evaluate the enrichment of a pathway. Multiple test correction was conducted using Benjamini-Hochberg method.
The data used in this study were whole transcriptome sequencing from tumor and adjacent normal tissues of five patients with lung adenocarcinoma, containing approximately 665 million short reads produced by Illumina HiSeq2000 sequencer, about 67 million per sample (Additional file 1: Table S1). Using Bowtie 2 aligner, about 647 million reads (~97%) can be mapped to human cDNAs. The remaining 18 million unmapped reads were further aligned to exon-intron junctions to identify potential intron retention events. On average, 67,466 and 63,297 retention events were found in each tumor and normal sample, respectively, with ~36,865 retentions in common (Additional file 1: Table S1).
Summary statistics for intron retentions in normal and tumor samples
Group-specific retention (GSR)a
Genes with GSRb
Group-specific genes (TPM > 1)d
Group-specific genes (TPM > 1 and GSR > 1)e
One possible explanation for the over-representation of intron retentions in tumors is the inhibition of nonsense mediated decay (NMD), which degrades transcripts with pre-mature codons  and is reported to be inhibited in tumor microenvironment . Therefore we investigated the expression pattern of 136 genes involved in NMD process (Additional file 3: Table S3). Among them, only three genes (CTIF, FAU, and RPS28) are significantly down-regulated in tumors, implying NMD may not be inhibited in lung adenocarcinoma and thus cannot explain the large amount of intron retentions in tumors.
It should be noted that the observed correlation could be simply explained by that retentions in over-expressed genes are preferentially identified due to their abundance. Therefore we estimated the intron retention abundance in each TSRG. The mean and median abundance of transcripts with intron retentions for a TSRG in tumor samples are 18% and 5%, consistent with previous observation that intron retentions comprise a minor fraction of splicing forms . Furthermore, we found that up-regulated TSRGs have lower percentage of intron retentions compared with down-regulated genes (4.5% versus 6.7%, median percentage, P-value < 2.2 × 10−16, Wilconxon ranksum test). As transcripts with premature stop codons tend to be degraded by NMD , the low level of intron retentions in up-regulated TSRGs and vice versa may suggest the presence of NMD. To validate this, we categorized 2,340 commonly expressed TSRGs as genes with in-frame retention (659) and genes with frame-shift retention (1681), and compared their expression level as well as retention abundance. Genes with in-frame retentions have higher expression than those with frame-shift retentions (1581 versus 1356, mean TPM), but it was not statistically significant (P-value = 0.4561, Wilconxon ranksum test). The retention level in genes with in-frame retentions is significantly higher compared with genes with frame-shift retentions (6.3% versus 4.8%, median percentage, P-value = 4.0 × 10−12, Wilconxon ranksum test), confirming NMD is active in tumor samples.
Enriched gene ontology categories in genes with multiple TSRs
Systemic lupus erythematosus
Extracellular matrix structural constitu…
Extracellular structure organization
Collagen fibril organization
VEGF signaling pathway
Platelet-derived growth factor binding
Extracellular matrix part
Extracellular matrix organization
Leading edge membrane
Proteinaceous extracellular matrix
We also conducted pathway analysis for TSRGs and five pathways were overrepresented (Table 2), including the VEGF (vascular endothelial growth factor) signaling pathway (hsa04370). This signal pathway contains several key mediators of angiogenesis and lymphangiogenesis in tumor development , and is often found highly expressed in tumors . Enriched intron retentions in these genes, again may activate the mRNA decay mechanism to offset the over-expression.
One plausible reason for intron retentions is that mutations occurred on the intron splicing sites which change the splicing signal and thus result in an unspliced intron. To explore the prevalence of splicing mutations in tumors, we used SAMtools to identify single nucleotide variants in tumors, and then filtered ones also variable in the matched normal samples. In total, only 27 tumor-specific variants were found to modify the splicing signal (Additional file 5: Table S5). Considering the large number of tumor-specific intron retentions (4,099), it seems that somatic mutations on splicing sites may have a negligible role in causing intron retentions. We also investigated the expression level of several trans-acting splicing activators, including Tra2[41, 42] and RNPS1, but none shows differential expression between tumors and normal samples.
By searching the COSMIC database (Catalogue of Somatic Mutations in Cancer, http://cancer.sanger.ac.uk/cancergenome/projects/cosmic/), we found TSRGs include a substantial number of tumor genes, and some are also represented in the Cancer Gene Census , which catalogues genes with mutations that have been causally implicated in cancer. Examples include EGFR (epidermal growth factor receptor), KDR (kinase insert domain receptor), ATM (ataxia telangiectasia mutated), and ROS1 (c-ros oncogene 1, receptor tyrosine kinase). Furthermore, three genes were among the top 20 most frequently mutated genes in lung adenocarcinoma: EGFR (34%), ATM (5%) and KDR (5%). TSRG list in this study also targets other genes with a potential role in carcinogenesis, such as MUC16 (mucin 16, cell surface associated), expression of which was found to correlate with clinical outcome in adenocarcinomas , as well as RUNX1 (runt-related transcription factor 1), which binds to the core element of many enhancers and promoters and may have various roles in tumors [46, 47]. A close investigation further found reads across six exon-intron junctions in MUC16, and the expression of MUC16 is significantly elevated in tumors (p-value = 3.98 × 10−13 after Benjamini-Hochberg correction), but the abundance of intron retention is 3.4%, smaller than 4.5%, the median abundance of up-regulated TSRGs, implying the over-expression of MUC16 in lung adenocarcinoma may be related to the below average intron retention level. Finally, we also prioritized a list of TSRGs which contain multiple frame-shift retentions and were significantly over-expressed in tumor samples (Additional file 6: Table S6). These genes include driver genes such as EGFR, ROS1, and RUNX1, thus functional studies on them should help understand the role of intron retentions in lung tumor development.
Recent large-scale efforts from Cancer Genome Atlas Research Network have resulted in lung cancer candidate genes with somatic mutations and copy number alternations [3, 48]. However, variations at the mRNA level in these are not fully explored, though the diversity and functionality of tumor-specific transcripts have been highlighted [10, 49, 50]. Several processes could result in novel mRNA isoforms in tumors, including alterations in promoter usage, exon skipping, and splicing signals, which in consequence changes coding regions and the resulting proteins [51–53]. Thus it is essential to understand the contribution of cancer-related changes emerging at the stage of transcription. The rapid development of sequencing technology makes RNA-Seq a cost-effective way to characterize transcriptome and is therefore frequently used in biomedical studies. Here, we developed a bioinformatics pipeline that explores RNA-Seq data to identify intron retention events, a splicing form of less appreciation but be also important in cancer study [13, 54], and further compared their spectrum between lung adenocarcinoma and matched normal tissues. A prevalence of intron retentions was found in carcinoma samples, and over-expressed TSRGs tend to have lower retention abundance compared with under-expressed genes.
One important issue in identifying intron retentions is to distinguish potential contaminations from genomic DNAs or precursor mRNAs during the library preparation process. In order to remove false positive calls caused by contamination, we applied a simple and straightforward filter that requires a candidate intron retention event to be presented in at least two tumor samples and not in any normal sample, or verse visa. If one sample is contaminated and contains false intron retentions, such retentions are not expected to be found in other samples; if multiple samples were contaminated, falsely called intron retentions would be found in both tumor and normal samples, which will also be removed by the filter. However, this filter also removes intron retentions occurred in individual samples, thus the total number of TSRs or NSRs should be even larger than reported here.
The nature of our bioinformatics pipeline determines that it may have limited power in detecting intron retentions in genes with low expression level, partially accounting for the enrichment of intron retentions in over-expressed genes. However, our pipeline also filtered genes with very low expression, the abundance of intron retentions in tumor samples thus cannot be simply explained by the expression bias. Additionally, when focusing on genes with abundant expression, a reverse pattern was demonstrated as the abundance of intron retention is negatively correlated with gene expression, which is likely the result of NMD. Since a substantial proportion of cancer driver genes are over-expressed in tumors, identified intron retentions in those up-regulated genes may suggest a biological role to neutralize over-expression in tumors.
With respect to the mechanism of somatic intron retentions, the most intuitive explanation is that somatic mutations occur at splicing sites and alternate the splicing signal, therefore those splicing sites could not be properly recognized. However, no enrichment of somatic mutations was observed in this dataset (less than 1% of TSRs have somatic mutations in the splicing sites). We also interrogated the expression pattern of several splicing activators, again, no obvious pattern was found. Alternatively, some studies showed that intron retention pattern is different among various tissues [55–57], suggesting other factors, such as cellular environment may also function in promoting the process of intron retention. In addition, the observation of smaller size of retained intron in tumors compared to that in normal samples or non-retained introns is intriguing. Although explanations have been proposed for short retained introns , the difference between normal and tumor samples remains unexplained. Future work is therefore necessary to better understand the pattern observed here.
Among genes with tumor-specific retentions, genes with known driver functions in cancer were rediscovered, including EGFR, ROS1, ATM and KDR. Additionally, other growth factor genes were also found with retained introns in tumor samples, such as PDGFRB (platelet-derived growth factor receptor, beta polypeptide), TGFBI (transforming growth factor, beta-induced), EGF (epidermal growth factor), IGF2R (insulin-like growth factor 2 receptor), and ERBB2 (v-erb-b2 erythroblastic leukemia viral oncogene homolog 2), which are also involved in tumor evolution in various studies [58–62]. By detailed investigation, we found intron retentions within these genes all caused frame-shift changes, which tend to invoke NMD. It is well known that cancer driver genes, such as EGFR, are over-expressed or activated by mutations in tumors, further activating downstream pathways associated with cell growth and survival. Therefore intron retentions occurring in these over-expressed or highly mutable driver genes could be protective for the patient by triggering NMD, which in term reduces the expression level or copies of mutable mRNAs. Future validation studies and functional dissections, however, are still critical before we can draw the conclusion.
At the moment of this analysis, only a few studies focus on systematically characterizing the global pattern and contribution of intron retentions in tumorigenesis . Results in this study suggest a potential protective role of intron retentions in lung adenocarcinoma and may benefit further biomarker development. It would also be of interest to investigate the pattern of intron retentions in other cancer types.
Written informed consent was obtained from patients in the original study and data is released for public use.
We appreciate the helpful comments from two reviewers. QZ thanks Harvard Research Computing group for technical support in computing. This study was funded by the introduction of innovative R&D team program of Guangdong Province (NO. 2009010029), Industry, Education and Academy Cooperation Foundation of Guangdong Province (No.2011A090200114), and Shenzhen Enterprise Engineering Center Project (No. JC201005240008A).
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.