Potential roles of microRNAs in regulating long intergenic noncoding RNAs
© Juan et al.; licensee BioMed Central Ltd. 2013
Published: 23 January 2013
Skip to main content
© Juan et al.; licensee BioMed Central Ltd. 2013
Published: 23 January 2013
Over 10,000 long intergenic non-coding RNAs (lincRNAs) have been identified in the human genome. Some have been well characterized and known to participate in various stages of gene regulation. In the post-transcriptional process, another class of well-known small non-coding RNA, or microRNA (miRNA), is very active in inhibiting mRNA. Though similar features between mRNA and lincRNA have been revealed in several recent studies, and a few isolated miRNA-lincRNA relationships have been observed. Despite these advances, the comprehensive miRNA regulation pattern of lincRNA has not been clarified.
In this study, we investigated the possible interaction between the two classes of non-coding RNAs. Instead of using the existing long non-coding database, we employed an ab initio method to annotate lincRNAs expressed in a group of normal breast tissues and breast tumors.
Approximately 90 lincRNAs show strong reverse expression correlation with miRNAs, which have at least one predicted target site presented. These target sites are statistically more conserved than their neighboring genetic regions and other predicted target sites. Several miRNAs that target to these lincRNAs are known to play an essential role in breast cancer.
Similar to inhibiting mRNAs, miRNAs show potential in promoting the degeneration of lincRNAs. Breast-cancer-related miRNAs may influence their target lincRNAs resulting in differential expression in normal and malignant breast tissues. This implies the miRNA regulation of lincRNAs may be involved in the regulatory process in tumor cells.
Deep sequencing data from the Encyclopedia of DNA Elements Consortium (ENCODE) suggests that 70-80% of the human genome can be transcribed, and non-protein-coding RNAs (ncRNA) exceed the number of protein-coding genes . The recent discovery of a large number of non-coding RNAs (ncRNAs) significantly enriches the portfolio of potential genetic factors. Rather than being transcriptional noise, many ncRNAs serve as master regulators that affect expression levels of dozens or even hundreds of target genes [2–7]. These regulatory RNAs integrate signals from both genetic and environmental factors, and therefore can play major roles in controlling biological processes . Most notably, a strong association of epigenetic marks with long intergenic non-coding RNAs (lincRNAs >200 nucleotides) in humans and mice was recently described [9–11]. The lincRNAs show evolutionary conservation and spatiotemporally restricted expression patterns, implying that they are functional and regulated. These lincRNAs are reported to regulate dosage compensation, imprinting, and development by establishing chromatin domains in an allele- and cell-type specific manner [12, 13]. According to an analysis on half-lives, lincRNAs are more stable than intron-derived long noncoding RNAs, as they are usually spliced, which also suggest widespread functionality .
In two recent studies, over 10,000 lincRNA regions were identified in human and mouse genomes. These regions were discovered based on epigenetic marks, where the promoter mark H3K4me3 is followed by the transcript mark H3K36me3. Comparative genomics analysis suggests that a significant portion of these regions do not have strong protein coding potentials, and do not have sequence homology with other known proteins . The K4-K36 signature indicates that the transcription of these non-coding RNAs is regulated in a similar fashion as protein-coding genes. Although the molecular mechanisms through which most lincRNAs function were unknown, diverse regulatory mechanisms have been reported for several well-characterized lincRNAs [15, 16]. By targeting the chromatin modification complexes and RNA-binding proteins (XIST, AIR) [17, 18], they can inhibit gene expression (HOTAIR, DBE-T) [12, 19], and control alternative splicing (MALAT-1) . In addition, some lincRNAs are found to be differentially expressed (HOX) or coactivate other proteins (SRA) in tumor cells [10, 21], and are reported to be strongly associated with tumorigenesis [22, 23].
MicroRNA (miRNA) is another type of noncoding RNA whose biological functions have been extensively studied. MiRNAs are ~22nt small non-coding RNAs  that promote mRNA degeneration and/or inhibit their translation by complementarily binding on the 3' un-translated region (3'UTR) of mRNAs  and attracting RNA-induced silencing complex (RISC). In the past decades, over one thousand miRNAs have been identified [25, 26]. They are broadly related with regular cell functions and diseases [27–30].
Despite the extensive investigation on the roles of miRNAs in regulating protein-coding genes, only a few isolated non-coding transcripts (CDR1 antisense and MEG3) are reported to be regulated by miRNA [31, 32]. Though CDR1 antisense is not an intergenic ncRNA, and MEG3 is indirectly regulated by miR-29, these discoveries encouraged us to explore the potential roles of miRNA in regulating non-specific lincRNA expression. Because the primary mechanism of miRNA regulation is by targeting RNA for degradation , they are likely to also broadly regulate lincRNA expression. In fact, 5% of genomic regions interacting with argonaute proteins are located in non-coding RNAs . The argonaute protein family is one of the major components of the RNA-induced silencing complex (RISC) that is responsible for gene silencing due to miRNA expression and RNA interference . In this study, we intend to generally investigate the potential roles of miRNAs in regulating the lincRNA expression.
GENCODE project and NONCODE database have both collected and integrated long noncoding RNA annotations[36, 37]. Based on GENCODE annotation, miRcode provides a map of possible miRNA targets on long noncoding RNAs by using multiple bioinformatics prediction software . However, in tissue-specific lincRNA studies, the ab initio approach is usually employed to reconstruct the annotation of lincRNAs from the RNA-seq data [2, 3, 39]. Considering that lincRNA shows a high tissue or cell-specific expression pattern , this approach is also appropriate for our study.
RNA-seq experiments sequence millions to billions of short RNA fragments in a single experiment, by which it measures the expression levels of the whole transcriptome, including noncoding RNAs, without relying on detailed annotation. To explore the potential miRNA-lincRNA expression relationship, we analyzed RNA-seq data from a group of normal and malignant breast tissues. Random primers were used in the reverse transcription step, and the strands of the transcripts were preserved to aid in identifying non-coding RNAs that are not poly-adenylated. Due to their small size (~22nt), the mature miRNAs were not included in the sequencing libraries. However, our data suggested that we can detect precursor miRNAs (80-120nt) from these RNA-seq data sets. For lincRNAs, the sequence fragments across two exons of the transcripts (junction reads) are able to reveal exon boundaries, especially novel ones. This feature helps us to understand the splicing patterns of unannotated lincRNAs. Several tools have been developed for mapping junction reads , detecting the splicing events, assembling isoforms and estimating expression [41, 42].
In this study, for each putative miRNA-lincRNA pair, (1) we required reverse correlation between the expression levels of pre-miRNA and lincRNA within the 20 examined tissue samples. This is due to the well-established roles of miRNAs in facilitating RNA degradation on protein coding genes; and (2) we also required seed sequence of the miRNA (positions 2-7 on the miRNA) should be present in the exon regions of identified lincRNA, and the surrounding sequences on the lincRNA should also favor miRNA targeting conditions. We further examined the evolutionary conservation of the identified miRNA seeds in lincRNAs.
Based on the RNA-seq data, the discovery of lincRNAs in breast tissue includes the following steps: (1) Epigenetic marks and comparative genetics. As reported previously, thousands of lincRNAs have been discovered by searching for the genomic regions with enhanced promoter marks (H3K4me3, or K4) following the transcript marks (H3K36me3, or K36). Our analysis started from 2,513 intergenic K4-K36 regions in the human genome (hg18) and 1,665 intergenic K4-K36 regions in the mouse genome (mm9) from previous studies [2, 3]. We further mapped the mouse intergenic K4-K36 regions to 1,502 genomic loci in human genome using the liftOver tool in the UCSC Genome Browser . LiftOver can convert genomic coordinates between selected assemblies of the same or different species. After merging the two sources of K4-K36 regions and removing the ones near or overlapping with known genes, 3,096 candidate lincRNA regions remained.
(3) Coding potential. We evaluated coding potentials of the identified lincRNA regions using the methods presented in [49, 50]. By modeling the mammalian Codon Substitution Frequency (CSF) of transcript regions and random genomic regions, a CSF score was calculated for each region to represent the codon substitution pattern of the region. A higher CSF score indicates higher potential for being a protein coding gene. The coding potentials of the lincRNA regions are slightly but significantly higher than the ones in random genomic regions (p < 5.4e-07, Wilcoxon test on CSF scores), while the coding potential of protein-coding genes is much higher than lincRNA regions (Figure 2b). For further analysis, we excluded 47 regions with high CSF scores >20; these regions may represent protein-coding genes that are not included in the current annotation database.
We used Scripture  for reconstructing lincRNA exon structures from RNA-seq data. Scripture scans read-enriched regions as putative exons and finds exon boundaries supported by the reads across potential junctions. Scripture identified 525 lincRNAs in the 2,073 candidate regions; this percentage is comparable with the lincRNAs identified by previous studies in other tissues. The genomic coordinates of these lincRNAs are listed in Additional file 1. The average lincRNA length is 1201.7 bases, 75.2% of which are shorter than 1000 bases. The longest lincRNA has 32,178 bases. The lincRNAs are composed by 7.12 exons on average. The mean exon length is 168.8 bases. Almost half (46.7%) of the exons are shorter than 100 bases, with the longest has 5,242 bases. The annotated lincRNAs show a very similar expression patterns comparing with exons and introns with protein-coding genes (Figure 2c). Although lincRNAs are less conserved comparing to protein coding genes, their exons are significantly more conserved than introns (p < 0.0035) and random genome regions (p < 1.35e-14) (Figure 2d). These results provide strong evidence that the lincRNA is functional.
We employed TargetScan  and PITA  to predict putative miRNA target sites on the exonic regions of lincRNAs by 7-mer and 8-mer seed matching (default parameters). We identified 44,887 putative binding sites that belong to 39,384 pairs of miRNA and lincRNA. All 525 lincRNAs and 677 miRNAs are involved in these predicted results. The same seed sequence could be shared in a miRNA family. A miRNA can also bind to multiple sites of a single lincRNA. We also downloaded conserved target prediction of miRNAs and genes for further analysis from the website of TargetScan, including 110,284 predicted pairs of 9,448 genes and 249 conserved human miRNAs.
We quantified the expression levels of mRNA, lincRNA and precursor miRNAs using RPKM measures (reads per kilo-base exon model per million mappable reads). In total, there are 15,381 genes, 303 lincRNAs and 286 miRNAs expressed more than 0.5 RPKM in more than 15 samples. Our further analysis focuses on these sets of genes whose expression levels are detectable.
A generalized linear model (GLM) was used for modeling the potential effects of miRNA in down-regulating the expression levels of genes and lincRNAs in tumor and normal breast samples. Among the expressed mRNA, lincRNAs and miRNAs, 38,828 pairs are predicted to have regulatory relationships by miRNA target prediction algorithms. Among these potential target pairs, 1,742 and 213 mRNA-miRNA and lincRNA-miRNA pairs also showed significant reverse correlation, as derived from the GLM model (FDR<0.2).
As a class of newly discovered non-coding RNA, genomic loci of lincRNAs are not well annotated. As an ab initio study, we inferred their sequences, strands, exon positions, and regulatory elements from RNA-seq data. At the current stage, the functions and regulatory mechanisms of most lincRNAs are unexplored. Several well-characterized lincRNAs are involved in the establishment of the chromatin state and it is believed that many lincRNAs execute their functions by closely associating with chromatin-modifying proteins . This suggests trans-acting may be the primarily method for the affects of lincRNAs on gene expression .
Several recent studies revealed that thousands of lincRNAs exist in the cell [2, 3, 37, 42]. As a transcript, the nature of lincRNAs are quite similar to mRNA, for instance, at least a fraction of lincRNAs are poly-adenylated and have exonic structures. This as evidence of the possibility that miRNA could bind to lincRNAs and trigger degradation. The repression of lincRNAs could be an unknown part of miRNA regulation. Since some lincRNAs have been observed to be differentially expressed in tumor cells compared to normal tissues , the miRNA-derived dysregulation of lincRNA expression can be another potential mechanism of tumorigenesis.
MiRNAs regulate degradation of its target mRNA through binding on the 3'-un-translated regions. Since the entire noncoding RNA is un-translated, the whole mature transcript can serve as potential miRNA target sites. We therefore predicted miRNA binding sites on the entire exonic regions of the identified lincRNAs.
One advantage of RNA-seq data is that it can simultaneously measure the expression level of both mRNAs/lincRNAs and precursor miRNAs in the same sample. This provides opportunity for revealing the relationship of the mRNA/lincRNA-miRNA in their expression levels. As a precursor stage of mature miRNA, pre-miRNA expression levels can be different from their final functional form. Therefore, major conclusions from this analysis are subject to further experimental validation.
Among 30,069 mRNA-miRNA pairs whose relationships were established based on target prediction alone, 31 pairs were experimentally validated, based on the records documented in the miR2Disease database and TarBase database [54, 55]. Ten of these 31 (32.3%) pairs were among the 1,742 mRNA-miRNA pairs showed significant reverse correlation. Reported in 5 previous studies [56–60], these validated mRNA-miRNA pairs involve 5 miRNAs (hsa-miR-155, hsa-miR-29c, hsa-let-7b, hsa-miR-17, and hsa-miR-222) and 10 genes. In addition, we found many miRNAs showed difference in expression relationship with their potential target lincRNAs in tumor and normal tissues (Figure 4). This suggests that miRNA-lincRNA target can be specific to different cellular states.
cDNA libraries from 10 normal breast tissues from the Susan G. Komen Tissue Bank at the IU Simon Cancer Center and 10 triple-negative breast cancer tumors were sequenced on an Applied Biosystems (ABI) SOLiD3 sequencer. We used a customized pipeline for RNA-seq data analysis, which includes three steps, QC filtering, sequence alignment, and gene expression quantification. 1) QC filtering. We first used SOLiD™ Instrument Control Software (ICS) and SOLiD™ Experiment Tracking System (SETS) software for the read quality recalibration. Sequences containing more than two 'N' or wildcards were discarded. Each sequence was scanned for low quality regions, and if a 5 base sliding window had an average quality score less than 20, the read would be truncated at that position. Any read with a length of less than 35 bases was discarded. Our experience suggests that such strategy effectively eliminates low quality reads while retaining high quality regions. 2) Alignment. We used BFAST  as our primary alignment algorithm, because it has high sensitivity for color space data . We used a TopHat-like strategy  to align the sequencing reads that cross splicing junctions. After aligning the sequence reads to a filtering index including repeats, rRNAs (ribosomal RNA), and other sequences not of interest, we conduct sequence alignment on three levels: genomic, known junctions, and novel junctions (based on the enriched regions identified in the genomic alignment). 3) Gene expression level quantification. The expression levels of protein-coding genes and pre-miRNAs were quantified based on the total number of RNA-seq reads falling into their genomic regions. The data were further normalized as RPKM (reads per kilo-base exon model per million mappable reads) based on their length and sequencing depth.
In potential lincRNA regions, we systematically searched regions with short read enrichment as putative lincRNA exons. A putative exon must contain at least 8 reads. If the distance between two exons was less than 10 bases, the two exons would be merged together. To define the exact exon boundaries, we searched for potential splicing sites around the putative exons. Splicing donor/acceptor (GT-AG or CT-AC for RNAs on Watson or Crick strand respectively) sites within 25 bases from the 5'- and 3'- ends of two putative exons would be considered as possible splicing sites. A lincRNA junction library was constructed by connecting all the possible donor and acceptor sites within the 100,000 bases span. The minimum intronic length is set to be 70 bases. After aligning the remaining unmapped reads to the new lincRNA junction library, we used Scripture  for predicting the lincRNA exonic structures with default settings. Similar as for mRNAs and pre-miRNAs, the expression levels of lincRNAs were further quantified by the total number of sequencing reads falling into the lincRNA exonic regions. RPKM values were calculated for each lincRNA.
We calculated the conservation score across placental mammals using Siphy  on various genomic regions, including exons and introns for protein coding genes, lincRNA exons and introns, random genomic regions, and predicted miRNA binding sites on lincRNAs. We used the as the conservation score, which models the overall mutation rates across placental mammals . The input of Siphy is a multiple alignment file (MAF) and a placental mammals model file. All known gene annotations (RefSeq), genome sequences (hg18 and mm9), liftOver tool, multiple alignment files, and placental mammals' model file were downloaded from the UCSC Genome browser.
We used the Codon Substantial Frequency (CSF) score to measure the coding potential of candidate lincRNA regions [49, 50, 63]. We set a 90 bases (30 Amino Acids) sliding window across the candidate lincRNA regions. The maximum score of any window of all six possible reading frames were used as the CSF score of the whole region . A higher CSF score indicates higher potential for encoding protein-coding genes.
In which denotes the number of reads that fall into the regions of gene/lincRNA; denotes pre-miRNA expression, represented by the natural logarithm value of reads number normalized by total mapped reads, ; is a categorical variable having two possible values, indicating normal or tumor status; is an interaction factor that represents the influence of the biological conditions to miRNA regulation patterns; is the natural logarithm value of the total reads number as an offset; are intercept and slopes of the predictors and their combinations; denotes the variation that cannot be explained by the independent factors.
To allow over-dispersion, in practice, we employed a quasi-Poisson distribution. For the significance tests on the global gene/lincRNA-miRNA expression correlation, we employed a t-test on the miRNA expression coefficient . If a gene/lincRNA-miRNA pair did not show significance for , which means the relationship between miRNA and its target is not affected by the cellular states (tumor or normal), we conducted GLM model by removing the interaction term.
To correct for multiple hypothesis testing, false discovery rate (FDR) was calculated based on Benjamini-Hochberg correction . All statistical models and tests (GLM, Wilcoxon test, Chi-square test) were implemented in R: http://www.r-project.org/.
This work was supported by grant from the National Natural Science Foundation of China (60901075, 60973078), the Natural Science Foundation of Heilongjiang Province of China (LC2009C35), the Special Research Foundation for technological Innovation Talents of Harbin (2011RFLXG011), and funds from the United States National Cancer Institute grants CA113001.
This article has been published as part of BMC Medical Genomics Volume 6 Supplement 1, 2013: Proceedings of the 2011 International Conference on Bioinformatics and Computational Biology (BIOCOMP'11). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcmedgenomics/supplements/6/S1. Publication of this supplement has been supported by the International Society of Intelligent Biological Medicine.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.