- Research article
- Open Access
- Open Peer Review
Identification and analysis of miRNAs in human breast cancer and teratoma samples using deep sequencing
© Nygaard et al; licensee BioMed Central Ltd. 2009
- Received: 24 July 2008
- Accepted: 09 June 2009
- Published: 09 June 2009
MiRNAs play important roles in cellular control and in various disease states such as cancers, where they may serve as markers or possibly even therapeutics. Identifying the whole repertoire of miRNAs and understanding their expression patterns is therefore an important goal.
Here we describe the analysis of 454 pyrosequencing of small RNA from four different tissues: Breast cancer, normal adjacent breast, and two teratoma cell lines. We developed a pipeline for identifying new miRNAs, emphasizing extracting and retaining as much data as possible from even noisy sequencing data. We investigated differential expression of miRNAs in the breast cancer and normal adjacent breast samples, and systematically examined the mature sequence end variability of miRNA compared to non-miRNA loci.
We identified five novel miRNAs, as well as two putative alternative precursors for known miRNAs. Several miRNAs were differentially expressed between the breast cancer and normal breast samples. The end variability was shown to be significantly different between miRNA and non-miRNA loci.
Pyrosequencing of small RNAs, together with a computational pipeline, can be used to identify miRNAs in tumor and other tissues. Measures of miRNA end variability may in the future be incorporated into the discovery pipeline as a discriminatory feature. Breast cancer samples show a distinct miRNA expression profile compared to normal adjacent breast.
- Breast Cancer
- Breast Cancer
- Support Vector Machine
- Hide Markov Model
- Hairpin Structure
MicroRNAs (miRNAs) have rapidly emerged as an important class of short endogenous RNAs that act as post-transcriptional regulators of gene expression by base-pairing with their target mRNAs. The approximately 22 nucleotides (nt) long mature miRNAs are processed sequentially from longer hairpin transcripts by the RNAse III ribonucleases Drosha  and Dicer [2, 3]. To date more than 9539 miRNAs have been annotated in vertebrates, invertebrates and plants of which 706 are human according to the miRBase database release 13.0 in March 2009 [4, 5], and recent bioinformatic predictions combined with array analyses, small RNA cloning and Northern blot validation indicate that the total number of miRNAs in vertebrate genomes is significantly higher than previously estimated and may be thousands [6–8].
Several papers have already described the usefulness of miRNAs as diagnostic molecules in e.g. cancer [9, 10] and their potential as therapeutics is being explored . One of the obvious and important goals for understanding more precisely the role and importance of miRNAs in different cellular contexts is to identify all_miRNA species of a given organism and their expression profiles. The diminishing costs of High-Throughput (HT) sequencing techniques are making these increasingly more popular for such discovery and profiling efforts [12, 13]. In consequence, large amounts of data will be generated, and appropriate bioinformatics methods are needed to deal with the data.
We developed a pipeline combining exact and probabilistic methods to analyse 454 small RNA data for the purpose of identifying putative new miRNAs. This task can be divided into two objectives: finding and quantifying expressed genomic regions giving rise to small RNA reads, and scoring these regions as potential new miRNAs. Our approach to the first part of this problem was to retain as much sequence information as possible, despite possible sequencing errors and redundant mapping, thus increasing the amount of available data. For the second objective, we trained a Support Vector Machine (SVM) for reliable classification of potential miRNAs.
The pipeline was used to analyze deep sequencing data generated from four different human tissue samples: Breast cancer, normal adjacent breast, and two teratoma cell lines. We chose to analyze breast cancer associated miRNAs, as these represent an important case for finding miRNA based biomarkers for cancer diagnosis. The discovery of novel miRNAs, as well as understanding the expression of already known miRNAs in these tissues, is therefore of medical interest. The two teratoma cell lines were included in the analysis with the aim of identifying novel miRNAs. Given that teratoma can develop into many different tissue types, we hypothesized that these samples could potentially express different miRNAs than normal samples and thus be a good source of new miRNAs.
In several papers it has been observed that the 5' end of metazoan mature miRNAs is more precisely defined than the 3' end [14–17]. Recently Seitz et al. reported the first systematic analysis of this phenomenon in flies, showing that the population of sequences derived from known miRNAs varies significantly less in the 5' ends compared to the 3' ends . Furthermore, they showed that the observed 5' precision is not caused by imprecise processing by the two endonucleases Drosha and Dicer, but by an event selecting precise 5' ends at or after the 2'-O-methylation of the 3' end and Argonaute2 loading of the miRNA guide strand. These results have yet to be confirmed in a systematic way in organisms other than flies, so we investigated whether the results could be confirmed by our data.
Small RNA fractions were obtained from tissue samples of breast cancer (BC), normal adjacent breast tissue (BN), and two teratomas (CRL-7826 and CRL-7732), see Methods for details. Using 454 pyrosequencing  we obtained between 64894 and 302556 sequence reads from each sample. For BC, RNA up to a length of 100 nt. was extracted with the aim of identifying miRNA precursors as well as the mature product. No such precursors were found (data not shown), so for the remaining samples an upper size limit of 40 nt was used. All analyses of known miRNAs were performed with reference to miRBase 10.1 [4, 5] unless otherwise stated.
Using a hidden Markov model for cDNA-insert recognition
Mapping the reads
Given the short length and functional redundancy of miRNAs, it is not surprising that many known mature miRNA sequences map to more than one place in the genome. Of the 564 human mature miRNA sequences in miRBase 10.1, we found that 462 (82%) mapped uniquely to one place in the genome (data not shown). As a compromise between the conflicting interests of accuracy of mapping and retaining information, we chose to keep reads with up to five equally good matches. This retained 98% of the known miRNAs, and 89% of all the mapped sequence reads.
Differential expression in BC and BN samples
Differentially expressed miRNAs in BN and BC.
Among the miRNAs overexpressed in the normal breast compared to breast cancer samples, miR-22 has previously been reported as highly expressed in mammary progenitor cells . Our findings are therefore consistent with previous reports, as well as adding new miRNAs to the repertoire of miRNAs showing different expression profiles for breast cancer versus normal breast samples.
Identifying new miRNAs
Using an SVM for miRNA recognition
To identify new miRNAs in the data, we first predicted the secondary structure around a genomic match using RNAfold [34–36]. The structure prediction was done in asymmetrical windows of 15 bases to one side of the match and 60 to the other. These window lengths were chosen as the combination that generated hairpin structures for most of the known miRNAs (data not shown).
The predicted structures were then scored using an SVM trained to recognise miRNA precursor hairpins, an approach that has previously been used successfully for miRNA discovery [37–41]. Our SVM was trained on 15 different sequence and structure features, describing both the mature miRNA and its precursor (see Lindow et al., 2007  and Methods for details). The SVM was trained using known miRNAs from miRBase [4, 5] as positive examples, ensuring that miRNA-family members were kept together to avoid overfitting. In generating the negative training set, we wanted to mimic the actual task that the final SVM would be presented with: Separating true miRNAs from (fragments of) various other transcripts present in the sequencing data. We therefore sampled the negative set from a combination of sources: mRNA, non-miRNA ncRNA, and random genomic locations. To make the SVM more specific for distinguishing between genuine miRNA hairpin structures and miRNA-like structures we constrained the sampled structures by requiring that their sequence/structure features be within specific quantiles of the distributions observed for known miRNAs (detailed in Methods).
By training on these sets, we obtained a sensitivity of 80% and a specificity of 98% on an independent test set. Measures of sensitivity and, in particular, specificity, are of course completely dependent on the test data used. Given the diffculty of our training and test sets, we expect the specificity on the actual data to be higher. A high specificity is particularly important in a HT analysis setting, where even a seemingly good specificity may generate many false positives.
Determining expression requirements
The use of imperfect and non-unique matches increases the number of mappings to the genome, and therefore also the risk of generating false predictions. To take this into account, we examined how to incorporate the different types of matches into an expression requirement for novel miRNA loci. There is some variation in the exact mature miRNA excised from a particular miRNA precursor  (discussed below), so to evaluate expression we generated overall loci of the genomic matches, merging overlapping sequence matches into the same potential new miRNA. To avoid 'locus-walking', i.e. sequentially overlapping matches expanding a locus beyond what is reasonable for a mature miRNA, we restricted these loci to two-base overhangs compared to the match representing the most abundant read (see Methods for details).
Since the aim was to identify new miRNAs, we explored the ratio between recovered known miRNAs and the total number recovered loci at different expression thresholds (Figure 4b). The greatest increase in this ratio was observed when going from a threshold of two to three reads for a locus, with perfectly matching reads generally having higher ratios.
To balance high recovery of miRNAs with the greater miRNA/total loci ratio obtained by requiring perfect matching, we chose to require at least one perfectly matching read for a candidate miRNA locus, and a minimum total expression (perfect or imperfect matches) of three reads. The reads could be mapped either uniquely or redundantly. This gave only a 2% loss in recovered known miRNAs compared to not requiring any perfect matches, but a three-fold increase in the miRNA/total loci ratio. All loci that passed these criteria were considered likely miRNA candidates, but for a locus to be considered a reliable de facto miRNA we additionally required that perfect matching reads were observed in at least two tissues.
Novel miRNAs and miRNA candidates.
Annotation of the miRNA candidates.
None of the five remaining novel miRNAs were found to be part of a cluster (no other miRNAs within 10 Kb up- and downstream). Expression and conservation for these loci was generally low, probably reflecting that most highly expressed or conserved miRNAs have been identified by now. Three loci were intronic (12783, 49828, 53356), one of these (53356) overlapping repeat annotation as well. In addition to being intronic to one gene, locus 53356 was found to also overlap the 5' UTR of an antisense gene (PNMA3, [Genbank: NM_013364]), suggesting that antisense transcription might play a part in regulation of these overlapping genes.
One locus, 19011, only overlapped repeat annotation, but was part of a ~600 base pair highly conserved block, which might be transcribed as part of the 5' UTR for the nearby gene CREBBP [Genbank: NM_004380]. Two mRNA ([Genbank: U47741], [Genbank: U85962]) encompassing the region seem to confirm this. The repetitive CGG unit of the mature sequence was also found in the sequence of locus 53356 and candidate locus 52275.
The fifth new miRNA, locus 41039, overlapped coding exon annotation. Approximately 75 bases downstream of this locus an evolutionarily conserved secondary structure is predicted by EvoFold , indicative of other ncRNA or structure based regulation in the area.
Additional candidate miRNA loci
The remaining seven candidate loci were all represented by at least three reads, but did not fulfill our expression requirements (expression in at elast two tissues) for a reliable new miRNA. Additional data will be required to confirm these as true miRNAs. Five of the candidate loci were intronic (6219, 21361, 25697, 19702, 52275), with three of them overlapping repeat annotation as well (25697, 19702, 52275). The last two candidates (32226,23602) overlapped exons, though in the case of locus 23602 in the antisense direction. Conservation of the non-exonic candidate loci was low, with the exception of locus 52275.
Alternative precursors for known miRNAs
A mature miRNA sequence may be encoded by more than one hairpin precursor locus, eg. the mature miR-124 is encoded by three distinct loci. Our data suggested that two known single-locus miRNAs, miR-151 and miR-500, may be encoded by more than one locus in the genome: Reads corresponding to these miRNAs could be mapped both to their official, miRBase annotated precursor, and to alternative predicted hairpin structures elsewhere in the genome. In such cases short-read data alone cannot identify the true precursor with certainty, but the following features should be noted:
In contrast to the official mir-151 locus, the predicted alternative precursor showed only little conservation. Furthermore, while most reads map equally well both places, there were 166 reads that mapped only to the official precursor, and only three that mapped exclusively to the alternative precursor. The data therefore lends more support to the official precursor, though miRNAs derived from the alternative precursor cannot be ruled out.
Mature miRNA end precision
The mature miRNA 5' end is less variable than the 3' end
Furthermore, the high 3' variability could not be immediately explained by 3'→5' degradation events as we found the variation to be broadly distributed on both sides of the most frequent 3' end (see Additional file 2).
Figure 6c shows the 5' versus 3' variability for individual miRNAs. Of the 219 miRNAs examined, only 7 (3%) showed most variability in the 5' end.
miRNA* 5' ends are also less variable than their 3' ends
In summary our results on human miRNAs were consistent with those obtained for flies by Seitz et al. , and support their notion that the precise 5' ends of both miRNA and miRNA* sequences are due to a narrowing selection on a more variable sequence population produced by Drosha and Dicer.
miRNA loci have less variable 5' ends than non-miRNA loci
Together these results suggest that even though the distributions overlap, the end variation measures for a given candidate locus has some discriminatory power, and could be incorporated into a probabilistic miRNA discovery pipeline, provided there are enough reads from a given locus. Five of our putative novel miRNA loci had ten or more reads, and for these we compared the end variation to the miRNA and non-miRNA distributions. Only the locus 53356 (10 observed reads), had a 5' end deviation above what we observed for the known miRNAs. This suggests that it may not be such a reliable candidate, though having more reads available for the end deviation calculations would be preferable.
ncRNA with miRNA-like sequence features
Kawaji et al. recently described a number of specific small RNA species derived from longer ncRNAs , in particular tRNAs, which seem to be processed in a tissue specific manner. It is interesting in this connection that when inspecting the 32 non-miRNA loci with 5' end variability less than 0.1 in our data, almost half (15) were annotated as tRNA derived, supporting the notion that these are non-random subspecies of longer tRNA transcripts. While none of these had SVM-scores indicative of a miRNA-like precursor (unsurprising given their tRNA origin), we observed that a number of high scoring hairpins were predicted in other ncRNAs, with read patterns sometimes consistent with that observed for miRNAs. For example the chromosome 17 cluster of five repetitive C/D box snoRNA U3 genes was strongly represented by a read of approximate length 22, derived from the 3' portion of the snoRNA gene (Additional file 3). Highly expressed reads from predicted hairpins were also observed in pseudo-genes for rRNAs: though a diffuse pattern of reads was observed, there were some dominant species of reads (Additional file 3). It would be interesting in future studies to see if hairpin structures inside other ncRNA genes are targeted capriciously by the miRNA processing machinery. Such ncRNA genes or pseudo genes could then easily be recruited as new miRNAs during evolution.
We have analyzed small RNA sequencing data from human breast cancer tumor samples, normal adjacent breast, and two teratoma cell lines, with the aims of evaluating differential miRNA expression between breast cancer and normal adjacent breast, and to identify novel miRNAs. Several differentially expressed miRNAs were identified, adding to the growing evidence for miRNA involvement in cancer.
To identify novel miRNAs we developed a pipeline which incorporates a hidden Markov model to extract the actual cDNA from the sequencing construct, non-heuristic mapping of the reads to the genome allowing both sequence variation and mapping to several places in the genome, and a support vector machine to score predicted hairpins. Using this pipeline we identified two putative alternative loci for known miRNAs, and 11 new miRNAs. Six of these have in the meantime been independently identified by others and included in miRBase.
Inspecting the read sequences derived from mature miRNA and miRNA* pairs, we found that the 5' ends were significantly less variable than the 3' ends. Our observations support previous results in flies  suggesting that the low 5' variability is due to a selection on the 5' end sequences after Drosha and Dicer processing of the precursor miRNA. Furthermore, when inspecting reasonably expressed miRNA loci vs. non-miRNA loci, we found that the 5' end variability had some discriminatory power. As the depth of sequencing improves with the advent of still more powerful HT sequencing technologies, we envision that this feature might be integrated in future miRNA discovery pipelines.
Five different human breast cancer (BC) tissue samples (about 200 mg in total) and their corresponding normal adjacent tissues (BN) were obtained from the MAMBIO repository at Herlev University Hospital, and stored at -80°C until RNA purification and fractionation. The collection of patient samples for the MAMBIO-repository was approved by the Science Ethics Committee for the former Københavs Amt and by the Danish Data Protection Agency (Datatilsynet).
The two teratoma cell lines, CRL-7826 and CRL-7732 were purchased from ATCC. The cells were grown to near confluence before total RNA extraction.
Preparation of RNA
Tissues were ground under liquid nitrogen. Small RNA (sRNA) species smaller than 200 nt were enriched with the mirVana miRNA isolation kit (Ambion, Austin, Texas, USA). RNA from the different samples was pooled into a BC and a BN library. RNA from CRL-7826 and CRL-7732 was extracted by guanidinum isothiocyanate/phenol:chloroform extraction (Trizol). The sRNAs were then separated on a denaturing 12,5% polyacrylamide (PAA) gel. The population of miRNAs with a length of 15 – 30 and 30 – 100 bases (breast cancer samples) or length 15–40 (normal breast, teratoma) was obtained by passive elution of the RNAs from the gel. The sRNAs were then precipitated with ethanol and dissolved in water.
For cDNA synthesis the sRNAs were first poly(A)-tailed using poly(A) polymerase followed by ligation of a RNA adapter to the 5'-phosphate of the sRNAs. First-strand cDNA synthesis was then performed using an oligo(dT)-linker primer and M-MLV-RNAse H- reverse transcriptase. The resulting cDNAs were then PCR-amplified to about 20 ng/μl using Taq polymerase.
The fusion primers used for PCR amplification were designed for amplicon sequencing according to the instructions of 454 Life Sciences. The correct size ranges (cDNA + flanks) were obtained by separate purification on 6% PAA-gels. For pool formation the purified cDNAs were mixed in a molar ratio of 3 +1. The concentration of the cDNA pool was 11 ng/μl dissolved in 25 μl water.
Sequencing using 454 technology
Amplicons from all preparations were sequenced using the Genome Sequencer 20 (GS20; Roche) according to the protocol provided by Marguiles et al. , resulting in the following number of reads for each sample: BC: 302556, BN: 136139, CRL-7826: 69013, CRL-7732: 64894.
Hidden Markov model
We built a profile HMM with states corresponding to the expected flank-sequences around the cDNA insert. The cDNA insert itself was modeled by a single state with fixed, uniform emission probabilities. The model was initialized with a 0.02 probability of mutation or indels in any position. A random subset of 10000 sequences was chosen and scored with the initial model. The score was calculated as , where P model is calculated with the forward algorithm , and P background is the probability given a uniform background model. Sequences with positive score were then used to train the final model. By inspection of the score distribution and sequences, a score cut-off above which all sequences had recognizable flanking sequences was chosen. All sequences were scored by the model, and for those that passed the score cut-off, the cDNA inserts were extracted using labels predicted by the Viterbi algorithm . Inserts shorter than 18 bases were subsequently discarded, due to the diffculties of mapping such short sequences.
Mapping sequences to the genome
We used the suffix array based program Vmatch  to map the read sequences to the genome requiring a minimum of 90% identity over the full length alignment. For each read we selected the set of genomic matches having maximal identity for the given read. Reads mapping more than five places with this maximal identity were discarded from further analysis.
Reads that had successfully been mapped to the genome a maximum of 5 places were annotated according to overlap with known annotations, in the following prioritized order:
MiRNA (Human miRBase 10.1 coordinates from miRbase [4, 5, 45, 46]. Other ncRNA (the sno/miRNA track downloaded from the UCSC genome browser, hg18 [47–49], and the Rfam, rnaDB, joneseddy, and noncode tracks from ncRNA.org v.2.0 ). Exon (Known Genes exon entries from the UCSC genome browser). Intron (reads contained within the Known Genes from the UCSC genome browser, but not in exons as described above). Repeat (the repeatmasker, microsatellite, and simplerepeat tables from the UCSC genome browser).
Mapped reads not overlapping any of these features were annotated as unknown.
For assessment of conservation, the conservation scores from the 'Vertebrate Multiz Alignment & PhastCons Conservation (28 Species)' track [52–54] of the UCSC genome browser was used, and the average calculated over all base positions in the mature sequence.
The Z-test described in  was used to compare relative expression values for BN and BC. Only reads of length 19 – 24 were included in the analysis. Fold change was calculated based on the normalized (ppm) counts. All statistical tests were performed in R .
Constructing genomic miRNA loci
To identify miRNAs among the sequenced reads, we grouped all genomic matches with read lengths between 19 – 24 nt (reads outside this range are ignored) into genomic loci based on their locations. Starting with the genomic match having highest measured read abundance, we assigned this genomic match and all matches contained within +/- 2 nt to the same locus. This procedure was repeated iteratively for the remaining genomic matches, always selecting the remaining genomic match with highest read abundance for the next locus. The genomic matches in a constructed miRNA locus represent a set of sequence variants originating from the same putative mature miRNA sequence
Resolving miRNA precursor candidates into SVM features
For each constructed miRNA locus, we examined the secondary structure by extracting two genomic sequences around the genomic match with highest abundance in the locus. The first extracted sequence started 15 bases 5' of the match and extended 60 bases 3' of the match – the second sequence had the extension lengths reversed. Each of these was treated independently in the following analysis. Each potential precursor sequence was folded with RNAfold [34–36], and the structure processed and evaluated as described in , calculating a number of attributes describing both sequence and structural features. In addition to the features described in , we also determined the miRNA arm and the length of the longest bulge found in the calculated miRNA:miRNA* duplex.
miRNA precursor classification
The known human miRNAs from miRBase 10.0 were used as positive examples for the SVM, excluding those where the mature sequence was annotated as shorter than 19 or longer than 24 bases. Based on the annotated mature miRNA coordinates, we constructed miRNA precursors by extension with 15 and 60 bases as described above. (Since we do not know in advance which arm of the precursor hairpin a novel miRNA will be on, this folding was done in both directions). MiRNAs that did not fold into hairpin structures using these settings were discarded. The miRBase  family annotation was used to ensure that family members were kept together during training.
The negative sets were made by random sampling of precursors from three different sequence sets: A) the full human genome (hg18, March 06 assembly). B) a ncRNA set made by concatenating the non-miRNA sequences from the 'rfamFull' and 'joneseddy' genome tracks from ncrna.org . C) A random subset of about 9000 mRNA sequences from the 'human mRNA track', table all mrna, via the UCSC genome browser. ¿From each set 3000 – 4000 hairpin structures were sampled randomly, while requiring that the values for all SVM features were within the range observed for the miRBase miRNAs. A further 600 – 1000 hairpins were sampled from each set requiring the values to be between the 0.01 and 0.99 quantiles of the miRNA distributions, and 100 – 500 hairpins were sampled requiring values within the 0.1 and 0.9 quantiles.
We used the R e1071 library  implementation of an SVM with radial kernel, using ten-fold cross-validation and evaluation on an independent test set. A locus was assigned the highest score obtained by any of its reads.
miRNA end precision
The same measure was used with signed distances (x i - x a ) instead to infer the directionality of the dispersion relative to the annotation. For comparisons of miRNA-miRNA* end precision, the WMAD was calculated relative to the respective sequences with highest read abundance.
Thanks to Louise Christiansen for help with the sample collection, and to Marianne Fregil for excellent technical assistance. AJ, ML and AK were supported by a grant from the Novo Nordisk Foundation.
- Lee Y, Ahn C, Han J, Choi H, Kim J, Yim J, Lee J, Provost P, Rådmark O, Kim S, Kim VN: The nuclear RNase III Drosha initiates microRNA processing. Nature. 2003, 425 (6956): 415-9. 10.1038/nature01957.View ArticlePubMedGoogle Scholar
- Hutvágner G, McLachlan J, Pasquinelli AE, Bálint E, Tuschl T, Zamore PD: A cellular function for the RNA-interference enzyme Dicer in the maturation of the let-7 small temporal RNA. Science. 2001, 293 (5531): 834-8. 10.1126/science.1062961.View ArticlePubMedGoogle Scholar
- Ketting RF, Fischer SE, Bernstein E, Sijen T, Hannon GJ, Plasterk RH: Dicer functions in RNA interference and in synthesis of small RNA involved in developmental timing in C. elegans. Genes Dev. 2001, 15 (20): 2654-9. 10.1101/gad.927801.View ArticlePubMedPubMed CentralGoogle Scholar
- miRBase. [http://microrna.sanger.ac.uk/]
- Griffiths-Jones S, Saini HK, van Dongen S, Enright AJ: miRBase: tools for microRNA genomics. Nucleic Acids Res. 2008, D154-8. 36 DatabaseGoogle Scholar
- Bentwich I, Avniel A, Karov Y, Aharonov R, Gilad S, Barad O, Barzilai A, Einat P, Einav U, Meiri E, Sharon E, Spector Y, Bentwich Z: Identification of hundreds of conserved and nonconserved human microRNAs. Nat Genet. 2005, 37 (7): 766-70. 10.1038/ng1590.View ArticlePubMedGoogle Scholar
- Berezikov E, Guryev V, Belt van de J, Wienholds E, Plasterk RHA, Cuppen E: Phylogenetic shadowing and computational identification of human microRNA genes. Cell. 2005, 120: 21-4. 10.1016/j.cell.2004.12.031.View ArticlePubMedGoogle Scholar
- Xie X, Lu J, Kulbokas EJ, Golub TR, Mootha V, Lindblad-Toh K, Lander ES, Kellis M: Systematic discovery of regulatory motifs in human promoters and 3' UTRs by comparison of several mammals. Nature. 2005, 434 (7031): 338-45. 10.1038/nature03441.View ArticlePubMedPubMed CentralGoogle Scholar
- Croce CM: MicroRNAs and lymphomas. Ann Oncol. 2008, 19 (Suppl 4): iv39-40.PubMedGoogle Scholar
- Calin GA, Cimmino A, Fabbri M, Ferracin M, Wojcik SE, Shimizu M, Taccioli C, Zanesi N, Garzon R, Aqeilan RI, Alder H, Volinia S, Rassenti L, Liu X, Liu CG, Kipps TJ, Negrini M, Croce CM: MiR-15a and miR-16-1 cluster functions in human leukemia. Proc Natl Acad Sci USA. 2008, 105 (13): 5166-71. 10.1073/pnas.0800121105.View ArticlePubMedPubMed CentralGoogle Scholar
- Elmén J, Lindow M, Schütz S, Lawrence M, Petri A, Obad S, Lindholm M, Hedtjärn M, Hansen HF, Berger U, Gullans S, Kearney P, Sarnow P, Straarup EM, Kauppinen S: LNA-mediated microRNA silencing in non-human primates. Nature. 2008, 452 (7189): 896-9. 10.1038/nature06783.View ArticlePubMedGoogle Scholar
- Morin RD, O'Connor MD, Griffith M, Kuchenbauer F, Delaney A, Prabhu AL, Zhao Y, McDonald H, Zeng T, Hirst M, Eaves CJ, Marra MA: Application of massively parallel sequencing to microRNA profiling and discovery in human embryonic stem cells. Genome Res. 2008, 18 (4): 610-21. 10.1101/gr.7179508.View ArticlePubMedPubMed CentralGoogle Scholar
- Glazov EA, Cottee PA, Barris WC, Moore RJ, Dalrymple BP, Tizard ML: A microRNA catalog of the developing chicken embryo identified by a deep sequencing approach. Genome Res. 2008, 18 (6): 957-64. 10.1101/gr.074740.107.View ArticlePubMedPubMed CentralGoogle Scholar
- Lau NC, Lim LP, Weinstein EG, Bartel DP: An abundant class of tiny RNAs with probable regulatory roles in Caenorhabditis elegans. Science. 2001, 294 (5543): 858-62. 10.1126/science.1065062.View ArticlePubMedGoogle Scholar
- Ruby JG, Jan C, Player C, Axtell MJ, Lee W, Nusbaum C, Ge H, Bartel DP: Large-scale sequencing reveals 21U-RNAs and additional microRNAs and endogenous siRNAs in C. elegans. Cell. 2006, 127 (6): 1193-207. 10.1016/j.cell.2006.10.040.View ArticlePubMedGoogle Scholar
- Ruby JG, Stark A, Johnston WK, Kellis M, Bartel DP, Lai EC: Evolution, biogenesis, expression, and target predictions of a substantially expanded set of Drosophila microRNAs. Genome Res. 2007, 17 (12): 1850-64. 10.1101/gr.6597907.View ArticlePubMedPubMed CentralGoogle Scholar
- Landgraf P, Rusu M, Sheridan R, Sewer A, Iovino N, Aravin A, Pfeffer S, Rice A, Kamphorst AO, Landthaler M, Lin C, Socci ND, Hermida L, Fulci V, Chiaretti S, Foà R, Schliwka J, Fuchs U, Novosel A, Müller RU, Schermer B, Bissels U, Inman J, Phan Q, Chien M, Weir DB, Choksi R, De Vita G, Frezzetti D, Trompeter HI, Hornung V, Teng G, Hartmann G, Palkovits M, Di Lauro R, Wernet P, Macino G, Rogler CE, Nagle JW, Ju J, Papavasiliou FN, Benzing T, Lichter P, Tam W, Brownstein MJ, Bosio A, Borkhardt A, Russo JJ, Sander C, Zavolan M, Tuschl T: A mammalian microRNA expression atlas based on small RNA library sequencing. Cell. 2007, 129 (7): 1401-14. 10.1016/j.cell.2007.04.040.View ArticlePubMedPubMed CentralGoogle Scholar
- Seitz H, Ghildiyal M, Zamore PD: Argonaute loading improves the 5' precision of both MicroRNAs and their miRNA strands in flies. Curr Biol. 2008, 18 (2): 147-51. 10.1016/j.cub.2007.12.049.View ArticlePubMedPubMed CentralGoogle Scholar
- Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman MS, Chen YJ, Chen Z, Dewell SB, Du L, Fierro JM, Gomes XV, Godwin BC, He W, Helgesen S, Ho CH, Irzyk GP, Jando SC, Alenquer MLI, Jarvie TP, Jirage KB, Kim JB, Knight JR, Lanza JR, Leamon JH, Lefkowitz SM, Lei M, Li J, Lohman KL, Lu H, Makhijani VB, McDade KE, McKenna MP, Myers EW, Nickerson E, Nobile JR, Plant R, Puc BP, Ronan MT, Roth GT, Sarkis GJ, Simons JF, Simpson JW, Srinivasan M, Tartaro KR, Tomasz A, Vogt KA, Volkmer GA, Wang SH, Wang Y, Weiner MP, Yu P, Begley RF, Rothberg JM: Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005, 437 (7057): 376-80.PubMedPubMed CentralGoogle Scholar
- Durbin R, Eddy S, Krogh A, Mitchison G: Biological Sequence Analysis: Probabilistic models of proteins and nucleic acids. 1998, Cambridge, UK: Cambridge University PressView ArticleGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215 (3): 403-10.View ArticlePubMedGoogle Scholar
- Kurtz S: The Vmatch large scale sequence analysis software. 2007, [http://vmatch.de]Google Scholar
- Bar M, Wyman SK, Fritz BR, Tewari M: MicroRNA Discovery and Profiling in Human Embryonic Stem Cells by Deep Sequencing of Small RNA Libraries. Stem Cells. 2008, 26 (10): 2496-2505. 10.1634/stemcells.2008-0356.View ArticlePubMedPubMed CentralGoogle Scholar
- Kal AJ, van Zonneveld AJ, Benes V, Berg van den M, Koerkamp MG, Albermann K, Strack N, Ruijter JM, Richter A, Dujon B, Ansorge W, Tabak HF: Dynamics of gene expression revealed by comparison of serial analysis of gene expression transcript profiles from yeast grown on two different carbon sources. Mol Biol Cell. 1999, 10 (6): 1859-72.View ArticlePubMedPubMed CentralGoogle Scholar
- Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. 1995, 57: 289-300.Google Scholar
- Volinia S, Calin GA, Liu CG, Ambs S, Cimmino A, Petrocca F, Visone R, Iorio M, Roldo C, Ferracin M, Prueitt RL, Yanaihara N, Lanza G, Scarpa A, Vecchione A, Negrini M, Harris CC, Croce CM: A microRNA expression signature of human solid tumors defines cancer gene targets. Proc Natl Acad Sci USA. 2006, 103 (7): 2257-61. 10.1073/pnas.0510565103.View ArticlePubMedPubMed CentralGoogle Scholar
- Sempere LF, Christensen M, Silahtaroglu A, Bak M, Heath CV, Schwartz G, Wells W, Kauppinen S, Cole CN: Altered MicroRNA expression confined to specific epithelial cell subpopulations in breast cancer. Cancer Res. 2007, 67 (24): 11612-20. 10.1158/0008-5472.CAN-07-5019.View ArticlePubMedGoogle Scholar
- Meng F, Henson R, Lang M, Wehbe H, Maheshwari S, Mendell JT, Jiang J, Schmittgen TD, Patel T: Involvement of human micro-RNA in growth and response to chemotherapy in human cholangiocarcinoma cell lines. Gastroenterology. 2006, 130 (7): 2113-29. 10.1053/j.gastro.2006.02.057.View ArticlePubMedGoogle Scholar
- Iorio MV, Ferracin M, Liu CG, Veronese A, Spizzo R, Sabbioni S, Magri E, Pedriali M, Fabbri M, Campiglio M, Ménard S, Palazzo JP, Rosenberg A, Musiani P, Volinia S, Nenci I, Calin GA, Querzoli P, Negrini M, Croce CM: MicroRNA gene expression deregulation in human breast cancer. Cancer Res. 2005, 65 (16): 7065-70. 10.1158/0008-5472.CAN-05-1783.View ArticlePubMedGoogle Scholar
- Iorio MV, Visone R, Di Leva G, Donati V, Petrocca F, Casalini P, Taccioli C, Volinia S, Liu CG, Alder H, Calin GA, Ménard S, Croce CM: MicroRNA signatures in human ovarian cancer. Cancer Res. 2007, 67 (18): 8699-707. 10.1158/0008-5472.CAN-07-1936.View ArticlePubMedGoogle Scholar
- Hurteau GJ, Carlson JA, Spivack SD, Brock GJ: Overexpression of the microRNA hsa-miR-200c leads to reduced expression of transcription factor 8 and increased expression of E-cadherin. Cancer Res. 2007, 67 (17): 7972-6. 10.1158/0008-5472.CAN-07-1058.View ArticlePubMedGoogle Scholar
- Blenkiron C, Goldstein LD, Thorne NP, Spiteri I, Chin SF, Dunning MJ, Barbosa-Morais NL, Teschendorff AE, Green AR, Ellis IO, Tavaré S, Caldas C, Miska EA: MicroRNA expression profiling of human breast cancer identifies new markers of tumor subtype. Genome Biol. 2007, 8 (10): R214-10.1186/gb-2007-8-10-r214.View ArticlePubMedPubMed CentralGoogle Scholar
- Ibarra I, Erlich Y, Muthuswamy SK, Sachidanandam R, Hannon GJ: A role for microRNAs in maintenance of mouse mammary epithelial progenitor cells. Genes Dev. 2007, 21 (24): 3238-43. 10.1101/gad.1616307.View ArticlePubMedPubMed CentralGoogle Scholar
- Hofacker I, Fontana W, Stadler P, Bonhoeffer S, Tacker M, Schuster P: Fast Folding and Comparison of RNA Secondary Structures. Monatshefte f Chemie. 1994, 125: 167-188. 10.1007/BF00818163.View ArticleGoogle Scholar
- Zuker M, Stiegler P: Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Res. 1981, 9: 133-48. 10.1093/nar/9.1.133.View ArticlePubMedPubMed CentralGoogle Scholar
- McCaskill JS: The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers. 1990, 29 (6–7): 1105-19. 10.1002/bip.360290621.View ArticlePubMedGoogle Scholar
- Pfeffer S, Sewer A, Lagos-Quintana M, Sheridan R, Sander C, Grässer FA, van Dyk LF, Ho CK, Shuman S, Chien M, Russo JJ, Ju J, Randall G, Lindenbach BD, Rice CM, Simon V, Ho DD, Zavolan M, Tuschl T: Identification of microRNAs of the herpesvirus family. Nat Methods. 2005, 2 (4): 269-76. 10.1038/nmeth746.View ArticlePubMedGoogle Scholar
- Sewer A, Paul N, Landgraf P, Aravin A, Pfeffer S, Brownstein MJ, Tuschl T, van Nimwegen E, Zavolan M: Identification of clustered microRNAs using an ab initio prediction method. BMC Bioinformatics. 2005, 6: 267-10.1186/1471-2105-6-267.View ArticlePubMedPubMed CentralGoogle Scholar
- Xue C, Li F, He T, Liu GP, Li Y, Zhang X: Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine. BMC Bioinformatics. 2005, 6: 310-10.1186/1471-2105-6-310.View ArticlePubMedPubMed CentralGoogle Scholar
- Hertel J, Stadler PF: Hairpins in a Haystack: recognizing microRNA precursors in comparative genomics data. Bioinformatics. 2006, 22 (14): e197-202. 10.1093/bioinformatics/btl257.View ArticlePubMedGoogle Scholar
- Ng KL, Mishra SK: De novo SVM classification of precursor microRNAs from genomic pseudo hairpins using global and intrinsic folding measures. Bioinformatics. 2007, 23 (11): 1321-30. 10.1093/bioinformatics/btm026.View ArticlePubMedGoogle Scholar
- Lindow M, Jacobsen A, Nygaard S, Mang Y, Krogh A: Intragenomic matching reveals a huge potential for miRNA-mediated regulation in plants. PLoS Comput Biol. 2007, 3 (11): e238-10.1371/journal.pcbi.0030238.View ArticlePubMedPubMed CentralGoogle Scholar
- Pedersen JS, Bejerano G, Siepel A, Rosenbloom K, Lindblad-Toh K, Lander ES, Kent J, Miller W, Haussler D: Identification and classification of conserved RNA secondary structures in the human genome. PLoS Comput Biol. 2006, 2 (4): e33-10.1371/journal.pcbi.0020033.View ArticlePubMedPubMed CentralGoogle Scholar
- Kawaji H, Nakamura M, Takahashi Y, Sandelin A, Katayama S, Fukuda S, Daub CO, Kai C, Kawai J, Yasuda J, Carninci P, Hayashizaki Y: Hidden layers of human small RNAs. BMC Genomics. 2008, 9: 157-10.1186/1471-2164-9-157.View ArticlePubMedPubMed CentralGoogle Scholar
- Griffiths-Jones S: The microRNA Registry. Nucleic Acids Research. 2004, 32: D109-D111. 10.1093/nar/gkh023.View ArticlePubMedPubMed CentralGoogle Scholar
- Griffiths-Jones S, Grocock RJ, van Dongen S, Bateman A, Enright AJ: miRBase: microRNA sequences, targets and gene nomenclature. Nucleic Acids Res. 2006, D140-4. 10.1093/nar/gkj112. 34 DatabaseGoogle Scholar
- The UCSC Genome Browser. [http://genome.ucsc.edu/]
- Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D: The human genome browser at UCSC. Genome Res. 2002, 12 (6): 996-1006.View ArticlePubMedPubMed CentralGoogle Scholar
- Karolchik D, Baertsch R, Diekhans M, Furey TS, Hinrichs A, Lu YT, Roskin KM, Schwartz M, Sugnet CW, Thomas DJ, Weber RJ, Haussler D, Kent WJ: The UCSC Genome Browser Database. Nucleic Acids Res. 2003, 31: 51-4. 10.1093/nar/gkg129.View ArticlePubMedPubMed CentralGoogle Scholar
- ncRNA.org. [http://www.ncrna.org/]
- rnaDB.org. [http://rnadb.org]
- Felsenstein J, Churchill GA: A Hidden Markov Model approach to variation among sites in rate of evolution. Mol Biol Evol. 1996, 13: 93-104.View ArticlePubMedGoogle Scholar
- Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, Clawson H, Spieth J, Hillier LW, Richards S, Weinstock GM, Wilson RK, Gibbs RA, Kent WJ, Miller W, Haussler D: Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 2005, 15 (8): 1034-50. 10.1101/gr.3715005.View ArticlePubMedPubMed CentralGoogle Scholar
- Yang Z: A space-time process model for the evolution of DNA sequences. Genetics. 1995, 139 (2): 993-1005.PubMedPubMed CentralGoogle Scholar
- Team RDC: R: A language and environment for statistical computing. 2008, Vienna, Austria: R Foundation for Statistical Computing, [http://www.R-project.org]Google Scholar
- Dimitriadou E, Hornik K, Leisch F, Meyer D, Weingessel A: e1071: Misc Functions of the Department of Statistics(e1071), TU Wien. 2006Google Scholar
- The pre-publication history for this paper can be accessed here:http://www.biomedcentral.com/1755-8794/2/35/prepub
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.