Five different human breast cancer (BC) tissue samples (about 200 mg in total) and their corresponding normal adjacent tissues (BN) were obtained from the MAMBIO repository at Herlev University Hospital, and stored at -80°C until RNA purification and fractionation. The collection of patient samples for the MAMBIO-repository was approved by the Science Ethics Committee for the former Københavs Amt and by the Danish Data Protection Agency (Datatilsynet).
The two teratoma cell lines, CRL-7826 and CRL-7732 were purchased from ATCC. The cells were grown to near confluence before total RNA extraction.
Preparation of RNA
Tissues were ground under liquid nitrogen. Small RNA (sRNA) species smaller than 200 nt were enriched with the mirVana miRNA isolation kit (Ambion, Austin, Texas, USA). RNA from the different samples was pooled into a BC and a BN library. RNA from CRL-7826 and CRL-7732 was extracted by guanidinum isothiocyanate/phenol:chloroform extraction (Trizol). The sRNAs were then separated on a denaturing 12,5% polyacrylamide (PAA) gel. The population of miRNAs with a length of 15 – 30 and 30 – 100 bases (breast cancer samples) or length 15–40 (normal breast, teratoma) was obtained by passive elution of the RNAs from the gel. The sRNAs were then precipitated with ethanol and dissolved in water.
For cDNA synthesis the sRNAs were first poly(A)-tailed using poly(A) polymerase followed by ligation of a RNA adapter to the 5'-phosphate of the sRNAs. First-strand cDNA synthesis was then performed using an oligo(dT)-linker primer and M-MLV-RNAse H- reverse transcriptase. The resulting cDNAs were then PCR-amplified to about 20 ng/μl using Taq polymerase.
The fusion primers used for PCR amplification were designed for amplicon sequencing according to the instructions of 454 Life Sciences. The correct size ranges (cDNA + flanks) were obtained by separate purification on 6% PAA-gels. For pool formation the purified cDNAs were mixed in a molar ratio of 3 +1. The concentration of the cDNA pool was 11 ng/μl dissolved in 25 μl water.
Sequencing using 454 technology
Amplicons from all preparations were sequenced using the Genome Sequencer 20 (GS20; Roche) according to the protocol provided by Marguiles et al. , resulting in the following number of reads for each sample: BC: 302556, BN: 136139, CRL-7826: 69013, CRL-7732: 64894.
The sequence data is freely accesible and can be downloaded from http://people.binf.ku.dk/~krogh/bmc454paper/. Novel miRNAs are being submitted to miRBase .
Hidden Markov model
We built a profile HMM with states corresponding to the expected flank-sequences around the cDNA insert. The cDNA insert itself was modeled by a single state with fixed, uniform emission probabilities. The model was initialized with a 0.02 probability of mutation or indels in any position. A random subset of 10000 sequences was chosen and scored with the initial model. The score was calculated as
, where P
is calculated with the forward algorithm , and P
is the probability given a uniform background model. Sequences with positive score were then used to train the final model. By inspection of the score distribution and sequences, a score cut-off above which all sequences had recognizable flanking sequences was chosen. All sequences were scored by the model, and for those that passed the score cut-off, the cDNA inserts were extracted using labels predicted by the Viterbi algorithm . Inserts shorter than 18 bases were subsequently discarded, due to the diffculties of mapping such short sequences.
Mapping sequences to the genome
We used the suffix array based program Vmatch  to map the read sequences to the genome requiring a minimum of 90% identity over the full length alignment. For each read we selected the set of genomic matches having maximal identity for the given read. Reads mapping more than five places with this maximal identity were discarded from further analysis.
Reads that had successfully been mapped to the genome a maximum of 5 places were annotated according to overlap with known annotations, in the following prioritized order:
MiRNA (Human miRBase 10.1 coordinates from miRbase [4, 5, 45, 46]. Other ncRNA (the sno/miRNA track downloaded from the UCSC genome browser, hg18 [47–49], and the Rfam, rnaDB, joneseddy, and noncode tracks from ncRNA.org v.2.0 ). Exon (Known Genes exon entries from the UCSC genome browser). Intron (reads contained within the Known Genes from the UCSC genome browser, but not in exons as described above). Repeat (the repeatmasker, microsatellite, and simplerepeat tables from the UCSC genome browser).
Mapped reads not overlapping any of these features were annotated as unknown.
Reads were also mapped against human piRNAs contained in XMLpiRNAV2.zip from rnaDB.org  using Vmatch  requiring exact matches.
For assessment of conservation, the conservation scores from the 'Vertebrate Multiz Alignment & PhastCons Conservation (28 Species)' track [52–54] of the UCSC genome browser was used, and the average calculated over all base positions in the mature sequence.
The Z-test described in  was used to compare relative expression values for BN and BC. Only reads of length 19 – 24 were included in the analysis. Fold change was calculated based on the normalized (ppm) counts. All statistical tests were performed in R .
Constructing genomic miRNA loci
To identify miRNAs among the sequenced reads, we grouped all genomic matches with read lengths between 19 – 24 nt (reads outside this range are ignored) into genomic loci based on their locations. Starting with the genomic match having highest measured read abundance, we assigned this genomic match and all matches contained within +/- 2 nt to the same locus. This procedure was repeated iteratively for the remaining genomic matches, always selecting the remaining genomic match with highest read abundance for the next locus. The genomic matches in a constructed miRNA locus represent a set of sequence variants originating from the same putative mature miRNA sequence
Resolving miRNA precursor candidates into SVM features
For each constructed miRNA locus, we examined the secondary structure by extracting two genomic sequences around the genomic match with highest abundance in the locus. The first extracted sequence started 15 bases 5' of the match and extended 60 bases 3' of the match – the second sequence had the extension lengths reversed. Each of these was treated independently in the following analysis. Each potential precursor sequence was folded with RNAfold [34–36], and the structure processed and evaluated as described in , calculating a number of attributes describing both sequence and structural features. In addition to the features described in , we also determined the miRNA arm and the length of the longest bulge found in the calculated miRNA:miRNA* duplex.
miRNA precursor classification
The known human miRNAs from miRBase 10.0 were used as positive examples for the SVM, excluding those where the mature sequence was annotated as shorter than 19 or longer than 24 bases. Based on the annotated mature miRNA coordinates, we constructed miRNA precursors by extension with 15 and 60 bases as described above. (Since we do not know in advance which arm of the precursor hairpin a novel miRNA will be on, this folding was done in both directions). MiRNAs that did not fold into hairpin structures using these settings were discarded. The miRBase  family annotation was used to ensure that family members were kept together during training.
The negative sets were made by random sampling of precursors from three different sequence sets: A) the full human genome (hg18, March 06 assembly). B) a ncRNA set made by concatenating the non-miRNA sequences from the 'rfamFull' and 'joneseddy' genome tracks from ncrna.org . C) A random subset of about 9000 mRNA sequences from the 'human mRNA track', table all mrna, via the UCSC genome browser. ¿From each set 3000 – 4000 hairpin structures were sampled randomly, while requiring that the values for all SVM features were within the range observed for the miRBase miRNAs. A further 600 – 1000 hairpins were sampled from each set requiring the values to be between the 0.01 and 0.99 quantiles of the miRNA distributions, and 100 – 500 hairpins were sampled requiring values within the 0.1 and 0.9 quantiles.
We used the R e1071 library  implementation of an SVM with radial kernel, using ten-fold cross-validation and evaluation on an independent test set. A locus was assigned the highest score obtained by any of its reads.
miRNA end precision
For miRBase mature miRNAs, reads mapping to the annotated mature region relaxed by +/- 4 nucleotides in both ends were analyzed. We only examined miRNAs having mapped reads with a summed expression count of at least 3. As a dispersion measure of the mature miRNA end precision we used the weighted mean absolute deviation (WMAD) with weights defined by the expression counts of the reads. Let x
denote the annotated mature miRNA end position (e.g. 5' end), and for each read r
mapping to the region we denote the expression count c
and the end position x
The same measure was used with signed distances (x
) instead to infer the directionality of the dispersion relative to the annotation. For comparisons of miRNA-miRNA* end precision, the WMAD was calculated relative to the respective sequences with highest read abundance.