Integrative analysis of next generation sequencing for small non-coding RNAs and transcriptional regulation in Myelodysplastic Syndromes

Background Myelodysplastic Syndromes (MDSS) are pre-leukemic disorders with increasing incident rates worldwide, but very limited treatment options. Little is known about small regulatory RNAs and how they contribute to pathogenesis, progression and transcriptome changes in MDS. Methods Patients' primary marrow cells were screened for short RNAs (RNA-seq) using next generation sequencing. Exon arrays from the same cells were used to profile gene expression and additional measures on 98 patients obtained. Integrative bioinformatics algorithms were proposed, and pathway and ontology analysis performed. Results In low-grade MDS, observations implied extensive post-transcriptional regulation via microRNAs (miRNA) and the recently discovered Piwi interacting RNAs (piRNA). Large expression differences were found for MDS-associated and novel miRNAs, including 48 sequences matching to miRNA star (miRNA*) motifs. The detected species were predicted to regulate disease stage specific molecular functions and pathways, including apoptosis and response to DNA damage. In high-grade MDS, results suggested extensive post-translation editing via transfer RNAs (tRNAs), providing a potential link for reduced apoptosis, a hallmark for this disease stage. Bioinformatics analysis confirmed important regulatory roles for MDS linked miRNAs and TFs, and strengthened the biological significance of miRNA*. The "RNA polymerase II promoters" were identified as the tightest controlled biological function. We suggest their control by a miRNA dominated feedback loop, which might be linked to the dramatically different miRNA amounts seen between low and high-grade MDS. Discussion The presented results provide novel findings that build a basis of further investigations of diagnostic biomarkers, targeted therapies and studies on MDS pathogenesis.


Background
Myelodysplastic Syndromes (MDS) are a group of heterogeneous hematopoietic stem cell disorders, which often lead to acute myeloid leukemia (AML). This group of diseases is most common in the growing demographic of the late sixties-early seventies [1]. In the United States the estimated number of new cases per year is about 40,000-76,000 with an attached cost of about 30.000 USD per person and year.
MDS is characterized by ineffective bone marrow hematopoiesis, leading to cytopenias [2], with a highly variable disease progression that ranges from a slow development over many years to a rapid progression to AML within a few months. Patients can be classified into risk groups, primarily based on bone marrow myeloblast counts [3,4]. These include refractory anemia (RA), describing an early disease stage (low-grade MDS) and the refractory anemias with excess of blasts (RAEB1, RAEB2), which represent the later stages of the disease (high-grade MDS). While the median survival times are relatively long in the low and intermediate-1 classes, 97 and 63 months respectively, they are considerably shorter in the later classes with 26 for the intermediate-2 and only 11 months in the high risk group [5]. Current treatment options are rare and show only limited success. They mainly include allogeneic stem cell transplantation, treatment with hypomethylating agents and Lenalidomide.
There is increasing evidence that dysregulation of a number of different molecular pathways is involved from the disease onset, however, clearly defined mechanisms remain elusive [6]. The accumulation of cellular death is a common trait for the early stage of MDS [7,8]. It is thought to counteract the proliferation of dysfunctional cells and is the key characteristic of ineffective hematopoiesis and marrow failure [9,10]. With the continued expansion of diseased cells, genetic damage accumulates and contributes to disease progression, which may result in the transformation to AML. The later stages of MDS have been implicated with angiogenesis and reduced apoptosis [11][12][13][14][15].
Recent studies have suggested that small non-coding RNAs (sRNAs), in particular microRNAs (miRNAs), contribute to the pathogenesis and progression of MDS [16,17]. However, very limited information on sRNA expression has been reported for MDS to date. To overcome this bottleneck, we performed high-throughput next generation sequencing of small RNAs (RNA-seq) in primary marrow cells of low-and high-grade MDS patients, together with matched controls. The relatively new technology of RNA-seq [18] is the method of choice for sensitive global detection of different sRNAs across an unparalleled dynamic range, and we detected sRNAs with read counts from ten to one million reads. The data obtained here suggest important roles for Piwi-interacting RNAs (piRNA), transfer RNAs (tRNA) and miRNAs, including many known and novel micro-RNAs star (miRNA*). Further functional analysis of miRNA/miRNA* showed that these species regulate disease stage-specific molecular functions and pathways, in particular, those known to be deregulated at the gene expression level. In addition, integrative bioinformatics modeling of our experimental data and bioinformatics databases identified the disease stage-specific regulation of the polymerase II promoter by miRNAs and transcription factors (TFs). This suggested a feedback loop that might contribute to the attenuation of miRNA expression in high-grade MDS.

Patient samples
Samples were obtained from patients presenting at The Methodist Hospital. The use of marrow samples was approved by The Methodist Hospital Institutional Review Board. All research described conformed to the Helsinki Declaration.
High throughput small RNA sequencing and data analysis RNA in the 18-30 bp range was isolated from a 15 percent urea-PAGE gel, and ligated to Solexa SRA5' and SRA 3' adapters, according to the standard protocol (available: http://www.illumina.com). Briefly, the SRA5' adapter was ligated to the 5' end of the selected RNAs. The ligation products were gel purified and SRA3' adapters ligated to their 3' ends. The resulting products were also gel purified, reverse transcribed and amplified with primers containing sequences complementary to the SRA5' and SRA3' adapters, after which they were gel purified again. The size and quality of the resulting libraries were verified using an Agilent DNA1000 Bioanalyzer chip (Agilent) and sequenced on a Solexa GAIIx, using PhiX as a loading control and analyzed with the standard Illumina Pipeline version 1.4. This produced approximately 13 million reads per lane.
In our analysis we used the s_x_sequence.txt files, containing 64 bit quality-scored output per-lane. The first 20bases of these reads were parsed in Mysql database tables, and further analyses utilized the MySQL database engine.
At this stage, the database was employed to identify and count distinct reads and to export this information into fasta formatted output files (Additional files 1, 2, 3). The results were used to map each small RNA to its matching position in the human genome. A variety of algorithms exists to perform this task including ELAND, which is provided with the Solexa GAIIx. However, a particular fast and memory efficient algorithm that outperforms other approaches is Bowtie [19]. This algorithm allows filtering alignments based on mismatches and can omit reads matched to multiple positions on the reference. The human genome version GRCh37 was downloaded from the NCBI website and converted into a bowtie index file. All distinct reads were aligned to this reference sequence. We allowed for at most two mismatches and only considered reads that aligned to at most 25 positions in the genome (parameter setting v = 2 and m = 25). With this parameter set, on average, 70 percent of the short sequence reads from all three lanes had positive matches to genome coordinates, about 21 percent did not match any genome position and about 10 percent had more than 25 matches.
A number of different databases were used as annotation basis for the aligned next generation sequencing reads. Information on sequences and genome positions of miRNAs were obtained from miRBase version 14. However, since our sample preparation and sequencing protocol is not specific for miRNAs, we downloaded information on other small RNAs from the UCSC genome browser. This contains genome positions for different small RNAs, including but not limited to tRNAs, rRNAs, scRNAs, suRNAs and srpRNA in the repeatmasker track, as well as positions of known exons. The sequences of known human piRNAs were searched and downloaded from the NCBI http://www.ncbi.nlm. nih.gov.
The implemented annotation algorithm first checked if a read falls into a known miRNA loci (compare Figure 1). Unmatched reads were further aligned to primary miRNA sequences and perfect matches registered. If no match was identified, known loci for other small RNAs were searched in the following order rRNA, scRNA, sRNA, srpRNA, simple repeat and other RNAs. If a read was still uncharacterized, it was aligned against all piRNA sequences and matches returned for perfect alignments. Finally, if none of the above criteria was satisfied, positions for all human exons were first checked, if no match was identified reads were classified as unknown. The number of sequenced reads that annotated with a known RNA locus were used to represent its expression.
The read counts for miRNA and miRNA* were compared for the RA, RAEB2 and controls and significant differential expression defined following the example in [20]. We required that the ratio R of read counts in two different cells was within R 1 > 1.5 ∨ R 2 < 0.67 and the read count difference D within D 1 > 100 ∨ D 2 < -100. Consequently, over expression was defined byR 1 and D 1 and under expression byR 2 and D 2 .
Exon array profiling and data analysis A total of 50ng RNA was extracted from each analyzed sample. We used primer provided from NuGEN and followed the manufacturer's protocol for the first strand cDNA synthesis. For RNA primer annealing, their mixtures were incubated for 2 minutes at 65°C and cooled to 4°C. After cooling, cDNA synthesis cycle followed; 4°C for 1 minute, 25°C for 10 minutes, 42°C for 10 minutes, 70°C for 15 minutes, and again 4°C for 1 minute. The second stranded reaction followed immediately. After mixing the first strand solution with second strand cDNA synthesis reaction solution, the entire mixture was incubated in the thermocycler as follows: 4°C for 1 minute, 25°C for 10 minutes, 50°C for 30 minutes, 70°C for 5 minutes, 4°C. Then, using the Agencourt ® RNA-Clean ® beads, the entire cDNA was purified according to the manufacturer's protocol. For the sense transcript cDNA generation, WT-Ovation™ Exon Module (NuGEN) was used. Based on the instructions in the manufacturer's manual, 3 μg of each cDNA was mixed with the provided primers and incubated for 5 minutes at 95°C and cooled to 4°C. After mixing with enzyme NGS data analysis pipeline used for this study. In A) we show the annotation of a sequence read. It was detected about 18000 times in RAEB2 and aligned at nine different positions, spread over six chromosomes, on the human genome (green). A single alignment position is shown (red) with the used annotation hierarchy (blue). The purple callbox, details the matched loci for miRNA let-7a-1, its full primary sequence (top), its mature sequence (middle) and the aligned short read (bottom). The brown callbox shows all nine annotations, including a number of miRNAs from the has-let-7 family as well as a piRNA. In B) we compare the total RNA content measured from our high-throughput sequencing and annotation steps, on the left results for the RAEB2, in the middle results for RA and on the right results for control. solution, the entire reaction mixture was incubated as follows: 1 minute at 4°C, 10 minutes at 30°C, 60 minutes at 42°C, 10 minutes at 75°C, and cooled to 4°C. Then the ST-cDNA was purified with the QIAGEN DNA clearing kit. After the purification, fragmentation reaction was carried out using FL-Ovation™ cDNA Biotin Module V.2 according to the recommended methods. Briefly, 5 μg of cDNA was mixed with the provided enzyme mix and incubated 30 minutes at 37°C and 2 minutes at 95°C. Then the reaction was cooled to 4°C. Next, the reaction was subjected to the labeling reaction as suggested by the manufacturer. The fragmented cDNA was mixed with labeling reaction mix and incubated at 37°C for 60 minutes and 70°C for 10 minutes. Then, the reaction was cooled to 4°C and used immediately for array hybridization. For the array hybridization, instead of recommended by Affimatrix, we used the standard array protocol provided by the NuGEN exon module. For hybridization, Chips were incubated in Gene Chip Hybridization Oven 640 and underwent the washing and staining processes according to the FS450_0001 fluidic protocol. Then, the array was scanned using Gene Chip Scanner 3000 (GCS3000).
The exon arrays for control, RA and RAEB2 were loaded into the Partek Genomics Suite 6.5. The Robust Multi-array Analysis (RMA) algorithm was used for initial intensity analysis [21] (Additional file 4). We generated gene expression estimates by averaging the intensities of all exons in a gene. Differential expression was defined as discussed for the NGS analysis above.

Integrated target genes for MDS
In an earlier study Pellegatii and colleagues [22] used an Affymetrics Human Genome U133 Plus 2.0 GeneChip to assay consistently differentially expressed genes in hematopoietic stem cells (HSC) of 183 patients compared to 17 HSC of normal controls. This identified 534 probesets for RA and 4670 from RAEB2 patients. We matched these probesets to gene symbols and identified their corresponding transcript IDs on the Exon Gene-Chip. For the RA gene list, 69 probesets did not have annotated gene symbols, 103 had no corresponding transcripts and for 431 matching IDs were found. For the RAEB2 gene list, 807 probesets had no annotation, 1009 had no matching transcripts and for 3661 matching IDs were found. Altogether, this created a target gene space of 4092 probesets that were further analyzed by our bioinformatics modeling approach.
Secondary structure and location of novel miRNA* sequences The secondary structures for all miRNAs with stem-loop sequences deposited in miRBase were calculated using the Matlab Bioinformatics toolbox (version R2009a).
The locations of mature miRNAs were identified as perfect alignments between the stem-loop and mature miRNA sequence. We calculated the locations of novel miRNA* sequences based on the genome coordinates of aligned small RNA reads. We note that due to mismatches in the miRBase alignments, e.g. between the miRNA stem-loop and the human genome, some derivations between the small RNA sequencing reads and the deposited stem-loop sequences may exist. All information was visualized using the tool VARNA [23].
Prediction of miRNA-mRNA and miRNA*-mRNA pairs Information on miRNA target genes was obtained from two popular and publicly available miRNA target prediction databases. We retrieved flat files for all predicted human miRNA targets available in miRanda [24] and targets conserved over different mammalian species from targetscan [25]. In order to reduce the number of false positive predictions we considered only targets predicted by both algorithms, which resulted in about 110.000 miRNA-mRNA pairs.
In theory the majority of miRNA* are degraded in the cell. Therefore, we restricted our analysis to sequences with minimum read counts of 100. In each case, we define a 7-mer nucleotide sequences based on the small RNA read with the highest copy number throughout the control, low and high risk MDS samples. The nucleotides at positions two to eight were extracted and transformed into the RNA alphabet. The seed regions were checked for overlap with other known miRNA and miRNA* sequences and the targetscanS algorithm was used to predict miRNA*-mRNA pairs, if the seed sequence was previously unreported. In general, this algorithm performs target predictions based on perfect and conserved matches between the genes untranslated region (UTR) and the first six nucleotides of the seed sequence. It further requires that the seed region is followed either by the nucleotide A (known as a t1A anchor) or that the position eight of the alignment contains a perfect Watson-Crick pairing. On contrast, if the seed sequences matched with a previously reported miRNA or miRNA*, we used the target prediction strategy as reported above.

Prediction of transcription factor target genes
The flat files FACTOR and GENE of the commercially available database TRANSFAC v2008_2 [26] were downloaded and parsed into a MySQL database. The FAC-TOR and GENE flat files contain information on transcription factor proteins and genes regulated by transcription factors, respectively. A total of 2362 regulating factors for the human species (Homo Sapiens) were extracted and 70 entries, that did not describe proteins, but other regulatory factors were omitted. A large fraction (about 77 percent) of the remaining 2292 transcription factor proteins were mapped to Uniprot [27], either by external database ID's, or exact matches between protein names. With these accessions the protein coding gene IDs, as well as other information was downloaded automatically via a MATLAB based data retrieval algorithm implemented for this study. The transcript and probeset annotation files for the Affymetrix GeneChip Human Exon 1.0 ST Array were downloaded from the manufacture's website http://www. affymetrix.com and parsed into MySQL tables. Transcript IDs for 98 percent of the human transcription factor coding genes were extracted based on direct matches between gene names.
Genes that can potentially be up regulated when the transcription factor protein binds to a specific site in its promoter region are called transcription factor target genes. We extracted all target genes for human transcription factor proteins by joining a number of database tables. This revealed 3296 gene targets for the 2292 transcription factor proteins. We used direct matches between the target gene names, as well as additional entries, to identify corresponding transcripts on the Affymetrix GeneChip. This resulted in matches for 83 percent of the target genes.

Functional analysis for miRNA and miRNA* targets
The functional analysis of miRNA and miRNA* were performed by means of their predicted target genes. However, since the pools of potential target genes are large and suffer from high false positive rates, we selected only a limited set of genes for functional analysis. Therefore, we defined a threshold T describing the number of different miRNA or miRNA* that regulate a gene. Similar to many biological phenomena such functions are described by power laws (see Figure 2) and we aimed to select T in the exponential part of the function. This ensured that the selected genes were targeted by a large number of different miRNAs. We further tried to select at most 100 genes for the analysis. In each case, the selected target genes were imported into Ingenuity Pathway Analysis (IPA) version 8.5 and analyzed using the IPA Core Analysis algorithm.

Data integration model and detection of important gene regulators
The proposed data integration model assumed that the mRNA amount present in a cell at any given time is linearly depended on the concentration of transcriptional acting TFs and post-transcriptional acting miRNAs. Therefore, gene expression was modeled as a linear combination of these factors plus random noise, which can be expressed following a standard regression model [28] where y i is the expression of gene i, i = 1,..., G with G being the number of genes under study, (b 0 ,..., b N ) are the regression coefficients to be estimated by our model, Figure 2 Threshold for miRNA/miRNA* target gene selection. This figure describes the number of genes (x-axis) that are targeted by different miRNA*s (y-axis), for the example of RA cells. In this particular case, we selected the threshold T to be 13 miRNAs and 93 different genes were selected for functional analysis.
N sums up the number of TFs and miRNAs observed in the cells under study, ε is the noise term which is assumed an independent Gaussian random variable with expectation zero and variance s 2 , x p i was defined as where x p i is a factor associating gene i with regulator p, g p is a regulation characteristic and δ p the expression level of regulator p. The association x p i was determined by miRNA and TF target prediction and x p i was set to one if gene i was a target of regulator p, otherwise x p i was set to zero. Transcription factors generally contribute to transcription and hence higher target genes levels, therefore, g p was set to one if p was a TFs. On contrast, miRNAs are known to post-transcriptionally degrade mRNAs, hence g p was set to minus one if p was a miRNA. The expression levels δ p were determined by experiments as discussed earlier. Note that all expression values were normalized to controls and standardized to mean zero and standard deviation one.
The above regression problem was solved using the recently proposed cyclical coordinate descent algorithm, which is based on an elastic net penalty [29]. This algorithm is particularly fast and the elastic net penalty is most appropriate to handle large and sparse problems (compare Additional file 5 Figure S1) of correlated inputs. In addition, it has the beneficial property of shrinking a number of predictor values b p to exactly zero, hence integrating an effective variable selection approach, otherwise computationally expensive [30]. Note, that the penalty is weighted and that these weights were determined by cross validation.

Results and Discussion
Defining the small RNAome of Myelodysplastic Syndromes by next generation sequencing We performed high-throughput next generation sequencing of small RNAs (RNA-seq) on primary cells from control, low-grade (RA) and high-grade (RAEB2) MDS patients on an Illumina Genome Analyzer IIx (see Methods). This resulted in about thirteen million short sequence reads (length 38 bp) per sample. We implemented an annotation algorithm that integrates knowledge from diverse biological databases to characterize each RNA-seq read ( Figure 1). In brief, all reads were trimmed (length 22 bp) and aligned against the current version of the human genome (GRCh37), using the publicly available software Bowtie [19]. We allowed for at most two mismatches between the reference and read sequences. Since, the analyzed reads were relatively short and we allowed mismatches, a large number aligned to multiple genome positions (green part Figure 1). Consistent with previous analyses, we decided to discard reads having more than 25 alignment positions [31]. For annotation, we matched small sequencing reads to a set of small RNAs that included miRNAs from miRBase [32], a number of other small RNAs, including tRNAs and rRNAs, from the RepeatMasker track of UCSCs genome browser [33], as well as piRNAs from the NCBI database http://www.ncbi.nlm.nih.gov (blue callout box Figure 1). This mapping showed that the composition of the small RNAome was dramatically different from the analyzed samples, suggesting a shift in the regulation of small RNA targets during the progression of this disease.
First, the relative amounts of tRNA to rRNA were significantly larger in RAEB2 compared to RA and control (36 vs. 1.6 and 1). Since tRNAs are vital building blocks for protein synthesis and required during translation, this may indicate an increased regulation of translation at this disease stage. A recent study based on tRNA microarrays reported a 20-fold elevation of tRNAs in tumor samples versus normal samples [34]. In addition, tRNAs have been shown to inhibit cytochorme c activated apoptosis [35,36]. Taken together, the high tRNA content may contribute to the two well known characteristics of high-grade MDSs, decreased apoptosis (in contrast to low-grade MDS) and high rate of leukemia transformation. To our knowledge, this novel finding has not been reported for MDS, highlighting the combined use of next generation sequencing and the proposed annotation methodology.
Next, the obtained sequencing data demonstrated the first evidence of piRNA expression in marrow cells, and particular enrichment in low-grade MDS. Piwi-interacting RNAs are a relative newly defined class of none coding RNAs with length from 26 to 32nt [37,38]. In RA their expression increased, accounting for about nine percent of total sRNA counts, compared to about two and one percent in RAEB2 and controls, respectively. The biogenesis of piRNA is not fully understood today, but increasing evidence pinpoints that PIWI proteins are required for the accumulation of piRNAs [39][40][41][42]. In accordance with this concept, our exon array data showed that piwil1 and piwil2, two of the four human PIWI coding genes, were significantly up-regulated in RA, compared to control and high-grade MDS cells. Furthermore, recent studies have indicated that the PIWI-piRNA complex may have a role in post-transcriptional silencing damaged DNA fragments [39,43,44] and that interrupting PIWI-piRNA formation can lead to DNA double strand breaks [45]. Altogether, these findings suggest that piRNA might be used as diagnostic markers for low-grade MDS, however, further studies of their role in MDS pathogenesis are warranted.
Finally, we found an increased regulatory role of miRNAs in cells of RA and RAEB2 patients. In low-grade MDS miRNAs represented about 35 percent of the total sRNAs, an almost 4-fold increase compared to control, highlighting their role in disease pathogenesis. Similarly, miRNA percentages were elevated to about 14 percent in RAEB2 compared to control, although at a lower extent (two-fold increase). Of note, miRNAs are currently the most widely studied species of sRNAs and they are known to influence mRNA levels as well as translation. Due to their profound effects, the above findings, and taken into account insufficient literature on miRNAs in MDS, we decided to further investigate and discuss their roles in MDS.
Sequencing of additional RNAomes is required to confirm the observed trends over a larger patient population.
Detailed characterization of expressed miRNA loci and identification of novel miRNA* In the analyzed samples, reads were found at 246 different full-length primary miRNA sequence loci. These included matches at 173 different mature miRNA sites in RA, 93 in controls and 79 in RAEB2. Expression varied between samples and was generally more elevated in RA compared to RAEB2 (compare Figure 3 and Additional file 6 Tables S1,S2 and S3). The miRNA hsa-mir-125b-2 was an exception and more elevated in RAEB2 (read counts: 264 RAEB2, 87 RA and zero in controls). A single miRNA, hsa-mir-720 (fold change 10), was significantly down-regulated in RA and no copies were detected in RAEB2. Furthermore, a total of 58 miRNAs were only expressed in RA (Additional file 6 Table S4), hsa-mir-191 was unique to controls and hsa-mir-9-3 was only detected in RAEB2.
A number of high-throughput sequencing studies have recently reported the detection of miRNA*, often with higher copy numbers than their mature counterparts [46,47]. These studies further suggest that miRNA* associate with the effector complex AGO1 and regulate target gene expression. However, their roles in MDS have never been studied and we found reads matching to miRNA* motifs on 68 loci in RA, 55 in control and 24 in RAEB2 cells. In addition, multiple reads matched to uncharacterized positions on 59 different primary miRNA sequences. Interestingly, no miRNA* motifs had been reported for these loci before. Therefore, we visualized the secondary structure for their primary sequence, the location of the mature sequence and the reads clustered at uncharacterized loci (see Figure 4 Methods and Additional file 6 Table S5). Our bioinformatics analysis showed that most uncharacterized reads aligned on the miRNA* arm, opposite to the mature sequence. This has led to the definition of 59 previously unreported miRNA* candidates, of which 20 seed sequences have previously been associated in the targetscan database [48], but which did not exist in the miRBase version (v14) used for this study. We classified the remaining 39 motifs as novel miRNA* sequences (miRNA**) and folding information with locations on the miRNA arms are given in Additional file 6 Table S5.
Considering all samples together, significant expression was detected (read count at least 100) for 128 miRNA*, including 123 miRNA* in RA, 72 in control and 31 in RAEB2. Interestingly, in our RNA-seq data either the miRNA or the miRNA* (including miRNA**) arms were expressed at many miRNA loci (Additional file 5 Figure S2), suggesting a non-random and selective expression of the two different miRNA arms. Importantly, we found that 24 miRNA* were only expressed in RA, hsa-mir-24-1* was unique to control (copy number: 119) and no miRNA* was uniquely expressed in RAEB2. These miRNA* can potentially be used as biomarkers to diagnose low-grade MDS, which has significant overlapping morphologic and clinical features with reactive cytopenias, and is consequently very difficult to diagnose. However, further validation in additional patients and with different methods is needed to confirm these findings. Details for the ten miRNA* with the greatest fold changes in RA are given in Table 1 further information can be found in Additional file 6 Tables S1 and S4.

Functional roles of miRNA and miRNA* in Myelodysplastic Syndromes
In order to identify biological functions that might contribute to low-grade MDS, and can be modulated by the detected miRNA/miRNA*, we first identified target genes for 91 miRNA and 104 miRNA* that were highest expressed in RA, compared to RAEB2 and control marrow cells. The total number of uniquely regulated mRNAs was 7021 for miRNA* and 4665 for miRNA (see Methods). To select high confidence targets, each gene was further ranked according to the number of miRNAs or miRNA* that potentially control its expression or translation (see Methods). This was necessary to counteract the high false positive rates of in-silico miRNA target predictions, which for example do not consider tissue specificity. From this ranking two gene sets (Table 2), the first consisting of 74 genes controlled by 19 miRNAs and the second consisting of 93 genes regulated by at least 14 miRNA*, were selected to compare significantly enriched molecular and cellular functions (Methods). Interestingly, four out of the top five functions, with the smallest p-values, overlapped. These included "Cell Death", "Cellular Development", "Cell Cycle" and "Gene Expression" ( Table 2). The high compatibility suggested that the detected miRNA* fulfill similar roles to their mature counterparts, providing further evidence of their selectivity and biological importance.
To study the overall role of miRNA/miRNA* in RA and RAEB2 cells, their target genes were combined for further analysis. In RA, we included 94 genes regulated by at least 27, and in RAEB2 a total 83 genes targeted by at least three different miRNA/miRNA*. The difference in the required number of regulating miRNA/ miRNA* were attributed to the higher number of differentially expressed miRNA in RA (compare Additional file 5 Figure S3). Next, we identified significantly enriched molecular and cellular functions (Methods) and compared results with a recent large scale gene expression study of 183 MDS patients [22].
In both disease grades the selected genes were enriched for the molecular function of "Cell Death" (RA: 9.86E-06, RAEB2: 1.75E-04). This is in agreement with the above study, which identified apoptosis as the main deregulated process in low-grade MDS.
In addition, cell cycle regulatory genes were among the indentified target genes for both, RA and RAEB2. In accordance with the study cited above, we found that the "G2/M phase" (RAEB2:1.55 E-3) and "DNA damage checkpoint" (RAEB2: 6.67E-3) were exclusively regulated in RAEB2. On contrast the "G1 phase" (6.17E-06) was exclusive to RA.
These findings showed that miRNA/miRNA* interfere with molecular functions and pathways known to be deregulated at the transcriptomic level, as reported in the  List of ten miRNA* (see Additional file 6 Table S4 for folding information) that were detected with the largest fold changes in control and low-grad cells. We show the fold change, p-value (measuring if the number of down regulated target genes is greater than expected by chance) and target genes with regulation (bold arrows mark significant and italic non-significant regulation). We assessed the significantly down regulated genes for functional enrichment and pathways. cited gene expression study (some additional information is given in Additional file 7). In the following we proposed a bioinformatics modeling approach to further elucidate the effects of miRNA/miRNA* on the MDS transcriptome.

Computational modeling of transcriptome regulation in Myelodysplastic Syndromes
In the recent years it has become increasingly evident that miRNAs and TFs coordinate to regulate mRNA levels [49]. Consequently, we proposed a bioinformatics model that accounts for both effects. It integrated miRNA expression levels measured by next generation sequencing, gene expression measured by exons arrays, as well as data of a recently published gene expression microarray study [22]. All datasets were linked using a number of publicly and commercially available bioinformatics databases (Methods). In particular, we focused on the regulation of genes consistently differentially expressed over a large patient pool, that can be influenced by miRNAs/miRNAs* and TFs detected in our samples. The general workflow is illustrated in Figure 5 and we briefly describe the main aspects below (more information is given in the Methods section and Additional file 5 Figure S4). The analysis started with miRNA profiling in samples of RA and RAEB2 patients by next generation sequencing, as discussed earlier.
In addition, we measured gene expression and splice form variations using the Affymetrix GeneChip Human Exon 1.0 ST Array. In an earlier study the bone marrow of 55 RA and 43 RAEB patients were compared against 17 controls and genes collectively differentially expressed explored [22]. These differentially expressed genes were merged with the exon array profiling (Additional file 5 Figure S5) and a set of 385 RA and 2795 RAEB2 genes was constructed.
However, 1073 genes could not be associated with an expressed miRNA nor a TF, and thus potential secondary targets were omitted from further analysis.
The obtained expression levels for all miRNA/ miRNA*, TF and genes were normalized to their respective controls and then standardized to a mean of zero and a standard deviation of one.
To develop a bioinformatics model for gene expression regulation, we assumed that the mRNA amount, present in a cell at any time, is linearly dependent on its positive acting TFs and negative acting miRNAs [50,51]. Hence, the mRNA amounts can be modeled as a linear combination of the standardized expression levels of miRNAs and TFs. Note that all expression measures for genes, miRNA and TF were acquired from marrow cells of the same patients, whereas the other mentioned studies relied on expression levels from multiple studies of different tissues.
The resulting model for RA consisted of 1640 equations to represent each RA gene and 415 predictors (regulators, e.g. miRNA and TFs). For RAEB2 we used 1216 equations and 290 predictors.
In spite of the huge variable space, we were interested to determine how much each regulator contributes to the expression of the analyzed genes. This is a particular large regression problem and our input data, similar to other biological measurements, was highly correlated. In addition, the average number of miRNA and TF regulators per gene was small compared to the variable space (see Additional file 5 Figure S6), leading to a set of sparse equations, which posed another algorithmic difficulty.
To overcome these issues, we applied the recently proposed elastic net algorithm [29] that is specifically equipped to handle large, correlated and sparse problems. In addition, its regularization term was designed to shrink a numbers of predictors to exactly zero. This eliminates variables (miRNAs and TFs) without importance, and directly incorporates a feature selection procedure, which is otherwise computationally expensive.
In RA this strategy identified 349 variables, out of 415, with coefficients different from zero. Similarly, for RAEB2 it selected 197 out of the 290 possible variables. In order to rule out the possibility that these results are purely dependent on the expression levels of the regulators, or the number of regulated genes, we calculated a series of correlation coefficients. With Pearson Correlation Coefficients of 0.003 and 0.067 for the expression and 0.062 and 0.007 for the number of regulated genes, there were no correlations found for the low-and highgrade MDS, respectively.
The selected variables for RA included 119 miRNA*, 90 miRNA and 140 TF. In addition to the increased expression of miRNA* in RA and their potential to regulate lowgrade MDS associated biological functions and pathways, the large selection of miRNA* provides further mathematical evidence for their regulatory importance.
To identify important miRNA/miRNA* and TFs, all regulators were ranked based on the aberration of their regression coefficients from zero ( Figure 6). A large deviation, in positive or negative direction, is synonymous with a large influence on gene expression.
In RA, two subtype-specific expressed miRNAs were selected as most dominant regulators. Whereas the differentially expressed target genes of hsa-mir-1977** regulate hematopoiesis and apoptosis, hsa-miR-130a has previously been associated with the regulation of angiogenesis and platelet physiology [52,53]. The transcription factor E2F1 ranked three and is known to regulate S-phase dependent apoptosis in MDS [54,55]. Similar, Figure 5 Transcriptome analysis pipeline. Pipeline for the integrative analysis of the MDS transcriptome, further described in the text and Additional file 5 Figure S4. eight out of 13 TF within the top 20 have previously been associated with "Hematological Disease" or "Hematopoiesis".
For RAEB2, the proposed pipeline selected 46 miRNA*, 76 miRNA and 84 TFs as influential. The 20 highest ranked regulators included 16 TFs, of which 12 have previously been associated with either "Hematological Disease" or "Hematopoeisis". The top ranked TF, AP-2β, has a known role in the development of metastatic phenotypes as well as apoptosis [56]. The highest ranked miRNAs were hsa-miR-122 and hsa-miR-20b, both expressed moderately and not linked to the RAEB2 phenotype.
In conclusion, the ranking of miRNAs and TFs with known and important relation to MDS shows the power of our approach. While a few TF have already been extensively investigated in MDS, an in-depth understanding of miRNA regulation remains elusive. We are planning to further study the functions of the novel miRNAs hsa-mir-1977** and hsa-miR-130a in primary cells to confirm our findings and illustrate their roles in MDS.

Key functions regulated by miRNAs and TFs in Myelodysplastic Syndromes
In order to identify molecular processes influenced by the above regulators, we first annotated the target genes of highly ranked miRNAs/miRNA* and TFs (e.g. absolute regression coefficients greater than one) with prefiltered (e.g. having less than 500 genes) gene ontologies [57]. Then each biological process was ranked according Figure 6 MDS transcriptome regulators. Top 20 regulators determined by the proposed modeling approach. The y-axis shows the regression coefficients and the x-axis lists the regulator names. We named TF with their transfac accession and the corresponding protein name. The miRNAs are named with their miRBase accession and we marked previously known miRNA* with a single and novel miRNA* with two stars. In addition, we indicate the rounded regression coefficients on the respective regulator bars.
to the number of involved target genes. Further, genes differentially expressed in each process term were identified and overlaid with the above ranking onto Figure 7.
Some highly regulated processes, such as angiogenesis, were shared between low-and high-grade MDS. Moreover, our model indicated a few biological processes that are highly regulated in both disease subtypes, but different in the levels of their expression. For example "nuclear mRNA splicing, via spliceosome", "G1/S transition of mitotic cell cycle" or "protein import into the nucleus, docking ". Rationally, such processes are potential keys that can define functional differences in MDS subtypes.
Of particular interest was the process "negative regulation of transcription from RNA polymerase II promoters" (GO:0000122), which was the most regulated process in both MDS grades. This pathway prevents or reduces transcription of different RNAs, including miRNAs. Figure 7 MDS regulated biological processes. Illustration of biological processes that are highly regulated by influential miRNAs and TFs, as selected by our in-silico model. The left figure shows results for the low risk and the right figure for the high risk grade. In both graphs the xaxis describes the regulated process. The y-axis shows, in the black bar, the number of selected miRNA and TF that regulate a certain processes. In the red bar the number of down-and in green bar the number of up regulated genes are shown.
In RA, the majority of the differentially expressed genes in this term were down regulated (Figure 7), hence promoting transcription. By contrast in RAEB2, the majority of differentially expressed genes were up regulated, leading to a reduced RNA production.
Therefore, these results are in agreement with our earlier findings that some miRNAs were only detected, or had higher copy numbers, in RA compared to RAEB2.
Altogether, these results suggested that the differences in miRNA expression between RA and RAEB2, and potentially their downstream targets, might be the result of RNA polymerase II promoter regulation. In RA, this would indicate a potential feedback system in which expressed miRNA and TF down regulate "GO:0000122". In turn, this could increase expression of RNA and hence accumulate miRNAs. By contrast in RAEB2, the selected miRNA and TF up regulate "GO:0000122". This drives the cell to reduce RNAs synthesis and consequently decreases their overall amount.
Thus, the discussed feedback loops are a potential explanation for the high amounts of miRNA seen in RA and the much lower amount in RAEB2, two obvious discoveries from the RNA-seq analysis described above. Further studies to investigate the role of this pathway in MDS are warranted.

Conclusions
In this paper we presented the first systematic profiling for small RNAs in Myelodysplastic Syndromes using next generation sequencing on the current Illumina Genome Analyzer IIx platform. A custom data analysis pipeline that handled raw reads, sequence alignment, data storage as well as integrative read annotation was implemented. The analysis showed that the small RNAome in low-grade MDS (RA) was enriched for piRNAs, potentially protecting DNA from the accumulation of mutations, a mechanism not observed in high-grade MDS (RAEB2). By contrast, tRNAs were enriched in RAEB2, which might contribute to the characteristic reduction in apoptotic cell death at this disease stage. In both grades a number of differentially expressed miRNAs and miRNA* were detected and 48 previously unreported miRNA* exposed. In all analyzed cells, miRNA reads were often found for either the mature or the star sequence, indicating selective expression of miRNA and miRNA*. Subsequent functional analysis of target genes showed that both miRNA species (i.e. miRNA and miRNA*), regulate similar MDS stage specific molecular functions and pathways indicating that miRNA* also play important regulatory roles on the MDS transcriptome. Using integrative bioinformatics modeling, we identified miRNA species and TFs that act as important regulators for a MDS transcriptome that is consistently deregulated over a large MDS patient pool. Further ontology analysis identified the geneontology process of "negative regulation of transcription from RNA polymerase II promoters" as highly controlled in both MDS grades. Additionally, our findings suggested a potential feedback loop, where specific miRNAs and TFs regulate their own expression by either enhancing polymerase II promoter function, as seen in RA, or repressing its function, as found in RAEB2. Further studies are warranted to experimentally substantiate our observation and to develop novel biomarkers for the diagnosis and treatment of MDS.