- Open Access
PDEGEM: Modeling non-uniform read distribution in RNA-Seq data
© Xia et al.; licensee BioMed Central Ltd. 2015
- Published: 29 May 2015
RNA-Seq is a powerful new technology to comprehensively analyze the transcriptome of any given cells. An important task in RNA-Seq data analysis is quantifying the expression levels of all transcripts. Although many methods have been introduced and much progress has been made, a satisfactory solution remains be elusive.
In this article, we borrow the idea from the Positional Dependent Nearest Neighborhood (PDNN) model, originally developed for analyzing microarray data, to model the non-uniformity of read distribution in RNA-seq data. We propose a robust nonlinear regression model named PDEGEM, a Positional Dependent Energy Guided Expression Model to estimate the abundance of transcripts. Using real data, we find that the PDEGEM fits the data better than mseq in all three real datasets we tested. We also find that the expression measure obtained using PDEGEM showed higher correlation with that obtained from alterative assays for quantifying gene and isoform expressions.
Based on these results, we believe that our PDEGEM can improve the accuracy in modeling and estimating the transcript abundance and isoform expression in RNA-Seq data. Additionally, although the stacking energy and positional weight of the PDEGEM are relatively related to sequencing platforms and species, they share some common trends, which indicates that the PDEGEM could partly reflect the mechanism of DNA binding between the template strain and the new synthesized read.
The PDEGEM model can be freely downloaded at: http://www.math.pku.edu.cn/teachers/dengmh/PDEGEM.
- Transcript Abundance
- Read Count
- Positional Weight
- Sequencing Platform
- Sequencing Preference
The transcriptome is the set of all RNA molecules in a cell including mRNA, rRNA, tRNA, microRNA and other non-coding RNA. Quantifying the expression level of mRNAs in a given cell is a fundamental problem in transcriptome research. Microarray was one of the most popular technologies to quantitatively measure gene expression in the past decade [1–4]. Despite its wide range of applications and successes, there are some key limitations in the microarray technology, for instance, poor estimation on low-expressed genes because of the effects of the cross-hybridization and background signal [5–8]. In addition, designing the array relies on the known sequence and annotation information, therefore novel transcripts are not able to be discovered and measured. An ultra-high-throughput sequencing-based technology, known as RNA-Seq , showed that it could overcome these difficulties , , [10–12]. Traditionally, for RNA-Seq data, reads are mapped to the reference genome or transcriptome or de novo assembled together to produce a genomescale transcriptional profile [13–15]. The profile can be summarized by a sequence of "counts", which stand for the number of reads whose mapping starts at that position . A basic question is how to use these counts to quantify the gene expression for each transcript. From our experience in analyzing microarray data, we believe the key to more accurate and effective expression quantification is to establish an appropriate statistical model. Mortazavi et al. proposed to use Reads Per Kilobase per Million mapped reads (RPKM) to represent the expression level of the transcript . The RPKM method gives full consideration to the transcript length and the number of mapped reads and has been widely used due to its simplicity. This method, however, ignores the variability of read coverage within each transcript hence introduced inaccuracy . Since then, more sophisticated models have been proposed for expression quantification in RNA-Seq. Marioni et al. proposed to use a constant rate Poisson model , Risso et al. proposed a method to correct GC-content bias  and Vardhanabhuti et al applied a Bayesian model . Li et al proposed a variable rates Poisson model, termed mseq, to fit the read count data. They applied two algorithms, Poisson-Linear model and MART, to implement the variable rates Poisson model. The mseq method takes into account the read count bias due to the sequence preference, therefore, it could fit the real data better.
Despite the improvement in mseq, it is oversimplified to just consider the neighborhood nucleotide sequence information alone. They also discussed the linear effect of dinucleotides in the Supplementary Material, which showed little improvement than the single-nucleotide model. However, we believe that a better approach is to consider the binding interaction between two adjacent nucleotides more precisely. An existing model that considers interaction between adjacent nucleotides is the Positional Dependent Nearest Neighborhood (PDNN) model , . The PDNN model was originally designed to model the probe-probe interaction that frequently observed in the microarray data and the results are quite favorable. The fundamental idea of the PDNN model lies that the binding affinity of a probe can be approximated as a weighted sum of its stacking energy , , . Inspired by the success of PDNN, we believed that the sequence contents may play an important role in affecting the base-level read counts in RNA-Seq experiments. Then we developed the PDEGEM, a Positional Dependent Energy Guided Expression Model to take into account the sequencing preference. The sequencing preference may lie in two aspects. The first one is the nucleotides before the starting position of reads, which could either affect the break point in random broken, or influence the amplification efficiency. The second part is the nucleotides after the starting points of reads, which may affect the binding affinity between the template strain and the generated strain. In this study, we mainly focused on the transcript abundance and isoform expression estimation, thus we only considered single-end reads, and the model could be easily extended to handle paired-end reads.
Positional Dependent Energy Guided Expression Model
We use a nonlinear PDNN model to characterize the non-uniformity of the read distribution due to systematic biological properties.
where w k is the positional weight factor depending on the position of the nucleotide in the sequence. The term b k , bk+1 is the neighborhood nucleotides that represent A, C, G, and T. is the stacking energy indicating the interaction of two adjacent nucleotides, which is the same as the stacking energy used in the PDNN model. N is the number of nucleotides nearby that were considered (N/2 nucleotides upstream and N/2-1 nucleotides downstream, also including the given nucleotide itself) to evaluate the sequencing preference.
where ν i = log(μ i ), μ i is the expression of the ith transcript and α is a constant. μ ij is the rate of the Poisson distribution and μ ij = w ij μ i , where w ij is the sequencing preference, which may depend on the surrounding sequence and the stacking energy between two adjacent nucleotides. In this model, by default, N is chosen to be 40 (Details see Additional files). As a result, our model uses 39 positional weight parameters and 16 stacking energy parameters and 1 constant totally 39 + 16 + 1 = 56 parameters, which is highly dependent on N, to model the sequencing preference of a given nucleotide.
where d is the deviance of the fitted model, and d0 is the deviance of the null model. The null model of PDEGEM is when we assume equal positional weight and use the initial stacking energy provided by PDNN model, i.e., . We apply cross validation to estimate R2 by dividing the transcripts into five parts, with the first four parts to estimate the parameters, while using the left one part to calculate the R2. The higher R2 is, the better the model could fit the data, thus the more reasonable the model is.
We perform a optimization strategy in order to get the optimal positional weight and the stacking energy
Step 1. is initialized, where L i is the length of the transcript.
Step 2. For each position on a transcript, initialize the stacking energy , to get an stacking energy vector, where N is the number of nucleotides surrounding the position. Then we merge the stacking energy vector of all positions on all transcripts in the training set to get a stacking energy matrix. Initialize positional weight to make each position the same initialized weight. Then we use Newton Method with the penalty function to obtain the optimal positional weight w = (w1, w2,..., wN −1). Next, utilize the optimal vector of positional weight w and take the stacking energy provided by PDNN as the initial value. Then we use Newton Method to iterate stacking energy ∈ = (∈1, ∈2,..., ∈16) and obtain an optimal vector of stacking energy E, which represents the energy weight of dinucleotides AA, AC, ..., T T .
Step 4. Utilize the optimal vector of positional weight and stacking energy in step 2 as the initial value of step 2. Alternate implementation of step 2 to step 3 until both positional weight and stacking energy minimize the objective equation. In the article, we use a threshold (0.1) and stop the iteration if the distance between adjacent two iterations is less than the threshold.
Where L is the length of the transcript, and n j (j = 1, 2,..., L) is the read count of the jth position.
Illustration of 3 datasets
Synthetic spike-in RNA-Seq
Two samples from human kidney
Two samples from human liver
Three samples from mouse liver
Three samples from mouse Muscle
Two samples from human Brain
This dataset consists of three sub datasets, named Wold data , Burge data , and Gimmond data  for short, which are originally utilized to validate the performance of the mseq method. We use these datasets to compare the performance of PDEGEM with the Poisson-Linear model and MART method implemented in mseq. The first two sub datasets are sequenced by Illumina's Solexa platform, while the third dataset is generated with ABI's SOLiD platform. In Wold data, the length of reads is 25 bp and the reads came from three mouse tissues: brain (w1), liver (w2) and skeletal muscle (w3). The Burge data consists of human tissues, mammary epithelial and breast cancer cell lines. They are divided into three groups. Group 1 (b1) is adipose, brain and breast. Group 2 (b2) is made of colon, heart and liver. Lymph node, skeletal muscle and testes are in group 3 (b3). The length of the read in Burge data is 32 bp. The Grimmond data is generated from two cell lines: embryonic stem cells (EB, g1) and undifferentiated mouse embryonic stem cells (ES, g2). The length of read in Grimmond data is 35 bp. In order to acquire a high qualitative dataset, some reads are truncated into 30 bp or 25 bp.
This dataset is a synthetic spike-in RNA-Seq dataset created by the External RNA Control Consortium (ERCC) , which is designed to develop a set of RNA standards and includes 96 spike-in transcripts. The reads are generated using Illumina GAII, with 36 bp in length. We chose this dataset (library 6) because the transcripts had known initial concentration and they were also detected in the 100% ERCC RNA-Seq experiment. Thus taking the initial concentration as the golden standard, we can apply Spearman's rank correlation coefficients to compare the concordance between the initial concentration of transcripts and their abundance estimated using various RNA-Seq transcript abundance estimation methods.
This dataset is originated from the Marioni et al. study , which consists of two samples from human kidney and two samples from liver tissues. We choose this dataset because these samples are profiled by both RNA-Seq and Affymetrix Microarray. The read length of Illumina RNA-Seq samples is 36 bp. We use the transcript abundance measured by microarray as the gold standard and compared with the transcript abundance obtained from RNA-Seq data estimated by different transcript abundance estimation methods.
This dataset is originated from the Mortazavi et al. study , which consists of three mouse tissues: liver, skeletal muscle and brain are sequenced on the Solexa platform. We choose this dataset because these samples are profiled by both RNA-Seq and Affymetrix Microarray . The read length of Solexa RNA-Seq samples is 25 bp. We used the isoform expression measured by microarray as the gold standard and compare with the isoform expression estimation obtained from RNA-Seq data estimated by different methods.
Extracting the count data from the original reads
To evaluate the goodness-of-fit of mseq and PDEGEM, we apply the same strategy with mseq to compare R2 of these two models. For each of the 8 sub-dataset in Data Set 1, we utilize the 100 highest expressed single-isoform transcripts selected by mseq as the training set to train the parameters.
For Dataset 2, we first map the reads of Spikein dataset to the reference using bowtie version 0.12.5 with the -best option and allow at most two mismatches (-v 2 option) . For this dataset, we first remove all positions with zero read count and we truncate the 20 bp at either beginning or end of transcript.
For the RNA-Seq data in Dataset 3, we also apply bowtie version 0.12.5 with the -best option and allow at most two mismatches (-v 2 option) to map reads to Refseq transcripts (UCSC hg19) and collect the read count of each position of each transcript. Also, the 100 highest expressed single-isoform transcripts are chosen as the training set.
For the RNA-Seq data in Date Set 4, we apply SeqMap  with two mismatches and map all sequencing reads to the mouse reference sequences (mm9, NCBI, Build 37) and download gene annotations from the RefSeq mouse mRNA database(mm9, NCBI, Build 37)
Goodness-of-fit of PDEGEM and mseq in Dataset 1
R2 for 3 models in 8 RNA-Seq samples in Dataset 1.
Trends of the stacking energy and positional weight in Dataset 1
Comparison in synthetic RNA-Seq data
Utilizing the synthetic spike in RNA-Seq data in Data Set 2, we further evaluate the performance of RPKM, MART, and PDEGEM. We compared the Spearman's rank correlation coefficients between the transcripts abundance estimated by these models from the RNA-Seq data and the "true" transcript abundance indicated by the experimental transcript concentration. Presumably, the higher the correlation is, the more accurate the method is in estimating the transcript abundance from RNA-Seq reads. There are 96 transcripts in the spike-in dataset, of which six do not have reads mapped to. Thus the 90 transcripts left were used in the following analysis. We again chose the 40 highest transcripts measured experimentally as the training set to train the positional weight and stacking energy, and these parameters are used to measure the gene expression of the other 50 transcripts.
Consistency between transcript abundance estimated by different methods and gold standards in Dataset 2.
Comparison in the Marioni dataset
First, four methods (RPKM, Poisson-Linear, MART, and PDEGEM) were chosen to estimate the transcript abundance of the four RNA-Seq datasets in Dataset 3. After that, we used the transcript abundance measured by the Affymetrix GeneChip as the gold standard to assess the performance of these four methods.
R2 for 3 models in Dataset 3.
Consistency between transcript abundance estimated by different methods and gold standards in Dataset 3.
Comparison in the Mortazavi dataset
Pearson's correlation coefficients of %ASex.
PCC1 by uniform model
PCC by MART
Sequencing preference has been recognized as an important factor in transcript abundance estimation. We believe that the stacking energy between neighborhood nucleotides can be utilized to better model sequencing preference than in the Poisson-Linear model. Therefore, we borrow the idea from the PDNN model that is first introduced to model the binding affinity between the target sequence and probe on the microarray. Combining the ideas of these two models, we construct PDEGEM for analyzing the RNA-Seq data. We assume that the starting position of a read is related to two main effects. The first one is the effect of RNA amplification, including random broken and insufficient amplification, which is presented by the nearby nucleotides prior to the starting point of the reads. The second one is the related sequencing procedure, which is presented by the nucleotides after the starting point of the reads. As for Illumina Solexa platform, it utilizes sequencing by synthesis procedure to produce RNA-Seq reads. The affinity between the template strain and the generated strain may play an important role in generating the reads.
As for sequencing technology and species, we found that different platforms or species generate slightly different positional weight and stacking energy in our model. Similar difference has been noticed in microarray data. For stacking energy, except for AA, AC, CC, and GC, the other dinucleotides show the similar trend in both Illumina and SOLiD platforms(Figure 1). The stacking energies are slightly different between the Wold Data and Burge Data that generated from the same sequencing platform Illumina. This may due to the fact that the RNA samples were extracted from different species (mouse and human), which shows that both sequencing platforms and species could influence the sequencing preference through stacking energy. As for positional weight, the results show that the nucleotides in the middle portion of the sequence have larger effects on the sequencing preference. Besides, the Wold data and Burge data generated with Illumina show almost the same trend of positional weight (Red and green lines in Figure 2), regardless of the species. However, it's significantly different for Gimmond data that are generated from the SOLiD sequencing platform (Blue lines in Figure 2). Our results shows that the sequencing platforms may have larger effect on positional weight than species. The same trend is shown in Dataset 2 and Dataset 3 (Details see Additional file 2). Although the parameters in the model are slightly related to both sequencing platforms and species, they share some common trends, which are relatively conservative across sequencing platforms and species. Thus, it nevertheless reflects that the DNA binding mechanism can be identified between the template strain and the new synthesized read. In this study, we only observe that sequencing preference are slightly related to sequencing platforms and species, while the detailed mechanism remains to be further explored with more refined biological experiments.
Furthermore, we also attempt to use 20, 60, 80, and 120 nucleotides to characterize the sequencing preference (See Additional file 1 for more details). Take the RNA-Seq data w1 in Wold Data of Dataset 1 for example, the R2 are almost the same as that of using 40 nucleotides. Besides, the stacking energy and positional weight are also very similar, as shown in Supplementary Figure S1. Therefore in order to reduce memory consumption, we chose to use 40 nucleotides in the PDEGEM. The detailed comparison is shown in the Supplementary Figure S1. In addition, the positional weight and stacking energy calculated from the spike-in RNA-Seq data generated from Illumina in Dataset 2 and four RNA-Seq samples generated with Illumina in Dataset 3 showed that they also share the similar trend as those of Wold data and Burge data in Data Set 1, which confirm the biological significance of the positional weight and stacking energy. More details could be found in Supplementary Figure S2 and Supplementary Figure S3.
In this study, we first calculate R2 through cross-validation and the R2 of PDEGEM increases significantly compared with Poisson-Linear model and MART. Also, PDEGEM is developed to estimate the gene expression at transcript level. By comparing with RPKM, Poisson-Linear model and MART, our method shows higher goodness-of-fit measured by R2 with less parameters and achieves higher Spearman's rank correlation coefficients with gene expression measured by microarray or other experiments. We believe that the improvement in R2 and transcript abundance estimation are the result of the more refined sequencing preference model.
In this model, we use Newton Method to solve the optimization equation. The convergence of Newton Method is time-consuming and this method may appear singular for particularly large amount of data. An alternative method that could be considered is the coordinate descent method. Though our model is based on single-end RNA-Seq protocol, paired-end RNA-Seq data can also be analyzed with PDEGEM.
PDEGEM characterize the non-uniformity of read distribution in RNA-Seq data. Through applying the model to RNA-Seq data generated with Illumina and SOLiD platforms, the results show that PDEGEM is relatively conservative to the sequencing platforms and species. PDEGEM combines two ideas, one is Poisson linear model which considers the sequencing preference, and the other is the PDNN model which considers the binding affinity to predict gene expression in microarray. The experimental evaluation of PDEGEM has illustrated that it can achieve higher goodness-of-fit and more accurate prediction of transcript abundance compared with other methods, which will surely benefit the downstream investigations such as detecting differentially expressed genes and gene set enrichment analysis.
Publication of this article has been funded by the National Natural Science Foundation of China (Nos. 31171262, 31428012,31471246), and the National Key Basic Research Project of China (No.2015CB910303). We thank Dr. Ming Hu and Dr. Sam Lee for providing help in preprocessing the RNA-seq data and Huaying Fang for technical support.
This article has been published as part of BMC Medical Genomics Volume 8 Supplement 2, 2015: Selected articles from the 4th Translational Bioinformatics Conference and the 8th International Conference on Systems Biology (TBC/ISB 2014). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcmedgenomics/supplements/8/S2.
- Lockhart D, Winzeler E: Genomics, gene expression and dna arrays. NATURE-LONDON. 2000, 827-836.Google Scholar
- Stamatoyannopoulos J: The genomics of gene expression. Genomics. 2004, 84 (3): 449-10.1016/j.ygeno.2004.05.002.View ArticlePubMedGoogle Scholar
- Zhang L, Wu C, Carta R, Zhao H: Free energy of dna duplex formation on short oligonucleotide microarrays. Nucleic Acids Res. 2007, 35 (3): e18-e18. 10.1093/nar/gkl1064.View ArticlePubMedGoogle Scholar
- Chittur S: Dna microarrays: tools for the 21st century. Combinatorial chemistry and high throughput screening. 2004, 7 (6): 531-537. 10.2174/1386207043328454.View ArticlePubMedGoogle Scholar
- Royce T, Rozowsky J, Gerstein M: Toward a universal microarray: prediction of gene expression through nearest-neighbor probe sequence identification. Nucleic Acids Res. 2007, 35 (15): e99-10.1093/nar/gkm549.View ArticlePubMedPubMed CentralGoogle Scholar
- Okoniewski M, Miller C: Hybridization interactions between probesets in short oligo microarrays lead to spurious correlations. BMC Bioinformatics. 2006, 7 (1): 276-10.1186/1471-2105-7-276.View ArticlePubMedPubMed CentralGoogle Scholar
- Li J, Jiang H, Wong W: Method modeling non-uniformity in short-read rates in rna-seq data. Genome Biol. 2010, 11 (5): R25-View ArticleGoogle Scholar
- Holt R, Jones S: The new paradigm of flow cell sequencing. Genome Res. 2008, 18 (6): 839-846. 10.1101/gr.073262.107.View ArticlePubMedGoogle Scholar
- Marioni J, Mason C, Mane S, Stephens M, Gilad Y: Rna-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 2008, 18 (9): 1509-1517. 10.1101/gr.079558.108.View ArticlePubMedPubMed CentralGoogle Scholar
- Shendure J, Ji H: Next-generation dna sequencing. Nature biotechnology. 2008, 26 (10): 1135-1145. 10.1038/nbt1486.View ArticlePubMedGoogle Scholar
- Tang F, Barbacioru C, Wang Y, Nordman E, Lee C, Xu N, Wang X, Bodeau J, Tuch B, Siddiqui A: mrna-seq whole-transcriptome analysis of a single cell. Nat Methods. 2009, 6 (5): 377-382. 10.1038/nmeth.1315.View ArticlePubMedGoogle Scholar
- Marguerat S, Bähler J: Rna-seq: from technology to biology. Cellular and Molecular Life Sciences. 2010, 67 (4): 569-579. 10.1007/s00018-009-0180-6.View ArticlePubMedGoogle Scholar
- Hu M, Zhu Y, Taylor J, Liu J, Qin Z: Using poisson mixed-effects model to quantify transcript-level gene expression in rna-seq. Bioinformatics. 2012, 28 (1): 63-68. 10.1093/bioinformatics/btr616.View ArticlePubMedGoogle Scholar
- Hernandez D, François P, Farinelli L, Østerås M, Schrenzel J: De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer. Genome Res. 2008, 18 (5): 802-809. 10.1101/gr.072033.107.View ArticlePubMedPubMed CentralGoogle Scholar
- Mizrachi E, Hefer C, Ranik M, Joubert F, Myburg A: De novo assembled expressed gene catalog of a fast-growing eucalyptus tree produced by illumina mrna-seq. Bmc Genomics. 2010, 11 (1): 681-10.1186/1471-2164-11-681.View ArticlePubMedPubMed CentralGoogle Scholar
- Mortazavi A, Williams B, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by rna-seq. Nat Methods. 2008, 5 (7): 621-628. 10.1038/nmeth.1226.View ArticlePubMedGoogle Scholar
- Risso D, Schwartz K, Sherlock G, Dudoit S: Gc-content normalization for rna-seq data. BMC Bioinformatics. 2011, 12 (1): 480-10.1186/1471-2105-12-480.View ArticlePubMedPubMed CentralGoogle Scholar
- Vardhanabhuti S, Li M, Li H: A hierarchical bayesian model for estimating and inferring differential isoform expression for multi-sample rna-seq data. Statistics in Biosciences. 2012, 1-19.Google Scholar
- Nielsen H, Gautier L, Knudsen S: Implementation of a gene expression index calculation method based on the pdnn model. Bioinformatics. 2005, 21 (5): 687-688. 10.1093/bioinformatics/bti078.View ArticlePubMedGoogle Scholar
- Zhang L, Miles M, Aldape K: A model of molecular interactions on short oligonucleotide microarrays. Nature biotechnology. 2003, 21 (7): 818-821. 10.1038/nbt836.View ArticlePubMedGoogle Scholar
- Sugimoto N, Nakano S, Katoh M, Matsumura A, Nakamuta H, Ohmichi T, Yoneyama M, Sasaki M: Thermodynamic parameters to predict stability of rna/dna hybrid duplexes. Biochemistry. 1995, 34 (35): 11 211-11 216. 10.1021/bi00035a029.View ArticleGoogle Scholar
- SantaLucia J: A unified view of polymer, dumbbell, and oligonucleotide dna nearest-neighbor thermodynamics. Proceedings of the National Academy of Sciences. 1998, 95 (4): 1460-1465. 10.1073/pnas.95.4.1460.View ArticleGoogle Scholar
- Wang E, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C, Kingsmore S, Schroth G, Burge C: Alternative isoform regulation in human tissue transcriptomes. Nature. 2008, 456 (7221): 470-476. 10.1038/nature07509.View ArticlePubMedPubMed CentralGoogle Scholar
- Cloonan N, Forrest A, Kolle G, Gardiner B, Faulkner G, Brown M, Taylor D, Steptoe A, Wani S, Bethel G: Stem cell transcriptome profiling via massive-scale mrna sequencing. Nat Methods. 2008, 5 (7): 613-619. 10.1038/nmeth.1223.View ArticlePubMedGoogle Scholar
- Jiang L, Schlesinger F, Davis C, Zhang Y, Li R, Salit M, Gingeras T, Oliver B: Synthetic spike-in standards for rna-seq experiments. Genome Res. 2011, 21 (9): 1543-1551. 10.1101/gr.121095.111.View ArticlePubMedPubMed CentralGoogle Scholar
- Pan Q, Shai O, Misquitta C, Zhang W, Saltzman AL, Mohammad N, Babak T, Siu H, Hughes TR, Morris QD, et al: Revealing global regulatory features of mammalian alternative splicing using a quantitative microarray platform. Molecular cell. 2004, 16 (6): 929-941. 10.1016/j.molcel.2004.12.004.View ArticlePubMedGoogle Scholar
- Langmead B, Trapnell C, Pop M, Salzberg S: Ultrafast and memory-efficient alignment of short dna sequences to the human genome. Genome Biol. 2009, 10 (3): R25-10.1186/gb-2009-10-3-r25.View ArticlePubMedPubMed CentralGoogle Scholar
- Jiang H, Wong WH: Seqmap: mapping massive amount of oligonucleotides to the genome. Bioinformatics. 2008, 24 (20): 2395-2396. 10.1093/bioinformatics/btn429.View ArticlePubMedPubMed CentralGoogle Scholar
- Jiang H, Wong W: Statistical inferences for isoform expression in rna-seq. Bioinformatics. 2009, 25 (8): 1026-1032. 10.1093/bioinformatics/btp113.View ArticlePubMedPubMed CentralGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.