Volume 8 Supplement 3
Selected articles from the IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2014): Medical Genomics
Hybrid de novo tandem repeat detection using short and long reads
 Guillaume Fertin^{1},
 Géraldine Jean^{1},
 Andreea Radulescu^{1}Email author and
 Irena Rusu^{1}
DOI: 10.1186/175587948S3S5
© Fertin et al.; 2015
Published: 23 September 2015
Abstract
Background
As one of the most studied genome rearrangements, tandem repeats have a considerable impact on genetic backgrounds of inherited diseases. Many methods designed for tandem repeat detection on reference sequences obtain high quality results. However, in the case of a de novo context, where no reference sequence is available, tandem repeat detection remains a difficult problem. The short reads obtained with the secondgeneration sequencing methods are not long enough to span regions that contain long repeats. This length limitation was tackled by the long reads obtained with the thirdgeneration sequencing platforms such as Pacific Biosciences technologies. Nevertheless, the gain on the read length came with a significant increase of the error rate. The main objective of nowadays studies on long reads is to handle the high error rate up to 16%.
Methods
In this paper we present MixTaR, the first de novo method for tandem repeat detection that combines the highquality of short reads and the large length of long reads. Our hybrid algorithm uses the set of short reads for tandem repeat pattern detection based on a de Bruijn graph. These patterns are then validated using the long reads, and the tandem repeat sequences are constructed using local greedy assemblies.
Results
MixTaR is tested with both simulated and real reads from complex organisms. For a complete analysis of its robustness to errors, we use short and long reads with different error rates. The results are then analysed in terms of number of tandem repeats detected and the length of their patterns.
Conclusions
Our method shows high precision and sensitivity. With low false positive rates even for highly erroneous reads, MixTaR is able to detect accurate tandem repeats with pattern lengths varying within a significant interval.
Keywords
tandem repeat secondgeneration sequencing thirdgeneration sequencing de Bruijn graphBackground
Tandem repeats are one of the most studied genome rearrangements [1–3]. Frequently located in genes or regulatory regions, tandem repeats (TR) play an important role in gene expression, genome evolution and transcriptional regulation [2–5]. They represent an important source of genome variation, help determine individuals inherited traits and are involved in genetically inherited diseases [6].
A TR is a sequence from a genome made of several (not necessarily exact) copies of the same pattern, located next to each other. In the case where each copy is identical to the pattern, the sequence is called an exact tandem repeat (ETR). Otherwise, it is called an approximate tandem repeat (ATR). Depending on the pattern length, TR are classified into microsatellites (110 bp), minisatellites (10100 bp) and satellites (>100 bp) [7].
The problem of TR detection has been the subject of many studies due to the TR important roles. The search of TR in genomes can be done in two contexts: either in the case where a reference sequence exists or in the de novo context (i.e. without a reference sequence).
TR detection on reference sequences of genomes, chromosomes or other types of assembled DNA sequences can be performed by a significant number of existing tools (see review [8] for examples). Most of these software contain two main steps. At first, potential TR are searched with either an exhaustive [9, 10] or a heuristic algorithm [11]. Some of these search methods are specifically designed for ETR such as BWtrs [9], while others can search for both ETR and ATR such as TRF [11] and mreps [10]. A filtering step is then executed for identifying the biologically significant repeats. Simple methods such as a length threshold [10] or more complex ones such as statisticsbased models [9] may be used. For sake of simplicity we call all these methods TR sequence search tools.
In the case of a de novo context, TR detection remains a difficult problem [12]. A sequencing platform produces numerous possible overlapping sequences, called reads, from the initial DNA sequence (called the target DNA sequence). Developed in 2005, the secondgeneration sequencing (SGS for short) technologies were developed to lower the sequencing time and cost (see [13] for a description). However, the resulting reads have short length (100 bp to 250 bp) and high error rate. One of the first sequencing technology, Sanger, can have an error rate as low as 0.001%, while reads from SGS technology can have up to 2.8% errors [14]. The loss of information, due both to the short length of the reads and the error rate, is counterbalanced with high values for the coverage depth, that is, the average number of reads covering an arbitrary position in the target DNA sequence. In this context, obtaining a correct de novo assembly of SGS data sets from complex genomes represents a difficult challenge [15]. Indeed, the numerous repeated sequences occurring in a genome are a main cause of errors in an assembly [16–18].
In order to surpass these challenges, several de novo assemblers were developed to improve the repeat assembly (see [19–21] for examples). Their procedures are usually executed at the end of the assembly process and are especially developed for repeat assembly by using pairedend information [12]. Most SGS platforms propose pairedend sequencing protocols which provide pairs of reads. The reads from a pair are separated on the target DNA sequence by a known approximate distance called insert size (usually ranging from 500 bp to 1 kbp). The insert size being much longer than the read length, the pairedend reads are used to span regions that contain long repeats. The de novo assemblers using this type of data are able to assemble a part of the repeats of the original genome, but in their final output many repeats are still missing [12]. Our previous study [22] focuses on these missing repeats by retrieving ETR left unresolved by the assemblers. The algorithm presented in this study, called DExTaR, improves the detection of ETR after a de novo de Bruijn assembly.
The read length limitation of SGS was tackled by the Pacific Biosciences (PacBio) technologies [23]. As part of the thirdgeneration sequencing platforms, PacBio long reads have a variable length ranging from 1 kbp to 20 kbp. This range of lengths can easily span most repeats. Unfortunately, the read length expansion came with a significant increase in the error rate. The mean error rate is of approximately 16% [24]. Therefore, most of the research work focusing on PacBio reads uses the higher quality of SGS short reads, either for correcting the PacBio reads [25–28] or for assembling in a de novo context [29, 30]. The problem of detecting TR sequences by using the long PacBio reads has been analysed in the context of an existing reference sequence [31]. Also, studies show that complex regions of genomes can be reconstructed using PacBio reads when finishing genomes after a global assembly [32–34]. In the remaining of this article we will use the term short reads to describe SGS reads and long reads for the reads generated by the Pacific Bioscience technologies.
In this paper we present the first de novo method for TR detection which uses both short and long reads and performs only local assemblies. Our hybrid algorithm, called MixTaR, combines the highquality of short reads and the large length of long reads for an efficient TR detection. By detecting both ETR and ATR, MixTaR goes further that DExTaR [22] in the TR detection. Moreover, unlike DExTaR, MixTaR does not need a previous global assembly.
MixTaR starts by building a de Bruijn graph from the short reads. The de Bruijn graph is constructed from the overlaps between fragments of reads of a certain length k, called kmers. Due to this fragmentation of the reads, the de Bruijn graph allows a detailed analysis of complex regions of the target DNA sequence. In particular, a sufficiently long ETR appears in the de Bruijn graph as a cycle. Therefore, we analyse the cycles in the de Bruijn graph for possible TR candidates. This step can only provide potential TR patterns due to the reduced length of the short reads. Thus, we use the long reads to verify the existence of TR containing these patterns and to deduce the length of these TR. The corrected TR sequence is then obtained by a local greedy assembly of a selected set of short reads.
Methods
In this paper, we present MixTaR, our solution to the DE NOVO HYBRID TANDEM REPEAT DETECTION problem, defined as follows.
DE NOVO HYBRID TANDEM REPEAT DETECTION
Input: A set SR of short reads and a set LR of long reads, both obtained from the sequencing of a DNA sequence D.
Requires: The set of TR from D involving the reads in SR and LR and their reverse complements.
For the second step we add to the set LR the reverse complements of the long reads and we use the set to validate these potential TR patterns. Because of the significant length of the long reads, most of the TR from D are spanned by at least one read in LR. The main difficulty of using long reads is to deal with their error rate. Thus, we start by searching approximate copies of each potential TR pattern in each read from LR. We then validate a potential TR pattern if we identify at least two approximate copies of it located next to each other in at least one read from LR.
The third and final step is to identify the exact sequence of the TR containing the validated patterns. To do so, we use again the set SR. The short reads overlapping a validated pattern are used in a local greedy assembly, from which we obtain the precise sequence of the TR containing the pattern.
Definitions for the main notions used in the paper are given in the next paragraphs, followed by a detailed description of each step.
Definitions
Let s be a string of length s over an alphabet Σ. For an integer i with 1 ≤ i ≤ s, s[i] is the letter at position i in s and s[i, l] is the substring of s of length l that starts at position i (and thus ends at position i + l − 1). The prefix and suffix of length l of the string s are denoted as Pref (s, l) and Suff (s, l). For two strings s_{1} and s_{2}, we denote their concatenation as s_{1} + s_{2}. The alignment score between s_{1} and s_{2} is obtained from an alignment [35] of s_{1} and s_{2} by adding 1 for each matching position, and 1 for each mismatch, insertion and deletion. The maximum semiglobal alignment score denoted as sgA_{ max } (s_{1}, s_{2}) represents the highest alignment score obtained from the possible alignments of s_{1} and a substring of s_{2}. A maximum overlap alignment score oA_{ max }(s_{1}, s_{2}, l_{ min }) between s_{1} and s_{2} is the highest alignment score obtained from the possible alignments of Pref (s_{1}, l) and Suff (s_{2}, l), where l ≥ lmin. A string s_{1} of length l occurs in a string s_{2} if there is a position 1 ≤ i ≤ (s_{2} − l + 1) for which s_{1} = s_{2}[i, l]. The number of occurrences occ(s_{1}, s_{2}) of a string s_{1} in a string s_{2} is the number of positions i for which s_{1} = s_{2}[i, l]. In the case of a set of strings, the number of occurrences occ(s, ζ) of a string s in the set ζ of strings is equal to ∑_{ t∈ζ } occ(s, t).
De Bruijn graph. A read is a string over the alphabet Σ = {A, C, G, T}. Let R be a nonempty set of reads of length l, together with their reverse complements, obtained after sequencing a DNA sequence D. We denote δ the coverage depth used for obtaining R. Given a positive integer k ≤ l, a kmer is a substring r[i, k] of a read r ∈ R such that 1 ≤ i ≤ l − k + 1. We denote S(k, r) the set of kmers of a string r. In the case of the set R of reads, the set S(k, R) of kmers of R is ∪ _{ r ∈ R } S(k, r). The de Bruijn graph [36]G^{ k } (R) is the directed graph constructed from S(k, R) as follows. Each kmer from S(k, R) is represented as a vertex v in G^{ k } (R). We consider the notation v for both the vertex and its corresponding kmer sequence. An arc α = (v_{ i }, v_{ j }) from v_{ i } to v_{ j } is built between two kmers iff Suff (v_{ i } , k − 1) = Pref (v_{ j } , k −1) and occ(v_{ i } +v_{ j } [k], R) ≥ 1. Because of this arc construction, each read in R corresponds to an oriented path in G^{ k } (R). The frequency of an arc α = (v_{ i }, v_{ j }) in G^{ k } (R) is f (α, R) = occ(v, R)/δ where v = v_{ i } + v_{ j } [k]. Indeed, we consider that the frequency of an arc is equal to the number of times the arc has to be traversed for assembling D.
Tandem Repeat. A pattern p is a string of length p ≥ 2 over the alphabet Σ = {A, C, G, T}. An exact tandem repeat (ETR) of the pattern p is a DNA sequence ε consisting of two or more consecutive copies of p, each copy being identical to p. Otherwise, if the copies of p in ε are not identical to each other but the maximum alignment score between each copy of p and p remains above a specific threshold, then ε is an approximate tandem repeat (ATR) of the pattern p. A tandem repeat (TR) of a pattern p is either an ETR or an ATR of p.
Let P = {p_{1}, p_{2}, . . . , ${p}_{{n}_{p}}$} be the list of copies of p in ε. When p has at least one complete copy in ε, its last copy ${p}_{{n}_{p}}$ can be partial, meaning that ${p}_{{n}_{p}}$ can be a copy (exact or approximate depending on the type of TR of ε) of a prefix of p. The copy number cn_{ p } of p in ε is the decimal number cn_{ p } = (n_{ p } − 1) + ${p}_{{n}_{p}}$/p and ε is then defined by the pair (p, cn_{ p }). In this article, we consider that for every 1 ≤ i ≤ n_{ p }, p_{ i } is primitive [37], meaning that p_{ i } does not contain itself an ETR. Moreover, when searching for TR in a target DNA sequence, we look for maximal TR in the sense that cn_{ p } is maximal.
Property 1 (Arc frequency property) Let v_{ 1 } = ε[1, k] and ve = Suf f (ε, k) be the first and last kmers of the ETR ε = (p, cnp) with ε ≥ p + k. In the cycle formed by ε in G^{ k } (ε), the frequency of each arc of the path from v_{ 1 } to v_{ e } is the same. If × denotes this frequency, then the frequency of each arc of the path from v_{ e } to v_{ 1 } is equal to × − 1.
Property 1 is respected for both p ≤ k and p > k as illustrated in Figure 2 for two different values of k. In both cases, the number of vertices in the cycle is equal to p due to the unique presence of each kmer in a de Bruijn graph.
Now that the definitions used in this paper are presented, we describe in detail the three main steps of MixTaR pipeline (Figure 1).
Pattern detection
SR is the set of short reads together with their reverse complements obtained after sequencing the DNA sequence D. As mentioned before, the ETR from D with length at least p + k (where p is the pattern of the ETR) form in a de Bruijn graph G^{ k } (SR) elementary cycles respecting Property 1. These ETR can represent substrings of longer ATR of the same pattern p in D. In this case, approximate copies of p are located next to the ETR in D and the ETR is considered as internal to a longer TR. In the following, these TR that contain an internal ETR forming a cycle in the de Bruijn graph are named robust TR. Our algorithm MixTaR first searches for ETR, then for the robust TR containing them.
Cycle search algorithm We consider that each elementary cycle in de Bruijn graph represents a potential ETR from D. Thus, after constructing the de Bruijn graph G^{ k } (SR) for a specific value of k, we start searching for elementary cycles in G^{ k } (SR). The sequencing errors in the set SR may introduce erroneous vertices in G^{ k } (SR). In order to eliminate them, we consider in our search only the vertices for which occ(v, SR) ≥ σ, where σ is a parameter with a value depending on δ, the coverage depth of SR. Moreover, we sort the list of vertices in G^{ k } (SR) in a descending order of their number of occurrences in SR.
A de Bruijn graph of a complex organism has a significant size, with hundreds of thousands of cycles. Hence, in order to detect a maximum number of elementary cycles in a limited amount of time, we use one of the most efficient cycle detection methods, namely Johnson's algorithm [38]. Johnson's algorithm explores the graph from every vertex v and returns the cycles that contain v and that are not yet detected. To limit this exploration we introduce three parameters: η, Λ_{ max } and λ_{ max }. From each vertex v we start by searching the cycles of maximal length of Λ_{ max }. After exploring η arcs from v and if there are still arcs to be explored, the algorithm searches for the remaining cycles from v of maximal length of λ_{ max } (λ_{ max } ≤ Λ_{ max }).
As mentioned before and explained in the next paragraph, after analysing a cycle c of l vertices we can obtain a pattern p of length 2 ≤ p ≤ l. Therefore, in order to obtain patterns with significant lengths, we have to maximize the number of cycles potentially containing an ETR of maximal length of Λ_{ max }. For this, we consider that the arcs of the cycles containing an ETR have a high value for their frequency in SR. Thus, from each vertex we start by exploring its output arcs in descending order of their frequency in SR. In this way, we traverse the arcs having the highest frequency in the first part of our cycle search. The parameters Λ_{ max }, η and λ_{ max } are set depending on the complexity of G^{ k } (SR) and on the running time allowed by the user for the cycle search.
Each vertex v from which we start the exploration of the graph has to satisfy an additional condition: occ(v, LR) >0. Because of the order in which vertices are considered, the vertex v from which the algorithm searches from cycles is also the vertex with the highest number of occurrences in SR from the cycles obtained for v. We then consider that we have for v the highest chance to find an errorless occurrence in LR between the vertices of the detected cycles from v. This additional condition is used in the second step of MixTaR, to validate the obtained patterns using the set LR.
Cycle analysis. Each cycle detected by Johnson's algorithm is analysed in order to find a potential ETR. The difficulty is that in a de Bruijn graph G^{ k } (SR), a cycle is formed by a repeat, but not necessarily by an ETR. For identifying the cycles formed by an ETR, we determine the ones that respect Property 1. Let c be a cycle in G^{ k } (SR) such that c is formed by an ETR ε from the target DNA fragment D. We consider v_{ 1 } = ε[1, k] and v_{ e } = Suf f (ε, k) to be the first and last kmer of ε as presented in Property 1. The flanking regions of ε in D create two arcs α_{ in }= (x, v_{ 1 }) and α_{ out } = (v_{ e }, y) is G^{ k } (SR), such that x[1] + ε + y[k] occurs in D. We consider that these two arcs mark the beginning and the end of ε.
In the case of a complex D, ε or substrings of ε can appear several times at different locations in D. In the following we call these sequences additional interspersed repeats (AIR) of ε. In a de Bruijn graph, each vertex represents a unique kmer. Since the AIR and ε have common kmers, their corresponding paths in G^{ k } (SR) share common vertices and thus parts of c. This fact has two consequences for c. The first one is that, as is the case for ε, the flanking regions of the AIR can create additional input and output arcs from the vertices of c. Since the AIR occur in D, they are spanned by reads in SR. Thus, the second consequence is that the arcs of the ETR in common with an AIR have a frequency corresponding to both the ETR and the AIR. Therefore, in order to detect if a cycle from G^{ k } (SR) contains an ETR we have to remove the impact of the potential AIR from the frequencies of the cycle arcs. The method, described in Algorithm 1 detailed below, is applied for each detected cycle c.
Algorithm 1: Cycle frequency cleaning
Input: The sets Vc and Ac of vertices and arcs of a cycle c; the sets A_{ in } − {α_{ in }} and A_{ out } − {α_{ out }} of input and output arcs of the vertices in c
 1
f_{ acc } ← 0;
 2
for i ← 1 to Ac do
 3
if i = Ac then
 4
αi ← (vi, v_{ 1 }), αi ∈ Ac;
 5
else
 6
αi ← (vi, vi+1), αi ∈ Ac;
 7
f_{ arc }[αi] ← f (αi, SR);
 8
f_{ input }[vi] ← $\sum _{\begin{array}{c}q=\left(x,{v}_{i}\right)\hfill \\ q\in {A}_{out}\left\{{\alpha}_{in}\right\}\hfill \end{array}}}f\left(q,SR\right)$
 9
f_{ output }[v_{ i }] ← $\sum _{\begin{array}{c}u=\left({v}_{i},y\right)\hfill \\ u\in {A}_{out}\left\{{\alpha}_{out}\right\}\hfill \end{array}}}f\left(u,\phantom{\rule{2.36043pt}{0ex}}SR\right)$
 10
f_{ acc } ← max(f_{ acc } + f_{ input }[vi] − f_{ output }[vi], 0);
 11
f_{ arc }[α_{ i }] ← f_{ arc }[α_{ i }] − f_{ acc };
 12
f_{ output }[v_{ i }] ← max(−f_{ acc }, 0);
 13
f_{ input }[v_{ i }] ← 0;
 14
if f_{ acc } >0 then
 15
for i ← 1 to Ac do
 16
if i = A_{ c } then
 17
α_{ i } ← (v_{ i }, v_{ 1 }), αi ∈ A_{ c };
 18
else
 19
α_{ i } ← (v_{ i }, v_{ i }+_{1}), α_{ i } ∈ A_{ c };
 20
f_{ acc } ← max(f_{ acc } − f_{ output }[vi], 0);
 21
f_{ arc }[α_{ i }] ← f_{ arc }[α_{ i }] − f_{ acc };
 22
return f_{ arc } [];
Let V_{ c } be the set of vertices of c and A_{ c } be the set of its arcs. We start by computing for each cycle c the set of input arcs A_{ in } = {(x, v) with x ∉ V_{ c } and v ∈ Vc} and the set of output arcs A_{ out } = {(v, y) with v ∈ Vc and y ∉ V_{ c }}. Unfortunately, the paths of the ETR and the AIR are not known. Thus, we suppose that each couple (α_{ in }, α_{ out }) ∈ A_{ in }xA_{ out } is a potential marker for the beginning and the end of an ETR ε in D, corresponding to the cycle c. Therefore, the arcs in A_{ in } − {α_{ in }} ∪ A_{ out } − {α_{ out }} are the endpoints of the paths defined by all the AIR of ε.
In the beginning, the frequency f_{ arc }[α] of an arc α ∈ Ac is initialized to f (α, SR) (computed from the number of occurrences of the corresponding (k+1)mer in SR as described before). We start traversing c and for each vertex v_{ i } ∈ V_{ c }, we compute the incoming frequency f_{ input }[v_{ i }] and the outgoing frequency f_{ output }[v_{ i }]. We then use a cumulative frequency f_{ acc } to remove the impact of the AIR from the arcs in c. Initially, the cumulative frequency f_{ acc } is equal to 0. For each vertex v_{ i } we add to f_{ acc } the difference between f_{ input }[v_{ i }] and f_{ output }[v_{ i }]. If f_{ acc } >0 then its value represents the frequency of the AIR, and thus in excess, of the arc α_{ i } = (v_{ i }, v_{i+1}) and we remove it from f_{ arc }[α_{ i }]. Otherwise, its absolute value corresponds to the frequency in excess of the arc α_{ i } = (v_{i−1}, v_{ i }) and it is removed in a second pass through the cycle. In this case, f_{ output }[v_{ i }] = f_{ acc } and we set f_{ acc } = 0. In both cases, we set f_{ input }[v_{ i }] = 0 since its impact is now included in f_{ acc }.
In order to deduce a potential ETR pattern from c, the remaining frequency of the arcs of c has to respect Property 1. In this case, we construct the ETR ε of the pattern p spelled by the cycle c in the following manner. Let l be the number of vertices of c, then ε = v_{ 1 } + v_{2}[k] + . . . + v_{ l }[k]. As mentioned before, l = p and thus p = ε[1, l].
Pattern validation
The cycle analysis we just described can only return potential patterns of ETR, and thus, of robust TR. However, some of the information given by the short reads is lost because of their hashing into kmers. As a consequence, the set of patterns obtained from the de Bruijn graph may contain false positives. Hence, for each obtained pattern we have to validate its presence in at least one TR from the target DNA fragment D. For this, we use the set LR because of the significant length of its reads.
With a variable length that can range from 1 kbp to 20 kbp [23], the reads in LR can span most TR in D. The hypothesis that we make is weaker: we assume that each pattern p of a TR has at least two neighbouring copies spanned by at least one read in LR. Thus, we validate a pattern if we identify at least two copies of it located next to each other in at least one long read. Otherwise the pattern is considered as a false positive and is discarded.
The difficulty raised by this approach is to correctly identify two neighbouring copies of a pattern in spite of the high error rate of long reads, which is of approximately 16%. A possible solution is the use of a long read correction procedure [25–28]. However, in case of complex genomes, even the most efficient methods are not able to correct all the errors from the long reads (see Results and discussion section for more details). Thus, our pattern search method allows the use of corrected and also noncorrected long reads.
In order to validate a pattern p, we start by searching an approximate copy of p in the reads in LR. Let c be a cycle detected at the previous step, and let v be the vertex of c having the highest number of occ in SR. As described in the previous step, v satisfies the condition occ(v, LR) >0. Let p be the potential pattern deduced from the analysis of c, i.e. p is the pattern of the ETR spelled by c. If p < k or if v = Suff (p, x) + Pref (p, k − x) for an integer 1 ≤ × < k, then v occurs in a string obtained by successive concatenations of p. Otherwise, v occurs directly in p. For a fast validation of the pattern p, we limit the search for copies of p to each long read r containing v. Let s be a string initially identical to p. If occ(v, s) = 0, we extend s by successive concatenations of p in such way that s is the shortest string for which occ(v, s) >0. The process is finite since v occurs in the ETR of p spelled by c.
Let pos be a position in a long read r such that r[pos, k] = v. We then try to identify an occurrence of s in r by searching for the alignment with the highest score between s and a substring of r around pos. Approximately 70% of the errors in LR are insertions [24]. Thus, we consider for our alignment a substring of length 2s from r and we compute the maximum semiglobal alignment score t = sgA_{ max }(s, r[pos − s, 2s]). We then use a parameter τ , which is a threshold for our score t. The value of τ is set according to the approximate percentage of errors of the long reads we used, raw or corrected. If t/s < τ , then p is considered as a false positive. Otherwise, either occ(p + p, s) >0 and the pattern p is validated or we repeat the process for s = s + p.
TR sequence assembly
Because of the high error rate of long reads, the partial TR detected in the second step of MixTaR contains a significant amount of erroneous bases. Thus, in the last step of our algorithm, we need to find the exact sequence for each TR. For this, we use once again the set SR. The TR of each pattern p are constructed by a local assembly of the set SR_{ p } of short reads overlapping p.
To construct the set SR_{ p } for each p we introduce two parameters γ and ϑ.
The smaller the length of the pattern p is, the higher is the probability that a read in SR overlaps p. Hence, the size of SR_{ p } and the running time of its assembly can be significant for short p. Let s be a string initially identical to p. In order to limit the number of short reads r used in the local assemblies, if s < γ, we extend s by successive concatenations of p such that s is the shortest string for which s ≥ γ. The value for γ depends on the length of the short reads in SR, and has to be small enough to maximize the number of reads spanning a part of a TR of p included in SR_{ p } while limiting its size.
We then construct the set SRp by searching the short reads overlapping s. MixTaR allows approximate overlaps since p can be a pattern of an ATR. For each short read r ∈ SR, we compute the overlap alignment score t = max(oA_{ max }(s, r, γ), oA_{ max }(r, s, γ)). If t ≥ ϑ, then r is added to SR_{ p }. The value for ϑ depends on the error rate of the reads in SR and on the difference allowed between a pattern and its copies for an ATR.
Once the set SRp is computed, we assembly it. For this, we use a greedy assembler. This choice was made because of its short running time needed in order to obtain satisfying results [15]. A greedy assembler computes a specific scoring function for the overlap between each pair of short reads. Then the short reads are assembled in larger sequences, called contigs, by successive assemblies of the short reads with the overlap with the highest score.
In each contig output by the greedy assembler, we may find several TR, and we have to identify their positions in the contig. To do so, we use a TR sequence search tool (see [8] for examples) which gives us the positions of the TR in each contig. The set of contigs is then analysed. In some cases, we can obtain contigs that do not contain any TR. Since these contigs are not significant for our TR detection, they are removed. In the case where a TR occurs at an end of a contig, we can suppose that this TR is not complete. Thus the contig becomes a seed. Each seed is then extended in order to obtain the complete sequence of the TR. The extension of the seeds is done using the greedy assembler with the set of reads $SR{{U}_{p}}_{\in \Im}SRp$, where $\Im $ is the set of all patterns validated during the second step of MixTaR. Once all the seeds are extended, we run the TR sequence search tool for a complete identification of the TR on the set of seeds. In the end, we output the TR obtained on the seeds along with the TR from the contigs.
Results and discussion
Experimental setup
In this section, we present the adaptations included to MixTaR for the real case of nonhomogeneous coverage depth data along with the tools and libraries used for the experiments.
For the first step of MixTaR (Step I. in Figure 1) we use the de Bruijn graph library GATBlib [39] for the construction of the de Bruijn graph. In this step, the result of the arc frequencies computations presented in Algorithm 1 depends on the homogeneity of the coverage depth used for obtaining the set SR. Let c be a cycle in G^{ k } (SR) formed by an ETR ε from the target DNA fragment D. In the real case of nonhomogeneous coverage depth of SR, the bases of ε are not always covered by the same number of short reads. Thus, the frequencies of the arcs in c may fluctuate. Therefore, Property 1 is verified using an interval rather than a specific value for the frequency of an arc. This interval is computed based on the coverage depth for the set SR and on the mean frequency of the arcs in c.
In the pattern validation step of MixTaR (Step II. in Figure 1), we compute the maximum semiglobal alignment score between the patterns and the long reads using the overlap alignment method proposed by the library SeqAn [40]. This method is also used at the end of our algorithm (Step III. in Figure 1) for computing the overlap between the patterns and the short reads. To compute the local greedy assemblies, we chose SSAKE [41] due to its low running time and simple setup needed to obtain satisfying results [17].
For our experiments we run the TR sequence search tool called mreps [10] in order to identify the TR from the contigs obtained with SSAKE. Mreps is a software tool designed for a fast identification in DNA sequences of both ETR and ATR with primitive patterns. We also used mreps for identifying TR in the reference DNA sequences of the tested organisms. Due to the significant number of TR in the tested organisms, mreps was parameterised each time to find TR with at least two complete copies.
Since the DNA strand of the contigs obtained with SSAKE, and thus of the TR detected, is not known in advance, when we detect a TR and its reverse complement, we consider that they represent the same TR. Thus, a detected TR can be represented either by its sequence or by both its sequence and its reverse complement. In our analysis we consider for the target DNA fragment D only its reference strand. Thus a TR has been correctly identified if we detected its sequence (alone or together with its reverse complement). Thus, each TR can be classified in:

True Positive (T P) if we detect the complete sequence (i.e. with the right copy number) of the TR from D;

True Positive Incomplete (T P_{ i }) if we detect only a part of the TR from D;

False Negative (F N) if we do not detect the TR from D;

False Positive (F P) if we detect the TR but its sequence does not occur in D.
The accuracy of our method is then measured using the two following statistics:

P_{ recision }= (T P + T P_{ i })/(T P + T P_{ i }+ F P) which measures the fraction of retrieved TR that are present in D;

Sensitivity = (T P + TP_{ i })/(T P + T P_{ i }+ F N) which measures the fraction of TR from D that are correctly identified.

Step 1: Pattern detection

the length k of the kmers for the de Bruijn graph construction.

the minimum number of occurrences σ of a kmer in SR. The value of this parameter depends on the coverage depth of SR.

the maximal length Λ_{ max }of cycles detected from each vertex v.

the maximal number of arcs η explored from each vertex v when looking for cycles of maximal length Λ_{ max }. The values of Λ_{ max }and η depend on the size of the de Bruijn graph and on the running time allowed by the user.

the maximal length λ_{ max }of cycles detected for a vertex v once the maximal number of arcs η was visited for v. The value of λ_{ max }(λ_{ max } ≤ Λ_{ max }) also depends on the size of the de Bruijn graph and on the running time allowed by the user.


Step 2: Pattern validation

the minimum alignment score τ allowed between a pattern and a long read. The value for this parameter depends on the error rate of the long reads.


Step 3: TR sequence assembly

the minimum length γ of the pattern used for selecting the short reads for the local assemblies. From our experiments, the best results for a length of short reads of 100 bp were obtained for γ = 10.

the minimum alignment score ϑ allowed between a pattern and a short read and also between the pattern and its copies. From our experiments, the best results were obtained for ϑ = 10.

In the following, we present the results obtained with MixTaR on both simulated and real data sets. As mentioned before, our algorithm is set to analyse all the cycles of maximum length λmax from the de Bruijn graph and only a part of the cycles of length between λmax and Λmax. Also, the length of a pattern deduced from a cycle is equal to the number of vertices of the cycle. Hence, for the TR detected with MixTaR we present two separate analyses. At first, we describe the results obtained on the TR with a pattern length of maximum λmax bp. Then we extend our analysis to the TR with a pattern length of maximum Λmax bp. These analyses are conducted on both the set of robust TR (those that the algorithm targets) and on the set of general TR from the chromosome. Some of the general TR do not have an internal ETR that forms a cycle in the de Bruijn graph and are therefore not directly targeted but are found by MixTaR.
Simulated data sets
For our experiments, we needed a complex organism presenting numerous TR. We chose Caenorhabditis elegans, which is a transparent nematode widely used as a model organism. As presented in [17], C. elegans is an organism for which assemblers need high computation time for a low percentage of correctly mapped contigs. This is mainly due to the important number of repeats, such as TR. Its genome was downloaded from GenBank [GenBank:GCA 000002985.3]. For our experiments, we used the first chromosome of C. elegans. This chromosome, of length of approximately 15 Mb, has a very complex structure and contains more TR than many complete genomes. We run mreps [10] to identify the TR from its reference sequence and we obtained a set of 39,006 TR with pattern p satisfying 2 ≤ p ≤ 100.
Pairedend short reads datasets with length of 100 bp and a coverage depth of 20x were obtained using the GemSIM read simulator [43]. To test the impact of the sequencing errors of the short reads on the results quality of MixTaR, the first dataset is errorless (SRNE, for Short Reads with No Error), while the second one (SRE, for Short Reads with Errors) simulates Illumina errors. Two other sets of pairedend short reads were obtained by correcting the set SRE: one (SRCE1, for Short Reads with Corrected Errors 1) by using the trimming tool proposed by SSAKE [41] and the other one (SRCE2 for Short Reads with Corrected Errors 2) by using the correcting tool proposed by ALLPATHS [44], one of the most efficient error correction methods for short reads [16].
Long reads datasets of coverage depth of 20x and 100x were obtained using the PacBio reads simulator PBSIM [42] (LRE x20 and x100, for Long Reads with Errors). Two other sets were obtained by correcting LRE x20 and x100 with LoRDEC (LRCE x20 and x100 Lond Reads with Corrected Errors) as presented before.
For the experiments on the first chromosome of C. elegans, we used the following values for the MixTaR parameters. Since the short reads have a length of 100 bp, for the de Bruijn construction we tested all the odd values for k ∈ [17, 47]. Due to the coverage depth of short reads (x20) we applied the cycle search algorithm for all kmers with a frequency of at least σ = 10. The de Bruijn graph for the first chromosome of C. elegans has a significant number of cycles, thus for the cycle search we set η = 10, 000 arcs for cycles of maximal length Λmax = 100 and then λmax = 20. For the minimum alignment score τ we used the value τ = 10 for the noncorrected long reads and τ = 20 for the corrected ones.
In the following paragraphs, the results obtained with MixTaR are analysed from two points of view: the percentage and the pattern length range of detected TR from the first chromosome of C. elegans.
Precision and sensitivity values for the detection of robust TR and of general TR with maximal pattern length of 100 bp from the first chromosome of C.
Robust TR  General TR  

Precision  Sensitivity  Precision  Sensitivity  
LRE x20  0.909  0.909  0.990  0.984 
LRE x100  0.892  0.908  0.988  0.984 
LRCE x20  0.909  0.912  0.993  0.984 
LRCE x100  0.883  0.911  0.987  0.984 
The results of mreps on the first chromosome of C. elegans showed 37,584 TR with p ≤ 20 and 39,006 TR with p ≤ 100. The results of MixTaR are even better on these sets of general TR, with more than 98% of TR detected and less than 1.3% of FP. This is due to the fact that the first chromosome of C. elegans contains a significant number of TR with similar patterns. By selecting the most significant short reads for the local assemblies, we are able to detect a high percentage of general TR without introducing errors. The precision and sensitivity values on the set of general TR with p ≤ 100 are given on the the third and fourth column of Table 1. Once again, we observe, overall, that the correction of the long reads does not have a true influence on the quality of the results. We can notice however than the number of FP obtained with the set LRCE x20 on the TR with maximal pattern length of 100 bp is at most 45% lower that the one obtained with the other three sets (see Figure 5). But this represents only about 0.6% of all the TR detected by MixTaR. Since in general the cost of correcting the reads in terms of running time and memory usage is significant, in the following we present the results obtained using the set of long reads noncorrected LRE with the lowest coverage depth, x20. The differences observed between the results obtained with this set of long reads and with the three others are similar to the ones presented in this paragraph.
Precision and sensitivity values for the detection of general TR with maximal pattern length of 100 bp from the first chromosome of C.
K = 17  K = 27  K = 37  K = 47  

Prec.  Sens.  Prec.  Sens.  Prec.  Sens.  Prec.  Sens.  
SRNE  0.990  0.984  0.994  0.980  0.996  0.948  0.997  0.884 
SRE  0.872  0.917  0.942  0.862  0.961  0.758  0.968  0.638 
SRCE1  0.984  0.939  0.994  0.906  0.997  0.777  0.997  0.579 
SECE2  0.998  0.818  0.999  0.764  0.999  0.669  0.999  0.565 
On Figure 8 we can also observe the effect of errorcorrection procedures on short reads. The number of FP can decrease by more than 90% only by trimming the reads and even with 100% with the error correction method proposed by ALLPATHS. But once again, the number of lost TP can be significant. By correcting the low quality parts of the short reads, the error correcting procedures can remove parts of correct TR. The arc frequency variance is then increased in the cycles formed by these TR, implying supplementary difficulties for MixTaR to correctly identify them.
The pattern length range of detected TR. In the results returned by mreps for 2 ≤ p ≤ 100, we observe that about 96% of the general TR of the first chromosome of C. elegans have 2 ≤ p ≤ 20. But the interval 21 ≤ p ≤ 100 is also very well covered, since the chromosome has at least one TR for most of the pattern lengths.
Real data sets
Legionella pneumophila is an intracellular parasite found in human monocytes and responsible of a severe pneumonia known as Legionnaires' disease. Several studies such as [45, 46] are focused on the biological role of TR from the genome of L. pneumophila and the Philadelphia strain.
For our experiments on the strain Philadelphia, we used a set of paired short reads obtained with Illumina technology with a coverage depth of approximately 190x [SRA:SRX258262]. In the following, we refer to this set of reads with SRRP for Short Reads Real for Philadelphia. Since no long reads sets are available for this strain, we simulated two sets of long reads with PBSIM with a coverage depth of 20x and 100x (LREP x20 and LREP x100 for Long Reads with Errors for Philadelphia). Two other sets of long reads were obtained by correcting LREP x20 and LREP x100 with LoRDEC using the set of real short reads and k = 19. These sets are denoted LRCEP x20 and LRCEP x100 for Long Reads with Corrected Errors for Philadelphia. The genome has a length of approximately 3.4 Mb [GenBank:GCA 000008485.1].
In order to test MixTaR with both long and short read real data sets, we also used for our experiments a second strain of L. pneumophila, the 130b strain. For our experiments, we used two sets of reads downloaded from the Sequence Read Archive at the NCBI: one set of paired short reads obtained with Illumina sequencing with a coverage depth of approximately 120x [SRA:ERX313832] and a set of long reads obtained with PacBio sequencing [SRA:ERX620205]. In the following, we refer to these two sets of reads with SRR130b (for Short Reads Real for 130b) and LRR130b (for Long Reads Real for 130b). A second set of long reads (LRRC130b, for Long Reads Real Corrected for 130b) was obtained by correcting the set LRR130b with LoRDEC using the set SRR130b and k = 19. The draft genome is constituted of 159 contigs [GenBank:GCA 000211115.2].
For the experiments on both strains of L. pneumophila that we present, we used the following values for the algorithm parameters. Due to the the short reads length of 100 bp we tested all the odd values for k ∈ [17, 47]. In the following paragraphs we present the results obtained for k = 17, the value for which we obtained the best results. Since the size of the genome is significantly smaller than the size of the first chromosome of C. elegans, the sizes of the set of kmers and of the de Bruijn graph are also smaller. Therefore we included in our cycle search all kmers with at least σ = 30 frequency. For the cycle search we set η = 10, 000 arcs for cycles of maximal length Λmax = 100 and then λmax = 30. For the minimum alignment score τ we used the value τ = 10 for the long reads noncorrected and τ = 20 for the corrected ones.
Precision and sensitivity values for the detection of robust TR and of general TR from the genome of L.
Robust TR  General TR  

Precision  Sensitivity  Precision  Sensitivity  
LREP x20  1  1  0.997  0.789 
LRCEP x20  1  0.941  0.993  0.639 
LREP x100  1  0.882  0.997  0.600 
LRCEP x100  1  0.941  0.992  0.590 
Biologically significant TR detected by MixTaR from the genome L.
Name  Gene  Pattern length  Copy number  LREP x20  LRCEP x20  LREP x100  LRCEP x100 

LPG0451  30 bp  5.9  Complete        
LPG0688  9 bp  6  Complete  Complete  Complete  Complete  
LPG1038  12 bp  4.17  Complete  Complete  Complete  Complete  
Lpms35  LPG1299  18 bp  3  Complete       
LPG1555  21 bp  2  Complete  Complete  Complete  Complete  
LPG1602  90 bp  9.2  Complete  Complete  Complete  Complete  
LPG1948  90 bp  7.08  Complete    Complete    
LPG1958  87 bp  13.59  Complete    Complete    
LPG2392  87 bp  6.49          
LPG2559  12 bp  4.08  Complete  Complete  Complete  Complete  
Lpms31  LPG2644  45 bp  19.44  Incomplete  Incomplete  Incomplete  Incomplete 
(11 copies)  (11 copies)  (11 copies)  (11 copies)  
Lpms3  LPG2793  96 bp  7.58  Complete       
Lpms01  LPG2854  45 bp  7.64  Incomplete       
(6 copies)  
Lpms_{1}3  LPG1488  24 bp  9.75  Incomplete  Incomplete  Incomplete  Incomplete 
(6 copies)  (6 copies)  (6 copies)  (6 copies)  
Lpms_{1}7  LPG0854  89 bp  2.28    Complete    Complete 
Lpms_{1}9  Intergenic  21 bp  4.05    Complete    Complete 
Precision and sensitivity values for the detection of robust TR and of general TR from the genome of L.
Robust TR  General TR  

Precision  Sensitivity  Precision  Sensitivity  
LRR130b  1  0.857  0.957  0.516 
LRRC130b  1  0.929  0.959  0.525 
Conclusion
This paper presents a new algorithm, MixTaR, that represents an efficient solution to the problem of de novo detection of TR. The method focuses only on the parts of the genome where potential TR can be located, and does not compute global assemblies. By mixing the quality of short reads with the length of long reads, we introduced a robust approach. We obtained high quality results on complex organisms, and using sets of reads with different error rates. Keeping low false positive rates, our method detects accurate TR with pattern lengths varying within a significant interval.
MixTaR identifies a significant number of TR on both real and simulated read sets. However, the complexity of the target DNA fragment influences the amount of general TR detected. As mentioned before, the initial target imposed by our approach is the set of robust TR. By assembling the short reads spanning the robust TR patterns, we are able to identify a significant amount of nonrobust TR. This is due either to the fact that they are located in the flanking regions of the robust TR, or to the similarity between the patterns of nonrobust and robust TR. As a consequence, the number of detected TR increases with the number of robust TR presented in the target DNA fragment. Thus, the performances of MixTaR increase with the complexity of the target DNA fragment.
Future improvements and extensions are planned to be included in our algorithm and in our study. Each tool that detects TR in a reference DNA sequence has its own definition of ATR and the sets of detected TR are different [8]. In this paper, we used mreps [10] for our analysis, but the study can be extended to other TR search tools, for instance, Tandem repeats finder [11].
Also, the results are presented in this paper from the point of view of the TR sequences. The algorithm and also the TR quality study can be extended to include the flanking regions of the TR. Thus, instead of only comparing TR sequences between our results and the target DNA sequence, we can also provide more information concerning the location of each TR.
List of abbreviations
 SGS:

Second Generation Sequencing
 TGS:

Third Generation Sequencing
 TR:

Tandem repeat
 ETR:

Exact tandem repeat
 ATR:

Approximate tandem repeat
 SR:

set of Short Reads obtained with a SGS technology
 LR:

set of Long Reads obtained with Pacific Bioscience sequencing technology.
Declarations
Declarations
Publication costs for this article were funded by bioinformatics theme from LINA UMR CNRS 6241, University of Nantes.
This article has been published as part of BMC Medical Genomics Volume 8 Supplement 3, 2015: Selected articles from the IEE International Conference on Bioinformatics and Biomedicine (BIBM 2014): Medical Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcmedgenomics/supplements/8/S3.
Authors’ Affiliations
References
 Jurka J, Kapitonov VV, Kohany O, Jurka MV: Repetitive sequences in complex genomes: structure and evolution. Annual Reviews of Genomics Human Genetics. 2007, 8: 241259. 10.1146/annurev.genom.8.080706.092416.View ArticleGoogle Scholar
 Mayer C, Leese F, Tollrian R: Genomewide analysis of tandem repeats in Daphnia pulexa comparative approach. BMC Genomics. 2010, 11 (1): 27710.1186/1471216411277.PubMedPubMed CentralView ArticleGoogle Scholar
 Zhao Z, Guo C, Sutharzan S, Li P, Echt CS, Zhang J: GenomeWide Analysis of Tandem Repeats in Plants and Green Algae. G3: Genes Genomes Genetics. 2014, 4 (1): 6778.PubMedView ArticleGoogle Scholar
 Subramanian S, Mishra RK, Singh L: Genomewide analysis of microsatellite repeats in humans: their abundance and density in specific genomic regions. Genome Biology. 2003, 4 (2): 1310.1186/gb200342r13.View ArticleGoogle Scholar
 Verstrepen KJ, Jansen A, Lewitter F, Fink GR: Intragenic tandem repeats generate functional variability. Nature Genetics. 2005, 37 (9): 986990. 10.1038/ng1618.PubMedPubMed CentralView ArticleGoogle Scholar
 Fondon JW, Garner HR: Molecular origins of rapid and continuous morphological evolution. Proceedings of the National Academy of Sciences. 2004, 101 (52): 1805818063. 10.1073/pnas.0408118101.View ArticleGoogle Scholar
 Gelfand Y, Rodriguez A, Benson G: TRDBthe tandem repeats database. Nucleic Acids Research. 2007, 35 (suppl 1): 8087.View ArticleGoogle Scholar
 Lim KG, Kwoh CK, Hsu LY, Wirawan A: Review of tandem repeat search tools: a systematic approach to evaluating algorithmic performance. Briefings in Bioinformatics. 2013, 14 (1): 6781. 10.1093/bib/bbs023.PubMedView ArticleGoogle Scholar
 Pokrzywa R, Polanski A: BWtrs: a tool for searching for tandem repeats in DNA sequences based on the BurrowsWheeler transform. Genomics. 2010, 96 (5): 316321. 10.1016/j.ygeno.2010.08.001.PubMedView ArticleGoogle Scholar
 Kolpakov R, Bana G, Kucherov G: mreps: efficient and flexible detection of tandem repeats in DNA. Nucleic Acids Research. 2003, 31 (13): 36723678. 10.1093/nar/gkg617.PubMedPubMed CentralView ArticleGoogle Scholar
 Benson G: Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Research. 1999, 27 (2): 573580. 10.1093/nar/27.2.573.PubMedPubMed CentralView ArticleGoogle Scholar
 Treangen TJ, Salzberg SL: Repetitive DNA and nextgeneration sequencing: computational challenges and solutions. Nature Reviews Genetics. 2012, 13 (1): 3646.Google Scholar
 Ansorge WJ: Nextgeneration DNA sequencing techniques. New Biotechnology. 2009, 25 (4): 195203. 10.1016/j.nbt.2008.12.009.PubMedView ArticleGoogle Scholar
 Hoff KJ: The effect of sequencing errors on metagenomic gene prediction. BMC Genomics. 2009, 10 (1): 52010.1186/1471216410520.PubMedPubMed CentralView ArticleGoogle Scholar
 Miller JR, Koren S, Sutton G: Assembly algorithms for nextgeneration sequencing data. Genomics. 2010, 95 (6): 315327. 10.1016/j.ygeno.2010.03.001.PubMedPubMed CentralView ArticleGoogle Scholar
 Salzberg SL, Phillippy AM, et al: GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome Research. 2012, 22 (3): 557567. 10.1101/gr.131383.111.PubMedPubMed CentralView ArticleGoogle Scholar
 Zhang W, Chen J, Yang Y, Tang Y, Shang J, Shen B: A practical comparison of de novo genome assembly software tools for nextgeneration sequencing technologies. PLoS One. 2011, 6 (3): 1791510.1371/journal.pone.0017915.View ArticleGoogle Scholar
 Misawa K, RF : A method for filtering short reads with tandem repeats for genome mapping. Genomics. 2013, 102 (1): 3537. 10.1016/j.ygeno.2013.03.002.PubMedView ArticleGoogle Scholar
 Zerbino DR, McEwen GK, Margulies EH, Birney E: Pebble and Rock Band: Heuristic Resolution of Repeats and Scaffolding in the Velvet ShortRead de Novo Assembler. PLoS One. 2009, 4 (12): 840710.1371/journal.pone.0008407.View ArticleGoogle Scholar
 Wetzel J, Kingsford C, Pop M: Assessing the benefits of using matepairs to resolve repeats in de novo shortread prokaryotic assemblies. BMC Bioinformatics. 2011, 12 (1): 9510.1186/147121051295.PubMedPubMed CentralView ArticleGoogle Scholar
 Pevzner PA, Tang H, Waterman MS: An Eulerian path approach to DNA fragment assembly. Proceedings of the National Academy of Sciences. 2011, 98 (17): 97489753.View ArticleGoogle Scholar
 Fertin G, Jean G, Radulescu A, Rusu I: DExTaR: Detection of exact tandem repeats based on the de Bruijn graph. IEEE International Conference on Bioinformatics and Biomedicine (BIBM). 2014, IEEE, 9093.View ArticleGoogle Scholar
 Carneiro MO, Russ C, Ross MG, Gabriel SB, Nusbaum C, DePristo MA: Pacific biosciences sequencing technology for genotyping and variation discovery in human data. BMC Genomics. 2012, 13 (1): 37510.1186/1471216413375.PubMedPubMed CentralView ArticleGoogle Scholar
 Chaisson MJ, Tesler G: Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinformatics. 2012, 13 (1): 23810.1186/1471210513238.PubMedPubMed CentralView ArticleGoogle Scholar
 Au KF, Underwood JG, Lee L, Wong WH: Improving PacBio long read accuracy by short read alignment. PLoS One. 2012, 7 (10): 4667910.1371/journal.pone.0046679.View ArticleGoogle Scholar
 Salmela L, Rivals E: LoRDEC: accurate and efficient long read error correction. Bioinformatics. 2014, 538Google Scholar
 Koren S, Schatz MC, Walenz BP, Martin J, Howard JT, Ganapathy G, Wang Z, Rasko DA, McCombie WR, Jarvis ED, et al: Hybrid error correction and de novo assembly of singlemolecule sequencing reads. Nature Biotechnology. 2012, 30 (7): 693700. 10.1038/nbt.2280.PubMedPubMed CentralView ArticleGoogle Scholar
 Hackl T, Hedrich R, Schultz J, F¨orster F: proovread: largescale highaccuracy pacbio correction through iterative short read consensus. Bioinformatics. 2014, 30 (21): 30043011. 10.1093/bioinformatics/btu392.PubMedPubMed CentralView ArticleGoogle Scholar
 Deshpande V, Fung ED, Pham S, Bafna V: Cerulean: A hybrid assembly using high throughput short and long reads. Algorithms in Bioinformatics Lecture Notes in Computer Science. 2013, 8126: 349363. 10.1007/9783642404535_27.View ArticleGoogle Scholar
 Prjibelski AD, Vasilinetc I, Bankevich A, Gurevich A, Krivosheeva T, Nurk S, Pham S, Korobeynikov A, Lapidus A, Pevzner PA: ExSPAnder: a universal repeat resolver for DNA fragment assembly. Bioinformatics. 2014, 30 (12): 293301. 10.1093/bioinformatics/btu266.View ArticleGoogle Scholar
 Ummat A, Bashir A: Resolving complex tandem repeats with long reads. Bioinformatics. 2014, 30 (24): 34913498. 10.1093/bioinformatics/btu437.PubMedView ArticleGoogle Scholar
 Bashir A, Klammer AA, Robins WP, Chin CS, Webster D, Paxinos E, Hsu D, Ashby M, Wang S, Peluso P, et al: A hybrid approach for the automated finishing of bacterial genomes. Nature Biotechnology. 2012, 30 (7): 701707. 10.1038/nbt.2288.PubMedPubMed CentralView ArticleGoogle Scholar
 Chin CS, Alexander DH, Marks P, Klammer AA, Drake J, Heiner C, Clum A, Copeland A, Huddleston J, Eichler EE, et al: Nonhybrid, finished microbial genome assemblies from longread SMRT sequencing data. Nature Methods. 2013, 10 (6): 563569. 10.1038/nmeth.2474.PubMedView ArticleGoogle Scholar
 Huddleston J, Ranade S, Malig M, Antonacci F, Chaisson M, Hon L, Sudmant PH, Graves TA, Alkan C, Dennis MY, et al: Reconstructing complex regions of genomes using longread sequencing technology. Genome Research. 2014, 24 (4): 688696. 10.1101/gr.168450.113.PubMedPubMed CentralView ArticleGoogle Scholar
 Waterman MS: Introduction to Computational Biology: Maps, Sequences and Genomes. 1995, CRC PressView ArticleGoogle Scholar
 Idury RM, Waterman MS: A new algorithm for DNA sequence assembly. Journal of Computational Biology. 1995, 2 (2): 291306. 10.1089/cmb.1995.2.291.PubMedView ArticleGoogle Scholar
 Stoye J, Gusfield D: Simple and flexible detection of contiguous repeats using a suffix tree. Theoretical Computer Science. 2002, 270 (1): 843856.View ArticleGoogle Scholar
 Johnson DB: Finding all the elementary circuits of a directed graph. SIAM Journal on Computing. 1975, 4 (1): 7784. 10.1137/0204007.View ArticleGoogle Scholar
 Drezen E, Rizk G, Chikhi R, Deltel C, Lemaitre C, Peterlongo P, Lavenier D: Gatb: Genome assembly & analysis tool box. Bioinformatics. 2014, 30 (20): 29592961. 10.1093/bioinformatics/btu406.PubMedPubMed CentralView ArticleGoogle Scholar
 Döring A, Weese D, Rausch T, Reinert K: SeqAn an efficient, generic C++ library for sequence analysis. BMC Bioinformatics. 2008, 9 (1): 1110.1186/14712105911.PubMedPubMed CentralView ArticleGoogle Scholar
 Warren RL, Sutton GG, Jones SJ, Holt RA: Assembling millions of short DNA sequences using SSAKE. Bioinformatics. 2007, 23 (4): 500501. 10.1093/bioinformatics/btl629.PubMedView ArticleGoogle Scholar
 Ono Y, Asai K, Hamada M: PBSIM: PacBio reads simulatortoward accurate genome assembly. Bioinformatics. 2013, 29 (1): 119121. 10.1093/bioinformatics/bts649.PubMedView ArticleGoogle Scholar
 McElroy KE, Luciani F, Thomas T: GemSIM: general, errormodel based simulator of nextgeneration sequencing data. BMC Genomics. 2012, 13 (1): 7410.1186/147121641374.PubMedPubMed CentralView ArticleGoogle Scholar
 Butler J, MacCallum I, Kleber M, Shlyakhter IA, Belmonte MK, Lander ES, Nusbaum C, Jaffe DB: ALLPATHS: de novo assembly of wholegenome shotgun microreads. Genome Research. 2008, 18 (5): 810820. 10.1101/gr.7337908.PubMedPubMed CentralView ArticleGoogle Scholar
 Coil DA, Vandersmissen L, Ginevra C, Jarraud S, Lammertyn E, Ann´e J: Intragenic tandem repeat variation between Legionella pneumophila strains. BMC Microbiology. 2008, 8 (1): 21810.1186/147121808218.PubMedPubMed CentralView ArticleGoogle Scholar
 Visca P, D'Arezzo S, Ramisse F, Gelfand Y, Benson G, Vergnaud G, Fry NK, Pourcel C: Investigation of the population structure of Legionella pneumophila by analysis of tandem repeat copy number and internal sequence variation. Microbiology. 2011, 157 (9): 25822594. 10.1099/mic.0.0472580.PubMedView ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.