Suppose that we are given a set of paired-end sequencing data with mapping information, and the outputs of the proposed pipeline include both the number of sub-clones and the haplotypes of each sub-clone. The given data is first pre-processed: a read is retained if it brings at least one point mutations, while a read-pair is retained if it brings at least two point mutations. Each read-pair that passed the filter is then collapsed to a much shorter sequence, named VPE as in [45], by extracting the sites with point mutations from this read-pair, as shown in Fig. 1. A VPE consists of only the sites with variants from the corresponding read-pair. In the current stage of this research, the structural variations, including the allelic imbalance events on point mutations, are temporarily not considered. According to these reads (VPEs), the variant allelic frequency (VAF) of each variant is calculated. We adopt a model-free method to provide a priori of the clonal structure according to global variant allelic frequencies. A series of model-free method could achieve this. Here, we incorporate SciClone [3], a popular method, into the proposed pipeline.
MixSubHap consists of three major components: assembling local VPEs, expanding local contigs and iteratively stripping clonal read-depth. The flowchart is shown in Fig. 2.
-
Assembling local VPEs: As the first component, a divide-and-conquer strategy is adopted to assemble the VPEs to multiple groups of contigs. Different from assembling reads, VPEs are first clustered by starting sites, and then for each cluster, the VPEs with the same starting site are processed together to form a small set of contigs. Note that, multiple reads whose mapping positions are slightly different may collapse to the same VPE sequence if they bring the same variants. In addition, several constraints which imply the inheritance principle across sub-clones are applied. The details of these constraints are further discussed in Discussion section.
-
Expanding local contigs: To link the contigs across different clusters, an efficient greedy algorithm is conducted. An undirected weighted graph is established. Each vertex represents a variant site. An edge exists between two vertexes if at least one VPE supports the linkage of them. The edges are weighted according to the likelihoods of possible linkage modes. The greedy algorithm first generates a maximum spanning tree on the sub-graph that consists of the vertexes from the founding clone. Here, the estimation of founding clone is provided by SciClone. Moreover, the computation is also simplified by the inheritance constraint, which limits the linkage mode in all of the descendant sub-clones.
-
Stripping each clonal read-depth: For the variants do not occur in the founding clone, different genotypes are brought by different sub-clones. In these cases, the likelihoods are interfered by allelic clonal haplotypes. To filter out the bias on the likelihoods, a thickness stripping algorithm is designed. According to the VAFs and the estimations on clonal structure, the VPEs that have higher probabilities of sampling from other sub-clones are removed, and then the spanning tree is extended based on the corrected edge weights. The parental haplotypes guide the clonal haplotypes of its descendants. When the tree is traversed, the clonal haplotypes are then reconstructed.
Assembling local VPEs
Let V denote the set of given variant sites. VPEs are defined as the information which contain the base states of all the mutated sites along with their actual positions in reads and relative positions in set V. When extracting VPE information, it is necessary to ensure that the number of mutations brought by the read or read-pair is not less than two. Sequencing reads that contain only one mutation do not reflect the interrelationship between mutations and do not provide useful information for haplotype reconstruction, thus we do not take such information into consideration. It is assumed that each sequence library is well prepared and the insert-size obeys a normal distribution with a small variance. Therefore, libraries with different lengths are introduced, more variant sites can be extracted correctly.
In general, the rare somatic mutations on cancer susceptibility genes may reach 10,000, which is a huge challenge for the computational capability of ordinary computer. At this time, the computational complexity is too high due to the large amount of variants. Therefore, the MixSubHap uses the Dividing and Assembling strategy to process the VPE information. The strategy appropriately clusters the length of the division and designs the connection ways, so that the resulting short chains can accurately exhibit fragments of the sub-clonal haplotypes. To achieve this, the VPE is first mapped to the reference by mapping the base state of corresponding variant against the reference according to its actual position. After that, the aligned VPE are divided into M groups according to its starting position, then the algorithm processes group by group. Assume that the number of sub-clones I is known, then the number of haplotypes is determined, which is 2I, and the total number of variants is N. The process consists of the following steps:
-
Sorting the VPEs via the starting positions.
-
Constructing initial short chain groups. VPE alignments at the same starting position are processed together to form a group of short chains. Specifically, this is an integer programming problem, where the goal of the programming is that the cardinality of a set of short chains (the number of short chains in the group) is minimal, and the constraint of the programming is that the short-chain group must be able to support all the corresponding VPEs, carrying the maximum number of variant sites. During solving this, the greedy strategy is used to minimize the number of short chains, and at the same time to support all the VPE information of this group. In order to ensure that the ambiguity chain does not produce redundant, we add each VPE into a short chain group only when the VPE not contained in the existing short chains, otherwise, the VPE remains in the candidate set and the short chain group retains the same.
-
According to the principle of inheritance, arrange all short chain groups and keep qualified arrangement. Process VPEs whose starting sites are p (p∈N), then process group by group, until all VPEs are processed.
Until now, all the VPEs are rationally and effectively connected, and a large number of short chain groups carrying haplotype information are formed. Next, we will present a reasonable and efficient strategy to further construct the haplotypes of each sub-clone.
Expanding local contigs
According to the hypothesis of linear evolution mode between tumor sub-clones (see in Discussion), we know that, once the connection mode of a pair of variants from the founding clone S0 is fixed in the tumor evolutionary process, the differentiation of subsequent sub-clone Si, i∈{1,2,…,I−1} will inherit the same connection without any change. In the same way, once the connection between a pair of variants from sub-clone Si at any level is confirmed, it will not change in its descendant clone Sj(i<j). According to the linear evolution mode, the mutation sites in the clones are separated layer by layer. Thus, the parental clonal haplotype structure can be used as a known condition to guide the construction of the descendant clonal haplotypes.
In order to recognize the clonal haplotypes efficiently and accurately, MixSubHap algorithm first clusters all the variant sites according to the VAF of each site. The variants in the same cluster are from the same sub-clone [3], and then the clustering results are considered as the basis to initialize the clonal haplotypes. The clustering method used in this paper is SciClone version 1.1.0 [3], which is reported to be relatively accurate in clustering the somatic mutations by clonal structure.
MixSubHap algorithm mainly generates a maximum spanning tree based on the short chain groups we obtained. Let M be the set of all variants, and 〈pi,pj〉 represent two adjacent allelic sites, pi,pj∈M. \( \text {Let} \, H^{p_{i},p_{j}}_{S_{k}} \) represent the connection mode between the pith variant site and the pjth variant site from the sub-clone Sk. A stands for the same base as the reference genome, while B stands for a base different from the reference genome, namely B stands for a mutation. Therefore, for any two sites from the same sub-clone, possible values for \( H^{p_{i},p_{j}}_{S_{k}} \) are {(A,A), (B,B)}or{(A,B), (B,A)}. For the founding clone S0, according to the short chain groups, we can get many different values of \( H^{p_{i},p_{j}}_{S_{0}} \) and add these different values respectively to the corresponding coverage of variant pair 〈pi,pj〉. In order to separate the sub-clones layer by layer, variant pair 〈pi,pj〉 from every sub-clone and the corresponding coverage level c〈i,j〉 are used to estimate the clonal haplotypes. For variant pairs, the coverage of each pair and the probability of various possible connection modes between the two variants are calculated together. According to the probabilities of various connection patterns, the corresponding undirected weighted graphs are established. When calculating the coverage level c〈i,j〉 of allelic sites 〈pi,pj〉, only short chain groups are considered. Let F(i,j) represent the set of short chains containing both allelic sites 〈pi,pj〉. |F(i,j)| represents the number of chains in this set. We set the connection probability of {(A,A), (B,B)} to be positive and set the connection probability of {(A,B), (B,A)} to be negative. Let G represent a weighted undirected graph where the variant sites are the vertexes of the graph. Here, we set the coverage threshold Cov, with a default value of 2. If |F(i,j)|≥Cov, there will be an edge between variants 〈pi,pj〉 and the weight of edge will be the probability of the connection. The formula of weight is:
$$W_{\left\langle i,j\right\rangle} = f\left(b_{p_{i}},b_{p_{j}},A\right)- f\left(b_{p_{i}},b_{p_{j}},B\right) $$
$$N^{b_{p_{i}},b_{p_{j}}}(p_{i},p_{j}) = \sum_{r \in F(i,j)}I\left(r(p_{i},p_{j})=\left(b_{p_{i}},b_{p_{j}}\right)\right) $$
Where, \( N^{b_{p_{i}},b_{p_{j}}}(p_{i},p_{j}) \) represents the number of chains across allelic sites 〈pi,pj〉 corresponding to the connection mode \( (b_{p_{i}},b_{p_{j}}) \). \(b_{p_{i}} \) and \( b_{p_{j}} \) respectively indicate the base states of the variant site pi and pj. r(pi,pj) represents alleles for variant site pi and pj. There are four kinds of joint states for allele pi and pj, \( \text {where} \; (b_{p_{i}},b_{p_{j}}) \in \left \lbrace (A, B), (B, A), (A, A), (B, B) \right \rbrace \). I(.) is the indicator function. When \(\phantom {\dot {i}\!} r(p_{i}, p_{j}) = (b_{p_{i}},b_{p_{j}}) \) is true, the function value is 1, 0 otherwise. For variant site pi, consider the sequencing error and alignment error of each site ε.
Let
$$N^{A}(p_{i},p_{j}) = N^{A,A}(p_{i},p_{j})+N^{B,B}(p_{i},p_{j}) $$
$$N^{B}(p_{i},p_{j}) = N^{A,B}(p_{i},p_{j})+N^{B,A}(p_{i},p_{j}) $$
Two connection probabilities of the paired variant sites 〈pi,pj〉 are,
$$ {\begin{aligned} f(b_{p_{i}},b_{p_{j}},A) = \frac{ \left((1-\epsilon)^{2} + \epsilon^{2}\right) \times N^{A}(p_{i},p_{j}) + 2\epsilon \times (1-\epsilon) \times N^{B}(p_{i},p_{j}) }{ \left| F(i,j) \right| } \end{aligned}} $$
$$ {\begin{aligned} f(b_{p_{i}},b_{p_{j}},B)=\frac{ \left((1-\epsilon)^{2} + \epsilon^{2}\right) \times N^{B}(p_{i},p_{j}) + 2\epsilon \times (1-\epsilon) \times N^{A}(p_{i},p_{j}) }{ \left| F(i,j) \right| } \end{aligned}} $$
Where W〈i,j〉>0 indicates that the connection mode of allelic sites 〈pi,pj〉 is \( H^{A,A}_{S_{0}}/ H^{B,B}_{S_{0}}\); otherwise, W〈i,j〉<0 indicates that the connection mode of allelic sites 〈pi,pj〉 is \( H^{A,B}_{S_{0}}/ H^{B,A}_{S_{0}}\). The greater the absolute value of W〈i,j〉, the higher the reliability of the corresponding connection pattern.
After constructing the undirected weighted graph G of the founding clone is constructed, the number of vertexes in the graph is Nv. All the vertexes in the graph G are all derived from variant sites in the founding clone S0. The algorithm selects a vertex, whose base state is known, as the starting point sp for constructing the sub-clonal haplotypes and generating the initial maximum spanning tree T corresponding to graph G. The processing steps are as follows, as shown in Fig. 3:
-
Find all variant sites connected to sp from graph G;
-
Select the corresponding edge of max(|W〈i,j〉|) as one edge of the maximum spanning tree T;
-
According to the positive or negative of the edge weight, the state of another variant connected to sp can be determined and these variants will be considered as new known vertexes;
-
Conduct the second step successively for other unkonwn variants and repeat adding border process to maximum spanning tree T till all the variants have appeared in T or cannot be added any more.
In actual, due to the sparseness of the founding clone variants and the limitation of the read length of the second generation sequencing data, the generated undirected weighted graph G is often not a connected graph, but consists of several mutually disconnected subgraphs. MixSubHap algorithm generates an equal number of subtrees on these subgraphs and identifies the base states, thus guides the later extension of the spanning tree.
Stripping each clonal read-depth
After the initial maximum spanning tree T is built, the connection patterns of partial variant sites on the haplotypes have been determined. However, there are a large quantity of variant sites not included in the initial maximum spanning tree T. The connection modes of these variant sites are relatively complex, including three types of linkages (B,A),(A,A),(A,B) with varying proportion of each according to the sub-clone proportion in the tumor sample. So we adopt the thickness stripping strategy to strip the read depth level by level from the founding clone to the uppermost layer descendant sub-clone. The remaining sub-clones can be processed in accordance with the method of constructing tree in the founding clone.
Thickness stripping strategy refers to finding the separation point that identifies the current sub-clone and divides all the clones into two parts in the direction of alleles. If sub-clone Si is the first sub-clone which two variants have been mutated, this sub-clone should be considered as the demarcation line and the upper parental clone of this sub-clone should be separated at a mixed ratio. With the process of evolution, the variant allelic frequency is decreasing, so the initial value of sub-clone proportion should be estimated according to the mean of allelic frequency of each sub-clone. After separation, there are only two types of connection ways between allelic sites for remaining sub-clones and the weights of the two ways are the same in positive and negative. So the connection between the allelic sites can be clearly judged. For the portion of sub-clones to be stripped, they can be separated from the mixed sequencing data according to the mixed ratio of the sub-clones. Set the number of clones to be Ns, and then the thickness stripping formula is:
$$P^{A}(p_{i},p_{j}) = \left| F(p_{i},p_{j})\right| \times \left(\sum\limits_{0 \leq i \leq S_{k}} r_{i} + \sum\limits_{S_{k+1} \leq i \leq S_{N_{s}}} \frac{1}{2} r_{i}\right) $$
$$P^{B}(p_{i},p_{j}) = \left| F(p_{i},p_{j})\right| \times \left(\sum\limits_{S_{k+1} \leq i \leq S_{N_{s}}} \frac{1}{2} r_{i}\right) $$
Adjust the coverage of the allelic site 〈pi,pj〉, we have
$$\hat{N^{A}}(p_{i},p_{j}) = N^{A}(p_{i},p_{j}) - P^{A}(p_{i},p_{j}) $$
$$\hat{N^{B}}(p_{i},p_{j}) = N^{B}(p_{i},p_{j}) - P^{B}(p_{i},p_{j}) $$
After separating the data, the new undirected weighted graph G′ is established by using the same method for the undirected weighted graph G and the weight calculation formula becomes
$$\begin{aligned} W^{\prime}\left\langle p_{i},p_{j} \right\rangle = f^{\prime}\left(b_{p_{i}},b_{p_{j}},A\right) - f^{\prime}(b_{p_{i}},b_{p_{j}},B) \end{aligned} $$
$$\begin{aligned} f^{\prime}(b_{p_{i}},b_{p_{j}},A) = \frac{ \left((1-\epsilon)^{2} + \epsilon^{2}\right) \times \widehat{N^{A}}(p_{i},p_{j}) + 2\epsilon \times (1-\epsilon) \times \widehat{N^{B}}(p_{i},p_{j})} { \widehat{N^{A}}(p_{i},p_{j}) + \widehat{N^{B}}(p_{i},p_{j}) } \end{aligned} $$
$$\begin{aligned} f^{\prime}(b_{p_{i}},b_{p_{j}},B) = \frac{ \left((1-\epsilon)^{2} + \epsilon^{2}\right) \times \widehat{N^{B}}(p_{i},p_{j}) + 2\epsilon \times (1-\epsilon) \times \widehat{N^{A}}(p_{i},p_{j})} { \widehat{N^{A}}(p_{i},p_{j}) + \widehat{N^{B}}(p_{i},p_{j}) } \end{aligned} $$
Where \(\phantom {\dot {i}\!} \left | W^{\prime }\left \langle p_{i},p_{j} \right \rangle \right | > \delta \) represents an effective edge between 〈pi,pj〉 allelic sites. When adding it to the graph G′, the corresponding weight is \(W^{\prime }\left \langle p_{i},p_{j} \right \rangle \). The default value of δ is 0.1. When the graph G′ is constructed, it can basically contain all the variation sites from the reference sequence. Then the maximum spanning tree T is extended to T′ according to the graph G′ following the steps in section Expanding local contigs.
In order to finally reconstruct the haplotype that contains the variants as many as possible, it is necessary to ensure that the extended tree T′ contains more variants. Lower coverage and higher threshold of coverage will cause some variation sites be left out. Thus, we automatically adjust to lower coverage threshold and edge weight threshold to ensure lower false-negative as much as possible. We adopt depth-first traversal of all vertexes in extended tree T′ and sort according to the relative positions in the order of reference sequence. After sorting, the state set of vertexes is a haplotype of the last sub-clone. Assuming that all variation sites are heterozygous, another haplotype of the last sub-clone is easily obtained. According to the linear evolution relationship among the sub-clones, the haplotypes of the remaining sub-clones are obtained by the following formula.
$$ h_{2i,j}=\left\{ \begin{array}{rcl} A, {h_{2(i+1),j} = A}\\ B, {h_{2(i+1),j} = B \ and \ sub(j) \leq i}\\ A, {h_{2(i+1),j} = B \ and \ sub(j) > i} \end{array} \right. $$
Where, i represents the label of sub-clone and j represents variation site number. sub(j) represents the sub-clone label of the jth variation site, h2i,j represents the base state of the jth site from father chain on the ith sub-clone. By the above formula, we can find base state of the site from one haplotype corresponding to the paired haplotype.
$$ h_{2i,j}=\left\{ \begin{array}{rcl} h_{2i,j} \oplus 1, {sub(j) \leq i}\\ h_{2i,j}, {i < sub(j) \leq I-1} \end{array} \right. $$
The construction process is shown in Fig. 4.