 Research article
 Open Access
 Published:
CLImATHET: detecting subclonal copy number alterations and loss of heterozygosity in heterogeneous tumor samples from wholegenome sequencing data
BMC Medical Genomics volume 10, Article number: 15 (2017)
Abstract
Background
Copy number alterations (CNA) and loss of heterozygosity (LOH) represent a large proportion of genetic structural variations of cancer genomes. These aberrations are continuously accumulated during the procedure of clonal evolution and patterned by phylogenetic branching. This invariably results in the emergence of multiple cell populations with distinct complement of mutational landscapes in tumor sample. With the advent of nextgeneration sequencing technology, inference of subclonal populations has become one of the focused interests in cancerassociated studies, and is usually based on the assessment of combinations of somatic singlenucleotide variations (SNV), CNA and LOH. However, cancer samples often have several inherent issues, such as contamination of normal stroma, tumor aneuploidy and intratumor heterogeneity. Addressing these critical issues is imperative for accurate profiling of clonal architecture.
Methods
We present CLImATHET, a computational method designed for capturing clonal diversity in the CNA/LOH dimensions by taking into account the intratumor heterogeneity issue, in the case where a reference or matched normal sample is absent. The algorithm quantitatively represents the clonal identification problem using a factorial hidden Markov model, and takes an integrated analysis of read counts and allele frequency data. It is able to infer subclonal CNA and LOH events as well as the fraction of cells harboring each event.
Results
The results on simulated datasets indicate that CLImATHET has high power to identify CNA/LOH segments, it achieves an average accuracy of 0.87. It can also accurately infer proportion of each clonal population with an overall Pearson correlation coefficient of 0.99 and a mean absolute error of 0.02. CLImATHET shows significant advantages when compared with other existing methods. Application of CLImATHET to 5 primary triple negative breast cancer samples demonstrates its ability to capture clonal diversity in the CAN/LOH dimensions. It detects two clonal populations in one sample, and three clonal populations in one other sample.
Conclusions
CLImATHET, a novel algorithm is introduced to infer CNA/LOH segments from heterogeneous tumor samples. We demonstrate CLImATHET’s ability to accurately recover clonal compositions using tumor WGS data without a match normal sample.
Background
Cancer is a disease that develops through accumulation of various genetic variability [1]. The clonal principles of tumor progression states that once initialization of a single founder cell is activated, various factors associated with carcinogenesis permit the descendants of the founder cell to resist apoptosis and undergo clonal growth, accompanying with acquirement of genomic alterations through multiple rounds of clonal expansion, and ultimately to invade neighboring tissues and metastasize to distant organs [1–5]. A tumor is thus heterogeneous and a mixture of multiple cell populations, and each population is characterized by a distinct complement of genomic aberrations. Genomic aberrations consist of somatic mutations such as singlenucleotide variations (SNV), copy number alterations (CNA), loss of heterozygosity (LOH), and more complicated structure variations (SVs) including inversion, translocation and etc., of which CNA and LOH are two frequently observed features of cancer genomes, and accurate detection of these aberrations is a crucial step to identify cancercausing genes [6, 7].
Cancerassociated studies have been greatly promoted by continuous advances in experimental technologies for landscape of cancer genomes [8–12]. With the advance of sequencing technologies, highthroughput DNA sequencing presents an unprecedented advantage in deconvolving intratumor heterogeneity and detecting subclonal aberrations compared with arraybased technologies. Wholegenome sequencing (WGS) of tumor samples is now a generally adopted approach for comprehensive analysis of structural and nucleotidelevel aberrations that underpin tumor progression [13]. However, analysis of WGS data is generally complicated by several issues. For example, tumor sample is usually impure and mixture of cancerous and normal cells [9, 14]. The fraction of cancerous cells is usually represented as tumor purity. Low tumor purity will significantly diminish sequencingderived signals, making it complicated to distinguish between aberrant and normal regions. Another intractable issue associated with cancer genomes is aneuploidy, caused by deletions or duplications of genomic segments or entire chromosomes in cancer genomes, and the average copy number of cancer genome is usually unknown [15–17]. One other more complex issue is the intratumor heterogeneity that results from the ongoing subclonal evolution [18], and the underlying clonal architecture is usually unavailable when dealing with patients’ tumor samples.
In this study, we present CLImATHET, a novel algorithm based on the framework of CLImAT [19], for inferring subclonal CNA and LOH segments by taking into account the intratumor heterogeneity issue, in the case where a reference or matched normal sample is absent. The model jointly explores both read counts and allelic read depths at known SNPs across the whole genome from tumor WGS data (Fig. 1a), and quantitatively represents the copy number profiles of multiple subclonal populations, which is one novelty of CLImATHET. For each aberration event, we assume observed signals result from the joint effects of three distinct populations of cells: normal cells, tumor cells harboring the event and tumor cells without the aberration (Fig. 1b). The cellularity of CNA/LOH event is defined as the proportion of cells harboring the event. We further assume that multiple cooccurring events can be designated to one of a finite number of clonal populations. We adopt a factorial hidden Markov model (HMM) with two underlying Markov chains to delineate genomic aberrations and clonal clusters (Fig. 1c). In the factorial HMM, one Markov chain depicts genome aberrations and another represents the corresponding clonal clusters. Furthermore, the structure of the factorial HMM is automatically optimized by using an embedded model selection module based on Bayesian information criterion (BIC), which is another novelty of CLImATHET.
In contrast to related approaches in the literature, our method presents several distinct characteristics. APOLLOH [13] is designed to infer LOH from tumor sequence data, however it does not take into account the issue of tumor heterogeneity. PurBayes [20] uses somatic singlenucleotide variants derived from paired tumornormal samples to estimate tumor cellularity and subclonality, and is only applicable to diploid tumor samples due to no integration of tumor ploidy and CNA information. THetA [21] is designed to infer cancer subclones by addressing a maximum likelihood mixture decomposition problem based on paired tumornormal samples, but it cannot automatically determine the number of underlying subclonal populations. A recently developed method, TITAN [22], is designed to jointly call CNA and LOH segments from WGS data of tumornormal paired samples. TITAN introduces a delicate factorial HMM to infer tumor subclones and performs an exhaustive search for the number of clonal populations up to 5 to find the optimal model structure. Our approach differs from TITAN in two aspects: 1) CLImATHET uses only single tumor sample and no reference or matched normal sample is required, and 2) CLImATHET learns the model structure under a BIC. Another method called OncoSNPSEQ [23] accounts for intratumor heterogeneity by employing a nonclustering approach to model the distinct cell populations [22].
We perform a comprehensive evaluation of CLImATHET using simulated and real WGS data. By quantitative benchmarking on simulated datasets, we demonstrate that CLImATHET has the capacity to accurately infer cellularity of clonal clusters and subclonal CNA/LOH segments, and shows significant advantages when compared with existing approaches. We apply CLImATHET to 5 primary triple negative breast cancer (TNBC) samples to show its ability to identify clonal diversity in the CNA/LOH dimensions.
Methods
CLImATHET pipeline
The pipeline of CLImATHET is illustrated in Fig. 1a. The inputs to the model are extracted from WGS data using our previously published tool DFExtract [19]. Our analysis covers more than 2.6 million known SNPs along the whole genome. Copy number data of N SNPs is represented by read counts d _{1:N }, meanwhile B allele frequency is represented by Ballele read depth b _{1:N } and total read depth T _{1:N }. Following the procedures adopted in CLImAT [19], read counts are obtained by counting the reads within a 1000bp window centered at each SNP, and further processed to correct GCcontent and mappability bias. For B allele frequency, quantile normalization of read depths is automatically performed to eliminate allelic bias based on selection of optimal threshold by a grid search. CLImATHET then jointly analyze d _{1:N }, b _{1:N } and T _{1:N } using an integrated HMM, and iteratively examine the model complexity using BIC for different number of clonal clusters, and finally output clonal/subclonal CNA and LOH segments as well as the cellularity of each clonal cluster.
The statistical models in CLImATHET
It is intractable to precisely depict the genomewide aberration status of tumors containing multiple subclones, therefore we adopt a simple assumption that the observed signals at a genomic locus are generated from underlying three types of cell populations: normal (nontumor) cells, tumor cells with normal genotype, and tumor cells harboring the aberration event of interest (Fig. 1b). Thus, cell populations can be ultimately divided into two parts at a genomic locus: one with normal genotype and relative abundance of (1–β), and another harbors the aberration event with cellularity of β. To avoid extensive local parameters, we further assume all the aberration events can be clustered into a finite number of K groups, of which the kth group corresponds to the kth clonal population with respective cellularity of β _{ k } (β _{ 1 } ˂β _{ 2 } ˂…˂β _{ K }), and tumor purity is equal to β _{ K }. Based on these assumptions, the model parameters can be effectively inferred by borrowing statistical approaches.
We define a set of states to depict copy number status of tumor genomes (Table S1, Additional file 1). Each state is represented by copy number, tumor genotypes derived from the normal genotypes by the deletion or duplication of alleles and aberration type. We only consider copy number status up to 7 copies, which is empirically based on the fact that most existing CNAcalling methods use this value as an upper bound [9, 19, 23]. Based on the emission models proposed in our previous study [19], we extend the conditional probabilities of read counts and Ballele read depth by introducing the contributions of multiple clonal populations. Given the aberration state c and the kth clonal cluster, we assume Ballele read depth is binomial distributed with the conditional probability defined as follows:
In formula (1), we only consider heterozygous genotypes in each aberration state. The mean of B allele copy number z _{ c,k } and total copy number y _{ c,k } are defined as:
where n _{ s } and u _{ s } deonte the copy number and expected B allele frequency (BAF) of normal cells respectively, n _{ c } and u _{ c } represent the copy number and BAF of tumor cells respectively. In addition, we assume read counts is negative binomial (NB) distributed with the conditional probability defined as follows:
where λ is mean read counts associated with normal copy, and λ _{ c,k } is formulated as:
The meanings of all other parameters involved in above formulas are the same as the ones described in CLImAT, and we do not go into detail here.
The conditional probabilities described above rely on two latent variables, aberration state c and clonal cluster k, therefore we implement CLImATHET as a factorial HMM with C × K hidden states by combining tumor genotypes and clonal clusters, where C is the number of aberration states defined in Table S1 and K is the number of clonal clusters. The HMM thus has two underlying Markov chains with one chain depicting aberration state sequence and another delineating corresponding clonal clusters (Fig. 1c). We employ expectation maximization (EM) algorithm [24] to learn the model parameters θ = (π, A, β, λ, p), where π is the initial state probability distribution, A is the state transition matrix, β is the cellularity of all clonal clusters, λ denotes the copy neutral read counts, and p refer to the success probability as a parameter of NB distributions. For the expectation step of EM algorithm, we calculate the expectation of the partial loglikelihood function of read counts and Ballele read depth respectively, and formulated as:
where γ _{ i,c,k } is the posterior probability of the ith SNP being in aberration state c and the kth clonal cluster, and is calculated using forwardbackward algorithm [25]. For the maximization step, we use NewtonRaphson method [26] to update all model parameters until predefined convergence criterion is met. We define relative increment of the value of the loglikelihood function from iteration n1 to n as follows:
where LL _{ n } is the value of the loglikelihood function in the nth iteration. If the value of Inc is less than a specific threshold (1 × 10^{−4}), then the parameter updating procedure is stopped.
Given the number of clonal clusters K, the EM algorithm is implemented as follows: 1) start with initial parameters θ ^{0} = (π ^{0},A ^{0},β ^{0},λ ^{0},p ^{0}) and calculates the jointposterior probability γ _{ i,c,k }, 2) update θ ^{1} = (π ^{1},A ^{1},β ^{1},λ ^{1},p ^{1}) using Newton–Raphson method, 3) repeat steps (1) and (2) until a specified number of iterations are reached or the convergence criterion is met. The converged parameters \( \widehat{\theta}=\left(\widehat{\pi},\widehat{A},\widehat{\beta},\widehat{\lambda},\widehat{p}\right) \) in the last iteration of the training process will be output as the optimal estimators. Furthermore, we perform a grid search of the initial parameters θ ^{0} = (π ^{0},A ^{0},β ^{0},λ ^{0},p ^{0}) to find the globally optimal solution.
As a vital parameter of CLImATHET model, the number of clonal clusters is determined using BIC. The BIC of a model is defined as follows:
where \( \widehat{L} \) is the maximized likelihood of the model, α > 0 is the regularizing term, m is the number of free parameters to be estimated and N is the number of SNPs. Our goal is to find an optimal value of K that leads to the model with the minimum value of BIC. A feasible solution is to perform an exhaustive search for possible values but it is practically intractable. Alternatively, CLImATHET starts with the initial assumption of tumor homogeneity (K = 1) and then iteratively increases clonal cluster number by one (K = K + 1) until the BIC value of the model no longer decreases. The optimal model is thus the one in the final iteration. For the computational convenience, we do not directly calculate the BIC of the model, but the difference of BIC between two adjacent models. Suppose that the number of clonal clusters is n in the nth iteration, and the BIC difference between the current model and the previous model is defined as follows:
where \( \Delta {\widehat{L}}_n \) and Δm _{ n } are the increment of the likelihood and the number of free parameters from model n1 to model n, respectively. Δm _{ n } is measured by formula as follows:
The iteration is stopped once the BIC difference is greater than zero. After the model learning is finished, aberration information including copy number, tumor genotype, aberration type and clonal cluster for each SNP can be inferred from the hidden state with the maximum posterior probability. At the same time, segmentation of all SNPs based on hidden states is performed to output clonal/subclonal CNA and LOH segments.
Similar to the method described in CLImAT, we define a reliability score for the ith segment as follows:
where b _{ ij } and T _{ ij } are Ballele read depth and total read depth of the jth heterozygous SNP in the ith segment, and d _{ ij } is the corresponding read counts, \( {\overline{b}}_{ij} \) and \( {\overline{d}}_{ij} \) denotes the expected Ballele read depth and read counts associated with aberration state c and the kth clonal cluster, respectively. Furthermore, the scores for all segments are scaled to 0 ~ 100.
Implementation of CLImATHET
An efficient implementation of CLImATHET by using C/Matlab is freely available at GitHub [27], our previously published tool DFExtract [19] is used to prepare input files for CLImATHET.
Datasets
Real dataset
WGS data from 5 primary TNBC samples described in a previous study [28] are analyzed in this study. Each sample was sequenced at approximately 30× coverage on the Life/ABI SOLID sequencing platform.
Simulated dataset
To simulate tumors containing multiple clones, we first define one normal genome (denoted as n) and four tumor genomes (denoted as a, b, c and d), and then generate mixtures by mixing five genomes at predefined proportions. As illustrated in Figure S1 (Additional file 2), a is the main clone and the genome is generated by defining a number of segments along the reference, meanwhile each segment is specified by a copy number state including total copy number and major allele copy number. Genome b and c are constructed by introducing new aberrations into genome a, and genome d corresponds to the subclones deriving from the second clonal expansion based on b. We use a normal sample HCC1954BL downloaded from CGHub [29] to generate sequencing data of the simulated genomes by following these steps: 1) For each segment of the genome, reads aligned to the region are randomly and repeatedly sampled from BAM file of the sample HCC1954BL according to the copy number of the segment, 2) nucleotide sequences of the sampled reads are properly modified to match BAF values of the SNPs within each segment, and 3) the processed reads are merged to generate BAM files using SAMtools [30]. We generate mixtures of genomes by sampling reads from the BAM files at predefined proportions, and reads are sampled to 30× coverage at all mixtures. By this way, 20 simulated tumor samples (Table S2, Additional file 3) are generated to evaluate the performance of CLImATHET in detecting clonal and subclonal CNA and LOH segments.
For each simulated sample, the underlying cellularity of clonal clusters are computed based on the relationship between the tumor genomes making up the mixture. For example, the mixture “a_010b_030n_060” is made up by genomes a, b and n with respective proportions of p_{a} = 0.1, p_{b} = 0.3 and p_{n} = 0.6, thus there will be two clonal clusters with the underlying cellularity equal to 0.3 (p_{b}) and 0.4 (p_{a} + p_{b}); similarly, the mixture “a_020b_035c_025n_020” will yield three clonal clusters with cellularity of 0.35 (p_{b}), 0.25 (p_{c}) and 0.8 (p_{a} + p_{b} + p_{c}), while the mixture “a_010b_010d_025n_055” contains three clonal clusters with cellularity of 0.25 (p_{d}), 0.35 (p_{b} + p_{d}) and 0.45 (p_{a} + p_{b} + p_{d}).
Competitive methods
Two advanced methods, OncoSNPSEQ [23] and CLImAT [19], are adopted to make comparison between the performance of CNA and LOH detection algorithms. When running OncoSNPSEQ on the simulated samples, the simulated SNP sites are used to prepare the input file during the preprocessing step, and we run the main program with the option “chr_range [1:22], −normalcontamination, −maxnormalcontamination 0.8, −seqtype illumina” on homogeneous samples, and with an additional option “tumourheterogeniety” on heterogeneous samples. As OncoSNPSEQ generally output multiple solutions per sample, and in this case we select the one associated with the maximum likelihood as the optimal solution. For CLImAT, we use the default configuration on all samples.
Performance evaluation
For the simulated samples, copy number and genotype profiles of all segments predefined in simulation experiment are used as the golden standard for evaluation. Accuracy is calculated for each method by comparing the predictions with the ground truth. We consider a segment to be accurately identified only if any predicted segment covers the 75% size of the segment and have exact matching of total and major copy number with the segment. Accuracy is defined as the proportion of accurately identified segments among all segments.
Results
In this section, we perform a comprehensive assessment of CLImATHET on both simulated and real datasets in terms of inferring cellularity of clonal clusters and identifying CNA/LOH segments.
Results on simulated data
We employ a simulation study to evaluate the accuracy of our estimates of cellularity of clonal clusters, tumor purity and predictions of CNA/LOH segments. Our simulated data is generated based on a real normal sample HCC1954BL downloaded from CGHub as described in the Methods. We run CLImATHET, OncoSNPSEQ and CLImAT to infer CNA/LOH segments and associated cellularity.
To assess the accuracy of cellularity estimations of CLImATHET, we compare the estimated cellularity to the ground truth cellularity for each clonal cluster. For each sample, the underlying cellularity of clonal clusters are computed based on the tumor genomes making up the mixture (more details in the Methods). CLImATHET is able to correctly infer the number of clonal clusters and corresponding cellularity for all the simulated mixtures. An example of cellularity estimations on a simulated sample “a_035b_015d_025n_025” with three clonal populations is shown in Fig. 2. CLImATHET correctly identify three clonal clusters and infer their cellularity (0.27, 0.41 and 0.72), simultaneously assign correct clonal cluster to 85% of all segments. OncoSNPSEQ infers one clonal cluster with cellularity of 0.7, and CLImAT estimates the tumor purity as 0.68. Moreover, we compute the Pearson correlation coefficient (r) and mean absolute error (MAE) for cellularity estimations, and the results show that the estimated cellularity of CLImATHET has a highly significant positive correlation (r > 0.99, pvalue < 5 × 10^{−5}, MAE < = 0.02) with the underlying cellularity across all simulated samples (Fig. 3a, b and c), which demonstrates CLImATHET’s ability to precisely recover clonal architecture. We further assess the accuracy of tumor purity estimations using the same metrics, and strongly significant correlation (r = 0.997, pvalue = 5.14 × 10^{−22}, MAE = 0.023) for all the mixtures relative to the ground truth tumor purity is observed for CLImATHET (Fig. 3d). OncoSNPSEQ accurately estimates the tumor purity for 90% of the samples (r = 0.878, pvalue = 3.71 × 10^{−7}, MAE = 0.053), and CLImAT also achieves good performance (r = 0.992, pvalue = 9.75 × 10^{−18}, MAE = 0.058).
Next, we proceed to assess the abilities of CLImATHET, OncoSNPSEQ and CLImAT in inferring tumor genotypes. We consider a segment to be accurately identified if and only if both the total and major copy numbers are accurately identified, and the sizes of the segments are not considered. For each simulated sample, total and major copy number profiles of all segments predefined in simulation experiment are used as the golden standard for evaluation. CLImATHET is able to accurately infer the total and major copy numbers of each segment for all the simulated mixtures. An example result from the simulated sample “a_035b_015d_025n_025” is shown in Fig. 4. The results show that CLImATHET infers the correct tumor genotype for 86% of all segments, while OncoSNPSEQ and CLImAT correctly identify the 61 and 68% of the segments, respectively. OncoSNPSEQ presents a relatively lower performance because we adopt strict evaluation strategy. For a general evaluation, we use metric accuracy, which is defined as the proportion of accurately identified segments, to compare the performance of the different methods. The results on simulated data are shown in Fig. 5. For the homogeneous tumor samples (Fig. 5a), there is only one clonal cluster with cellularity equal to tumor purity, all methods show high accuracy when tumor purity is larger than 0.4. For the heterogeneous tumor samples (Fig. 5b and c), OncoSNPSEQ and CLImAT achieve respective mean accuracies of 0.56 and 0.74, and CLImATHET accurately identify both the total and major copy numbers with a mean accuracy of 0.88. The performance of OncoSNPSEQ is declining on some samples, possibly the reads coverage of these samples is not deep enough to enable OncoSNPSEQ’s accurate inference of local parameters [23].
Results on real data
We also examine CLImATHET on 5 primary triple negative breast cancer samples, which are sequenced at approximately 30× coverage and also assayed by Affymetrix SNP6.0 array [28]. ASCAT [31] is a widely used software to analyze SNParray data, therefore we use it to infer the tumor purities.
The subclonal prediction results of CLImATHET on sample SA223 are shown in Fig. 6. CLImATHET identify one subclonal cluster with cellularity of 0.66, which is in good concordance with the tumor purities estimated by CLImAT, APOLLOH [13] and ASCAT (Table 1). This population presents copy neutral LOH spanning chromosomes 3p(22.3–11.1), 5q, 13q(13.3–33.2), 15q and 16q(13–24.3), and amplified heterozygous regions on chromosomes 1p, 1q(25.3–44), 5p, 6, 7, 9p, 17q, 18 and 20. ASCAT infers the measure of goodness of fit as 0.82 for sample SA223, showing that there may be aberration events represented in subclonal populations and not well interpreted by the models. Interestingly, CLImATHET identify another subclonal cluster with cellularity of 0.39, and copy neutral LOH regions on chromosomes 1q, 9q(21.2–34.3) and 22q, and segmental amplifications on chromosomes 3p(26.3–22.3), 3q(11.2–27.3), 4, 10, 11 and 12 are represented in the population. Figure 7a shows the loglikelihoods and BIC differences of sample SA223 under each iteration, and CLImATHET selects K = 2 as the optimal number of clonal populations since the BIC difference becomes to be positive when the number of clonal populations increases to 3.
For sample SA227, CLImATHET predicts it as heterogeneous with three distinct clonal populations, and the subclonal prediction results are shown in Fig. 8. One estimated subclonal cellularity of 0.44 is in accordance with the tumor purities estimated by CLImAT and ASCAT (Table 1). Copy neutral LOH on chromosome 5q and amplified LOH on chromosomes 3p, 8p and 9 are represented in this population. One other clonal cluster presents a similar cellularity of 0.5, which may correspond with the relatively higher measure of goodness of fit inferred as 0.9 by ASCAT. The genome of this clonal population is featured by a number of segmental amplifications across chromosomes 1, 2q, 3q, 5p, 6, 12, 17q, 18 and 20q, and copy neutral LOH regions spanning chromosomes 13 and 15. The third clonal cluster has a relatively lower cellularity of 0.21, and aberration events mainly include copy neutral LOH regions located at chromosome 11q and amplified LOH regions on chromosomes 4p, 11p, 16p and 22q. Figure 7b shows the loglikelihoods and BIC differences of sample SA227, and CLImATHET selects K = 3 as the optimal number of clonal populations.
For samples SA220, SA224 and SA225, the measures of goodness of fit returned by ASCAT are 0.95, 0.94 and 0.98 respectively, indicating these samples are probably homogeneous. Accordingly, CLImATHET analysis of these samples shows that there is also no significant evidence for existence of subclonal events, and the inferred cellularity of the single clonal population is 0.62, 0.52 and 0.75 respectively, which are in good concordance with the tumor purities predicted by CLImAT and ASCAT (Table 1). The copy number estimation results of these samples are shown in Figure S24 (Additional file 2).
Discussion
CLImATHET is a novel statistical method for inferring subclonal CNA and LOH segments from WGS data of heterogeneous tumor samples. It is developed based on our previously published algorithm CLImAT, and take into account the issue of intratumor heterogeneity. Compared with CLImAT that only handles homogeneous tumor samples, CLImATHET can efficiently deal with both homogeneous and heterogeneous tumor samples. The read counts and read depths data generated from multiple cell populations is quantitatively represented, and proper emission models are adopted in the HMM. Furthermore, we integrate a BIC module into the CLImATHET framework to effectively determine the underlying number of subclonal populations. These features enable CLImATHET’s advantages in handling complex WGS data. First, the appropriate decomposition of mixed signals improves the CNA/LOH detection performance. Second, CLImATHET is more sensitive to the aberration events represented in minor cell populations when compared to existing methods as shown in simulated study. Third, CLImATHET is able to accurately infer the cellularity each subclonal cluster. Finally, the evaluation results on both simulated and real WGS data demonstrate the advantages of our algorithm.
CLImATHET also has limitations due to its adopted modeling assumptions. The assumption that only one aberrant genotype exists at each genomic locus will not hold if distinct subclonal populations have different aberrant genotypes at the same locus. However, it is difficult to distinguish among multiple subclones that have variable genotypes. The emission models of CLImATHET need to be extended to account for these situations, and the joint analysis of read counts and read depths signals may output multiple solutions. Identifying distinct haplotypes harboring linked mutations might be a good way to identify multiple mutations at the same locus in different populations.
Conclusions
In summary, we demonstrate that CLImATHET represents remarkable advantages in inferring subclonal CNA/LOH segments from tumor WGS data. The high performance of CLImATHET benefits from delicate representation of copy number profiles of distinct cell populations, and efficient statistical methods for inference of the global parameters. We expect that CLImATHET will complement the arsenal of bioinformatics tools developed for investigating the role of tumor tumourigenesis and progression.
Abbreviations
 BAF:

B allele frequency
 BIC:

Bayesian information criterion
 CNA:

Copy number alterations
 EM:

Expectation maximization
 HMM:

Hidden Markov model
 LOH:

Loss of heterozygosity
 NB:

Negative binomial
 SNV:

Singlenucleotide variants
 SV:

Structure variations
 TNBC:

Triple negative breast cancer
 WGS:

Wholegenome sequencing
References
Nowell PC. The clonal evolution of tumor cell populations. Science. 1976;194(4260):23–8.
Greaves M, Maley CC. Clonal evolution in cancer. Nature. 2012;481(7381):306–13.
Yates LR, Campbell PJ. Evolution of the cancer genome. Nat Rev Genet. 2012;13(11):795–806.
Stratton MR, Campbell PJ, Futreal PA. The cancer genome. Nature. 2009;458(7239):719–24.
Hanahan D, Weinberg RA. Hallmarks of cancer: the next generation. Cell. 2011;144(5):646–74.
BentiresAlj M, Gil SG, Chan R, Wang ZC, Wang Y, Imanaka N, Harris LN, Richardson A, Neel BG, Gu H. A role for the scaffolding adapter GAB2 in breast cancer. Nat Med. 2006;12(1):114–21.
Hicks J, Krasnitz A, Lakshmi B, Navin NE, Riggs M, Leibu E, Esposito D, Alexander J, Troge J, Grubor V. Novel patterns of genome rearrangement and their association with survival in breast cancer. Genome Res. 2006;16(12):1465–79.
Park PJ. Experimental design and data analysis for array comparative genomic hybridization. Cancer Invest. 2008;26(9):923–8.
Li A, Liu Z, LezonGeyda K, Sarkar S, Lannin D, Schulz V, Krop I, Winer E, Harris L, Tuck D. GPHMM: an integrated hidden Markov model for identification of copy number alteration and loss of heterozygosity in complex tumor samples using whole genome SNP arrays. Nucleic Acids Res. 2011;39(12):4928–41.
Metzker ML. Sequencing technologiesthe next generation. Nat Rev Genet. 2010;11(1):31–46.
McCarroll SA, Kuruvilla FG, Korn JM, Cawley S, Nemesh J, Wysoker A, Shapero MH, de Bakker PI, Maller JB, Kirby A. Integrated detection and populationgenetic analysis of SNPs and copy number variation. Nat Genet. 2008;40(10):1166–74.
Shendure J, Ji H. Nextgeneration DNA sequencing. Nat Biotechnol. 2008;26(10):1135–45.
Ha G, Roth A, Lai D, Bashashati A, Ding J, Goya R, Giuliany R, Rosner J, Oloumi A, Shumansky K. Integrative analysis of genomewide loss of heterozygosity and monoallelic expression at nucleotide resolution reveals disrupted pathways in triplenegative breast cancer. Genome Res. 2012;22(10):1995–2007.
Carter SL, Cibulskis K, Helman E, McKenna A, Shen H, Zack T, Laird PW, Onofrio RC, Winckler W, Weir BA. Absolute quantification of somatic DNA alterations in human cancer. Nat Biotechnol. 2012;30(5):413–21.
Mitelman F. Recurrent chromosome aberrations in cancer. Mutat Res Rev Mutat Res. 2000;462(2):247–53.
Storchova Z, Pellman D. From polyploidy to aneuploidy, genome instability and cancer. Nat Rev Mol Cell Biol. 2004;5(1):45–54.
Storchova Z, Kuffer C. The consequences of tetraploidy and aneuploidy. J Cell Sci. 2008;121(23):3859–66.
Navin N, Krasnitz A, Rodgers L, Cook K, Meth J, Kendall J, Riggs M, Eberling Y, Troge J, Grubor V. Inferring tumor progression from genomic heterogeneity. Genome Res. 2010;20(1):68–80.
Yu Z, Liu Y, Shen Y, Wang M, Li A. CLImAT: accurate detection of copy number alteration and loss of heterozygosity in impure and aneuploid tumor samples using wholegenome sequencing data. Bioinformatics. 2014;30(18):2576–83.
Larson NB, Fridley BL. PurBayes: estimating tumor cellularity and subclonality in nextgeneration sequencing data. Bioinformatics. 2013;29(15):1888–9.
Oesper L, Satas G, Raphael BJ. Quantifying tumor heterogeneity in wholegenome and wholeexome sequencing data. Bioinformatics. 2014;30(24):3532–40.
Ha G, Roth A, Khattra J, Ho J, Yap D, Prentice LM, Melnyk N, McPherson A, Bashashati A, Laks E. TITAN: inference of copy number architectures in clonal cell populations from tumor wholegenome sequence data. Genome Res. 2014;24(11):1881–93.
Yau C. OncoSNPSEQ: a statistical approach for the identification of somatic copy number alterations from nextgeneration sequencing of cancer genomes. Bioinformatics. 2013;29(19):2482–4.
Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B Methodol. 1977;1977:1–38.
Rabiner LR. A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE. 1989;77(2):257–86.
Ypma TJ. Historical development of the NewtonRaphson method. SIAM Rev. 1995;37(4):531–51.
CLImATHET. https://github.com/USTCHILAB/CLImATHET. Accessed 8 Dec 2015.
Shah SP, Roth A, Goya R, Oloumi A, Ha G, Zhao Y, Turashvili G, Ding J, Tse K, Haffari G. The clonal and mutational evolution spectrum of primary triplenegative breast cancers. Nature. 2012;486(7403):395–9.
TCGA Mutation/Variation Calling Benchmark 4 at GDC. https://gdc.cancer.gov/resourcestcgausers/tcgamutationcallingbenchmark4files. Accessed 12 Mar 2017.
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R. The sequence alignment/map format and SAMtools. Bioinformatics. 2009;25(16):2078–9.
Van Loo P, Nordgard SH, Lingjærde OC, Russnes HG, Rye IH, Sun W, Weigman VJ, Marynen P, Zetterberg A, Naume B. Allelespecific copy number analysis of tumors. Proc Natl Acad Sci. 2010;107(39):16910–5.
Acknowledgements
This manuscript was prepared using a limited access dataset obtained from British Columbia Cancer Agency Branch (BCCA) and does not necessarily reflect the options or views BCCA.
Funding
This work was supported by grants from National Natural Science Foundation of China (61571414, 61471331 and 31100955).
Availability of data and materials
Our tool can be download from https://github.com/USTCHILAB/CLImATHET.
Authors’ contributions
AL conceived the study. ZY designed and implemented the CLImATHET algorithm, and analyzed the results. ZY wrote the paper, AL and MW improved the manuscript. All authors approved the final manuscript for publication.
Competing interests
The authors declare that they have no competing interests.
Consent for publication
Not applicable.
Ethics approval and consent to participate
This study was approved by the Research Ethics Board of University of Science and Technology of China.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Author information
Authors and Affiliations
Corresponding author
Additional files
Additional file 1: Table S1.
This file contains Supplementary Table S1, and provides the definition of copy number aberration states in CLImATHET. (PDF 104 kb)
Additional file 2:
This file contains Supplementary Figures. (PDF 1702 kb)
Additional file 3: Table S2.
This file contains Supplementary Table S2. (XLS 26 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Yu, Z., Li, A. & Wang, M. CLImATHET: detecting subclonal copy number alterations and loss of heterozygosity in heterogeneous tumor samples from wholegenome sequencing data. BMC Med Genomics 10, 15 (2017). https://doi.org/10.1186/s1292001702554
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1292001702554
Keywords
 Copy number alteration
 Loss of heterozygosity
 Wholegenome sequencing
 Intratumor heterogeneity
 Hidden Markov model
 Bayesian information criterion