Method overview
AmpliSolve consists of two main programs written in C++: AmpliSolveErrorEstimation and AmpliSolveVariantCalling. AmpliSolveErrorEstimation requires the availability of a set of normal samples processed with the deep sequencing platform and panel of choice. Here, we focus on the Ion Torrent Personal Genome Machine (PGM) and a custom AmpliSeq panel, a technology known to have relatively high rates of sequencing error compared to others. The program uses the normal samples to infer position-specific, nucleotide specific and strand-specific background sequencing error levels (noise) across the targeted regions. Execution of AmpliSolveErrorEstimation is performed only once per panel design. Error estimates are then used as input to the AmpliSolveVariantCalling program for SNVs’ detection. The procedures for in-silico noise estimation and SNV identification are described below. In Fig. 1 we present a graphical overview of the AmpliSolve computational workflow.
In-silico identification of the background sequencing error
Our strategy for estimating background error levels, implemented in the AmpliSolveErrorEstimation program, is based on the assumption that alternative alleles observed at VAF < 5% in normal samples are, in the majority of cases, the result of sequencing errors (see Additional file 2: Figure S1 for the distribution of non-reference allele frequencies in normal samples showing the separation between heterozygous germline variants and lower frequency ‘noise’ variants). Accordingly, we utilize a set of normal samples to estimate background noise in our custom panel. Notably, we estimate error levels separately for each genomic position, each nucleotide (alternative allele) and each of the two (forward and reverse) strands. In particular, for each genomic position we generate six error estimates (i.e. two each for the three alternative alleles). Error estimates are fed to a Poisson model, which is then used to calculate the p-value of the observed substitutions representing true variants versus them being noise. The detailed implementation is as follows. We first extract “raw” counts for each position, alternative allele and strand from the BAM files [15] of a set of N normal samples using the ASEQ software [16]. We run ASEQ with the quality parameters suggested by the authors of a previous study based on Ion AmpliSeq data [17], namely: minimum base quality = 20, minimum read quality = 20 and minimum read coverage = 20. At every genomic position, we estimate the background error s separately for each alternative allele α and strand (+ or -) by calculating the fraction of reads carrying the alternative allele on a given strand across all normal samples. More specifically we use the following formula:
$$ {s}^{\alpha, +/-}=\frac{Er^{\alpha, +/-}}{{Er d}^{+/-}}+C $$
(1)
with
$$ {Er}^{\alpha, +/-}=\sum \limits_{i=1}^N{R}_i^{\alpha, +/-} $$
(1a)
and
$$ {Erd}^{+/-}=\sum \limits_{i=1}^N{RD}_i^{+/-} $$
(1b)
We denote with Riα,+ and Riα,- the number of reads supporting the alternative allele α on the forward and reverse strand, respectively, in normal sample i. We denote with RDi+ and RDi− the total number of reads (read depth) at the genomic position of interest on the forward and reverse strand, respectively, in normal sample i. Summations are taken over all normal samples utilized for the error estimation. C in eq. (1) is a constant pseudo-count parameter that is introduced to mitigate the problem of positions in which the alternative allele read count might be underestimated (e.g. due to a relatively low read depth at a given position in the normal samples). In the Results section we test values of C in the range from 10− 5 to 2•10− 2.
When calculating the summations in (1a) and (1b), we apply two filters that aim to increase the quality of the error estimation at specific positions and for specific alternative alleles α at that position. First, at a given position, samples for which an alternative allele α has VAF > 5% are not considered at that position for that particular allele. This is because a frequency greater than 5% is likely to indicate, in that sample, either the presence of a real variant (i.e. a single nucleotide polymorphism) or a particularly noisy ‘read-out’. Second, samples that at a given position have coverage lower than a predefined threshold either on the forward or on the reverse strand are not considered for computing Erα, + /− (1a) and Erd+/− (1b) for any allele α at that position. In the following we use a threshold of 100 reads which, in our case, typically excludes 5% of sites per sample (see Additional file 3: Figure S2); however, this parameter can be adjusted depending on the study design. After applying these filters, positions and alternative alleles for which 2/3 or more of the normal samples cannot be used for calculating the summations in (1a) and (1b) are considered non-callable. We note that among non-callable cases there may be positions with alleles that are either frequent in the population or simply over-represented in the specific set of normal samples used for the error estimation. However, given that AmpliSolve main goal is the identification of somatic mutations this does not constitute a major limitation.
SNV detection using a cumulative Poisson distribution
Given a sample of interest, for every alternative allele α featuring a non-zero strand-specific (+ or –) variant read count kα, + /−, the AmpliSolveVariantCalling program uses a Poisson model to calculate the probability that kα, + /− or more variant reads are produced by sequencing errors, i.e. the p-value. Only positions that, in the sample of interest, have read depth on each strand higher than a pre-assigned threshold RDmin are considered (in the following, we set RDmin = 100 unless otherwise specified). At these positions, the calculated p-value is a function of the normal sample-based sequencing error sα, + /− from the previous section and of both the number of variant reads kα, + /− and the strand-specific read depth K+/− in the sample of interest (K+/− > RDmin). In particular:
$$ p- value\left( variant\ \right|\ {k}^{\alpha, +/-},{K}^{+/-},{s}^{\alpha, +/-}\Big)=1-{e}^{-K\ast s}\sum \limits_{i=0}^{k-1}\frac{{\left(K\ast s\right)}^i}{i!} $$
(2)
Where, for better readability, on the right side of the equation we have omitted all α symbols for k and s, as well as, +/− symbols for k, K and s. We observe that K*s is the expected number of random substitutions for a depth of coverage K or the mean of the Poisson distribution. Note that p-values are not corrected for multiple testing. The strand-specific p-values are finally converted to quality scores using the formula Q = -10*log10(p-value). In its output, for all positions in the panel carrying substitutions with Q score equal to or greater than 5 on both strands, AmpliSolve reports the average Q score between the two strands. Vice versa, positions with no substitutions or with substitutions with associated Q score lower than 5 on one or both strands are not reported in AmpliSolve’s output. All reported SNVs are further tested for and potentially assigned one or more of the following warning flags:
-
a)
‘LowQ’ if the Q score is lower than 20 in at least one of the two strands.
-
b)
‘LowSupportingReads’ if the SNV is supported by less than 5 reads per strand in the tumor samples being analysed.
-
c)
‘AmpliconEdge’ if the SNV is located within overlapping amplicon edge regions, which may result in sequencing artifacts.
-
d)
‘StrandBias’ if the SNV is associated to a strand-bias. We apply Fisher’s exact test to each SNV under the null hypothesis that the number of forward and reverse reads supporting the variant should be proportional to the total number of reads sequenced in the forward and reverse strands, respectively. The flag is assigned to substitutions for which the p-value of the Fisher’s exact test is lower than a pre-defined value SBth. In the following we set SBth = 0.05 (unless otherwise specified).
-
e)
‘HomoPolymerRegion’ if the SNV is located within a homopolymer region using the same criteria as in [18].
-
f)
‘PositionWithHighNoise’ if the SNV is supported by more than 5 reads per strand but the associated VAF is lower than the maximum VAF at this position across all normal samples in the training set.
If no warning is issued, AmpliSolve assigns a ‘PASS’ flag to the SNV.
Performance measures
To assess AmpliSolve’s success in detecting SNVs, we use a number of performance metrics:
-
1.
Sensitivity or True Positive Rate (TPR) = TP / (TP + FN)
-
2.
Precision or Positive Predictive Value (PPV) = TP/(TP + FP)
-
3.
False Discovery Rate (FDR) = 1-PPV=FP / (FP + TP)
-
4.
Harmonic mean of Precision and Sensitivity (F1) = 2*TP / (2*TP + FP + FN)
Where, TP is the number of True Positive predictions, FN is the number of False Negative predictions and FP is the number of False Positive predictions.
Clinical data used in this study
For the development and evaluation of AmpliSolve, we have access to an extensive collection of clinical samples from castration-resistant prostate cancer (CRPC) patients, part of which had been already presented in previous publications [17, 19,20,21]. The collection comprises 184 germline samples (white blood cells, buccal swabs or saliva) and more than 450 liquid biopsy plasma samples (note that for some patients there are multiple liquid biopsies and a small minority of liquid biopsy samples has no matched normal). In practice for this study, we rely on all 184 normal samples but only use 96 liquid biopsy samples for which results from digital droplet PCR (ddPCR) assays are available (see below). For 5 additional patients we have access to 10 solid tumor samples from metastatic sites (1, 2 and 3 samples from respectively 1, 3 and 1 patients) and their associated 5 germline samples. For the available samples, we have the following data:
-
a)
For all samples (germlines, liquid biopsies and solid tumors), we have Ion Torrent sequencing data obtained using a custom Ion AmpliSeq panel of 367 amplicons spanning 40,814 genomic positions at around 1000-1500x coverage. The panel targets both intronic and exonic regions in chromosomes 8, 10, 14, 17, 21 and X including commonly aberrated genes such as PTEN, CYP17A1, FOXA1, TP53, SPOP as well as the androgen receptor (AR) gene, which is one of the main drivers of CRPC, and the drug target CYP17A1. More details about the sequencing protocol, data processing and additional information about the application of our custom Ion AmpliSeq panel in CRPC diagnostic studies can be found in [17, 19]. These papers also include a description of a variant caller that we used as starting point for developing AmpliSolve. We call variants in these Ion AmpliSeq data with our program AmpliSolve.
-
b)
For the 10 solid tumor samples and 5 matched germline samples, in addition to Ion Torrent data, we have Illumina Whole Genome Sequencing (WGS) data at around 80-100x (tumor) and 30x (germline) coverage. We call variants in WGS data according to a previously established pipeline [22], which we describe in the next section.
-
c)
For 96 liquid biopsy samples, we have results from ddPCR assays to screen 3 clinically relevant SNVs in the AR gene. These SNVs have been linked to resistance to targeted therapy in CRPC patients, namely: 2105 T > A (p.L702H), 2226G- > T (p.W742C) and 2632A > G (p.T878A). ddPCR in the plasma samples was performed using 2–4 ng of DNA, using Life Technologies Custom Taqman snp genotyping assay (product codes AH0JFRC, C_175239649_10 and C_175239651_10, respectively). Following droplet generation (AutoDg, Bio-Rad) and PCR, samples were run on the Bio-Rad QX200 droplet reader and analyzed using the QuantaSoft software.
WGS variant calling pipeline
We used Illumina WGS data to generate a benchmark set of calls (“ground-truth”) against which AmpliSolve performance is evaluated. WGS data have been processed using standard tools, such as Skewer [23] for adapter trimming, BWA-MEM [24] for mapping and Picard [25] for duplicate removal. In order to call SNVs we run a previously developed pipeline [22] that utilizes jointly Mutect [26] and Platypus [18] (throughout the manuscript this pipeline is denoted as MutPlat). Briefly, we first run Mutect (default parameters) on each paired tumor-normal samples. Then, we use Mutect’s calls as priors for Platypus and jointly call variants on all tumors and matched normal samples of a patient (further details are provided in Additional file 1: Supplementary Methods). Our ground-truth set of calls consists of both germline and somatic mutations extracted as explained below.
Germline variants are identified as those variants called in the normal (GT = 0/1 or 1/1) and that, additionally, have either a PASS filter flag or don’t have a PASS filter flag (could have e.g. ‘badReads’) but are present in 1000 genomes (phase 3 release) [27]. For AmpliSolve validation purposes we consider only tumor samples but include both germline and somatic SNVs. By including germline SNVs, in particular, we are able to test a higher number of low VAF mutations than would be possible when considering only somatic mutations. This is due to a combination of somatic deletions and germline DNA contamination in the tumor samples. Indeed, while somatic deletions cause loss of some germline SNPs in tumor DNA, germline DNA contamination (i.e. < 100% tumor purity) means that these mutations may still be present in the tumor samples, albeit with a lower VAF. Note that if somatic deletions occur in high tumor purity samples and/or the sequencing coverage is not deep enough, germline variants may have no supporting read at all in the WGS tumor data. We keep also these limiting cases as part of our ground-truth set of variants as they might be (and sometimes are) detectable in the targeted Ion AmpliSeq data.
To call a somatic SNV we require all of the following criteria to be met: i) Platypus filter: PASS, alleleBias, Q20, QD, SC or HapScore, ii) at least 3 reads supporting the variant in the tumor, iii) at least 10 reads covering the position in the germline and no support for the variant in the germline (NV = 0 and genotype GT = 0/0), iv) SNV not present in the 1000 genomes database.
SNV callers tested for comparison
On WGS and ctDNA samples, we compare AmpliSolve to SiNVICT [12], a tool that has been shown to be effective in detecting mutations at very low VAF in Ion Torrent data. We run SiNVICT (version 1.0) with default parameters and with no additional data pre-processing steps. We split the tumor samples (10 metastatic solid tumors plus 96 ctDNA samples) into 3 batches of similar size and we run SiNVICT simultaneously on all samples from each batch. SiNVICT applies a number of post-processing filters and calls variants at 6 different confidence levels, with level 6 assigned to variants that pass all filters. On ctDNA samples, we additionally compare AmpliSolve to deepSNV [9], a state-of-the-art method for calling low allele frequency variants in deep sequencing data (although originally designed to detect sub-clonal mutations on Illumina rather than Ion AmpliSeq data). We run deepSNV (version 1.21.3) with default parameters following the available vignette in the R package.
How to run AmpliSolve
AmpliSolve two modules, AmpliSolveErrorEstimation and AmpliSolveVariantCalling, can be downloaded from github (https://github.com/dkleftogi/AmpliSolve). Additional requirements include running versions of the programs Samtools [15], ASEQ [16] and the Boost libraries for C++. Here we provide a brief description of how to run AmpliSolve, however, more detailed information and a number of examples are available on the github page.
For a given amplicon panel, error estimation at each genomic position, for each alternative nucleotide and for each of the two strands requires availability of amplicon-based data from N normal sample files. Although we don’t enforce a minimum value for N, values below 10 are likely to give low-quality error estimations. In general, we suggest using as many normal samples as possible when training the error matrix for your panel. If normal samples are not available, AmpliSolveErrorEstimation assigns a constant error rate to all positions, nucleotides and strands in the panel. The default constant error is 0.01 but the user can specify a different value if needed (e.g. for different sequencing platforms). Note that AmpliSolveErrorEstimation does not take bam files as input but rather bam-derived read count files. Read count files can be obtained by running the program ASEQ [16]. Once the read count files have been produced, the user needs to set the value of the C pseudo-count parameter (eq. (1)). The choice of C will depend on the trade-off between precision and sensitivity the user is interested in. Users can refer to the benchmarking experiments performed in this paper. In general, values of C between 0.001 and 0.01 should suit most applications.
Once the error matrix has been calculated, it can be fed to the AmpliSolveVariantCalling program together with read count files for the tumor samples again to be produced by running ASEQ. Note that AmpliSolveVariantCalling does not require matched normal-tumor samples for calling SNVs. In fact, AmpliSolveVariantCalling calls all variants it can find in the tumor sample, including germline variants. To separate germline from somatic variants users will need to run AmpliSolveVariantCalling on a matched normal sample and take the difference between the two output files. Command-line syntax for running AmpliSolveErrorEstimation and AmpliSolveVariantCalling is provided on github.