Comparative assessments of indel annotations in healthy and cancer genomes with next-generation sequencing data

Background Insertion and deletion (indel) is one of the major variation types in human genomes. Accurate annotation of indels is of paramount importance in genetic variation analysis and investigation of their roles in human diseases. Previous studies revealed a high number of false positives from existing indel calling methods, which limits downstream analyses of the effects of indels on both healthy and disease genomes. In this study, we evaluated seven commonly used general indel calling programs for germline indels and four somatic indel calling programs through comparative analysis to investigate their common features and differences and to explore ways to improve indel annotation accuracy. Methods In our comparative analysis, we adopted a more stringent evaluation approach by considering both the indel positions and the indel types (insertion or deletion sequences) between the samples and the reference set. In addition, we applied an efficient way to use a benchmark for improved performance comparisons for the general indel calling programs Results We found that germline indels in healthy genomes derived by combining several indel calling tools could help remove a large number of false positive indels from individual programs without compromising the number of true positives. The performance comparisons of somatic indel calling programs are more complicated due to the lack of a reliable and comprehensive benchmark. Nevertheless our results revealed large variations among the programs and among cancer types. Conclusions While more accurate indel calling programs are needed, we found that the performance for germline indel annotations can be improved by combining the results from several programs. In addition, well-designed benchmarks for both germline and somatic indels are key in program development and evaluations.


Background
Insertion and deletion (indel) is the second-largest genetic variation type in human genomes. On average, one healthy human genome differs from the reference genome at about 566,000 sites with indel lengths ranging from 1 to 1000 base pairs (bps) [1]. Typically, small indels are termed for insertions/deletions of shorter than 50 bps while longer ones are considered as structural variants (SVs) [2,3]. Besides contributing to genetic variations in healthy population, deleterious indels in both coding and non-coding regions can lead to various types of diseases. For example, coding indels were identified in breast cancer development genes, including AKT1, BRCA1 and CDH1, and the fragile X syndrome is caused by a large insertion in 5′UTR of the FMR1 gene [4,5]. Several databases with annotated indels have been developed to document these variants, including dbSNP, dbVar, and COSMIC (the Catalogue Of Somatic Mutation In Cancer) [6][7][8].
Detection of genomic variations including indels represents one of the most important aspects in human genome analysis. Mills et al. reported 2 million unique indels in their updated analysis of 79 genomes in 2011 [9,10]. The indel set from these 79 genomes is commonly used as a reference for indel analysis since these indels were annotated with Sanger sequencing data, which reported a 97.2% validation rate [10]. There are also studies focused on somatic indels in cancer genomes. For example, Niu et al. analyzed 4201 non-frame-shift indels and identified more than 6000 mutation clusters on protein 3-dimentional (3D) structures across 19 cancer types [11]. Besides somatic coding indels, non-coding indels also play important roles in cancer genomes. Imielinski et al. found that non-coding somatic indels tend to be enriched in lineage-defining genes in multiple cancer genomes [12].
Next-generation sequencing (NGS) technology has reduced the sequencing cost and produced more genome sequence data. A number of programs have been developed for both germline indel and somatic indel identification from NGS data [25][26][27]. Current indel calling programs use different algorithms to distinguish sequence errors or alignment errors from real indel variations [28]. General indel calling programs are classified into five major groups: alignment-based methods, split read mapping methods, paired end mapping methods, haplotype based methods, and machine learning-based approaches [25,28]. A list of indel calling programs with variant types that can be detected and the corresponding algorithms are shown in Table 1 [13,[15][16][17][18][19][20][22][23][24].
Alignment-based methods, including Dindel, GATK_ UG, SAMTools and Varscan, use information from the mapping step and identify indels with statistical models [25]. These alignment-based programs differ in the statistical models and processing details [18]. The indel sizes from these alignment-based programs are constrained by the length of sequence reads. Consequently the medium sized indels and large insertions are hard to detect since the workflow relies on the initial alignments [28]. Split read mapping methods, such as Pindel, rely on the discordant reads in the alignment step and can be used to annotate medium sized indels. These methods usually do not use statistical approaches to filter variants [16]. The haplotype-based methods, such as GATK_HC and Platypus, collect candidate haplotypes and identify the variants based on the realignment results on haplotypes [25]. Paired-end read mapping methods compare the real and expected distances between paired-end reads to identify potential indels. However the exact indel sequences are usually hard to annotate. They are considered more accurate for medium sized indels but not for small indels. Machine learning methods need training data to predict true indels [25,28]. Due to these issues or constraints, paired-end read mapping and machine learning-based methods are not included in this study.
Besides general indel calling programs, there are tools designed for detecting germline/somatic variants from cancer genomes. Almost all somatic indel calling programs can detect single nucleotide variants, some of them can also detect SVs [29]. Majority of these programs use tumor-normal paired sample data to identify somatic variants, while others can predict with only tumor samples [30]. For programs based on the tumornormal paired data, the general core algorithms include joint genotype analysis, allele frequency analysis, heuristic threshold, haplotype analysis, and machine learning  [30]. In this study, we selected Varscan2, GATK Mutect2, Strelka and Strelka2 for comparative somatic indel analysis based on their good performances reported by several groups [21,29,[31][32][33][34] (Table 1). In general, performance evaluations for somatic indel identification can be done with simulation data and/or real sequence data [31,33]. While the simulation data can help test different features such as variant allele fractions [33], comparison of indel annotation methods with real NGS data can provide useful guidance for their application in variant analysis in disease genomes. Even though currently there is no gold standard for evaluating somatic indel variants from cancer genomes, several existing databases can provide some useful information [31]. For instance, the annotated indels in GATK Resource Bundle and dbSNP can be used to check false positive cases and indels in COSMIC can be used to evaluate positive cases, respectively [7,8,31,35]. However, caution should be taken when using these databases for evaluation purpose as both databases contain only partial data. Accurate annotation of indels is of paramount importance in studying genetic variations and in identifying disease associated indels [36][37][38]. To test the consistency or differences among the general indel calling programs, Hasan et al. performed a comparative analysis by using the sequences of chromosome 11 from 78 samples of the 1000 Genomes Project and showed that 78-89% of the benchmark indels are not identified in a sample by any program and only a very small number of indels are identified by all seven programs [25]. However, the results do not accurately reflect the performance of each program as well as the common indels predicted from different programs. First, they compared the indels from individual genome samples to the pooled indel dataset of 79 genomes. Rare and low frequency variants account for a large proportion of indels and the pooled indel set includes all of them, but an individual sample may contain only a small subset of the pooled indel set [1,39,40]. Figure 1a shows a schematic example to explain the potential pitfalls of comparing individual samples with a pooled reference set from multiple samples. In this study we applied a pooled-sample based method for more accurate comparative analysis since indels from multiple samples from one program are pooled together to compare with the pooled benchmark indels (Fig. 1b). In addition, we expanded the comparison with the whole genome sequences instead of only one chromosome.
Unlike SNPs, indels are more complicated in that there are two different indel types, insertion and deletion. Moreover, for a coding indel, it can be a frame-shift (FS) or non-frame-shift (NFS) indel. Consequently, the way to compare the indels can affect the number of true positives and false positives. Previous studies used a position range of i ± 5 (where i is the indel position) to determine if an indel is the same one as that in the reference set [41]. However, this approach has several disadvantages. First, the indel types, insertion or deletion, are not considered separately. An insertion and a deletion at the same genome position are two different indels, not the same indel. Secondly, for coding indels, 1 bp difference in a position may result in a totally different protein sequence due to an open reading frame shift. In light of these issues, we adopted a modified approach by considering indel types (insertion or deletion) as well as positions, which is especially important in germline indel analysis.

Datasets
We used the same dataset as Hasan et al., which consists of 78 samples from the 1000 Genomes Project (https ://www.inter natio nalge nome.org/) covering five super populations (EUR, EAS, SAS, AMR, and ARF) and 26 sub-populations (three from each sub-populations) to evaluate general indel calling programs [25]. The benchmark is a set of small indels identified by Mills et al. [10]. For somatic indel program evaluation, we used a total of 30 tumor-normal paired data, including 10 colon cancer, 10 breast cancer, and 10 bladder cancer samples. The cancer genome sequencing data were downloaded from TCGA with dbGap ID phs000178.v11.p8. A total of 4970 indels from the latest version of COSMIC (v90) were downloaded for somatic indel evaluations [8].

Evaluation methods
For germline indels from healthy genomes, they are mainly genetic variants with the type and position of the indels presumably conserved in sub-populations or super populations. In other words, they are less random compared with somatic variants and usually do not lead to diseases. Therefore, when evaluating germline indels from healthy genomes, we only counted the indels that are located at the same positions with the same insertion or deletion sequences between the samples and the reference as positive identifications. Since somatic indels from cancer genomes are less conserved than the germline indels, we used the typical range of i ± 5 in positions along with the indel types, either insertion or deletion, for comparative evaluation.
Recall, precision and F measure are calculated for performance evaluations (Eqs. 1-3): where TP represents true positive, FP represents false positive, and FN represents false negative. As mentioned in the Background section, for germline indels, the TP, FP and FN are identified by a pooled sample-based method (Fig. 1b). For somatic indel evaluation, the predicted indels are compared with the annotated indels in the COSMIC database as potential somatic indels (the indel types are classified using the indel labels downloaded from COSMIC). To identify potential false somatic indels, we compared the predictions with the indel set from the GATK Resource Bundle, which is considered as a standard germline indel set for human reference GRCh38 [35].

General indel calling programs Overall analysis of the predicted indels
The number of true positive and false positive indels from healthy genomes by different programs is listed in Table 2. SAMTools calls the largest number of indels, with Platypus ranks the second. The number of the TP indels varies by programs. Dindel has the highest recall (0.78) but with a low precision (0.24). Varscan, which calls the least number of indels, has the highest precision (0.56) as well as the best F value (0.48). GATK_UG and GATK_HC have the second-best F value with relatively good recall and precision. Among all the programs, GATK_HC calls the longest indel with 616 bps. The length distribution is shown in Fig. 2a (percentages) and Additional file 1: Table S1 (counts) with the benchmark as a reference. SAMTools has the largest number of short indels for length between 1 and 20 bps, which is not surprising since it calls much more indels than any other programs ( Table 2 and Additional file 1: Table S1). Pindel predicts the largest number of indels longer than 50 bps, largely because Pindel uses an algorithm that tends to call longer indels. In terms of mid-length indels between 20 and 50 bps, GATK_HC has the largest number in each category. Percentage-wise, Platypus, Varscan, GATK_UG and SAMTools predict relatively more short indels compared to other three programs. We also compared the programs in terms of indel types, insertion and deletion ( Fig. 2b and Additional file 1: Table S2). SAMTools has a higher percentage of deletion types while GATK_UG has more insertion types in terms of the ratio when compared with the benchmark. Dindel has the most similar insertion/deletion ratio (56.2%/43.8%) to the benchmark (57.6%/42.4%) and it has the highest TP rate for both insertion and deletion types (Additional file 1: Table S2).
In coding regions, indels can be grouped into FS and NFS types. An NFS indel consists of a multiple of three base pairs, introducing an insertion or deletion of one or more amino acids while keeping the other part of the protein sequence unchanged. In contrast, an FS indel changes the reading frame starting from the site of insertion/deletion, which can produce different protein sequences from the indel position. FS indels can also lead to premature termination and the mRNA molecules can be subjected to a surveillance pathway called non-sensemediated mRNA decay (NMD) [42]. The proportion of NFS and FS coding indels from each program is shown in Fig. 2c and Additional file 1: Table S3. GATK_UG, Pindel and Varscan show similar FS/NFS ratios to that of the benchmark while Pindel, SAMTools, and Platypus have a much higher percentage of FS coding indels.

Pare-wise comparisons
To check the similarity or difference of indels predicted by two different programs, the overlapped indels from two programs are compared with the benchmark indels. The recall and precision values are presented in Table 3, showing a trade-off between recall and precision. When a program is paired with Varscan or Pindel, it usually achieves high precision with smaller number of FPs while having low recall at the same time since these are the two programs that call the lowest number of total indels. The indels from Varscan, GATK_UG and Dindel are highly similar. About 94% of indels from Varscan are also annotated by GATK_UG (898,482 out of 957,841) or Dindel (903,756 out of 957,841).

Combination of indels from different programs
The results from individual programs have shown that there are a large number of false positive indel predictions from the NGS data (Table 2). While false negatives may represent missed opportunities, false positives can result in wrong conclusions and are costly in real applications. We hypothesize that by selecting the consistent indel annotations from different programs, we may be able to remove majority of the false positives while retaining most of the true positives. The underlying idea is that in general, unlike false positives, true indels can be identified by different prediction algorithms. The ones that are program specific have a higher probability to be false positives. In a previous study, Hasan et al. showed that only a very small number of indels were called by all seven programs [25]. But as discussed in Background, that conclusion is a result from their approach by comparing the indels from individual samples to the pooled benchmark dataset, which may produce a large number of false negatives. We adopted a pooled sample method for a more   Figure 3 shows a schematic example to explain the differences by counting the overlaps or consistent indels between the two approaches. Among the seven indels called by both caller 1 and caller 2 with the pooled sample method, five of them are true positive indels. However, the single sample approach only identifies two true positives, resulting a very low TP rate from the overlapped indels (Fig. 3). Table 4 shows the averages of TP indels, FP indels, recall, precision and F values for all possible combinations including individual programs. The results from Hasan et al. show that only a small proportion of TP indels (1.51%) are called by all seven programs [25]. With our pooled sample approach, we found that 476,253 indels are called by all seven programs. Among these indels, 326,184 can be found in the reference set, representing a TP rate of 25.6%.
Among all the possible combinations, including the individual programs, a five tool combination of GATK_ UG, GATK_HC, Pindel, SAMTools and Dindel has the highest F value (0.53). Dindel has the highest recall (0.78, Table 2) and a combination of three tools (GATK UG, Pindel and SAMTools) has the highest precision (0.69). On average, a combination of 2 or 3 programs has the highest average F values (Table 4). Table 5 lists top three combinations of two and three programs ranked by F values. As shown in Tables 4 and 5, adding more programs can remove more false positives than true positives and a   combination of three programs seems to have a good balance of recall and precision. Figure 4 shows an example of indels called by 3 programs: GATK_UG, GATK_HC and Dindel. There are large overlaps among the TP indels either for all indels (Fig. 4a) or for coding indels only (Fig. 4b), while the disagreement among the FP indels are much bigger. Therefore, if a low number of false positives is preferred in an application, results from more programs can be used and combined.

Somatic indel calling programs
Unlike general indel calling approaches, the majority of somatic indel annotations need both normal and diseases genome samples and thus are more complicated. Different methods or algorithms have been developed for somatic indel identifications (Table 1). In this study, we applied four somatic indel calling programs to three types of cancers. As discussed in Background, there are no benchmark sets available to assess the true positive or false positives for cancer somatic genome indels. But for comparison purposes between programs and cancer types, we can use the COSMIC database with annotated somatic cancer indels and GATK Resource Bundle as potential false positives (or germline indels) to see how much they agree or differ with each other. Since the COSMIC indel set represents only a small portion of real cancer population indels, a small number of indels in COSMIC does not necessarily indicate a large number of false positives from a program. Similarly, an indel found in the germline indel set does not necessarily mean it is a true false positive since there is a single cancer sample vs. pooled germline samples problem. Nevertheless, the comparative analysis can provide some insights about these somatic indel calling programs and the similarity or differences among different cancer types. The number of potential true positive and false positive indels called by four programs are shown in Table 6. GATK Mutect2 calls the largest number of indels independent of cancer types and it has the largest overlap with the COSMIC indels and relatively low number of potential germline indels among the four programs. Strelka2 has the smallest numbers of indels for bladder and breast cancer types while Varscan2 calls the lowest number of indels in colon cancer. In terms of cancer types, colon cancer has more indels than the other two cancer types. The number of indels in bladder cancer is much less than the other two types. Taken together, GATK Mutect2 has a better coverage of somatic indels in all three cancer types with relatively low number of germline indels, or potential false positives. Strelka has the second largest number of total indels and COSMIC indels, however, the number of potential germline indels is also high.
As for the length distribution of the somatic indels, GATK Mutect2 calls the longest somatic indel (245 bps) in a cancer genome and identifies more longer indels ( Fig. 5a and Additional file 1: Table S4). It has 202 indels  longer than 50 bps. However, no other programs identify any indels of length 50 or more. The length distributions in terms of cancer types also vary. Even though colon cancer has the largest number of indels, breast cancer has more longer indels ( Fig. 5b and Additional file 1: Table S4). In healthy genomes, there are more germline deletions (57.75%) than germline insertions (42.25%) (Additional file 1: Table S5) while in cancer indel database COSMIC, the ratios are slightly different with 34.39% of insertions and 52.00% of deletions, with the remaining cases assigned as complex indels (13.61%) (Additional file 1: Table S5) [39]. Except for GATK Mutect2 in bladder and breast cancer genomes, all other programs detect relatively low number of insertions. It is not clear if cancer genomes have relatively fewer insertions or the programs have difficulty in identifying somatic insertions. As for coding indels, germline coding indels has slightly more NFS indels (51.63%) than the FS indels (48.37%) (Additional file 1: Table S6). It is not surprising that the number of short FS coding indels is smaller than expected (2 to 1 ratio if there is no selection) in healthy genomes, as FS indels are more deleterious than NFS indels, which are more likely to be removed from the population during evolution. FS indels found in healthy individuals generally are less deleterious and contribute to phenotypic diversity through different ways [39]. In COSMIC cancer indel database, FS indel is the dominant coding indel type (81.05%). Except for Strelka2 in balder cancer, all other programs predict more FS indels than NFS indels in all three cancer types. It should be pointed out that the total numbers of coding indels predicted by Varscan2 and Strelka2 are rather small (Additional file 1: Table S6).
When the somatic indels from different programs are compared, the number of similar indels from different programs or the overlapped indels are much smaller especially when more programs are considered ( Fig. 6 and Table 7). This is quite different from the germline indels by the general indel annotation programs especially the comparison criteria are not as stringent as those used for germline indel comparisons (Table 4), in which there are a large number of indels called by all the programs, especially for the true positive indels. Table 7 and Fig. 6 show that when all four programs are used, there are only 22, 36, 161 indels in the bladder, breast, and colon cancel samples respectively. These results suggest that the agreement among different programs is low and it might not be practical to use multiple programs in order to remove false positives in cancer samples as we showed in the germline indel cases since it also dramatically decrease the total number of indels as well as true positives.

Discussion
Accurate annotation of indels in both healthy and cancer genomes is important for downstream analysis in biological and medical applications. A number of programs have been developed for identifying indels from both healthy genomes for germline indels as well as cancer genomes for somatic indels with NGS data. Comparative analysis and evaluation can provide useful information about each program's performance. The best available benchmark for large-scale germline indels so far is the pooled sample indels [10]. One previous comparative study applied this pooled benchmark set and evaluated seven general indel calling programs using chromosome 11 of 78 samples. However, the comparison was carried out between a single sample and the pooled benchmark, which is problematic as shown in Figs. 1 and 3. It may also explain why the study finds little overlap when the indels from all seven tools are combined [25]. In this study we carried out an improved approach to assess the general indel calling programs using the whole genome NGS sequences instead of using one chromosome sequences. More importantly, we adopted a pooled sample vs. pooled benchmark comparison, which provides more accurate assessment of programs' performances. The new method greatly reduced the number of false negative cases by correctly recognizing the true positives (Figs. 1, 3). Last but not the least, we adopted a stringent indel comparison approach by considering the exact indel position as well as the indel types, which was not considered in previous studies. It should be noted that even though we applied a pooled sample approach, the comparison is not error free since the samples and the genomes in the benchmark set are different. There are some sample specific indels in both the test set and the benchmark set. Nevertheless, our approach makes the best use of the reference set and provides more accurate performance evaluations.
These general indel calling programs employ different prediction algorithms and predict different number of indels with different length and type distributions (Additional file 1: Tables S1-2). There is a tradeoff between the number of true positives and false positives. Some of them recognize a large number of true positive indels but at the same time output more false positive indels. We found that combing indels predicted from several different programs can achieve a good balance of TPs and FPs by removing a large number of false positives while keeping most of the true positives. The idea behind this is that if an indel is a true one, most programs are expected to find it no matter what algorithm is used. On the other hand, if an indel is a false one, it probably will only be predicted by one or a small number of programs. Our results show it is indeed the case and the best TP/FP balance is achieved with two or three different programs (Tables 4, 5).
In addition to germline indels, we also carried out program comparisons of somatic indel predictions using 30 cancer samples of three different types. Evaluating somatic indels is even more challenging because there is no benchmark that can be used for a systematic comparison and cancer indels are more random in terms of indel positions. Nevertheless, by using a common sample sets, we can evaluate the similarity/differences of indels from different somatic indel calling programs and among different cancer types. To get a sense of the potential number of true positive or false positive somatic indels, we compared the predicted indels with the cancer indels in COSMIC database (as potential true positives) and the germline indel set (as potential false positive somatic indels). While each program produces different number of indels with various ratios of indel types (Table 6 and  Additional file 1: Tables S4 and S5), there is a clear trend among different cancer types in general. Bladder cancer has the lowest number of predicted somatic indels and colon cancer has the largest number of predicted somatic indels (Tables 6 and Additional file 1: Table S5). Secondly, unlike the germline indels, the number of indels predicted by all programs is very small (Table 7), suggesting a low agreement among the programs even though the input sequences are the same. Thirdly, the programs identify a small number of insertions. This trend has also been reported by other case studies. For example, 2233 deletions and 544 insertions were identified from 21 breast cancer genomes by a modified Pindel program, and 680 deletions and 303 insertions were found from a skin cancer genome by Pindel, BWA and GROUPER  [16,[43][44][45]. In COSMIC database, there are also less insertions compared with deletions (Additional file 1: Table S5). On the other hand, Sathya et al. identified SNP and indel patterns from lung cancer genomes and found more insertions than deletions in both healthy genomes and lung cancer genomes using GATK-UG [46]. Whether the difference in the ratio of insertion and deletion in the cancer genomes is caused by the characteristics of the cancer genomes or by the algorithms used by the somatic variants calling programs remains to be further studied.

Conclusions
Our results show that a better balance between TP and FP can be achieved by combining results from a small number of programs for germline indel annotations. However, the low agreement among indel calling programs, especially for somatic indel identifications, calls for novel approaches for improving prediction accuracy with NGS data. In addition, the development of such approaches needs well-annotated indel reference sets.