Evaluation of in silico pathogenicity prediction tools for the classification of small in-frame indels

Background The use of in silico pathogenicity predictions as evidence when interpreting genetic variants is widely accepted as part of standard variant classification guidelines. Although numerous algorithms have been developed and evaluated for classifying missense variants, in-frame insertions/deletions (indels) have been much less well studied. Methods We created a dataset of 3964 small (< 100 bp) indels predicted to result in in-frame amino acid insertions or deletions using data from gnomAD v3.1 (minor allele frequency of 1–5%), ClinVar and the Deciphering Developmental Disorders (DDD) study. We used this dataset to evaluate the performance of nine pathogenicity predictor tools: CADD, CAPICE, FATHMM-indel, MutPred-Indel, MutationTaster2021, PROVEAN, SIFT-indel, VEST-indel and VVP. Results Our dataset consisted of 2224 benign/likely benign and 1740 pathogenic/likely pathogenic variants from gnomAD (n = 809), ClinVar (n = 2882) and, DDD (n = 273). We were able to generate scores across all tools for 91% of the variants, with areas under the ROC curve (AUC) of 0.81–0.96 based on the published recommended thresholds. To avoid biases caused by inclusion of our dataset in the tools’ training data, we also evaluated just DDD variants not present in either gnomAD or ClinVar (70 pathogenic and 81 benign). Using this subset, the AUC of all tools decreased substantially to 0.64–0.87. Several of the tools performed similarly however, VEST-indel had the highest AUCs of 0.93 (full dataset) and 0.87 (DDD subset). Conclusions Algorithms designed for predicting the pathogenicity of in-frame indels perform well enough to aid clinical variant classification in a similar manner to missense prediction tools. Supplementary Information The online version contains supplementary material available at 10.1186/s12920-023-01454-6.


Background
Next generation DNA sequencing (NGS) is transforming healthcare by facilitating novel understanding of disease and uptake of precision medicine initiatives [1,2].Genetic variation is widespread, with every individual carrying > 200 very rare coding variants [3], so molecular diagnosis of monogenic disorders requires expert clinical and scientific interpretation of variants detected by NGS.Classifying the pathogenicity of candidate causal variants is essential for robust diagnosis and management of genetic disorders.To this end, numerous in silico pathogenicity prediction algorithms have been developed and are widely used as evidence when interpreting genetic variants.The use of pathogenicity predictors is supported by current guidelines from the American College of Medical Genetics and Genomics (ACMG) and Association for Molecular Pathology (AMP) [4] and, more recently, the UK Association for Clinical Genomic Science (ACGS) [5], through the PP3/BP4 criteria.Pathogenicity prediction algorithms incorporate various lines of evidence to predict the impact of variation on protein function, including evolutionary inter-species sequence conservation [6], physico-chemical distances between amino acids [7] as well as integrated tests for identifying regulatory features [8].Some also incorporate human variation and disease data [9] by querying gene or variant level understanding from open source [10] or proprietary [11] databases.These aggregated data are used to generate statistical prediction models, such as supervised machine learning classifiers [12], which produce a score used to assign pathogenicity status to a given variant.
Most pathogenicity predictors have been developed to predict the effect of missense substitutions [13,14], which are primarily caused by single nucleotide variants (SNVs) in the protein-coding regions of the genome.However, small insertions and deletions (indels) account for between 13 and 18% of all variation in the human genome [15,16], both within and outside protein-coding regions, and have been linked to numerous rare heritable diseases [17] as well as cancerous somatic mutations [18].Approximately 40% of coding indels are in-frame [19], defined as a nucleotide length (n), wholly divisible by three, which results in the removal or addition of n/3 amino acids.Unlike frame-shifting indels, which are generally assumed to cause loss-of-function, the insertion or deletion of a small number of amino acids is likely to have a similarly deleterious effect on a protein as substitution of one amino acid for another.Indeed, missense variants and in-frame indels are frequently grouped together as "protein altering variants" and overall assumed to have "moderate" impact [20].
Numerous small in-frame indels have been shown to cause monogenic disease, most famously (p.Phe508del) in CFTR [21].However, in general, the classification of in-frame indels has been much less well studied than missense and loss-of-function variants.To this end, we created a novel dataset of previously classified inframe indels, constructed from three databases, two open source (gnomAD [22] and, ClinVar [10]) and one managed access (Diagnosing Developmental Disorders study (DDD) [23]), and use this dataset to evaluate the performance of nine in silico prediction algorithms.We show that although the accuracy of pathogenicity classifiers varies across tools, overall the performance is comparable to those designed for missense variants.

Benchmark dataset generation
Variants were retrieved from gnomAD (v3.1.1)[22], Clin-Var [10], and the DDD study deposited in DECIPHER [23], all accessed 18 March 2021, before filtering for suitability for this study (Fig. 1).Briefly, variants in genome build GRCh38 were included if they were evenly divisible by 3 and < 100 base-pairs in length.Assumed benign variants with a minor allele frequency 1-5% were retained from the gnomAD population database, while variants classified as likely pathogenic (LP), pathogenic (P), benign (B) or likely benign (LB) were retained from the two clinical datasets.Identical variants in more than one database were retained from only one using the preferential order of DDD, ClinVar then gnomAD, and variants with conflicting annotations between databases were removed.The resulting variants were annotated by the Ensembl Variant Effect Predictor (VEP) [20].Those annotated as "inframe_insertion" or inframe_deletion" with biotype "protein coding" and a single protein consequence per variant were selected (n = 3964; Table 1, Additional file 2: Table S1).A subset of potentially novel variants from the DDD study, which were not present in either ClinVar or gnomAD (n = 151), was used as an additional test set because these variants are unlikely to have been previously encountered by the tools.

Tool selection and benchmarking
For inclusion in this study, pathogenicity prediction tools were identified from the literature and had to be either (i) accessible through a webserver or (ii) downloadable for use on a local server.We evaluated the performance of nine pathogenicity prediction tools, using their default classification threshold criteria: CADD [24], CAPICE [25], PROVEAN [26], FATHMM-indel [27], Mutation-Taster2021 [28], MutPred-Indel [29], SIFT-indel [12], VEST-indel [30], and VVP [31] (Table 2).Standard performance metrics (sensitivity, specificity, positive and negative predictive values) and the Matthews Correlation Coefficient (MCC) [32] were calculated for all tools.Receiveroperator characteristics (ROC) and the area under the ROC curve (AUC) were determined for all tools apart from SIFT-indel and MutationTaster2021 which produced binary classifications.All above analyses were repeated using the DDD-only subset.We also considered the effect of protein length on the ability of software to classify variants by grouping variants into four bins of amino acid length (1, 2-4, 5-10 and 11 +).

Benchmark datasets contained a good balance of pathogenic and benign insertions and deletions
Our dataset consisted of 3964 small in-frame indels from 1820 genes, including 1246 insertions and 2718 deletions from gnomAD (n = 809), ClinVar (n = 2882) and DDD (n = 273) (Fig. 1).Of these, 2224 were B/LB and 1740 were P/LP ranging in size from 1-48 amino acids for insertions and 1-66 amino acids for deletions (Fig. 2).The longest pathogenic and benign deletions were 32 and 66 residues, and the longest pathogenic and benign insertions were 28 and 48 residues, respectively.Variants were distributed across 1820 protein-coding genes (mean = 2.18, SD = 3.64, min = 1, max = 66).The proportion of benign/pathogenic variants varied across genes linked with monogenic disease.Some genes had almost exclusively benign variants in our dataset, e.      where rare deleterious variants are known to cause developmental disorders.
Performance was generally high across all tools using our full dataset, but some tools performed substantially worse using a smaller, novel variant dataset For the full dataset, 3615-3963 (91-99%) of variants were classified by each tool and 3522 (89%) were classified by every tool.Of the latter, 556 (15.8%) were universally categorised correctly by all nine tools as pathogenic (n = 179, 5.1%) or benign (n = 377, 10.7%).Sensitivity and specificity ranged from 0.30-0.99 to 0.61-0.97,respectively (Table 3, Fig. 3A).For the smaller DDD-only novel dataset, 143-151 (95-100%) variants were classified by each tool and 141 (93%) were classified by every tool.Of these, 14 (9.9%) were universally categorised correctly by all nine tools as pathogenic (n = 8, 5.7%) or benign (n = 6, 4.2%).Sensitivity ranged from 0.24 to 0.97 and specificity range from 0.14 to 0.8 (Table 3, Fig. 3B).Sensitivity decreased for most tools between the full dataset and the DDD-only subset, apart from FATHMMindel that remained the same (0.94), as well as CADD and SIFT-indel which increased from 0.49 to 0.64 and 0.82 to 0.86, respectively; MutationTaster2021 showed the largest decrease in sensitivity from 0.98 to 0.72.Specificity decreased for all tools between the two datasets with CADD and SIFT-indel decreasing the least from 0.92 to 0.80 and 0.61 to 0.51, respectively; VVP decreased the most from 0.67 to 0.14.These observations were recapitulated in the MCC metric, where VVP and Muta-tionTaster2021 decreased the most by 0.48 and 0.44, whereas CADD and SIFT-Indel decreased the least by 0.02 and 0.04, respectively.PROVEAN, VEST-indel and FATHMM-indel showed similar performance in the DDD-only subset with MCC metrics of 0.51, 0.53 and 0.51, respectively.

Tool performance was generally independent of indel length
We investigated the tools' performance for insertions and deletions separately, and whether their performance was influenced by indel length (grouped into bins of 1, 2-4, 5-10 and 11 + amino acids inserted/deleted).We observed very little difference in performance between groups of variants (Additional file 1: Figs.S2 and S3) shows this in more detail), despite an increase in the proportion of benign variants with increasing indel length.

Table 3 Performance metrics for all indel pathogenicity prediction tools tested
Findings from the entire dataset are included in the top table, and just the novel (DDD-only) subset in the bottom table.Relative strength of likelihood ratios for application of 'strong' or 'moderate' evidence under the ACMG/ACGS variant classification criteria [33] are denoted as: $ High relative strength, + Medium relative strength.Additional file 1: Fig. S1

Discussion
We tested the performance of nine pathogenicity prediction tools on a dataset of 3964 in-frame indels and a smaller subset of 151 novel, clinically classified indels that are not readily accessible from public databases.We show that the performance of these tools is generally good across a range of indel lengths, with AUCs of 0.81-0.93.As expected, most tools performed less well in the smaller novel subset, with AUCs of 0.64-0.87,which likely reflects the use of publicly accessible datasets in the tools' classification method or training data.
Of the nine tools tested, MutationTaster2021 had the highest sensitivity and specificity when tested on all variants but also showed the greatest decrease in sensitivity when tested on the DDD only dataset.Since gnomAD variants were used as benign training cases and ClinVar and HGMD [11] as pathogenic training cases [28], this may reflect some overfitting [34] and potentially suggests a lower performance for previously unobserved variants.FATHMM-indel, CAPICE, VEST-indel and PROVEAN performed comparably well, although PROVEAN and VEST-indel classified fewer variants than CAPICE and FATHMM-indel.It should be noted that some tools (e.g.CADD, CAPICE, MutationTaster2021, PROVEAN, VVP) were not designed specifically for use with in-frame indels and were trained primarily on SNVs, whilst other tools (e.g.VEST-indel, FATHMM-indel, MutPred-Indel, SIFTindel) were optimised particularly for the classification of indels.We have previously demonstrated that standard pathogenicity predictors such as SIFT and Polyphen-2 classified missense variants with AUCs of between 0.85-0.87for a publicly accessible "open" dataset, and 0.70-0.72 for a restricted access "clinical" dataset [34], which is a comparable performance to the indel pathogenicity predictors tested here.However, the newer meta-predictors Revel [35] and Clin-Pred [36] produced AUCs of 0.97-0.99 and 0.82-0.81for open and clinical datasets of missense variants [34], respectively, outperforming all the indel pathogenicity prediction tools tested here.Nonetheless, similar to many missense pathogenicity predictors, the likelihood ratios calculated for in-frame indel predictors using our dataset (Table 3) support their use at either 'supporting' or 'moderate' towards the PP3 and BP4 criteria of the ACMG/ACGS recommendations, although none of the tools reach the moderate threshold in the DDD subset [33,37].
We found that the pathogenicity predictor tools varied substantially in input requirements and their ease of use.For example, seven of the tools tested require variants to be uploaded in VCF format as input and five of these also offer a downloadable command line interface (Table 2).Tools with these two features are typically well suited for integration into analysis pipelines; however, ease of installation, additional required software and metadata dependencies varied.For example, MutPred-Indel contained all the necessary metadata to make pathogenicity predictions, but required variants to be input in FASTA format, which is not routinely used in a clinical genetic testing setting, as well as installation of a specific version of MATLAB.Similarly, PROVEAN offered a command line option but also required local installations of NCBI-Blast, and the NCBI nr protein database.The requirement for advanced bioinformatics skills to operate a tool will adversely affect its utility, particularly for routine diagnostics use.In contrast, CAPICE used an Ensembl-VEP annotated TSV file as input, and we found it the easiest to install and quickest to use.
Like other comparable studies, we were limited by several factors.Firstly, the veracity of the variant classifications taken from ClinVar, the DDD study and gnomAD is uncertain, and our benchmark dataset may include some erroneous variant classifications.We tried to minimise this issue by incorporating data from three different databases and by using minor allele frequency thresholds for benign variants.However, the low number of variants in the DDD subset (n = 151) limits the comparison of tool performance metrics and a larger dataset would provide a more accurate assessment.Secondly, unlike missense variants caused by SNVs, in-frame indels are comparatively rare and are harder to detect robustly using NGS, and thus our dataset is relatively small.Evaluation of the performance of the tools versus indel length was further limited by the inverse correlation between frequency and variant length in the dataset, which limits the interpretability of tool performance for larger indels.Although large (> 100 base-pair) in-frame indels exist, and may be either benign or pathogenic, these are difficult to detect using short-read NGS technologies, so were largely absent from the databases used here and excluded from our dataset.Finally, not all variants in our dataset were in genes linked with monogenic disease, particularly those from gnomAD, which potentially introduces a bias for tools that use gene-level data for classification.However, around 75% of genes present in our dataset contained variants from at least two of the databases, and a sensitivity analysis using only these variants produced similar results (data not shown).

Conclusions
We have shown that numerous in silico pathogenicity prediction tools perform well for in-frame indels using a benchmark dataset.We therefore suggest that genomic diagnostic laboratories should consider incorporating these tools-in the same manner as missense prediction tools-to aid variant classification.Our findings are consistent with previous studies [25,27,30] and, to the best of our knowledge, represent the largest independent assessment to date of pathogenicity predictors for inframe indels.

Fig. 1
Fig. 1 Flowchart of dataset construction.We included in-frame indels from ClinVar, gnomAD and the DDD study (deposited in DECIPHER).SNV single nucleotide variant, MAF Minor allele frequency

Fig. 3
Fig. 3 Performance of pathogenicity prediction tools for in-frame indels.Sensitivity (top) and specificity (bottom) of nine pathogenicity prediction tools based on classification of 3964 in-frame-indels from ClinVar, DDD and gnomAD databases (blue), as well as a DDD-only subset of 151 variants (red)

Table 1
Number of variants from each database included in our benchmark dataset

Table 2
Pathogenicity predictors and default or recommended classification thresholds used in this study VCF Variant call format, VEP Variant effect predictor, TSV tab separated values, CSV comma separated values shows ROC-AUC curves TP True positive, FP False positive, TN True negative, FN False negative, LR + positive likelihood ratio, LR− negative likelihood ratio, PPV Positive predictive value, NPV negative predictive value, AUC Area under the curve, MCC Matthews correlation coefficient