Analytical validation of whole exome and whole genome sequencing for clinical applications

Background Whole exome and genome sequencing (WES/WGS) is now routinely offered as a clinical test by a growing number of laboratories. As part of the test design process each laboratory must determine the performance characteristics of the platform, test and informatics pipeline. This report documents one such characterization of WES/WGS. Methods Whole exome and whole genome sequencing was performed on multiple technical replicates of five reference samples using the Illumina HiSeq 2000/2500. The sequencing data was processed with a GATK-based genome analysis pipeline to evaluate: intra-run, inter-run, inter-mode, inter-machine and inter-library consistency, concordance with orthogonal technologies (microarray, Sanger) and sensitivity and accuracy relative to known variant sets. Results Concordance to high-density microarrays consistently exceeds 97% (and typically exceeds 99%) and concordance between sequencing replicates also exceeds 97%, with no observable differences between different flow cells, runs, machines or modes. Sensitivity relative to high-density microarray variants exceeds 95%. In a detailed study of a 129 kb region, sensitivity was lower with some validated single-base insertions and deletions “not called”. Different variants are "not called" in each replicate: of all variants identified in WES data from the NA12878 reference sample 74% of indels and 89% of SNVs were called in all seven replicates, in NA12878 WGS 52% of indels and 88% of SNVs were called in all six replicates. Key sources of non-uniformity are variance in depth of coverage, artifactual variants resulting from repetitive regions and larger structural variants.

WGS, where more data is available, VQSR is also used for insertion and deletions; for WES fixed filters are applied since fewer variants are available (as recommended in the GATK Best Practices).

Figure 1: Schematic of genome analysis pipeline
The specific metrics used in VQSR are summarized in Table 1. Following the GATK best practices, dbSNP, HapMap3.3, Mills/1000 Genomes indels and 1000 Genomes Illumina Omni2.5 SNP microarray datasets (all distributed as part of the GATK resource bundle) are used as the training data for VQSR. The products of VQSR are a VQSLOD score (the log odds ratio of being a true variant) for each variant and a set of VQSLOD score thresholds, or tranches, that aim to capture a specific proportion of true positives. Each tranche is less specific but more sensitive, introducing additional true positive calls along with additional false positive calls. We set the PASSing threshold at 99.5%, i.e. we set VQSLOD threshold to the value estimated by VQSR to capture that percentage of true positives. We have observed this threshold to offer a good compromise between precision and recall. In choosing a threshold below 100%, however, we set a corresponding minimum false negative rate.
Genotype concordance (concordance), non--reference sensitivity (NRS), non--reference concordance (NRC) and precision are computed as described in the main text using a modified version of the GATK GenotypeConcordance walker. Site--level sensitivity and specificity is calculated using a modified version of the GATK VariantEval ValidationReport module.     Tables   Table 3 and Table 4 (at the end of the supplemental) show the different comparison types for each pairs of WES and WGS technical replicates respectively.

Comparison to Sanger Results in 129kb Targeted ASD Panel
To quantitatively assess the contribution of the different kinds of inter--replicate comparisons to the concordance, we performed a multiple linear regression analysis using the lm function in the R statistical package of the five different comparison kinds as binary variables and the sample ID as a binary control variable on concordance for all NA12878 and NA18507 replicate pairs. The specific model is (in R syntax): concordance Sample + intra.run + inter.run + inter.machine + inter.mode + inter.library We assumed the behavior of concordance is linear across the narrow range of values observed in our data: WES (.970--.989) and WGS (.989--.990). Under the null hypothesis that the concordance does not vary with any comparison type, but does with the sample, the full model did not significantly differ from the null model for WES, F--statistic 0.72 (p--value = 0.61), as assessed using an F--test via the anova function in the R statistical package. The full model did significantly differ for WGS, F--statistic 4.69 (p--value 0.016) at a threshold of 0.05.
Based on the significant difference for WGS, we further analyzed the WGS component parameters individually. We built four models of the form: concordance Sample + <parameter> for each of intra.run, inter,run, inter.machine, and inter.library (note that there is no inter.machine comparison for WGS). Using the same null hypothesis, we evaluated the significance of each individual model using an F--test. The results are summarized in Table 2. Only the inter.library model differed significantly at a threshold of 0.0125 (Bonferroni corrected for the 4 tests).