Reference DNA and clinical samples for performance evaluation
The wet-lab protocol of ECNano was tested using the standard HG001 and HG002 DNA samples, which are commercially available, ensuring that the performance assessment of the wet-lab protocol was unrelated to the quality of input DNA. In addition, these two datasets are among the best-annotated human references, which allows precise variant calling evaluation of Clair-ensemble using ECNano data. To assess the reproducibility of ECNano workflow in actual clinical practice, ECNano was tested on three in-house clinical samples that require detection of different variant types and genotypes. The patients were diagnosed, and disease-causing pathogenic variants of the samples were previously identified using the next-generation sequencing approach. In all these six sequencing runs using the same ECNano protocol, we observed similar performance in terms of read length distribution, coverage profile at target positions, total throughput, and the precision and recall of small variant calls, which are the primary measurements of the protocol. Thus, we confirmed that our workflow, which is sensitive to pathogenic variant detection using clinical specimens, is robust and reproducible.
Stable, high-quality performance in long-read target capturing for MinION sequencing
Among the various target enrichment methods, using RNA or DNA biotinylated probes for hybridization-based target capture is stably applied with short-read sequencing for genetic screening in clinical genetics laboratories. The method is standardized and robust, and therefore highly reliable. ECNano therefore integrates hybridization-based medical exome capture with MinION sequencing, which can be easily incorporated into existing clinical practice as a stable enhancement of the genetic screening workflow. To ensure the workflow was very steady, we tested the balance between the capture efficiency of Agilent SureSelect Focused Exome probes and fragment length. The workflow was optimized to capture input DNA fragments of approximately 1000 bp. After MinION sequencing, the N50 of the captured reads was 1378 bp, and the average read length was 968 bp after minimum quality filtering and adapter trimming (Fig. 2).
Per-base accuracy is often constrained by sequencing technology. Low-quality reads with a mean Phred score below 3 were removed during base-calling. The average read Phred score in the tested sequencing library was Q11, and the best was Q24. Over 70.2% of the reads achieved Q10. Although the highly accurate base-calling model of base-caller Guppy was used, the error rate of ONT reads was still high, so conducting accurate variant calling is still challenging if the DoC is not high enough. The DoC per position deviated from the sequencing throughput. Depending on the quality and number of active pores available in the flowcell used, each sequencing run is expected to yield minimum 10Gbp throughput. Even with a sequencing run of only 10 Gbp throughput, ECNano achieved DoC at target positions of 133x, on average. About 98% of the target positions listed in the panel were covered by at least 30 reads, about 87% of the positions had 60× DoC, and about 55% of the positions had 100× DoC (Fig. 3). With long ONT reads, over 96% of reads could be uniquely mapped to the reference genome for variant calling. Out of the 72,417 target regions within the MES bed, on average, only 17 regions did not have any read covered. Most of these regions had over 80% GC content, so they might be difficult to capture.
The margins of the target exon were uniformly covered by long reads of ECNano. The uniformity of the sequencing data was over 98% in the target regions. For adjacent exons with a short intron between them, our data showed a broader covered region, and the intron positions were also covered (Fig. 4). This allows ECNano data to have high discovery potential for novel pathogenic variant detection in a broader range around these medically relevant positions. Although there were still considerable deviations in the DoC in the captured regions owing to variation in capture efficiency and PCR bias, in-depth normalization was well performed by Clair-ensemble to achieve precise variant calling.
Sufficient DoC is guaranteed at known pathogenic variant positions and nearby splice sites for accurate variant calling
Known pathogenic variant positions and nearby splice sites are most relevant to medical diagnosis. Since most of the recorded pathogenic missense variants were found in protein-coding sequences [23, 24], screening variants in exome regions is sufficient for predicting a large proportion of genetic diseases. It is essential to maintain sufficient DoC at these positions to guarantee that the variant detection is highly sensitive. To evaluate the performance of target enrichment at the known pathogenic variant positions, we examined 23,866 positions in the targeted region of ECNano, which are listed in the ClinVar database as pathogenic using one of the HG001 sequencing runs. Over 98.9% of these positions achieved above 30× DoC, which allows highly confident variant calling at these critical positions.
Since the average size of an exon is 164 bp [19, 20], a library with a read length of over 1000 bp can cover the entire exome region, as well as potential variants close to the splice sites. In total, 2,114,539 positions of potential splicing donors and receptors adjacent to the target region were extracted from the intropolis database [25]. The referenced study annotated potential donor and acceptor splice sites on the reference genome using 21,000 NGS transcriptome data. Over 90% of these positions (± 10 bp included) had at least 60 × DoC, and 98.5% had at least 30 × DoC. This coverage allows accurate identification of canonical pathogenic splice sites and their nearby variants, which are known to cause some monogenic disease and developmental disorders [26, 27].
Accurate variant calling and performance enhancement with Clair-ensemble in target enrichment datasets
To ensure consistent and fair evaluation of variant-calling performance, Clair-ensemble was evaluated using the two sets of standard HG001 and one set of HG002 ECNano target-enrichment data against the known variants in the GIAB truth set (Table in Additional file 1). Following the best practice in Luo et al. [16], positions in high confidence regions defined by GIAB were used for benchmarking. In total, 14,976,390 positions were included in the benchmarking, which was approximately 84% of the Agilent BED positions. We benchmarked the performance of Clair-ensemble against the original Clair, LongShot, and Medaka. They are listed by precision and efficiency in Table 1. In addition to different variant callers, another downsampling method using VariantBam before variant calling was tested.
All the evaluated tools, including the original Clair, LongShot and Medaka, are capable of SNP detection. With our high DoC ECNano data, the SNP calling accuracy of these tools does not significantly deviate from each other. However, the sensitivity is inferior with Medaka, in addition to its lengthy calling time, possibly because of the use of consensus calling. The performance of Indel prediction with Medaka was also dissatisfactory. The performance of LongShot and the original Clair with low-DoC WGS models was similar in SNP calling. However, LongShot does not provide a function for Indel detection. Although we still included LongShot in our benchmarking, it is not recommended in the ECNano workflow for clinical context since it is functionally incomplete. Among all the benchmarked tools, Clair-ensemble achieved the best precision and sensitivity, achieving an F1-score in the overall calling results of over 98%. Both the built-in downsample option and VariantBam were tested in Clair-ensemble. The calling results were similar with both resampling methods, with the built-in resampling method having slightly higher precision and overall performance. However, the use of the built-in method is preferred since no extra time or disk storage is required. It is expected that with runs of even higher throughput, the built-in resampling method will be much more effective than using VariantBam because of the increase in DoC variation with regions. The performance of the ECNano bioinformatics workflow was significantly better than the original Clair, especially for INDEL calling (Fig. 5; Additional file 1: Table S1). Accurate INDEL prediction with ONT data is difficult owing to random base shifts during base-calling. Although the sensitivity remained at a similar level, the ensemble method improved the precision in INDEL calling, which is preferred in clinical applications.
The performance of neural networks is biased towards the properties of its training dataset. These variances among callers or models are reduced with the ensemble method, achieving a more stable performance across datasets. Including multiple outputs for an ensemble, however, is time-consuming if the caller runtime is long. In a comparison of the calling speed, a single Medaka run on ECNano data took days, while the original Clair required only hours. The short processing time of Clair allows an ensemble of multiple calling results without requiring too much time.
Another important feature of Clair-ensemble is the per-positional resampling as the pre-processing function alongside the ensemble. The resampling function is implemented in a Clair pre-processing step, which ensures the calling DoC does not exceed the maximal allowed depth in the ultra-high DoC region, while no resampling is performed in regions with optimal or low DoC. Compared with the use of other global downsampling methods, such as VariantBam and Samtools, this is particularly effective in avoiding over-downsampling in low DoC regions. A huge variation in DoC is commonly observed in other target enrichment data and is also highly applicable for processing with Clair-ensemble. The resampling process in Clair-ensemble also preferentially retains higher quality bases for variant calling, and therefore further improves the precision of variant calling.
Practical application of ECNano on real patient samples
Since the whole wet-lab protocol optimization and variant-calling performance evaluation was completed with standard DNA samples, we also applied the complete workflow using three patient DNA samples to ensure that ECNano is practical in actual clinical settings. All samples were first sequenced using Illumina NGS whole-exome sequencing, and the pathogenic variant was known. The three patient samples involved different types of variants (shown in Fig. 6), including (1) a 10-base insertion in BCAP31; (2) a homozygous C > T SNP in SLURP1; and (3) two heterozygous C > T and T > C SNPs in UROC1. Using the standardized ECNano workflow, we obtained over 10 Gbp of base-called throughput with a single MinION flowcell and identified the target variants unambiguously. One of the three tested samples had a 10-based duplication prediction. Since the precision and sensitivity of INDEL-calling with ONT data is less promising than SNP-calling, the variant could still be called with high confidence. These test set results confirmed the robust discovery power of ECNano in actual clinical use.
Comparison to medical exome sequencing using the NGS approach
To evaluate the robustness of ECNano relative to other existing sequencing, we compared the reads coverage of ECNano with a published illumina NGS dataset generated using the same target capture panel [28]. While both ECNano and NGS generated less than 1% of undetected target regions (i.e. 0.35% of the target genes or regions in NGS and 0.56% in ECNano dataset with less than 20 reads covered), the DoC distribution of ECNano is of much higher evenness compared to that in NGS (Additional file 3a). We therefore concluded ECNano is able to achieve the same if not even better depth distribution for variant calling as in NGS with the same target enrichment method.