Sequential filtering for clinically relevant variants as a method for clinical interpretation of whole exome sequencing findings in glioma

Background In the clinical setting, workflows for analyzing individual genomics data should be both comprehensive and convenient for clinical interpretation. In an effort for comprehensiveness and practicality, we attempted to create a clinical individual whole exome sequencing (WES) analysis workflow, allowing identification of genomic alterations and presentation of neurooncologically-relevant findings. Methods The analysis workflow detects germline and somatic variants and presents: (1) germline variants, (2) somatic short variants, (3) tumor mutational burden (TMB), (4) microsatellite instability (MSI), (5) somatic copy number alterations (SCNA), (6) SCNA burden, (7) loss of heterozygosity, (8) genes with double-hit, (9) mutational signatures, and (10) pathway enrichment analyses. Using the workflow, 58 WES analyses from matched blood and tumor samples of 52 patients were analyzed: 47 primary and 11 recurrent diffuse gliomas. Results The median mean read depths were 199.88 for tumor and 110.955 for normal samples. For germline variants, a median of 22 (14–33) variants per patient was reported. There was a median of 6 (0–590) reported somatic short variants per tumor. A median of 19 (0–94) broad SCNAs and a median of 6 (0–12) gene-level SCNAs were reported per tumor. The gene with the most frequent somatic short variants was TP53 (41.38%). The most frequent chromosome-/arm-level SCNA events were chr7 amplification, chr22q loss, and chr10 loss. TMB in primary gliomas were significantly lower than in recurrent tumors (p = 0.002). MSI incidence was low (6.9%). Conclusions We demonstrate that WES can be practically and efficiently utilized for clinical analysis of individual brain tumors. The results display that NOTATES produces clinically relevant results in a concise but exhaustive manner.

provide additional information such as direct measurement of the mutational burden [11,12] and exploration of signatures of mutational processes [13,14]. Brain tumors have complex genetic landscapes [15][16][17]. Therefore, it is beneficial to gather the most comprehensive genomics information for each neurooncology patient. We hence advocate utilizing WES for neurooncological genomics analyses as it gathers comprehensive information with a lower cost than WGS and is technically less challenging to analyze and interpret.
The bioinformatics workflows for variant calling are well established but the clinical interpretation of the identified variants constitutes a bottleneck in the analysis [18]. In the clinical setting, the analysis workflow should produce results that are both exhaustive and suitable for clinical interpretation. Intending to be simultaneously comprehensive and practical, we created a clinical WES workflow tailored for neurooncology. This approach sequentially filters and presents layers of findings relevant to neurooncology (the layers being alterations that are detected in curated collections of clinically-relevant genes). This sequential filtering approach prioritizes highly relevant findings while still reporting less relevant but possibly important findings. This article presents our approach and provides results of the analysis of our findings on a sizable glioma cohort, demonstrating that our approach yields clinically relevant results.

Reads-to-variants workflow
The overview of the complete workflow is presented in Fig. 1. The reads-to-variants pipeline is presented below.

Personalized neurooncology report workflow
To produce comprehensive reports of WES results, we developed the reporting workflow NOTATES. NOTATES uses curated datasets of glioma-and cancerrelated variants and genes to sequentially report clinically relevant findings.
After a summary of somatic WES findings, the report contains the following sections: The contents of these sections are detailed in the Results section. NOTATES was written in R [24] and R Markdown.

Analyses and patients
Using NOTATES v1.5, 58 WES analyses from matched blood and tumor samples of 52 patients were analyzed: 47 primary and 11 recurrent diffuse gliomas. Overall, 47 grade IV (81.03%), 7 grade III (12.07%), and 4 grade II tumors (6.9%) were analyzed. Clinical details for all patients and analyses are presented in Additional file 2: Table S1. For each tumor specimen submitted for WES, sections were reviewed by a neuro-pathologist to confirm the diagnosis of diffuse glioma and specifically excise a region within the tumor sample containing only tumor tissue. DNA was extracted using the DNeasy Blood & Tissue Kit (QIAGEN).
All analyses of NOTATES results presented here were performed using R. Selected results were compared with results from the TCGA pan-glioma cohort [15].

Software availability
The reads-to-variants and reporting workflow NOTATES is available for non-commercial purposes on GitHub: https ://githu b.com/egeul gen/NOTAT ES.

Sequencing quality metrics
The median mean read depths were 199.88 for tumor and 110.955 for normal samples. The median percentages of reads with at least 25X coverage were 99.3% and 98.55% for tumor and normal samples, respectively. Detailed quality metrics are presented in Additional file 2: Table S2.
Considerable percentages of combined reported variants (in all patients) per each category did not have a record in ClinVar ("not reported") and for variants with a ClinVar record. The most frequent clinical significances were "Drug response" for "ACMG Incidental Findings" (37.3%), "Conflicting" for "Variants in Cancer Gene Census Genes" (5.17%), and "VUS" for "Variants in Cancer Predisposition Genes" (16.67%) and "Variants in DNA Damage Repair Genes" (3.82%) (Additional file 2: Fig.  S1). Very small fractions of reported variants per each category were reported as "Pathogenic" or "Likely Pathogenic": 2.38% for "ACMG Incidental Findings", 0.48% for "Variants in Cancer Gene Census Genes", 4.17% for "Variants in Cancer Predisposition Genes" and 1.91% for "Variants in DNA Damage Repair Genes" (Additional file 2: Fig. S1).

Somatic short variants
To filter out sequencing artifacts, raw somatic short variants (median = 14,000, range = 4068-55,533 per analysis) are similarly filtered following the GATK best practices recommendations to result in a median of 223 (range = 57-22,271) variants per analysis (Fig. 2b). For reporting, we further filter these "called" variants and only include variants that: • have tumor Variant Allele Frequency (VAF) > 5% • have non-synonymous impact • are not in FLAGS genes.
This filtering results in a median of 49.5 (range = 2-5646) variants per analysis. A median of 6 (range = 0-590) variants is in the reported categories: A median of 2 (range = 0-44) in "Variants in Established Glioma Genes", 0 (range = 0-28) in "Hotspot Variants in Cancer Gene Census Genes", 2 (0-309) in "Other Variants in Cancer Gene Census Genes", 0 (range = 0-57) in "Variants in DNA Damage Repair Genes" and 1 (range = 0-152) in "Variants in Important KEGG Pathway Genes". Figure 3 presents the reasoning behind the sequential filtering of somatic short variants. "Called" (sequencing artifacts excluded) somatic short variants are initially filtered according to the above-mentioned criteria, excluding an average of 78.08% (SD = 8.36%) of "called" variants ( Fig. 3a). An average of 2.91% (SD = 1.62%) of "called" variants were reported sequentially in the (1) "Glioma-related" subsection ("Variants in Established Glioma Genes"), (2) "Cancer-related" subsections ("Hotspot Variants in Cancer Gene Census Genes" and "Other Variants in Cancer Gene Census Gene") and (3) "Selected Gene Sets" subsections ("Variants in DNA Damage Repair Genes" and "Variants in Important KEGG Pathway Genes"). On average, 19.01% (SD = 7.63%) did pass the reporting filter but were not reported. By sequential filtering, a variant reported in a category is not reported in the following ones. A mean percentage of 31.28%

Tumor mutational burden and microsatellite instability
The TMB values of all tumors are presented in Additional file 2: Fig. S3A. TMB in primary gliomas (median = 3.2/ Mb) were significantly lower than the TMB in recurrent cases (median = 5.8/Mb. Wilcoxon, p = 0.002). The TMB values in different molecular subsets (devised based on WES findings) were also significantly different (Kruskal-Wallis, p = 0.0072. Additional file 2: Fig. S3B).
The TMB distribution of this glioma cohort was comparable to (i.e., not significantly different than) the TMB distributions of the TCGA-Glioblastoma multiforme (GBM) and TCGA-Low-grade Glioma (LGG) cohorts (t-test p = 0.7 and p = 0.37 for GBM and LGG, respectively. Additional file 2: Fig. S3C).
There were 4 cases (6.9%) that were predicted to have microsatellite instability and none of the cases were predicted to have POLE deficiency.

The most frequently reported alterations Germline variants
The top 10 genes that harbored a reported germline variant in each subsection are presented in Fig. 4a. A large proportion variants reported in the germline variants section had no clinical significance annotation in ClinVar.

Somatic short variants
The top 10 genes that harbored a reported somatic short variant per each subsection are presented in Fig. 4b. Overall, there was a positive correlation between the percentages of all the top reported genes between the current cohort (NOT) and the TCGA cohort (Spearman's ρ = 0.5, p < 0.001). Except for the subsection "Variants in DNA Damage Repair Genes", there were positive correlations of percentages of the top 10 genes between the current cohort (NOT) and the TCGA cohort per each subsection (Fig. 4b, top figure). It can be observed that the most frequent genes were observed in "Variants in Established Glioma Genes". The gene with the most frequent somatic short variants was TP53 (41.38%).

Somatic copy number alterations
Under "Established SCNAs in Glioma", the most frequently reported SCNA events were CDK6 amplification

Personalized neurooncology report
To communicate the findings of potential clinical relevance, we developed a comprehensive personalized neurooncology report (Fig. 5, Additional file 1).

Summary report of somatic WES findings
The initial page of the report summarizes somatic findings, including TMB, somatic short variants, and somatic copy number alterations on a single page (Fig. 5). The summary includes a description section, providing information on the indication for testing, treatment status, tumor sample type, normal sample type, and DNA extraction method. The summary indicates exome coverage, providing a high-level overview of the quality of the patient's exome data.
For TMB, a plot of all previously analyzed tumor samples' TMB values, along with the current tumor's TMB (circled in red) by molecular subset (devised based on WES-identifiable markers) is provided, and the TMB of the current is indicated. The MSI status of the tumor is also indicated in this section.
For somatic short variants, all somatic short variants reported in different categories (i.e., "Variants in Established Glioma Genes", "Hotspot Variants in Cancer Gene Census Genes", "Other Variants in Cancer Gene Census Genes", "Variants in DNA Damage Repair Genes" and "Variants in Important KEGG Pathway Genes") are presented in a table format, containing information on variant impact classification, protein change annotation, genome change annotation, and tumor VAF.
For SCNAs, a plot of copy number of segments by chromosome is provided as well as a table containing all gene-level SCNAs reported in each category (i.e. "Established SCNAs in Glioma" and "SCNAs in Cancer Gene Census Genes".

(I) Summary table of quality metrics
A summary of sequencing quality, including the number of lanes, read type, read length, the total number of reads, PF reads, aligned PF reads, PE aligned, mean coverage, and percentage of bases covered at 1X, 5X, 10X, 25X, 50X and 100X, is reported here.

(II) Tumor purity
The fraction of reads coming from cross-sample contamination, reflecting a measure of tumor purity, is calculated using the GATK-CalculateContamination tool and presented here. A purity/clonality estimate (reflecting normal contamination in the tumor sample) based on copy number alterations is presented under section "Tumor Heterogeneity Analysis".

Somatic SNV/indels
Somatic variants obtained via MuTect2 are filtered to have a variant allele frequency (VAF) of at least 5% and be non-synonymous variants. FLAGS [26] were excluded from the report. The variant subsections in this section also follow a sequential order.

a. Tumor mutational burden
In this subsection, the Tumor Mutational Burden (TMB) is reported. TMB is defined as the number of somatic mutations in the coding region per megabase, including SNVs and indels. This calculation is performed through: 1. keeping variants with VAF > 5% and.

keeping variants with a sequence depth > 20X in the tumor and > 10X in the normal.
Two scatter plots and a table summarize the median TMB values overall and for each molecular subset (devised based on WES-identifiable markers) for the current and all previously reported tumors. b. Microsatellite instability status The MSI status is predicted using the tool MSIpred [36]. Additionally, polymerase-epsilon deficiency is predicted based on the presence of (a) somatic SNVs/ Mb > 60 and (b) somatic indels in single sequence repeats/Mb < 0.18. The predicted MSI and polymerase-epsilon deficiency statuses are reported here. Contains possibly important somatic SNV/indels in selected KEGG [37] pathways (namely "Cell cycle", "mTOR signaling" and "Pathways in cancer").

. Established SCNAs in Glioma
This subsection presents SCNAs that are in a list of gene-level SCNAs curated because of their importance in gliomas (as reported in the aforementioned TCGA pan-glioma study [15]). g. SCNAs in Cancer Gene Census Genes This subsection lists SCNAs where the gene subject to copy-number alteration is listed in CGC. h. Broad SCNAs This subsection lists SCNA events that span over one or more cytobands. i. Plots of SCNA Segments by Chromosome This subsection displays SCNA plots (log 2 (Tumor/Normal) ratio vs. position) per all chromosomes.

Loss of Heterozygosity (LOH) events
For high confidence, only LOH events for which the absolute difference of B-allele frequencies (|BAF Tumor − BAF Normal |) is larger than 0.4 are reported.

a. LOH overview
All LOH events that pass the filter are reported here. b. LOH + somatic SNV/indel Here, alterations where a gene has LOH, and a somatic SNV/indel are reported. c. LOH events in CGC genes LOH events where the gene subject to LOH is listed in CGC are reported here.

Genes with double hit
A double hit strongly suggests a relevant tumor suppressor gene [42]. In this section, the list of genes with somatic SNV/indel as well as SCNA and/or LOH events are reported.

Tumor heterogeneity analysis
To estimate tumor purity as well as clonal/subclonal SCNAs, THetA [43] is used. In this section, the results of THetA are presented.

Mutational signatures
Somatic mutations in cancer genomes are caused by multiple mutational processes, each of which generates a characteristic mutational signature [14]. Analysis of mutational signatures is becoming routine in clinical cancer genomics as the detected signatures of mutational processes have implications for pathogenesis, classification, prognosis, and even treatment decisions [44,45].
Using DeConstructSigs [46], the weights of Mutational Signatures v3 (May 2019) from COSMIC [47] are estimated. This section presents the mutational signatures detected in this tumor.

pathfindR-KEGG pathway enrichment analysis
For studying mechanisms underlying oncological processes, KEGG pathway enrichment analyses are performed using the active-subnetwork-oriented enrichment approach of pathfindR [48].

a. Enrichment Results for High-impact Somatic SNV/ indels
Genes harboring any somatic non-synonymous variants with a VAF > 5% and not in FLAGS are used for analysis. b. Enrichment Results for High-impact SCNA Genes harboring homozygous deletion(Tumor/Normal ratio < 0.5) or multi-copy amplification(Tumor/Normal ratio > 1.5) are used for analysis.

Discussion
The reporting of findings of potential oncological relevance from NGS is rapidly expanding into the clinical area [1][2][3]. In this work, we aimed to present the efficiency and utility of our approach to analyze wholeexome sequencing data of individual gliomas and produce clinically interpretable reports of individual cancer genomes. The approach attempts sequential filtration of various layers of genetic information to assist in clinical decision-making.
It is established that individual tumors may harbor clinically relevant alterations which are not observed frequently in tumors of the same cancer type [49]. In our approach, alterations are prioritized from "highly likely" to "less likely" to be clinically relevant. This is done by sequentially filtering for (1) glioma-related alterations followed by (2) cancer-related alterations followed by (3) alterations in selected gene sets. Through sequential filtering, NOTATES greatly reduces the number of variants to be reported while still retaining the most clinically relevant variants as well as other variants of potential significance.
The clinical interpretation of germline variants in cancer is challenging. The sequential reporting of germline variants in NOTATES allows the clinician to identify any clinically relevant variants. The "ACMG Incidental Findings" section allows the identification of incidental variants, followed by "Variants in Cancer Gene Census Genes" and "Variants in Cancer Predisposition Genes" allowing the identification of cancer-related variants. "Variants in DNA Damage Repair Genes" specifically lists germline variants in DNA damage repair genes, which are important in gliomas because numerous studies have provided evidence that DNA repair deficiency was a central theme in gliomagenesis, a finding also reported in our previous study [50,51]. Most reported germline variants were not included in ClinVar. As previously reported, the prevalence of "pathogenic" / "likely-pathogenic" germline variants in the ACMG Secondary Findings v2.0 list was low [52] whereas the prevalence of such variants in the cancer-related subsections were relatively higher (among variants with clinical significance annotation in ClinVar).
For somatic SNV/indels, the subsection "Variants in Established Glioma Genes" contains the most likely glioma-specific drivers. Overall, a third of the somatic SNVs reported were in this subsection per tumor. The two following subsections contain somatic variants in CGC genes, pointing to possible oncogenic alterations that are not tumor-type-specific. Hotspot alterations were infrequent but a third of the reported variants per tumor were alterations in CGC genes. The median VAF of the glioma-specific alterations (reported under "Variants in Established Glioma Genes") was relatively higher than that of alterations reported in the other subsections, emphasizing the importance of this subsection.
TMB and the predicted MSI status, which are both predictive biomarkers for systemic cancer immunotherapy [56][57][58], are included in the report as well. Rather than only providing a hard cut-off value, we provide a plot and a table summarizing the TMB status of all reported gliomas, which enables the clinician to evaluate the TMB status in the relevant context. The TMB distribution of this glioma cohort was similar to those of the TCGA glioma cohorts. As expected, the median TMB value for recurrent tumors was higher than the primary tumors. The TMB values of different glioma molecular subsets were also different. Along with TMB, we also predict MSI status and possible POLE deficiency. As previously reported, the incidence of MSI in diffuse gliomas was low [59][60][61].
Because NOTATES allows the identification of specific genetic alterations indicating differing clinical outcomes in gliomas, the findings in the NOTATES report reflect the severity of the tumor. For example, if a mutation in IDH1/IDH2 is detected, this indicates a better prognosis [62,63], whereas H3-K27M or G34 mutations imply worse disease outcome [64,65]. Similarly, IDH-wild-type gliomas with EGFR amplifications and/or chromosome 7 amplifications and chromosome 10 loss can be molecularly defined as GBM, conferring worse prognosis [66,67]. In addition to specific genetic alterations, NOTATES calculates TMB and evaluates the presence of MSI, further aiding the clinical assessment because these are both predictive biomarkers for systemic cancer immunotherapy [56][57][58].
It is important to emphasize that all findings presented in the NOTATES report complement each other. For example, a high TMB, predicted MSI, somatic variants in mismatch repair genes and mismatch repair deficiency-related mutational signatures will all support highly likely mismatch repair deficiency in a tumor, indicating a higher chance of response to immunotherapy.
Identification of clinically relevant findings from the vast amount of data produced by WES is a substantial challenge [49,68]. In this work, we aimed to propose a solution to this issue by presenting our approach for reporting of genomic findings from WES data of individual gliomas. Using curated resources, NOTATES investigates and presents various forms of findings of potential clinical importance: germline short variants, somatic short variants, somatic copy-number alterations, loss-of-heterozygosity events, tumor mutational burden, microsatellite instability, and mutational signatures. The NOTATES report is formatted to provide a coherent overview of clinically-relevant genomic findings, enabling the adaptation of WES to the clinical setting. For this purpose, NOTATES utilizes curated sets of relevant genes and databases that collect knowledge about cancer alterations and their relationships to tumor formation and clinical utility and reports the findings in a sequential manner according to clinical relevance. The results in this work demonstrate that NOTATES successfully captures glioma-specific alterations while also reporting possibly relevant cancerrelated alterations. The comprehensive report contains the most clinically important findings that may aid in clinical decision-making.