Exonic variants undergoing allele-specific selection in cancers

Allelic imbalance (AI) in tumors is caused by chromosomal and sub-chromosomal gains and losses. We evaluated AI at 109,086 germline exonic SNP loci in four cancer types, and identified a set of SNPs that demonstrate strong tumor allele specificity in AI events. Further analyses demonstrated that these alleles show consistently different frequencies in the cancer population compared to the healthy population and are significantly enriched for predicted protein-damaging variants. Moreover, genes harboring SNPs that demonstrate allele specificity are enriched for cancer-related biological processes and are more likely to be essential in cancer cells. In summary, our study provides a unique and complementary method to identify genes and variants that are relevant to carcinogenesis.


Background
Somatic DNA alterations are crucial for the acquisition of tumor-related traits. One class of alterations, allelic imbalance (AI), occurs when a segment of one parental chromosome increases or decreases in copy number relative to the other. If the parental homolog with the resulting larger copy number-referred to herein as the "promoted" homolog-carries a genetic variant that is more advantageous to tumor growth than that carried by the other homolog, then cells promoting the advantageous allele gain a selective advantage. The resulting tumor can then be observed to harbor more copies of this allele than its counterpart. That is, AI can be viewed as a readout for allelic selection, thereby nominating candidate genes and alleles of importance (Fig. 1a). We stress that "promotion" here refers to an allele possessing a higher fraction than its other parental counterpart, regardless of mechanism. The counterpart may be deleted, the promoted allele may be duplicated, or both. Knudson's "two-hit" hypothesis is a specific case of the allelic promotion mechanism, wherein a deleterious variant is promoted via somatic loss of the wild-type.
With the advent of "next generation" sequencing (NGS) [1], it is now possible to interrogate the entire tumor genome in an unbiased manner. The Cancer Genome Atlas (TCGA) provides a large collection of whole-exome sequencing (WES) data from both tumor samples and the matched normal samples. These data sets enable a nearcomprehensive view of somatic AI in germline exonic polymorphisms across thousands of patient samples. The global characterization of these classes of germline exonic variation now allows an agnostic and systematic search for alleles demonstrating preferential tumor enrichment.
Genomic studies of large tumor sample sets typically focus on recurrence as a signature of a driver/ causal status. Recurrence is usually measured either for point mutations and indels [2], or for structural amplifications and deletions [3]. Here, we offer an alternative and complementary approach that exploits information from inherited variants coupled with copy number alterations. Using AI as a lens through which to view selection, we hypothesize that common germline exonic variants advantageous to tumor growth will produce a statistical signal of preferential promotion across the patient population. Alleles observed to be promoted in a statistically recurrent manner across independent tumors provide evidence of cancer relevance.
Here, we provide a generalized framework in which to investigate selection of germline exonic alleles in areas of AI. Since selection implies function, we aimed to identify novel genes and variants displaying a signature of selection, thereby implying importance for tumor-related properties. Toward this end, as a pilot demonstration we analyzed data from TCGA across four tumor types: breast cancer, colon cancer, lung cancer and prostate cancer, as these comprise the most common cancer types in terms of new cases each year.

Bayesian model
The Bayesian model is fit using the L-BFGS-B algorithm [4]. The hyper-parameters used in the model were inspired by prior studies [5]. The boundaries and prior distributions for each parameter are provided in Supplementary Table S1 (Additional file 1: Table S1). We conducted the analysis in R using package emdbook (version 1.3) and package stats (version 3.5). The convergence is assessed based on whether the reduction in the objective is within this factor of the machine tolerance (1 × 10 -8 ).

Determine the allelic coverage for germline coding SNPs
We chose 1,695,264 coding SNPs from dbSNP144. For each SNP i we retrieved the allelic coverage N ija and N ijb of the A allele and B allele in tumor sample j from paired exome-sequencing data respectively using SAMtools. Similarly, we let N ij0a and N ij0b denote the Fig. 1 Illustration of the study scheme. a Overview of the analysis. Whole-exome sequencing from patients is used to identify heterozygotes and assess AI. To generate the null distribution of the selection parameter π, the identity of the promoted allele (A or B) is swapped repeatedly, with 50% probability each time, and the resulting π recomputed each time. Venn diagrams showing the numbers of AI alleles identified in four cancer types at the significance level of b 0.05; c 0.01 and d 0.001 coverage of the A allele and B allele of SNP i in matched normal sample j0.
The germline genotypes are determined by N ij0a and N ij0b from the matched normal sample. We call a given locus i as heterozygous if the total coverage (N ij0a + N ij0b ) is greater than 20 and B-allele frequency, BAF ranges from 0.2 to 0.8, where

Determination of significantly imbalanced alleles
We determined significance of the allelic selection of a SNP i in a cancer population using a permutation test. For each permutation, with 50% probability, all paternal and maternal alleles are swapped on the chromosomal arm of interest. This is achieved by relabeling all A alleles as B and all B alleles as A. Since phase is preserved in allelic imbalance events in the original data (indeed either the paternal or maternal chromosome acquires the amplification or deletion event), it will be preserved for each permutation since all alleles are swapped (or not). Effectively, the only change is which parental chromosome acquires the amplification/deletion. No computational phasing is necessary.
We computed the null distribution of π i from 1000 iterations of permutation in the corresponding TCGA population for 56,677 SNP loci. To determine if an estimated π i is significant, we compared it to the null distribution obtained from the pooled permutation-based estimates of π . Specifically, we obtained an empirical P-value for each observed value of the test statistics abs(0.5 -π i ) using the empPvals function from R package "qvalue". The null distribution used in empPvals was derived from the pooled 1000 × 56,677 values of abs(0.5 -π ) obtained from the permutations. Finally, we used the function qvalue from the R package "qvalue" to derive q-values from the empirically-derived P-values. It should be noted that this approach was shown by Storey et al. [6] to be equivalent to directly thresholding the test statistics themselves and utilizing an analogous FDR estimator.

Association between allelic selection and protein-damaging variants
We stratified the exonic alleles according to "Combined Annotation Dependent Depletion" (CADD). CADD is suitable in our case because it is derived by contrasting variants that survived natural selection with simulated mutations, which is consistent to the processes of somatic evolution. Alleles with CADD larger than 10 are classified as "deleterious" variants. We then compared the fraction of deleterious variants within subgroups of variants under different magnitudes of somatic selection. We compared the fraction of deleterious variants in selected-for/against alleles in each cancer type using a hypergeometric test.

Association between allelic selection and germline risk variants
We used Eigenstrat [7] to identify TCGA individuals of European ancestry. We compared the frequencies of the selected-for and selected-against alleles in European populations from the 1000 genomes project and the corresponding cancer population from TCGA. For each cancer type, we then evaluated the fraction of the alleles with significantly higher or lower frequencies in cancer within subgroups of alleles under different magnitudes of somatic selection ( π i ). The differences in the fractions of the alleles were assessed using a hypergeometric test.

A bayesian approach to evaluate allelic imbalance in four cancer population
To reveal the landscape of selective allelic imbalance in the exons, we choose the 1,695,264 biallelic exonic SNPs from the dbSNP database [9] (ref dbSNP144, Methods). We removed variants with very low population allele frequencies (minor allele frequency less than 0.005), which yielded a set of 155,702 SNPs. After filtering, we kept 56,677 SNPs for further analyses (Additional file 3: Fig.  S1). For each SNP i under consideration, for each patient j with a normal-cell heterozygous genotype we retrieved the base-level coverage of both alleles (A and B) from the corresponding tumor sequences Y ij = {y Aij , y Bij } , and the matched normal sequences Y ′ ij = {y ′ Aij , y ′ Bij } , respectively. Using data across all individuals j, we then applied a Bayesian-based approach to estimate an SNP-specific parameter ( π i ), which represents the tumor preference of the B-allele over the A-allele. Thus, π i serves as a surrogate for the somatic allele-specific selection pressure in cancer. When π i is statistically larger than 0.5, this is evidence that the B-allele of the SNP is under positive selection ("selected-for"); π i being statistically smaller than 0.5 is evidence that the B-allele is under negative selection or "selected-against" (Fig. 1a and "Methods"). To address the confounding effects during the library preparation and the mapping of sequences, we introduced two SNPspecific parameters, δ i and ϕ i , which correspond to the base-calling error and mapping bias toward the reference alleles, respectively. We normalized the allelic counts, Y ij and Y ′ ij for the cross-individual variation of coverage as described previously and yielded the normalized coverage for SNP i and sample j, k ij = K Aij + K Bij . The observed allelic coverage follows a beta-binomial distribution with parameters p i and θ i , where p i is the mean of the beta prior and θ i is the dispersion factor [10].
In order to model p_i, for each SNP, we define In the tumors, π i is a real number between 0 and 1; in normal tissues, where both alleles are equally represented, π i = 0.5. Then

Thus
The posterior probability of the model parameters is given as: After further quality filtering (Additional file 3: Fig. S1), we were left with 56,677 SNPs to test for tumor allelic preference as follows. Each π i is estimated from the observed data using the maximum a posteriori probability (MAP) estimate. To determine the significance of the selection pressure ( π i ) on the variants in a cancer population, we performed permutation tests by randomly swapping the alleles between the paternal and maternal chromosome arms. This permutation procedure destroys any correlation between allele and promotion status, while retaining linkage disequilibrium structure. Based on the null distribution of simulated π i values from 1000 rounds of permutation (Method), we obtained the alleles under somatic selection in each cancer type at significance levels of 0.05, 0.01 and 0.001.

Variants under significant allele-specific selection in cancers
From the four cancer types, we identified 88 to 3,310 unique putatively selected-for alleles corresponding to the significance levels of 0.001 (N = 88), 0.01 (N = 752) and 0.05 (N = 3310), respectively. The unique putatively selectedagainst alleles in the four cancer types range from 204 (significance level 0.001) to 5228 (significance level 0.05) ( Table 1 and Additional file 4: Data S1).
In order to control for the false discovery rate, we also identified the variants under allele-specific selection based on q-values [11,12] of 0.1. Using this threshold, we found 7 variants that undergo significant allele-specific selection in breast cancer and 21 in lung adenocarcinoma (Table 2). However, using false discovery rate will omit some variants undergoing true somatic selection [13]. Hence, to evaluate the landscape of the somatic selection in cancer exomes, we retained the alleles with different levels of significance in the remainder of our analysis. The landscape of alleles under somatic selection are highly specific to the cancer types. Our data show no selected variants common to all four types at any significance level from 0.001 to 0.05. Most of the significant AI variants are cancer type-specific (Fig. 1b, c).

Protein-damaging variants undergo significant somatic selection in Cancer
We classified the exonic SNPs into deleterious loci and non-deleterious loci based on the Combined Annotation Dependent Depletion (CADD) score [14]. CADD assesses variants according to their likelihood of being deleterious to humans on the population level. Interestingly, the fraction of deleterious alleles is significantly higher in both selected-for alleles or selected-against alleles, as compared to all exonic variants under consideration ( Fig. 2a and Table 3). This suggests that variants that have not withstood the evolutionary selective pressure across millions of years are more likely to confer a relative advantage or disadvantage (as opposed to being neutral) to the tumor cell when promoted, as compared to variants that have withstood such selection (see "Discussion" for further elaboration on this point). As for specific cancer types, at the significance level of 0.05, the fraction of deleterious alleles ranges from 48.1% (PRAD) to 51.6% (BRCA) in the selected-for alleles; and 44.9% (LUAD) to 50.3% (BRCA) in the selected-against alleles. The enrichment of deleterious alleles in the selected-for alleles and selected-against alleles are statistically significant (P < 0.05) in all cancer types except for PRAD.

The selected-for alleles in cancers are enriched among cancer patients
We reasoned that many alleles under somatic selection are likely to be associated with either advantageous or disadvantageous traits in the cancer cells, and therefore such alleles should appear at altered frequency in the cancer population at germline level. To verify this, we compared the allele frequencies of the alleles under somatic selection between the cancer patients (European ancestry, TCGA) and a control population of 699 individuals (European ancestry, 1000 Genomes Project) (Additional file 5: Fig. S2). Of the selected-for alleles (at the 0.05 significance level) in BRCA and COAD, 14.5% and 31.9% are significantly (q < 0.05) more frequent in the corresponding TCGA cohort than in the control population. Conversely, the alleles with significantly higher frequencies (q < 0.05) in the corresponding cancer populations are significantly enriched in the somatically selected-for alleles for both cancers (hypergeometric test P < 0.05, Additional file 6: Table S3 and Additional file 7: Table S4). On the other hand, alleles with lower frequencies in the corresponding TCGA cohort are significantly enriched in the selected-against alleles for LUAD (P = 1.443 × 10 -22 , Additional file 8: Table S5) and PRAD (P = 4.226 × 10 -3 , Additional file 9: Table S6). Overall, the alleles with higher frequencies in cancer populations tend to be enriched in the selected-for alleles (P = 3.56 × 10 -7 ) whereas the alleles with lower frequencies in cancer tend to be enriched in the selected-against alleles (P = 3.19 × 10 -6 , Fig. 2b, c).

Genes affected by allele-specific selection
We next investigated the genes that are affected by allelespecific selection. At a significance level of 0.05, there are  A recent study reports pathogenic germline variants in 10,389 tumors [15]. The authors found 10 tumor suppressor genes harboring pathogenic/likely pathogenic alleles whose wild-type complement was lost in LOH events. Of these 10 predisposition genes, our analysis revealed two affected by somatic selection: ATM and BRCA2. We further hypothesized that genes harboring selectedfor alleles may be enriched for genes essential for cancer cell survival and proliferation. To this end, we exploited the CERES score [16], which estimates dependency levels of genes from CRISPR-Cas9 essentiality screens. In the genes carrying selected-for alleles, the CERES scores skew lower (lower CERES scores indicate stronger genetic dependency) with increasing significance of π i , and significantly deviates from the background distribution (Kolmogorov-Smirnov P = 0.00704) (Fig. 3a). However, in the genes carrying selected-against alleles, no significant tendency is observed (Fig. 3b).
We retrieved the protein-protein interaction networks (PPI) for the genes affected by somatic selection in the four cancer types, based on which we identified 3 modules with significantly higher burdens of somatic selection (Fig. 4a-c and Additional file 10: Table S7). We observed that the modules are enriched for pathways with known functional implications in cancer. In particular, NOTCH signaling pathway (BRCA, LUAD, q < 0.05), JAK/STAT signaling pathway (LUAD, COAD, q < 0.05), toll-like receptor signaling (LUAD, COAD, q < 0.05) and apoptosis (LUAD, COAD, q < 0.05) are overrepresented in somatically selected genes in multiple cancer types (q < 0.1).

Discussion
Imbalance between human paternal and maternal alleles presents in various biological processes. At the transcriptomic level, allelic imbalance manifests as allele-specific expression, often from imprinting or allele-specific binding at the epigenomic level. In the cancer genome, allelic imbalance can result from frequent somatic copy number alteration and has been reported in many cancers for its biological and phenotypic implications. In tumors, allelic imbalance of a functional variant can alter proliferation capacity and fitness, subjecting the cell to different selection pressures. As a result of such selection, alleles conferring cancer fitness will be promoted over the other. Therefore, allelic imbalance can serve as important evidence for pathogenicity of germline variants that complements population frequency-based association [17].
While many studies address allelic imbalance on the transcriptomic level, or focus on specific genes or loci [18], less is known about the global somatic landscape of allelic imbalance on the DNA level. In this study, we evaluated the allelic imbalance of exonic alleles in four TCGA cancer cohorts, thereby revealing the landscape of somatic selection on germline variants. Our data demonstrate that somatic selection of the exonic alleles are associated with functional impact.
We report here 28 alleles that show signals of somatic selection at the q < 0.1 level, although there are likely many more that undergo selection but do not meet this threshold owing to the large number of statistical tests performed. For this reason, we considered all alleles achieving nominal significance and examined them collectively for enrichment in various functional categories. Interestingly, among the 28 identified alleles, most are synonymous variants. We deliberately included synonymous variants in our study, since synonymous mutations are estimated to represent 6-8% of all driver substitution mutations in cancer [19].
Our findings confirm that functionally deleterious alleles are subject to stronger somatic selective pressure in cancers. Intriguingly, both somatically selectedfor and somatically selected-against alleles had higher CADD scores than alleles showing no signals of selection (Fig. 2a). Since CADD scores reflect the likelihood that an allele is deleterious over human evolutionary time in the germline, there might be several reasons for this observation. The selected-for alleles may include those that would not withstand (human population) purifying selection because they confer increased susceptibility to cancer. Selected-against alleles, on the other hand, may include those alleles that would not withstand (human population) purifying selection because they compromise cellular processes that are crucial for both normal cellular function and function in the tumor. It would follow, then, that increased dosage of the wild type may provide an advantage in the tumor environment.
Somatic selection of a given allele was also associated with germline susceptibility. We found that somatically selected-for alleles are more frequent in cases than ancestry-matched controls. Similarly, selected-against alleles are more frequent in controls. These observation highlights the potential of somatic selection as an indicator of risk or protective status on a population level. We restricted this analysis to individuals of European ancestry since other populations had very small numbers in the cohort. These results should be tested in non-European individuals.  On the gene level, we found that those harboring selected-for alleles tend to be genes that are essential for cancer cells, as assessed by CRISPR-Cas9 screens. In general, the genes affected by allele-specific selection in the tumor genome are enriched for known cancerpredisposing genes as well as tumor-related biological pathways. For instance, the NOTCH signaling pathway is involved in many cancers as it determines the fate of cells and is a favorable therapeutic target [20,21]. The toll-like receptor pathways play a critical role in immune responses and thereby mediate the apoptosis of the cancer cells. Most of the genes affected by somatic selection of alleles are cancer type-specific. However, we also observed some genes that undergo consistent somatic selection in multiple cancer types. For example, prune homolog 2 with BCH domain (PRUNE2) is a known tumor suppressor in PRAD. In our analysis, PRUNE2 [22,23] is affected by four different, but consistently selected-against, alleles in four cancer types. Another tumor suppressor, microcephalin 1 (MCPH1), from which we identified six selected-against alleles, acts as G2/M checkpoint and promotes apoptosis in response to DNA damage [24][25][26]. Nevertheless, the findings of alleles under somatic selection are limited by the sample size and tumor purity. Variants with few heterozygotes in the cancer population tend to be less significant. Moreover, higher levels of normal-cell contamination would push the observed B allele frequency (BAF) toward 0.5, which could result in a loss of power owing to estimated values of π being less significantly different from 0.5. In addition, the somatic selection may act on haplotypes rather than SNPs, which is not considered in the current analysis.

Conclusions
In summary, we have described a statistical approach to reveal somatic allelic selection in the exomes of four cancer types and thereby suggest cancer-related genes and loci. These results together underscore the complexity of somatic selection in the process of clonal evolution. Since somatic selective processes in cancer differ from those at the germline level, evaluation of the allelic selection at the somatic level provides additional evidence to prioritize cancer-related genes. Our analysis is constrained to exonic SNPs, but many functional variants located outside the exons are also subject to allelic selection. With the rapidly growing volume of cancer genome sequencing data, revealing the landscape of allelic selection on the whole-genome, pan-cancer level is also foreseeable. In addition, the method can be applied to other NGS data sets such as RNA sequencing and ChIP-sequencing to suggest alleles of importance to the relevant biology.