Impact of mutational signatures on transcription factor binding motifs
We computed the motif alteration signature using the conditional probability between the transcription factor alteration probability and the trinucleotide mutational signature as described in Eq. 5 and Eq. 6. The predicted alteration probabilities for all TF motifs and all 30 COSMIC signatures are given in Additional file 6: Figure S6. The motif alteration results across all 512 JASPAR motifs are shown in Fig. 2a for motif creation (top) and motif disruption (bottom, see Additional file 1: Figure S1 for a more detailed representation including signature names and information on TF families.). To capture similar patterns between transcription factor motifs, we applied a PAM clustering to the set of 512 motifs. The silhouette coefficient shows a local maximum at kclust=4 indicating that the optimal number of clusters is 4.
The four clusters show a completely different behavior under the mutation signatures; whereas cluster 3 shows a low sensitivity to any of the mutational signatures (with some exceptions), cluster 4 on the other hand seems to be strongly impacted both by motif creation and disruption. Interestingly, comparing the disruption and creation heatmaps, we observe that the creation sensitivity for cluster 4 is high over all signatures, whereas we observe two groups of signatures for the disruption, with different impacts. We then studied the composition of TF motifs of each cluster. As expected, none of the clusters is dominated by a single TF family (Fig. 2b). However, Fig. 2b shows that some motif families are preferentially associated to one of the clusters. Forkhead motifs are in their vast majority associated with cluster 4 and show a high sensitivity to all mutational signatures. We also observe obvious mutual exclusivity between the cluster 1 and 2 versus cluster 3 and 4 for most of the TF families. SOX-related factors and C/EBP-related family, on the other hand, seem to be distributed across several clusters. For SOX-related motifs, we observe that these motifs form 2 major subgroups as shown in Fig. 2c. SOX-related factors associated with PAM cluster 1 contain an extended binding sequence AACAATKGCAKCAKTGTT whereas those associated with PAM cluster 2 and 4 contain a shorter version AACAATG binding motif.
After this global analysis of the alteration patterns across all 512 motifs, we next focused on specific motifs related to transcription factors which play a role in cancer. We extracted from COSMIC cancer gene census the list of transcription factors to investigate how the binding motifs of transcription factors mutated in the cancer genome are affected by somatic mutations. We computed the differential motif alteration probabilities of 30 signatures for 40 TF motifs with the strongest differential impact (Additional file 2: Figure S2). This map displays two broad groups of transcription factors with opposite behavior. Some interesting patterns can be observed. For example, HNF1A, a liver-specific transcription factor, appears to have a strong excess of creation upon signature 12, which is specifically found in liver cancer. Given the high expression of HNF1A in liver tissue, this excess of new binding sites could result in some novel, functional binding sites in liver cancer.
In order to capture the complexity of the relation between the mutational signatures and the binding motifs, we applied SOM clustering to group motifs showing similar behaviors. We performed SOM clustering over the 192 possible mutational transitions (taking both creation and disruption into account). We also combined these mutational probabilities into the 30 mutational signatures.
The SOM clustering provides a clear picture on the similarity of the alteration behavior of transcription factors across all trinucleotide mutational probability and the 30 mutational signatures. Transcription factors of the same transcription factor families often share similar binding sequence and their motif alteration behavior should also be similar. We observe that overall, transcription factors of the same family are well clustered together and globally share similar motif alteration probability patterns (Fig. 3a,b). For example, the SOX transcription factor family with SOX2, SOX4, SOX8, SOX9, SOX11, SOX10, SOX11, SOX17, and SOX21 are clustered together. However, for specific signatures, differences between related transcription factors do exist. In Fig. 3a, SOX10 appears to have a very similar creation probability to its neighbor SOX2 for signature 10, but shows a much lower creation probability in signature 23 (Fig. 3b).
Overall, by coloring the cells according to the family of the TF, we observe a global clustering of motifs from the same structural class (Fig. 3c,d). However, this does not hold true for all families; we highlighted some members of the GATA-family in Fig. 3c which appear to be dispersed across the SOM map.
Correlation Between Motif Creation and Disruption Signature
The previous results indicate that for many transcription factors, there is a compensating effect of creation and disruption. Further investigating this relationship we found indeed that the alteration probability between creation and disruption event in all signatures have globally a strong correlation between the creation and disruption (Fig. 4a). We found this correlation to be strongly related to the trinucleotide diversity of the motif matching sequences. To quantify this diversity, the trinucleotide content of each of the matching sequences is determined and the entropy is computed for each TF motif. The similarity between creation and disruption probability of each motif is evaluated using the absolute relative difference:
$$ {}\left| RD(s,tf)\right|=\frac{2\left|{Pr(s,tf,create) - Pr(s,tf,disrupt)}\right|}{ Pr(s,tf,create) + Pr(s,tf,disrupt)} $$
(12)
where s represents a signature and tf a binding motif. The scatter plot in Fig. 4b shows an inverse relation between the complexity of the binding sequences (as measured by the entropy of these sequences) and the difference between creation and disruption. Hence, the more complex the motif content, the stronger the creation and the disruption signature probability are correlated.
This effect is illustrated using Hoxd8 and PAX9 as an example. To ensure that the motif width does not bias the results, both selected motifs have a width of 17 nucleotides. As illustrated in Fig. 4d and Fig. 4e, Hoxd8 and PAX9 have very different trinucleotide content. This is obvious looking at the motif logo. The motif alteration probability with respect to the 96 trinucleotide mutation is shown below the logos. For Hoxd8, the motif alteration probability concentrates on trinucleotide mutations related to TAA, AAT, TTA, and ATT where the overlap of these trinucleotide mutation creation and disruption only occurs on A[T >A]A, A[T >A]T, T[T >A]A, and T[T >A]T.
On the other hand, the alteration probability of PAX9 is distributed along all the 96 trinucleotide mutations, resulting in the creation and disruption probability to be strongly correlated along the 96 trinucleotide mutations. This translates into a high similarity across the 30 mutational signatures. This is shown in Fig. 4c for Hoxd8 and PAX9 across the 30 mutational signatures.
Association of Deamination Signature and TFBS Creation
We next sought to validate our predictions using independent data. In [14], Zemojtel et al. described a set of transcription factors whose binding sites are frequently created as a result of CpG deamination process during evolution. These transcription factors include: c-Myc(Myc), Nfya, Nfyb, Oct4(POU5F1B), PAX5, Rxra, Usf1, and YY1. Given that some of the mutational signatures in cancer are related to CpG deamination, we sought to verify if the same transcription factors are impacted by this process due to somatic mutations. The creation probability of these transcription factor motifs are plotted in Fig. 5a over all 30 mutational signatures.
Signature 1 was described as being related to deamination of 5-methylcytosine [15] and ranks as the second highest signature in terms of creation frequency of these motifs in Fig. 5a. Signature 14 shows a very similar mutational profile with strong C →T bias. In addition, it is interesting to note that the POU5F1B motif has a high creation probability compared to other TFs across all signatures. This phenomenon is due to the high genomic frequency of motifs which differ from POU5F1B binding motifs by one mutation, which gives rise to a large number of k-mers which closely resemble the POU5F1B binding domain and serve as a substrate for TF motif creation events.
If we extend this single TFs to the family they belong to, we also observe a much higher creation probability for the other members of the POU-family, compared to the families of the other impacted TFs (Fig. 5b).
Mutation associated Mechanisms
There are three main molecular mechanisms leading to single nucleotide mutations in cancer: i) defective DNA mismatch repair (MMR); ii) APOBEC activity; and iii) transcription-coupled nucleotide excision repair (NER). In the catalogue of mutational signatures, several signatures can be related to each of these processes. We wanted to investigate which transcription factor motifs are most impacted by these three different molecular mechanisms.
For each of these mechanisms, we summarized the alteration results for all signatures annotated to the same mechanism; APOBEC is related to signatures 2 and 13, MMR to signature 6,15,20 and 26, and NER to signatures 4,7,11 and 22. We displayed the differential alteration probabilities (creation - disruption) in Fig. 6a. Interestingly, a number of transcription factor motifs show reverse behaviors with respect to these three mechanisms. For example, Foxd3 shows a much higher creation probability for APOBEC and NER related signatures, whereas the opposite holds for MMR signatures.
As an example, it was shown that a mutation associated with the APOBEC signature leads to the creation of a MYB binding site [16]. To understand if this single event leading to a driver mutation might result from a more general impact of the APOBEC signature on the binding sites of MYB, we indeed observed that the two APOBEC signatures (signature 2 and 13) show the highest bias towards MYB motif creation compared to all other signatures (P-value = 0.02, Fig. 6b). Hence, this single driver event might result from an elevated creation rate of potential MYB binding sites under these APOBEC signatures.
Comparison with the PCAWG Dataset
We then sought to validate our predictions using a dataset of observed SNV in cancer genomes. We used a dataset of 2708 whole-genome sequencing covering 40 cancer subtypes. The idea of the validation is to compare the predicted frequencies of motifs alteration that are explained purely by the mutational signature and its biases towards certain trinucleotides, with the frequencies of motif alteration observed when considering a SNV dataset in cancer genomes. If the predicted and the observed frequencies are comparable, then most of the signal of motif creation or disruption can be explained by the impact of mutational signatures. If on the other hand we observe a difference, this discrepancy could be attributed to additional effects in the real dataset, such as positive or negative evolutionary pressure in the cancer dataset leading to an increased frequency of motif creation or disruption. Hence, our goal is to highlight such potential effects and to use our signature based prediction as a baseline.
In order to compute a predicted alteration probability per patient, we used the signature exposure of each patient, and performed a linear combination based on the exposure values. In order to take into account the fact that there is heterogeneity between tumors even within a subtype, we used a clustering of samples based on their exposure to mutational signatures to split each subtype. While some subtypes are rather homogeneous i.e. lie mostly within one cluster, some others are spread across many clusters (see Additional file 3: Figure S3). The differential alteration probability is computed for each motif and each cluster within the subtypes by first subtracting the motif disruption and creation probability for each TF and then taking the median of the alteration difference by cancer entity and TF family. In Fig. 7a, we display the results for all TF families containing at least 10 motifs and each tumor subtype containing at least 50 samples.
For the 10 TF families, we observed that some appear to have a global excess of predicted disruption events over all cancer subtypes (bHLH, Ets-related), while other show the opositive effect (NK-related, SOX-related,...). Beyond these general trends, we have also observed that some cancer subtypes show a different trend. For example, Esophageal Adenocarcinoma (ESAD) shows an excess of bHLH-type motif creations, while melanoma (MELA) is predicted to contain an excess of Tal-related disruptions as opposed to all other cancer types. Looking even more in detail, differences within the tumor subtypes were obvious, between the samples belonging to the different clusters defined previously through the UMAP analysis. A prominent example was the alteration probabilities of Ets-related motifs in Melanoma. Melanoma shows by far the highest bias toward disruptions of Ets-motifs. However, this only holds true for the melanoma samples belonging to cluster 3 (brown in the figure). The other melanoma samples (cluster 1 and 21) show a balance between creation and disruption. These 3 melanoma clusters differ in their exposure profile (Fig. 7b). Indeed, cluster 3, which displays the high Ets disruption bias, is highly exposed to Signature 7, which is related to UV-induced mutations, and displays indeed the second highest disruption probability for Ets-motifs (Fig. 7c). The relation between UV-induced signatures and Ets binding sites has been described previously [17]. In this study, an increase of mutations in binding sites for TFs was described, in particular for Ets binding sites and UV induced mutations. Hence, it appears that our model is capable of capturing this effect, despite the fact that we focus on motifs occurrences and not actual binding sites.
Finally, we compared the predictions with the observed alterations across the PCAWG dataset. To perform this comparison, we first determined the observed motif creation and disruption probabilities of all TFs across all samples within a given cohort using the motif alteration calling pipeline described in the method section. This pipeline predicts, for each SNV, whether it disrupts or creates potential binding motifs, and yields for each patient and each motif a creation and disruption count for each TF motif. We computed the log-ratio of the number of observed creation/disruption events as determined by the pipeline, by summarizing all samples belonging to a tumor subtype and UMAP cluster. The comparison of this observed alteration bias with the alteration bias predicted by our model is displayed in Fig. 8. Overall, we found a very good correlation between the model prediction and the observation. The correlation between both was significant for all TF families, with some differences. For example, the correlation is very high for Ets-related factors, for which in most cases, the direction of the bias was concordant between model prediction and observed alteration counts. In other cases however, despite a good correlation, the direction of the alteration did not coincide well. For example, for RXR-related factors, we observe a global excess of motif disruptions, even for samples for which our model predicts an excess of creations. The prediction made for melanoma samples in cluster 3 of a strong excess of disruptions over creations is confirmed in the observed alterations.