Sex-specific recombination patterns predict parent of origin for recurrent genomic disorders

Background Structural rearrangements of the genome, which generally occur during meiosis and result in large-scale (> 1 kb) copy number variants (CNV; deletions or duplications ≥ 1 kb), underlie genomic disorders. Recurrent pathogenic CNVs harbor similar breakpoints in multiple unrelated individuals and are primarily formed via non-allelic homologous recombination (NAHR). Several pathogenic NAHR-mediated recurrent CNV loci demonstrate biases for parental origin of de novo CNVs. However, the mechanism underlying these biases is not well understood. Methods We performed a systematic, comprehensive literature search to curate parent of origin data for multiple pathogenic CNV loci. Using a regression framework, we assessed the relationship between parental CNV origin and the male to female recombination rate ratio. Results We demonstrate significant association between sex-specific differences in meiotic recombination and parental origin biases at these loci (p = 1.07 × 10–14). Conclusions Our results suggest that parental origin of CNVs is largely influenced by sex-specific recombination rates and highlight the need to consider these differences when investigating mechanisms that cause structural variation. Supplementary Information The online version contains supplementary material available at 10.1186/s12920-021-00999-8.


Background
Genomic disorders are caused by pathological structural variation in the human genome usually arising de novo during parental meiosis [1][2][3][4]. The most common pathogenic variety of these rearrangements are copy number variants (CNVs), i.e. a deletion or duplication of > 1 kb of genetic material [3,5,6]. The clinical phenotypes of genomic disorders are varied. They include congenital dysmorphisms, neurodevelopmental, neurodegenerative, and neuropsychiatric manifestations, and even more common complex phenotypes such as obesity and hypertension [7][8][9][10][11][12]. CNVs have been observed in 10% of sporadic cases of autism [13,14], 15% of schizophrenia cases [15,16], and 16% of cases of intellectual disability [17]. These and other associations highlight the importance of structural variation to human health and the need to understand the factors influencing how they arise.
There is an intense interest in understanding the mechanisms by which CNVs form [18,19]. In several regions of the genome, de novo CNVs with approximately the same breakpoints recur in independent meioses (recurrent CNVs) [1,20]. The presence of segmental duplications flanking these intervals is a hallmark feature of recurrent CNVs. It is hypothesized that misalignment and subsequent recombination between non-allelic low copy repeat (LCR) segments within the segmental duplication regions is the formative event giving rise to the CNV [21,22], so-called non-allelic homologous recombination (NAHR). Risk factors that may favor NAHR have been investigated and include sequence composition and orientation of the LCRs themselves [21,23] as well as the presence of inversions at the locus [24,25].
Parental sex bias for the origin of recurrent de novo CNVs remains unexplained. De novo deletions at the 16p11.2 and 17q11.2 loci are more likely to arise on maternally inherited chromosomes [26][27][28][29]. Deletions at the 22q11.2 locus show a slight maternal bias as well [30]. In contrast, deletions at the 5q35.3 locus (Sotos syndrome [MIM: 117550]) display a paternal origin bias [31,32]. Deletions at the 7q11.23 locus (Williams syndrome [MIM: 194050]) do not show a bias in parental origin [24]. While it has been suggested that sex-specific recombination rates might influence sex biases in NAHR [26], this hypothesis has not been formally tested.
The majority of recurrent CNVs are thought to form during meiosis when homologous chromosomes align and synapse during prophase I [33]. It is well established that meiosis differs significantly between males and females. In males, spermatagonia continuously divide and complete meiosis throughout postpubescent life with all four products of meiosis resulting in gametes. In contrast, in human females, oogonia are established in fetal life and enter into an extended period of prolonged stasis in prophase I of meiosis until they complete meiosis upon ovulation and fertilization [34]. Additionally, in female meiosis, only one of four products of meiosis result in a gamete. Sexual dimorphism in meiosis extends to the patterns and processes of recombination during meiosis [33]. Here we seek to ask whether local sex-specific rates in meiotic recombination can predict the parental bias for the origin of recurrent de novo CNVs.

Determination of parental origin for 3q29 deletion
Study subject recruitment This study was approved by Emory University's Institutional Review Board (IRB00064133). Individuals with a clinically confirmed diagnosis of 3q29 deletion were ascertained through the internet-based 3q29 registry (https:// 3q29d eleti on. patie ntcro ssroa ds. org/) as previously described [103]. Blood samples were obtained from 14 families. SNP genotyping was performed on 12 of the 14 families (10 full trios, 2 mother-child pairs) using the Illumina GSA-24 v 3.0 array. For 2 full trios (6 samples), parent of origin was determined from whole-genome sequence data on Illumina's NovaSeq 6000 platform. Quality control was performed with PLINK 1.9 [104] and our custom pipeline (Additional file 1: Supplemental Methods).
Parental origin analysis Parental origin of the 3q29 deletion was determined for all 14 families using PLINK 1.9 [104]. SNPs located within the 3q29 deletion region (chr3:196029182-197617792; hg38) were isolated for analysis and the pattern of Mendelian errors (MEs) were analyzed. The parent with the most MEs was considered the parent of origin for the 3q29 deletion (Additional file 1: Supplemental Methods). The mean age of fathers in our 3q29 cohort was collected from self-reported data in conjunction with the Emory University 3q29 project (http:// genome. emory. edu/ 3q29/) and compared to the U.S. average (NCHS; https:// www. cdc. gov/ nchs/ index. htm) via a two-tailed two-sample t-test using R [105].

Calculation of recombination rates and ratios
Chromosome male and female recombination rates (cM/ Mb) were obtained from the deCODE sex-specific maps, which are based on over 4.5 million crossover recombination events from 126,427 meioses, with an average resolution of 682 base pairs [106]. The recombination rate (cM/Mb) data from deCODE is publicly available as recombination rates calculated for a physical genomic interval bounded by two SNP markers (Additional file 1: Supplemental Methods). Therefore, for our calculation of the average male and female recombination rates, each bounded recombination rate was weighted by the total number of base pairs contained within the respective SNP marker interval. Weighted rates were then averaged across the CNV interval for males and females, separately. The ratio of the weighted average male and female recombination rates was then calculated for each CNV interval by dividing the weighted average male recombination rate by the weighted average female recombination rate (Additional file 1: Figure S1). To account for slight differences in the recombination rate ratios calculated for the different LCR22 intervals at the 22q11.2 locus we used an adjusted recombination rate ratio composed of the weighted recombination rate ratios calculated for each LCR22 interval. Weights were based on the estimated population prevalence of the different 22q11.2 deletion intervals (Additional file 1: Table S3) [107].

Logistic regression analysis
Parental origin data was curated for CNVs at the 24 CNV loci from 77 independent studies; only independent samples were included in the analysis (duplicate or overlapping samples were removed). For each CNV locus, the male to female recombination rate ratio was calculated as described above. A logistic regression model was fitted to the data with the log e -transformed male to female recombination rate ratio as the predictor and parental origin (paternal vs. maternal) as the response variable. We performed a secondary analysis stratified by deletions and duplications. See Table 2 and Additional file 4: Table S4 for the data calculated and used in the logistic regression analyses.

Linear regression analysis
For linear regression, locus-specific estimates for parental origin were derived by combining the data from all published studies for a given locus. To alleviate the uncertainty introduced by small sample sizes, only those loci with more than 10 observations were included. The log e -transformed combined male to female parental origin count ratio for each locus was regressed on the calculated average log e -transformed average male to female recombination rate ratio for that locus' CNV interval. Each locus was weighted based on its sample size.

Parent of origin of 3q29 deletion
We determined parent of origin in 12 full trios where a proband had a de novo 3q29 deletion; in 2 additional trios where only proband and maternal DNA samples were available, parent of origin was inferred. For the 12 trios evaluated by SNP arrays, in all cases, the number of Mendelian errors between the presumed inherited (intact) parental allele was zero, and the mean Mendelian errors for the presumed de novo parent of origin allele were 41, with a range of 27-66. For the two trios evaluated with sequence data, Mendelian errors were 20-33-fold elevated when comparing the inherited versus de novo parent. In these 14 trios, 13 deletions (92.9%) arose on the paternal genome indicating a significant departure from the null expectation of 50% (p = 0.002, binomial exact). When accounting for only full trios, 11 of 12 (91.7%) deletions arose on paternal haplotypes (p = 0.006, binomial exact), altogether indicating there is a paternal bias for origin of the 3q29 deletion (Additional file 1: Table S5). We examined the age distribution of male parents in our cohort; the mean age is 34 years (median = 34 years) and is not significantly different from the 2018 U.S national average, (31.8 years) (p = 0.08, Two-tailed two-sample t-test), These data indicate the bias in the 3q29 sample is unlikely to be due to oversampling of older fathers (Additional file 1: Table S5).

Meiotic recombination and parental origin
We tested the hypothesis that sex-dependent differences in meiotic recombination could explain the parental biases observed for recurrent genomic disorder loci mediated by NAHR. We determined the male and female origin counts of the CNVs curated from the literature search. Of the 1977 CNVs, 870 were paternal in origin and 1107 were of maternal origin. We calculated the average male and female recombination rates (cM/Mb) across the CNV intervals at all 24 loci using recombination rates published by the deCODE genetics group [106] (Additional file 1: Figure S2-S12). We fit a simple logistic model to the data, with the maleto-female recombination rate ratio as the predictor and parental origin as the response variable (Table 2; Additional file 4: Table S4). Our data reveal that the sexdependent recombination rate ratio significantly predicts parental de novo origin of a given CNV (p = 1.07 × 10 -14 , β = 0.6606, CI 95% = (0.4980, 0.8333), OR = 1.936) (Fig. 1). In other words: for a given region, the higher the male recombination rate is relative to the female rate, the more likely a CNV formed in that region will be paternal in origin. Stratified analyses on deletions and duplications separately lead to a nearly identical model  Figure S13-S14, Table S6-S7). Simple linear regression on the subset of CNV loci with more than 10 samples, shows the striking correlation between relative recombination rates and parental origin, where relative recombination rates explain 85% of the variance in parental bias (Additional file 1: Figure S15 and Table S8). Our logistic model can be used to predict paternal origin rates for any locus with estimable recombination in males and females, and we have done so (Additional file 1: Table S9). CNVs at the 15q13.3 and 17q23 both are predicted to have a paternal origin approximately 60% of the time, while at the 16p11.2 distal locus CNVs are predicted to have a maternal origin 76% of the time (Additional file 1: Table S9). If correct, our model would predict these loci exhibit a bias in parental origin.

Discussion
Parent of origin bias for de novo events at recurrent CNV loci has been well-documented but has lacked a compelling explanation. Our analysis of data gathered on 1977 CNVs from 77 published reports demonstrate that sexspecific variation in local meiotic recombination rates predicts parent of origin at recurrent CNV loci. Human male and female meiotic recombination rates and patterns differ greatly across the broad scale of human chromosomes. Recombination events are nearly uniformly distributed across the chromosome arms in females but tend to be clustered closer to the telomeres in males [108]. We note that this pattern has been previously recognized [26]. Here we have formally tested the hypothesis that recombination variation drives parent of origin variation using a rigorous, statistical framework ( Fig. 1) and provided an estimate for the variance in parent of origin bias that is due to sex-specific recombination rates (Additional file 1: Figure S15). Investigations into the mechanism by which recurrent CNVs arise have focused on LCRs and their makeup [1,109]. These regions are composed of units of sequence repeats that vary in orientation, percent homology, length, and copy number. Consequently, LCRs are mosaics of varying units, imparting complexity to LCR architecture [23]. The frequency of NAHR events mediated by LCRs is a function of these characteristics and other features of the genomic architecture [21]. Specifically, the rate of NAHR is known to correlate positively with LCR length and percent homology and decrease as the distance between LCRs increases [19,21]. However, because LCRs are challenging to study with short-read sequencing technology, the population-level variability of these regions is not well described [110]. Recent breakthroughs with long-read sequencing and optical mapping have revealed remarkable variation in LCRs [111][112][113], and haplotypes with higher risks for CNV formation have now been identified [114]. LCRs are substrates for NAHR [1], and thus are subject to the recombination process. Local recombination rates may influence how likely an NAHR event will happen between two LCRs. Therefore, when analyzing LCR haplotypes and their susceptibility to NAHR, one would need to take into account sex differences in recombination. For example, at loci with maternal biases, specific risk haplotypes may be required for males to form CNVs and vice versa. Greater enrichment of GC content, homologous core duplicons or the PRDM9 motifs, or other recombination-favoring factors may also be required [1,19].
Variation in recombination rates between sexes is well established [108,[115][116][117][118]. Prediction of individual risk may also need to consider individual variation in meiotic recombination, particularly due to heritable variation and the presence or absence of inversion polymorphisms [117,119]. Variants in several genes, including PRDM9, have been shown to affect recombination rates and the distribution of double-stranded breaks in mammals [120,121]. Common alleles in PRDM9 are evidenced to affect the percentage of recombination events within individuals that take place at hotspots [120], and variants in RNF212 are associated with opposite effects on recombination rate between males and females [116,121]. The unexplained variance in our study may be due to these additional factors, which are rich substrates for future study.
Many human genetic studies have observed correlations between inversion polymorphisms and genomic disorder loci [25,122]. Because these inversions are copynumber neutral and often located in complex repeat regions, [123] they can be difficult to assay with current high-throughput strategies and their true impact remains    to be explored. One model proposes that during meiosis these regions may fail to synapse properly and increase the probability of NAHR [124,125]. Another theory suggests formation of inversions increases directly oriented content in LCRs leading to an NAHR-favorable haplotype [126]. Supporting these theories, inversion polymorphisms have been identified at the majority of recurrent CNV loci [24,25,30,122,124,126,127]. At the 7q11.23, 17q21.31, and 5q35 loci [24,25,127], compelling data indicts inversions as a highly associated marker of CNV formation. However, heterozygous inversions are known to suppress recombination perturbing the local pattern of recombination and altering the fate of chiasmata [119]. The analysis presented here strongly suggests that recombination is the driving force for CNV formation giving rise to an alternate explanation for the association between inversions and CNVs; they are both the consequence (and neither one the cause) of recombination between non-allelic homologous LCRs. Inversions and CNVs appear to be associated because both are being initiated by aberrant recombination. Viewing the system in this manner also explains the frequency of individual inversions at CNV loci. Inversions are arising via rare aberrant recombination, like CNVs, but subsequently being driven to higher frequency by natural selection, because they act to suppress recombination and "save offspring" from deleterious genomic disorders. Of course, frequent mutations leading to inversions and the details of LCR structure such as relative orientation and homology within a genomic region may promote or impede CNV formation in a locus-specific manner [128][129][130]. Further exploration of this relationship with improved genomic mapping can test these alternative models [131]. One testable prediction of the model described here is that inversions should be at higher frequency at loci giving rise to highly deleterious CNVs, as opposed to loci harboring recurrent benign CNVs.
To our knowledge, this study is the first comprehensive investigation of parental origin of recurrent, NAHRmediated CNV loci. Investigations of predominantly nonrecurrent CNVs show paternal bias [132][133][134]. Unlike recurrent CNVs, nonrecurrent CNVs are mostly formed via non-homologous end joining (NHEJ) and replicative mechanisms [1,135,136]. The standing hypothesis is that replication-based mechanisms of nonrecurrent CNV formation, which are known to accumulate errors in male germlines, contribute to this bias [132]. Our study reinforces the idea that the factors influencing recurrent CNVs differ from those impacting nonrecurrent CNVs. Future genome-wide analyses with larger sample sizes can further help refine our understanding of the divergent forces at play affecting recurrent and nonrecurrent CNV formation.
We conducted a comprehensive literature search at 38 loci and ultimately identified 1977 samples for analysis. We note that the majority of the data come from 7 well-studied loci (Table 1). While we thoroughly curated the data in a systematic way, it is possible that our data is subject to publication bias, where loci that exhibit parent of origin biases are more likely to have parental origin reported. Further exacerbating potential publication bias, genetic testing for the affected patient (and even more so for the parents) can be difficult to obtain due to concerns such as insurance coverage, potential future discrimination, and privacy concerns [137][138][139][140]. However, we note individuals with CNVs are generally not ascertained or recruited under the expectation that recombination affects parent of origin, and therefore, any potential publication or ascertainment bias is unlikely to confound the results of our analysis. Analysis of a larger cohort of CNV loci including benign CNVs will give greater insight into the role of recombination, and sex differences in recombination influencing parent of origin in CNVs.
Our estimates of recombination rates summarize CNV-scale (broad-scale) patterns of recombination, rather than fine-scale patterns near the sites of relevant recombination events that form these CNVs-LCRs. For example, local sex-specific hotspots within LCRs could be the underlying drivers behind the correlation between recombination rates and parental origin. Given the nature of repetitive regions like LCRs and our inability to adequately interrogate them with current sequencing technologies, accurate recombination data across and within the LCR regions is not available. In other words, the data is currently insufficient to conclude whether or not these broad-scale patterns are tightly correlated with fine-scale recombination rates in and around the LCRs. The best available data in the field allows us to infer the following: broad-scale patterns of recombination tightly predict patterns of parental origin.

Conclusions
In this study, we determined male and female differences in meiotic recombination rates significantly predict parent of origin for recurrent CNV loci. Combining the sexspecific recombination landscape and the mechanistic factors underlying it with a more detailed understanding of existing structural factors at genomic disorder loci can be expected to help guide standards used to identify and perform genetic counseling for individuals at risk of genomic rearrangement.
Additional file 1: Figure S1. Schematic of recombination rate calculation method; Figures S2-S12. Recombination rates of 24 loci analyzed; Figure S13. Logistic regression with deletions only; Figure S14. Logistic regression with duplications only; Figure S15. Linear regression with combined CNV parent of origin data; Table S3. LCR22 recombination rate data; Table S5. Demographic data for 3q29 cohort; Table S6. Summarized data for logistic regression with deletions only; Table S7. Summarized data for logistic regression with duplications only; Table S8. Sensitivity analysis results for linear regression analysis with deletions and duplications combined;