Genome-wide copy number variations in a large cohort of bantu African children

Yilmaz, Feyza; Null, Megan; Astling, David; Yu, Hung-Chun; Cole, Joanne; Santorico, Stephanie A.; Hallgrimsson, Benedikt; Manyama, Mange; Spritz, Richard A.; Hendricks, Audrey E.; Shaikh, Tamim H.

doi:10.1186/s12920-021-00978-z

Research article
Open access
Published: 17 May 2021

Genome-wide copy number variations in a large cohort of bantu African children

Feyza Yilmaz^1,2,
Megan Null³,
David Astling⁴,
Hung-Chun Yu²,
Joanne Cole^2,5,
Stephanie A. Santorico^3,5,6,
Benedikt Hallgrimsson⁷,
Mange Manyama⁸,
Richard A. Spritz^2,5,
Audrey E. Hendricks^3,5,6 &
…
Tamim H. Shaikh ORCID: orcid.org/0000-0002-4264-4272^2,5

BMC Medical Genomics volume 14, Article number: 129 (2021) Cite this article

2982 Accesses
5 Citations
1 Altmetric
Metrics details

Abstract

Background

Copy number variations (CNVs) account for a substantial proportion of inter-individual genomic variation. However, a majority of genomic variation studies have focused on single-nucleotide variations (SNVs), with limited genome-wide analysis of CNVs in large cohorts, especially in populations that are under-represented in genetic studies including people of African descent.

Methods

We carried out a genome-wide copy number analysis in > 3400 healthy Bantu Africans from Tanzania. Signal intensity data from high density (> 2.5 million probes) genotyping arrays were used for CNV calling with three algorithms including PennCNV, DNAcopy and VanillaICE. Stringent quality metrics and filtering criteria were applied to obtain high confidence CNVs.

Results

We identified over 400,000 CNVs larger than 1 kilobase (kb), for an average of 120 CNVs (SE = 2.57) per individual. We detected 866 large CNVs (≥ 300 kb), some of which overlapped genomic regions previously associated with multiple congenital anomaly syndromes, including Prader-Willi/Angelman syndrome (Type1) and 22q11.2 deletion syndrome. Furthermore, several of the common CNVs seen in our cohort (≥ 5%) overlap genes previously associated with developmental disorders.

Conclusions

These findings may help refine the phenotypic outcomes and penetrance of variations affecting genes and genomic regions previously implicated in diseases. Our study provides one of the largest datasets of CNVs from individuals of African ancestry, enabling improved clinical evaluation and disease association of CNVs observed in research and clinical studies in African populations.

Peer Review reports

Background

Copy number variations (CNVs) are a class of structural variation resulting from loss or gain of genomic fragments ≥ 1 kilobase (kb). CNVs can arise from genomic rearrangements such as deletions, duplications, insertions, inversions, or translocations [1,2,3] and have been implicated in the etiology of Mendelian disorders as well as complex traits [4]. Several pediatric disorders resulting from CNVs such as the 22q11 deletion syndrome, the Williams-Beuren syndrome, resulting from a microdeletion in 7q11.23, and the 15q13.3 microdeletion syndromes are characterized by the occurrence of multiple congenital anomalies, including intellectual and developmental disabilities, congenital heart defects, craniofacial dysmorphisms, or abnormalities in the development of other tissues and organs [5,6,7,8,9,10]. These types of CNVs can alter copy number of dosage-sensitive genes or disrupt regulatory elements, which result in pathogenic outcomes observed in patients [11]. For instance, 22q11.2 microdeletion region overlaps with genes essential for cortical circuit formation, and aberrations in cortical anatomy are two of the phenotypes observed in individuals with 22q11.2 deletion syndrome [12]. CNVs may also play a role in the etiology of common, complex diseases and traits including, diabetes, asthma, HIV susceptibility, cancer, and phenotypes in immune and environmental responses [13,14,15,16,17].

In addition to their role in disease, CNVs account for a high level of variation between healthy individuals, both within and between populations [1,2,3, 18, 19]. The 1000 Genomes Project was initiated to identify genetic variation in the human genome across diverse populations, and it has been instrumental in generating the largest catalog of genomic variants, including CNVs [20,21,22,23]. Nevertheless, CNVs remain largely understudied compared to single-nucleotide variations (SNVs) and are not commonly genotyped in a microarray-based analysis of genome-wide variation and association to disease phenotypes [24]. In 2015, Zarrei and colleagues compiled a CNV map of the human genome and estimated that 4.8–9.5% of the human genome contributes to CNV [25]. Furthermore, they identified approximately 100 genes whose loss is not associated with any severe consequences [25]. However, the vast majority of CNV data derive from individuals of European descent residing in Western countries, which might cause incorrect clinical interpretation of genomic variants [26,27,28]. Recently, resources such as the Genome Aggregation Database (gnomAD) have reported structural variations, including CNVs, in large cohorts of individuals of both European and non-European ancestries [29]. Regardless, knowledge of the genomic landscape of CNVs remains incomplete, especially in understudied populations such as Africans.

Based on the significant role of CNVs in health and disease, it is critical to have a set of reference CNVs observed in individuals from diverse populations. These population-specific reference datasets will greatly improve clinical interpretation and can help to refine a genomic region associated with diseases [30]. A recent study by Kessler and colleagues [31] demonstrated how lack of African ancestry individuals in variant databases may have resulted in the mischaracterization of variants in the ClinVar and Human Gene Mutation Databases.

In this study, we have detected CNVs in > 3400 healthy Bantu African children from Tanzania, using data from high-density (> 2.5 million probes) genotyping microarrays. We present a high-resolution map of CNVs ranging in size from 1 kb—3 Mb (million bases), providing a useful resource of CNV genetic variation for individuals of African ancestry. Additionally, we observe large CNVs in genomic regions previously implicated in syndromes and developmental disorders.

Methods

Sample description – populations

Our study was conducted using a previously collected cohort which included 3631 Bantu African children aged 3–21 living in Mwanza, Tanzania, a region with a population that is both genetically and environmentally relatively homogeneous [32]. The original study was aimed at studying the genetics of facial shape in children and adolescents aged 3–21 to minimize the potential and accumulating impact of the environment. Additionally, the majority of the sample were between the ages of 7 and 12 to also minimize the effects of puberty. Other parameters collected for individuals in the study included height, weight and BMI (Additional file 1). Individuals with a birth defect or having a relative with orofacial cleft were excluded [32]. The subjects were previously genotyped at the Center for Inherited Disease Research (CIDR) as part of the NIDCR FaceBase1 initiative. Genotyping using the Illumina HumanOmni2.5Exome-8v1_A (also referred to as Infinium Omni2.5–8) beadchip array and quality control (QC) was described previously [32, 33]. We obtained deidentified signal intensity data (*.idat) files for all the subjects in order to carry out copy number variation detection and analysis as described below.

CNV detection and analysis

Signal intensity data (*.idat) files were processed and normalized using Illumina GenomeStudio software. The FinalReport files were used as the raw data to perform CNV calling with three CNV calling algorithms: PennCNV (version 1.0.1) [34], DNAcopy (version 1.46.0), [35] and VanillaICE (version 1.32.2), [36]. Both PennCNV and VanillaICE implement Hidden Markov Models (HMM), whereas DNAcopy implements a Circular Binary Segmentation (CBS) algorithm. GC correction was performed for PennCNV using the built-in function, and the R/Bioconductor package ArrayTV (version 1.8.0) [37] was used to perform GC correction for DNAcopy and VanillaICE. Codes used to run the algorithms are available at GitHub [38]. Individuals with a total number of CNVs ≥ 3 standard deviations above the cohort mean were removed from further analysis based on previously established criteria [39]. In all, 168 individuals were excluded from further analysis: 70 duplicate samples, 97 individuals with a total number of CNVs ≥ 3 standard deviation of the cohort mean, and one individual who had 0 CNVs after applying analysis pipeline thresholds described in Fig. 1. All subsequent analyses were performed on the remaining 3463 individuals and all CNV coordinates are based on NCBI build37/hg19.

CNV calls with fewer than five probes and < 1000 bases in size were removed, followed by those with DNAcopy log-ratio between -0.1 and 0.1 (a threshold determined by a plateau plot in the DNAcopy R package that shows the copy number across the genome), and PennCNV calls with confidence score < 10 (recommended threshold by the developers of PennCNV) (Fig. 1). We used the intersect function in BEDTools v2.25 [40] to determine the proportion of overlap between CNV coordinates and genomic elements. CNV calls from two or more algorithms that overlap by 50% or more were considered concordant and included for further analyses. Next, CNV calls overlapping the centromere, telomere, or ≥ 50% with segmental duplications were removed.

PennCNV calls with copy numbers of 0 and 1 were annotated as copy number loss, 2 as diploid copy number, and 3, 4, 5 and 6 as copy number gain; VanillaICE calls with copy numbers of 1 and 2 were annotated as copy number loss, 3 and 4 as diploid copy number, and 5 and 6 as copy number gain; DNAcopy segments with log-ratio ≥ 0.1 were annotated as copy number gain, and log-ratio ≤ -0.1 as copy number loss.

CNV calling with PennCNV from genotype data using high-density SNP arrays often results in the artificial splitting of larger CNVs (i.e. > 500 kb) into multiple smaller CNVs [34]. Therefore, we merged adjacent CNVs of the same type (i.e., loss or gain) in the same individual using an approach described previously [34]. Briefly, for three adjacent genomic regions A, B, and C, where A and C represent two CNVs of the same type separated by a region B, the length of B was divided by the total length of all three segments (A + B + C). If this fraction was ≤ 15%, then three regions were merged into one CNV. This approach was used to generate a list of CNVs that passed quality metrics and filtering criteria in individual samples from the Bantu cohort (Additional file 2).

In silico quality assessment of CNVs

To assess the quality of CNV calls in the Bantu population, we compared the overlap of CNVs in the Bantu population with the Database of Genomic Variants (DGV) Gold Standard (GS) variants [41]. DGV GS variants are a curated set of variants from a select number of studies with high resolution and high quality, which were evaluated for accuracy and sensitivity. Therefore, an overlap with DGV GS variants indicates that our CNV calls are likely true positives. To assess whether the overlap was more than expected by chance, we permuted the genomic locations (n = 1000) using the shuffle function in BEDTools v2.25 [40]. Permutation tests were performed within each chromosome with the same number and size distribution of CNVs observed in the Bantu population as recommended for genomic elements that are unevenly distributed across the genome [42].

CNV regions (CNVRs)

CNV regions (CNVRs) were generated by merging all overlapping CNVs of the same type (i.e. loss or gain) from multiple individuals in our cohort, using the merge function in BEDTools v2.25 [40]. This resulted in a list of loss-only and gain-only CNVRs, which were further merged into overlapping CNVRs of all types (Additional file 3).

Comparison to other CNV datasets

We compared Bantu CNVRs to variants obtained from DGV (release date 2020–02-15) [41], the Genome Aggregation Database (gnomAD v2.1) [29, 43], African CNVR [44] and CNVs identified in low-mappability regions [45]. DGV CNVs dataset were downloaded from DGV website [46]. gnomAD SV 2.1 sites BED file was downloaded from Broad Institute website [47], which were filtered by SV Type and SV Filter, and only “DEL”, “DUP”, “CN” SV types, and SVs with “PASS” SV Filter were included. The CNV dataset for low-mappability regions obtained from Monlong and colleagues’ publication additional material Sect. [45]. CNVs obtained from tumor samples were excluded. CNVRs were generated using a similar approach as described above, and we then compared to the list of Bantu CNVRs to identify overlap.

CNV blocks

We generated a list of ‘CNV blocks’ from a set of unrelated individuals in our cohort (the description of unrelated individuals is explained in Ref. 32) to obtain a more accurate count of the number of times any given CNV was observed. First, all overlapping CNVs localizing to a given genomic region were aligned as shown (Fig. 2a,b). The largest region encompassed by these overlapping CNVs (A-D in Fig. 2) was segmented by start and end coordinates of individual CNV calls (A-K in Fig. 2), which resulted into multiple CNV blocks (A-E, E-C, C-J in Fig. 2, Additional file 4). An example for CNV blocks is represented in Fig. 2b. We then counted the number of times each CNV block was observed in unrelated individuals in our cohort. Based on these counts, CNV blocks were categorized into four groups: CNV blocks observed in ≥ 5% (common CNV blocks), ≥ 1 and < 5% (low frequency CNV blocks), ≥ 0.1 and < 1% (rare CNV blocks), and ≤ 0.1% (very rare CNV blocks).

CNVs in regions associated with disease

To assess which CNVs from our cohort overlap genes associated with developmental disorders, we identified overlap (at least 1 bp) of our common (≥ 5%), low frequency (≥ 1—< 5%), and rare (≥ 0.1—< 1%) CNV blocks with genes catalogued in the Developmental Disorders Genotype–Phenotype Database (Additional file 5) (DDG2P, [48]), compiled based on known implication in disease etiology. The following “STATUS” categories were included in the analysis: Confirmed developmental disorder (DD) Gene, Probable DD Gene, Possible DD Gene, and Both DD and IF (incidental finding). We determined the degree of overlap between using a bi-directional approach; first we calculated how much of the CNV block overlapped with gene (CNVvsGeneOverlap% in Additional file 6) and then how much of the gene overlapped with the CNV block (GenevsCNVOverlap% in Additional file 6).

To assess whether large CNVs from our cohort overlap loci associated with genomic disorders, we first generated a list of 866 large CNVs (≥ 300 kb) observed in our cohort (Additional file 7). We then determined the proportion overlap of these CNVs with known CNVs previously implicated in the etiology of syndromes and genomic disorders catalogued in The DatabasE of genomiC variation and Phenotype in Humans using Ensembl Resources [49, 50] (Additional file 8). DECIPHER is an expert-curated database of microdeletion and microduplication syndromes in developmental disorders.