High throughput exome coverage of clinically relevant cardiac genes
© Manase et al.; licensee BioMed Central Ltd. 2014
Received: 12 August 2014
Accepted: 26 November 2014
Published: 11 December 2014
Given the growing use of whole-exome sequencing (WES) for clinical diagnostics of complex human disorders, we evaluated coverage of clinically relevant cardiac genes on WES and factors influencing uniformity and depth of coverage of exonic regions.
Two hundred and thirteen human DNA samples were exome sequenced via Illumina HiSeq using different versions of the Agilent SureSelect capture kit. 50 cardiac genes were further analyzed including 31 genes from the American College of Medical Genetics (ACMG) list for reporting of incidental findings and 19 genes associated with congenital heart disease for which clinical testing is available. Gene coordinates were obtained from two databases, CCDS and Known Gene and compared. Read depth for each region was extracted from the exomes and used to assess capture variability between kits for individual genes, and for overall coverage. GC content, gene size, and inter-sample variability were also tested as potential contributors to variability in gene coverage.
All versions of capture kits (designed based on Consensus coding sequence) included only 55% of known genomic regions for the cardiac genes. Although newer versions of each Agilent kit showed improvement in capture of CCDS regions to 99%, only 64% of Known Gene regions were captured even with newer capture kits. There was considerable variability in coverage of the cardiac genes. 10 of the 50 genes including 6 on the ACMG list had less than the optimal coverage of 30X. Within each gene, only 32 of the 50 genes had the majority of their bases covered at an interquartile range ≥30X. Heterogeneity in gene coverage was modestly associated with gene size and significantly associated with GC content.
Despite improvement in overall coverage across the exome with newer capture kit versions and higher sequencing depths, only 50% of known genomic regions of clinical cardiac genes are targeted and individual gene coverage is non-uniform. This may contribute to a bias with greater attribution of disease causation to mutations in well-represented and well-covered genes. Improvements in WES technology are needed before widespread clinical application.
KeywordsExome sequencing Coverage Congenital heart disease Cardiac Genomics
Next generation sequencing (NGS) has changed the diagnostic landscape for complex human disorders. Sequencing costs, particularly for exome sequencing, have plummeted over the last decade , allowing for the rapid uptake of technology in both the research and clinical realms. NGS platforms capitalize on parallel processing to generate hundreds of megabases of sequence from a single run allowing for deep coverage of vast regions . A greater understanding of the limitations and potential pitfalls of sequencing technology is essential before clinical uptake of this technology . Whole exome sequencing (WES) represents one NGS platform that has been effective in the study of rare Mendelian diseases and has become widely popular as an efficient method of identifying genetic variants within an individual’s genes. Currently WES has two major limitations: lower accuracy in nucleotide calls compared with Sanger sequencing and the lack of uniform coverage of exonic regions. The considerably higher rate when compared to Sanger sequencing of both false-negative and false-positive calls, can be minimized by increasing the depth of sequencing . However, variant identification in whole exome sequencing is still limited by the non-uniformity of overall coverage which in turn is determined by sequence capture and read depth. Previous studies report considerable variability in uniformity and depth of coverage across the exome including base-to-base and gene-to-gene coverage -. Although newer versions of capture kits are designed to target a greater number of bases, it is unclear if capture of exonic loci is uniform across genes. In 2013, the American College of Medical Genetics (ACMG) published recommendations for reporting of incidental findings in clinical exome sequencing. Since a large number of genes for cardiac disorders are included in the ACMG reporting list , we assessed targeted capture and coverage of clinically relevant cardiac genes with whole exome sequencing and investigated factors contributing to heterogeneity in coverage.
Whole exome sequencing and candidate gene selection
Clinically relevant cardiac gene list
American College of Medical Genetics
▪ Hypertrophic cardiomyopathy
MYBPC3, MYH7, TNNT2, TNNI3, TPM1, MYL3, ACTC1, PRKAG2, GLA, MYL2, LMNA
▪ Dilated cardiomyopathy
▪ Catecholaminergic polymorphic ventricular tachycardia
▪ Arrhythmogenic right ventricular cardiomyopathy
PKP2, DSP, DSC2, TMEM43, DSG2
▪ Romano–Ward long QT syndrome
KCNQ1, KCNH2, SCN5A
▪ Brugada syndrome
▪ Familial hypercholesterolemia
LDLR, APOB, PCSK9
▪ Ehlers–Danlos syndrome, vascular type
▪ Marfan syndrome
FBN1, TGFBR1, TGFBR2, SMAD3, ACTA2, MYLK, MYH11
▪ Loeys–Dietz syndromes
▪ Familial thoracic aortic aneurysms and dissections
▪ Genes associated with congenital heart disease
BRAF, NOTCH1, CFC1, NRAS, CHD7, PTPN11, GATA4, RAF1, HRAS, SOS1, JAG1, TBX1, KRAS, TBX5, MAP2K1, ZIC3, ELN, NKX2-5, NF1
Exonic regions represented in reference datasets
Agilent capture kits are primarily designed to target genes based on the National Center for Biotechnology Information Consensus Coding Sequence (CCDS) dataset. The CCDS database  is built by consensus among four major collaborating partners: European Bioinformatics Institute, National Center for Biotechnology Information (NCBI), Wellcome Trust Sanger Institute, and the UCSC -. The University of California, Santa Cruz (UCSC) Known Gene dataset houses a larger database that encompasses not only all the CCDS transcripts but also any protein coding genes that are substantiated by a transcript in GenBank mRNA or NCBI RefSeq and have a UniProt protein. Alternative splicing isoforms are also included in Known Gene if they represent a UniProt protein and have a transcript . We therefore compared how well each gene was represented in both datasets by first obtaining gene coordinates for all protein coding regions of every available transcript in BED format from the CCDS and the Known Gene databases using the UCSC table browser . BEDTools  was then used to collapse coordinates to unique locations in order to avoid overlap and also ensure all CCDS coordinates were contained within the Known Gene coordinates. For each gene the exonic regions included within the CCDS database were represented as a percentage of the Known Gene superset for that gene.
Cardiac gene coverage
Coverage across the 50 cardiac genes was analyzed as a combination of sequence capture and read depth. The read depth at each nucleotide was extracted from each subject’s exome BAM file using SAMtools mpileup . Subsequent data handling was performed with custom scripting using the Python programming language (v2.7.1) . The read depth or coverage was the number of independent times a particular nucleotide is represented in a collection of raw sequences, and was expressed numerically as 1X, 2X, 30X, etc. . The read depth in every captured location was averaged across the samples sequenced for each capture kit. This produced an average capture per coordinate within both the CCDS and Known Gene datasets for each kit that was used in subsequent analyses.
Gene size and GC content
Gene size was calculated as the sum of unique bases of each candidate cardiac gene from the CCDS dataset. GC content was analyzed using BEDTools’ nuc function, along with UCSC’s human genome version 19 FASTA file, to output counts for each of the four nucleotides in a given region.
The proportion of bases in Known Gene regions captured by different versions of CCDS-based capture kits was compared. The comparison was performed across the entire targeted region as well as a gene by gene comparison to identify which genes had the highest discordance between databases. Actual coverage of the targeted region at a minimum read depth of 3X (the minimum depth required to identify a heterozygous mutation) and at an optimal read depth of 30X or higher (the minimum requirement for a confident genotype call) was then determined and influence of capture kit version and observed sequencing depth on coverage was analyzed. Uniformity of coverage was assessed across the candidate genes by calculating the median coverage of each gene as well as the variability in base to base coverage within each gene. Inter-sample variability was assessed for all bases within the CCDS targets by plotting the read depth confidence interval across all samples. Finally, the association between gene size and GC content on coverage of candidate cardiac genes was analyzed using linear regression. All statistical analysis was performed using the R statistical programming package (v3.0.1) .
207 participants underwent whole exome sequencing between April 2012 and January 2013. 196 participants had a diagnosis of CHD and 17 had no structural heart defects. The average age at enrollment was 14.7 ± 11.7 years. 109 participants (53%) were male. Racial distribution included 85% Caucasians, 9% Asians, 1% African American, and 5% of mixed or unknown ethnicity. Six additional exome datasets were obtained from the Wellcome Trust Sanger Institute HapMap data with no information on disease status available.
Cardiac gene sequences represented in CCDS vs. known gene
The Agilent capture kits are designed to capture sequence data derived from the CCDS dataset. For the 50 cardiac genes, this amounts to a total of 166,271 bp of unique protein coding bases. For the Known Gene dataset, this amounts to 300,468 bp unique protein coding bases. Additional file 1: Table S1 shows the proportion of Known Gene regions that are also included in the CCDS. Overall only 55% of the regions within the Known Gene dataset for all genes were represented in the CCDS dataset. At individual gene level, this ranged from 10.1% (KRAS) to 96.8% (APOB).
Capture of cardiac genes by kit version
Uniformity of coverage of cardiac genes
Association of gene size and GC content with genomic coverage
The increasing use of next generation sequencing for research and clinical genetics and the move towards scanning the genome or exome not only for variation in targeted genes of relevance to the disease under study, but also for variant-detection in other disease-associated genes has intensified the focus on the technical accuracy and validity of the technology . Coverage analysis of 50 clinically important cardiac genes using data from research based WES demonstrated considerable base-by-base and gene-by-gene variability in coverage within and across the candidate genes. Sources of variability included incomplete targeting of exonic regions, low capture efficiency, low sequencing depth, larger gene size and high GC content within the target region. Despite improvement in median coverage across the exome with newer capture kits and with higher sequencing depths, the coverage of individual genes was non-uniform. Overall, these results highlight the importance of assessing adequacy of coverage of target genes when analyzing whole exome data for disease genes and the need to improve current technology before whole exome sequencing can be widely deployed towards clinical cardiac diagnostics.
The first important finding was incomplete capture of clinically important cardiac genes with whole exome sequencing. Currently, exome sequencing capture kits are primarily designed to capture regions identified within the CCDS database. However, unlike the Known Gene database, the CCDS does not contain all possible transcripts, thereby restricting target regions to just over 50% of known genomic regions. Missing exonic nucleotides not captured with these kits may result in non-detection of variants in alternate isoforms. The difference in exonic region definition varies greatly at the gene level as shown in Additional file 1: Table S1 where the number of bases per gene for the 50 cardiac genes in the CCDS dataset can range from 10% to 96% of their size in the Known Gene dataset. Given that some known transcripts identified within the Known Gene dataset have yet to be approved in the CCDS dataset, there is a potential bias in gene discovery towards genes that are already well defined within the CCDS. Further increase in target regions is necessary to enhance more uniform and complete capture across the exome.
The most important metric of efficiency for a capture experiment is the proportion of targeted DNA inserts that are specifically hybridized and recovered from the capture. Our results showed that, for the 50 genes studied, minimum coverage (3X) was obtained to a high degree (92% to 98%) for all the capture kit versions within the CCDS. Higher depth of sequencing can improve coverage across difficult targets and as expected, the proportion of the genomic regions with high read depth was improved when intended sequencing depth was also higher with optimal coverage (30X) reached for 82.0% of the targets that were sequenced using the Agilent V3 kit that had an observed average depth of 80X. Despite overall improvement in coverage, there remained significant gene-to-gene variability in coverage. Ten of the 50 cardiac genes (including 6 on the ACMG list) had a median coverage that was less than optimal (<30X), increasing the risk of under-detection of true positives (Type II error). Thus a high intended depth of sequencing can improve the efficiency of capture but it increases costs and did not increase the number of genomic regions covered. Improving coverage will therefore require improvement in the capture technology and probe capture design to capture previously missing regions of the exome.
Gene-to-gene variability in coverage was not accounted for by inter-sample variability, which was consistently low across all 50 genes, and only modestly related to the size of the exonic regions examined for each gene. It did however correlate with GC content with lower coverage in regions that showed a high GC content. This limitation in sequencing regions with high GC content has been identified previously ,. Targeted re-sequencing and modified sequencing techniques, such as oligonucleotide-selective sequencing, may provide a more efficient strategy to deal with GC rich genes . Therefore, a preliminary screen for the amount GC content of candidate genes prior to sequencing is recommended.
In summary, these results highlight the importance of assessing adequacy of coverage of candidate or target genes in genomic analyses to avoid false negative results. When comparing cohorts sequenced using different capture kits or sequencing platforms, coverage analysis plays an even more significant role. Underreporting of variants due to lower coverage in one cohort can result in false ascertainment of enrichment of variants in the well covered cohort resulting in a Type II error. Accurate assessment of coverage between cohorts should also look beyond the commonly reported mean/median depth per base. This type of measure for overall coverage gives no indication of the uniformity of coverage across the gene. Base-by-base read depth comparison should be assessed to ensure cohorts are comparable at specific variant sites.
A limitation of our study is that we did not assess if structural variation, such as common and rare copy number variants (CNVs), contribute to the discrepancy in coverage when analyzing a small subset of genes especially in the presence of population specific CNVs. Another limitation is that we did not study the impact of coverage on variant detection sensitivity. However, several other studies have already evaluated variant detection sensitivity with exome data and identified numerous factors that influence variant detection sensitivity including coverage, quality of alignment and mapping, duplicates, strand bias, and type of algorithm used ,-. A recent study also found that a mean on-target read depth of 17-37X is required to identify 90% of heterozygous SNVs in the targeted regions, depending on the uniformity of read coverage . Therefore strategies to enhance capture and coverage as well as the sensitivity in variant detection are needed.
In conclusion, non-uniformity of coverage is a limitation of exome sequencing and may introduce bias into gene and variant discovery. As the application of exome sequencing expands into the clinical domain, careful interpretation of findings in the context of coverage is essential. Improvements to sequencing capture technology to expand the targeted regions and advances in bioinformatic approaches to account for differential coverage are essential to improve clinical implementation. From a research perspective, care should be taken when assessing overall mutation burden or identifying casual variants or genes. There is a bias towards well-covered and easy to sequence genes and a risk of under-reporting of mutation burden in poorly covered genes.
American College of Medical Genetics
Congenital heart disease
Copy number variant
Next generation sequencing
Whole exome sequencing.
The authors acknowledge the following individuals and organizations: SickKids Labatt Family Heart Centre Biobank Registry for access to DNA samples; Mina Safi, Alan Fung, and Steven Habbous for assistance with data acquisition. The work was supported by funding from the Heart and Stroke Foundation of Ontario Chair in Cardiovascular Sciences (S.M), Heart and Stroke Foundation of Canada Research Fellowship (A.K.M), Ontario Ministry of Research and Innovation International Strategic Opportunities Program (S.M), McLaughlin Centre Accelerator Grant in Genomic Medicine (S.M), SickKids Labatt Family Heart Centre Innovation Funds (D.M, L.D, S.M), the Roma and Marvin Auerback Endowment funds (S.M).
- Dewey FE, Pan S, Wheeler MT, Quake SR, Ashley EA: DNA sequencing clinical applications of New DNA sequencing technologies. Circulation. 2012, 125: 931-944. 10.1161/CIRCULATIONAHA.110.972828.View ArticlePubMedPubMed CentralGoogle Scholar
- Majewski J, Schwartzentruber J, Lalonde E, Montpetit A, Jabado N: What can exome sequencing do for you?. J Med Genet. 2011, 48: 580-589. 10.1136/jmedgenet-2011-100223.View ArticlePubMedGoogle Scholar
- Ng SB, Turner EH, Robertson PD, Flygare SD, Bigham AW, Lee C, Shaffer T, Wong M, Bhattacharjee A, Eichler EE: Targeted capture and massively parallel sequencing of 12 human exomes. Nature. 2009, 461: 272-276. 10.1038/nature08250.View ArticlePubMedPubMed CentralGoogle Scholar
- Dorn C, Grunert M, Sperling SR: Application of high-throughput sequencing for studying genomic variations in congenital heart disease. Brief Funct Genomics. 2014, 13: 51-65. 10.1093/bfgp/elt040.View ArticlePubMedGoogle Scholar
- Strom SP, Lee H, Das K, Vilain E, Nelson SF, Grody WW, Deignan JL: Assessing the necessity of confirmatory testing for exome-sequencing results in a clinical molecular diagnostic laboratory. Genet Med. 2014, 16: 510-515. 10.1038/gim.2013.183.View ArticlePubMedPubMed CentralGoogle Scholar
- Parla JS, Iossifov I, Grabill I, Spector MS, Kramer M, McCombie WR: A comparative analysis of exome capture. Genome Biol. 2011, 12: R97-10.1186/gb-2011-12-9-r97.View ArticlePubMedPubMed CentralGoogle Scholar
- Dias C, Sincan M, Cherukuri PF, Rupps R, Huang Y, Briemberg H, Selby K, Mullikin JC, Markello TC, Adams DR: An analysis of exome sequencing for diagnostic testing of the genes associated with muscle disease and spastic paraplegia. Hum Mutat. 2012, 33: 614-626. 10.1002/humu.22032.View ArticlePubMedPubMed CentralGoogle Scholar
- Dewey FE, Grove ME, Pan C, Goldstein BA, Bernstein JA, Chaib H, Merker JD, Goldfeder RL, Enns GM, David SP: Clinical interpretation and implications of whole-genome sequencing. JAMA. 2014, 311: 1035-1045. 10.1001/jama.2014.1717.View ArticlePubMedPubMed CentralGoogle Scholar
- Sulonen A-M, Ellonen P, Almusa H, Lepistö M, Eldfors S, Hannula S, Miettinen T, Tyynismaa H, Salo P, Heckman C: Comparison of solution-based exome capture methods for next generation sequencing. Genome Biol. 2011, 12: R94-10.1186/gb-2011-12-9-r94.View ArticlePubMedPubMed CentralGoogle Scholar
- Green RC, Berg JS, Grody WW, Kalia SS, Korf BR, Martin CL, McGuire AL, Nussbaum RL, O’Daniel JM, Ormond KE: ACMG recommendations for reporting of incidental findings in clinical exome and genome sequencing. Genet Med. 2013, 15: 565-574. 10.1038/gim.2013.73.View ArticlePubMedPubMed CentralGoogle Scholar
- Papaz T, Safi M, Manickaraj A-K, Ogaki C, Kyryliuk JB, Burrill L, Dodge C, Chant-Gambacort C, Walter L-L, Rosenberg H: Factors influencing participation in a population-based biorepository for childhood heart disease. Pediatrics. 2012, 130: e1198-e1205. 10.1542/peds.2012-0687.View ArticlePubMedGoogle Scholar
- Li H, Durbin R: Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics. 2009, 25: 1754-1760. 10.1093/bioinformatics/btp324.View ArticlePubMedPubMed CentralGoogle Scholar
- McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M: The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010, 20: 1297-1303. 10.1101/gr.107524.110.View ArticlePubMedPubMed CentralGoogle Scholar
- Coffey AJ, Kokocinski F, Calafato MS, Scott CE, Palta P, Drury E, Joyce CJ, LeProust EM, Harrow J, Hunt S: The GENCODE exome: sequencing the complete human exome. Eur J Hum Genet. 2011, 19: 827-831. 10.1038/ejhg.2011.28.View ArticlePubMedPubMed CentralGoogle Scholar
- Pruitt KD, Harrow J, Harte RA, Wallin C, Diekhans M, Maglott DR, Searle S, Farrell CM, Loveland JE, Ruef BJ: The consensus coding sequence (CCDS) project: Identifying a common protein-coding gene set for the human and mouse genomes. Genome Res. 2009, 19: 1316-1323. 10.1101/gr.080531.108.View ArticlePubMedPubMed CentralGoogle Scholar
- Harte RA, Farrell CM, Loveland JE, Suner M-M, Wilming L, Aken B, Barrell D, Frankish A, Wallin C, Searle S: Tracking and coordinating an international curation effort for the CCDS project. Database. 2012, 2012: bas008-10.1093/database/bas008.View ArticlePubMedPubMed CentralGoogle Scholar
- Farrell CM, O’Leary NA, Harte RA, Loveland JE, Wilming LG, Wallin C, Diekhans M, Barrell D, Searle SM, Aken B: Current status and new features of the consensus coding sequence database. Nucleic Acids Res. 2014, 42: D865-D872. 10.1093/nar/gkt1059.View ArticlePubMedGoogle Scholar
- Hsu F, Kent WJ, Clawson H, Kuhn RM, Diekhans M, Haussler D: The UCSC known genes. Bioinformatics. 2006, 22: 1036-1046. 10.1093/bioinformatics/btl048.View ArticlePubMedGoogle Scholar
- Karolchik D, Hinrichs AS, Furey TS, Roskin KM, Sugnet CW, Haussler D, Kent WJ: The UCSC table browser data retrieval tool. Nucleic Acids Res. 2004, 32: D493-D496. 10.1093/nar/gkh103.View ArticlePubMedPubMed CentralGoogle Scholar
- Quinlan AR, Hall IM: BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010, 26: 841-842. 10.1093/bioinformatics/btq033.View ArticlePubMedPubMed CentralGoogle Scholar
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R: The sequence alignment/map format and SAMtools. Bioinformatics. 2009, 25: 2078-2079. 10.1093/bioinformatics/btp352.View ArticlePubMedPubMed CentralGoogle Scholar
- VanRossum G, Drake FL: The Python Language Reference. 2010, Python Software Foundation, Amsterdam, NetherlandsGoogle Scholar
- Sims D, Sudbery I, Ilott NE, Heger A, Ponting CP: Sequencing depth and coverage: key considerations in genomic analyses. Nat Rev Genet. 2014, 15: 121-132. 10.1038/nrg3642.View ArticlePubMedGoogle Scholar
- R Development Core Team: R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing; 2012.
- Wooderchak-Donahue WL, O’Fallon B, Furtado LV, Durtschi JD, Plant P, Ridge PG, Rope AF, Yetman AT, Bayrak-Toydemir P: A direct comparison of next generation sequencing enrichment methods using an aortopathy gene panel-clinical diagnostics perspective. BMC Med Genomics. 2012, 5: 50-10.1186/1755-8794-5-50.View ArticlePubMedPubMed CentralGoogle Scholar
- Myllykangas S, Buenrostro JD, Natsoulis G, Bell JM, Ji HP: Efficient targeted resequencing of human germline and cancer genomes by oligonucleotide-selective sequencing. Nat Biotechnol. 2011, 29: 1024-1027. 10.1038/nbt.1996.View ArticlePubMedPubMed CentralGoogle Scholar
- Liu X, Han S, Wang Z, Gelernter J, Yang B-Z: Variant callers for next-generation sequencing data: a comparison study. PLoS One. 2013, 8: e75619-10.1371/journal.pone.0075619.View ArticlePubMedPubMed CentralGoogle Scholar
- Wang Q, Jia P, Li F, Chen H, Ji H, Hucks D, Dahlman KB, Pao W, Zhao Z: Detecting somatic point mutations in cancer genome sequencing data: a comparison of mutation callers. Genome Med. 2013, 5: 91-10.1186/gm495.View ArticlePubMedPubMed CentralGoogle Scholar
- Rieber N, Zapatka M, Lasitschka B, Jones D, Northcott P, Hutter B, Jager N, Kool M, Taylor M, Lichter P, Pfister S, Wolf S, Brors B, Eils R: Coverage bias and sensitivity of variant calling for four whole-genome sequencing technologies. PLoS One. 2013, 8: e66621-10.1371/journal.pone.0066621.View ArticlePubMedPubMed CentralGoogle Scholar
- Meynert AM, Bicknell LS, Hurles ME, Jackson AP, Taylor MS: Quantifying single nucleotide variant detection sensitivity in exome sequencing. BMC Bioinformatics. 2013, 14: 195-10.1186/1471-2105-14-195.View ArticlePubMedPubMed CentralGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.