The potential discovery of CNVs associated with HNPCC/LS would represent a significant advance in the search for genetic loci associated with disease expression. In the current study, using Illumina SNP arrays we identified two CN gains (7q11.21 and 16p11.2) with significant frequency differences between cases and controls. The CN gain on chromosome 16p11.2 could not be validated with a CN assay and was not evident when the dataset was re-analysed on a newer version of Nexus software (v6.1) and is therefore considered to be a false positive observation from our primary analyses. False CNV calling may be caused by intensity fluctuations on SNP arrays, which have been shown to occur as a result of the GC content of probed sequences, the position of the SNP in the probe and algorithms used to analyse array signals . It is likely that the detected CN gain at chromosome 16p is an artefact of some or all of these phenomena, and this is supported by the exclusion of these probes when Nexus 6.1 linear correction was applied.
The CN gain on chromosome 7q11.21 could not be validated by TaqMan assay but is still evident when re-analysed on Nexus v6.1 (18% of cases still have CN gain, while none of the controls display a CN gain in the same region). Unlike TaqMan® Pre-Designed and Custom Plus assays, the Custom assay design used for these validations does not go through genome quality checks as the others and is designed on a masked sequence provided by the customer. The Custom assay for the CN gain on chromosome 7q11.21 demonstrated a weakness in its reproducibility (low confidence scores and inconsistent calls when repeated), which may or may not be a result of DNA sequence-specific complications. These results are evidence of loci specific, elevated rates of false detection for both platforms used, and since all sample concentrations were equilibrated and pipetting between plates was consistent, a technical cause of this inconsistency could not be identified. Due to the difficulty designing a CN assay in the two regions, only one CN assay was designed in each region. This is a possible limitation in the attempt to validate the results, as two assays within the segment and one assay outside the segment (as negative control) would have been optimal.
Another method that was available to validate the CN gains at 7q11.21 and 16p11.2 was Affymetrix 2.7 M array results from another project (unpublished data) that included 30 of our cases. Neither of the two regions is covered by this array – and when further investigated, Affymetrix informed that the probe performance over these regions was not optimal. The Applied Biosystems Custom Plus assay design service was unable to design suitable assays for these regions, perhaps indicating a similarly reduced capacity for optimal data acquisition and may be reflective of the poor data obtained.
The CN gain on chromosome 7 is located in a chromosomal region where there are no annotated genes/miRNA/CpG islands, but the CN gain is downstream of a CpG island (CpG: 139) and upstream of the gene LOC643955 (function unknown). The importance of these intergenic regions is poorly understood but they may be involved in regulating the expression of up- or down-stream genomic regions  or be in linkage disequilibrium with disease associated regions. CNVs in the region have previously been reported in control populations . Chromosome 7q11-21 has previously been associated with cancer [58, 59] and interestingly, both regions identified in the current study (7q11.21 and 16p11.2) have been found as CN gains in small bowel adenocarcinomas , which raises questions whether this is evidence in favour of the findings of the current study, or calls into question the stringency of the analysis which reported it.
It has been suggested that the overall CNV burden creates a differing sensitised background during development, leading to different thresholds of disease . In the current study we observed that HNPCC/LS cases have a greater overall CNV burden and unique/rare CNV burden compared to controls. This is consistent with previous reports for other complex genetic disorders. For example, individuals with schizophrenia have a greater genomic burden of structural variation compared to controls  and rare CNVs have been observed in schizophrenia patients but not controls, supporting a disease model incorporating the effects of multiple, rare, highly penetrant variants . Few studies have investigated germline CNVs and cancer risk, but the total number of germline CNVs have been found to be higher in patients with Li-Fraumeni syndrome compared to controls . A large CNV burden has also been positively correlated with the severity of childhood disabilities . In the current study, the high overall CNV burden in HNPCC/LS patients could be due to their MMR deficiency arising from mutations in MMR genes, supporting the idea that deficiency of MMR occurs first and the adenoma evolves from the MMR-deficient cell . Therefore we tested the overall difference in the CNV burden between MMR + LS patients and MMR- HNPCC patients. The total and average CNV length was not different between the two groups but the number of CNV events was. Interestingly, Nexus Software analysis suggested that MMR- HNPCC cases had a greater unique/rare CNV burden than MMR + probands, which could be an indication of a deficient DNA repair in these patients despite the negative mutation screen in MMR genes known to be associated with the disease. Because our clinical cohort represents a highly ascertained population that underwent CNV analyses as a result of a clinical/molecular diagnosis of HNPCC/LS, the subjects are possibly enriched for rare CNVs. However, we only cautiously suggest this interpretation, due to the described challenges with validating Nexus results.
We took a rigorous and conservative analytical approach to maximise CNV call reliability by calculating NSR, setting the number of probes to a minimum of 5 and using two different algorithms to identify significant CNV differences between cases and controls. Utilising more than one algorithm in CNV calling have been applied in several studies to improve the rates of reproducibility and positive prediction [19, 66–69], however it invariably demonstrates an increase in the overall false positive rate. Accordingly, we sought to control our positive prediction rate by considering only those regions that satisfied dual algorithm detection at respective significance thresholds as qualifiers of association with LS/HNPCC. Nevertheless, our findings should be interpreted with caution as we can see considerable differences between the CNV frequencies detected in cases/controls in the association analysis, the total length, average length and the number of CNVs called between the two software programs used for analysis. Reassuringly, the discrepancies we observed are consistent with the results of other recent studies that have attempted to use convergence across multiple algorithms to identify valid CNV calls [67–69]. The source of these discrepancies is due to the differing sensitivities of algorithms to the inherent variations in relative fluorescence between co-assayed genomic loci on SNP arrays.
To compare the algorithms used in the current study, Nexus uses a proprietary CBS based algorithm to divide chromosomal data into segments whose median LRR values are significantly different from adjacent segments. CNVs are defined using numerous one-size-fits-all user defined thresholds (see methods) and may therefore be susceptible to CNV call reliability fluctuations according to data quality. The Nexus algorithm only considers single samples for CNV calling and does not draw on collective data for greater call confidence. Conversely, QuantiSNP uses a HMM where aberrations are defined as excursions from the null state that satisfy multiple parameters learnt from the input data and confidence is heightened if the aberrations are detected in multiple samples. Additionally, there are only two user defined thresholds (the characteristic length parameter (2 MB default) and Log-Bayes Factor), which serve to reduce the false positive error rate at differing stages of the analysis .
The challenge seems to be a combination of the inherent inaccuracy of measuring signal intensity using genotyping data from SNP arrays and systematic differences between statistical algorithms. Accordingly, the observed higher Type I error rate by Nexus in the current study may be due to the lack of control for false positives, the rigidity of its user-defined thresholds which do not adapt in line with data quality and a lack of confidence testing of aberrations (that is, the only significance testing is applied at the segmentation stage, not when the LRR ± cut-offs indicate CN gain or loss). In recent comparisons [34, 70], both programs utilised in the current study have performed well compared to other algorithms and we used settings consistent with those previously reported. The recent study by Kim et al. suggests that convergent CNV calls across at least three algorithms should be obtained before undertaking association analysis as only ~10% of CNVs called using two algorithms were verified by a third. Such low convergence likely reflects a combination of type I and type II error across the discovery and validation analysis. Kim et al. do however show that validity can be increased by increasing the CNV filtering criterion to require the inclusion of at least 7 probes, suggesting that better validity may have resulted from applying a third algorithm and requiring called CNVs to contain at least 7 consecutive probes.
New software for analysing CNVs is being rapidly developed [34, 70] and as no gold standard has yet been established, CNV analyses remains challenging and the results difficult to interpret. Other possible limitations of our study are the control population, the modest sample size and the potential for false negative results due to strict analytical parameters. The controls were healthy individuals at the time of sampling but may develop cancer in the future, which would be expected to reduce power of our analyses. However, all controls were aged >55 years, which reduces the potential impact of misclassification bias.
The genomic region on chromosome 7q11.21 requires further investigation to prove the association with the investigated disease and should not be dismissed due to its location in an intergenic region. The HNPCC/LS cases have a greater burden of CNV across their genomes compared to controls which is supporting the notion of higher genomic instability in these patients due to an inadequate DNA repair process. The technology is improving rapidly, but until next-generation sequencing is available and widely used in clinical diagnostic testing, inspecting the overall CNV burden in individuals with a clinical diagnosis of HNPCC/LS could become a rapid and cost-efficient screening method for identifying families for genetic testing. Future research should explore the identified candidate locus on chromosome 7q11.21 further as well as consider whether high CN at this locus increases the risk of disease development in the context of HNPCC/LS families.