A novel SNP analysis method to detect copy number alterations with an unbiased reference signal directly from tumor samples
© Lisovich et al; licensee BioMed Central Ltd. 2011
Received: 11 July 2010
Accepted: 26 January 2011
Published: 26 January 2011
Genomic instability in cancer leads to abnormal genome copy number alterations (CNA) as a mechanism underlying tumorigenesis. Using microarrays and other technologies, tumor CNA are detected by comparing tumor sample CN to normal reference sample CN. While advances in microarray technology have improved detection of copy number alterations, the increase in the number of measured signals, noise from array probes, variations in signal-to-noise ratio across batches and disparity across laboratories leads to significant limitations for the accurate identification of CNA regions when comparing tumor and normal samples.
To address these limitations, we designed a novel "Virtual Normal" algorithm (VN), which allowed for construction of an unbiased reference signal directly from test samples within an experiment using any publicly available normal reference set as a baseline thus eliminating the need for an in-lab normal reference set.
The algorithm was tested using an optimal, paired tumor/normal data set as well as previously uncharacterized pediatric malignant gliomas for which a normal reference set was not available. Using Affymetrix 250K Sty microarrays, we demonstrated improved signal-to-noise ratio and detected significant copy number alterations using the VN algorithm that were validated by independent PCR analysis of the target CNA regions.
We developed and validated an algorithm to provide a virtual normal reference signal directly from tumor samples and minimize noise in the derivation of the raw CN signal. The algorithm reduces the variability of assays performed across different reagent and array batches, methods of sample preservation, multiple personnel, and among different laboratories. This approach may be valuable when matched normal samples are unavailable or the paired normal specimens have been subjected to variations in methods of preservation.
DNA copy number alterations (CNA) including sequence amplifications and deletions can cause oncogene activation or reduced tumor suppressor gene function associated with the emergence of cancer [1, 2]. Recent advances in microarray technology including high density SNP (single nucleotide polymorphism) profiling have provided a novel approach to evaluate CNA across the genome of patient tissue specimens as a potential diagnostic tool for tumor classification. However, detection of a copy number amplification or deletion is critically dependent on: 1) interrogation of SNPs and/or nonpolymorphic markers at a frequency sufficient to identify changes in specific genes, and 2) accurate detection of gene copy number in tumor tissues . Evolution of arrays with increasing densities of SNPs has mitigated the first factor but the ability to accurately detect CNA remains a formidable challenge. Specifically, the signal-to-noise ratio of copy number intensity from tumor samples is markedly decreased (or inferior) compared to that obtained from ideal samples such as homogeneous cell lines or blood samples. These differences are likely associated with cellular heterogeneity e.g. normal cells mixed with tumor cells, infiltrating inflammatory cells and/or the presence of proliferating, apoptotic or necrotic cells. The variability is compounded further by sample preservation e.g. formalin-fixed paraffin embedded (FFPE) exhibit higher variability than frozen samples. Thus, there are substantial methodological hurdles to be resolved before CNA values from SNP arrays can be integrated into diagnostic evaluation of tumor samples.
An ideal experimental comparison for tumor copy-number evaluation comprises paired, normal tissue samples processed simultaneously with the tumor specimens thereby eliminating technical variability from personnel and batch effects. One additional advantage to paired comparisons is the suppression of CN aberrations present in both normal and tumor tissues. In practice however, paired tissue samples are rarely available and acquisition of paired normal tissue does not typically fall within the purview of therapeutic surgical intervention. Consequently, normal CNA profiles are often derived from a variety of normal specimens accumulated in a laboratory and often include archived FFPE specimens. Otherwise, investigators must rely on data from other laboratories employing the same CNA platform and/or by comparison to a publicly accessible database. The variability inherent in these other data sources further complicates interpretation of tumor sample CNA results. To address this issue, we have developed a SNP CN detection algorithm that derives a reliable copy number reference signal specifically from tumor samples thereby eliminating the need for a normal reference set. Validation tests indicated that this approach achieves a signal-to-noise ratio surpassing that obtained by other techniques including paired normal reference sets.
Reducing the noise level by exploiting similarities between the test and reference set has been successful in at least two empirical studies [4, 5]. In one study , the entire data set (test as well as normal data if available) is acquired in the lab to build a reference set. While noise reduction is achieved, this approach suppresses the CNA common for all samples in a data set, making it impractical for systematic CN aberration analysis. The second employs a CN detection package in CNAG , selecting the best from the pool of normal samples based on similarity of correlation function between a particular test sample and a given set of normal samples. While the noise level is somewhat reduced, testing this approach in CNAG as well as in our own implementation of a similar selection algorithm (data not shown) shows that sometimes, the reference set selected using this method may be very small (up to 1 - 2 samples) and therefore any CNA present in a normal reference will introduce false CNA in the tumor CN signal. Nevertheless, exploiting the natural similarity between test and reference set is an attractive approach if implemented carefully.
Its high frequency component is highly correlated with the one common for all tumor samples in the given batch and therefore the noise in (log2) ratio between given tumor raw CN and the reference signal is reduced.
The reference signal does not contain the CN aberrations common for the part or the whole tumor sample set, which means that when each individual tumor sample is compared to the reference signal, the aberration will be picked up for this particular tumor.
Unbiased Reference Signal Restoration
Depending on the signal-to-noise ratio between averages, we can employ two basic strategies for detecting and correcting for shifted segments. If the signal-to-noise ratio is high enough so that the false positive (FP)/false negative (FN) detection is low, we can use the direct segment detection, thus achieving the highest possible resolution. Shifting the segments back is a problem requiring detection of points of discontinuity of signal and calculating the mean over the segment. There are very few efficient segmentation algorithms available [6–8], which preserve a segment mean and therefore can be used to accomplish the task. The performance of a segmentation algorithm in this case should be higher than on a single sample, as the ratio of averages would not include the instrument noise.
If the noise level is too high so that FP/FN segment detection is likely, then instead of direct segment detection, the low-pass filtering or window smoothing is used for detecting the shift pattern, followed by correcting the original average intensity signal for an abovementioned shift pattern. With a proper choice of a smoothing window width, the problem of FP/FN detection presented in the first method is eliminated. In practice, the width of a smoothing window is comparable with one used for smoothing in standard methods (e.g., 10 - 30 data points). For comparatively rare CN events (i.e. less than 25 -30% of test samples), the smoothing described above does not substantially alter the final spatial resolution, as the CN aberrations will be suppressed by the same 25 - 30% thus preserving the high frequency component.
Tumor DNA was obtained from a de-identified series of 25 pediatric malignant glioma tissue samples (Children's Oncology Group ACNS0423 study). Tissue accrual for the current study was coordinated by the Pediatric Branch of the Cooperative Human Tissue Network (CHTN). This study was approved by Institutional Review Board (IRB) for the University of Pittsburgh. Patients included children with malignant gliomas arising outside the brainstem treated with surgery, field irradiation, and chemotherapy using temozolomide administered on a daily schedule during irradiation, and in conjunction with lomustine after irradiation. Eligibility for the study required institutional IRB approval as well as central review of the histopathology, and confirmation of a diagnosis of glioblastoma, anaplastic astrocytoma, or gliosarcoma.
Samples were processed and data was collected using the Affymetrix Mapping 250K Sty microarray chip using standard procedures  that have been optimized in the UPCI Clinical Genomics Facility (detailed methods are described in Additional file 1). All data have been deposited in the Gene Expression Omnibus (Accession number GSE25589). Due to the specifics of the sample collection protocol, paired normal samples (blood DNA) were not available for analysis. Also, because of the sample preservation method (FFPE), using the HapMap data as a normal reference set proved difficult due to the very high noise level, likely from the difference in preservation protocols between the glioma samples (FFPE) and HapMap (frozen). We therefore used an in-house processed FFPE reference set, which was used to validate the results obtained through the VN approach, as described. We believe that having a sample set without an ideal reference set is a universal problem in cancer translational research and highlights the potential usefulness and practicality of the VN method we describe.
Generating the VN Reference Signal
In preparation for generating the VN reference signal, the intensities of test and reference samples were preprocessed within the aroma.affymetrix framework  using the latest CRMAv.2 method [11, 12]. During the preprocessing step, the sample intensities were corrected for allelic crosstalk, normalized for nucleotide-position probe sequence effects, summarized and finally corrected for the PCR fragment length. We implemented our own GC content correction using the approach described by Diskin and colleagues , modifying it slightly so the regions of high amplification/deletion were also included into the GC wave parameter estimation. All processing was done on the total intensities and aimed to total raw CN estimates, but the results are also applicable for allele-specific analysis.
Let I t i (p) and I r j (p) be the signal intensities for test (t) and reference set (r), where i = 1.N and j = 1. M is a sample number for test (i) and reference (j) sets respectively and p is a genomic position. For the segmentation-based version of the algorithm, the VN reference signal is generated as follows:
where I t avg and I r avg are intensity medians computed for each chromosomal position p over the whole test and reference sets respectively
Perform the GC correction on CNavg signal using the modified PennCNV .
where S(p) is a piece-wise constant function of the chromosomal position and M(CNavg s (p)) is a mean of (log2) ratio of two averages from (2) at the segment s for the range of chromosomal positions p.
where p is a chromosomal position, I vn (p) is a virtual reference intensity signal, I t avg is an average intensity of tumor samples and S(p) is a segmented average CN from (2)
where the rawCNi is a raw CN signal for sample I, Ii is an intensity for tumor I and Ivn is an intensity of the modified tumor intensity average signal (the virtual reference) from equation (4).
where F is a filtering operator (moving Gaussian smoothing window for example) and w is a width of smoothing window.
The trade-off of this approach is that the loss of resolution due to the filtering leaves the narrow peaks in the virtual reference signal uncorrected, which in turn suppresses the peaks in each corresponding test sample exhibiting such a peak. In practice we found that generating the virtual reference based on a set of normals from the same lab requires the same filtering window width as when using a standard approach. Moreover, the peak suppressing effect largely goes away when the CN aberration event is not present in all test samples, which is a typical situation. In practice, the peak suppression of very small CN regions did not exceed 15% - 20% and in most instances was undetectable, which allows using the noise reduction obtained through the VN to full advantage.
Methods for Implementation and Validation
The algorithm was implemented in R as an add-on to the aroma.affymetrix extendable open source CNV analysis package . The main workflow remained comparable to the conventional scenario of using an unpaired reference set, with the major exception relating to the method of obtaining the reference signal. Instead of using the average of a normal reference set signal, the VN reference is obtained as described above. Another difference with the aroma.affymetrix workflow is a correction for GC content which is implemented based on ideas from the PennCNV package  with a modification allowing the inclusion of the amplified/deleted regions into the GC curve fitting process, thus improving the robustness of correction in case the samples have a high percentage of CN aberrations. The VN reference generation algorithm also uses this version of the GC correction. While the authors of the aroma.affymetrix package suggest that for high-density chips (250K and above) GC correction is not necessary, we found it to be very beneficial in our case, especially during processing of FFPE samples using the frozen reference set as a template for VN.
We assessed the validity of this method in several ways. First, using a data set from nine tumor/normal pairs (available from Affymetrix; see Additional file 1), we compared the validity and performance of the method with the standard unpaired and paired approach as outlined below. This data set was chosen for a preliminary test due to the high signal-to-noise ratio as well as known SNP CN aberration positions. Next, the level of noise and the quality of segmentation of the glioma samples under consideration were assessed using the HapMap normal samples as a reference set. A conventional CN analysis using the tumor and in-house lab FFPE sample set (5 normal FFPE blood samples) was then used to validate the VN algorithm on the HapMap results. We also constructed a second VN reference signal using the same normal in-lab sample set as a template and the results were compared to assess the usability of a VN with Hapmap. Further, the filtering version of the VN algorithm was applied to generate a reference signal based on the HapMap template and the gain in signal-to-noise ratio was gauged.
To validate the VN results directly, genomic DNA was analyzed using Applied Biosystems TaqMan® Copy Number Assays (chromosome 4: Hs04800686_cn; chromosome 15: Hs05319670_cn, Hs01691525_cn) with the Applied Biosystems StepOnePlus system. Each sample was analyzed in quadruplicate as per the manufacturer's instructions (ΔΔCT method). The data was imported into Applied Biosystems CopyCaller™ software and the copy number for chromosomes 4 and 15 was calculated. The results are shown as an average of three independent experiments +/- S.E.M.
Validation using Tumor/Normal Paired Samples
Application to Pediatric Glioma Using FFPE Samples
Comparison of a standard deviation (SD) and signal to noise (S/N) ratio for a raw CN signal obtained using different types of references. Here SD is an average of all SDs obtained for each raw CN segmented region in log2 scale, and S/N is computed as a ratio of the difference between the raw CN values corresponding to CN = 3 and CN = 2 in log2 scale and the SD value defined above
FFPE Tumor vs:
Blood normals (frozen)
S/N ratio for CN = 3 (log2)
Application to higher density arrays
Initially we were motivated to develop this algorithm for the particular study on the Affymetrix 500K platform due to the absence of the adequate normal FFPE reference set, i.e. under the most unfavorable conditions. However, the algorithm is generic enough to be applied to a wide range of platforms which utilize the normal sample set to generate a reference signal. In particular, it was successfully tested during the renal carcinoma studies (frozen tumor and normal samples) performed on the Affymetrix Genome-Wide 6.0 platform (see Additional file 1, Figure S7 A, B and C). Here we show the exemplary raw CN and segmentation processing results for a renal carcinoma sample on Chr. 2 which utilize three different sources of the reference signal: a) HapMap normal set (Additional file 1, Figure S7a); b) in-lab normal set (Additional file 1, Figure S7b) and c) VN using HapMap as a template (Additional file 1, Figure S7c). The reference signal generated using VN algorithm allowed to achieve the best signal-to-noise ratio compared to the traditional methods even when in-lab normal reference set with matching preservation protocol was available.
We have demonstrated that it is possible to construct a reference signal directly from the tumor sample intensities using the normal sample reference set as a template and that such a reference has the properties to minimize the noise in the resulting raw CN signal. The proposed algorithm allows minimizing the negative effect of inter-batch or inter-lab variability and is generic enough to be applicable to a variety of platforms which utilize the reference signal for processing. When the test sample preservation protocol differs from the one for the available normal reference set or when the in-lab reference set is unavailable, the described approach could likely be an ideal option to produce reliable total CN estimates. We did not investigate the applicability of the method for allele-specific CN estimates, but it's likely it could be used in such instances as well.
We have discussed the tradeoffs between segmentation based and filter based VN versions. Our primary goal was to accommodate the different signal-to-noise ratios between two average intensities so to achieve the maximum spatial resolution while minimizing FP/FN events. We believe, that using the wavelet based filtering similar to CDF 5/3 [15, 16], would help to achieve an optimal compromise between a low FN/FP rate as for filtering based VN and the highest possible spatial resolution. We are working on incorporating such filtering as an option for the VN algorithm.
The recent trend in CNV detection algorithms is to use the total and allele-specific CN estimates and the genotyping call results together to obtain the more reliable CN event detection . We plan to work on such extension of the aroma.affymetrix package including the VN into the processing pipeline in a near future.
Finally, during evaluation, the VN algorithm demonstrated a better signal-to-noise ratio than paired analysis. Still, in most cases, the paired analysis is preferable as it allows exclusion of CN aberrations that are not disease-specific. However, in rare cases, the paired normal may have disease specific aberrations if obtained in close proximity to the tumor. We suggest that using both methods applied to the same dataset would provide more information on the disease related CN aberrations, than each method applied alone.
The algorithm was implemented in R language as an add-on to the open source aroma.affymetrix package but can be easily adapted for use with other packages. The algorithm source code as well as the GC correction module along with necessary modifications in aroma.affymetrix package is readily available for download at the following URL - http://lnx02.dbmi.pitt.edu/download/Aroma%20Extensions.zip.
Availablility and Requirements
Project name: Virtual Normals
Project home page: http://lnx02.dbmi.pitt.edu/download/Aroma%20Extensions.zip
Operating system(s): Platform independent.
Programming language: R
Other requirements: R 2.11 or higher, aroma.affymetrix 1.5.0 or higher.
License: Please contact corresponding author (RWS) for commercial license. No license required for academic use.
Any restrictions to use by non-academics: License needed for commercial use.
- (algorithm described herein) CN:
copy number alterations
formalin-fixed paraffin embedded sample preservation protocol
GC content: guanine-cytosine content
- GC wave:
variation in hybridization intensity associated with the GC content
This work was supported by the American Cancer Society [RWS] [RSG-05-246-01-GMC] and the National Institutes of Health [RWS] [CA132385; GM087798; CA148629] and [IFP] [NS37704]. Funding was also provided to RWS from the St. Baldrick's Foundation and the National Childhood Cancer Foundation.
- Zhao X, Li C, Paez JG, Chin K, Janne PA, Chen TH, Girard L, Minna J, Christiani D, Leo C, et al: An integrated view of copy number and allelic alterations in the cancer genome using single nucleotide polymorphism arrays. Cancer Res. 2004, 64 (9): 3060-3071. 10.1158/0008-5472.CAN-03-3308.View ArticlePubMed
- Freire P, Vilela M, Deus H, Kim YW, Koul D, Colman H, Aldape KD, Bogler O, Yung WK, Coombes K, et al: Exploratory analysis of the copy number alterations in glioblastoma multiforme. PLoS ONE. 2008, 3 (12): e4076-10.1371/journal.pone.0004076.PubMed CentralView ArticlePubMed
- Gai X, Perin JC, Murphy K, O'Hara R, D'Arcy M, Wenocur A, Xie HM, Rappaport EF, Shaikh TH, White PS: CNV Workshop: an integrated platform for high-throughput copy number variation discovery and clinical diagnostics. BMC Bioinformatics. 2010, 11: 74-10.1186/1471-2105-11-74.PubMed CentralView ArticlePubMed
- Yamamoto G, Nannya Y, Kato M, Sanada M, Levine RL, Kawamata N, Hangaishi A, Kurokawa M, Chiba S, Gilliland DG, et al: Highly sensitive method for genomewide detection of allelic composition in nonpaired, primary tumor specimens by use of affymetrix single-nucleotide-polymorphism genotyping microarrays. Am J Hum Genet. 2007, 81 (1): 114-126. 10.1086/518809.PubMed CentralView ArticlePubMed
- Nannya Y, Sanada M, Nakazaki K, Hosoya N, Wang L, Hangaishi A, Kurokawa M, Chiba S, Bailey DK, Kennedy GC, et al: A robust algorithm for copy number detection using high-density oligonucleotide single nucleotide polymorphism genotyping arrays. Cancer Res. 2005, 65 (14): 6071-6079. 10.1158/0008-5472.CAN-05-0465.View ArticlePubMed
- Ben-Yaacov E, Eldar YC: A fast and flexible method for the segmentation of aCGH data. Bioinformatics. 2008, 24 (16): i139-145. 10.1093/bioinformatics/btn272.View ArticlePubMed
- Pique-Regi R, Monso-Varona J, Ortega A, Seeger RC, Triche TJ, Asgharzadeh S: Sparse representation and Bayesian detection of genome copy number alterations from microarray data. Bioinformatics. 2008, 24 (3): 309-318. 10.1093/bioinformatics/btm601.PubMed CentralView ArticlePubMed
- Venkatraman ES, Olshen AB: A faster circular binary segmentation algorithm for the analysis of array CGH data. Bioinformatics. 2007, 23 (6): 657-663. 10.1093/bioinformatics/btl646.View ArticlePubMed
- Lyons-Weiler M, Hagenkord J, Sciulli C, Dhir R, Monzon F: Optimization of the Affymetrix GeneChip Mapping 10K 2.0 Assay for routine clinical use on formalin-fixed paraffin-embedded tissue. Diagnostic Molecular Pathology. 2008, 17 (1): 3-13.PubMed
- aroma.affymetrix. [http://www.aroma-project.org/]
- Bengtsson H, Ray A, Spellman P, Speed TP: A single-sample method for normalizing and combining full-resolution copy numbers from multiple platforms, labs and analysis methods. Bioinformatics. 2009, 25 (7): 861-867. 10.1093/bioinformatics/btp074.PubMed CentralView ArticlePubMed
- Bengtsson H, Wirapati P, Speed TP: A single-array preprocessing method for estimating full-resolution raw copy numbers from all Affymetrix genotyping arrays including GenomeWideSNP 5 & 6. Bioinformatics. 2009, 25 (17): 2149-2156. 10.1093/bioinformatics/btp371.PubMed CentralView ArticlePubMed
- Diskin SJ, Li M, Hou C, Yang S, Glessner J, Hakonarson H, Bucan M, Maris JM, Wang K: Adjustment of genomic waves in signal intensities from whole-genome SNP genotyping platforms. Nucleic Acids Res. 2008, 36 (19): e126-10.1093/nar/gkn556.PubMed CentralView ArticlePubMed
- Wang K, Li M, Hadley D, Liu R, Glessner J, Grant SF, Hakonarson H, Bucan M: PennCNV: an integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. Genome Res. 2007, 17 (11): 1665-1674. 10.1101/gr.6861907.PubMed CentralView ArticlePubMed
- Vonesch C, Unser M: A fast thresholded landweber algorithm for wavelet-regularized multidimensional deconvolution. IEEE Trans Image Process. 2008, 17 (4): 539-549. 10.1109/TIP.2008.917103.View ArticlePubMed
- Vonesch C, Unser M: A fast multilevel algorithm for wavelet-regularized image restoration. IEEE Trans Image Process. 2009, 18 (3): 509-523. 10.1109/TIP.2008.2008073.View ArticlePubMed
- Yau C, Holmes CC: CNV discovery using SNP genotyping arrays. Cytogenet Genome Res. 2008, 123 (1-4): 307-312. 10.1159/000184722.View ArticlePubMed
- The pre-publication history for this paper can be accessed here:http://www.biomedcentral.com/1755-8794/4/14/prepub