General Approach
Reducing the noise level by exploiting similarities between the test and reference set has been successful in at least two empirical studies [4, 5]. In one study [4], the entire data set (test as well as normal data if available) is acquired in the lab to build a reference set. While noise reduction is achieved, this approach suppresses the CNA common for all samples in a data set, making it impractical for systematic CN aberration analysis. The second employs a CN detection package in CNAG [5], selecting the best from the pool of normal samples based on similarity of correlation function between a particular test sample and a given set of normal samples. While the noise level is somewhat reduced, testing this approach in CNAG as well as in our own implementation of a similar selection algorithm (data not shown) shows that sometimes, the reference set selected using this method may be very small (up to 1 - 2 samples) and therefore any CNA present in a normal reference will introduce false CNA in the tumor CN signal. Nevertheless, exploiting the natural similarity between test and reference set is an attractive approach if implemented carefully.
Methodological Considerations
Consider a scenario where the reference signal is computed as an average (median) of tumor intensity signals instead of traditional average of normal reference samples intensities. Such signal would preserve the high frequency component common for all tumor samples in a given batch and therefore if used as a denominator in computing the (log2) ratio in place of the standard reference would decrease the noise this high frequency component is responsible for. However this reference signal would carry the CN variations common for the given tumor sample set and therefore, if used unmodified, would suppress these common variations for each particular tumor sample where they are present and cause the false CN variations for the rest of the tumors. To overcome this problem, we can compute the (log2) ratio between this synthetic reference signal and normal reference set based signal (treating these two references as a test/normal pair) and then employ the mean preserving segmentation or filtering technique to detect the common CN variations. If, then, we use the information about the position and amplitude of the common CN variations to modify the tumor based reference signal in such a way that these common CN variations go away when tested against the traditional reference signal, we would obtain a synthetic reference signal which has two key properties:
-
a)
Its high frequency component is highly correlated with the one common for all tumor samples in the given batch and therefore the noise in (log2) ratio between given tumor raw CN and the reference signal is reduced.
-
b)
The reference signal does not contain the CN aberrations common for the part or the whole tumor sample set, which means that when each individual tumor sample is compared to the reference signal, the aberration will be picked up for this particular tumor.
The whole process is illustrated on the Figure. 1, where Figure 1A shows the raw CN signal for a single FFPE tumor sample obtained using the traditional reference signal based on frozen HapMap normal set, the Figure 1B shows the raw CN segmentation results for log2 ratio between the tumor and normal based reference signals and Figure 1C shows the log2 ratio between the final transformed synthetic reference and normal based reference signals. Note how the common tumor CN variations got removed from the synthetic reference set. The Figure 1D shows the raw CN and the segmentation results for this tumor sample using the synthetic reference and illustrate significant (> 2 times) noise reduction compared with traditional approach (Figure 1A). The original normal reference set is used here as a template to construct an optimal reference signal and will be referred to further as a template reference set when applicable.
Unbiased Reference Signal Restoration
Depending on the signal-to-noise ratio between averages, we can employ two basic strategies for detecting and correcting for shifted segments. If the signal-to-noise ratio is high enough so that the false positive (FP)/false negative (FN) detection is low, we can use the direct segment detection, thus achieving the highest possible resolution. Shifting the segments back is a problem requiring detection of points of discontinuity of signal and calculating the mean over the segment. There are very few efficient segmentation algorithms available [6–8], which preserve a segment mean and therefore can be used to accomplish the task. The performance of a segmentation algorithm in this case should be higher than on a single sample, as the ratio of averages would not include the instrument noise.
If the noise level is too high so that FP/FN segment detection is likely, then instead of direct segment detection, the low-pass filtering or window smoothing is used for detecting the shift pattern, followed by correcting the original average intensity signal for an abovementioned shift pattern. With a proper choice of a smoothing window width, the problem of FP/FN detection presented in the first method is eliminated. In practice, the width of a smoothing window is comparable with one used for smoothing in standard methods (e.g., 10 - 30 data points). For comparatively rare CN events (i.e. less than 25 -30% of test samples), the smoothing described above does not substantially alter the final spatial resolution, as the CN aberrations will be suppressed by the same 25 - 30% thus preserving the high frequency component.
Data Set
Tumor DNA was obtained from a de-identified series of 25 pediatric malignant glioma tissue samples (Children's Oncology Group ACNS0423 study). Tissue accrual for the current study was coordinated by the Pediatric Branch of the Cooperative Human Tissue Network (CHTN). This study was approved by Institutional Review Board (IRB) for the University of Pittsburgh. Patients included children with malignant gliomas arising outside the brainstem treated with surgery, field irradiation, and chemotherapy using temozolomide administered on a daily schedule during irradiation, and in conjunction with lomustine after irradiation. Eligibility for the study required institutional IRB approval as well as central review of the histopathology, and confirmation of a diagnosis of glioblastoma, anaplastic astrocytoma, or gliosarcoma.
Samples were processed and data was collected using the Affymetrix Mapping 250K Sty microarray chip using standard procedures [9] that have been optimized in the UPCI Clinical Genomics Facility (detailed methods are described in Additional file 1). All data have been deposited in the Gene Expression Omnibus (Accession number GSE25589). Due to the specifics of the sample collection protocol, paired normal samples (blood DNA) were not available for analysis. Also, because of the sample preservation method (FFPE), using the HapMap data as a normal reference set proved difficult due to the very high noise level, likely from the difference in preservation protocols between the glioma samples (FFPE) and HapMap (frozen). We therefore used an in-house processed FFPE reference set, which was used to validate the results obtained through the VN approach, as described. We believe that having a sample set without an ideal reference set is a universal problem in cancer translational research and highlights the potential usefulness and practicality of the VN method we describe.
Generating the VN Reference Signal
In preparation for generating the VN reference signal, the intensities of test and reference samples were preprocessed within the aroma.affymetrix framework [10] using the latest CRMAv.2 method [11, 12]. During the preprocessing step, the sample intensities were corrected for allelic crosstalk, normalized for nucleotide-position probe sequence effects, summarized and finally corrected for the PCR fragment length. We implemented our own GC content correction using the approach described by Diskin and colleagues [13], modifying it slightly so the regions of high amplification/deletion were also included into the GC wave parameter estimation. All processing was done on the total intensities and aimed to total raw CN estimates, but the results are also applicable for allele-specific analysis.
Algorithm Description
Let I
t
i
(p) and I
r
j
(p) be the signal intensities for test (t) and reference set (r), where i = 1.N and j = 1. M is a sample number for test (i) and reference (j) sets respectively and p is a genomic position. For the segmentation-based version of the algorithm, the VN reference signal is generated as follows:
Compute the average signal for tumor test set and the normal reference set separately, so
(1a)
and
(1b)
where I
t
avg and I
r
avg are intensity medians computed for each chromosomal position p over the whole test and reference sets respectively
Compute the (log2) ratio of two averages producing the raw CN average signal
(2)
Perform the GC correction on CNavg signal using the modified PennCNV [13].
Perform the segmentation of the average signal using one of the segmentation algorithms which preserve the mean level of each signal [6–8], so
(3)
where S(p) is a piece-wise constant function of the chromosomal position and M(CNavg
s
(p)) is a mean of (log2) ratio of two averages from (2) at the segment s for the range of chromosomal positions p.
Using the segmentation results, modify the average tumor set intensity signal in such a way that the regions of the segmentation results deviating from the baseline (zero for log2 ratio) would get shifted back to the baseline, producing the virtual reference signal:
(4)
where p is a chromosomal position, I
vn
(p) is a virtual reference intensity signal, I
t
avg is an average intensity of tumor samples and S(p) is a segmented average CN from (2)
Use the resulting modified tumor average intensity signal as a new (virtual) reference when generating the raw CN signal for each tumor sample, i.e.
(5)
where the rawCNi is a raw CN signal for sample I, Ii is an intensity for tumor I and Ivn is an intensity of the modified tumor intensity average signal (the virtual reference) from equation (4).
As was mentioned before, one of the major drawbacks of using the segmentation technique in obtaining the reference signal is that the erroneous (false positive) detection of a segment would lead to a corresponding false peak in all samples run against such a reference set. A more conservative approach for eliminating such a possibility involves filtering the log2 ratio and using this filtered signal in place of a segmented signal. In this case the expression for S(p) (3) is given by the formula:
(3a)
where F is a filtering operator (moving Gaussian smoothing window for example) and w is a width of smoothing window.
The trade-off of this approach is that the loss of resolution due to the filtering leaves the narrow peaks in the virtual reference signal uncorrected, which in turn suppresses the peaks in each corresponding test sample exhibiting such a peak. In practice we found that generating the virtual reference based on a set of normals from the same lab requires the same filtering window width as when using a standard approach. Moreover, the peak suppressing effect largely goes away when the CN aberration event is not present in all test samples, which is a typical situation. In practice, the peak suppression of very small CN regions did not exceed 15% - 20% and in most instances was undetectable, which allows using the noise reduction obtained through the VN to full advantage.
Methods for Implementation and Validation
The algorithm was implemented in R as an add-on to the aroma.affymetrix extendable open source CNV analysis package [10]. The main workflow remained comparable to the conventional scenario of using an unpaired reference set, with the major exception relating to the method of obtaining the reference signal. Instead of using the average of a normal reference set signal, the VN reference is obtained as described above. Another difference with the aroma.affymetrix workflow is a correction for GC content which is implemented based on ideas from the PennCNV package [14] with a modification allowing the inclusion of the amplified/deleted regions into the GC curve fitting process, thus improving the robustness of correction in case the samples have a high percentage of CN aberrations. The VN reference generation algorithm also uses this version of the GC correction. While the authors of the aroma.affymetrix package suggest that for high-density chips (250K and above) GC correction is not necessary, we found it to be very beneficial in our case, especially during processing of FFPE samples using the frozen reference set as a template for VN.
The segmentation of the resulting raw CN signal as well as a segmentation-based version of the VN algorithm was implemented using the Sparse Bayesian Learning-based GADA package [7]. The block-scheme of the algorithm is presented in Figure 2.
We assessed the validity of this method in several ways. First, using a data set from nine tumor/normal pairs (available from Affymetrix; see Additional file 1), we compared the validity and performance of the method with the standard unpaired and paired approach as outlined below. This data set was chosen for a preliminary test due to the high signal-to-noise ratio as well as known SNP CN aberration positions. Next, the level of noise and the quality of segmentation of the glioma samples under consideration were assessed using the HapMap normal samples as a reference set. A conventional CN analysis using the tumor and in-house lab FFPE sample set (5 normal FFPE blood samples) was then used to validate the VN algorithm on the HapMap results. We also constructed a second VN reference signal using the same normal in-lab sample set as a template and the results were compared to assess the usability of a VN with Hapmap. Further, the filtering version of the VN algorithm was applied to generate a reference signal based on the HapMap template and the gain in signal-to-noise ratio was gauged.
To validate the VN results directly, genomic DNA was analyzed using Applied Biosystems TaqMan® Copy Number Assays (chromosome 4: Hs04800686_cn; chromosome 15: Hs05319670_cn, Hs01691525_cn) with the Applied Biosystems StepOnePlus system. Each sample was analyzed in quadruplicate as per the manufacturer's instructions (ΔΔCT method). The data was imported into Applied Biosystems CopyCaller™ software and the copy number for chromosomes 4 and 15 was calculated. The results are shown as an average of three independent experiments +/- S.E.M.