In this study, we evaluated three commonly used normalization approaches in batch effect correction on Illumina Infinium methylation data. We first used the technical replicates included in Dataset 1 to assess the reduction of technical artifacts from these normalizations. Compared to raw average β, we observed each of the normalizations reduced the errors measured by average absolute deviation from zero (Figure 1D and 1E). Lumi outperformed QNβ and ABnorm for this dataset with minor batch effects. However, when a dataset had more obvious batch effects such as Dataset 2 and 3, lumi normalization became less effective, leaving a larger portion of batch effects intact. In this case, QNβ and ABnorm appeared more effective. ABnorm was less aggressive than QNβ as it performed at signal intensities.
Unlike gene expression microarray where normalization is routine practice, normalization for Infinium methylation data has been controversial. One reason is that for a particular CpG the same dye is used for both methylated and unmethylated probes and some believe no dye bias adjustment is needed . Additionally, methylation measure is a ratio between methylated and unmethylated signal within a sample and it is presumed less affected by technical issues. The concern as to the appropriateness of normalization assumptions for methylation data is also raised . However, growing evidence suggests that technical issues can affect methylated and unmethylated signals differently and the magnitude of these effects is too large to be ignored. For the datasets (Dataset 2 and 3) in this study, we not only observed the clear technical artifacts from batches, but also some variations between samples in the same batch that were not explained by underlying biology. This was further evidenced by the clear differences between technical replicates in Dataset 1. When M-A plots were investigated, these systematic biases showed an intensity dependent manner (Additional file 1). All these suggest the necessity of normalization. In evaluating methylation profiles of multiple individuals from the same or different tissues, a study showed that the majority of CpG methylation patterns were conserved , which implies that the quantile normalization assumption that the majority of markers in an experiment are not differentially expressed and should have similar distributions largely holds for methylation data, particularly when it is conducted in the same or similar tissue type. This was also supported from our presented data. For example, in Dataset 1, the variation of each CpG across 83 individuals was very small (standard deviation < 0.05 for over 95% of 27,578 CpGs). When differential methylation was conducted between case and control samples, only a small proportion (2.2%) of CpGs was with p value < 0.01. Even for the case of tumor and normal samples where a large difference is expected in Dataset 3, the differential CpGs accounted for < 10% of total (after normalization and batch correction). QNβ is an aggressive normalization approach yet could be very effective if the underlying assumption of β value distribution across samples hold, particularly when technical artifacts are severe. Normalization at signal intensities would correct the root cause of signal intensity bias and lead more accurate adjustment as the transformation from signal intensities to average β is a non-linear process and the sources of the bias become obscure once transformed. The differences between lumi and ABnorm are that lumi first conducts color channel adjustment (smooth quantile) and then pools methylated and unmethylated probes together for quantile normalization, assuming the pooled probes have similar distribution across samples. ABnorm assumes the distribution of A or B signals should be similar across samples. Although lumi normalization performed better for the dataset with minor batch effects such as for Dataset 1, it was the least effective to remove obvious distribution bias as seen in Dataset 2 and 3. ABnorm, on the other hand, could effectively correct the distribution biases for both datasets and its performance was between QNβ and lumi.
In spite of the clear benefit from each of the normalizations, it is worth pointing that normalization can be a double-edged sword, which not only does good but also harm when used improperly. Investigators should make sure the samples to be normalized together are "similar" enough (i.e., the same or similar type of tissue or patients in the similar characteristics) so that biological signals will not be compromised through normalization. Evaluation of the known "positive" (hypermethylated) and "negative" (no methylation) CpG methylation patterns in the data may help assess the effectiveness and potential bias from normalization . For example, we applied this approach to Dataset 3 where many genes are known to be hypermethylated in prostate cancer compared to normal prostate. Eighty-five CpGs from 14 hypermethylated genes [28, 29] and 358 CpGs from a list of house-keeping genes whose CpGs are presumably not methylated  were selected for differential testing. Normalization and normalization/EB correction significantly increased the numbers of differentially methylated positive CpGs compared to unnormalized data (up to 60% increase) yet the numbers of significant negative CpGs remained in the expected type I error rate (< expected 18 and p values were uniformly distributed, Additional file 2).
Batch effects appear a common phenomenon in high throughput genomic data as described recently . The batch affected probes or markers tend to behave differently under different conditions and most global normalizations are not able to correct such irregular probes. For example, after normalization, 24-46% of total CpGs remained associated with batch effects in Dataset 2 and 3 depending on which normalization was used (Table 1), which makes further batch removal mandatory. Several batch correction methods have been developed for gene expression microarray, which include the EB approach applied in this work, distance weighted discrimination (DWD) , mean-centering , geometric ratio-based method , and surrogate variable analysis (SVA) . The performance of these methods in gene expression microarray has been extensively evaluated . The similar characteristics of batch effects in methylation to gene expression suggest that these methods can be applied to methylation array. Their usage is mainly dictated by their flexibility of handling the unique methylation data and performance in gene expression microarray. Methylation average β values are bounded from 0 to 1. Algorithms that generate data out of the bounds after batch correction are problematic. Besides the factor of study interest, methylation can be affected by other factors such as age and gender. These factors may need to be taken into consideration while batch correction is performed for accurate adjustment. Batch correction methods that can not incorporate these factors are less ideal. None of the aforementioned methods except the EB approach have this flexibility. The EB approach has been shown outperforming all other methods in terms of precision and accuracy  because it models both additive and multiplicative effects. What makes EB algorithm more attractive is that it corrects batch effects by borrowing information across CpGs and experimental conditions, which leads to better estimation of batch parameters and more stable adjustment, particularly when the number of samples in each batch is small. This is particularly useful to the Illumina Methylation BeadChip data, as we often see batch effects occurring at chip level, 12 samples a unit. Our data indicated that it not only successfully removed the remaining batch effects but also significantly increased biological signal detection. Even for the data with minor batch effects, it can further reduce technical variability on the top of normalization (Figure 1F). As it is an R function, it can be easily integrated with other packages such as "lumi" and other normalization procedures.
For an illustration, we applied the DWD approach to our Dataset 2 and 3. As expected, we found that it not only corrected batch effects but also removed certain biological signals. The numbers of differentially methylated CpGs associated with the study of interest were all lower compared to the EB corrected data regardless of normalization method used (Additional file 3). Similar to gene expression data, the DWD could change the scale of the methylation data, making the normalized data not directly comparable. It needed to be run two batches at a time and it would be cumbersome when multiple batches are involved.
It should be emphasized that a good study design is critical for evaluation and correction of batch effects. Samples from different study groups should be randomly or evenly allocated to different batches. If batches and study groups are totally confounded, i.e., one group of samples allocated on one batch and another group of samples on another, it would be impossible to separate batch from biological effects. Normalization might be the only option that can be exercised. With balanced design, batch factor and covariates such as sample group of study can be incorporated for more accurate adjustment.
The procedure of normalization and batch correction in this work is sequential, i.e., normalization is conducted first and batch correction is added subsequently. A comprehensive model that incorporates normalization, batch effect correction, and other important factors such as tissue heterogeneity might lead to a better result as proposed for gene expression microarray . Although this evaluation was based on the Human Methylation27 platform, we expect these principles should be applicable to the new release of the 450 k Infinium Methylation Beadchip as both platforms use the similar Infinium methylation assay. To our experience, this bead based platform is sensitive to batch effects.