Evaluation of the imputation performance of the program IMPUTE in an admixed sample from Mexico City using several model designs

Background We explored the imputation performance of the program IMPUTE in an admixed sample from Mexico City. The following issues were evaluated: (a) the impact of different reference panels (HapMap vs. 1000 Genomes) on imputation; (b) potential differences in imputation performance between single-step vs. two-step (phasing and imputation) approaches; (c) the effect of different INFO score thresholds on imputation performance and (d) imputation performance in common vs. rare markers. Methods The sample from Mexico City comprised 1,310 individuals genotyped with the Affymetrix 5.0 array. We randomly masked 5% of the markers directly genotyped on chromosome 12 (n = 1,046) and compared the imputed genotypes with the microarray genotype calls. Imputation was carried out with the program IMPUTE. The concordance rates between the imputed and observed genotypes were used as a measure of imputation accuracy and the proportion of non-missing genotypes as a measure of imputation efficacy. Results The single-step imputation approach produced slightly higher concordance rates than the two-step strategy (99.1% vs. 98.4% when using the HapMap phase II combined panel), but at the expense of a lower proportion of non-missing genotypes (85.5% vs. 90.1%). The 1,000 Genomes reference sample produced similar concordance rates to the HapMap phase II panel (98.4% for both datasets, using the two-step strategy). However, the 1000 Genomes reference sample increased substantially the proportion of non-missing genotypes (94.7% vs. 90.1%). Rare variants (<1%) had lower imputation accuracy and efficacy than common markers. Conclusions The program IMPUTE had an excellent imputation performance for common alleles in an admixed sample from Mexico City, which has primarily Native American (62%) and European (33%) contributions. Genotype concordances were higher than 98.4% using all the imputation strategies, in spite of the fact that no Native American samples are present in the HapMap and 1000 Genomes reference panels. The best balance of imputation accuracy and efficiency was obtained with the 1,000 Genomes panel. Rare variants were not captured effectively by any of the available panels, emphasizing the need to be cautious in the interpretation of association results for imputed rare variants.


Background
Genome-wide association studies (GWAS) are a convenient and powerful tool for the identification of common genetic variants associated with complex diseases [1][2][3][4][5]. In recent years, high-density GWAS have proven successful in identifying loci predisposing to a variety of complex diseases, e.g., type 1 and type 2 diabetes, obesity, inflammatory bowel disease, prostate cancer and breast cancer [5,6]. The recent successes of GWAS have mainly been possible due to the rapid advancement in high-throughput SNP genotyping technologies (e.g., Affymetrix and Illumina platforms), which assay a large number of SNPs (between 100,000 and 2,500,000) across the human genome [7][8][9]. However, despite recent improvements, the coverage of most of the genotyping platforms remains relatively inadequate, in comparison with the total number of SNPs described in the genome. Furthermore, rare variants are typically not included in these genotyping arrays and a fraction of the typed SNPs are eliminated from further analyses, due to genotyping problems, leading to the loss of statistical power in association studies [10][11][12][13].
To overcome the aforementioned limitations of GWAS genotyping platforms, a variety of imputation methods have been developed. These methods infer missing or untyped SNP genotypes based on the genotypes at nearby typed SNPs, using the pattern of linkage disequilibrium (LD) observed in reference samples. Imputation methods have been extensively used to predict the genotypes of untyped markers by combining reference panels of individuals genotyped at a dense set of SNPs with a study sample genotyped at a subset of the SNPs [14,15]. The main challenge of imputation, however, lies in the selection of an appropriate reference panel relevant for the study populations. Although this is straightforward in samples with ancestry matching that of the available reference panels (e.g., European or East Asian ancestry), this is not the case for samples that are not well represented in the reference panels (e.g. Native American samples or admixed samples). One of the proposed solutions to the latter scenario is to include mixtures of the available reference panels for imputation. It has been described that this strategy results in good imputation accuracy [16].
In the present study, we employed the HapMap and the recently available 1000 Genomes reference panels to evaluate the performance of the imputation program IM-PUTE in an admixed sample from Mexico City. The following issues were evaluated in this project: (a) the impact of different reference panels (HapMap and 1000 Genomes) on imputation; (b) potential differences in imputation performance between single-step vs. two-step (phasing and imputation) approaches; (c) the impact of different INFO score thresholds on imputation performance and (d) imputation performance in common vs. rare markers.

Study participants and Genotyping
A total of 1,310 individuals from Mexico City (967 with type 2 diabetes and 343 with normal glucose tolerance) were analyzed in this study. Informed consent was obtained from each participant, and the research was approved by the ethical research boards of the Medical Center 'Siglo XXI' and the University of Toronto. Genotyping of the sample was then carried out in the microarray analysis facility located in the Centre for Applied Genomics (Toronto, ON, Canada), using the Affymetrix Genome-wide Human SNP array 5.0 (Affymetrix, Santa Clara, CA, USA), and following standard protocols. Further details about participant recruitment and quality control measures can be found elsewhere [30].

Reference panels for imputation
The following reference panels were used for the present study:

Imputation using IMPUTE
The programs IMPUTE v1 and v2 were employed for imputation of untyped markers. IMPUTE v1 [19] was used for analysis with the HapMap phase II combined and the HapMap phase II combined + HapMap phase III Mexican-American reference datasets and IMPUTE v2 [20] was used with the HapMap Phase II combined and the 1000 Genomes Phase I (June 2011 release) reference panels. With IMPUTE v1 we performed phasing and imputation in a single analytical step. With IMPUTE v2 we used a two-step approach, phasing the study sample first and performing imputation using the reference samples in a second stage. In order to evaluate the performance of the imputation, we randomly masked 5% of the markers directly genotyped on chromosome 12 (n = 1,046) and compared the imputed genotypes with the Affymetrix genome-wide Human SNP array 5.0 genotype calls. For analysis using IMPUTE v1, chromosome 12 was divided into chunks of 15 Mb length (chunk size specified using the -int option). Each chunk was then directly imputed with the following settings: buffer = 250 kb, k = 40, iter = 30, burnin = 10, Ne = 11418, using the said reference panels.
The -buffer option helps to avoid edge effects when imputing in relatively small chunks.
For analysis using IMPUTE v2, chromosome 12 was broken into smaller chunks of~5 Mb each, and we also used a buffer region of 250 kb. Phasing of GWAS data in each chunk was subsequently performed to produce the best-guess haplotypes (using -phase and -include_buf-fer_in_output flags with IMPUTE v2's settings: k = 80, iter = 30, burnin = 10, Ne = 11500). Imputation from the best-guess haplotypes was then carried out, for each chunk, using the aforementioned reference panels. The differences in program versions and imputation settings between the one-step and two-step approaches are primarily due to the fact that the imputations were done at different times.

Evaluation of imputation performance
We report the concordance rate between the imputed and observed genotypes for the masked SNPs as a measure of imputation accuracy and the proportion of non-missing genotypes under a given INFO score threshold as a measure of the imputation efficacy. The program Gtool (http://www.well.ox.ac.uk/~cfreeman/software/gwas/gtool. html) was used for this purpose, using the default INFO score threshold value of 0.9 to export the IMPUTE data to PLINK format. With this threshold, imputed markers with INFO scores < 0.9 were labeled as missing genotypes. Then, the PLINK's -merge command along with -mergemode 7 command was used to evaluate the genotype concordance. We also used PLINK to obtain information on the proportion of non-missing genotypes for each of the four imputation strategies(Impute v1: Hapmap phase II combined and Hapmap phase II combined + MXL, Impute v2: Hapmap phase II combined and 1000 Genomes).
We also evaluated the imputation performance (accuracy and efficacy) at different INFO score threshold values (0.8, 0.7, 0.6 and 0.5), in addition to the default threshold value of 0.9. This analysis was carried out only for the two-step imputation method based on the HapMap phase II combined reference sample.
Finally, we explored the effect of allele frequency on imputation performance. The INFO scores based on the two-step imputation method using the HapMap phase II combined and the 1000 Genomes (June 2011 release) reference panels were compared for different allele frequency categories, grouping markers in 5% bins. Histograms were generated to show the distribution of the INFO scores for each bin and the distribution of the differences in INFO scores, and we estimated the correlation between the INFO scores for the two imputation approaches. We also did a more detailed analysis of imputation accuracy and efficacy for markers in the following allele frequency categories: <1%, 1-5% and 45-50%, using the two-step imputation method and the 1000 Genomes reference panel.

Results
The concordance rates and the proportion of non-missing genotypes obtained with the four imputation strategies evaluated in this study are shown in Table 1. For this analysis, imputed genotypes with INFO scores lower than 0.9 were defined as missing genotypes. The concordance rate was used as a measure of the imputation accuracy and the proportion of non-missing genotypes as a measure of imputation efficacy. The concordance rate was consistently high (>98%) for all the imputation strategies, but there were differences between methods in imputation efficacy. Using the single-step strategy produced slightly higher concordance rates than the twostep strategy (e.g. 99.1% vs. 98.4% when using the Hap-Map phase II combined reference sample, respectively), but at the expense of a lower proportion of non-missing genotypes (85.5% vs. 90.1%, respectively). The inclusion of the HapMap phase III Mexican American sample as a reference sample, in addition to the HapMap phase II combined sample, produced a marginal improvement both in concordance rate and proportion of non-missing genotypes (99.4% vs. 99.1% for concordance, and 85.9% vs. 85.5% for the proportion of non-missing genotypes, using the singlestep approach). For the two-step approach, using the 1,000 Table 1 Concordance rate and proportion of non-missing genotypes (using a score information threshold of 0.9) for chromosome 12 markers in the studied reference panels Genomes reference sample did not alter the concordance rate with respect to the HapMap phase II combined sample (98.4% for both datasets). However, the use of the 1000 Genomes panel produced a substantial increase in the proportion of non-missing genotypes (94.7% vs. 90.1%, respectively). Figure 1 illustrates the concordance rates and the proportion of non-missing genotypes obtained using various INFO score thresholds. This analysis provides an indication of how the selection of confidence thresholds affects the accuracy and efficacy of the imputations. We restricted this evaluation to the two-step protocol using the HapMap phase II combined sample. As expected, lowering the INFO score thresholds resulted in progressively reduced concordance rates and higher proportions of non-missing genotypes. Using a threshold of 0.9, the concordance rate was 98.4% and the proportion of nonmissing genotypes 90.1%. Using a much less conservative threshold of 0.5, the concordance rate was still quite high (95.5%) and the proportion of non-missing genotypes went up to 99.6%. Figure 2 depicts the average INFO scores for different allele frequency bins, using the two-step imputation methods based on the HapMap phase II combined and the 1000 Genomes phase I (June 2011 release) reference panels. This figure provides information about imputation quality across the allele frequency spectrum, based on the two reference panels. The average INFO scores obtained for the 1000 Genomes panel are substantially higher, irrespective of the allele frequencies, than the HapMap phase II combined panel. It is also evident in the plot that rare alleles (frequencies < 5%) have considerably lower INFO scores than common alleles. In addition to average imputation qualities, it is also relevant to explore the distribution of INFO scores in each frequency bin. This is depicted in Figures 3A (for the HapMap Phase II combined reference sample) and 3B (for the 1000 Genomes phase I panel). These Figures show that for most frequency bins, the majority of the untyped SNPs have INFO scores higher than 0.9, with decreasing proportions of markers in the lower INFO score categories. However, for rare markers, particularly those with frequencies < 1%, the distribution is considerably wider, and the mode of the distribution does not correspond to the INFO score > 0.9, but to lower INFO score values. Additionally, the plots also demonstrate that using the 1000 Genomes sample as a reference sample shifts the distributions to the right in all the frequency bin categories. Markers imputed using the 1000 Genomes reference sample tend to have INFO scores higher than those imputed using the HapMap Phase II combined reference panel for all the frequency bins. This is also evident in Figure  We compared in more detail the imputation accuracy and efficacy for markers in the following allele frequency categories: <1%, 1-5% and 45-50%, using the two-step Figure 1 Proportion of non-missing genotypes versus concordance rates using different INFO score thresholds. This analysis was performed for the HapMap phase II combined reference sample based on the two-step imputation method.
imputation method and the 1000 Genomes reference panel. For this analysis, instead of using the overall imputation concordance based on the three possible genotypes, we focused our attention on the concordance and missingness rates for the heterozygotes. The reason for employing this strategy is that an analysis based on overall imputation concordance may give misleading results for rare markers: the overall concordance rate may be high for these markers, but the concordance rates for heterozygotes and minor allele homozygotes may be much lower than the overall concordance rates. For the imputed markers in the 45-50% allele frequency bin, using an INFO threshold of 0.9, the concordance rate for the heterozygotes was 97.5% and the proportion of nonmissing genotypes 90.2%. For the markers in the 1-5% bin, the concordance rate dropped to 85.4% and the proportion of non-missing genotypes to 85.1%. For rare markers (<1%), the drop was even more accused: the concordance rate was only 60.6% and the proportion of non-missing genotypes was 78.1%.
The results described above are based on markers located on chromosome 12. In order to evaluate the generalizability of these results, we also masked 5% of genotyped markers on chromosome 22, and on the HLA region, which spans approximately 5 megabases on chromosome 6 and has been under selective pressure in different population groups [31][32][33]. These analyses were carried out with the two-step imputation method using the HapMap and 1000 Genomes reference panels. For chromosome 22, using the HapMap reference panel, the concordance rate was 97.6%, and the proportion of nonmissing genotypes 83.2%, and using the 1000 Genomes reference panel, the concordance rate was 97.3% and the proportion of non-missing genotypes 89.9%. For the HLA region, using the HapMap reference panel the concordance rate was 99.35% and the proportion of nonmissing genotypes 97.4%, and with the 1000 Genomes reference panel the concordance rate was 99.5% and the proportion of non-missing genotypes 99.05%.

Discussion
In recent years, imputation has become a key tool in the success of genome-wide association studies. Genotype imputation has proven to increase the power of genetic association studies, by boosting the number of SNPs to be tested for association and facilitating the detection of rare variants in addition to common variants [14,19,34,35]. Furthermore, imputation aids in fine-mapping studies of the disease-associated region thus increasing the chance of identifying additional candidate SNPs [36]. Finally, genotype imputation enables metaanalysis that combines results across studies based on different genotyping platforms [37,38]. This approach has been effective in identifying novel associations in different traits [39][40][41][42][43][44].
However, an important concern with respect to imputation lies in the selection of an appropriate reference panel. Most of the GWAS to date have been conducted  in populations well represented by the available reference panels (e.g. European or East Asian populations), and used only one relevant reference population during the imputation process [6,43,[45][46][47]. However, for populations that are phylogenetically distant from the samples present in the reference panels, the selection of a suitable reference panel for imputation becomes less clear. In this situation, differences in the pattern of LD between the study and reference populations may affect imputation accuracy. Different approaches have been suggested for this particular scenario. For example, Huang et al. [16] explored imputation accuracy in the samples of the HGDP-CEPH panel, which is a worldwide collection of individuals from different locations, using the HapMap II reference panels. The authors found that for most of the studied samples, mixtures from at least two HapMap reference samples maximized imputation accuracy [16]. Another study showed that using tag SNPs from all the HapMap reference populations combined captured common variation in African American, Latino and Hawaiian samples more effectively than tag SNPs obtained from the individual HapMap reference samples [48]. This 'cosmopolitan' approach to imputation, combining reference haplotypes from all the reference populations available, is the strategy currently recommended by the most widely used imputation packages, such as IMPUTE [19,20] and MACH [21].
African American and Hispanic/Latino populations have unique challenges for imputation. These populations are the result of recent admixture between continental groups (primarily European, Native American and West African populations) and admixture proportions show substantial geographic variation [49][50][51]. Several studies have evaluated imputation performance in recently admixed populations. In a recent GWAS of coronary heart disease and its risk factors in a large African American sample [52], a high imputation concordance (95.6%) was obtained when SNPs were imputed using a combined reference panel of haplotypes from the Hap-Map phase II CEU and YRI panels. In another study in African Americans [53], the highest imputation yield and coverage were attained using the two HapMap reference panels (CEU and YRI) separately and then merging the results. Another approach for imputation in African American populations has been recently suggested by Paşaniuc et al. (2011) [54]. This strategy, termed 'local ancestry aware imputation' , uses local ancestry to guide the choice of reference haplotypes for imputation and shows marginal improvement in imputation accuracy in the admixed sample. However, this approach will be more difficult to implement in Hispanic/Latino populations, due to the lack of reference data for the relevant Native American parental populations, which is key to obtain accurate estimates of local ancestry. In the study by Huang et al. (2009), using combinations of two (European and East Asian) or three HapMap reference samples (East Asian, European and West African) produced the highest imputation accuracies (>95%) for two Native American samples (Pima and Maya) and a sample from Colombia [16]. A recent study [55] showed that, when performing imputation in a Hispanic sample from San Francisco with the program IMPUTE v2 and the Hap-Map II reference panel, using local haplotype weights based on a coalescent method provided lower error rates (7.8%) than using no weighting (8.9%), or a global weighting method based on empirical estimates of ancestry (9.0%) [56]. It is important to note that most of the aforementioned studies have used the HapMap II panel as the reference dataset for imputation. However, the recent progress of the 1000 Genomes project (http:// www.1000genomes.org/) has provided the scientific community with much more complete reference panels, both in terms of the number of markers and the number of populations. Importantly, the reference databases are updated on a regular basis. For this reason, it is currently recommended to perform the imputation in two stages: pre-phasing the study genotypes to estimate haplotypes, and then imputing untyped genotypes in a separate run. This substantially reduces imputation time with respect to single-step approaches at the expense of a small loss in accuracy. An important advantage of this approach is that, as new reference data become available, it is only necessary to repeat the imputation step.
In this study, we evaluated the imputation performance of the widely used program IMPUTE in an admixed sample from Mexico City using different imputation strategies (single-step vs. two-step imputation) and reference panels (HapMap and 1000 Genomes). We have previously described that this sample primarily has Native American (62%) and European contributions (33%), with a low proportion of African ancestry (5%) [30]. Importantly, there are no Native American reference samples in the Hap-Map or 1000 Genomes datasets, so it is of relevance to test the relative imputation performance of these reference panels in the Mexican sample. In an analysis of imputed markers on chromosome 12, we  observed that for this sample there are only minor differences in imputation accuracy between the single-step and two-step approaches ( Table 1). The concordance rate of the single-step approach is only slightly higher than that of the two-step approach (99.1% vs. 98.4% when using the HapMap phase II combined reference sample, respectively). In contrast, the imputation efficacy (i.e. proportion of non-missing genotypes) was higher for the two-step than the single-step imputation approach (90.1% vs. 85.5%, respectively). Therefore, our study confirms the twostep approach as the preferable imputation strategy, because it provides flexibility and faster imputation times, while providing an overly similar imputation performance to the single-step approach.
As expected, we observed that adding the HapMap Phase III Mexican American sample from LA to the HapMap Phase II combined reference sample there were marginal increases in both accuracy (99.4% vs. 99.1%) and efficacy (85.9% vs. 85.5%) ( Table 1). We also anticipated to find that reducing the threshold of the imputation confidence scores (the INFO score measures) when calling the imputed genotypes would result on lower imputation accuracy and higher proportions of non-missing genotypes. The reductions observed in imputation accuracy were relatively minor, from 98.4% with an INFO score threshold of 0.9 to 95.5% with an INFO score threshold of 0.5 ( Figure 1). This relatively small reduction in overall imputation accuracy is primarily due to the fact that most genotypes (and markers) have very high INFO scores. Therefore, adding the relatively small percentage of genotypes with lower INFO scores (and lower concordance rates) does not produce a major shift in overall imputation accuracy. Of all masked markers on chromosome 12, 61.5% had INFO scores higher than 0.9, 15.4% had INFO scores between 0.8 and 0.9, 6.5% had INFO scores between 0.7 and 0.8, 6.2% had info scores between 0.6 and 0.7, 3.8% had INFO scores between 0.5 and 0.6, and 6.7% INFO scores lower than 0.5 (see also discussion below about the relationship of imputation efficacy and accuracy and allele frequency).
We also examined the potential improvement in imputation performance obtained with the recently available 1000 Genomes panel (June 2011 release), with respect to the HapMap panel, using the two-step imputation protocol. The 1000 Genomes panel is a much more comprehensive and powerful resource for imputation, comprising more than 37 million autosomal SNPs present in 1,094 individuals from different populations around the world. Here, we show that for the Mexican sample the major improvement associated with the use of the 1000 Genomes reference panel is the substantial increase in imputation efficacy, in addition to the larger number of imputed markers (Table 1). Genotype concordances were similar for both reference datasets (around 98.4%). However, imputations with the 1000 Genomes panel resulted in 94.7% of non-missing genotypes (employing an INFO score threshold of 0.9), in comparison with 90.1% for the HapMap phase II combined panel (using the same threshold). When the INFO scores are plotted for different allele frequency bins, either as an average (Figure 2) or as histograms of the individual scores (Figures 3A and 3B), it is evident that the confidence of the genotype calls is higher with the 1000 Genomes panel for all allele frequency categories. There is a high correlation between the INFO scores obtained with the 1000 Genomes and HapMap phase II reference panels ( Figure 5), but the former are systematically higher than the latter (Figure 4).
The results described above are based on an analysis of markers on chromosome 12. An analysis of markers on chromosome 22 gives consistent results: The concordance rates using the HapMap phase II and 1000 Genomes reference panels are very similar (97.6% vs. 97.3%, respectively), but the proportion of non-missing genotypes is lower with the HapMap reference panel than with the 1000 Genomes panel (83.2%, and 89.9%, respectively). Interestingly, in the HLA region on chromosome 6, which spans approximately 5 megabases (29)(30)(31)(32)(33)(34) and has shown signatures of natural selection in previous studies (31)(32)(33), both the imputation accuracy (concordance) and the imputation efficacy (proportion of non-missing genotypes) were higher than those observed for chromosomes 12 and 22. When analyzing locus ancestry with a panel of Ancestry Informative Markers in the sample from Mexico City (data not shown), we observed that in a broad region of chromosome 6, including the HLA loci, there was an excess of European ancestry with respect to the rest of the genome, in both type 2 diabetes patients and controls. This may be a potential explanation for the increased imputation accuracy and efficacy identified in the HLA region (i.e. both reference panels, HapMap and 1000 Genomes, have a good representation of European populations, but Native American populations are not well represented in these panels).
The imputation performance of the 1000 Genomes reference panel for rare variants is substantially better than that of the HapMap phase II panel. However, the average imputation confidence (INFO score) is considerably lower for rare variants than for common variants (Figures 2 and 3), irrespective of the reference panel. The rare alleles (<1%) present in the Mexican sample are not properly captured by any of the reference panels, in spite of the inclusion in the 1000 Genomes panel of dense data from another sample of Mexican ancestry from LA. This is also evident in a more detailed comparison of imputation accuracy and efficacy for heterozygotes in the following allele frequency categories: <1%, 1-5% and 45-50%. For common variants (45-50%), the imputation accuracy and efficacy were very high (>97% concordance and >90% non-missing genotypes). However, for rare variants (<1%), the proportion of missing genotypes was quite high (> 21%), and importantly, even for the genotypes with high INFO scores (>0.9), there was a large proportion of discordant calls (>39%). It is important to note that our analyses were based on markers from a commercial microarray (in order to minimize genotyping errors, the program PLINK was used to merge the genotype calls obtained with two genotyping algorithms: BRLMM-P and Birdseed), and it is not clear to which extent these findings can be extrapolated to other scenarios (e.g. sequencing data). However, our results highlight the need to be cautious with the interpretation of the results for rare variants in GWAS in Hispanic samples.

Conclusions
We show that the program IMPUTE has an excellent imputation performance for common markers in an admixed sample from Mexico City, which has primarily Native American (62%) and European (33%) contributions. Genotype concordances for randomly masked markers are higher than 98.4% using different imputation strategies, in spite of the fact that no Native American samples are present in the HapMap and 1000 Genomes reference panels. In this sample, the best balance of imputation accuracy and efficiency was obtained with the 1,000 Genomes panel (genotype concordance 98.4% and proportion of non-missing genotypes 94.7%). However, not unexpectedly, rare alleles (frequencies <1%) are not captured efficiently by any of the available panels.