An assessment of the portability of ancestry informative markers between human populations
© Myles et al. 2009
Received: 12 December 2008
Accepted: 20 July 2009
Published: 20 July 2009
Skip to main content
© Myles et al. 2009
Received: 12 December 2008
Accepted: 20 July 2009
Published: 20 July 2009
Recent work has shown that population stratification can have confounding effects on genetic association studies and statistical methods have been developed to correct for these effects. Subsets of markers that are highly-differentiated between populations, ancestry-informative markers (AIMs), have been used to correct for population stratification. Often AIMs are discovered in one set of populations and then employed in a different set of populations. The underlying assumption in these cases is that the population under study has the same substructure as the population in which the AIMs were discovered. The present study assesses this assumption and evaluates the portability between worldwide populations of 10 SNPs found to be highly-differentiated within Britain (BritAIMs).
We genotyped 10 BritAIMs in ~1000 individuals from 53 populations worldwide. We assessed the degree to which these 10 BritAIMs capture population stratification in other groups of populations by use of the Fst statistic. We used Fst values from 2750 random markers typed in the same set of individuals as an empirical distribution to which the Fst values of the 10 BritAIMs were compared.
Allele frequency differences between continental groups for the BritAIMs are not unusually high. This is also the case for comparisons within continental groups distantly related to Britain. However, two BritAIMs show high Fst between European populations and two BritAIMs show high Fst between populations from the Middle East. Overall the median Fst across all BritAIMs is not unusually high compared to the empirical distribution.
We find that BritAIMs are generally not useful to distinguish between continental groups or within continental groups distantly related to Britain. Moreover, our analyses suggest that the portability of AIMs across geographical scales (e.g. between Europe and Britain) can be limited and should therefore be taken into consideration in the design and interpretation of genetic association studies.
Whole-genome association studies (GWASs) have proven extraordinarily successful in mapping loci that associate with common complex human diseases [for reviews see [1, 2]]. Whereas candidate gene and linkage analyses have identified a few dozen replicable associations between genetic markers and complex diseases , GWASs have provided compelling evidence for more than 150 gene-disease associations since their introduction in 2006 . The presence of population stratification has presented one of the main statistical challenges in GWASs. Population stratification refers to differences in allele frequencies between cases and controls related to ancestry rather than disease status. Long before technologies for GWASs were available, it was recognized that differences in ancestry between cases and controls can present a substantial confounding effect in case-control studies . This is especially true in cases where disease risk differs between groups with different ancestry. For example, prostate cancer is more frequent in individuals of African ancestry compared to individuals of European ancestry , and previous significant genetic associations with prostate cancer become nonsignificant when correcting for these differences in ancestry . The presence of population stratification can inflate false positive rates or cause reduced power and it has become standard practice to evaluate and correct for genetic ancestry in GWASs [for a review see ].
Currently there are two widely-used approaches for correcting for population stratification in GWASs: structured association (SA)  and principal components analysis (PCA) . SA uses the program STRUCTURE  to estimate the number of sub-populations, k, and then for each individual assigns a probability of membership to each of k subpopulations. It is then tested whether allele frequencies are dependent on phenotype within each k subpopulation. PCA reduces high-dimensional data to a small number of dimensions and uses the axes of variation, or eigenvectors, from these dimensions to calculate ancestry-adjusted genotypes and phenotypes. Both of these methods rely on inferences of ancestry from genome-wide SNP data. It has been shown that accurate estimates of individual ancestry can be obtained from a subset of SNPs from genome-wide data and these are referred to as ancestry-informative markers (AIMS) [for a review see ]. AIMs are characterized by substantially different allele frequencies between populations and can be used to estimate the proportion of an individual's ancestry that is derived from these populations. Before running GWAS, AIMs can be used to match cases and controls, and outlier individuals whose ancestry is not typical of the population under study can be excluded . The main intention for the development of sets of AIMs, however, is to provide a set of markers that effectively control for population stratification in association studies in which samples have not been typed with genome-wide SNP arrays. These sets of AIMs are designed to capture all of the necessary ancestry information required to correct for stratification in candidate gene studies, in replication studies of GWASs, or in fine-mapping studies that focus on specific genomic regions identified from GWASs.
Sets of AIMs have been developed to distinguish among continental groups [12–16]. These sets of AIMs will be useful in controlling for stratification in admixed populations especially when mapping traits that are known to differ by continental ancestry, for example skin pigmentation . Most GWASs, however, have focused on samples of European ancestry and population stratification within Europe has therefore been assessed in detail [e.g. [18–20]]. From several genome-wide SNP data sets, sets of European AIMs have been developed [21–25] that distinguish stratification primarily along north-south and east-west gradients.
While European AIMs will be useful in studies that examine individuals of diverse European origin, many GWASs focus on cohorts of much more homogeneous ancestry (e.g. individuals from within a single country). It has been shown that even moderate levels of population stratification in relatively homogeneous populations can confound results in well-designed case-control studies [26, 27]. For example, spurious associations due to population stratification can arise if samples are drawn from two different cities within a country [e.g. Dresden and Munich in Germany; ] and can even appear in genetic isolates like the Icelandic  and Finnish populations . Despite these warnings, association studies that do not properly correct for population structure continue to be published [e.g. ].
Studies that do incorporate a correction for the confounding effects of population structure are to be commended. However, the correction for population structure is only as good as the markers chosen for the correction. In general, the precondition for the use of a set of AIMs is that the population under study has the same substructure as the population in which the AIMs were discovered. In several recent association studies, this precondition has been left unevaluated and potentially uninformative AIMs have been used to correct for population stratification. For example, Sulem et al.  tested for the presence of population stratification in Iceland with a set of AIMs that distinguish between European populations . To correct for population stratification in association studies in Asian populations, SNPs with high Fst between Asians and other continental groups have been employed [32, 33]. Similarly, correcting for population structure in Caucasians, Hu et al.  use 38 SNPs that are highly differentiated between continental groups. These studies demonstrate that the underlying assumption that AIMs are largely portable across geographical scales is pervasive.
To assess the portability of AIMs between populations and across geographic scales, we genotyped 10 SNPs found to be highly-differentiated within Britain (BritAIMs) in ~1000 individuals from 53 populations worldwide. Although these 10 BritAIMs do not constitute a complete set of AIMs that fully capture population structure within Britain, they are nevertheless useful for evaluating the portability of AIMs across geographic scales. We evaluated the usefulness of these BritAIMs as AIMs between and within different continents by comparing Fst values for the BritAIMs to Fst values from 2750 random markers typed in the same set of worldwide samples. Our results suggest that AIMs have limited portability between human populations and that caution is warranted in the use of AIMs discovered in a population whose substructure does not match the population in which they are being employed.
Fst values of the 10 BritAIMs and the associated P values.
Central South Asia
Genotype calls were made by visual inspection. None of the 10 SNPs were out of Hardy-Weinberg equilibrium after Bonferroni correction for multiple comparisons (53 populations × 10 SNPs = 530 comparisons). In the 7 cases of significant (P < 0.05 without Bonferroni correction) deviation from Hardy-Weinberg expectations for a SNP in a population, cluster plots were re-evaluated and no data were removed. The amount of missing data per SNP ranged from 0% – 6.5% with a mean of 3.1%. These data are accessible by request to the corresponding author or from the CEPH database .
Fst was calculated according to equation 10 in Weir and Cockerham . Negative Fst values were set to 0. "Global Fst" for each of the 10 SNPs was calculated as the degree of differentiation among the 7 geographic regions represented in the CEPH-HGDP panel. Results remain unchanged when global Fst was calculated as the differentiation among all 53 populations rather than the 7 regions. We compared our observed Fst values for the 10 BritAIMs to an empirical Fst distribution from 2750 autosomal markers (2540 SNPs  and 210 indels ) typed in 927 individuals from the CEPH-HGDP panel. In cases where the same allele was fixed in all populations being compared, the SNP was considered non-informative and no Fst value was assigned. This resulted in different numbers of observations for the different empirical distributions with a minimum 2286 SNPs making up the empirical Fst distribution within Oceania. To allow for an unbiased comparison to the empirical distribution, Fst for the 10 BritAIMs was calculated from the same set of 927 individuals from which the empirical Fst distribution was calculated. For each BritAIM, a P value was calculated as the proportion of Fst values from the empirical distribution that were ≥ the observed Fst value. We use P < 0.05 as our threshold for "significance". It should be noted that "significant" therefore describes a value only in relation to the empirical distribution.
To assess the portability of AIMs, we used the Fst statistic  to measure the degree of genetic differentiation within and between continental groups of ten BritAIMs, i.e. SNPs identified as "highly-differentiated" within Britain . Fst is a commonly employed and useful measure of allele frequency difference between populations and takes on values ranging from 0 (no difference) to 1 (fixed difference). SNPs with high Fst values are highly differentiated between populations and are therefore informative about population structure and are useful as AIMs [e.g. [23, 45]]. The list of 10 BritAIMS is presented in Table 1 along with Fst values and associated P values for the among-continent (i.e. global) and within-continent comparisons.
Median Fst of the 10 BritAIMs within each continent and the associated P values
Central South Asia
The presence of population stratification is a potential source of false positives, and thus of spurious associations, in disease association studies. Recently, a number of studies have identified AIMs and have recommended their use to control for population stratification in association studies [12, 13, 21, 22, 25]. The sets of AIMs identified to date show large allele frequency differences either between continental groups (e.g. Africans and Europeans) or between populations within a continent (e.g. Europe). However, many association studies are conducted in relatively homogeneous populations. The medical genetics literature provides numerous examples where AIMs discovered at one geographic scale are used to correct for population stratification at finer geographic scales [e.g. [31–34]]. It remains unclear, however, whether AIMs identified as informative within Europe, for example, will also prove useful at finer geographical scales (e.g. within Britain). The present study takes a first step in addressing this issue by examining the portability of AIMs between populations and across geographic scales.
The 10 SNPs identified as highly-differentiated within Britain (BritAIMs) are not highly differentiated on a worldwide scale: none of the global Fst values for the BritAIMs lie in the upper tail of the empirical global Fst distribution (Figure 1). The median global Fst of the 10 BritAIMs is also not unusually high compared to the expectation at random (P = 0.692). Thus, AIMs identified at a fine geographic scale (i.e. within Britain) are not informative on a worldwide scale. This result was foreseeable since there is no a priori reason for expecting SNPs that differ dramatically in allele frequency within Britain to differ dramatically among continental groups.
Within continents, only 4 of the 10 BritAIMs have significantly high Fst values (Figure 2). These 4 BritAIMs are found within the top 5% of the empirical distributions from Europe and the Middle East. These two continental groups are assumed to be more closely related to Britons than the other continental groups included in the present study, and this result therefore suggests that some AIMs may be portable within a restricted geographic range. When the patterns of population differentiation for these 4 BritAIMs are examined in more detail, it is clear that the signal from rs1460133 is derived almost exclusively from the Basque who differ significantly from most of the other European populations for this SNP (Figure 3). Thus, while rs1460133 may be an informative marker for Basque ancestry, it does not show the dramatic gradient of allele frequency across the continent that is characteristic of other European AIMs . It is noteworthy that the median Fst for the 10 BritAIMs is significantly high within Europe (P = 0.039), but not within the other continental groups (Table 2). This observation also supports the notion that the BritAIMs as a set are at least somewhat ancestry informative across European populations.
Figure 4 provides a detailed view of the allele frequencies and population differentiation of the BritAIM that shows the highest Fst value within Europe, rs7696175. From the 53 × 53 matrix in Figure 4, it can be seen that the high Fst values for rs7696175 are not restricted to population pairwise comparisons within Europe: the French and Orcadians, for example, have substantially higher minor allele frequencies than several populations from the Middle East and Central South Asia. Similar plots are available for the remaining 9 BritAIMs (Additional File 1) in which it can be seen that the patterns of population differentiation are extremely varied across SNPs: many population-pairwise Fst values lie within the top 5% and even the top 1% of the empirical distribution. Thus the BritAIMs may be useful as AIMs in other groups of populations, but the patterns are often not systematic and their effectiveness in other samples would be difficult to predict a priori.
Previous studies have provided some evidence that AIMs may be portable between human populations. For example, microsatellite markers that are ancestry informative in one population are generally informative in others . Also, genomic regions showing large allele frequency differences between one set of continental groups are likely to be highly-differentiated between other continental groups . However, more recent studies that focus on the portability of AIMs across continental groups provide evidence against this notion. For example, Campbell et al.  previously noted that the use of 67 AIMs that were discovered to distinguish between African and European ancestry did not vary sufficiently among Europeans to allow detection of stratification. Similarly, Paschou et al.  found that SNPs chosen for ancestry inference in one continent perform no better than random SNPs in inferring ancestry in other continents. These studies have focused, however, on the portability of AIMs across broad geographic scales (i.e. between continental groups) and their conclusions have limited applicability to the design of association studies which usually focus on a more refined geographic scale.
Recently, Heath et al.  used PCA to assess population structure in ~6000 Europeans genotyped for ~130,000 SNPs and found that 5 of the 10 genomic regions containing the BritAIMs examined here were significantly associated with PC1 or PC2. It is worth noting that the two BritAIMs for which genotyping failed in the present study were in regions significantly correlated with PC2, and that 4 of the 5 remaining genomic regions containing BritAIMs neared significant correlation with PC1 or PC2 . These data suggest that, while AIMs discovered in a broad panel of Europeans may not perfectly capture ancestry information within Britain, there is substantial overlap among ancestry-informative genomic regions between the two geographic scales.
Local, geographically-restricted, natural selection at a locus generates large allele frequency differences between populations [49, 50]. Thus, AIMs are enriched in genomic regions that have been targeted by positive selection and are therefore likely to be in LD with adaptive functional alleles. Viewed from this evolutionary perspective, our finding that BritAIMs are not unusually differentiated between continental groups is not surprising: selection pressures that have generated allele frequency differences within Britain are unlikely to be shared across continental groups because local cultural and physical environments differ drastically at the continental scale. However, the BritAIMs' sharp gradients of allele frequencies across Britain are likely to have been caused by selection pressures shared by other European populations. For example, one BritAIM (rs1042712) is found near the lactase gene which shows a sharp gradient across Europe due to the action of positive selection for lactose tolerance . Thus, the portability of AIMs between populations will depend in part on the extent to which selection pressures have been shared between the populations. Without extensive population genetic analyses, this criterion will be difficult to evaluate.
The assumption that AIMs are portable across geographic scales is pervasive [31–34]. The data presented here suggest that there is an inevitable loss of power to detect population stratification when AIMs discovered in one population are used in another population. Practically, the assumption that the substructure of the population under study is adequately similar to the substructure of the population in which the AIMs were discovered is often difficult to evaluate. The present analyses suggest the portability of AIMs is limited and that claims of association between genetic variants and phenotypes should be interpreted in accordance with the suitability of the selected AIMs used to correct for population stratification. As association analyses become increasingly common in populations for which genome-wide genotype data is sparse, we anticipate that this cautionary note will become increasingly important.
We thank Marc Bauchet for technical assistance. We acknowledge the Principal Investigators of the Wellcome Trust Case Control Consortium (WTCCC) for providing results prior to the publication of the main WTCCC manuscript. These include Mark McCarthy (type2 diabetes), Jane Worthington (rheumatoid arthritis), Nilesh Samani (coronary artery disease), John Todd (type 1 diabetes), David Clayton (type 1 diabetes and analysis group), Peter Donelly (analysis group) and Lon Cardon (analysis group). This work was supported by the German Bundesministerium für Bildung und Forschung (BMBF: NGFN2) and the Max Planck Society.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.