Worldwide population differentiation at disease-associated SNPs
© Myles et al. 2008
Received: 06 February 2008
Accepted: 04 June 2008
Published: 04 June 2008
Skip to main content
© Myles et al. 2008
Received: 06 February 2008
Accepted: 04 June 2008
Published: 04 June 2008
Recent genome-wide association (GWA) studies have provided compelling evidence of association between genetic variants and common complex diseases. These studies have made use of cases and controls almost exclusively from populations of European ancestry and little is known about the frequency of risk alleles in other populations. The present study addresses the transferability of disease associations across human populations by examining levels of population differentiation at disease-associated single nucleotide polymorphisms (SNPs).
We genotyped ~1000 individuals from 53 populations worldwide at 25 SNPs which show robust association with 6 complex human diseases (Crohn's disease, type 1 diabetes, type 2 diabetes, rheumatoid arthritis, coronary artery disease and obesity). Allele frequency differences between populations for these SNPs were measured using Fst. The Fst values for the disease-associated SNPs were compared to Fst values from 2750 random SNPs typed in the same set of individuals.
On average, disease SNPs are not significantly more differentiated between populations than random SNPs in the genome. Risk allele frequencies, however, do show substantial variation across human populations and may contribute to differences in disease prevalence between populations. We demonstrate that, in some cases, risk allele frequency differences are unusually high compared to random SNPs and may be due to the action of local (i.e. geographically-restricted) positive natural selection. Moreover, some risk alleles were absent or fixed in a population, which implies that risk alleles identified in one population do not necessarily account for disease prevalence in all human populations.
Although differences in risk allele frequencies between human populations are not unusually large and are thus likely not due to positive local selection, there is substantial variation in risk allele frequencies between populations which may account for differences in disease prevalence between human populations.
A broadly accepted model for the genetic architecture of complex disease is the common disease – common variant (CDCV) hypothesis. This hypothesis proposes that risk alleles for common complex diseases should be common (i.e. ≥ 5%) and thus are likely old and found in multiple human populations, rather than being population specific [1–4]. From analyses of genome-wide polymorphism data from populations of African, Asian and European ancestry, it has been shown that common alleles in one population are frequently both shared and common among human populations [5–7]. However, a recent comprehensive study of 3,873 genes from African, Asian, Latino/Hispanic, and European Americans found that common alleles in one population were frequently not common in another population . Similarly, from a meta-analysis of disease-association studies, Ioannidis et al. (2004) argued that the frequencies of disease-associated alleles show "large heterogeneity between races" . These observations suggest that the frequency of a risk allele discovered in one population is not always a strong predictor of the frequency of that risk allele in other populations. This raises the question of whether risk alleles discovered in one population account for disease prevalence across all human populations. Thus, it remains unknown how well the CDCV model accounts for disease prevalence across populations on a worldwide scale.
In addition to evaluating the extent to which disease-associated alleles differ in frequency between populations, it is of great interest to determine what evolutionary forces are responsible for the observed degree of population differentiation at disease-associated SNPs. Because disease is so tightly linked to survival and reproductive success, it follows that disease has likely been a strong selective force in human evolution. Moreover, alleles that cause disease in contemporary environments may have been positively selected in ancestral environments. For example, the thrifty gene hypothesis posits that populations whose ancestral environments were characterized by periods of feast and famine may have experienced selection for a "thrifty genotype" that promotes efficient fat and carbohydrate storage . Though formerly advantageous, thrifty genotypes may be causing obesity and type 2 diabetes in contemporary environments where food is often abundantly available. Previous studies have suggested that genes associated with complex diseases such as cardiovascular disease [11–14] and type 2 diabetes [15–17] have been targets of positive natural selection. If disease genes have often been targeted by selection, then identifying loci that have experienced selection may aid in disease-related research .
Local (i.e. geographically-restricted) positive selection results in large allele frequency differences between populations [e.g. [19, 20]]. The Fst statistic captures the difference in allele frequency between populations at any given SNP and ranges from 0 (no differentiation) to 1 (fixed difference between populations). Thus, when compared to a set of random SNPs in the genome, positively selected alleles tend to accumulate in the top tail of the Fst distribution [21–23]. It has previously been shown that local positive selection has had no widespread effect on disease allele frequency differences between populations: on average, disease-associated SNPs showed allele frequency differences between populations similar to those observed for random SNPs . Individually, however, several disease-associated alleles appear to have been driven to high frequency by positive selection in certain human populations and thus may be responsible for large differences in disease prevalence between populations [15, 25].
The conclusions drawn from previous studies that have evaluated levels of population differentiation at disease-associated SNPs are limited for two reasons. First, these studies relied on many disease-gene associations that have not been successfully replicated and thus likely do not represent true associations. Second, previous studies made use of disease allele frequencies from a small number of populations (i.e. ≤ 4). To address the strength of the CDCV model on a worldwide scale and to evaluate the effects of local positive selection on worldwide risk allele frequencies, we present allele frequencies and levels of population differentiation across 53 populations for 25 SNPs which show replicated association with the following common complex human diseases: Crohn's disease, type 1 diabetes, type 2 diabetes, rheumatoid arthritis, coronary artery disease and obesity [17, 26–42]. These newly identified genetic variants came from recent genome-wide association (GWA) study data, which have revolutionized approaches for identifying disease loci .
Worldwide risk allele frequencies and global Fst for 25 disease-associated SNPs typed in the CEPH-HGDP panel.
Risk allele frequency
Central South Asia
[34, 35, 36]
[27, 38, 39]
[27, 38, 39]
[27, 28, 31]
All of the genotype calls were confirmed by visual inspection. After Bonferroni correction for 25 comparisons, there remained 4 SNPs for which a population was out of Hardy-Weinberg equilibrium at p < 0.002. The genotype calls in these cases were re-confirmed by visual inspection of the cluster plots and no data were removed. The amount of missing data per SNP ranged from 2.0% – 5.4% with a mean of 3.3%. These data are accessible from the CEPH database  or by request to the corresponding author.
Global Fst , the degree of differentiation among the 7 geographic regions represented in the CEPH-HGDP panel, was calculated for each of the 25 SNPs. Results were largely the same when global Fst was calculated among all 53 populations. We obtained an empirical Fst distribution from 2750 autosomal markers (2540 SNPs  and 210 indels ) previously typed in 927 individuals from the CEPH-HGDP panel. Global Fst values for the disease-associated SNPs were calculated from the same set of 927 individuals to allow for an unbiased comparison to the empirical distribution. For each disease-associated SNP, a P value was calculated as the proportion of Fst values from the empirical distribution that were ≥ the observed Fst value. We found that global Fst is weakly but significantly correlated with global minor allele frequency (R2 = 0.0152, P = 5.04 × 10-23, see Additional file 1) and that the Fst distribution often differs significantly between minor allele frequency bins (Additional file 2). We therefore provide corrected P values (P cor ) for each Fst value by comparing only to SNPs from the empirical distribution that fall into the same minor allele frequency bin.
To determine whether the mean global Fst of 0.100 for the 25 disease-associated SNPs is unusually high, this value was compared to a distribution of mean global Fst values from 25 SNPs sampled at random 10,000 times from the empirical distribution. We found that disease-associated SNPs are not more differentiated than random markers (P = 0.462, P cor = 0.500). This analysis was repeated for groups of SNPs associated with each of the diseases listed in Table 1. In no case were the disease-associated SNPs more differentiated than expected at random (P and P cor > 0.3 in every case).
Global Fst provides a rough measure of the magnitude of allele frequency differentiation worldwide, but local positive selection acting at finer geographical scales will likely remain undetected using this measure. To examine the patterns of population differentiation at a more refined geographical scale, we calculated Fst for every pairwise comparison among the 53 populations and 7 geographic regions to produce 53 × 53 and 7 × 7 Fst matrices, respectively. Each Fst value was then compared to the corresponding empirical distribution of Fst values to generate a P value without correction for allele frequency.
The extent to which the CDCV hypothesis is applicable across human populations depends in part on the extent to which common risk alleles identified in one population are also common in other populations. The majority of disease association studies are conducted using case-control cohorts of European ancestry. The degree to which associations established in these studies can be extended to other populations remains an open question. In addition, it remains unclear how often differences in risk allele frequencies between populations are due to the action of local positive selection. The present study takes a first step in addressing these issues by quantifying the degree of allele frequency differentiation between worldwide populations for 25 SNPs associated with 6 common complex diseases.
Many of the disease-associated SNPs studied here show substantial heterogeneity in allele frequencies across human populations (Figure 1). In some cases, risk allele frequencies remain generally low or high across all 53 populations. However, in several cases risk allele frequencies vary across a large portion of the allele frequency spectrum. Maximum allele frequency differences between any 2 populations ranged from 0.10 to 1.0 across SNPs with a mean of 0.65 (Figure 2). For 7 of the 25 SNPs, the maximum allele frequency difference between any 2 populations was > 0.75. Thus, some risk alleles are found at substantially different frequencies between populations.
To further quantify the allele frequency differences between populations for the disease-associated SNPs, we compared Fst values for the disease-associated SNPs to an empirical Fst distribution generated from 2750 random markers genotyped in the same samples. The average global Fst of the disease-associated SNPs is not unusually high compared to the empirical global Fst distribution. This is also the case when global Fst values were averaged across SNPs in each disease category. Thus, disease-associated SNPs do not show more population differentiation than random SNPs, in agreement with a previous study that examined a different set of disease-associated markers in a more limited set of populations .
Although disease-associated SNPs do not show high Fst as a set, individual disease-associated SNPs may be unusually differentiated. Previous studies have identified disease-associated loci that show evidence of local positive selection in the form of unusually large allele frequency differences between populations [14, 15, 17, 25, 51–53]. In some cases it is the protective allele [17, 53], and in others the risk allele , which appears to have been driven to high frequency by positive selection. Several of the disease-associated SNPs studied here show considerable worldwide population differentiation and have global Fst values within the top 10% of the empirical global Fst distribution (Figure 3). At a more refined geographical scale, the patterns of population differentiation are extremely varied across SNPs and many population-pairwise Fst values lie within the top 5% and even the top 1% of the empirical distribution (see Additional file 3). For example, the risk allele at SNP rs10761659 is absent in some African populations and is near or at fixation in a number of populations outside of Africa. The global Fst value for this SNP lies within the top 5% of the empirical distribution (Figure 3) and most population pairwise comparisons between Africans and non-Africans are highly significant (Figure 4). A type 1 diabetes-associated SNP, rs11171739, also shows high levels of differentiation between Africans and non-Africans, but in this case the risk allele is near fixation in Africans but is at low to intermediate frequency elsewhere in the world (Additional file 3). There are also cases in which a risk allele frequency is unusually high or low in only one or a few populations. For example, the risk allele at rs564398, a SNP associated with type 2 diabetes, is found at unusually low frequencies only in the Kalash of Pakistan and in Melanesians (Additional file 3). These SNPs may therefore turn out to have been the targets of local positive selection. However, evidence for selection based on single marker Fst values should be interpreted with caution . A more in-depth investigation of the patterns of genetic variation in and around these loci and their effects on the phenotype is required before conclusions can be confidently drawn.
Regardless of whether large risk allele frequency differences between populations are the result of selection or genetic drift, these data provide several useful insights. First, it is reasonable to assume that, if a risk allele is fixed, absent, or close to either, it does not contribute to disease risk variation within that population. Thus, assuming that the risk conferred by these alleles is constant across populations (as may be the case for risk alleles found in genes related to fundamental biological activity, e.g. cyclin dependent kinase function and T2D/CAD risk), our data suggest that the CDCV model does not necessarily extend across populations since risk alleles discovered in a European population are sometimes absent, fixed or found at extremely low or high frequencies in other populations.
Second, combining evidence of selection and association may enhance power to identify genotype-phenotype relationships: a SNP with a large difference in risk allele frequency between populations is a strong candidate to explain large differences in disease prevalence between populations [15, 18]. However, despite the pattern observed for the Crohn's disease-associated SNP rs10761659 (Figure 4), there is no strong evidence to suggest that the risk of developing Crohn's disease differs dramatically between individuals of African and European ancestry . Future studies are required to determine the extent to which differences in risk allele frequencies between populations predict disease prevalence differences between populations.
Finally, power estimates for disease association studies rely on estimates of the risk allele frequency in a population . Inaccurate risk allele frequency estimates can result in overestimates of power and, consequently, in underpowered studies [57, 58]. Thus, these data can aid in the design of future association studies in populations for which allele frequency data are scarce.
Some of the risk alleles studied here may not be disease causing, but instead may be in linkage disequilibrium (LD) with the disease causing allele. Although recombination hotspot locations are generally shared across human populations and there is substantial conservation of haplotype structure worldwide [49, 59], the extent of LD can vary markedly across populations [60–63]. Because LD breaks down differently in different populations, the risk alleles studied here may not be associated with disease across all human populations. Our analyses assume that the degree of LD between the genotyped risk allele and the true causal allele is conserved across populations. Our interpretations should be considered in light of this caveat.
Disease-association studies have primarily made use of case-control cohorts of European ancestry. Studies of worldwide patterns of genetic variation in disease-associated genes are essential to determine how transferable disease-gene associations are from one population to another. Moreover, disease-association studies in diverse populations are required in order to determine whether different alleles are responsible for disease prevalence in different populations. A strong focus on the genetics of disease in humans worldwide is an important step in addressing large disparities in the quality of health care between human populations.
Disease-associated SNPs do not differ in frequency more between human populations than random SNPs in the genome. This suggests that positive local selection has not had a strong effect on the frequencies of risk alleles in general. Individually, however, several disease-associated SNPs do show evidence of positive local selection. Regardless of whether the observed differences are due to drift or selection, worldwide variation in risk allele frequencies is considerable. Future studies are required to determine the extent to which this variation is responsible for differences in disease prevalence between populations.
We acknowledge the Principal Investigators of the Wellcome Trust Case Control Consortium (WTCCC) for providing results prior to the publication of the main WTCCC manuscript. These include Mark McCarthy (type 2 diabetes), Jane Worthington (rheumatoid arthritis), Nilesh Samani (coronary artery disease), John Todd (type 1 diabetes), David Clayton (type 1 diabetes and analysis group), Peter Donelly (analysis group) and Lon Cardon (analysis group). We thank Kay Prüfer and Mehmet Somel for technical assistance; Naim Matasci, Matina Donaldson, David Hughes, Ed Green and Thomas Giger for useful discussions and Kirk Lohmueller for comments.
This work was supported by the German Bundesministerium für Bildung und Forschung (BMBF: NGFN2) and the Max Planck Society.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.