#PAGE_PARAMS# #ADS_HEAD_SCRIPTS# #MICRODATA#

The Impact of Divergence Time on the Nature of Population Structure: An Example from Iceland


The Icelandic population has been sampled in many disease association studies, providing a strong motivation to understand the structure of this population and its ramifications for disease gene mapping. Previous work using 40 microsatellites showed that the Icelandic population is relatively homogeneous, but exhibits subtle population structure that can bias disease association statistics. Here, we show that regional geographic ancestries of individuals from Iceland can be distinguished using 292,289 autosomal single-nucleotide polymorphisms (SNPs). We further show that subpopulation differences are due to genetic drift since the settlement of Iceland 1100 years ago, and not to varying contributions from different ancestral populations. A consequence of the recent origin of Icelandic population structure is that allele frequency differences follow a null distribution devoid of outliers, so that the risk of false positive associations due to stratification is minimal. Our results highlight an important distinction between population differences attributable to recent drift and those arising from more ancient divergence, which has implications both for association studies and for efforts to detect natural selection using population differentiation.


Published in the journal: . PLoS Genet 5(6): e32767. doi:10.1371/journal.pgen.1000505
Category: Research Article
doi: https://doi.org/10.1371/journal.pgen.1000505

Summary

The Icelandic population has been sampled in many disease association studies, providing a strong motivation to understand the structure of this population and its ramifications for disease gene mapping. Previous work using 40 microsatellites showed that the Icelandic population is relatively homogeneous, but exhibits subtle population structure that can bias disease association statistics. Here, we show that regional geographic ancestries of individuals from Iceland can be distinguished using 292,289 autosomal single-nucleotide polymorphisms (SNPs). We further show that subpopulation differences are due to genetic drift since the settlement of Iceland 1100 years ago, and not to varying contributions from different ancestral populations. A consequence of the recent origin of Icelandic population structure is that allele frequency differences follow a null distribution devoid of outliers, so that the risk of false positive associations due to stratification is minimal. Our results highlight an important distinction between population differences attributable to recent drift and those arising from more ancient divergence, which has implications both for association studies and for efforts to detect natural selection using population differentiation.

Introduction

The Icelandic population has been sampled in many disease association studies [1][8]. Thus, there is a strong motivation to understand the structure of this population and the ramifications for association studies. A recent study of 40 microsatellite markers showed that the Icelandic population is relatively homogeneous, but that subtle subpopulation differences exist, inflating disease association statistics in simulated case-control studies [9]. Other studies of Icelandic population structure have focused on Y chromosome and mtDNA analyses [10][12]. Now, the availability of genotype data from a large number of Icelandic samples, based on densely distributed SNPs from all over the genome and collected in the course of genome-wide association studies, makes it possible to investigate Icelandic population structure in greater depth. In this study, we analyzed over 30,000 Icelandic samples that were genotyped using the Illumina 300 K chip.

In addition to providing a more detailed assessment of genetic differences between regional subpopulations, our analyses yield several new results. First, we show that with a sufficient amount of genotype data it is possible to distinguish regional geographic ancestries of individuals from Iceland, and to demonstrate a striking concordance between genetic relationships and Icelandic geography. Second, we show that population structure in Iceland is due to recent genetic drift, not to regional differences in the proportion of admixture from Norse and Gaelic ancestral populations [11]. Third, we show that allele frequency differences between regional subpopulations follow a null distribution that is devoid of highly differentiated SNPs, consistent with the young age of the Icelandic population. A noteworthy consequence is that there is minimal risk of confounding due to population stratification in association studies performed in Iceland. This is in stark contrast to differences among populations of European ancestry (e.g., as represented in European Americans [13],[14]), where, even in the face of low levels of aggregate population differentiation, confounding can arise from unusually differentiated loci that are the result of geographically restricted episodes of natural selection during much longer periods of population divergence. Indeed, a genetic comparison of Icelanders and Scots revealed an excess of highly differentiated variants, including variants for which the unusual extent of differentiation was genomewide-significant, suggesting the action of natural selection. Thus, both the curse of population stratification and the blessing of using unusually differentiated loci to detect natural selection are far more pertinent in populations with a subtle level of structure arising from ancient divergence than in populations such as that of Iceland whose subtle structure is the result of recent genetic drift.

Results

Genetic Relationships between 11 Regions of Iceland

During the past century, urbanization has led to considerable mixing of ancestry from the different regions of Iceland, particularly in the capital city of Reykjavik [9]. However, our aim here was to study the population structure as it existed prior to this mixing. To this end, our initial analyses focused on a subset of 877 Icelandic samples of over 30,000 that were genotyped on the Illumina 300 K chip. For each of 11 regions of Iceland, we chose up to 100 unrelated samples with majority ancestry from that region, based on genealogical information from their ancestors five generations back (Figure 1 and Table 1; see Materials and Methods).

Fig. 1. Map of 11 regions of Iceland, color-coded to match Figures 2 and 3.
Map of 11 regions of Iceland, color-coded to match <em class=&quot;ref&quot;>Figures 2</em> and <em class=&quot;ref&quot;>3</em>.
The interior region is not numbered, as it is uninhabited. Sample sizes for each region are listed in Table 1.

Tab. 1. Data for Icelandic samples with majority ancestry from each of the 11 regions.
Data for Icelandic samples with majority ancestry from each of the 11 regions.
For each region, we list the total number of Icelandic samples with majority ancestry from that region, and the number of unrelated samples that were selected.

Principal components analysis (PCA) is a widely used tool for analyzing genetic data [15][18]. We ran PCA on genotype data from the 877 individuals using the EIGENSOFT software with default parameters settings [17]. A plot of the top two principal components is displayed in Figure 2A, revealing a striking concordance between the geographical orientation of the 11 regions (Figure 1) and the relative positions of each region on the PCA plot (Figure 2A). In both cases, we observe a ring-shaped topology with region numbers increasing in clockwise order and a central void corresponding to the unpopulated interior of Iceland. The top two PCs explain a modest proportion of the overall variance: 0.0027 for PC1 and 0.0022 for PC2, representing an excess of 0.0015 for PC1 and 0.0011 for PC2 above what would be expected by chance (Tracy-Widom P-values<10−12 in each case [17]), similar to previous results on European American data sets [13]. We note that these PCs are the result of genome-wide structure, as opposed to a small number of highly informative markers (see Text S1).

Fig. 2. PCA plots of (A) samples with most of their ancestry from 11 regions of Iceland and (B) samples with most of their ancestry from 11 regions of Iceland, together with a set of 250 randomly selected Icelandic samples.
PCA plots of (A) samples with most of their ancestry from 11 regions of Iceland and (B) samples with most of their ancestry from 11 regions of Iceland, together with a set of 250 randomly selected Icelandic samples.

To evaluate the use of dense genotype data to predict geographic ancestry in the Icelandic population, we randomly selected 250 additional Icelandic samples for which genotype data was available (see Materials and Methods). A PCA run with the 250 samples included (Figure 2B) indicates that these individuals trace their ancestry from all over Iceland, with an excess of individuals from the vicinity of region 4 (concordant with Table 1). We used the PCA results to predict the regional ancestry of each of the 250 samples and compared this with their true ancestry, which we defined as the region in which the greatest number of ancestors five generations back was born (see Materials and Methods). The ancestry predictions were correct for 47% of samples, correct to within a distance of one region for 74% of samples, and correct to within a distance of two regions for 93% of samples. The accuracy increased to 58% (87% to within one region, 97% to within two regions) when restricting to the 98 (of 250) samples with at least 16 of 32 ancestors from a single region. Our analyses demonstrate that dense genotype data can be used to distinguish, and to some extent predict, the regional geographic ancestry of individuals within Iceland. We note that a correlation between geography and genetic ancestry has also been observed in other parts of Europe [19][22].

A different way to examine the patterns of genetic variation in Iceland is through summary statistics such as FST, which reflects the proportion of the total genetic variation found in two populations that is explained by their division into separate populations [23],[16] (see Materials and Methods). FST values were computed for each pair of Icelandic regions, yielding an average of 0.0026 (Table 2). Both Figure 2A and Table 2 show that region 7 and particularly region 9 show the greatest divergence from the other regions, as well as the lowest heterozygosity, which suggests that these regions have been more influenced by genetic drift than the others. This finding is consistent with the small historical population sizes of these regions [24].

Tab. 2. Pairwise FST and heterozygosity estimates for 11 regions of Iceland.
Pairwise <i>F</i><sub>ST</sub> and heterozygosity estimates for 11 regions of Iceland.
Heterozygosity values are listed on the diagonal. Standard errors of FST estimates were equal to 0.0007 for all comparisons involving Region 1 and 0.0001 for all other comparisons.

Genetic Relationships between Iceland, Norway, and Scotland

The Icelandic population arose from the admixture of Norse and Gaelic ancestors around 1100 years ago, at the time of settlement [11]. Pairwise FST values between Iceland, Norway and Scotland were computed based on the 79,641 autosomal SNPs in the intersection of the Illumina 300 K and Affymetrix 6.0 chips, using genotype data from 30,244 Icelandic, 250 Norwegian and 445 Scottish samples (see Materials and Methods). The resulting FST estimates were 0.0016 between Iceland and Norway, 0.0020 between Iceland and Scotland, and 0.0013 between Norway and Scotland. The larger FST estimates separating Iceland and its two ancestral populations are consistent with previous analyses indicating that the Icelandic gene pool has experienced more recent drift than neighboring countries in northern Europe [12].

One possible explanation for the genetic differences observed between the 11 regions of Iceland is varying contributions from ancestral populations. To explore this possibility, we used genotypes from the 79,641 overlapping SNPs to project [17] the Norwegian and Scottish samples onto principal components computed using the subset of 877 Icelandic samples (Figure 3). This analysis is robust to the concern that projected samples may be affected by regression towards the mean (see Text S1, Figure S1, and Figure S2). The Norwegian and Scottish samples were tightly clustered near the origin, with each having a mean of 0.004 on PC1 and −0.005 on PC2. This indicates that the genetic differences between Icelandic subpopulations represented on the top two PCs are orthogonal to genetic differences between the Norwegian and Scottish ancestral populations. In other words, varying contributions from ancestral populations are not a major determinant of genetic differences between Icelandic regions. Rather, the most plausible source of these differences is genetic drift during the 1100 years that have passed since the settlement of Iceland.

Fig. 3. PCA plot of samples from Norway and Scotland projected onto PCs computed using samples with most of their ancestry from 11 regions of Iceland.
PCA plot of samples from Norway and Scotland projected onto PCs computed using samples with most of their ancestry from 11 regions of Iceland.

Estimating the Norse and Gaelic Contributions to Icelandic Ancestry

To obtain a direct estimate of Norse and Gaelic ancestry proportions in the Icelandic population, we modeled Icelandic allele frequencies as a linear combination of Norwegian and Scottish allele frequencies, accounting for the sampling error arising from the limited sample sizes (see Materials and Methods). While the Norwegian and Scottish samples may not perfectly represent the ancestral populations of Icelandic settlers—who derived from several parts of Norway, possibly other parts of Scandinavia, Scottish coastal regions and Ireland—we postulated that they were close enough to provide a reasonable admixture estimate. Based on the available data, the optimal linear combination yielded an estimate of 64% Norse and 36% Scottish ancestry, with a standard error of less than 2%. The FST between the optimal linear combination and the observed allele frequencies in Iceland was 0.0014, which may be in part due to inadequate sampling from the true ancestral populations, but is likely to be mainly due to recent genetic drift in the Icelandic gene pool.

The same computation was performed for each of the 11 Icelandic regions, yielding ancestry estimates that were not statistically different. For each region, the estimate of Norse ancestry was between 62% and 65%, with a standard error of less than 2% (except region 1, for which we obtained 61% with a standard error of less than 3%). This provides strong evidence that the proportions of Norse and Gaelic ancestry do not vary among Icelandic regions, supporting the notion that differences between Icelandic regions are due to recent genetic drift rather than varying contributions from ancestral populations.

A separate question is whether the proportion of Norse ancestry was greater among male settlers of Iceland than among female settlers, as previous studies based on Y-chromosome and mtDNA haplotypes have suggested [10],[11]. A comparison of ancestry estimates for X-chromosome vs. autosomal SNPs could potentially provide an answer to this question, since two-thirds of X-chromosome alleles (vs. one-half of autosomal alleles) are passed through the female line. We obtained an X-chromosome ancestry estimate of 63% Norse and 37% Scottish ancestry, with a standard error of 7%. The standard error was quite large—our analysis was limited to only 2,962 X-chromosome SNPs present on both the Illumina 300 K and Affymetrix 6.0 chips—and hence this analysis is inconclusive. Because ancestry differences between the X chromosome and autosomes would be expected to be much smaller than the underlying ancestry effects (for example, a 100% difference between the ancestry of male settlers and female settlers would lead to an X-chromosome vs. autosome ancestry difference of only 17%), our results do not contradict the hypothesis of a substantial ancestry difference between male and female settlers.

Distribution of Allele Frequency Differences between Icelandic Subpopulations

We evaluated whether there is an excess of common SNPs with large allele frequency differences between Icelandic subpopulations, using data from 14,313 individuals with majority ancestry from one of 11 Icelandic regions (Table 1). For each Icelandic region, we computed the distribution of allele frequency differences between that region and the union of all other regions, expressed as a χ2 (1 d.o.f.) statistic under a model of neutral genetic drift. This computation accounts for related individuals (see Materials and Methods). P-P plots for each region r () are displayed in Figure 4. For each region, there was no excess of markers with large frequency differences versus other regions. Averaging across computations for each of 11 regions, 0.008% of markers had a P-value less than 0.0001, roughly matching the expected distribution. The most significant P-value was 3×10−6, a value that is not statistically significant after correcting for the number of SNPs and regions tested. These results are consistent with the hypothesis that the divergence time of Icelandic regions has been too short for differential selective forces to have had a significant impact on allele frequencies.

Fig. 4. P-P plots of allele frequency differentiation between region <i>r</i> and the union of all other regions, for each value of <i>r</i> ().
P-P plots of allele frequency differentiation between region &lt;i&gt;r&lt;/i&gt; and the union of all other regions, for each value of &lt;i&gt;r&lt;/i&gt; ().

In a disease association study where cases and controls are drawn from distinct populations, there is a mathematical relationship between the distribution of allele frequency differences and the distribution of disease association statistics (see Materials and Methods). We obtained empirical agreement with this theoretical result by simulating a case-control study in which 100 unrelated samples with majority ancestry from region 4 were labeled as disease cases and 100 unrelated samples with majority ancestry from region 5 were labeled as controls. We computed Cochran-Armitage trend statistics and obtained a genomic control λ of 1.285, consistent with the predicted value of (1+NFST) = 1.28 given the FST of 0.0014 between the two regions (see Materials and Methods). After dividing by Cochran-Armitage trend statistics by the genomic control λ, the most significant association had a P-value of 3×10−6, which is not statistically significant after correcting for the number of SNPs tested. We repeated this analysis for all pairs of regions (4,5,6,8,10) with 100 unrelated samples available (see Table 1), and obtained similar results (minimum P-value of 4×10−7, which is not statistically significant after correcting for the number of SNPs and number of pairs of regions tested.)

A consequence of these findings is that whenever λ is close to 1 in a disease association study involving the Icelandic population, false positive associations due to population stratification can be conclusively ruled out. If λ is greater than 1, then dividing association statistics by λ will still prevent false positive associations. This is not the case in populations, such as European Americans, with a subtle level of structure arising from more ancient divergence [25].

Distribution of Allele Frequency Differences between Iceland and Scotland

We evaluated whether an excess of common SNPs with large allele frequency differences between Icelanders and Scots could provide evidence of population-specific natural selection. We used Icelanders and Scots (rather than Norwegians) in this analysis, because these samples were genotyped on the same chip under identical assay conditions, thus avoiding the effects of differential bias [26]. Indeed, tail distributions of comparisons between populations genotyped on different chips appear to be confounded by assay artifacts, precluding robust analyses of those comparisons (see Text S1). We used allele frequency differences between the Icelandic and Scottish samples at common SNPs to compute a χ2 (1 d.o.f.) statistic for unusual population differentiation that accounts for the effects of neutral genetic drift (see Materials and Methods). A P-P plot of our results is displayed in Figure 5. In contrast to Figure 4, there is a substantial excess of markers in the extreme tail, with 0.018% of markers having a P-value less than 0.0001. We speculate that many of these markers are likely to have been under natural selection.

Fig. 5. P-P plot of allele frequency differentiation between Norway and Scotland.
P-P plot of allele frequency differentiation between Norway and Scotland.
The nine SNPs from Table 3 are displayed as squares.

We found eight SNPs, representing two chromosomal regions, for which the evidence of unusual population differentiation was genomewide-significant (nominal P-value<10−7, P-value<0.03 after correcting for 284,191 common SNPs tested). Six of the SNPs lie in or near the TLR (toll-like receptor) genes TLR10 and TLR1, while the other two lie inside the NADSYN1 (NAD synthesase 1) gene (http://genome.ucsc.edu/) (Table 3). For each of these SNPs, the allele frequency difference between Icelanders and Scots was greater than 15% (Table S1), far in excess of typical allele frequency differences of about 3% that correspond to an FST value of 0.0020. Only two of the SNPs from Table 3 were present in Norwegian data based on the Affymetrix 6.0 chip (rs10024216 and rs11096957 in the TLR region), but for both of these SNPs—and also for rs7940244 in the NADSYN1 region (which was not genomewide-significant in the comparison of Icelanders and Scots)—allele frequency differences between Norwegians and Scots were likewise greater than 15% (Table S1), ruling out an effect specific to Icelanders. We also report frequencies of these SNPs in HapMap populations [27] (Table S1). We note that both TLR and NADSYN1 were previously reported to be significantly differentiated among 12 British subpopulations analyzed by the WTCCC (nominal P-values of 10−12 for TLR and 10−8 for NADSYN1) [28]. The WTCCC study has made an important and valuable contribution to research on natural selection by highlighting the potential utility of large sample sizes from very closely related populations for detecting signals of selection. However, the statistical test employed by those authors only evaluated whether frequency differences between the 12 subpopulations were different from zero, and not whether the amount of differentiation was in excess of what would be expected under neutral genetic drift (as inferred from genome-wide patterns). As an illustration of this distinction, we observed that a total of 3,982 SNPs in our data set had frequency differences between Iceland and Scotland that were different from zero at the nominal P-value threshold of 10−7 used for the corresponding test in the WTCCC study. It is extremely unlikely that all of these SNPs were under selection. Thus, it is not possible to conclude whether the results of the WTCCC study represent genomewide-significant signals of selection. However, our findings support the hypothesis that selection did occur.

Tab. 3. List of markers whose unusual differentiation between Iceland and Scotland is genomewide-significant.
List of markers whose unusual differentiation between Iceland and Scotland is genomewide-significant.
A total of 12 markers in the TLR region and 5 markers in the NADSYN1 region achieved a nominal P-value of 0.0001 or lower (data not shown). We list with an asterisk one additional marker whose differentiation is highly suggestive (see text). Gene names are listed for markers located between the transcription start and end sites of a gene.

In addition to the eight genomewide-significant signals, a highly suggestive signal of unusual differentiation was observed at the SNP rs13107325 (nominal P-value = 2×10−7, P-value = 0.06 after correcting for 284,191 common SNPs tested) (Table 3). This SNP is a missense coding SNP inside the SLC39A8 (solute carrier family 39 (zinc transporter), member 8) gene (http://genome.ucsc.edu/), and allele frequencies in HapMap [27] indicate that the minor allele of this SNP is private to populations of European ancestry (Table S1). Thus, although this SNP did not meet our strict criteria for genome-wide significance, it is an intriguing candidate for natural selection.

Discussion

We analyzed the population structure of Iceland using dense genotype data to show that there are subtle but discernable genetic differences between individuals from different Icelandic regions, and that these differences are broadly consistent with the ring-shaped topology of the inhabited part of Iceland. The average pairwise FST of 0.0026 for the 11 regions we analyzed is similar to FST values between different European populations. However, it is important to point out that FST values in this study may be heavily dependent on the sampling scheme, and FST values of a similar magnitude might be observed within other European countries if analyzed at the same geographical resolution. Notably, Icelandic subpopulation differences are due to recent genetic drift and not to varying contributions from ancestral populations, as the subpopulations from each Icelandic region inherit roughly 64% Nordic and 36% Gaelic ancestry.

A consequence of the recent origin of the genetic differences between Icelandic subpopulations is that allele frequency differences follow the null distribution predicted by neutral drift. Thus, there is little risk of false positive associations due to population stratification in disease association studies, despite the fact that there are genuine differences between regions. The same conclusion may be expected for other populations whose structure has arisen from recent genetic drift [29]. On the other hand, such populations are not well-suited for the detection of regionally specific natural selection reflected in unusual differences between subpopulations. For that purpose, subtly structured populations whose structure is due to more ancient population divergence, with large population sizes minimizing subsequent genetic drift, offer the greatest promise. For example, European American subpopulations exhibit unusual differences at the LCT, HLA and OCA2 loci that lie outside the null distribution with genome-wide significance ([13] and A.L. Price, unpublished data). The distinction between population differences attributable to recent drift and those arising from more ancient divergence is also likely to be of interest in studies of other subtly structured populations [22],[28],[30].

For some diseases in Iceland, such as breast cancer, the geographical distribution of patients and their ancestors is not random [31]. Our results indicate that highly differentiated common variants are unlikely to be the cause of this phenomenon. Rare variants that have risen to higher frequency in certain regions of Iceland due to founder effects provide a more plausible explanation. An example in the case of breast cancer is the BARD1 Cys557Ser risk variant that rose in frequency in the easternmost county of Sudur-Mulasysla (Figure 1) due to a population bottleneck in that region [32]. A direction of research that is motivated by our findings is to investigate the extent to which rare variants, spread by recent founder effects, play a role in differences in disease prevalence among individuals with ancestry from different regions of Iceland.

Materials and Methods

Ethics Statement

This research was approved by the Data Protection Commission of Iceland and the National Bioethics Committee of Iceland. The appropriate informed consent was obtained for all sample donors.

Icelandic Data

DNA samples from 35,457 individuals residing in Iceland were genotyped using the Illumina 300 K chip in the course of disease association studies conducted by deCODE Genetics. The appropriate informed consent was obtained for all sample donors. Owing to the sensitive nature of genotype data, access to this data can only be granted at the headquarters of deCODE Genetics in Iceland. SNPs with >5% missing data were removed, leaving 292,289 autosomal SNPs for analysis. No linkage disequilibrium or low frequency SNP filters were applied. For each Icelandic sample genotyped, additional data were available from a genealogical database describing relatedness to other samples and listing the birth county in Iceland of each ancestor tracing back five generations [33]. This information was used to restrict some analyses to subsets of Icelandic samples (see below).

Samples with Ancestry from 11 Regions of Iceland

We grouped the 21 counties of Iceland into 11 regions, as previously described [9] (Figure 1). From the entire set of 35,457 individuals, we selected a subset of 14,313 individuals with majority ancestry from one of the 11 regions, based on having at least 16 of 32 ancestors (five generations back) from that region (Table 1a). The goal of this scheme was to choose a set of samples reflecting the population structure of Iceland prior to the large-scale migration that resulted from industrialization and urbanization during the past century. From this set of 14,313 individuals we selected a further subset of 885 individuals—with at most 100 individuals from each region—that were unrelated at a meiotic distance of four generations. Of the 885 individuals, 8 were removed as genetic outliers when we ran PCA [17]; Table 1b and subsequent analyses are based on the remaining 877 individuals. The size limit of 100 individuals was used to ensure a relatively even representation of regions for analyses that are sensitive to varying sample sizes from subpopulations. We note that region 1, which contains the capital city of Reykjavik, was heavily underrepresented as it had a small population prior to urbanization.

An Additional 250 Icelandic Samples

We randomly selected 250 samples from the 35,457 samples that were genotyped on the Illumina 300 K chip. Of these 250 samples, five overlapped the previous set of 877 samples; these were retained in the set of 250 additional samples but excluded from the set of original samples, in which only 872 samples were retained. We ran PCA on the combined set of 1,112 samples (Figure 2B) and used the 872 original samples to compute the average value of PC1 and PC2 for each region r. For each of the 250 additional samples, we computed the Euclidean distance between (PC1,PC2) for that sample and the average value of (PC1,PC2) for region r, and defined our prediction of regional ancestry as the value of r minimizing that distance. We defined true ancestry as the region in which the greatest number of ancestors five generations back was born. We compared predicted ancestry with true ancestry, both for the set of 250 samples and for a subset of 98 samples with majority ancestry from a single region. Given the low number of ancestors from region 1 (see Table 1), we merged region 1 with region 11 in these analyses (see Figure 1). This had little effect on our results, as only two of the 250 samples and none of the subset of 98 samples had the greatest number of ancestors from region 1. Thus, predicted ancestry P and true ancestry T each had values between 2 and 11. We considered our ancestry prediction to be correct if , correct to within a distance of one region if , and correct to within a distance of two regions if (see Figure 1).

Samples from Norway and Scotland

The Icelandic population arose from the admixture of Norse and Gaelic ancestors. To represent the ancestral populations, 445 samples from Scotland were genotyped on the Illumina 300 K chip, and 250 samples from Norway were genotyped on the Affymetrix 6.0 chip. The appropriate informed consent was obtained for all sample donors. Illumina 300 K genotyping was conducted by deCODE Genetics, and Affymetrix 6.0 genotyping was conducted by Expression Analysis on behalf of Ulleval University Hospital in Oslo. SNPs with >5% missing data in either Norway or Scotland were removed, leaving 79,641 autosomal SNPs (that were genotyped on both chips) in the merged data set of samples from Iceland, Norway and Scotland.

Assessment of Nordic and Gaelic Ancestry in the Icelandic Population

Let Nj and pj denote total allele count and observed allele frequency in the Icelandic population, Nj1 and pj1 denote total allele count and observed allele frequency in ancestral population 1, and similarly Nj2 and pj2 in ancestral population 2, for SNP j. Let MIXα denote a synthetic population consisting of a linear combination of proportions α and (1−α) from ancestral populations 1 and 2, respectively. Let p = α pj1+(1−α) pj2. We estimate the FST between Iceland and MIXα aswith the subtracted terms in the numerator adjusting for the effects of sampling error (see Supp Note 10 of [34]). We note that linkage disequilibrium between SNPs may lead to suboptimal weighting, which will increase the variance but will not bias the estimate. We estimate FST for different values of α (on an evenly spaced grid from 0 to 1) and infer the ancestry proportion α that minimizes FST, as described previously [35],[36]. We compute the standard error of the ancestry estimate α via a bootstrap approach. We partition the set of SNPs into B disjoint blocks (e.g., B = 100), repeat the computation for SNPs in each block to obtain B different ancestry estimates, and compute the standard error as the standard deviation of these estimates divided by the square root of B. Standard errors of FST estimates are computed in the same way. We note that the computation of FST between two sampled populations is equivalent to the above formula for α = 0 or α = 1.

Our FST computations assume that allele frequencies are obtained from an unrelated set of individuals. If related individuals were used, the effects of sampling error would be underestimated. Unrelated individuals were used in all FST computations, except in analyses of the aggregate set of Icelandic individuals, which included some related pairs of individuals. In this analysis, we used a subset of 30,244 of the 35,457 Icelandic individuals genotyped, in which the most closely related samples were removed. In this case, the amount by which the estimated sampling error (equal to the reciprocal of N = 2×30,244) is inaccurate is expected to be far smaller than the precision of 0.0001 to which we report FST estimates, and hence negligible.

Distribution of Allele Frequency Differences

Under neutral drift, the difference (p1p2) between observed allele frequencies of two populations at a given locus can be approximated as a normal distribution with mean 0 and variance p(1−p)(2FST+1/N1+1/N2), where FST is the genetic distance between the two populations, N1 and N2 are total allele counts in each population, and p is the ancestral allele frequency that can be approximated as the average of the two observed allele frequencies [37]. We note that this null model extends to the case of admixture, which simply scales FST by the square of the admixture coefficient. It follows that (p1p2)2/[p(1−p)(2FST+1/N1+1/N2)] is χ2 distributed with 1 degree of freedom (d.o.f.). In fact, one can simply compute (p1p2)2/[p(1−p)] divided by its mean across SNPs, avoiding complications involving the effective sample size in the case of related samples. In these computations we excluded SNPs with minor allele frequencies p<0.05 to minimize deviations from the normality assumption. An excess of large values of the χ2 statistic indicates deviations from the null model, suggesting the action of natural selection.

Relationship between the distributions of allele frequency differences and disease association statistics, if cases and controls are drawn from distinct populations. We provide a mathematical derivation for the result that a null distribution of allele frequency differences implies a null distribution of disease association statistics after correction by genomic control. We consider a hypothetical association study in which N/2 diploid disease cases are drawn from population 1 and N/2 diploid controls are drawn from population 2. Any instance of population stratification can be considered in this framework by defining population 1 and population 2 as appropriate admixtures of the underlying populations. For a given marker, let p1 and p2 denote observed frequencies in cases and controls and p be the mean of p1 and p2. It follows that the correlation between genotype and case-control status is equal to, so that the Cochran-Armitage trend statistic [38], which equals N times the square of that correlation, is equal to. Since (p1p2) is normally distributed with mean 0 and variance p(1−p)(2FST+1/N1+1/N2), where N1 = N2 = N (see above), it follows that the Cochran-Armitage trend statistic has a χ2 (1 d.o.f.) distribution scaled by (1+NFST). (See [39] for a related derivation.) This means that when the method of genomic control [40] is applied, the inflation factor λ is equal to 1+NFST, and that dividing association statistics by λ results in a χ2 (1 d.o.f.) distribution. More generally, the fact that both the allele frequency difference statistic and the Cochran-Armitage trend statistic are proportional to (p1p2)2/(p(1−p)) implies that the distributions of these two statistics are identical up to a constant scaling factor, even when allele frequency differences do not follow a null distribution.

Supporting Information

Attachment 1

Attachment 2

Attachment 3

Attachment 4


Zdroje

1. GrantSF

ThorleifssonG

ReynisdottirI

BenediktssonR

ManolescuA

2006 Variant of transcription factor 7-like 2 (TCF7L2) gene confers risk of type 2 diabetes. Nat Genet 38 320 323

2. AmundadottirLT

SulemP

GudmundssonJ

HelgasonA

BakerA

2006 A common variant associated with prostate cancer in European and African populations. Nat Genet 38 652 658

3. HelgadottirA

ThorleifssonG

ManolescuA

GretarsdottirS

BlondalT

2007 A common variant on chromosome 9p21 affects the risk of myocardial infarction. Science 316 1491 1493

4. GudbjartssonDF

ArnarDO

HelgadottirA

GretarsdottirS

HolmH

2007 Variants conferring risk of atrial fibrillation on chromosome 4q25. Nature 448 353 357

5. GudmundssonJ

SulemP

SteinthorsdottirV

BergthorssonJT

ThorleifssonG

2007 Two variants on chromosome 17 confer prostate cancer risk, and the one in TCF2 protects against type 2 diabetes. Nat Genet 39 977 983

6. ThorleifssonG

MagnussonKP

SulemP

WaltersGB

GudbjartssonDF

2007 Common sequence variants in the LOXL1 gene confer susceptibility to exfoliation glaucoma. Science 317 1397 1400

7. ThorgeirssonTE

GellerF

SulemP

RafnarT

WisteA

2008 A variant associated with nicotine dependence, lung cancer and peripheral arterial disease. Nature 452 638 642

8. GudbjartssonDF

SulemP

StaceySN

GoldsteinAM

RafnarT

2008 ASIP and TYR pigmentation variants associate with cutaneous melanoma and basal cell carcinoma. Nat Genet 40 886 891

9. HelgasonA

YngvadottirB

HrafnkelssonB

GulcherJ

StefanssonK

2005 An Icelandic example of the impact of population structure on association studies. Nat Genet 37 90 95

10. HelgasonA

SigurethardottirS

NicholsonJ

SykesB

HillEW

2000 Estimating Scandinavian and Gaelic ancestry in the male settlers of Iceland. Am J Hum Genet 67 697 717

11. HelgasonA

HickeyE

GoodacreS

BosnesV

StefanssonK

2001 mtDNA and the islands of the North Atlantic: estimating the proportions of Norse and Gaelic ancestry. Am J Hum Genet 68 723 737

12. HelgasonA

Lalueza-FoxC

GhoshS

SigurethardottirS

SampietroML

2009 Sequences from first settlers reveal rapid evolution in Icelandic mtDNA pool. PLoS Genet 5 e1000343 doi:10.1371/journal.pgen.1000343

13. PriceAL

ButlerJ

PattersonN

CapelliC

PascaliVL

2008 Discerning the ancestry of European Americans in genetic association studies. PLoS Genet 4 e236 doi:10.1371/journal.pgen.0030236

14. TianC

PlengeRM

RansomM

LeeA

VillosladaP

2008 Analysis and application of European genetic substructure using 300 K SNP information. PLoS Genet 4 e4 doi/10.1371/journal.pone.0003862

15. MenozziP

PiazzaA

Cavalli-SforzaL

1978 Synthetic maps of human gene frequencies in Europeans. Science 201 786 792

16. Cavalli-SforzaLL

MenozziP

PiazzaA

1994 The history and geography of human genes Princeton, NJ Princeton University Press

17. PattersonN

PriceAL

ReichD

2006 Population structure and eigenanalysis. PLoS Genet 2 e190 doi:10.1371/journal.pgen.0020190

18. NovembreJ

StephensM

2008 Interpreting principal component analyses of spatial population genetic variation. Nat Genet 40 646 649

19. NovembreJ

JohnsonT

BrycK

KutalikZ

BoykoAR

2008 Genes mirror geography within Europe. Nature 456 98 101

20. LaoO

LuTT

NothnagelM

JungeO

Freitag-WolfS

2008 Correlation between genetic and geographic structure in Europe. Curr Biol 18 1241 1248

21. HeathSC

GutIG

BrennanP

McKayJD

BenckoV

2008 Investigation of the fine structure of European populations with applications to disease association studies. Eur J Hum Genet 16 1413 1429

22. JakkulaE

RehnstromK

VariloT

PietilainenOP

PaunioT

2008 The genome-wide patterns of variation expose significant substructure in a founder population. Am J Hum Genet 83 787 794

23. WeirBS

CockerhamCC

1984 Estimating F-statistics for the analysis of population structure. Evolution 38 1358 1370

24. JonssonG

MagnussonMS

1997 Hagskinna: Icelandic historical statistics Reykjavík, Iceland Hagstofa Islands

25. PriceAL

PattersonNJ

PlengeRM

WeinblattME

ShadickNA

2006 Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 38 904 909

26. ClaytonDG

WalkerNM

SmythDJ

PaskR

CooperJD

2005 Population structure, differential bias and genomic control in a large-scale, case-control association study. Nat Genet 37 1243 1246

27. The International Hapmap Consortium 2007 A second generation human haplotype map of over 3.1 million SNPs. Nature 449 851 861

28. The Wellcome Trust Case Control Consortium 2007 Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447 661 678

29. ServiceS

DeYoungJ

KarayiorgouM

RoosJL

PretoriousH

2006 Magnitude and distribution of linkage disequilibrium in population isolates and implications for genome-wide association studies. Nat Genet 38 556 560

30. Yamaguchi-KabataY

NakazonoK

TakahashiA

SaitoS

HosonoN

2008 Japanese population structure, based on SNP genotypes from 7003 individuals compared to other ethnic groups: effects on population-based association studies. Am J Hum Genet 83 445 456

31. GudmunssonJ

SulemP

JohannssonO

SigurdssonH

HrafnkelssonH

2004 Geographic stratification in the ancestry of breast cancer patients and carriers of the BRCA2-999del5 founder mutation in Iceland [Poster abstract]. Presented at the 54th annual meeting of the American Society of Human Genetics, Toronto, Canada

32. StaceySN

SulemP

JohannssonOT

HelgasonA

GudmundssonJ

2006 The BARD1 Cys557Ser variant and breast cancer risk in Iceland. PLoS Med 3 e217 doi:10.1371/journal.pmed.0030217

33. HelgasonA

PalssonS

GudbjartssonDF

KristjanssonT

StefanssonK

2008 An association between the kinship and fertility of human couples. Science 319 813 816

34. KeinanA

MullikinJC

PattersonN

ReichD

2007 Measurement of the human allele frequency spectrum demonstrates greater genetic drift in East Asians than in Europeans. Nat Genet 39 1251 1255

35. LongJC

1991 The genetic structure of admixed populations. Genetics 127 417 428

36. PriceAL

PattersonN

HancksDC

MyersS

ReichD

2008 Effects of cis and trans genetic ancestry on gene expression in African Americans. PLoS Genet 4 e1000294 doi:10.1371/journal.pgen.1000294

37. AyodoG

PriceAL

KeinanA

AjwangA

OtienoMF

2007 Combining evidence of natural selection with association analysis increases power to detect malaria-resistance variants. Am J Hum Genet 81 234 242

38. ArmitageP

1955 Tests for linear trends in proportions and frequencies. Biometrics 11 375 386

39. WeirBS

1996 Genetic data analysis II: methods for discrete population genetic data Sunderland, MA Sinauer Associates

40. DevlinB

RoederK

1999 Genomic control for association studies. Biometrics 55 997 1004

Štítky
Genetika Reprodukční medicína

Článek vyšel v časopise

PLOS Genetics


2009 Číslo 6
Nejčtenější tento týden
Nejčtenější v tomto čísle
Kurzy

Zvyšte si kvalifikaci online z pohodlí domova

Aktuální možnosti diagnostiky a léčby litiáz
nový kurz
Autoři: MUDr. Tomáš Ürge, PhD.

Střevní příprava před kolonoskopií
Autoři: MUDr. Klára Kmochová, Ph.D.

Závislosti moderní doby – digitální závislosti a hypnotika
Autoři: MUDr. Vladimír Kmoch

Aktuální možnosti diagnostiky a léčby AML a MDS nízkého rizika
Autoři: MUDr. Natália Podstavková

Jak diagnostikovat a efektivně léčit CHOPN v roce 2024
Autoři: doc. MUDr. Vladimír Koblížek, Ph.D.

Všechny kurzy
Přihlášení
Zapomenuté heslo

Zadejte e-mailovou adresu, se kterou jste vytvářel(a) účet, budou Vám na ni zaslány informace k nastavení nového hesla.

Přihlášení

Nemáte účet?  Registrujte se

#ADS_BOTTOM_SCRIPTS#