Natural Selection Reduced Diversity on Human Y Chromosomes

Download PDF České info

The human Y chromosome exhibits surprisingly low levels of genetic diversity. This could result from neutral processes if the effective population size of males is reduced relative to females due to a higher variance in the number of offspring from males than from females. Alternatively, selection acting on new mutations, and affecting linked neutral sites, could reduce variability on the Y chromosome. Here, using genome-wide analyses of X, Y, autosomal and mitochondrial DNA, in combination with extensive population genetic simulations, we show that low observed Y chromosome variability is not consistent with a purely neutral model. Instead, we show that models of purifying selection are consistent with observed Y diversity. Further, the number of sites estimated to be under purifying selection greatly exceeds the number of Y-linked coding sites, suggesting the importance of the highly repetitive ampliconic regions. While we show that purifying selection removing deleterious mutations can explain the low diversity on the Y chromosome, we cannot exclude the possibility that positive selection acting on beneficial mutations could have also reduced diversity in linked neutral regions, and may have contributed to lowering human Y chromosome diversity. Because the functional significance of the ampliconic regions is poorly understood, our findings should motivate future research in this area.

Published in the journal: . PLoS Genet 10(1): e32767. doi:10.1371/journal.pgen.1004064
Category: Research Article
doi: https://doi.org/10.1371/journal.pgen.1004064

Summary

Introduction

The Y chromosome has often been used as a marker for studying human demographic history [1], but one implicit assumption in these analyses is that the Y chromosome is not affected by natural selection at linked sites [2]. However, formal tests of models of selection have been lacking. In part, this has been due to a paucity of resequencing data for many male human genomes, where autosomal, X, Y and mtDNA for the same individuals could be compared. Such data eliminate one source of sampling variance that could influence comparisons between genomic regions, and also allow for chromosome-wide estimates of genetic diversity on the Y, which is often ignored in whole-genome analyses [3]–[5]. Under simple neutral models with constant and equal male and female population sizes, diversity is expected to be proportional to the relative number of each chromosome in the population: X diversity is expected to be three-quarters autosomal diversity (because there are three X chromosomes for every four autosomes) and both the Y and mtDNA diversity are expected to be one-quarter autosomal diversity [6].

The Y chromosome does not undergo homologous recombination, except in the small pseudoautosomal regions [7]. In general, diversity is reduced in genomic regions or genomes with little or no recombination [8]–[11]. Similarly, previous studies of small segments of the human Y chromosome have found low levels of genetic diversity, but multiple theories exist to explain this reduction [12]–[16].

Because the Y chromosome is found only in males, low diversity on the Y could result from neutral processes if, for example, the effective population size of males is reduced relative to that of females. One factor that can reduce the male population size is high variance in the number of offspring. Differences in the variance in reproductive success between the sexes, will cause differences in effective population sizes, even when the actual number of males and females is approximately the same [4], [13]. Based on comparing patterns of genetic variation on the X chromosome and the autosomes, several recent studies have found evidence of sex-biased demographic processes during human history [3]–[5], [17]–[20], often suggesting that the effective population size of females was higher than that of males throughout recent human history (N_f>N_m, if N_f represents the effective number of breeding females and N_m represents the effective number of breeding males).

Alternatively, purifying selection acting to remove new deleterious mutations on the Y chromosome, will affect diversity at linked neutral sites through a process called background selection. Background selection refers to the reduction in genetic diversity at sites that are themselves neutrally evolving, but are linked to other sites where deleterious mutations occur [21]–[24]. Background selection may be particularly potent on the Y chromosome, because there is no recombination on the Y chromosome. As such, deleterious mutations in one area of the chromosome could reduce levels of genetic diversity across the entire chromosome [12], [14]–[16]. However, the strength of selection is also important. Several weakly deleterious mutations may interact resulting in a Hill-Robertson interference [25], whereby interference among linked sites weakens their effects on linked neutral sites [26]. Similarly, positive selection, acting on beneficial mutations is expected to decrease diversity at linked neutral sites. Given the unique gene content and lack of recombination on the Y chromosome, it is likely to have experienced a complex evolutionary history.

Here, using genome-wide analyses of X, Y, autosomal and mitochondrial DNA, in combination with extensive population genetic simulations, we show that low observed Y chromosome variability is not consistent with a purely neutral model. Instead, we show that models of purifying selection and background selection affecting linked neutral sites are consistent with observed Y diversity. Further, the number of sites estimated to be directly under purifying selection greatly exceeds the number of Y-linked coding sites, suggesting the importance of the highly repetitive ampliconic regions [27]–[29]. Because the functional significance of the ampliconic regions is poorly understood, our findings should motivate future research in this area.

Results/Discussion

Diversity across the entire human Y is extremely low

Analyzing complete genomic sequence data from 16 unrelated males (Table S1), we observe that normalized diversity on the human Y is extremely low compared to expectations from other genomic regions (Figure 1; Table 1). By analyzing resequencing data for the autosomes, X chromosome, Y chromosome, and mitochondria from the same individuals, we reduce sampling variance that might otherwise confound comparisons between regions of the genome. Here diversity is measured as the average pairwise differences per site, π, in the sample, and is normalized using divergence between humans and outgroup species (see Materials and Methods). The purpose of this normalization is to account for the possibility that different parts of the genome may have different mutation rates. The mutation rates could systematically differ across chromosomal types because the different chromosomes spend different amounts of time in the male and female germlines and the male germline has a higher mutation rate than the female germline [30]. Because the low diversity on the Y chromosome persists after this normalization, it cannot be explained by a correspondingly low mutation rate on the Y chromosome (Table S2; Figure S1). Further, the highly repetitive ampliconic regions of the Y were not assembled by Complete Genomics, and so are not analyzed here (Materials and Methods). Diversity on the Y chromosome is likely not being under-estimated due to the inability to call variants in haploid regions of the genome because diversity on the X measured in females, where the X is diploid, is nearly identical to diversity on the X measured in males, where the X is haploid (Figure S2). The pattern of reduced diversity on the Y chromosome is observed in both Africans and Europeans, suggesting that the effect is not population-specific, and holds regardless of whether the neutral sequence analyzed is near or far from genes (Table 1). Previous analyses of portions of the Y reported low Y diversity [12]–[16], but measuring divergence-normalized π per site at 0.0018 for Africans and 0.0024 for Europeans, we observe that chromosome-wide Y diversity is an order of magnitude lower than the equilibrium neutral expectation of one-quarter the autosomal level of diversity (Figure 1). Conversely, mitochondrial diversity is not reduced compared to expectations under neutrality (Figure 1). Additionally, our estimates of diversity on the X chromosome are consistent with previous estimates from Africans [5], [17] and Europeans [3], [5]. These trends held for all populations sampled in the public Complete Genomics data (Figure S3).

**Fig. 1. Observed and expected ratios of normalized X/Autosome, Y/Autosome, and mtDNA/Autosome nucleotide diversities.**

**Tab. 1. Observed and mean modeled estimates of neutral diversity.**

In contrast to diversity in other genomic regions, we observe that diversity is lower on the Y chromosome for the African populations in our sample than for the European populations in our sample (Table 1). Previous studies of Y chromosome diversity have also suggested that the difference in diversity on the Y is small between Africans and Europeans [31], [32], or that it may, as we observe, be higher in Europeans than some African populations [15], [33]. For example, haplotype diversity was found to be higher across Europeans than Africans (0.852 versus 0.841) [33]. Similarly, when the African populations are broken down into Sub-Saharan Africans versus North Africans (the Complete Genomics samples are Western/Northern Africans), European diversity falls in between these two, with European diversity on the Y chromosome actually higher than diversity in North Africans [33]. Other studies have observed slightly higher diversity in Africans than Europeans, but include a much more diverse group of Africans. For example, variation on the Y chromosome has been reported previously to be only slightly higher on the Y for African versus Non-African populations, even though the population of Africans is much more diverse (including Bakola from Cameroon, Dogon from Mali, Bantu from South Africa and Khoisan from Namibia and South Africa) [32] than the population we analyze. The uncorrected levels of diversity reported here for the Y chromosome (Table S2), differ from some previous studies [15], [31], [34], but are not directly comparable to these studies because: 1) they were based on genetic markers that were chosen specifically because they have high mutation rates [15], [31], [34]; and, 2) the populations are different than the ones available for this study [34]. The absolute number of SNPs identified here is not reduced relative to other sequencing platforms [35]. In fact, overall diversity is similarly observed to be low on the Y using this other technology, but a larger TMRCA is estimated [35], perhaps because the Y seems to harbor pockets of hidden diversity [36].

We next consider several possible models that could explain this unexpectedly low amount of diversity found on the Y chromosome relative to other genomic regions. Such models include differences in the variance in reproductive success between males and females, purifying selection on the Y chromosome, and positive selection on the Y chromosome.

Variance in reproductive success

In principle, a greater variance in male reproductive success than female reproductive success (N_f>N_m) could result in a lower than expected effective population size of the Y chromosome. In fact, previous studies have suggested that increased variance in offspring number has reduced the effective population size in human males versus females and might explain the reduced variability on the paternally inherited Y chromosome [4], [13]. To test the hypothesis that sex-biased demography explains the decreased Y chromosome diversity, we modeled increasingly skewed sex ratios using coalescent simulations, taking into account the complex demography of the populations analyzed here (Figure 1; Table S3; Methods). We use the case where N_m = N_f as the null model. As expected, decreases in the male effective population sizes (N_m/N_f<1) decrease expected Y diversity. However, we find that the reduction in the male effective population size required to explain the observed Y chromosome data, predicts levels of normalized autosomal, X and mtDNA diversity that are not consistent with the data in these markers (Figure 1; Table S3). This effect can also be illustrated by considering ratios of normalized diversity in each type of marker relative to autosomes. A skew in the sex ratio large enough to explain the observed reduction in Y/autosome diversity would also cause increases in X/autosome and mtDNA/autosome diversity that are incompatible with observations (Figure 1; Table S4). Thus, by analyzing all classes of genomic sequences, we are able to reject extreme sex-biased processes as the sole explanation for patterns of low observed Y variability.

Purifying selection

Natural selection has also been suggested to play a large role in reducing diversity on the Y chromosome [12], [14]–[16], and works within the context of the demographic history of the populations. Purifying selection can reduce genetic variation at linked neutral sites via a process called background selection, which has received extensive theoretical treatment in the literature [21], [22], [26], [37]–[41]. Purifying selection has already been documented for the mtDNA [42]. Due to the lack of homologous recombination throughout most of the Y chromosome, background selection is expected to have a particularly strong effect, severely reducing diversity on the Y chromosome. Two factors determine the overall effect of background selection on reducing neutral diversity in non-recombining regions: 1) The strength of selection, and 2) the number of sites subject to selection. At approximately 60 million base pairs, there are orders of magnitude more sites that may be subject to selection on the human Y chromosome than on the mtDNA. Selection may actually be quite weak on individual mutations that occur on the Y chromosome, but in the absence of recombination, if many sites are possible targets of this weak selection, this can lead to a strong reduction in diversity among Y chromosomes.

Here, we performed forward simulations with purifying selection to assess whether background selection could reduce diversity at neutral sites on the Y chromosome to the levels observed in our data. We study purifying selection under different assumptions of the variance in male reproductive success. We chose to use forward simulations, rather than using standard analytical background selection models, which assume the effect of background selection is a simple reduction in effective population size, for several reasons. First, the standard formulas were derived for equilibrium demographic models, but human populations have a more complex demographic history with unknown effects on the process of background selection. Second, many mutations have been shown to be weakly deleterious and may persist in the population due to genetic drift [37], [38]. The standard theory does not allow for this. Finally, simulations studies suggest that the standard theory can over-predict the reduction in genetic diversity due to background selection if there are many weakly selected linked mutations [26]. The forward simulations that we conducted address all of these concerns.

We first evaluated whether purifying selection acting only on new nonsynonymous mutations in the coding regions of the Y chromosome could reduce levels of genetic diversity at linked neutral sites to the levels detected in our observed Y chromosome data. To do this, we performed forward simulations using realistic demographic models for the populations where only new nonsynonymous mutations were subjected to purifying selection (see Methods). We find that models of selection acting only on coding sites cannot sufficiently reduce expected diversity at linked neutral sites through background selection on the Y chromosome. Under the assumption of equal sex ratios, regardless of the mean selection coefficient used, all models result in levels of diversity at linked neutral sites that are significantly higher than the observed values for both Africans (P<0.001) and Europeans (P<0.025, Figure S4).

In principle, models with a larger female effective population size could explain the low diversity observed on the Y chromosome. However, we have demonstrated that such models cannot match the levels of genetic diversity observed on the X chromosome, mtDNA, and Y chromosome together. However, sex-biased demography along with purifying selection acting on new nonsynonymous mutations in the coding regions of the Y chromosome could reduce levels of diversity at linked neutral sites. To evaluate the joint effects of sex-biased demography and purifying selection, we used levels of putatively neutral diversity (i.e., diversity far from genes) on the X chromosome and the autosomes to estimate the degree of sex-biased demography for the populations in our study (Table 2). We find that N_m/N_f = 0.335 in the African population which is concordant with estimates from previous studies [4], [20], [35]. Under an assumption of an extremely reduced male effective population size, relative to females (N_m/N_f = 0.335) which matches patterns of diversity on the X chromosome, predicted diversity at linked neutral sites, from models including purifying selection only on nonsynonymous mutations, is still significantly higher than the observations in Africans (P<0.001, Figure S4). In Europeans, we estimate that that N_m/N_f = 1 (Table 2). These results hold for a wide range of the mean strength of selection (Methods; Figure S5).

**Tab. 2. Estimates of <i>N</i><sub>m</sub>/<i>N</i><sub>f</sub> using X and autosomal genetic diversity far from genes.**

Estimating the number of sites under purifying selection on the Y chromosome

Given its unique structure, it is possible that purifying selection acts on more than just the nonsynonymous sites on the Y chromosome. Specifically, in addition to the approximately 100,000 single copy coding sites (predicted from annotated coding genes [43]; Methods), the Y also contains 5.7 Mb of highly repetitive ampliconic regions, composed of long palindrome “arms”, each with nearly-identical sequences [27], [28]. Genes in these ampliconic regions are expressed exclusively in the testis [27], [28], and so may be under selection related to male fertility. Further, it has been hypothesized that, in the absence of homologous recombination with the X, intra-chromosome pairing and the resulting gene conversion between palindrome arms may reduce the mutational load on the Y, and so these palindromes themselves, as a means of allowing intra-chromosome recombination, may be subjects of selection [27]–[29].

Thus, we developed a novel approximate likelihood approach to estimate the number of sites affected by purifying selection (L) required to reduce diversity at linked neutral sites to the low values observed on the Y (Methods). Simulations show that our method can accurately estimate L (Methods; Table S5). Assuming an equal sex ratio, the maximum likelihood estimate of the number of sites subjected to purifying selection on the Y is as much as 30 fold higher than the number of coding sites, for both Africans and Europeans (Figure 2). Relaxing the assumption of an equal sex ratio to allow many fewer males relative to females (to the ratio of the number of males to the number of females that fit neutral diversity on the X and autosomes, N_m/N_f = 0.335 [4], [20]), and to an extreme bias in male reproductive success of N_m/N_f = 0.1, slightly decreases the estimates of the number of sites directly affected by purifying selection. However, the estimate from the African sample is still significantly greater than the number of coding sites. Our results strongly support the hypothesis that at least some of the ampliconic regions evolve under the direct effects of purifying selection, where new mutations in these regions are deleterious.

Estimates of the number of sites affected by purifying selection (<i>L</i>) on chromosome Y. — **Fig. 2. Estimates of the number of sites affected by purifying selection (L) on chromosome Y.**

The above estimates assume that the selection coefficients of the deleterious mutations on the Y chromosome are the same as those estimated from nonsynonymous mutations on the autosomes, with appropriate re-scaling to account for the differences in Ne and ploidy on the autosomes and the Y chromosome (see Methods). However, it is possible that the strength of selection acting on noncoding mutations on the Y chromosome could be different than that acting on nonsynonymous mutations on the autosomes. It is unclear whether this difference in the strength of selection could bias our estimates of the number of sites directly under selection. To address this concern, we extended our approach to jointly estimate the number of sites directly affected by purifying selection (L) as well as the mean strength of selection (see Methods). Even when considering a range of different strengths of selection, we find that the estimates of the number of sites to be directly under the effect of purifying selection are largely insensitive to the mean strength of selection, and are still more than the number of X-degenerate coding sites (Figures S5 and S6). This suggests that content recruited to the Y chromosome after X–Y recombination was suppressed, including the high-copy-number ampliconic regions, as well as any transcription factor binding sites, may be subject to purifying selection that, due to the lack of homologous recombination, acts to reduce diversity on the human Y chromosome.

We found that a population expansion model matched the average observed levels of autosomal, X and mtDNA polymorphism in the African populations, and a bottleneck model matched the observed levels of polymorphism in the European population (Figure 1, Tables S4, S5 and S7). Several publications have documented various signatures of background selection throughout the genome [17], [44]–[47]. If background selection had reduced average levels of diversity across the genome (previous work suggests around a 6% reduction in diversity [24]), this would mean that the demographic parameters that fit the data were not truly reflective of population history, but instead reflected both population history and background selection. Thus, even if background selection is operating on the putatively neutral genomic regions we analyze here, the reduction in diversity on the Y chromosome is still too extreme to be consistent with that level of background selection. Rather, additional background selection, as we have modeled here, would be required.

Positive selection

Although models of purifying selection are consistent with the low observed diversity, it is also possible that positive natural selection may also be driving low diversity on the human Y via selective sweeps [48], [49], when neutral variation is removed due to the fixation of an advantageous mutation. Although it can be difficult to distinguish between genetic signals of background selection versus positive selection with few nucleotide polymorphisms, as is the case with the Y chromosome, we analyzed the data using two additional measures. First, we computed the folded site frequency spectrum for Y chromosome SNPs across all unrelated Y chromosomes in the Complete Genomics dataset (Figure S8). The abundance of low frequency SNPs is consistent with both positive selection and purifying selection (Figure S8), and the low overall number of SNPs makes further distinctions between the two models difficult. Second, we built a neighbor-joining tree for all unrelated Y haplotypes in the Complete Genomics dataset using phylip [50], then branch lengths were computed using a molecular clock in paml [51]. There is not an overarching star phylogeny, which would be indicative of a single selective sweep (Figure S9). While we cannot rule out such a scenario directly, we note that previous studies also found little or no evidence of selective sweeps [52] or gene-specific positive selection [53], [54] on the Y chromosome. However, one might conceive of a complex evolutionary history involving several instances of positive selection along different Y lineages that could result in the observed haplotype topology. Given recent findings of pockets of Y haplotype diversity, it is possible that recurrent positive selection may contribute to reduced Y diversity [36].

Conclusions

We observe that diversity across the entire human Y chromosome is extremely low. We find that neutral models with sex-biased demography may contribute to low Y diversity. However, models of extreme differences in reproductive success between males and females are insufficient as the sole explanation for patterns of genome-wide diversity. Alternatively, then, natural selection appears to be acting to reduce diversity on the Y. We show that models of purifying selection affecting Y chromosome diversity are consistent with low observed diversity, if purifying selection acts on more than the few coding regions left on the Y chromosome. Thus, our results suggest that selection may also act on the highly repetitive ampliconic regions, and support arguments for the functional importance of these regions [29]. Further strong purifying selection acting on the human Y is consistent with reports of the conservation of both the number and the type of functional coding genes on the Y chromosome in humans [12] and across primates [55], [56]. It is also possible that positive selection has been acting to reduce diversity on the Y chromosome, but this explanation would require multiple independent selective sweeps across populations.

Although positive selection is expected to confound evolutionary relationships, if purifying selection is the dominant force on the Y chromosome, the topology of the tree should remain intact, but the coalescent times are expected to be reduced. This means that the Y chromosome, keeping in mind that it is a single marker without recombination, may actually provide a more useful marker for inferring phylogeographic patterns than other markers. Indeed, recent resequencing efforts of the Y chromosome identified a single mutation that resolves a previously unresolved trifurcation of lineages, and reports monophyletic groupings of Y chromosomes from distinct populations [35]. While it a combination of factors influence genome-wide estimates of diversity, and variance in male reproductive success still affects patterns of autosomal, X, Y and mtDNA diversity, selection clearly affects levels of diversity on the Y, and so should be considered when drawing conclusions regarding demography and population history based on patterns of Y-linked markers.

Materials and Methods

Genomic data analysis

We analyzed unrelated, high quality, publicly available whole genomes generated by Complete Genomics assembly software version 2.0.0 [57] (Table S1). Next generation sequence data often suffer from sequence errors, assembly errors and missing information, and non-reference alleles will be less likely to be mapped [58]. However, the Complete Genomics dataset overcomes many of these errors by using very high coverage (>30X [57]). Additionally, to be conservative, we only consider sites with data called in all individuals in each population. We removed putatively functional and difficult to assemble regions including: RefSeq known genes, CpG islands, simple repeats, repetitive elements (RepeatMasker), centromeres, and telomeres, downloaded from the UCSC Genome browser [43], and filtered using Galaxy [59]. We also excluded the hypervariable regions on the mtDNA [60], which might inflate estimates of mitochondrial diversity, and analyzed only the X-degenerate regions of the human Y [27], because diversity might be reduced in the pseudoautosomal or ampliconic regions. Divergence was computed from number of nucleotide differences per site between pairwise human and chimpanzee reference sequence alignments for autosomes, X, and mtDNA downloaded from the UCSC genome browser [43], and for the Y from ref [28]. The total number of SNPs called on the Y chromosome in the Complete Genomics dataset does not appear to be lower than other chromosome-wide assessments of Y variation. Of the SNPs across 16 individuals that overlap between the 1000 genomes (252 SNPs) and Complete genomics dataset (6236), there are only 12 sites called in the 1000 genomes dataset that are not called in the Complete Genomics dataset; all of these are singletons, and many have missing data across several individuals (Table S7). Further, the geographic distribution of Y chromosome sampled for the Complete Genomics dataset does not appear to be wider for the European versus the African populations [61]. The per generation per site mutation rates estimated from human-chimpanzee alignments, assuming a divergence time of 6 million years and 20 years per generation, are 2.11×10⁻⁰⁸ for the autosomes, 1.65×10⁻⁰⁸ for chromosome X, and 3.42×10⁻⁰⁸ for chromosome Y. For mtDNA we use the mutation rate reported of 1.7×10⁻⁰⁸ for the mtDNA [62]. The recombination rates used were 1 cM/Mb and (2/3)*(1 cM/Mb), for the autosomes and X, respectively. Diversity is measured using, π, the average number of nucleotide differences per site between all pair of sequences. For the inference of the number of sites under selection, we summarize the genetic variation data by S, the number of segregating sites, because the distribution of S, conditional on the underlying genealogy, is known (Poisson, see below). We do not directly analyze the ampliconic regions, as they were not assembled in the Complete Genomics data.

All estimates of diversity, and human-chimpanzee divergence used for normalization are reported in Table S2. Human-orangutan estimates of divergence could not be used because no whole Y chromosome sequence currently exists for orangutan. Although the Y chromosome sequence was recently published for the rhesus macaque, the sequence has diverged and degraded so much between human and macaque that very little of the noncoding regions are alignable [55], preventing us from reliably correcting for divergence across all chromosome types using human-macaque divergence.

Modeling male and female effective population sizes

Population genetics parameters used in coalescent [63] and forward simulations [64] for Europeans and Africans are similar to previously published estimates [65], [66]. We use a simple model of drift, which assumes purely random (Poisson) variation in offspring numbers for both males and females, and non-overlapping generations. For Africans, the neutral model is of an expansion from 10,000 to 20,000 individuals 4,000 generations ago. For Europeans the neutral model is of a bottleneck from 10,000 to 1,000 individuals 1,500 generations ago, followed by an expansion to 10,000 individuals 1,100 generations ago (Table S6).

Neutral expectations under equal and skewed sex ratios were modeled using coalescent simulations implemented in ms [63], assuming the population-specific demographic models described above, and allowing for recombination on the autosomes and X chromosome, but not the Y or mtDNA.

The effective population sizes for each chromosome type (N_auto, N_chrX, N_chrY, and N_mtDNA), for given male and female effective population sizes (N_m and N_f) are (see e.g., ref [67]):

For a fixed ratio and males to females (R = N_m/N_f), and fixed total effective population size (N_auto), we then calculate the male and female effective population sizes as:

Using these equations we can use standard neutral coalescent simulations implemented in ms to simulate data for the four chromosome types, while varying R, but keeping N_auto constant. We keep N_auto constant to mimic the real data, as the demographic parameters were originally estimated from autosomal markers. Further details about the values used for simulations can be found in Table S8. Complete commands for ms simulations are given in Note S1.

Modeling purifying selection

We modeled purifying selection using forward simulations implemented in SFS_CODE [64]. The exact commands used in the SFS_CODE simulations are given in Note S1. Similar to the coalescent simulations, we modeled the African and European populations separately, using the population-specific demographic models described above, the Y chromosome per generation per base pair mutation rate, and sampling 8 chromosomes per simulation to match the sample size of our observed data. However, unlike ms, which scales parameters by the current population size and moves backward in time, SFS_CODE starts with the ancestral number of chromosomes and simulates a haploid population forward in time. Thus, when rescaling the effective population size from the autosomal estimates, for SFS_CODE we used the same diploid autosomal ancestral effective population size for both populations (N = 10,000). The Y chromosome effective size was then found using the same process described above for the neutral coalescent simulations.

Evaluating purifying selection on coding sites

To investigate purifying selection acting only on new nonsynonymous mutations, we simulated 60,041 nonsynonymous sites (90,062 coding sites are estimated from the union of all exons from X-degenerate, non-pseudoautosomal genes on the Y chromosome [43]) at which new mutations are expected to be subject to purifying selection. To assess the effect of background selection, each simulation also contained 500 kb of linked neutral sequence from which we calculated diversity.

The effect of background selection is a function of the distribution of selection coefficients for new, deleterious mutations, and can be modeled by varying the mutation rate, the number of sites affected by selection (L), and the selection coefficient acting on new mutations (s) [21]. When evaluating models with different strengths of purifying selection, we assumed that selection coefficients for the nonsynonymous sites were drawn from a gamma distribution. Previous studies found this distribution to fit the observed autosomal frequency spectrum well [37], [38], [68], [69], and there is little reason to believe that the shape of the gamma distribution varies across chromosomes. However, although the X -⁠ and Y-linked genes are often highly diverged in sequence and function, the remaining X-degenerate Y-linked genes are likely highly constrained in order to have survived on the Y [70]. Thus, it may not be precise to assume X-degenerate Y-linked genes evolve under similar selective constraints as autosomal genes. To address this, we investigate a wide range of scale parameters of the gamma distribution. For a fixed value of the shape parameter of the gamma distribution, the mean strength of selection can be changed by modifying the scale parameter of the gamma distribution. Thus, we fixed the shape parameter to 0.184 (as estimated by refs [37], [69]) and performed simulations using mean selection coefficients ranging from 0.0001 to 0.09 (Figure S4).

We ran 1,000 replicates for each set of selection parameters in each population. For each replicate we calculated π*, the simulated per site nucleotide diversity (average number of pairwise differences) normalized by the per site human-chimp divergence (0.02051; Table S1). The similarly calculated observed Y diversity is denoted π_obs. For each set of parameter values we then calculated P₁, the proportion of simulation replicates with π*>π_obs was used to calculate a 2-sided P-value by P₂ = 1−2×|P₂−0.5|. Models with could not be rejected and were considered to fit the observed data.

Estimating the number of sites under purifying selection

To estimate the number of sites directly affected by purifying selection on the Y chromosome (defined as L) from looking at the levels of diversity at linked neutral sites, we developed a novel approximate likelihood approach [65], [71], [72] using the observed number of segregating sites, S_obs, in neutral regions, as a summary statistic. We then define the likelihood function for L in a neutral region as:

where is the number of segregating sites in neutral regions of the observed data, is the sum of all the branch lengths of the genealogy in units of generations, and refers to all of the other fixed parameters in the model (e.g., the demographic history and distribution of selection coefficients). Under the infinite sites model, the conditional distribution of S_obs given T is Poisson (see e.g., [73]):

where μ is the neutral mutation rate per generation over the entire region. This relationship holds even if the underlying genealogy has been affected by natural selection or other non-stationary demographic processes, as long as the individual mutations being analyzed are neutral. Then, the number of sites affected by purifying selection, L, enters the likelihood function by the effect that selection has on the genealogy. is the distribution (density) of the sum of the branch lengths over the entire genealogy under the particular model of demography and selection, with L sites directly affected by purifying selection. This distribution is difficult to calculate directly, and in general, the integral given above cannot be solved analytically. However, it could be approximated using simulation approaches that keep track of the genealogy as part of a forward simulation method [74]. If we could simulate from, then the distribution of could be approximated as the sum:

However, even such an approach is cumbersome and slow because of the overhead involved in keeping track of a genealogy in simulations with multiple loci under selection. We instead employ an approximate approach using forward-simulations implemented in SFS_CODE [64]. For a simulation replicate producing variable sites, and with a simulated value of equal to T*,

Therefore, a simulation consistent estimator of can be obtained from the number of segregating sites in a simulated sample. In other words, if we simulate enough sites in each replicate, the total tree length can be approximated using the number of segregating sites (Table S5; Figure S5). The aforementioned integral in the likelihood function can therefore be approximated stochastically by simulating data sets using SFS_CODE, with S_i^*, = 1, 2,…k, segregating sites, and each with a neutral mutation rate of μ_sim, and then evaluating,

as an estimator of the likelihood function for L based on S_obs. The number of neutral base pairs on the Y chromosome with sufficient sequencing data was 7,758,906 and 7,974,045 bp for the African and European populations respectively. Assuming a neutral mutation rate of 3.42×10⁻⁰⁸ per base pair per generation, μ = 0.265 for the African population and μ = 0.273 for the European population. However, forward simulations of >7 Mb of sequence are extremely time consuming. Thus, for computational efficiency, we simulated 500 kb of neutral sequence, giving μ_sim = 0.0171. We accounted for the fact that we simulated fewer neutral sites than in the actual data by including the ratio of the two per region mutation rates (μ/μ_sim), in our likelihood function represented above. We chose to simulate 500 kb of neutral sequence because a region of that size is small enough to be computationally efficient while still allowing an accurate approximation of T (Figure S5). Using this method we optimized the likelihood function over a grid of values for L ranging from below the number of coding sites, 50 kb, to more than the number of ampliconic regions, 6 Mb.

The population scaled selection coefficient (Ns) acting on a particular deleterious mutation was drawn from a gamma distribution, with the parameters estimated in Boyko et al. [37], including the same shape parameter (0.184) used above. However, because the Boyko et al. [37] model was developed for the autosomes, and assumes semi-dominant effects, we rescaled the mean strength of selection for a haploid model to represent Y evolution. The scale parameter of the Boyko et al. model (8200) was divided by the ratio of the number of chromosomes used in the original model (51272) to the number of Y chromosomes used in our simulations (5000), then multiplied by 2 because the original model described the fitness of a mutation in the heterozygous state, and all mutations on the Y chromosome will immediately be exposed to selection. Thus, our model used the resulting scale parameter (1600).

We also jointly estimated the number of sites directly under selection (L) and the mean strength of selection by looking at neutral diversity levels on the Y chromosome. We employed an approximate likelihood approach similar to that described above. However, here we investigated a two-dimensional grid of different values for L and a grid of different scale parameters for the gamma distribution of selective effects. Because we kept the shape parameter fixed at 0.184, changing the scale parameter changed the mean strength of selection. We found that our estimates of L were largely insensitive to the mean strength of selection. The profile likelihood curve shown in Figure S7 is remarkably similar to the likelihood curve shown in Figure S6, when the mean strength of selection was held constant.

Asymptotic approximate 95% confidence intervals included all points in the log-likelihood curve that fell within 1.92 log-likelihood units from the MLE (Note S1; Figure S6). Linear interpolation was used to find the appropriate cutoff in between grid points. SFS_CODE commands used for this section are given in Note S1.

Performance of the approximate likelihood approach on simulated data

We performed simulations to evaluate the performance of our approximate likelihood approach to estimate L by simulating 1,000 Y chromosome datasets using SFS_CODE under models of African and European demographic history. No recombination was allowed on the Y chromosome. Each simulation replicate, or simulated dataset, included 7.5 Mb of neutral sequence (equivalent to the size of our observed data) linked to 2 Mb of sites (i.e., L = 2 Mb) where new mutations were subjected to purifying selection (with selection coefficients drawn from the gamma distribution as discussed in Methods). For each simulated region, the approximate likelihood approach was used to estimate L based on the number of segregating sites within the neutral region. The distribution of selection coefficients used in the inference procedure was the same distribution used to simulate the data. The mean and median of the maximum likelihood estimates (MLEs) as well as the coverage properties of the asymptotic 95% confidence intervals (CIs) are shown in Table S5. The asymptotic 95% CIs contain the true value of L 96.6% of the time for the African simulations and 98.3% of time for the European simulations (rather than 95% of the time), suggesting that they are slightly conservative.

Testing models of sex-biased demography and purifying selection

We repeated our analyses of whether purifying selection on coding sites can explain the low diversity on the Y chromosome and our estimation of the number of sites affected by purifying selection taking into account unequal male and female population sizes. We also evaluated whether the low diversity on the Y chromosome could be accounted for by purifying selection combined with unequal male and female population sizes. In particular, Hammer et al. [17] and Lohmueller et al. [20] estimate that there were roughly 2.63 females reproducing for each male that reproduces. In other words, N_m = 0.38N_f. Additionally, we performed our own estimate of N_m/N_f from the levels of diversity at putatively neutral sites (those >100 kb from genes) on the X chromosome and the autosomes and estimate N_m = 0.3352N (Table 2). We have shown (Figure 1) that demographic models with an autosomal ancestral effective population size of roughly 10,000 individuals fit the autosomal levels of diversity reasonably well (Table S3). We compute the effective population size of males under a skewed sex ratio by inputting the previously observed N_m/N_f ratio of 0.3352, and the autosomal size of 10,000 individual, in the equation [67]:

We then repeated the forward simulations and analyses described above using this value for N_m.

Supporting Information

Zdroje

1. JoblingMA, Tyler-SmithC (2003) The human Y chromosome: an evolutionary marker comes of age. Nat Rev Genet 4 : 598–612.

2. JoblingMA (2012) The impact of recent events on human genetic diversity. Philos Trans R Soc Lond B Biol Sci 367 : 793–799.

3. KeinanA, MullikinJC, PattersonN, ReichD (2009) Accelerated genetic drift on chromosome X during the human dispersal out of Africa. Nat Genet 41 : 66–70.

4. HammerMF, MendezFL, CoxMP, WoernerAE, WallJD (2008) Sex-biased evolutionary forces shape genomic patterns of human diversity. PLoS Genetics 4: e1000202.

5. GottipatiS, ArbizaL, SiepelA, ClarkAG, KeinanA (2011) Analyses of X-linked and autosomal genetic variation in population-scale whole genome sequencing. Nature Genetics 43 : 741–743.

6. CaballeroA (1995) On the Effective Size of Populations with Separate Sexes, with Particular Reference to Sex-Linked Genes. Genetics 139 : 1007–1011.

7. Helena MangsA, MorrisBJ (2007) The Human Pseudoautosomal Region (PAR): Origin, Function and Future. Curr Genomics 8 : 129–136.

8. AguadeM, MiyashitaN, LangleyCH (1989) Reduced variation in the yellow-achaete-scute region in natural populations of Drosophila melanogaster. Genetics 122 : 607–615.

9. BegunDJ, AquadroCF (1992) Levels of naturally occurring DNA polymorphism correlate with recombination rates in D. melanogaster. Nature 356 : 519–520.

10. StephanW, LangleyCH (1989) Molecular genetic variation in the centromeric region of the X chromosome in three Drosophila ananassae populations. I. Contrasts between the vermilion and forked loci. Genetics 121 : 89–99.

11. MoghadamHK, PointerMA, WrightAE, BerlinS, MankJE (2012) W chromosome expression responds to female-specific selection. Proc Natl Acad Sci U S A 109 : 8207–8211.

12. RozenS, MarszalekJD, AlagappanRK, SkaletskyH, PageDC (2009) Remarkably little variation in proteins encoded by the Y chromosome's single-copy genes, implying effective purifying selection. Am J Hum Genet 85 : 923–928.

13. WilderJA, MobasherZ, HammerMF (2004) Genetic evidence for unequal effective population sizes of human females and males. Mol Biol Evol 21 : 2047–2057.

14. WhitfieldLS, SulstonJE, GoodfellowPN (1995) Sequence variation of the human Y chromosome. Nature 378 : 379–380.

15. PritchardJK, SeielstadMT, Perez-LezaunA, FeldmanMW (1999) Population growth of human Y chromosomes: a study of Y chromosome microsatellites. Mol Biol Evol 16 : 1791–1798.

16. MalaspinaP, PersichettiF, NovellettoA, IodiceC, TerrenatoL, et al. (1990) The human Y chromosome shows a low level of DNA polymorphism. Ann Hum Genet 54 : 297–305.

17. HammerMF, WoernerAE, MendezFL, WatkinsJC, CoxMP, et al. (2010) The ratio of human X chromosome to autosome diversity is positively correlated with genetic distance from genes. Nature Genetics 42 : 830–831.

18. EmeryLS, FelsensteinJ, AkeyJM (2010) Estimators of the human effective sex ratio detect sex biases on different timescales. American Journal of Human Genetics 87 : 848–856.

19. LabudaD, LefebvreJF, NadeauP, Roy-GagnonMH (2010) Female-to-male breeding ratio in modern humans-an analysis based on historical recombinations. American Journal of Human Genetics 86 : 353–363.

20. LohmuellerKE, DegenhardtJD, KeinanA (2010) Sex-averaged recombination and mutation rates on the X chromosome: a comment on Labuda et al. American Journal of Human Genetics 86 : 978–980.

21. CharlesworthB, MorganMT, CharlesworthD (1993) The effect of deleterious mutations on neutral molecular variation. Genetics 134 : 1289–1303.

22. HudsonRR, KaplanNL (1995) Deleterious background selection with recombination. Genetics 141 : 1605–1617.

23. CharlesworthB, CharlesworthD (2000) The degeneration of Y chromosomes. Philosophical Transactions of the Royal Society Biological Sciences 355 : 1563–1572.

24. CharlesworthB (2012) The role of background selection in shaping patterns of molecular evolution and variation: evidence from variability on the Drosophila X chromosome. Genetics 191 : 233–246.

25. McVeanGA, CharlesworthB (2000) The effects of Hill-Robertson interference between weakly selected mutations on patterns of molecular evolution and variation. Genetics 155 : 929–944.

26. KaiserVB, CharlesworthB (2009) The effects of deleterious mutations on evolution in non-recombining genomes. Trends in Genetics 25 : 9–12.

27. SkaletskyH, Kuroda-KawaguchiT, MinxPJ, CordumHS, HillierL, et al. (2003) The male-specific region of the human Y chromosome is a mosaic of discrete sequence classes. Nature 423 : 825–837.

28. HughesJF, SkaletskyH, PyntikovaT, GravesTA, van DaalenSK, et al. (2010) Chimpanzee and human Y chromosomes are remarkably divergent in structure and gene content. Nature 463 : 536–539.

29. MaraisGA, CamposPR, GordoI (2010) Can intra-Y gene conversion oppose the degeneration of the human Y chromosome? A simulation study. Genome Biology and Evolution 2 : 347–357.

30. MakovaKD, LiWH (2002) Strong male-driven evolution of DNA sequences in humans and apes. Nature 416 : 624–626.

31. JordeLB, WatkinsWS, BamshadMJ, DixonME, RickerCE, et al. (2000) The distribution of human genetic diversity: a comparison of mitochondrial, autosomal, and Y-chromosome data. Am J Hum Genet 66 : 979–988.

32. WilderJA, KinganSB, MobasherZ, PilkingtonMM, HammerMF (2004) Global patterns of human mitochondrial DNA and Y-chromosome structure are not influenced by higher migration rates of females versus males. Nat Genet 36 : 1122–1125.

33. HammerMF, KarafetTM, ReddAJ, JarjanaziH, Santachiara-BenerecettiS, et al. (2001) Hierarchical patterns of global human Y-chromosome diversity. Mol Biol Evol 18 : 1189–1203.

34. WilderJA, MobasherZ, HammerMF (2004) Genetic evidence for unequal effective population sizes of human females and males. Mol Biol Evol 21 : 2047–2057.

35. PoznikGD, HennBM, YeeMC, SliwerskaE, EuskirchenGM, et al. (2013) Sequencing Y chromosomes resolves discrepancy in time to common ancestor of males versus females. Science 341 : 562–565.

36. MendezFL, KrahnT, SchrackB, KrahnAM, VeeramahKR, et al. (2013) An african american paternal lineage adds an extremely ancient root to the human y chromosome phylogenetic tree. Am J Hum Genet 92 : 454–459.

37. BoykoAR, WilliamsonSH, IndapAR, DegenhardtJD, HernandezRD, et al. (2008) Assessing the evolutionary impact of amino acid mutations in the human genome. PLoS Genetics 4: e1000083.

38. AkashiH, OsadaN, OhtaT (2012) Weak selection and protein evolution. Genetics 192 : 15–31.

39. CharlesworthD, CharlesworthB, MorganMT (1995) The pattern of neutral molecular variation under the background selection model. Genetics 141 : 1619–1632.

40. HudsonRR, KaplanNL (1995) The coalescent process and background selection. Philosophical transactions of the Royal Society of LondonSeries B, Biological sciences 349 : 19–23.

41. NordborgM, CharlesworthB, CharlesworthD (1996) The effect of recombination on background selection. Genetical Research 67 : 159–174.

42. StewartJB, FreyerC, ElsonJL, LarssonNG (2008) Purifying selection of mtDNA and its implications for understanding evolution and mitochondrial disease. Nat Rev Genet 9 : 657–662.

43. FujitaPA, RheadB, ZweigAS, HinrichsAS, KarolchikD, et al. (2011) The UCSC Genome Browser database: update 2011. Nucleic Acids Research 39: D876–882.

44. HernandezRD, KelleyJL, ElyashivE, MeltonSC, AutonA, et al. (2011) Classic selective sweeps were rare in recent human evolution. Science 331 : 920–924.

45. LohmuellerKE, AlbrechtsenA, LiY, KimSY, KorneliussenT, et al. (2011) Natural selection affects multiple aspects of genetic variation at putatively neutral sites across the human genome. PLoS Genet 7: e1002326.

46. ReedFA, AkeyJM, AquadroCF (2005) Fitting background-selection predictions to levels of nucleotide variation and divergence along the human autosomes. Genome Res 15 : 1211–1221.

47. McVickerG, GordonD, DavisC, GreenP (2009) Widespread genomic signatures of natural selection in hominid evolution. PLoS Genet 5: e1000471.

48. KaplanNL, HudsonRR, LangleyCH (1989) The “hitchhiking effect” revisited. Genetics 123 : 887–899.

49. Maynard SmithJ, HaighJ (1974) The hitch-hiking effect of a favourable gene. Genet Res 23 : 23–35.

50. FelsensteinJ (1995) PHYLIP -⁠ Phylogeny Inference Package (version 3.2). Cladistics 5 : 164–166.

51. YangZ (2007) PAML 4: Phylogenetic analysis by maximum likelihood. Molecular Biology and Evolution 24 : 1586–1591.

52. ChiaroniJ, UnderhillPA, Cavalli-SforzaLL (2009) Y chromosome diversity, human expansion, drift, and cultural evolution. Proc Natl Acad Sci U S A 106 : 20174–20179.

53. WilsonMA, MakovaKD (2009) Evolution and survival on eutherian sex chromosomes. PLoS Genet 5: e1000568.

54. GotoH, PengL, MakovaKD (2009) Evolution of X-degenerate Y chromosome genes in greater apes: conservation of gene content in human and gorilla, but not chimpanzee. J Mol Evol 68 : 134–144.

55. HughesJF, SkaletskyH, BrownLG, PyntikovaT, GravesT, et al. (2012) Strict evolutionary conservation followed rapid gene loss on human and rhesus Y chromosomes. Nature 483 : 82–86.

56. HughesJF, SkaletskyH, PyntikovaT, MinxPJ, GravesT, et al. (2005) Conservation of Y-linked genes during human evolution revealed by comparative sequencing in chimpanzee. Nature 437 : 101–104.

57. DrmanacR, SparksAB, CallowMJ, HalpernAL, BurnsNL, et al. (2009) Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays. Science 327 : 78–81.

58. PoolJE, HellmannI, JensenJD, NielsenR (2010) Population genetic inference from genomic sequence variation. Genome Res 20 : 291–300.

59. BlankenbergD, Von KusterG, CoraorN, AnandaG, LazarusR, et al. (2011) Galaxy: a web-based genome analysis tool for experimentalists. Curr Protoc Mol Biol Chapter 19 : 1–21.

60. StonekingM (2000) Hypervariable sites in the mtDNA control region are mutational hotspots. Am J Hum Genet 67 : 1029–1032.

61. WeiW, AyubQ, ChenY, McCarthyS, HouY, et al. (2013) A calibrated human Y-chromosomal phylogeny based on resequencing. Genome Res 23 : 388–395.

62. IngmanM, KaessmannH, PaaboS, GyllenstenU (2000) Mitochondrial genome variation and the origin of modern humans. Nature 408 : 708–713.

63. HudsonRR (2002) Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics 18 : 337–338.

64. HernandezRD (2008) A flexible forward simulator for populations subject to selection and demography. Bioinformatics 24 : 2786–2787.

65. LohmuellerKE, BustamanteCD, ClarkAG (2009) Methods for human demographic inference using haplotype patterns from genomewide single-nucleotide polymorphism data. Genetics 182 : 217–231.

66. LohmuellerKE, BustamanteCD, ClarkAG (2010) The effect of recent admixture on inference of ancient human population history. Genetics 185 : 611–622.

67. Hartl DL, Clark AG (2006) Principles of Population Genetics. Sunderland, MA: Sinauer Associates.

68. Eyre-WalkerA, KeightleyPD (2009) Estimating the rate of adaptive molecular evolution in the presence of slightly deleterious mutations and population size change. Molecular Biology and Evolution 26 : 2097–2108.

69. Eyre-WalkerA, WoolfitM, PhelpsT (2006) The distribution of fitness effects of new deleterious amino acid mutations in humans. Genetics 173 : 891–900.

70. WilsonMA, MakovaKD (2009) Evolution and survival on eutherian sex chromosomes. PLoS Genetics 5: e1000568.

71. WeissG, von HaeselerA (1998) Inference of population history using a likelihood approach. Genetics 149 : 1539–1546.

72. WallJD (2000) A comparison of estimators of the population recombination rate. Molecular Biology and Evolution 17 : 156–163.

73. Wakeley J (2009) Coalescent Theory. Greenwood Village, CO: Roberts & Company.

74. WilliamsonS, OriveME (2002) The genealogy of a sequence subject to purifying selection at multiple sites. Mol Biol Evol 19 : 1376–1384.

75. PembertonTJ, WangC, LiJZ, RosenbergNA (2010) Inference of unexpected genetic relatedness among individuals in HapMap Phase III. American Journal of Human Genetics 87 : 457–464.