Correlated Evolution of Nearby Residues in Drosophilid Proteins

Download PDF České info

Here we investigate the correlations between coding sequence substitutions as a function of their separation along the protein sequence. We consider both substitutions between the reference genomes of several Drosophilids as well as polymorphisms in a population sample of Zimbabwean Drosophila melanogaster. We find that amino acid substitutions are “clustered” along the protein sequence, that is, the frequency of additional substitutions is strongly enhanced within ≈10 residues of a first such substitution. No such clustering is observed for synonymous substitutions, supporting a “correlation length” associated with selection on proteins as the causative mechanism. Clustering is stronger between substitutions that arose in the same lineage than it is between substitutions that arose in different lineages. We consider several possible origins of clustering, concluding that epistasis (interactions between amino acids within a protein that affect function) and positional heterogeneity in the strength of purifying selection are primarily responsible. The role of epistasis is directly supported by the tendency of nearby substitutions that arose on the same lineage to preserve the total charge of the residues within the correlation length and by the preferential cosegregation of neighboring derived alleles in our population sample. We interpret the observed length scale of clustering as a statistical reflection of the functional locality (or modularity) of proteins: amino acids that are near each other on the protein backbone are more likely to contribute to, and collaborate toward, a common subfunction.

Published in the journal: . PLoS Genet 7(2): e32767. doi:10.1371/journal.pgen.1001315
Category: Research Article
doi: https://doi.org/10.1371/journal.pgen.1001315

Summary

Introduction

There has been an ongoing debate over the past few decades about the processes underlying protein evolution [1]–[5]. The neutral theory [1] posits that protein evolution is chiefly governed by the fraction of newly arising mutations that are not detrimental enough to be removed by natural selection. However, recent population genetic analyses of closely related Drosophila species suggest that protein divergence between species is substantially in excess of the neutral model's predictions [6], [7]. Intriguingly, this protein divergence excess is consistent with an important role for positive selection in protein evolution [5], [8], [9], although the contribution of weakly deleterious mutations to this pattern is still debated [10], [11].

The dramatic shift in our view of the processes driving protein evolution in Drosophila highlights the deficiency in our understanding of the mechanisms responsible for the observed protein divergence excess. One reason for this deficiency is the explicitly sequence-based nature of the population genetic analyses used to describe the excess divergence. These methods were developed for the analysis of linear sequences of independently evolving amino acids, and quite generally ignore the fact that most proteins fold into complex three-dimensional structures, held together by interactions between amino acids and between amino acids and the surrounding medium. Protein function depends critically on this folded structure, e.g. the arrangement of specific amino acids at the active site of an enzyme [12]. This is reflected in protein evolution; both the structure and the function of homologous proteins are remarkably conserved over long times, even while primary sequences substantially diverge [13]. The maintenance of protein structure is possible because evolution preserves structurally important interactions, such as favorable biochemical interactions between amino acids in physical contact [14]. This preservation of structurally important interactions affects sequence-based analyses; the preferred state and variability of an amino acid will depend on amino acids elsewhere in the protein [15].

This study is motivated by the desire to more closely integrate protein structure and function into sequence-based inferences of selection. Correlations between substitution patterns and protein structure have yielded insights over many years, from the slower divergence of protein active sites [1], [16] to recent results indicating a correlation between estimates of positive selection and secondary structure [17]. Work demonstrating the evolutionary consequences of interactions inferred from RNA structure [18]–[20] supported the application of sequence-based inference of functional interactions to proteins, where functional interactions are difficult to identify even when structure is known [21]. Under the assumption that functionally interacting residues coevolve, interactions can be identified if enough evolutionary trajectories can be sampled. In practice this has meant multi-alignments across many species of large protein families [22]–[25], but alignments within populations of the highly mutable HIV have also been used [26], [27]. These methods have been successfully used to identify pair-wise interactions between residues that contribute to protein function. As an example, the inclusion of interactions inferred from a multi-alignment was shown sufficient to produce a stable fold [28].

Here we develop a complementary approach intended to probe the level of influence interactions have on protein evolution. Instead of focusing on a single protein and specific pairs of interacting residues, we shall aggregate evolutionary information across proteins and use the increased statistical power to look for generic patterns. Specifically, we investigate the correlations in the substitution processes at residues a given distance from each other along the protein backbone, averaged over many proteins of D. melanogaster. Our rationale is as follows: residues that are near in the primary protein sequence are also likely to be near in the folded protein (Figure 1A) and therefore more likely to interact physically and/or belong to the same protein domain. Consequently, if correlated evolution in proteins is common, it should be detectable by an increase in evolutionary correlation between residues nearby in sequence, for which physical interaction in the folded protein is more likely. While we will be unable to identify particular interactions, our approach will be informative about the overall level of influence interactions have on the evolution of proteins.

**Fig. 1. Structural distance as a function of sequence distance and Drosophilid phylogeny.**

We find that amino acid substitutions cluster together on the protein sequence, i.e. amino acid substitutions are more frequent nearby other such substitutions. The strength of this effect decays exponentially with the separation between the residues along the protein sequence, with a characteristic length scale of about codons. We observe this clustering phenomenon in substitutions between D. melanogaster and several sister species (Figure 1B) as well as in polymorphisms within a Zimbabwean population sample of D. melanogaster. Clustering is absent when considering synonymous substitutions, implicating selection as the root cause. Furthermore, clustering is stronger between substitutions that arose along the same branch of the evolutionary tree than between substitutions that arose in different branches, and nearby derived alleles tend to cosegregate in our population sample. Additionally, pairs of substitutions within codons of each other that arose in the same lineage have a significant tendency to cause compensatory changes to the total charge of the protein. These lines of evidence lead us to conclude that epistasis between amino acid substitutions contributes significantly to clustering, and the substitution process as a whole.

Results

The 12 Drosophilid genomes resource [29] serves as the primary data source in this study. We used this resource to identify protein coding sequence substitutions between D. melanogaster (Dmel) and several sister Drosophilids: D. sechellia (Dsec), D. simulans (Dsim), D. yakuba (Dyak), D. erecta (Dere), D. ananassae (Dana) and D. pseudoobscura (Dpse) available at http://rana.lbl.gov/drosophila/ (Figure 1B). Substitutions were ascertained from nucleotide alignments of the reference genomes produced by the blastz algorithm [30], and available from UCSC [31] at ftp://hgdownload.cse.ucsc.edu/goldenPath/dm3/.

Our goal here is to understand how correlation between the substitution processes at different residues is affected by the distance between those residues along the protein sequence. To this end we introduce the conditional probability function (cPDF), which we denote and define as the probability of there being a substitution of type at sequence position conditioned on the existence of a substitution of type at sequence position . To assess, for example, whether the probability of a synonymous divergence (DS) is affected by the presence of a non-synonymous divergence (DN) some distance away, we can estimate and compare it to the overall level of synonymous divergence.

cPDFs are estimated from sets of aligned coding sequences by averaging over all instances of the focal substitution in the aligned sequences (Methods). Since we are particularly interested in the functional dependence of cPDFs on the distance from the focal substitution we will normalize cPDFs by their asymptotic value (Methods). Note that we will always be measuring distance in terms of codons rather than nucleotides, as this is the natural unit of distance in a gene. Figure 2A shows three of these normalized cPDFs, , , and , estimated from the species comparison of D. melanogaster and D. yakuba.

**Fig. 2. The frequency of additional substitutions near a focal substitution.**

Amino acid substitutions cluster along protein sequences

Amino acid substitutions are not distributed uniformly along the protein sequence. The cPDF for non-synonymous substitutions, , is significantly peaked around in every species comparison we consider. This peak describes the tendency of non-synonymous substitutions to ‘clump together’ on the protein sequence, a phenomenon we call clustering. The shape of the clustering peak is well-fit by a decaying exponential with a characteristic length scale of about 10 codons. In sharp contrast, the cPDFs involving synonymous substitutions, and , have no clustering peak, indicating that synonymous substitutions are distributed uniformly along the protein sequence. The difference between non-synonymous and synonymous clustering is highly significant, the sampling p-value is essentially zero (, chi-square test).

The magnitude of clustering is large. The nearest neighbor of a codon with a non-synonymous substitution is roughly twice as likely to also have such a substitution than would otherwise be expected. The impact of clustering extends well beyond the nearest neighbor, and is appreciable out to a distance of at least codons from a focal non-synonymous substitution. We quantify the total magnitude of clustering by defining the ‘clustering count’ as the difference between the expected number of substitutions of type in the codons downstream of a focal substitution of type and the expected number in a codon sequence segment distant from the focal substitution (Methods). More plainly, is the number of extra substitutions you find in the vicinity of an substitution because substitutions cluster instead of being distributed uniformly along the sequence. Graphically, is the area under the clustering peak (and above the asymptotic value) of the normalized cPDF , multiplied by the overall density of substitutions.

We are particularly interested in , which we will simply denote . The shape of is very consistent between the different species comparisons tested, but the clustering count is not because it depends not only on , but also on the density of substitutions between the species being compared. ranges from in the D. melanogaster versus D. sechellia alignment to in the D. melanogaster versus D. ananassae alignment, as seen in Figure 3A. increases linearly with (the fraction of substituted amino acids), this is consistent with a clustering pattern that remains constant as divergence increases with time.

**Fig. 3. The dependence of clustering count on sequence divergence.**

Clustering between nearby non-synonymous substitutions is strongly supported by the data, but it is not a priori clear whether it is the separation of amino acids along the protein backbone, or the distance in base pairs along the genome, that matters. To discriminate between these possibilities we repeated the correlation analysis including only those pairs of residues which spanned an intron. As a result the genomic separation between codons had a median increase of bp ( codons) and a minimum increase of bp ( codons), while separation between the encoded amino acids along the protein backbone was unchanged. As shown in Figure 2B, the cPDFs estimated from these intron-spanning pairs of codons correspond closely with those estimated within exons, when separation along the protein backbone (exonic distance) is used in the estimation. We conclude that the clustering length scale is set by the distance along the protein backbone, not along the genome.

Remarkably, the clustering between amino acid substitutions is not limited to substitutions between species. It is also apparent among polymorphisms within a population sample of D. melanogaster (Methods). Figure 2C shows the estimated cPDFs between synonymous and non-synonymous polymorphisms ( and ). The cPDFs estimated from polymorphisms are much noisier because our population sample sequencing spans only kb of coding sequence, as compared to Mb for the divergence data. Nevertheless, we find clustering between polymorphisms analogous to that between substitutions: non-synonymous polymorphisms cluster significantly (, chi-square test), while synonymous polymorphisms do not.

Factors influencing clustering

We tested for potential relationships between clustering and a number of genetic properties by estimating on subsets of the full set of coding sequences stratified by the property in question. Clustering is robust in the sense that it is not substantially affected by many of the properties we tested, such as chromosome (including autosome versus X), recombination rate and the level of gapping in the alignment (Figures S2, S3, S4). We did find a systematic relationship between the GC content of coding sequence and clustering; higher GC content correlates with stronger clustering (Figure S5).

A notable factor that influences clustering is the level of constraint under which a gene evolves, which we estimate by the fraction of substituted amino acids . Amino acid substitution are more clustered in constrained genes than they are in unconstrained genes, i.e. has a larger clustering peak when it is estimated from highly constrained (low ) coding sequences, see Figure 3B. In the inset of Figure 3B we have plotted the estimated from each subset of coding sequences against the average of that subset. It is useful to compare this plot to the one in Figure 3A, which also is a plot of versus . The difference between these plots is that in panel A effectively measures divergence time and scales linearly with , while in the inset of panel B tracks the level of constraint and is strongly sublinear in . In fact, once constraint relaxes past a certain point, becomes roughly constant. This relationship suggests that substitutions in constrained genes occur in tight clusters, and that as constraint lessens the additional substitutions which accrue do so uniformly along the sequence.

Potential non-selective causes of clustering

Clustered sequencing errors or mutation events

The sequencing and mutation processes both have the potential to produce a clustering signal. The frequency of sequencing error might autocorrelate along the sequence, for instance as a result of heterogeneity in read depth. If these clustered errors are interpreted as substitutions the result would be an artefactual clustering signal. Clustering in the mutational process, perhaps as a result of single mutational events altering several nearby codons, would be expected to introduce clustering into the substitution process. In fact, spatially correlated mutation events have been reported on length scales comparable to the clustering length scale we observe [32], [33].

There are two observations which contradict both sequencing error and mutation as the primary cause of clustering. First, the strong concordance between the clustering observed within exons and across introns in Figure 2B is incompatible with these mechanisms. Both of these processes would produce clustering which depended on genomic separation, not separation along the protein backbone. Second, both sequencing and mutation are insensitive to the codon structure in coding sequence. As a result, any clustering that is generated by these processes should affect synonymous and non-synonymous substitutions alike. This is inconsistent with our observations, we find clustering between non-synonymous substitutions to be substantial and clustering between synonymous substitutions (or between non-synonymous and synonymous substitutions) to be absent.

On this second point, it is important to be careful when making these comparisons, as the higher frequency of synonymous mutations has the potential to ‘drown out’ an equivalent level of clustering when considering cPDFs normalized to the background level of divergence. However that is is not the case here, non-synonymous clustering is an order of magnitude larger in absolute terms than any synonymous clustering that might exist (Figure S6). Furthermore, the clustering signal we observe is not driven by a small number of anomalous genes. We tested this by bootstrapping, i.e. repeating our analysis using data sets obtained by resampling with replacement from the full set of aligned coding sequences (Methods). The significance of the difference between non-synonymous and synonymous clustering is strongly supported by the bootstrap analysis (Figures S7, S8). We can assign a -value to this difference by sampling bootstrap distribution of the summary statistics and (Figure S8). We find that the boostrapped -values for observing greater than half (roughly the hypothesis that excess clustering is due to a codon-blind mechanism) is less than (Methods).

Local misalignment

Inadequacies in the alignment process also have the potential to introduce a spurious clustering signal. For instance, if sequence segments are incorrectly frameshifted the result would be artefactual stretches of predominantly non-synonymous substitutions. These stretches could lead to a non-synonymous only clustering signal consistent with our observations. This is of particular concern because the alignments we used are nucleotide alignments and hence did not account for the codon structure in the open reading frame.

Several lines of evidence argue against a substantial contribution from local misalignment to the clustering signal. (i) Non-synonymous clustering is just as strong when the two substitutions are separated by an intron. This is inconsistent with misalignment because local misalignment will affect stretches of sequence contiguous on the genome. Remember that the alignments used here are genome alignments, introns were not spliced out prior to alignment. (ii) The same clustering signal is observed when the analysis is performed on subsets of the coding sequences with and without alignment gaps (Figure S4). While not perfect, alignment gapping is a common proxy for alignment quality, and the particular concern of frameshifts is eliminated when considering gapless alignments. (iii) We repeated our analysis on an alternative set of alignments of Drosophilid coding sequences made publicly available at ftp://ftp.flybase.net/genomes/12_species_analysis/clark_eisen/alignments/ (specifically the masked, melanogaster group alignments using a guide tree) [29]. While the same reference genomes are used, the alignment method and ortholog selection is different, yet consistent clustering is observed in both cases (Figure S9).

Clustering of amino acid substitutions is due to selection

Non-selective mechanisms cannot account for both significant non-synonymous clustering and the absence of synonymous clustering. Having ruled out non-selective mechanisms, we now consider potential selective mechanisms that could cause amino acid substitutions to cluster. Perhaps the simplest explanation for clustering is that proteins have short segments, such as unstructured loops, that are under reduced purifying selection. These weakly constrained segments experience locally increased rates of amino acid substitution, which we then observe as clustering in both divergence and polymorphism data. There are also several ways in which positive selection could cause clustering. Clustering could be the result of localized ‘adaptive bursts’, i.e. functional modules in which multiple independently adaptive substitutions became available (perhaps due to a changed environment). Because amino acids close on the protein backbone are more likely to be in the same module, the resulting burst of adaptive substitutions would be clustered on the sequence. Amino acids that are close along the chain are also more likely to physically interact, even after protein folding. As a consequence, the fitness effect, and hence evolutionary fate, of nearby substitutions could be contingent on one another (i.e. epistasis). In particular we might imagine common compensatory interactions between nearby substitutions, although all synergistic interactions would contribute to clustering. Finally, another potential mechanism is hitchhiking. In this scenario mildly deleterious amino acid polymorphisms are driven to fixation by the selective sweep of a linked allele, resulting in clustered substitutions. We will now attempt to disentangle the relative contributions of these different selective scenarios.

Clustered substitutions tend to occur in the same lineage

We can polarize substitutions by the lineage on which they arose using an outgroup and then repeat our correlation analysis for pairs of substitutions which arose in the same lineage and for pairs which arose in different lineages (Methods). This allows us to begin to distinguish between potential selective mechanisms of clustering. If spatial heterogeneity in the strength of purifying selection is responsible for clustering we expect equal clustering within and between lineages, since in this case the presence of a substitution simply informs as to the level of constraint in that region of the protein sequence. In contrast, the alternative selective mechanisms (adaptive bursts, compensatory or synergistic mutations, and hitchhiking) are lineage-specific, they only apply when substitutions occur in the same lineage and therefore can only cause clustering between same-lineage substitutions.

We incorporate polarization into our analysis by extending the sequence features in our cPDFs with the specification of the species lineage on which a substitution arose, e.g. is a non-synonymous substitution in the Dmel, Dsec, Dsim, Dyak, Dere, Dana, Dpse lineage (Methods). The non-synonymous cPDFs estimated for substitutions in the same and different lineage than the focal substitution are shown in Figure 4A for each species comparison. Clustering between substitutions is always significant whether substitutions arose in the same lineage or in different lineages, but clustering between same-lineage substitutions is always significantly stronger (Table S1). We argued above that spatially heterogeneous purifying selection would cause equal clustering within and between lineages. If this is so, the excess clustering within lineages must be generated by one of the lineage-specific alternatives.

**Fig. 4. Lineage-specific clustering of amino acid substitutions.**

Excess lineage-specific clustering can be quantified with an extension of the clustering count . First we define the lineage-specific clustering count as an analog of with the difference that the cPDF from which derives is estimated using only substitutions in lineage . Therefore, is the increased number of -lineage DNs near a focal -lineage DN due to clustering. Next, the ‘lineage-specific excess clustering count’ is the portion of which is inconsistent with a lineage non-specific mechanism. We quantify this as the difference between the within -⁠ and between-lineage clustering over the first codons (Methods). This corresponds graphically to the area between those cPDFs (the red area in Figure 4A, ), multiplied by the density of substitutions in the lineage .

The lineage-specific excess appears to be a roughly constant fraction of the total lineage-specific clustering . The estimate of is plotted against the estimate of for both lineages of all our species comparisons in Figure 4B. This relationship is well-fit by a linear model, suggesting that approximately of clustering within a lineage is due to lineage-specific mechanisms, i.e. some combination of compensatory or synergistic mutations, adaptive bursts and hitchhiking. The D. simulans lineage is an outlier, Dsim is aberrantly high. This may be a consequence of details relating to this particular reference sequence: the D. simulans reference sequence has lower coverage and quality than the other reference sequences as well as being a ‘mosaic’ assembly constructed from multiple individuals [29]. The Dsim lineage is also picked out by the synonymous control, there is significant synonymous clustering in this lineage above that found in any other lineage we consider (Figure S10).

Nearby charge-altering substitutions tend to compensate each other

If compensatory mutations are contributing substantially to lineage-specific excess one might find evidence of this in a physical or biochemical quantity associated with the compensation. For example, changes in volume, hydrophobicity, charge, etc. might anti-correlate if the substitutions are compensatory. We tested several amino acid properties for such a relationship but found only one that exhibited the hypothesized behavior: nearby substitutions have a significantly increased probability to cause compensatory changes in charge, but only when they arise in the same lineage! We quantify this effect by estimating the fraction of substitutions which compensate the effect of a focal charge-altering substitution, as a function of distance from the focal substitution . In Figure 5 we see that the fraction of charge-compensating substitutions is significantly elevated near a focal charge-altering substitution, on roughly the clustering length scale of 10 codons. This compensation serves to partially conserve the total charge of the protein sequence within the clustering length scale.

**Fig. 5. Amino acid substitutions tend to conserve local charge.**

Local charge compensation is significant in every species comparison we considered, all p-values , chi-square test (Table S2). A measure of the magnitude of this effect is the fraction of charge-altering substitutions that that have their charge alteration compensated for by the net change in charge caused by the other substitutions within codons. This varies by lineage, but is always significant and increases with species divergence up to for the species comparison of D. melanogaster and D. pseudoobscura. Charge compensation is a lineage-specific effect, and it is responsible for a significant fraction of the lineage-specific excess we observe, roughly depending on lineage (Table S2). The observation of substantial charge compensation, and the lack of compensation of other amino acid properties, is consistent with previous observations which suggested charge compensation to be of greater significance in protein evolution than compensation of other amino acid characteristics [22], [34]. Interestingly, while substitutions in different lineages do not exhibit the local compensation phenomenon, they do show a weaker, but statistically significant, increase in the fraction of nearby changes which alter charge in the same direction, perhaps indicating convergent evolution (Table S2).

Nearby amino acid mutations cosegregate in a population

Non-synonymous polymorphisms cluster as well, and polymorphism data provides another avenue to distinguish between the possible selective mechanisms of clustering. Under a model of bursts of independent adaptive mutations, beneficial amino acid mutations can be incorporated sequentially, and would not be expected to segregate together in the population since beneficial mutations rapidly fix after arising. In contrast, if epistatic selection is driving the observed clustering we expect that a compensatory mutation will only be found on a chromosome that already carries the first mutation, i.e. we expect the derived states of nearby polymorphic sites to cosegregate. We can quantify this expectation by estimating the average polarized linkage disequilibrium [35], [36], i.e. the frequency of the doubly derived haplotype minus the product of the frequencies of the individual derived alleles averaged over all pairs of polymorphisms a distance apart. then indicates that derived alleles occur in coupling more often than would be expected if their fitnesses were independent. Consistent with the compensatory scenario, we find when estimated from amino acid polymorphisms within codons of each other, as seen in Figure 6.

**Fig. 6. Polarized linkage disequilibrium.**

We evaluate the significance of the cosegregation of nearby derived alleles by bootstrapping: we resample polymorphic sites from the full set of polymorphic sites in our population, pair them off into a number of pairs equal to the number of pairs of polymorphisms within codons of each other, and then estimate from this resampled ensemble. Repeating this process times yields a bootstrapped probability distribution which we compare to the estimated from the data, yielding a bootstrapping p-value of of observing an equal or greater by chance from our population sample. Again, only pairs of non-synonymous polymorphisms significantly cosegregate, supporting the contention that epistasis is responsible and arguing against purely genomic explanations. Although cosegregation is statistically significant, because our polymorphism data set is limited (compared to whole-genome comparisons of divergence) there is more uncertainty about these results, and it is worth noting that cosegregation does not seem to extend beyond three codons of separation.

Discussion

We have shown that the presence of an amino acid substitution substantially increases the probability of there being additional amino acid substitutions nearby in the protein sequence, with the strength of this effect decaying exponentially along the sequence with a characteristic length scale of ≈10 codons. This ‘clustering’ phenomenon is not observed for synonymous substitutions and is insensitive to the presence of intervening intronic sequence, strongly suggesting selection on proteins as the root cause. Both divergence between Drosophilids and polymorphisms within a population sample of D. melanogaster exhibit this effect. Clustering has a substantial lineage-specific component and nearby substitutions in the same lineage tend to conserve local charge, suggesting compensatory evolution plays a role.

While the results presented here are derived from Drosophila data, we expect that clustering obtains more generally. A recent study found that mutations identified as compensatory clustered near their associated deleterious mutations in eukaryotes, prokaryotes and viruses [37]. Similarly, nucleotide substitutions cluster within codons more often than expected in mammals and HIV, suggesting that two successive mutations are required for the incorporation of some fraction of amino acid substitutions [38], [39].

Origin of clustering

There are a number of selective mechanisms that could cause amino acid substitutions to cluster, and the clustering we observe most likely has multiple causes. We will now try to reconcile the various observations made above with the different mechanisms that have the potential to cause clustering, and estimate their respective contributions. Potential selective mechanisms of clustering can be grouped into two classes: (A) Heterogeneity in the strength of purifying selection acting within an ORF leads to variation in the density of substitutions and polymorphisms, resulting in clustering. (B) Novel protein variants are selected for and this adaptation leads to clusters of substitutions. The latter class of mechanisms comes in several flavors: (i) A localized adaptive burst in which several nearby substitutions independently sweep to fixation. This might be a consequence of changes in selective pressure on a protein domain that requires multiple adaptive substitutions to reach the new optimum [40]. (ii) A complex adaptation, in which several dependent substitutions are required to achieve the selected effect. This case includes scenarios of compensatory mutations, i.e. a second mutation is necessary to compensate deleterious side effects of the first [41], and evolutionary contingency, i.e. the first mutation is necessary for the second mutation to be beneficial [42]. (iii) Hitchhiking, the fixation of otherwise deleterious substitutions as a result of a selective sweep at a linked site [43], [44].

Purifying versus positive selection

Purifying selection prunes mutations that are detrimental, perhaps because they interfere with protein structure or stability. Given that protein structure is strongly conserved across different domains of life, it is reasonable to assume that purifying selection operates in a similar fashion on homologous regions of proteins in different branches of the Drosophila phylogeny. Adaptive evolution, however, depends on the ecological niche of the species and can depend strongly on previous substitutions in that species. Adaptive evolution is therefore expected to be lineage-specific, at least moreso than purifying selection.

We observed that clustering exists between pairs of amino acid substitutions in different lineages as well as in the same lineage, the latter being consistently greater (Figure 4). Clustering across lineages implies that a substitution found in one lineage is predictive of the local substitution rate independent of lineage, which we understand as a lineage-non-specific local increase in the substitution rate. This is most consistent with a class (A) mechanism such as locally relaxed purifying selection, e.g. in an unstructured loop of a protein.

The excess clustering within lineages must be caused by a lineage-specific mechanism such as the class (B) mechanisms described above. Purifying selection can of course also vary in a lineage-specific way. If mildly-deleterious substitutions were highly clustered, and a reduced effective population size rendered them effectively neutral, this could result in excess clustering in the lower population size lineage. However, this scenario is inconsistent with the fact that we observe excess clustering within all lineages, and that it is quantitatively similar between lineage pairs diverging from a common ancestor. Locus-specific variation in purifying selection is also possible, but in most cases will affect an entire gene (e.g. via duplication or transformation into a pseudo-gene) and therefore would not lead to clustering on short length scales. Given that excess lineage-specific clustering is a substantial fraction of the total clustering in every lineage, it does not seem likely that lineage-specific variation in the strength of purifying selection can account for it.

Adaptive mechanisms for clustered substitutions

We start by addressing the potential contribution of hitch-hiking to clustering. A selective sweep of a strongly beneficial substitution fixes a linked haplotype, converting a local snapshot of polymorphisms present in the population into substitutions. This hitch-hiking process does not affect the fixation probability of neutral (and perhaps synonymous) mutations [45], but is expected to increase the fixation probability of nearby deleterious non-synonymous substitutions. However, several observations argue against hitchhiking as the main contributor to clustering. First, hitch-hiking predicts that the length scale of clustering is given by the typical size of hitchhiked region [46]. This implies clustering dependent on separation along the DNA sequence rather than along the protein backbone, contrary to our observations (Figure 2B). Second, there is no correlation between clustering and the average recombination rate of a coding sequence, which would affect the size of hitchhiked regions (Figure S3). Finally, we can calculate a rough upper bound for the contribution of hitchhiking to lineage-specific clustering. Given a per-site heterozygosity , the expected population frequency of derived mutations per site is . Non-synonymous in D. melanogaster is per site [47]–[49] and thus per 4-fold codon (and slightly higher for 2-folds). Given this, the probability of finding a derived amino acid substitution within codons of a focal site is . This serves as a very generous upper bound on the contribution of hitch-hiking to , since only if the focal site is always adaptive and the observed variation always deleterious will this value be approached. This estimate suggests that the contribution of hitchhiking to lineage-specific clustering is minor, since this upper bound is less than the range over which we observe lineage-specific excess, from to depending on lineage (Figure 4B).

The two remaining adaptive scenarios, adaptive bursts and complex adaptations, are difficult to distinguish in part because the boundary between them is not sharply delineated. Certainly, different substitutions within codons in the same protein are never going to be completely independent. The question rather is whether one of the mutations ‘substantially’ affected the probability of the other. Do localized adaptive bursts, loosely defined as substitutions within codons which all independently improve fitness, dominate our clustering signal? Or are the interactions (epistasis) between nearby substitutions mainly responsible? We cannot fully exclude either scenario, but there is evidence that local interactions play at least a significant role. Mutations of independent beneficial effect would not be expected to compensate each others effect on total charge. This requires epistasis between the substitutions, and implies that complex adaptations are responsible for at least of lineage-specific excess. Secondly, independent beneficial mutations are expected to either fix sequentially or, if they do occur simultaneously, to generally segregate in repulsion [50]. This is inconsistent with the preferential cosegregation we observe between nearby derived alleles (Figure 6). Furthermore, charge compensation is only one of many relevant interactions, albeit the one we most readily ascertained from the primary sequence data. So the contribution of charge compensation is only a lower limit for the influence of complex adaptation on the substitution process.

While the possibility of interactions between amino acid substitutions has never been seriously questioned (and has recently been demonstrated in a number of concrete examples[42], [51]), the general importance of epistasis and compensation in evolution has been, and continues to be, controversial. We find evidence that a non-negligible fraction of substitutions are involved in patterns of adaptation suggestive of epistasis. If lineage-specific clustering is mostly due to epistasis, a scenario consistent with our results, we can use the lineage-specific excess to estimate the number of substitutions which owe their fixation to interactions with other substitutions. For example, the lineage-specific excess in the D. yakuba lineage is Dyak. If we attribute the entirety of this to epistasis we would conclude that of the substitutions on this lineage were contingent on another substitution. This estimate is clearly generous in the sense that we have not completely excluded the contribution of other processes, but it is also conservative in the sense that it only includes the effect of elevated local epistasis and excludes the contribution of long-range interactions.

To account for interactions between amino acids distant in the protein sequence but nevertheless in close vicinity in the folded protein, one would need to incorporate protein structure explicitly. However, the probability for any random pair of residues to be involved in such interaction decays rapidly with their separation along the protein backbone, likely to an asymptotic value. Hence, in our analysis we expect correlations between distant pairs to be lost in the background, with only the enriched short range interactions observable as excess clustering of substitutions. The presence of this local enrichment is the enabling factor behind our approach. In agreement with this interpretation, the inferred length scale of clustering of 10 codons is consistent with the size of secondary structure elements in proteins (e.g. 3 turns of an helix). While this manuscript was prepared for publication, another group also found clustering of positively selected amino acid substitutions [17]. Via a different approach, the authors show that the rate of evolution depends on elements of secondary structure and that nearby positively selected sites tend to cluster.

Finally, while we have focused on the mode of evolution responsible for lineage-specific excess, the clear clustering which occurs across lineages is notable in its own right. We attribute this clustering to spatially heterogeneous purifying selection. The clustering length scale is extremely consistent across all the species comparisons we considered and the polymorphism data (Figure 2). This suggests that models of protein evolution might be improved by incorporating correlation between the rate of amino acid evolution along the sequence (e.g. site-specific in PAML [52]). This is particularly true if the length scale we observe here can be shown to be consistent across phyla, demonstrating it as a generic property of proteins themselves.

Methods

We assign substitutions in coding sequence (CDS) on a codon-by-codon basis to pairwise alignments of the reference genome of D. melanogaster with the reference genomes of 6 other Drosophilids: D. sechellia, D. simulans, D. yakuba, D. erecta, D. ananassae and D. pseudoobscura. We use FlyBase release 5.26 gene models to identify the location of coding sequence in the D. melanogaster genome. Coding sequence substitutions are assigned only in the absence of gaps and ambiguous nucleotide. If the aligned codons encode different amino acids a non-synonymous substitution () is assigned, if the same amino acid, a synonymous substitution (). Substitutions are assigned in the context of an alignment between two lineages, which we can make explicit by writing where is the pairwise alignment between lineages and . We omit the alignment for notational convenience, the alignment under consideration will be clear by context, but it is worth remembering that the objects we define later depend implicitly on an alignment when they involve substitutions.

Substitutions are polarized into the lineage in which they arose by comparison to the closest available Drosophilid that is more distant from D. melanogaster than the one being aligned. Specifically, D. yakuba is used as the outgroup for the D. simulans and D. sechellia comparisons, D. ananassae for D. yakuba and D. erecta, D. pseudoobscura for D. ananassae and D. willistoni for D. pseudoobscura. Substitutions are assigned to a lineage if the assignment is unambiguous using standard parsimony criteria. A polarized into lineage is denoted . Not all substitutions can be polarized, represents the fraction of substitutions between the lineages and which are polarized. The species comparisons between D. melanogaster and either D. yakuba or D. erecta have the best properties for the analysis here: most coding sequence alignments pass quality checks and the number of substitutions, both polarizable and total, is high, as seen in Figure S1 and observed previously [53]. When we present results from just one species comparison it will be the D. melanogaster -⁠ D. yakuba comparison for this reason.

Having assigned and polarized substitutions, we study clustering between substitutions typed by synonymity, e.g. non-synonymous (DN) and synonymous (DS) divergent sites, by estimating the probability of finding a substitution (of some particular type) codons away from a focal substitution. This is formalized as a conditional probability distribution (cPDF), which we denote , where is the focal substitution type, and is the substitution whose frequency is measured at distance . and can be simply DN or DS, or in the later analysis a substitution polarized to a particular lineage.

The cPDF is calculated from a set of CDSs on which the presence/absence of substitutions and have been ascertained site-by-site. is the proportion of sites a distance downstream (coding sense) of a substitution of type , summed over all instances of in the data set. We must account for the decrease in the number of observations made as increases due to the finite length of coding sequences. To be precise, the cPDF is calculated as follows: Let us label individual CDSs with and index the codons in a CDS by , which ranges from to the length of the CDS, . We define an indicator variable for each CDS and substitution type . if codon of contains an substitution, and is otherwise. The cPDF is defined as,(1)where the Kronecker symbol if and 0 otherwise. These cPDFs generically go to an ‘asymptotic value’ , which is calculated ad-hoc by averaging over . This property allows us to separate the functional dependence of a cPDF on distance from its absolute value by introducing the ‘normalized’ cPDF .

The cPDF naturally generalizes to include polarization information. Polarized cPDFs are defined as above, with , , and an additional summation over the lineages, . A same-lineage cPDF enforces the same-lineage condition with a Kronecker delta , different-lineage with .

We define the clustering count as the sum of the difference between the cPDF and its asymptotic value over the first codons, i.e. the area between and the asymptotic value of the cPDF over . This is the difference between the expected number of substitutions within codons downstream of a focal substitution and the expected number in a codon sequence segment that is distant from the focal substitution. The choice of as the upper limit of the sum simply reflects the observation that significant clustering does not extend past this point. Additionally, this is a one-sided sum, ensuring that each pair is counted only once. We also define the lineage-specific excess clustering count for lineage , , in order to quantify the stronger clustering within a lineage. Lineage-specific excess is found by summing over the difference between the normalized same-lineage cPDF and the normalized cross-lineage cPDF and then ‘unnormalizing’,(2)(3)

The factor of , defined above, corrects for the fraction of substitutions that cannot be unambiguously polarized. Multiplying by roughly accounts for this by assuming that the polarized substitutions are representative of the unpolarized ones. When or are written without indices they should be assumed to refer to clustering, i.e. . Note that and depend on the pairwise alignment being considered via their dependence on the assignment of substitutions, as described at the beginning of the Methods.

Our definition of cPDFs implicitly involved the determination of the set of CDSs to be summed over. This set varies with the species comparison so we denote it . For the results presented here was the set of all CDSs for which the pairwise alignment between D. melanogaster and sister Drosophilid met several standards of quality: a CDS included in was required to have less than gapping in its pairwise alignment and less than amino acid substitution, the alignment could not contain out-of-frame gaps (gaps with size that is not a multiple of three), and the spliced transcript to which the CDS belongs could contain no extraneous stop codons. Furthermore, we often restrict to a specific subset in order to investigate the dependence of the clustering signal on various quantities, e.g. Figure 3B shows the cPDFs calculated using subsets of all CDSs ranked by .

Polymorphisms were identified in a Zimbabwean population sample of male D. melanogaster. We re-sequenced kb of coding sequence from genes in the highly recombining region of the X-chromosome (cytological positions 3C3 to 18F4) using standard methods reported previously [54]. Samples sizes ranged from to with a mean of . A subset of these sequences ( alleles for each of loci), were previously reported [54]. All new sequences have been submitted to GenBank, accession numbers are available in Table S3. Polymorphisms are assigned if more than one codon exists in the population sample at that site. Singletons are excluded from the analysis. A non-synonymous polymorphism is assigned if this set of codons encodes more than one amino acid, and a synonymous polymorphism if the number of codons exceeds the number of amino acids encoded. PN and PS assignment is not exclusive. Polarization into mutant/ancestral alleles is inferred by comparison to D. simulans (or D. sechellia when D. simulans is unavailable) at that site (i.e. standard parsimony criteria). cPDFs are constructed analogously to those involving substitutions.

All line plots presented are smoothed from the underlying data. We used a moving window averaging for smoothing, always with window size . The contribution from each data point to the smoothed average was weighted by the number of ‘trials’ from which the value was estimated.

Assessment of significance

We assess two ‘types’ of significance here, sampling significance and bootstrapping significance. The assessment of sampling significance is understood by recalling how cPDFs are estimated. is the mean of a set of trials which can have outcome either or (Bernoulli random variables). is the same, it is just an average over a cPDF for . Trials consist of selecting a focal substitution of type , looking away on the sequence, and recording the presence () or absence () of a substitution of type . So, assessing the significance of values of or is equivalent to assessing the significance of sums of Bernoulli random variables, for which we used chi-square tests.

Bootstrapping significance is also a measure of sampling significance, with the difference being that the effect of resampling is evaluated at the level of the largest unit in our analysis, the coding sequence. The probability distribution of a value of interest is constructed by resampling with replacement from the full set of coding sequences a ‘bootstrapped’ set of equal size, estimating the value of interest on that bootstrapped set, and repeating. Bootstrapping p-values are then determined from this estimate of the probability distribution. If the estimated distribution can be approximated as a gaussian, as is always the case here, the gaussian approximation is used to assign the p-value. A modification of this bootstrapping scheme was used for polymorphism cosegregation, and described there.

Supporting Information

Zdroje

1. KimuraM

1983 The neutral theory of molecular evolution. Cambridge Cambridge University Press

2. GillespieJH

1991 The Causes of Molecular Evolution. Oxford Oxford University Press

3. HeyJ

1999 The neutralist, the y and the selectionist. Trends in Ecology & Evolution 14 35 38

4. NeiM

2005 Selectionism and neutralism in molecular evolution. Molecular Biology and Evolution 22 2318 2342

5. SellaG

PetrovDA

PrzeworskiM

AndolfattoP

2009 Pervasive natural selection in the drosophila genome? PLoS Genet 5 e1000495 doi:10.1371/journal.pgen.1000495

6. SmithNGC

Eyre-WalkerA

2002 Adaptive protein evolution in drosophila. Nature 415 1022 4

7. FayJC

WyckoffGJ

WuCI

2002 Testing the neutral theory of molecular evolution with genomic data from drosophila. Nature 415 1024 1026

8. McDonaldJH

KreitmanM

1991 Adaptive protein evolution at the adh locus in drosophila. Nature 351 652 4

9. Eyre-WalkerA

2006 The genomic rate of adaptive evolution. Trends in Ecology & Evolution 21 569 575

10. OhtaT

1992 The nearly neutral theory of molecular evolution. Annual Review of Ecology and Systematics 23 263 286

11. HughesAL

2007 Looking for darwin in all the wrong places: the misguided quest for positive selection at the nucleotide sequence level. Heredity 99 364 373

12. BrandenC

ToozeJ

1999 Introduction to protein structure. New York Garland Science

13. ChothiaC

LeskAM

1986 The relation between the divergence of sequence and structure in proteins. EMBO Journal 5 823 26

14. Olivier LichtargeHRB

CohenFE

1996 An evolutionary trace method defines binding surfaces common to protein families. J Mol Biol 342 358

15. FitchW

MarkowitzE

1970 An improved method for determining codon variability in a gene and its application to the rate of fixation of mutations in evolution. Biochem Genet 4 579 593

16. ZvelebilM

BartonG

TaylorW

SternbergM

1987 Prediction of protein secondary structure and active sites using the alignment of homologous sequences. J Mol Biol 195 957 961

17. RidoutK

DixonC

FilatovD

2010 Positive selection differs between protein secondary structure elements in drosophila. Genome Biology and Evolution 2010 166 179

18. KirbyDA

MuseSV

StephanW

1995 Maintenance of pre-mrna secondary structure by epistatic selection. PNAS 92 9047 9051

19. StephanW

1996 The rate of compensatory evolution. Genetics 144 419 26

20. MeerMV

KondrashovAS

Artzy-RandrupY

KondrashovFA

2010 Compensatory evolution in mitochondrial trnas navigates valleys of low fitness. Nature 464 279 282

21. WhisstockJC

LeskAM

2004 Prediction of protein function from protein sequence and structure. Quarterly Reviews of Biophysics 36 307 340

22. NeherE

1994 How frequent are correlated changes in families of protein sequences? PNAS 91 98 102

23. LocklessSW

RanganathanR

1999 Evolutionarily conserved pathways of energetic connectivity in protein families. Science 286 295 9

24. YeangCH

HausslerD

2007 Detecting coevolution in and among protein domains. PLoS Comput Biol 3 e211 doi:10.1371/journal.pcbi.0030211

25. BurgerL

van NimwegenE

2010 Disentangling direct from indirect co-evolution of residues in protein alignments. PLoS Comput Biol 6 e1000633 doi:10.1371/journal.pcbi.1000633

26. WangQ

LeeC

2007 Distinguishing functional amino acid covariation from background linkage disequilibrium in HIV protease and reverse transcriptase. PLoS ONE 2 e814 doi:10.1371/journal.pone.0000814

27. PoonAFY

SwensonLC

DongWWY

DengW

PondSLK

2010 Phylogenetic analysis of population-based and deep sequencing data to identify coevolving sites in the nef gene of hiv-1. MBE 27 819 832

28. SocolichM

LocklessSW

RussWP

LeeH

GardnerKH

2005 Evolutionary information for specifying a protein fold. Nature 437 512 8

29. ConsortiumDG

2007 Evolution of genes and genomes on the drosophila phylogeny. Nature 450 203 18

30. SchwartzS

KentWJ

SmitA

ZhangZ

BaertschR

2003 Human-mouse alignments with blastz. Genome Res 13 103 7

31. KarolchikD

BaertschR

DiekhansM

FureyTS

HinrichsA

2003 The ucsc genome browser database. Nucleic Acids Res 31 51 4

32. ColginLM

HackmannAFM

EmondMJ

MonnatRJ

2002 The unexpected landscape of in vivo somatic mutation in a human epithelial cell lineage. PNAS 99 1437 42

33. WangJ

GonzalezKD

ScaringeWA

TsaiK

LiuN

2007 Evidence for mutation showers. PNAS 104 8403 8

34. Fukami-KobayashiK

SchreiberD

BennerS

2002 Detecting compensatory covariation signals in protein evolution using reconstructed ancestral sequences. Journal of Molecular Biology 319 729 743

35. SlatkinM

2008 Linkage disequilibrium -⁠ understanding the evolutionary past and mapping the medical future. Nat Rev Genet 9 477 485

36. TakahasiKR

InnanH

2008 The direction of linkage disequilibrium: A new measure based on the ancestral-derived status of segregating alleles. Genetics 179 1705 1712

37. DavisBH

PoonAFY

WhitlockMC

2009 Compensatory mutations are repeatable and clustered within proteins. Proc Biol Sci 276 1823 7

38. BazykinGA

KondrashovFA

OgurtsovAY

SunyaevS

KondrashovAS

2004 Positive selection at sites of multiple amino acid replacements since rat-mouse divergence. Nature 429 558 62

39. BazykinGA

DushoffJ

LevinSA

KondrashovAS

2006 Bursts of nonsynonymous substitutions in HIV-1 evolution reveal instances of positive selection at conservative protein sites. PNAS 103 19396 401

40. OrrHA

2003 A minimum on the mean number of steps taken in adaptive walks. Journal of Theoretical Biology 220 241 247

41. KulathinalR

BettencourtB

HartlD

2004 Compensated deleterious mutations in insect genomes. Science 306 1553 4

42. WeinreichDM

DelaneyNF

DepristoMA

HartlDL

2006 Darwinian evolution can follow only very few mutational paths to fitter proteins. Science 312 111 4

43. SmithJM

HaighJ

1974 The hitch-hiking effect of a favourable gene. Genetical Research 23 23 35

44. RiceWR

1987 Genetic hitchhiking and the evolution of reduced genetic activity of the y sex chromosome. Genetics 116 161 167

45. BirkyCW

WalshJB

1988 Effects of linkage on rates of molecular evolution. PNAS 85 6414 6418

46. BartonNH

1995 Linkage and the limits to natural selection. Genetics 140 821 841

47. AndolfattoP

2005 Adaptive evolution of non-coding DNA in drosophila. Nature 437 1149 52

48. BegunDJ

HollowayAK

StevensK

HillierLW

PohYP

2007 Population genomics: Whole-genome analysis of polymorphism and divergence in drosophila simulans. PLoS Biol 5 e310 doi:10.1371/journal.pbio.0050310

49. ShapiroJA

HuangW

ZhangC

HubiszMJ

LuJ

2007 Adaptive genic evolution in the drosophila genomes. PNAS 104 2271 2276

50. HillWG

RoberstonA

1966 The effect of linkage on limits to artificial selection. Genetical Research 8 269 294

51. OrtlundEA

BridghamJT

RedinboMR