Analysis of Transposon Interruptions Suggests Selection for L1 Elements on the X Chromosome
It has been hypothesised that the massive accumulation of L1 transposable elements on the X chromosome is due to their function in X inactivation, and that the accumulation of Alu elements near genes is adaptive. We tested the possible selective advantage of these two transposable element (TE) families with a novel method, interruption analysis. In mammalian genomes, a large number of TEs interrupt other TEs due to the high overall abundance and age of repeats, and these interruptions can be used to test whether TEs are selectively neutral. Interruptions of TEs, which are beneficial for the host, are expected to be deleterious and underrepresented compared with neutral ones. We found that L1 elements in the regions of the X chromosome that contain the majority of the inactivated genes are significantly less frequently interrupted than on the autosomes, while L1s near genes that escape inactivation are interrupted with higher frequency, supporting the hypothesis that L1s on the X chromosome play a role in its inactivation. In addition, we show that TEs are less frequently interrupted in introns than in intergenic regions, probably due to selection against the expansion of introns, but the insertion pattern of Alus is comparable to other repeats.
Published in the journal:
. PLoS Genet 4(8): e32767. doi:10.1371/journal.pgen.1000172
Category:
Research Article
doi:
https://doi.org/10.1371/journal.pgen.1000172
Summary
It has been hypothesised that the massive accumulation of L1 transposable elements on the X chromosome is due to their function in X inactivation, and that the accumulation of Alu elements near genes is adaptive. We tested the possible selective advantage of these two transposable element (TE) families with a novel method, interruption analysis. In mammalian genomes, a large number of TEs interrupt other TEs due to the high overall abundance and age of repeats, and these interruptions can be used to test whether TEs are selectively neutral. Interruptions of TEs, which are beneficial for the host, are expected to be deleterious and underrepresented compared with neutral ones. We found that L1 elements in the regions of the X chromosome that contain the majority of the inactivated genes are significantly less frequently interrupted than on the autosomes, while L1s near genes that escape inactivation are interrupted with higher frequency, supporting the hypothesis that L1s on the X chromosome play a role in its inactivation. In addition, we show that TEs are less frequently interrupted in introns than in intergenic regions, probably due to selection against the expansion of introns, but the insertion pattern of Alus is comparable to other repeats.
Introduction
The activity of transposable elements (TEs) harms their hosts primarily through disrupting coding or other selectively important regions of the genome, and through illegitimate recombination between copies of the repeats. Despite their overall deleterious effect, it is becoming increasingly clear that a considerable fraction of TEs have been domesticated by their hosts, and gained either a regulatory role [1],[2], or less frequently, their sequence has become part of a gene [3],[4]. Functional repeats are typically identified in two ways; either through conservation of their sequence [5],[6], or through co-localization with regions with a known genomic function [7],[8]. However, these methods may not be able to identify many functional repeats in the genome. The initial results of the ENCODE project show that even though up to ∼5% of mammalian genomes might be functional, the sequence of many experimentally determined functional elements is not conserved across species, therefore cannot be identified by sequence conservation [9],[10]. Thus, in the absence of prior information on the functionality of a genomic region, the in-silico discovery of selectively important but not conserved repeats (or genomic regions) remains a challenge.
Analysis of transposon insertions offers at least a partial solution for the detection of such repeats or genomic regions. Recently, Simons et al. [11] identified almost 1000 large, >10 kb regions in mammalian genomes which remained free of transposon insertions in many mammalian genomes, and a considerable fraction of them shows little or no sequence conservation. Since the probability of the random emergence of such high number of transposon-free regions is extremely low, the authors concluded that the maintenance of such regions must involve selection against TE insertions, although their exact function remains unclear. Here we take a further step and use transposon interruptions to analyze the selective constraints on transposable elements themselves. We analyze two TE classes that have been hypothesized to have an epigenetic function, at least in some regions of the genome: L1 repeats in the inactivation of the X chromosome in females of placental mammals [12], and Alus, which accumulate near genes over evolutionary time [13],[14].
TEs on average cover more than 40% of mammalian genomes [15], and remain detectable in primate genomes for up to 200 million years [14]. Since most fixed TE insertions are neutral or nearly neutral, interrupting them by other, younger TEs is also likely to be selectively neutral for the host. In consequence, mammalian genomes contain many nested TE insertions (“TE clusters”), where older TEs are interrupted by younger ones (Figure 1). The analysis of TE clusters can provide information on the evolution of TEs, and has already been used to analyze the relative age of TE families [16]–[20]. Furthermore, analysis of TE interruptions can provide insights on the selective constraints on TEs. Insertion into TE sequences which are beneficial for the host would result in their disruption and loss of function, and individuals carrying such “knock-out” TEs would undergo negative selection and disappear from the population. Thus, analyzing TE interruptions offers a novel way of investigating selective pressures on TEs in mammalian genomes. In addition to the identification of functional but non-conserved regions, in many genomes inference about conservation is complicated by the high spatial heterogeneity of substitution rates in different parts of the genome [21],[22]. Factors that lead to such heterogeneous substitution rates are the large variability of chromosome size, for example in avian and reptile genomes where the size of chromosomes spans two orders of magnitude [23],[24], or the complex evolutionary history found on the sex chromosomes[25],[26]. In such organisms and genomic regions the analysis of TE insertions may be a valuable tool in the detection of functional repeats, supplementing standard methods based on sequence comparison.
In mammals, due to the XY sex determination system, one of the X chromosomes in females is inactivated during early embryogenesis [27],[28]. Inactivation evolved separately in marsupials and placental mammals [29]; in marsupials strictly the paternal X chromosome is inactivated [30], while in placental mammals the inactivated X chromosome is selected at random [27],[28]. In humans X chromosome inactivation is mediated by a 17 kb long non-coding RNA produced by the Xist gene [31], which appeared prior to the mammalian radiation [29],[32], but is absent in marsupials [30]. Inactivated genes are not evenly distributed on the human X chromosome, but instead are mostly located on the oldest evolutionary “strata” (S1–S3) of the chromosome [33] (see also Figure 2), which largely correspond to the opossum X chromosome [30]. The exact mechanism of inactivation is not known, but the higher than average abundance of L1s on the X chromosome[12],[34], particularly near inactivated genes have led to the hypothesis that L1s have a role in the inactivation process, by serving as “way stations” for the spread of the inactivation signal. Recent computational analyses show that the inactivation status of X-linked genes can be predicted by the neighboring repeats [35],[36]. However, the sequence conservation of L1s on the X chromosome does not differ qualitatively from the autosomes, and it is also unclear whether the unique patterns of repetitive element distributions on the X chromosome are the cause, or consequence of inactivation (or both).
With more than one million copies, Alus are the most abundant TEs in our genome [14],[37]. They are primate specific, parasitize active L1s for replication [38], and insert primarily into gene-poor, AT-rich regions of the genome. However, the genomic distribution of Alus changes with their age; in contrast to the youngest insertions the vast majority of Alu repeats are present in GC and gene-rich regions of the genome [13],[14]. The high density of Alus near and within genes has led to the hypothesis that many of these insertions might be preferentially retained in the genome due to a not yet fully identified function [14],[39],[40].
In this paper we test whether L1s on the X chromosome and Alus near genes are less interrupted than expected by their genomic abundance. We interpret reduced amounts of interrupted TEs as a signature of selection for the integrity of the TE sequence in that region (selection against “knock-out” TEs).
Results
L1s on the X Chromosome
We investigated the selective constraints on L1 elements on mammalian X chromosomes by examination of the frequency of interrupted L1s. We analysed the evolutionary strata of the X chromosome independently, to account for their different evolutionary histories and proportion of inactivated genes. The clusters of interrupted L1s were categorised into two groups, depending on whether L1s were interrupted by L1s, or by different types of repeats (Figure 2B). (We made this distinction because interrupting a TE by a similar TE may not result in loss of functionality of the locus). The frequency of non-L1 interruptions changes across the human X chromosome, it is highest on the still recombining pseudoautosomal regions and the youngest evolutionary strata, and lowest on the oldest stratum, where the frequency of interrupted L1s is significantly lower than on the autosomes (p = 0.0012, Wald-Wolfowitz runs test [WWrt], Figure 2B). In contrast, there is no such trend in the clusters containing L1s interrupted by other L1s; the frequency of these interruptions in the oldest strata, and in the pseudoautosomal regions is comparable to the genomic median (p<0.05 only for S4–S5, WWrt, Figure 2B).
In the opossum genome we found no significant differences in the frequency of interrupted L1 elements between the autosomes and the X chromosome (p = 0.23, Wilcoxon signed rank test [Wsrt], Figure 2D). In contrast, on Stratum 1 of the human X, L1s are approximately twofold less frequently interrupted compared with the autosomes (p<0.001, Wsrt, Figure 2E). The frequency of L1 interruptions declines with the distance of the repeats from the genes, both on autosomes and the X chromosome (p<0.001, Wsrt, Figure 2C).
There are large differences between the different L1 families: the relatively young, primate specific L1P families are interrupted on the S1 and S2 strata of the X chromosome at significantly lower rates than on the autosomes (Figure 2F, p<0.001, p = 0.017 and p = 0.26 for S1, S2 and S3 respectively, Wsrt, see also Figure 4B). However, the difference is small in comparison with the older L1M families, which were active mainly before the mammalian radiation, and are much less interrupted on the human X than on the autosomes (Figure 2G, p<0.001, p = 0.55 and p<0.001 (Wsrt) for S1, S2 and S3 respectively. (Note that the gene density of S2 is higher than the genomic average). The 5′UTRs, ORFs and 3′UTRs of L1s are interrupted by non-L1 repeats at different frequencies, particularly in the case of primate specific L1s (Figure 2F and G; the frequency of interruptions is calculated per total amount (base pairs) of the L1's in each window, and do not simply reflect the different abundance of these regions). The pattern of interruption is qualitatively similar on the autosomes and X chromosome, and no specific regions within L1s are free of interruptions compared with the autosomes.
The inactivation of genes on the X is incomplete, and several genes escape inactivation (at least partly), even on the oldest stratum (S1) of the X chromosome. We compared the frequency of interruptions in S1 that are found within 100 kb of both inactivated genes and genes that escape inactivation. We find that the L1M repeat sequence in the vicinity (and within) genes escaping inactivation is interrupted at significantly higher rates than L1Ms near genes that are subject to inactivation (p = 0.0019, Wsrt, Figure 3), but at a somewhat lower rate than on the autosomes (p<0.001, Wsrt, Figure 3), further corroborating the relationship between the presence of uninterrupted L1s and inactivation.
Frequency of Interruptions vs. Their Distance from Genes
Since the frequency of interrupted repeats shows clear dependence on the distance of the interrupted repeat from genes (Figure 2C, Figure 4B, C), the reduced frequency of L1 interruptions on the X chromosome could be a simple by-product of a lower than average gene density on the oldest evolutionary strata. In addition, if L1s are the only or main repeat type involved in X inactivation, than only L1s should show reduced frequency of interruptions on the X chromosome but not other non-LTR repeats. We tested these hypotheses by analyzing the frequency of the interruptions of the most abundant non-LTR repeat classes of the human genome (Figure 4) on each human chromosome, using the percentage of coding sequence in the euchromatic sequence as a covariate. With the exception of Alus, the frequency of interruptions of each type of TE correlates positively with the density of coding sequence on the chromosomes. This can be explained by the lower average distance of the repeats to coding regions. In the case of L1s the S1 and S2 regions of the X chromosome are clear outliers, indicating that the lower frequency of interruptions cannot be explained with low density of genes on these strata (Figure 4), while L2s and MIRs do not show this effect.
Alu Repeats and the Frequency of Interruptions in Introns
A relatively large number (>21 000) of Alus are interrupted in the genome, and the vast majority of the interrupters are other Alus. Due to the target specificity of the L1 integrase which Alus use (TT|AAAAA), most Alus are interrupted in the polyA stretch of the linker region between the two Alu halves (Figure 5A). In contrast, interruptions of Alus by TEs other than Alus or L1s are spread out approximately evenly across the Alu consensus sequence (Figure 5A). The frequency of interruptions of Alu's by other Alus increases nearer to genes and exons, while the frequency of Alus interrupted by non-Alus remain relatively constant (Figure 5B). We find a clear difference between intergenic and intronic Alus; in introns Alus are interrupted with a considerably lower frequency both by Alus (p<0.001, Wilcoxon matched pairs test) and other repeats (p = 0.009, Figure 5B), suggesting that interrupting a fraction of Alus in introns is deleterious. However, this pattern is not restricted to Alus, a qualitatively similar trend is present for other repeats (Figure S1) and the combined dataset of all TEs (p<0.001, Figure 5C), indicating an overall selection against disrupted repeats in introns, that includes but is not specific for Alus.
Discussion
L1s on the X Chromosome
Recombination between the human X and Y chromosomes ceased gradually in the last ∼300 my, leading to two pseudoautosomal regions and five evolutionary strata (S1–S5) on the X (Figure 2A); the largest and oldest of them (S1) roughly corresponds to the opossum X chromosome (Figure 2). The distribution of inactivated genes is not random between the strata; on the oldest one (S1) the vast majority of the genes are inactivated, while on the youngest strata (S4–S5) most of the genes escape inactivation [33].
Our results support the hypothesis that L1 repeats have a role in the spread of the inactivation signal on the X chromosome. The low frequency of L1 interruptions in strata with high number of inactivated genes suggests that there is selection against L1 interruptions in these regions, and individuals with interrupted L1s near inactivated genes were removed from the population more efficiently than individuals carrying interrupted L1s on the autosomes, where such interruptions are likely to be neutral.
Theoretically the lower frequency of interruptions could emerge also as the result of the relaxation of selection on the X chromosome, both due to its reduced rate of recombination or inactivation. A recent study have demonstrated that full length L1s are subject to negative selection in the human genome while truncated L1s are essentially neutral [41]. Weaker selection would lead to less efficient removal of “standalone”, in particular full length repeats from the chromosome, and consequently result in a lower frequency of interrupted repeats. However, a process like this would influence all types of repeats not just L1s, and other, equally old non-LTR repeats like L2s and MIRs are not less interrupted on the X chromosome than elsewhere in the genome (Figure 4). This is probably true for other chromosome-wide processes that are not specific for L1 elements, such as paternal transmission of Alu repeats [4],[42]. Additionally, ancient L1Ms which have essentially no full length copies and on average have lower insertion length than L1Ps show the strongest pattern.
The timing of the evolution of random, Xist-mediated inactivation [29],[43] is consistent with our results. In the opossum genome, where there is no random X inactivation and the Xist gene is absent [30], L1s show equally frequent interruptions on the X chromosome and on the autosomes (Figure 2D). However, in humans the L1M families, which were active before the mammalian radiation and present when Xist-mediated inactivation evolved, are less frequently interrupted on the X than on the autosomes (Figure 2). The difference in the frequency of interruptions between the S1 and the autosomes is much smaller for the primate specific L1P families, despite the strong accumulation of L1Ps on the X chromosome. Since most arguments about the putative L1 function in inactivation were based on the increased L1P abundance on the X chromosome [12],[34],[44], the small difference in the frequency of L1P interruptions between the autosomes and the S1 is surprising. One possibility is that the accumulation of L1s on the X is a consequence and not the cause of the inactivation, due to a so far unknown mechanism. On the other hand, the primate specific L1P families are relatively young and are interrupted at much lower frequency than the mammalian wide L1M families (Figure 2), and most L1P insertions are also considerably longer than L1Ms (∼1400 bp vs. ∼700 bp respectively). Due to the low frequency of interruptions (approximately 10% of L1P repeats are interrupted, while almost 40% of L1Ms) the amount of uninterrupted sequence may still be high enough to influence the spread of the inactivation signal; thus, despite the fact that on the S1 region we found no clear support for it, the conclusion that primate specific L1s have no influence on X inactivation is premature.
Gene Density, Distance from Genes and the Frequency of TE Interruptions
The frequency of interrupted repeats within a TE family is largely determined by the age of the family; older families had more time to accumulate insertions and are more interrupted than young ones; this has already been used by our group to determine the age of mammalian repeats [20]. However, besides age, gene density has also significant influence on the frequency of interrupted repeats. We found that in the vicinity of genes TEs are likely to be more interrupted than in gene poor regions (Figure 2), and in consequence gene dense chromosomes have higher frequency of TE clusters (Figure 4). We propose two mechanisms which can cause this pattern. In gene dense regions the likelihood that a new insertion into the euchromatic sequence will be deleterious due to the disruption of a coding or regulatory sequence is high, while inserting into another TE is likely to be neutral (with the exception of specific regions where TEs acquired some function, like exapted repeats or the X chromosome). This is likely to result in increased frequency of interrupted repeats close to genes. A second mechanism that can result in the depletion of TE clusters in gene-poor regions is illegitimate recombination between repeats [45]–[49], particularly Alus. Illegitimate recombination is probably the key process behind the large spatial variability of TEs in the genome, and particularly the distribution of Alus is modified by its effects [45],[46],[48],[50]. Due to their large numbers Alus are the most frequent interrupters in the genome, and if the probability of ectopic recombination between Alu elements is larger than between the original, uninterrupted repeats, than illegitimate recombination between Alus reduces also the amount of TE clusters, because it results in deletions [51],[52] which can contain a part of the cluster. Since the likelihood that deletions are neutral and will reach fixation is highest in gene poor regions, this process leads to a positive correlation between gene density and frequency of TE interruptions, similarly to the density of Alu repeats.
Alu Repeats and the Frequency of TE Interruptions in Introns
The vast majority of Alus are interrupted in their A-rich linker region that connects the two GC rich arms or the repeat (Figure 5A). This pattern can be easily explained by the insertion preference of the repeats; both Alus and L1s target TT|AAAAA sites, while interruptions of all other repeats with different target site specificity are not clustered at the linker region. The lower frequency of interrupted Alus and other TEs in introns than in intergenic regions suggests that intronic TE interruptions may be deleterious. This is in agreement with the findings that several Alu containing exons are alternatively spliced, and suggestions thus Alus may have a profound influence on the human transcriptome [53],[54]. In addition, a recent study by Gal-Mark et al. [55] have demonstrated that both arms of Alus are used in this process, and experimentally increasing the distance between them results in deleterious splicing. Since the majority of Alus are interrupted exactly in the linker region between the two Alu arms (Figure 5A), these findings provide an elegant example of the loss of biological function due to interruption of a repeat. However, only 0.2% of Alus appear to be exonised [54], and the pattern we observe is not specific for Alus (Figure 5C, Figure S1), thus this mechanism alone is not sufficient to explain the low frequency of interruptions in introns, or the accumulation of Alus in gene-dense regions.
TEs show biases in their orientation in introns due to selection against Alus, L1s and in particular LTR insertions in the forward direction, because these repeats can interfere with transcription [56]. In consequence, in introns the neutrality of an insertion depends on the orientation of the TE as well, and in consequence a fraction of TE insertions in the forward direction is likely to be deleterious, even if they interrupt other repeats. This process results in a lower frequency of interruptions in introns, however, it can account for less than a half of the difference between the frequency of interruptions in introns and intergenic regions (Figure S2), because the bias in interruptions is much stronger than in the number of TEs.
We see at least two additional processes that may lead to reduced frequency of interrupted repeats in introns. First, introns may be selected for small size [57],[58], and since new TEs increase intron size they are weakly deleterious, even if they disrupt a transposable element. Thus, unlike in intergenic regions an insertion into a transposable element isn't neutral in introns, leading to a higher probability that such nested insertions will be lost during evolution. This hypothesis does not assume that Alus or other repeats have any specific benefit for the host, and it predicts that the effect will be the strongest in short introns of highly expressed genes, where selection for small intron size is the strongest [58]. This prediction is consistent with our findings: the difference in the frequency of interruptions between intronic and intergenic repeats is the largest in the vicinity of exons, and gradually declines with the increasing distance from exons (Figure 5C).
An alternative hypothesis is that selection acts on some of the TEs themselves, and a fraction of the TEs within introns are beneficial for the host, most likely due to their effect on gene expression. This is consistent with recent findings, which indicate that a large number of TEs are involved in gene regulation through cis natural antisense transcripts, and that 98.2% of such repeats are present in introns [59].
An interesting pattern in the data is the much more pronounced accumulation on self-interrupted Alus near genes than Alu sequence (Figure 5). The large scale shift in the frequency of self-interrupted repeats spanning more than 100 kb supports the studies showing that the Alu distribution in the genome is significantly shaped by illegitimate recombination [45],[46],[48]. Illegitimate recombination between nested Alus accounts for 8% of Alu-Alu recombination mediated deletions in the chimpanzee [50], but only 1.8% of Alus contain a self-insertion in the human genome, thus deletions caused by self-interrupted Alus are 4.3 times more frequent than between individual Alus. This is expected to reduce the density nested Alus, particularly in gene poor regions, where such deletions are less deleterious and can reach fixation. The reason for the higher frequency of recombination between nested alus is probably their length because ectopic recombination depends on the length of a repeat [60], and tandemly repeated sequences are particularly prone for it (a nested Alu insertion contains 4 almost identical arms). Taken together, our data confirm that the variability in the abundance of Alu repeats in primate genomes is caused by the frequency of (illegitimate) recombination.
Conclusions
-
In regions of the X chromosome which are subject to inactivation (strata 1 to 3), L1 elements, primarily L1Ms are interrupted at lower frequency than on autosomes or other, more active regions of the X chromosome. Assuming that lower than expected frequency of interruptions indicate selection, our analysis suggests that the ancient L1M repeats are utilized by the inactivation mechanism, while we found support for such function for the primate specific L1Ps only on the S2. This is consistent with the phylogenetic distribution of X inactivation, which probably evolved before the mammalian radiation (and the appearance of the L1P clade).
-
On the X chromosome of the opossum which has no random inactivation, lacks the Xist, and is largely homologous to the oldest evolutionary stratum of the Human X, the pattern of L1 interruptions is similar to the autosomes.
-
The frequency of interrupted TEs is highest near genes and exons, probably due to the higher likelihood of deleterious insertions in gene dense regions, and the more frequent loss of TE clusters from gene poor regions via non-homologous recombination between repeats.
-
TEs are less interrupted in introns than in intergenic regions, possibly due to selection on intron size.
-
The analysis of TE interruptions appears to be a useful method to gain insights on the selective constrains on genomes. The method is clearly not as informative as inference from sequence conservation; its main limitations are that it cannot provide information on individual TE copies, can be used only in repeat rich genomes, and identifying the real target of selection (e.g. intron size vs. repetitive elements) may need additional work. However, its major advantage is that it does not rely on any assumption on substitution rates, prior knowledge on functionality, or on the assumption that functional copies of TEs are conserved, which make it a valuable tool for analyses where these assumptions are uncertain.
Materials and Methods
Data Sources
The following files were downloaded from the UCSC Genome Browser: the RepeatMasker annotation files for the human (hg18) and opossum (monDom4) genomes, which provide the coordinates of repetitive elements, and the UCSC known-gene file for the human genome that provides the genomic coordinates of genes. The coordinates of the evolutionary strata of the X chromosomes were taken from Carrel and Willard [26]. In the first step of the analysis we integrated these datasets, and determined the position (intergenic, intronic), distance (bp), and orientation (the same or opposite strand) of each transposable element in relation to its neighboring genes. The inactivation status of genes on the X chromosome is available from Carrel and Willard [33]; we used their coordinates in the hg18 draft (UCSC) of the human genome, and excluded genes that were not present in the databases of UCSC. Genes that showed activity in at least 30% of the cases were considered as escaping inactivation. The amount of coding sequence for each chromosome (Figure 5) was determined using the knownGenes dataset of UCSC, while the amount of euchromatic sequence was determined from the raw sequence files.
A large fraction of transposable element insertions are fragmented: most old repeats which have originally inserted into the genome have been split into several fragments, either due to recombination and short insertions or due to insertions of other, younger transposable elements into their sequence. Defragmentation is the reconstruction of the original insertion from its fragments; we defragmented TEs using Transposon Cluster Finder (TCF), a program recently developed by our group [20], and identified interrupted transposons – TEs that inserted into other, older TEs. TCF supports two methods of identifying interruptions; one by its native algorithm (described in detail in [20]), and it can also use the defragementation information (IDs) present in the RepeatMasker output. We used the native defragmentation algorithm of TCF in our analyses (the number of TE clusters found by these two methods are comparable, and they lead to similar conclusions), and determined the key characteristics of transposon clusters: the positions of interruptions in the consensus TE sequence, and the interrupting repeats.
Data Analysis
Since the probability of being interrupted depends on the length and density of the repeats (the likelihood that an individual TE insertion will be interrupted is higher for longer repeats), and both vary between chromosomes, (for example, L1 insertions are longer on the X chromosomes than on the autosomes [61]), we standardized the frequency of interruptions with the length of the repeats within the analyzed regions. The frequency of interruptions was calculated as the number of interruptions within the copies of a TE family, divided by the summed length of insertions of the same TE family in the analyzed region.
We also determined the frequency of interruptions across the repeat consensus sequence, and its dependence on the distance from exons. This was calculated as follows: for Figures 2C and 5B–C the amount of TE sequence falling into 5 Kb bins counted from gene and exon boundaries were calculated. TEs falling into more than one bin were split and only the fraction of the repeat overlapping with the bin was added to the amount of TE sequence in that bin. For Figures 2D–G and 3, the sequence of L1s was split into 300, or on the smaller evolutionary strata of the X chromosome to 600–900 bp long bins along the consensus sequence, to examine the frequency of interruptions in different regions of repeats. The insertion profile of Alus across the consensus sequence (Figure 5A) was not standardized with Alu length, because Alus do not show biases in their sequence distribution comparable to L1s. The frequency of TE interruptions was determined separately for the clusters that did and did not contain a “self-insertion” (i.e. an L1 repeat interrupted by a younger L1).
The abundances of repetitive elements show large scale correlations in the genome [13],[14]; for example Alus are most abundant near genes while L1s in gene poor areas, due to differences in the rate of repeat removal by ectopic recombination and small deletions [46],[62] in different genomic regions. The rate of repeat loss is in turn determined by recombination rate and density of coding sequence, and has a large effect also on the frequency of interrupted repeats. To account for the combined effects of gene density, deletions (TE-loss), and distance to genes we included a covariate to the analysis, the ratio of coding and euchromatic sequence, which explains a large percentage of the variance (Figure 4).
The sizes of the oldest evolutionary strata (S1–S2–S3) are very different; the S1 and S3 are comparable to the small autosomes, but the S2 is only 15.7 Mb long. Since the variability of the frequency of interruptions is expected to be higher for smaller genomic regions, we divided the genome into 207 15.7 MB non-overlapping windows and calculated the frequencies of interrupted L1s and the density of coding region in them. The regressions between the frequency of interruptions and fraction of coding sequence (Figure S3) show higher variance than the plots containing data from the autosomes, nevertheless the pattern is qualitatively similar, and the S2 is significantly less interrupted than regions of comparable length on the autosomes (t-tests, p<0.001 both for L1P and L1M, using second order polynomials to estimate regression residuals).
Supporting Information
Zdroje
1. MedstrandP
van de LagemaatLN
DunnCA
LandryJR
SvenbackD
2005 Impact of transposable elements on the evolution of mammalian gene regulation. Cytogenetic and Genome Research 110 342 352
2. van de LagemaatLN
LandryJR
MagerDL
MedstrandP
2003 Transposable elements in mammals promote regulatory variation and diversification of genes with specialized functions. Trends in Genetics 19 530 536
3. FeschotteC
PrithamEJ
2007 DNA transposons and the evolution of eukaryotic genomes. Annual Review of Genetics 41 331 368
4. JurkaJ
KapitonovVV
KohanyO
JurkaMV
2007 Repetitive Sequences in Complex Genomes: Structure and Evolution. Annu Rev Genomics Hum Genet
5. BejeranoG
SiepelAC
KentWJ
HausslerD
2005 Computational screening of conserved genomic DNA in search of functional noncoding elements. Nature Methods 2 535 545
6. LoweCB
BejeranoG
HausslerD
2007 Thousands of human mobile element fragments undergo strong purifying selection near developmental genes. Proc Natl Acad Sci U S A 104 8005 8010
7. JordanIK
RogozinIB
GlazkoGV
KooninEV
2003 Origin of a substantial fraction of human regulatory sequences from transposable elements. Trends in Genetics 19 68 72
8. Marino-RamirezL
LewisKC
LandsmanD
JordanIK
2005 Transposable elements donate lineage-specific regulatory sequences to host genomes. Cytogenetic and Genome Research 110 333 341
9. BirneyE
StamatoyannopoulosJA
DuttaA
GuigoR
GingerasTR
2007 Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447 799 816
10. MarguliesEH
CooperGM
AsimenosG
ThomasDJ
DeweyCN
2007 Analyses of deep mammalian sequence alignments and constraint predictions for 1% of the human genome. Genome Res 17 760 774
11. SimonsC
PheasantM
MakuninIV
MattickJS
2006 Transposon-free regions in mammalian genomes. Genome Research 16 164 172
12. LyonMF
1998 X-chromosome inactivation: a repeat hypothesis. Cytogenetics and Cell Genetics 80 133 137
13. PavlicekA
JabbariK
PacesJ
PacesV
HejnarJ
2001 Similar integration but different stability of Alus and LINEs in the human genome. Gene 276 39 45
14. IHGSC 2001 Initial sequencing and analysis of the human genome. Nature 409 860 921
15. MarguliesEH
MaduroVVB
ThomasPJ
TomkinsJP
AmemiyaCT
2005 Comparative sequencing provides insights about the structure and conservation of marsupial and monotreme genomes. Proceedings of the National Academy of Sciences of the United States of America 102 3354 3359
16. BergmanCM
QuesnevilleH
AnxolabehereD
AshburnerM
2006 Recurrent insertion and duplication generate networks of transposable element sequences in the Drosophila melanogaster genome. Genome Biology 7
17. KriegsJO
MatzkeA
ChurakovG
KuritzinA
MayrG
2007 Waves of genomic hitchhikers shed light on the evolution of gamebirds (Aves : Galliformes). Bmc Evolutionary Biology 7
18. ChurakovG
SmitAFA
BrosiusJ
SchmitzB
2005 A novel abundant family of retroposed elements (DAS-SINEs) in the nine-banded armadillo (Dasypus novemcinctus). Molecular Biology and Evolution 22 886 893
19. PaceJK
FeschotteC
2007 The evolutionary history of human DNA transposons: Evidence for intense activity in the primate lineage. Genome Research 17 422 432
20. GiordanoJ
GeY
GelfandY
AbrusanG
BensonG
2007 Evolutionary history of mammalian transposons determined by genome-wide defragmentation. PLoS Computational Biology 3 e137
21. ArndtPF
HwaT
PetrovDA
2005 Substantial regional variation in substitution rates in the human genome: Importance of GC content, gene density, and telomere-specific effects. Journal of Molecular Evolution 60 748 U728
22. WebsterMT
AxelssonE
EllegrenH
2006 Strong regional biases in nucleotide substitution in the chicken genome. Molecular Biology and Evolution 23 1203 1216
23. ICGSC 2004 Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution. Nature 432 695 716
24. AxelssonE
WebsterMT
SmithNGC
BurtDW
EllegrenH
2005 Comparison of the chicken and turkey genomes reveals a higher rate of nucleotide divergence on microchromosomes than macrochromosomes. Genome Research 15 120 125
25. RossMT
GrafhamDV
CoffeyAJ
SchererS
McLayK
2005 The DNA sequence of the human X chromosome. Nature 434 325 337
26. HandleyLL
CeplitisH
EllegrenH
2004 Evolutionary strata on the chicken Z chromosome: Implications for sex chromosome evolution. Genetics 167 367 376
27. PlathK
Mlynarczyk-EvansS
NusinowDA
PanningB
2002 Xist RNA and the mechanism of X chromosome inactivation. Annual Review of Genetics 36 233 278
28. ChowJC
YenZ
ZiescheSM
BrownCJ
2005 Silencing of the mammalian X chromosome. Annual Review of Genomics and Human Genetics 6 69 92
29. DuretL
ChureauC
SamainS
WeissenbachJ
AvnerP
2006 The Xist RNA gene evolved in eutherians by pseudogenization of a protein-coding gene. Science 312 1653 1655
30. MikkelsenTS
WakefieldMJ
AkenB
AmemiyaCT
ChangJL
2007 Genome of the marsupial Monodelphis domestica reveals innovation in non-coding sequences. Nature 447 167 U161
31. BrownCJ
HendrichBD
RupertJL
LafreniereRG
XingY
1992 The Human Xist Gene - Analysis of a 17 Kb Inactive X-Specific Rna That Contains Conserved Repeats and Is Highly Localized within the Nucleus. Cell 71 527 542
32. ChureauC
PrissetteM
BourdetA
BarbeV
CattolicoL
2002 Comparative sequence analysis of the X-inactivation center region in mouse, human, and bovine. Genome Research 12 894 908
33. CarrelL
WillardHF
2005 X-inactivation profile reveals extensive variability in X-linked gene expression in females. Nature 434 400 404
34. BaileyJA
CarrelL
ChakravartiA
EichlerEE
2000 Molecular evidence for a relationship between LINE-1 elements and X chromosome inactivation: The Lyon repeat hypothesis. Proceedings of the National Academy of Sciences of the United States of America 97 6634 6639
35. CarrelL
ParkC
TyekuchevaS
DunnJ
ChiaromonteF
2006 Genomic environment predicts expression patterns on the human inactive X chromosome. Plos Genetics 2 1477 1486
36. WangZ
WillardHF
MukherjeeS
FureyTS
2006 Evidence of influence of genomic DNA sequence on human X chromosome inactivation. Plos Computational Biology 2 979 988
37. JurkaJ
2004 Evolutionary impact of human Alu repetitive elements. Current Opinion in Genetics & Development 14 603 608
38. DewannieuxM
EsnaultC
HeidmannT
2003 LINE-mediated retrotransposition of marked Alu sequences. Nature Genetics 35 41 48
39. PolakP
DomanyE
2006 Alu elements contain many binding sites for transcription factors and may play a role in regulation of developmental processes. BMC Genomics 7
40. HaslerJ
StrubK
2006 Alu elements as regulators of gene expression. Nucleic Acids Research 34 5491 5497
41. BoissinotS
DavisJ
EntezamA
PetrovD
FuranoAV
2006 Fitness cost of LINE-1 (L1) activity in humans. Proceedings of the National Academy of Sciences of the United States of America 103 9590 9594
42. JurkaJ
KrnjajicM
KapitonovVV
StengerJE
KokhanyyO
2002 Active Alu elements are passed primarily through paternal germlines. Theoretical Population Biology 61 519 530
43. NgK
PullirschD
LeebM
WutzA
2007 Xist and the order of silencing. Embo Reports 8 34 39
44. LyonMF
2003 The lyon and the LINE hypothesis. Seminars in Cell & Developmental Biology 14 313 318
45. AbrusanG
KrambeckHJ
2006 The distribution of L1 and Alu retroelements in relation to GC content on human sex chromosomes is consistent with the ectopic recombination model. Journal of Molecular Evolution 63 484 492
46. SenSK
HanKD
WangJX
LeeJ
WangH
2006 Human genomic deletions mediated by recombination between Alu elements. American Journal of Human Genetics 79 41 53
47. BrookfieldJFY
2001 Selection on Alu sequences? Current Biology 11 R900 R901
48. HackenbergM
Bernaola-GalvanP
CarpenaP
OliverJL
2005 The biased distribution of alus in human isochores might be driven by recombination. Journal of Molecular Evolution 60 365 377
49. MontgomeryEA
HuangSM
LangleyCH
JuddBH
1991 Chromosome Rearrangement by Ectopic Recombination in Drosophila-Melanogaster - Genome Structure and Evolution. Genetics 129 1085 1098
50. HanK
LeeJ
MeyerTJ
WangJ
SenSK
2007 Alu recombination-mediated structural deletions in the chimpanzee genome. Plos Genetics 3 1939 1949
51. SymerDE
ConnellyC
SzakST
CaputoEM
CostGJ
2002 Human L1 retrotransposition is associated with genetic instability in vivo. Cell 110 327 338
52. GilbertN
Lutz-PriggeS
MoranJV
2002 Genomic deletions created upon LINE-1 retrotransposition. Cell 110 315 325
53. Lev-MaorG
SorekR
ShomronN
AstG
2003 The birth of an alternatively spliced exon: 3′ splice-site selection in Alu exons. Science 300 1288 1291
54. SelaN
MerschB
Gal-MarkN
Lev-MaorG
Hotz-WagenblattA
2007 Comparative analysis of transposed element insertion within human and mouse genomes reveals Alu's unique role in shaping the human transcriptome. Genome Biology 8
55. Gal-MarkN
SchwartzS
AstG
2008 Alternative splicing of Alu exons - two arms are better than one. Nucleic Acids Research 36 2012 2023
56. SmitAFA
1999 Interspersed repeats and other mementos of transposable elements in mammalian genomes. Current Opinion in Genetics & Development 9 657 663
57. UrrutiaAO
HurstLD
2003 The signature of selection mediated by expression on human genes. Genome Res 13 2260 2264
58. Castillo-DavisCI
MekhedovSL
HartlDL
KooninEV
KondrashovFA
2002 Selection for short introns in highly expressed genes. Nat Genet 31 415 418
59. ConleyAB
MillerWJ
JordanIK
2008 Human cis natural antisense transcripts initiated by transposable elements. Trends Genet 24 53 56
60. PetrovDA
AminetzachYT
DavisJC
BensassonD
HirshAE
2003 Size matters: Non-LTR retrotransposable elements and ectopic recombination in Drosophila. Molecular Biology and Evolution 20 880 892
61. MGSC 2002 Initial sequencing and comparative analysis of the mouse genome. Nature 420 520 562
62. PetrovDA
SangsterTA
JohnstonJS
HartlDL
ShawKL
2000 Evidence for DNA loss as a determinant of genome size. Science 287 1060 1062
Štítky
Genetika Reprodukční medicínaČlánek vyšel v časopise
PLOS Genetics
2008 Číslo 8
Nejčtenější v tomto čísle
- Rise of the Machines
- A Tribute to Evgenii V. Ananiev, 1947–2008
- Analysis of Transposon Interruptions Suggests Selection for L1 Elements on the X Chromosome
- The Evolution of Epigenetic Regulators and in Amniotes