Systematic identification of functional SNPs interrupting 3’UTR polyadenylation signals
Authors:
Eldad David Shulman aff001; Ran Elkon aff001
Authors place of work:
Department of Human Molecular Genetics and Biochemistry, Sackler School of Medicine, Tel Aviv University, Tel Aviv, Israel
aff001
Published in the journal:
Systematic identification of functional SNPs interrupting 3’UTR polyadenylation signals. PLoS Genet 16(8): e32767. doi:10.1371/journal.pgen.1008977
Category:
Research Article
doi:
https://doi.org/10.1371/journal.pgen.1008977
Summary
Alternative polyadenylation (APA) is emerging as a widespread regulatory layer since the majority of human protein-coding genes contain several polyadenylation (p(A)) sites in their 3’UTRs. By generating isoforms with different 3’UTR length, APA potentially affects mRNA stability, translation efficiency, nuclear export, and cellular localization. Polyadenylation sites are regulated by adjacent RNA cis-regulatory elements, the principals among them are the polyadenylation signal (PAS) AAUAAA and its main variant AUUAAA, typically located ~20-nt upstream of the p(A) site. Mutations in PAS and other auxiliary poly(A) cis-elements in the 3’UTR of several genes have been shown to cause human Mendelian diseases, and to date, only a few common SNPs that regulate APA were associated with complex diseases. Here, we systematically searched for SNPs that affect gene expression and human traits by modulation of 3’UTR APA. First, focusing on the variants most likely to exert the strongest effect, we identified 2,305 SNPs that interrupt the canonical PAS or its main variant. Implementing pA-QTL tests using GTEx RNA-seq data, we identified 330 PAS SNPs (called PAS pA-QTLs) that were significantly associated with the usage of their p(A) site. As expected, PAS-interrupting alleles were mostly linked with decreased cleavage at their p(A) site and the consequential 3’UTR lengthening. However, interestingly, in ~10% of the cases, the PAS-interrupting allele was associated with increased usage of an upstream p(A) site and 3’UTR shortening. As an indication of the functional effects of these PAS pA-QTLs on gene expression and complex human traits, we observed for few dozens of them marked colocalization with eQTL and/or GWAS signals. The PAS-interrupting alleles linked with 3’UTR lengthening were also strongly associated with decreased gene expression, indicating that shorter isoforms generated by APA are generally more stable than longer ones. Last, we carried out an extended, genome-wide analysis of 3’UTR variants and detected thousands of additional pA-QTLs having weaker effects compared to the PAS pA-QTLs.
Keywords:
Gene expression – Gene regulation – Alleles – Messenger RNA – Genome-wide association studies – Genomic signal processing – Polyadenylation – Single nucleotide polymorphisms
Introduction
The maturation of mRNA 3′ ends is a 2-steps process, termed cleavage and polyadenylation, that involves endonucleolytic cleavage of the nascent RNA followed by synthesis of a poly(A) tail on the 3′ terminus of the cleaved product [1]. Cleavage and polyadenylation sites (p(A) sites) are determined and controlled by adjacent RNA cis-regulatory elements, the principal among them is the polyadenylation signal (PAS) AAUAAA, typically located ~20-nt upstream of the p(A) site. There are more than 10 weaker variants to this canonical PAS, the main among them is AUUAAA [2]. Auxiliary elements include upstream U-rich and UGUA motifs and downstream U-rich and GU-rich elements, and the strength of a p(A) site is determined by these elements in a combinatorial manner [3]. Importantly, the majority of human protein-coding genes contain several alternative p(A) sites in their 3’UTR, making alternative polyadenylation (APA) a widespread regulatory layer that generates transcript isoforms with alternative 3′ ends, and correspondingly, different 3’UTR lengths [1, 4–6]. As 3′ UTRs contain cis-elements that serve as major docking platforms for microRNAs (miRNAs) and RNA binding proteins (RBPs), which are involved in various aspects of mRNA metabolism, 3′ UTR APA can affect post-transcriptional regulation in multiple ways, including the modulation of mRNA stability, translation efficiency, nuclear export and cellular localization [7–9]. Transcriptomic studies demonstrated that APA is globally modulated during multiple differentiation processes [4, 10–17] and in response to changes in cell proliferation state [18–20]. Yet, our current understanding of the impact of APA on gene regulation and of its biological roles is still very rudimentary.
Mutations in PAS and other poly(A) cis-elements, and the resulting alteration in gene expression have been shown to cause several human Mendelian diseases. Examples include the mutation in the 3’UTR of HBA2 (converting AATAAA to AATAAG) causing α-Thalassaemia [21], the mutation in the 3’UTR of HBB (AATAAA to AACAAA) causing β-Thalassaemia [22] and the mutation in the 3’UTR of FOXP3 (AATAAA to AATGAA) causing the IPEX syndrome [23] (for a thorough review see [24]). In addition, few common SNPs that regulate APA were found to affect the risk of complex diseases. This includes a risk SNP for systemic lupus erythematosus (SLE) located in the 3’UTR of IRF5. This SNP reduces the use of a proximal p(A) site, leading to the production of longer and less stable isoforms and consequently to reduced IRF5 levels [25]. Similarly, a polymorphic cis-element downstream to a PAS in the 3’UTR of ATP1B1 is associated with high blood pressure [26].
Given the high prevalence of APA, there are likely many more common SNPs that modulate APA and affect disease susceptibility. In this study, we systematically searched for SNPs that interfere with 3’UTR PAS signals in the human genome and implemented polyadenylation-QTL (pA-QTL) tests to examine their effect on APA. Analyzing GTEx RNA-seq data from twelve tissues we identified dozens of such SNPs that significantly affect the usage of their downstream p(A) sites. The intersection of these pA-QTLs with eQTLs and GWAS risk SNPs indicated their roles in the regulation of gene expression and their impact on various phenotypes.
Results
pA-QTL analysis of PAS SNPs
We first sought to systematically identify SNPs in the human genome that affect 3’UTR APA. To focus our analysis on the ones likely to exert the strongest effect on gene regulation (and thus on phenotype), we specifically searched for SNPs that interrupt the canonical PAS (AATAAA) or its main variant (ATTAAA). Using p(A) sites annotations from polyA DB [27], we overall detected 2,305 SNPs that interfere with these signals in a 40-nt window upstream of annotated 3’UTR p(A) sites in 1,936 unique genes (Fig 1A; S1 Table). We call such variants PAS SNPs. Each PAS SNP has one allele that preserves the PAS and another allele that interrupts it (called PAS Interrupting Allele (PIA)). Note that the PIA can be either the allele appearing on the genome reference sequence (the reference allele) or the alternative allele (Fig 1B). 68% of the PAS SNPs were located within the canonical AATAAA PAS (S1 Table). We associated each PAS SNP with its downstream p(A) site.
PAS SNPs potentially have a strong impact on cleavage and polyadenylation efficiency at their downstream p(A) sites. To detect such effects we implemented pA-QTL tests that are based on the estimation of p(A)-site usage from RNA-seq data. Each PAS SNP divides its 3’UTR into two segments: a common UTR (cUTR) which is upstream to the associated p(A) site and is common to the transcript generated by cleavage at this p(A) site and the transcripts generated by usage of more distal p(A) sites, and an alternative UTR (aUTR) which is included only in transcripts generated by more distal p(A) sites (Fig 2A). For each RNA-seq sample, we estimated the usage of each p(A) site associated with a PAS SNP by calculating the p(A) site Usage Index (pAUI), defined as the ratio (in log scale) between the number of reads mapped to the cUTR and the aUTR segments (Fig 2A). Having RNA-seq data from a large cohort of individuals for which genotype data is available too, enables to test for association between the PAS SNP alleles and the pAUI levels (Fig 2A). We carried out such pA-QTL tests on twelve selected tissues from the GTEx project (v7) [28] that have more than 130 RNA-seq samples with corresponding genotype data (Methods; S2 Table). Per tissue, and for each p(A) site linked with a PAS SNP, we examined variants within a window of +/- 10 kbp with respect to the p(A) site for association with pAUI levels using FastQTL [29] (Methods). Overall, 512 pA sites were significantly associated (FDR ≤ 5%) with a pA-QTL in at least one tissue (see examples in Fig 2B–2D; S3 Table).
Linkage disequilibrium (LD) patterns between adjacent genetic variants complicate the identification of causal variants. Therefore, next, we used CAVIAR [30] to detect pA-QTL associations that can be more confidently ascribed to the effect of the corresponding PAS SNPs (S1 Fig; Methods). Defining per pA site with a significant association a 95% credible set of variants, this analysis implicated the PAS SNP as a causal variant for 55%-70% of the observed pA-QTLs per tissue (Fig 3A). Hereafter, we refer to PAS SNPs as PAS pA-QTLs if they (a) obtained in the pA-QTL tests p-values below their pA site-level thresholds (see Methods) and (b) were included in CAVIAR’s credible sets. Overall, we detected 330 distinct PAS pA-QTLs, 70% of them interrupt a canonical PAS (and the rest interrupt the ATTAAA variant) (S3 Table). Most of these PAS pA-QTLs were detected in multiple tissues (Fig 3B and S3 Table). The p(A) loci in which the PAS SNP was not included in CAVIAR’s credible set indicate that there are additional mechanisms by which genetic variants affect APA other than disruption of PAS motifs (S1C Fig).
As the PAS-interrupting allele (PIA) of PAS SNPs is expected to reduce cleavage efficiency at its p(A) site, we anticipated that the pA-QTL PIAs will be associated with lower pAUI (reflecting 3’UTR lengthening) (Fig 2A). However, unexpectedly, we also identified PAS pA-QTLs whose PIA was associated with increased pAUI (Fig 4A and 4B). Interestingly, in some of these cases, we observed that PIA’s disruption of the cleavage at the corresponding p(A) site was associated with increased usage of an upstream (proximal) p(A) site (resulting in an increased pAUI and 3’UTR shortening). Yet, overall, the cases in which the PAI showed the expected 3’UTR lengthening effect were markedly more prevalent (84%-94%; Fig 4C). Furthermore, remarkably, for PAS pA-QTLs that were detected in multiple tissues, the effect of their PIA on 3’UTR length was highly consistent across all tissues (Fig 4D).
Colocalization of PAS pA-QTL and eQTL signals
APA can impact gene expression via regulation of transcript stability. Therefore, to test for possible functional effects of PAS SNPs on gene expression (e.g., through regulation of transcript stability), we examined overlaps between PAS pA-QTL signals detected by our analysis and eQTL signals identified by GTEx (Fig 5A). Overall, of the 330 PAS pA-QTLs detected by our analysis, 133 overlapped an eQTL in at least one tissue. However, given the high abundance of eQTLs in the human genome, a simple intersection between PAS pA-QTLs and eQTLs might result in many false hypotheses on the causal effect of PAS SNPs on gene expression [31]. Therefore, for loci in which a PAS pA-QTL intersected an eQTL, we used eCAVAIR [32] to seek further support for these two signals tagging the same causal variant (Methods; S2 Fig). Of the PAS pA-QTLs that overlapped an eQTL, 51 showed a high colocalization in at least one tissue (S3 Table).
Next, we examined a possible association between the effect of PIAs on 3’UTR length (lengthening or shortening) and gene expression (increased or decreased expression). As 3’UTR cis-regulatory elements mostly have destabilizing effects (e.g., ARE elements, microRNA binding sites), we expected that PIAs associated with 3’UTR lengthening will consequently be mainly associated with decreased expression level (Fig 5A). Yet, we detected few uncommon cases in which 3’UTR lengthening effect of a PIA was associated with increased gene expression level (see Fig 5B for one example). Yet overall, in line with our expectations, in all tissues 3’UTR lengthening was significantly associated with decreased expression than with increased one (Fig 5C). Notably, both patterns of association with expression level were highly consistent across different tissues (Fig 5D). For example, the PIA of the PAS SNP in the 3’UTR of UTP4 showed significant association with 3’UTR lengthening and decreased expression level in all twelve tissues, and the PIA of the PAS SNP in the 3’UTR of PPP2R1B showed a consistent association with 3’UTR lengthening and increased expression levels in six different tissues (Fig 5D). PIAs linked with 3’UTR shortening were not prevalent enough to allow robust statistical testing of their association with increased or decreased expression (Fig 5D).
Colocalization of PAS pA-QTL and GWAS signals
Next, we sought possible links between the PAS pA-QTLs we identified and human phenotypes by intersecting these variants with GWAS risk SNPs (Methods). We analyzed GWAS summary statistics from 124 studies (S4 Table). In total, 104 PAS pA-QTLs overlapped at least one GWAS SNP (70 of which had GWAS lead SNP with p-value < 5e-8 and the rest had 5e-8 ≤ p-value < 1e-5), in 237 (158 when considering only GWAS lead SNP with p-value < 5e-8) PAS pA-QTL-GWAS SNP pairs. However, similar to eQTLs, given the high abundance of GWAS SNPs, pA-QTLs and GWAS SNPs can simply co-occur by chance [31]. Therefore, here too, we applied eCAVIAR to gain higher confidence for the potential causal effect of the PAS pA-QTLs on the complex traits. Overall, 78 (40) PAS pA-QTL and GWAS signals (from 37 (18) distinct PAS pA-QTLs) showed marked colocalization (S3 Fig and S3 Table). Nine (six) of these PAS pA-QTLs in nine (six) distinct genes were also colocalized with eQTL signals, suggesting novel links between APA and phenotype which are mediated through modulation of gene expression. These cases included the PAS pA-QTL in the 3’UTR of BECN1, a key regulator of autophagy. This pA-QTL was colocalized both with eQTL signals for BECN1 and GWAS signals for type 2 diabetes (Fig 6A). Notably, autophagy abnormality has been recently associated with metabolic disorders, such as type 2 diabetes, and the BECN1 protein was shown to regulate insulin secretion. It was demonstrated that, in insulin-producing ß cells, excess autophagy degrades insulin-containing vesicles, resulting in decreased insulin contents and systemic glucose intolerance; whereas in insulin-responsive cells, activating autophagy decreases endoplasmic reticulum (ER) stress and improves insulin sensitivity [33]. Interestingly, the PIA of this PAS SNP is associated with 3’UTR lengthening, decreased BECN1 expression and decreased T2D risk (S3 Table). Other examples are the PAS pA-QTL in the 3’UTR of PPP2R1B that is colocalized with an eQTL for this gene and with GWAS signal of body fat distribution (Fig 6B; interestingly, PPP2R1B shows outstanding high expression in fat pad tissues (Entrez Gene page)); and the PAS pA-QTL in the 3’UTR of DIP2B that is colocalized with an eQTL for this gene and with GWAS signal of red blood cell width (Fig 6C). Of note, our analysis did not capture the known association between the PAS SNP in the 3’UTR of IRF5 and risk for SLE [25], because this locus likely contains several causal variants that act through different mechanisms, some of which have a stronger effect on IRF5 expression and SLE risk than the PAS pA-QTL (S4 Fig). Accordingly, this PAS pA-QTL showed colocalization with the GWAS signal only after increasing eCAVIAR’s setting for the number of putative causal variants in this locus to five (and with the eQTL signal after increasing the number of putative causal variants to three) (S5 Table).
pA-QTL analysis of 3’UTR SNPs
The above analyses were focused on SNPs that interrupt the canonical PAS signals, as those are expected to exert the strongest effect on APA. Yet, APA is regulated by additional 3’UTR cis-elements, including upstream U-rich and UGUA motifs, downstream U-rich and GU-rich elements and multiple RNA-binding sites [1]. Genetic variants that interfere with such auxiliary elements are expected too to modulate cleavage efficiency at their respective p(A) sites, albeit with weaker impact. Therefore, last, we extended our analysis and searched for 3’UTR SNPs that are associated with APA modulation without disrupting PAS elements. The pAUI, which is based on pA site annotations, allows a high-resolution quantification of the usage of its PAS pA site. To analyze the effect of other variants on APA and consequently on 3’UTR length, we turned to using the distal p(A) site usage indexes provided by the APA Atlas [34] (Methods), as indexes for the overall (relative) 3’UTR length in each RNA-seq sample. This genome-wide 3’UTR pA-QTL analysis (covering 25,293 3’UTRs from 22,903 genes) identified 27,994 pA-QTLs in 3,653 3’UTRs (3,501 genes) (and after excluding 3’UTRs containing PAS SNPs, 26,197 pA-QTLs in 3,577 3’UTRs of 3,429 genes). As anticipated, these pA-QTLs showed weaker effect sizes than the PAS pA-QTLs (Fig 7A). These additional pA-QTLs may modulate APA via multiple, direct and indirect, mechanisms. To further characterize one potential mechanism, we found 41 pA-QTLs (in 40 genes), supported by CAVIAR as likely tagging causal variants, that interfered with the GUGU motif, located within 20 nt downstream of their respective p(A) sites (Fig 7B; S6 Table). The effect sizes of these GUGU pA-QTLs were too markedly lower than those of the PAS pA-QTLs (S5 Fig). Unlike the sharp tendency for 3’UTR lengthening effect exerted by the PAS pA-QTLs (Fig 4C), as a set the GUGU-interfering pA-QTLs did not generally show a clear directional effect on 3’UTR length (Fig 7C). Yet, individually they showed a consistent effect across different tissues (Fig 7D). For example, the pA-QTL disrupting the GUGU motif in the 3’UTR of TXN was associated with 3’UTR lengthening in all twelve tissues (Fig 7D). Colocalization analysis of pA-QTL and eQTL signals supported some of these GUGU-interfering SNPs as variants underlying both associations (Fig 7D; S6 Table).
Discussion
Despite the emergence of APA as an important layer of gene regulation, potentially affecting the majority of human protein-coding genes, it remains largely unexplored, and its involvement in physiological and pathological processes is often overlooked. Here, we searched for SNPs that modulate APA. Focusing on SNPs likely to have the strongest effect, we identified 2,305 SNPs that interfere with the canonical PAS or its main variant. Seeking support for the regulatory impact of these SNPs on APA, we implemented pA-QTL tests using GTEx RNA-seq data from twelve tissues. Our analysis detected 330 PAS pA-QTLs with a significant effect in at least one tissue, which was supported by CAVIAR’s fine-mapping as tagging causal variants. As expected, for the vast majority of these pA-QTLs, the PAS-interrupting allele caused decreased cleavage at its p(A) site that resulted in 3’UTR lengthening (Fig 4C). Yet, interestingly, in ~10% of the cases, the effect of the PAS-interrupting allele was associated with 3’UTR shortening, due to elevated usage of a more proximal p(A) site (Fig 4A and 4B). This observation argues against a scanning mechanism as the mode of action of the polyadenylation machinery. The fact that reduced efficiency of a p(A) site augments cleavage and polyadenylation at an upstream p(A) site implies a thermodynamic competition between the alternative p(A) sites.
Our tests for the identification of PAS pA-QTLs were applied to each tissue separately. Yet, many of these variants showed a consistent effect across different tissues. Modern statistical methods strive to increase power by sharing information across conditions. Indeed, applying MASH, a recently developed method for joint multivariable analyses [35], to the set of PAS SNPs markedly increased the number of PAS SNPs detected as pA-QTLs across multiple tissues (S6A and S6C Fig). Nevertheless, attention should be paid not only to statistical significance but also to effect sizes. Notably, the PAS SNPs which, in the MASH analysis, had a significant effect on APA in all twelve tissues, showed high heterogeneity in the magnitude of their effect over the tissues (S6B Fig), similar to results recently observed for eQTLs [35].
APA potentially affects post-transcriptional gene regulation in multiple ways, including modulation of mRNA stability, translation efficiency, nuclear export, and cellular localization. However, two recent studies failed to detect a large effect of APA on transcript stability or translation efficiency [36, 37]. We sought an indication of the functional impact of the PAS SNPs on gene expression. To this goal, we intersected the PAS pA-QTLs identified by our analysis with GTEx eQTLs and found that 133 of them, in at least one tissue, were also detected as eQTLs of the same gene. Noting that given the high abundance of eQTLs in the human genome, such overlaps are likely to occur by chance, we further used eCAVIAR to examine colocalization of the pA-QTL and eQTL signals. This analysis indicated 51 PAS SNPs as underlying causal variants for the association with both APA and gene expression. As cis-regulatory elements embedded within 3’UTRs mainly carry destabilizing roles (e.g., miRNA binding sites, AU-rich elements (AREs)) [38], we expected that 3’UTR lengthening effects of the PAS-interrupting alleles will be mostly linked with reduced gene expression. We indeed observed such link in all twelve tissue (Fig 5C). These results indicate the regulatory impact of APA on gene expression, where shorter isoforms generated by APA are generally more stable than longer ones. Nevertheless, the impact of 3’UTR APA on transcript stability is more complicated than mere inclusion/exclusion of regulatory elements in/from the 3’UTR, since the efficiency of mRNA targeting by such elements can be also affected by their location, as was demonstrated for miRNA target sites: sites located at the start or end of the 3’UTRs are more efficient than those located in the middle [39]. Thus, APA can modulate the activity of miRNA target sites by changing their location relative to the transcript’s 3’ end [40].
Over the last decade, GWAS studies discovered thousands of SNPs associated with common diseases and traits (The GWAS catalog already reports >70k tag SNPs [41]). Yet, the mechanism of action of most of the genetic variants identified by GWAS is currently unknown. Functional interpretation is hindered by the fact that the vast majority (>90%) of these SNPs map to noncoding regions of the genome [42]. While disruption of enhancer elements regulating gene transcription emerges as the main mode of action of risk SNPs [42], marked fractions of traits’ heritability are not accounted for by SNPs that map to transcriptional regulatory elements (e.g., putative enhancers and promoters) [43]. This indicates that other modes of action mediate the impact of noncoding genetic variants on human traits. Modulation of APA can be an important additional mechanism, and in this study, we identified 104 PAS pA-QTLs that overlapped GWAS SNPs (in 237 pA-QTL-GWAS trait pairs). However, similar to eQTLs, GWAS SNPs are too highly abundant in the human genome, and thus such overlaps may occur just by chance. To gain higher confidence in the potential causal effect of the PAS pA-QTLs on the complex human traits, we applied eCAVIAR, and detected marked colocalization of the pA-QTL and GWAS signals for 78 pairs in 37 distinct 3’UTR loci. Our analysis mainly focused on PAS SNPs as they are expected to have the strongest impact on APA (and consequently, on gene expression and human traits). Nevertheless, our extended, analysis of 3’UTR pA-QTLs suggests that thousands of additional variants effect (although with weaker magnitude) gene expression and human phenotypes by APA modulation. Very recently, Yang et al. [44] carried out a similar pA-QTL analysis on 32 human cancer types using RNA-seq data from The Cancer Genome Atlas (TCGA) and their study too detected thousands of cis pA-QTLs.
Given the critical roles potentially played by APA in gene regulation and our limited understanding of how it is affected by genetic variants, our methodology and findings contribute to the initial elucidation of associations between PAS SNPs, gene expression and human phenotypes.
Methods
Identification of PAS SNPs
To identify PAS SNPs, we started with all 106,571 p(A) sites located in 3’UTRs from poly(A) DB (release 3.2) [27], and all 37,611,962 annotated SNPs from GTEX v7 [28]. We considered the 53,428 SNPs located within 40-nt upstream of an annotated 3’UTR p(A) site. Examining the sequences spanning 5-nt upstream and 5-nt downstream of these SNP, we identified 2,260 instances in which one of the alleles (reference or alternative) constitute a canonic PAS sequence (AATAAA) or its main variant (ATTAAA), while the other allele interrupts this signal (we call this allele the PAS-interrupting allele—PIA). We also detected 45 instances in which one allele constitutes the canonic PAS, while the other allele constitutes the main variant.
GTEx data
Aligned GTEx paired-end RNA-seq were obtained from dbGaP (release phs000424.v7.p2). We used SAMtools [45] and SRA tools to convert cram and SRA files, respectively, to BAM format. We used RNA-seq data from 12 tissues that have more than 130 RNA-seq samples with corresponding genotype data (S2 Table). SNP genotypes were obtained from GTEx v7 VCF file, which is based on whole-genome sequencing (we used VCF processed files from which low-quality sites and samples are filtered out by GTEx).
pA-QTL analysis of PAS SNPs
We associated each PAS SNP with its p(A) site (located within 40-nt downstream of it; if a PAS SNP had more than one annotated p(A) site within downstream 40-nt, the nearest p(A) site was taken). We split the 3’UTRs containing a PAS SNP into two segments: the common 3’UTR (cUTR; the segment upstream of the p(A) site) and the alternative 3’UTR (aUTR; the 3’UTR segment downstream of the p(A) site). 3’UTR coordinates were downloaded from the UCSC genome browser based on GENCODE hg19 release v31 [46]. Overlapping 3’UTRs of the same gene were merged using bedtools [47]. When the p(A) site of a PAS SNP was located at the edge of the 3’UTR, the entire 3’UTR was considered as the cUTR and the downstream 1,000 nt were considered as the aUTR.
We defined the pA site Usage Index (pAUI) to quantify the usage of each p(A) site associated with a PAS SNP in each RNA-seq sample: where cUTR counts is the fragment counts in the cUTR, and aUTR counts is the fragment counts in the aUTR. A pseudo count of 1 was added to avoid zeroes.
We used featureCounts [48] with a SAF file containing the coordinates of the 3’UTR segments to count the number of reads mapping to aUTRs and cUTRs in each sample. Sequenced fragments that overlapped a 3’UTR were considered in the following way: if at least one of the fragment’s reads intersected the aUTR region, the fragment was assigned to the aUTR; otherwise, the fragment was assigned to the cUTR segment. Only reads that were uniquely mapped (mapping quality of 255), aligned in proper pairs, and not marked as PCR duplicates, were counted.
We next used the pAUI levels for a pA-QTL analysis. We carried out this analysis for each tissue separately. To ensure sufficient statistical power while lowering the burden of multiple testing, for each tissue, we included in our analysis only 3’UTRs with a median RPKM ≥ 1 across samples and corresponded to genes that were included in GTEx v7 analysis (namely genes with > 0.1 TPM and ≥ 6 reads in 20% of the samples). We performed the pA-QTL analysis using FastQTL [29]. We added sequencing platform (Illumina HiSeq 2000 or HiSeq X), sex, the top three genotype principal components and PEER factors (obtained from the GTEx data portal) as covariates. For each tissue and each 3’UTR containing a PAS SNP, we examined all variants located within 10 kbp of the corresponding pA site whose minor allele frequencies ≥0.01 and with the minor allele observed in at least 10 samples. To account for testing multiple (correlated) variants within each pA site window, we followed the procedure applied by GTEx v7 [28]. Briefly, we ran FastQTL with the permutation mode to obtain empirical p-values extrapolated from a Beta distribution (FastQTL setting–permute 1000 10000). This mode reports the p-value of the most significant variant in a locus (pA site, in our analysis). These p-values were further corrected for multiple testing (due to testing pA sites over multiple 3’UTRs) using Storey’s q-value method [49]. pA sites with q-value < 0.05 were considered significantly associated with a pA-QTL.
For each significant pA locus, we applied CAVIAR [30] to define the 95% credible set, which contains the causal variant with high confidence. The inputs to CAVIAR for each set were variants’ t-statistics (calculated from the slope and slope standard error from FastQTL) and their pairwise LD (r) calculated using PLINK 1.9 (—r square). (PLINK’s binary file was created from GTEx VCF using plink—make-bed and specifying—keep allele-order). Since we suspect that in each tested pA locus the PAS SNP is the causal variant, CAVIAR was run with one causal variant mode (-c 1) and otherwise default parameters. We considered a PAS SNP as a likely causal pA-QTL (referred to as a PAS pA-QTL) if (a) it was included in the credible set of its pA locus and (b) its nominal p-value was below its pA-locus threshold calculated following GTEx’s procedure for obtaining a gene-level nominal p-value threshold based on its beta distribution model [28].
All tests were performed in R-3.5.1 and plots were made using ggplot2 R package.
Colocalization of pA-QTL and eQTL signals
For each tissue, we first intersected its GTEx v7 list of significant eQTLs (significant variant-gene pairs obtained from GTEx .signif_variant_gene_pairs.txt files) with its list of PAS pA-QTLs (requiring that these eQTLs are linked to the genes in which they are located). We then used eCAVIAR [32] to test these cases for colocalization of the eQTL and pA-QTL signals. (The strength of colocalization between two genomic signals is measured by eCAVAIR’s colocalization posterior probability (CLPP)). For each pair of a PAS pA-QTL and its overlapping eQTL, eCAVIAR analysis included variants located within +/- 10 kbps with respect to the corresponding pA site (filtered as described above for the pA-QTL analysis). eCAVIAR input per p(A) locus were the variants’ t-statistics obtained by the pA-QTL and eQTL tests and pairwise LD scores calculated as described above for CAVIAR analysis. eCAVIAR was run using the 1 causal variant mode (-c 1), and otherwise, default parameters. We followed eCAVIAR authors’ practice and considered CLPP > 0.01 as an indication for colocalization.
Colocalization of pA-QTL and GWAS signals
We used summary statistics from 124 GWAS studies (S4 Table) to find overlaps between PAS pA-QTLs and significant GWAS SNPs (p-value for a given trait < 1e-5, we also distinguish in S3 Table cases for which the lead GWAS SNP in the locus had p-value < 5e-8). Similar to the analysis of colocalization between pA-QTLs and eQTLs, here too we used eCAVIAR to seek further support for the pA-QTLs and GWAS SNPs tagging the same causal variants. For each such overlap, the input to eCAVIAR was the pA-QTL’s t-statistics and GWAS’ z-scores of variants included in both datasets and that are located within +/- 10 kbp of the respective pA site, and LDs for GTEx and GWAS calculated from GTEx VCF file and 1000 genome VCF files respectively.
As eCAVIAR requires the direction of effect and direction of LD, we aligned the GWAS z-scores such that the reference allele (non-effect allele) is the same as the GTEx reference allele. Furthermore, to align the pA-QTL eQTL LDs, we used BCF tools [50] to match the reference alleles of the genome 1000 VCFs to those of GTEx (After using BCF tools to split multiallelic sites to biallelic). We then used PLINK specifying -keep allele-order when creating PLINK binary files (this step is required as by default, PLINK 1.9 uses the major allele as the reference). eCAVIAR was run using the 1 causal variant mode (-c 1), and otherwise, default parameters.
Genome-wide pA-QTL analysis of 3’UTR SNPs
In this analysis, we used the distal p(A) site usage index provided by the APA-Atlas [34], calculated per transcript in each GTEx sample (transcripts with less than 30 reads in their last exon were removed by APA-atlas). We included in this analysis all variants located within 3’UTRs or their 2.5 kbp flanking regions (GENCODE hg19 release v31) whose minor allele frequencies ≥0.01 and with the minor allele observed in at least 10 samples. To identify pA-QTLs, each 3’UTR interval was analyzed by FastQTL, using the same procedure and the same covariates described above for the analysis of PAS SNPs. Variants (a) located in 3’UTR loci with q-value<0.05 and (b) obtained nominal p-value below the interval-specific threshold were considered significant 3’UTR pA-QTLs. CAVIAR and eCAVIAR analyses were applied to the significant variants that interrupt GUGU element as described above for the PAS pA-QTLs.
MASH
We used MASH (multivariate adaptive shrinkage) [35] as implemented by mashr. Analyzing 3’UTRs with PAS SNPs, the inputs were the matrix of effect estimates and the matrix of corresponding standard errors (slope and slope standard error calculated by FastQTL) of the set of 26,140 variants that passed the variants’ filtering criteria (see above) in all tissue. We fitted the MASH model using both data-driven and canonical covariance as recommended by the authors. To obtain the data-driven covariance, we used mashr built-in functions (get_significant_results, cov_pca). In this analysis we considered a PAS SNP as a pA-QTL PAS if (a) its local false sign rate (lsfr) was below 0.05, and (b) it was included in CAVIAR’s credible set. CAVIAR was run as described above, with FastQTL t-statistics replaced by the ratio of the posterior mean and posterior standard error obtained from MASH.
Raw data for the figures and supplemental tables of this study are available at https://github.com/ElkonLab/pA_QTLs.
Supporting information
S1 Fig [tif]
CAVIAR fine-mapping analysis.
S2 Fig [clpp]
eCAVIAR colocalization analysis of pA-QTL and eQTL signals.
S3 Fig [tif]
eCAVIAR colocalization analysis of pA-QTL and GWAS signals.
S4 Fig [tif]
Relationship between pA-QTL, eQTL and SLE GWAS signals in the 3’UTR locus of .
S5 Fig [tif]
Comparison between effect magnitudes of PAS pA-QTLs, GUGU pA-QTL and other 3’UTR pA-QTLs.
S6 Fig [left]
MASH information sharing across conditions increases statistical power for the detection of pA-QTL effects.
S1 Table [xlsx]
3’UTR SNPs interrupting a canonical PAS (AATAAA) or its main variant (ATTAAA).
S2 Table [xlsx]
GTEx tissues analyzed in our study.
S3 Table [xlsx]
PAS pA-QTLs and their association with eQTLs and GWAS SNPs.
S4 Table [xlsx]
GWAS summary statistics datasets analyzed in our study.
S5 Table [xlsx]
SLE IRF5 eCaviar analysis results.
S6 Table [xlsx]
GUGU-interfering variants and their association with eQTLs.
Zdroje
1. Tian B, Manley JL: Alternative polyadenylation of mRNA precursors. Nat Rev Mol Cell Biol 2017, 18:18–30. doi: 10.1038/nrm.2016.116 27677860
2. Tian B, Hu J, Zhang H, Lutz CS: A large-scale analysis of mRNA polyadenylation of human and mouse genes. Nucleic Acids Res 2005, 33:201–212. doi: 10.1093/nar/gki158 15647503
3. Cheng Y, Miura RM, Tian B: Prediction of mRNA polyadenylation sites by support vector machine. Bioinformatics 2006, 22:2320–2325. doi: 10.1093/bioinformatics/btl394 16870936
4. Derti A, Garrett-Engele P, Macisaac KD, Stevens RC, Sriram S, Chen R, Rohl CA, Johnson JM, Babak T: A quantitative atlas of polyadenylation in five mammals. Genome Res 2012, 22:1173–1183. doi: 10.1101/gr.132563.111 22454233
5. Hoque M, Ji Z, Zheng D, Luo W, Li W, You B, Park JY, Yehia G, Tian B: Analysis of alternative cleavage and polyadenylation by 3’ region extraction and deep sequencing. Nat Methods 2013, 10:133–139. doi: 10.1038/nmeth.2288 23241633
6. Gruber AJ, Zavolan M: Alternative cleavage and polyadenylation in health and disease. Nat Rev Genet 2019, 20:599–614. doi: 10.1038/s41576-019-0145-z 31267064
7. Elkon R, Ugalde AP, Agami R: Alternative cleavage and polyadenylation: extent, regulation and function. Nat Rev Genet 2013, 14:496–506. doi: 10.1038/nrg3482 23774734
8. Fabian MR, Sonenberg N, Filipowicz W: Regulation of mRNA translation and stability by microRNAs. Annu Rev Biochem 2010, 79:351–379. doi: 10.1146/annurev-biochem-060308-103103 20533884
9. Andreassi C, Riccio A: To localize or not to localize: mRNA fate is in 3’UTR ends. Trends Cell Biol 2009, 19:465–474. doi: 10.1016/j.tcb.2009.06.001 19716303
10. Ji Z, Lee JY, Pan Z, Jiang B, Tian B: Progressive lengthening of 3’ untranslated regions of mRNAs by alternative polyadenylation during mouse embryonic development. Proc Natl Acad Sci U S A 2009, 106:7028–7033. doi: 10.1073/pnas.0900028106 19372383
11. Ji Z, Tian B: Reprogramming of 3’ untranslated regions of mRNAs by alternative polyadenylation in generation of pluripotent stem cells from different cell types. PLoS One 2009, 4:e8419. doi: 10.1371/journal.pone.0008419 20037631
12. Liu D, Brockman JM, Dass B, Hutchins LN, Singh P, McCarrey JR, MacDonald CC, Graber JH: Systematic variation in mRNA 3’-processing signals during mouse spermatogenesis. Nucleic Acids Res 2007, 35:234–246. doi: 10.1093/nar/gkl919 17158511
13. Sartini BL, Wang H, Wang W, Millette CF, Kilpatrick DL: Pre-messenger RNA cleavage factor I (CFIm): potential role in alternative polyadenylation during spermatogenesis. Biol Reprod 2008, 78:472–482. doi: 10.1095/biolreprod.107.064774 18032416
14. Li W, Park JY, Zheng D, Hoque M, Yehia G, Tian B: Alternative cleavage and polyadenylation in spermatogenesis connects chromatin regulation with post-transcriptional control. BMC Biol 2016, 14:6. doi: 10.1186/s12915-016-0229-6 26801249
15. Zhang H, Lee JY, Tian B: Biased alternative polyadenylation in human tissues. Genome Biol 2005, 6:R100. doi: 10.1186/gb-2005-6-12-r100 16356263
16. Miura P, Shenker S, Andreu-Agullo C, Westholm JO, Lai EC: Widespread and extensive lengthening of 3’ UTRs in the mammalian brain. Genome Res 2013, 23:812–825. doi: 10.1101/gr.146886.112 23520388
17. Shulman ED, Elkon R: Cell-type-specific analysis of alternative polyadenylation using single-cell transcriptomics data. Nucleic Acids Res 2019.
18. Sandberg R, Neilson JR, Sarma A, Sharp PA, Burge CB: Proliferating cells express mRNAs with shortened 3’ untranslated regions and fewer microRNA target sites. Science 2008, 320:1643–1647. doi: 10.1126/science.1155390 18566288
19. Mayr C, Bartel DP: Widespread shortening of 3’UTRs by alternative cleavage and polyadenylation activates oncogenes in cancer cells. Cell 2009, 138:673–684. doi: 10.1016/j.cell.2009.06.016 19703394
20. Xia Z, Donehower LA, Cooper TA, Neilson JR, Wheeler DA, Wagner EJ, Li W: Dynamic analyses of alternative polyadenylation from RNA-seq reveal a 3’-UTR landscape across seven tumour types. Nat Commun 2014, 5:5274. doi: 10.1038/ncomms6274 25409906
21. Higgs DR, Goodbourn SE, Lamb J, Clegg JB, Weatherall DJ, Proudfoot NJ: Alpha-thalassaemia caused by a polyadenylation signal mutation. Nature 1983, 306:398–400. doi: 10.1038/306398a0 6646217
22. Orkin SH, Cheng TC, Antonarakis SE, Kazazian HH Jr.: Thalassemia due to a mutation in the cleavage-polyadenylation signal of the human beta-globin gene. EMBO J 1985, 4:453–456. 4018033
23. Bennett CL, Brunkow ME, Ramsdell F, O’Briant KC, Zhu Q, Fuleihan RL, Shigeoka AO, Ochs HD, Chance PF: A rare polyadenylation signal mutation of the FOXP3 gene (AAUAAA—>AAUGAA) leads to the IPEX syndrome. Immunogenetics 2001, 53:435–439. doi: 10.1007/s002510100358 11685453
24. Danckwardt S, Hentze MW, Kulozik AE: 3’ end mRNA processing: molecular mechanisms and implications for health and disease. EMBO J 2008, 27:482–498. doi: 10.1038/sj.emboj.7601932 18256699
25. Graham RR, Kyogoku C, Sigurdsson S, Vlasova IA, Davies LR, Baechler EC, Plenge RM, Koeuth T, Ortmann WA, Hom G, et al: Three functional variants of IFN regulatory factor 5 (IRF5) define risk and protective haplotypes for human lupus. Proc Natl Acad Sci U S A 2007, 104:6758–6763. doi: 10.1073/pnas.0701266104 17412832
26. Prasad MK, Bhalla K, Pan ZH, O’Connell JR, Weder AB, Chakravarti A, Tian B, Chang YP: A polymorphic 3’UTR element in ATP1B1 regulates alternative polyadenylation and is associated with blood pressure. PLoS One 2013, 8:e76290. doi: 10.1371/journal.pone.0076290 24098465
27. Wang R, Zheng D, Yehia G, Tian B: A compendium of conserved cleavage and polyadenylation events in mammalian genes. Genome Res 2018, 28:1427–1441. doi: 10.1101/gr.237826.118 30143597
28. Consortium GT, Laboratory DA, Coordinating Center -Analysis Working G, Statistical Methods groups-Analysis Working G, Enhancing Gg, Fund NIHC, Nih/Nci, Nih/Nhgri, Nih/Nimh, Nih/Nida, et al: Genetic effects on gene expression across human tissues. Nature 2017, 550:204–213. doi: 10.1038/nature24277 29022597
29. Ongen H, Buil A, Brown AA, Dermitzakis ET, Delaneau O: Fast and efficient QTL mapper for thousands of molecular phenotypes. Bioinformatics 2016, 32:1479–1485. doi: 10.1093/bioinformatics/btv722 26708335
30. Hormozdiari F, Kostem E, Kang EY, Pasaniuc B, Eskin E: Identifying causal variants at loci with multiple signals of association. Genetics 2014, 198:497–508. doi: 10.1534/genetics.114.167908 25104515
31. Liu B, Gloudemans MJ, Rao AS, Ingelsson E, Montgomery SB: Abundant associations with gene expression complicate GWAS follow-up. Nat Genet 2019, 51:768–769. doi: 10.1038/s41588-019-0404-0 31043754
32. Hormozdiari F, van de Bunt M, Segre AV, Li X, Joo JWJ, Bilow M, Sul JH, Sankararaman S, Pasaniuc B, Eskin E: Colocalization of GWAS and eQTL Signals Detects Target Genes. Am J Hum Genet 2016, 99:1245–1260. doi: 10.1016/j.ajhg.2016.10.003 27866706
33. Kuramoto K, He C: The BECN1-BCL2 complex regulates insulin secretion and storage in mice. Autophagy 2018, 14:2026–2028. doi: 10.1080/15548627.2018.1502566 30081744
34. Hong W, Ruan H, Zhang Z, Ye Y, Liu Y, Li S, Jing Y, Zhang H, Diao L, Liang H, Han L: APAatlas: decoding alternative polyadenylation across human tissues. Nucleic Acids Res 2020, 48:D34–D39. doi: 10.1093/nar/gkz876 31586392
35. Urbut SM, Wang G, Carbonetto P, Stephens M: Flexible statistical methods for estimating and testing effects in genomic studies with multiple conditions. Nat Genet 2019, 51:187–195. doi: 10.1038/s41588-018-0268-8 30478440
36. Spies N, Burge CB, Bartel DP: 3’ UTR-isoform choice has limited influence on the stability and translational efficiency of most mRNAs in mouse fibroblasts. Genome Res 2013, 23:2078–2090. doi: 10.1101/gr.156919.113 24072873
37. Gruber AR, Martin G, Muller P, Schmidt A, Gruber AJ, Gumienny R, Mittal N, Jayachandran R, Pieters J, Keller W, et al: Global 3’ UTR shortening has a limited effect on protein abundance in proliferating T cells. Nat Commun 2014, 5:5465. doi: 10.1038/ncomms6465 25413384
38. Guhaniyogi J, Brewer G: Regulation of mRNA stability in mammalian cells. Gene 2001, 265:11–23. doi: 10.1016/s0378-1119(01)00350-x 11255003
39. Nam JW, Rissland OS, Koppstein D, Abreu-Goodger C, Jan CH, Agarwal V, Yildirim MA, Rodriguez A, DP: Global analyses of the effect of different cellular contexts on microRNA targeting. Mol Cell 2014, 53:1031–1043. doi: 10.1016/j.molcel.2014.02.013 24631284
40. Hoffman Y, Bublik DR, Ugalde AP, Elkon R, Biniashvili T, Agami R, Oren M, Pilpel Y: 3’UTR Shortening Potentiates MicroRNA-Based Repression of Pro-differentiation Genes in Proliferating Human Cells. PLoS Genet 2016, 12:e1005879. doi: 10.1371/journal.pgen.1005879 26908102
41. MacArthur J, Bowler E, Cerezo M, Gil L, Hall P, Hastings E, Junkins H, McMahon A, Milano A, Morales J, et al: The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog). Nucleic Acids Res 2017, 45:D896–D901. doi: 10.1093/nar/gkw1133 27899670
42. Elkon R, Agami R: Characterization of noncoding regulatory DNA in the human genome. Nat Biotechnol 2017, 35:732–746. doi: 10.1038/nbt.3863 28787426
43. Finucane HK, Bulik-Sullivan B, Gusev A, Trynka G, Reshef Y, Loh PR, Anttila V, Xu H, Zang C, Farh K, et al: Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat Genet 2015, 47:1228–1235. doi: 10.1038/ng.3404 26414678
44. Yang Y, Zhang Q, Miao YR, Yang J, Yang W, Yu F, Wang D, Guo AY, Gong J: SNP2APA: a database for evaluating effects of genetic variants on alternative polyadenylation in human cancers. Nucleic Acids Res 2019.
45. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, Genome Project Data Processing S: The Sequence Alignment/Map format and SAMtools. Bioinformatics 2009, 25:2078–2079. doi: 10.1093/bioinformatics/btp352 19505943
46. Frankish A, Diekhans M, Ferreira AM, Johnson R, Jungreis I, Loveland J, Mudge JM, Sisu C, Wright J, Armstrong J, et al: GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Res 2019, 47:D766–D773. doi: 10.1093/nar/gky955 30357393
47. Quinlan AR, Hall IM: BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 2010, 26:841–842. doi: 10.1093/bioinformatics/btq033 20110278
48. Liao Y, Smyth GK, Shi W: featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics 2014, 30:923–930. doi: 10.1093/bioinformatics/btt656 24227677
49. Storey JD, Tibshirani R: Statistical significance for genomewide studies. Proc Natl Acad Sci U S A 2003, 100:9440–9445. doi: 10.1073/pnas.1530509100 12883005
50. Li H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 2011, 27:2987–2993. doi: 10.1093/bioinformatics/btr509 21903627
Článek vyšel v časopise
PLOS Genetics
2020 Číslo 8
- Antibiotika na nachlazení nezabírají! Jak můžeme zpomalit šíření rezistence?
- Jak si poradit s antibiotiky v magistraliter přípravě
- Ibuprofen jako alternativa antibiotik při léčbě infekcí močových cest
- FDA varuje před selfmonitoringem cukru pomocí chytrých hodinek. Jak je to v Česku?
- Měli bychom postcovidový syndrom léčit antidepresivy?
Nejčtenější v tomto čísle
- Genomic imprinting: An epigenetic regulatory system
- Uptake of exogenous serine is important to maintain sphingolipid homeostasis in Saccharomyces cerevisiae
- A human-specific VNTR in the TRIB3 promoter causes gene expression variation between individuals
- Immediate activation of chemosensory neuron gene expression by bacterial metabolites is selectively induced by distinct cyclic GMP-dependent pathways in Caenorhabditis elegans