Quantitative Genetics of CTCF Binding Reveal Local Sequence Effects and Different Modes of X-Chromosome Association

Download PDF České info

We have systematically measured the effect of normal genetic variation present in a human population on the binding of a specific chromatin protein (CTCF) to DNA by measuring its binding in 51 human cell lines. We observed a large number of changes in protein binding that we can confidently attribute to genetic effects. The corresponding genetic changes are often clustered around the binding motif for CTCF, but only a minority are actually within the motif. Unexpectedly, we also find that at most binding sites on the X chromosome, CTCF binding occurs equally on both the X chromosomes in females at the same level as on the single X chromosome in males. This finding suggests that in general, CTCF binding is not subject to global dosage compensation, the process which equalizes gene expression levels from the two female X chromosomes and the single male X.

Published in the journal: . PLoS Genet 10(11): e32767. doi:10.1371/journal.pgen.1004798
Category: Research Article
doi: https://doi.org/10.1371/journal.pgen.1004798

Summary

Introduction

A major challenge in human genetics is to understand the mechanisms that link variation in genomic sequence to phenotypes of interest, including disease. Since 2005, a growing number of genome-wide association studies (GWAS) have associated both disease and normal phenotypes with over 9,800 single nucleotide polymorphisms (SNPs) [1]. Association studies can identify either causative variants or SNPs in linkage disequilibrium (LD) with the causative variant. Considerable effort has been invested in identifying potential causative variants, because this is essential to understanding the mechanistic route from the change in genomic sequence to final phenotype. The majority of the loci that have been found are not in strong linkage disequilibrium with a protein coding variant, suggesting that a change in a non-protein coding DNA sequence is often responsible for the phenotypic effect [2].

One route to finding intermediates between genotype and whole organism phenotype is to study the effect of genetic variants on gene regulation. New technologies such as microarrays and RNA sequencing (RNA-seq) have enabled quantification of transcript levels for every gene in a genome. Similarly, genome wide measurements of transcription factor occupancy and chromatin structure via chromatin immunoprecipitation followed by sequencing (ChIP-seq) [3] and DNase I hypersensitivity assays [4]–[7] have made it possible to quantify the state of upstream activities important for regulating transcription. Using DNase I hypersensitivity and binding assays for the CTCF transcription factor on two family trios with known genome sequences, we showed that allele-specific binding patterns consistent with strong genetic effects could be readily measured at heterozygous sites [8]. Other studies have shown allele specific binding of RNA polymerase and NF-κB binding measured across a small number of individuals [9], or of a wider range of transcription factors in a single cell line [10]. Similarly, differences between mouse strains in binding of PU-1 and CEBP/α at enhancer regions correlate with sequence differences and adjacent gene expression [11]. Intriguingly, some sites with prominent SNPs in the binding motifs of CTCF did not show a genetic effect in a study of its binding across an extended family [12]. Reciprocally, differences in transcriptor factor binding were seen between closely related species even where there was no sequence difference in the binding region [13].

In order to examine these phenomena further, and infer potential causative connections to disease GWAS results, we need to identify specific cases where a genetic variant affects a binding site. To do this we can use a genetic association study, as in GWAS, that searches for statistical association of genetic variants to quantitative measurements taken across samples. The variants with statistically significant association are known as quantitative trait loci (QTLs). When applied to transcript expression levels as the measurements on 60 or more samples, this approach has identified thousands of expression quantitative trait loci (eQTLs) [14]–[16]. A QTL study of human open chromatin [17] found 8,902 DNase I hypersensitivity sites that were correlated with genetic variants. However, there are currently no systematic association studies of how genetic variation in human populations affects the binding pattern of a specific transcription factor. Here we carry out such a study.

To identify transcription factor binding QTLs, we measured the binding of CTCF across a panel of cell lines. CTCF is a highly conserved multifunctional protein that serves as both a transcription factor as well an insulator binding protein, preventing interactions between enhancers and promoters and demarcating chromatin domains. Working with cohesin, CTCF can also mediate chromosomal looping interactions, and is involved in imprinting as well as X-inactivation (see [18], [19] for reviews). There have been extensive locus specific studies [20]–[26] and specific genome wide screens [27]–[30] demonstrating the different roles of CTCF in different circumstances. Studies by ourselves and others have shown the extent of genetic effects on CTCF binding in families [8], [12], although specific loci underlying these effects have not been identified.

We used ChIP-seq to measure CTCF binding in 51 lymphoblastoid cell lines (LCLs) from the HapMap CEU population, each of which had already been sequenced as part of the 1000 Genomes Project [2] and had been subjected to RNA-seq analysis [31]. Our data and analysis identified thousands of CTCF binding QTLs across the human genome. These data, together with the available full genome sequence of the cell lines, allowed us to explore parameters of genetic effects on protein-DNA binding. For example, we defined the relationship of the QTL location to the TF binding motif, estimated the relative impact of substitutions and insertions/deletions (indels), and measured whether allele-specific differences are indicative of population-wide variation.

Furthermore, our study revealed a previously uncharacterized mode of CTCF binding on the X chromosome. In human females (XX), one X chromosome is randomly inactivated and does not express most protein coding RNAs (reviewed in [32]). Thus for most X chromosome genes, both male and female cells have just one active locus, resulting in dosage compensation between the two sexes. The X-inactivation process requires expression of the non-coding RNA Xist from the inactive X. When we looked at CTCF binding on the X chromosome across our samples, we observed three distinct classes of CTCF binding sites. One major class was sensitive to X inactivation such that the active X showed stronger binding. Another class showed similar binding by CTCF on both X chromosomes, and the third, minor class of sites exhibited female specific binding.

Results

Analysis of CTCF binding in 51 genotyped individuals reveals thousands of binding QTLs

We performed ChIP-seq on extracted chromatin from genotyped LCLs as previously described [33] except that we sequenced the DNA fragments from both ends (Figure 1) (Materials and Methods). We quantified binding to binding regions similarly to previous work [33] but pooled all the samples and identified a composite set of binding regions with detectable CTCF binding at low threshold. We then counted the sequence fragments that overlap each binding region in each individual, and normalised the signal to correct for systematic biases as in Degner et al [17]. We discarded binding regions that showed very little inter individual variance or had only one or two individuals with significant binding scores. Overall, our normalized data showed good consistency across all 51 individuals, as well as variation in signal sufficient to motivate QTL analysis (Figure 1B).

To measure the variance due to growth differences between the cells, we grew two individual cell lines as four independent cultures started on four consecutive days. There was higher correlation between these biological replicates from the same individual than between samples from different individuals, although all data sets were modestly correlated as expected for CTCF ChIP-seq (Figure S1). We next examined the data to see whether there were any systematic biases between samples. A principal component analysis identified some systematic variance, with a particularly strong first component (24.1%, Figure S2) that on investigation was correlated to known experimental batches. We therefore removed the first principal component, significantly improving the recovery of QTLs (Figure S3, Methods). We used the resulting normalised adjusted binding intensity (NABI) for subsequent analyses.

To discover QTLs, we looked for correlations between the NABI measures and SNPs and small biallelic insertion or deletion (indel) variants within 50 kb of the relevant binding region, using a linear model (Table 1, Example in Figure 2A; see Methods). As expected, the majority of variants do not have a significant association with variation in CTCF binding, with the linear model P-value distribution following the expected distribution (>95% of tests, fraction of the overlap between the black line and red line, Figure 2B). When samples are permuted, the distribution of the test statistic falls on the expected line (see Figure S4). Using a non-parametric statistic we saw similar P values (Figure S5). Using a Bonferroni adjusted threshold of P<3.8E-9 (See details on association testing in Methods) we find 509 binding regions with significant QTLs. Using a more liberal False Discovery Rate (FDR) [34] approach to take advantage of the smaller number of effectively independent tests occurring in these limited cis-regions, we discovered 1,837 binding regions (3% of total binding regions) with at least one significant variant at the 1% FDR level; relaxing the threshold to 10% FDR we discover 6,747 binding regions (12% of the total) (Table 1).

**Fig. 2. Overall properties of CTCF QTLs.**

**Tab. 1. Summary statistics of the CTCF QTL scan.**

We chose to focus further analysis on the 1% FDR threshold as this provided ample QTLs from which to derive insights. We only considered one association per binding region, because the small number of samples meant that there was insufficient power for a conditional analysis for secondary associations in almost all cases. Within this set of associations, the genetic variant accounted for a substantial fraction of the variation in CTCF binding (median R square 0.38).

We summarised the collective set of variants which might be involved in each binding region association as being the cluster of SNPs within one order of magnitude of the P-value of the lead variant. 24,534 variants were identified in at least one cluster at the 1% FDR level, 13.4 variants on average per binding region (Table 1). As expected, these variants were mainly clustered around the target binding region, and when a CTCF binding motif could be identified (1341 of the 1837 cases) and a cluster QTL variant was present in the motif, the frequency was correlated with the information content of the motif (Figure 2C), as seen previously [12]. However, only a minority of significant binding regions had a QTL candidate within the motif (433/1341), and in only a small majority of cases there was a QTL within 1 kb (747/1341), of the binding region (Table 2).

**Tab. 2. CTCF QTLs with associated variants in different distance ranges.**

We explored further the cases where there was no proximal variant in the cluster. There was not a substantial difference in genotype quality around the associated binding regions in these cases compared to binding regions with proximal effects, suggesting that there is not a large missing data problem. When considering all 1000 Genomes Project variants including those with allele frequency below 5%, in 95.5% of these cases, there was a proximal variant within 1 kb of the binding region in linkage disequilibrium (LD) with the distal lead variant, where LD was defined as the absolute value of D′>0.5. In approximately half of these cases the P-value of the proximal association either fell just outside the one order of magnitude threshold to fall in the cluster, or was just under the FDR threshold (Figure S6). In the 99 such cases where such a proximal variant was within the CTCF binding motif, the position of the variant was correlated with the information content of the position in the motif (Figure S7). Therefore a substantial fraction of the apparently distal cases appear to be explained by proximal cases. However still only a minority can be explained by variants in the binding motif.

We also conducted the analysis excluding short indels to replicate the more commonplace association analysis using only SNPs. In an indel-free analysis we would have missed QTLs in 67 binding regions entirely (∼5% of significant binding regions), and for 56 additional binding regions the closest observed explanatory SNP would have been over 1 kb away from the motif inside the peak. For these SNPs, there is usually a short indel with similar direct P-value inside the binding region. We further explored whether another cause for distal QTL effects could be due to the distal variant affecting a second neighbouring binding region, which in turn influenced the primary binding region, but there was only one case where we could find any evidence for this model (Figure S8). We additionally investigated the cases where there exist binding interactions between the QTL binding region and the neighboring region. We observed corresponding changes in histone modifications depending on the direction of the interactions between two binding regions (Figure S9, S10).

The effect size distribution with respect to allele frequency shows increased effect sizes for lower frequency SNPs, with a clear absence of large effects of common alleles (Figure S11). There is no statistical difference in effect size distribution between SNP and indel variants (Figure S11).

The dual-end sequencing of the ChIP-seq fragments provides the resolution to discover specific binding modes that influence the spatial distribution of the recovered fragments. To analyse this, we characterised ChIP-seq binding regions by metrics that summarised the extent of the peak and the position of the summit on a per individual basis, and used these additional metrics as phenotypes in a quantitative trait analysis using the methods described above. We found 25 shifts in peak shape driven by a genetic locus at the 1% FDR. Ten cases were also associated with a change in peak height. An example is shown in Figure 3, with the two homozygous genotypes showing the creation of a new associated peak, and merging of a double peak, and from visual inspection the other cases also look as if they can be explained as two CTCF peaks in close proximity, one or both of which is under cis-genetic control.

**Fig. 3. Example of CTCF peak shape QTL.**

There are 61 CTCF QTL variants that overlap with disease and trait associated variants from other studies (GWAS Catalog [1]). In particular there is a disproportionate overlap with immune system related diseases (20 variants; Chi-sq P-value 1.7E-9). This is consistent with the lymphocyte origin of LCLs, and suggests a causal pathway for CTCF binding in the molecular aetiology of the disease phenotype in at least some cases. However many of these variants fall within the MHC locus, and a full causal analysis would need to take account of the complex LD structure there.

In summary, these results are consistent with previous studies [9], [10], [12], [13] that observed substantial variation in transcription factor binding within and between species, only a minority of which could be accounted for by genetic differences in the binding site. We also found that only 25.7% of our QTLs could be explained by a genetic variant in the motif. The majority of the remainder can be explained by changes within 1 kb of the motif, consistent with observations that transcription factor binding differences between mouse strains are more likely if there are genetic differences within 200 bp of the binding site [11]. However there remain some genetic associations for which we are not able to identify any proximal candidate, suggesting that longer range influences can make some contribution to CTCF binding.

Allele-specific bias analysis of CTCF binding provides independent confirmation of QTLs

This data set represents an excellent resource to directly examine allele-specific biases in TF binding at heterozygous sites in a larger set of individuals than previous studies [8]. Allele-specific binding refers to statistically significant biases in binding to the two alleles in a diploid cell, at sites where a heterozygous polymorphism allows the two alleles to be distinguished. Allele-specific binding thus is an independent way of assessing how genetic variants at binding sites might affect binding variation. Although the two alleles at heterozygous SNPs are normally referred to as the reference or alternate allele (referring to which base is found in the reference genome sequence and which is the alternate base), here we chose to categorize the two alleles as ancestral (shared with chimp) or derived (human specific). This has two advantages. First, any residual effect of biases in aligning sequence reads to the reference allele will be minimized. Second, measuring allele-specific binding in terms of the ancestral and derived allele provides information about how evolutionary changes might affect CTCF binding.

After processing the reads, we identified allele-specific statistically sites using a binomial null model of equal occupancy of both alleles at heterozygous sites, using a 5% FDR corrected threshold (see Methods). This process identified 589 SNPs that have replicated in at least two individuals showing significant allele-specific bias. We examined the allele counts of all heterozygous individuals at these 589 SNPs. For most sites (91.5%) the allele-specific biases were consistent between individuals, confirming the predominantly genetic basis of allele-specific binding (Figure 4A). At such sites, the same ancestral or derived allele was preferred for binding across 2 or more individuals.

**Fig. 4. Properties of allele specific CTCF sites.**

However, there were 50 (8.5%) sites which showed significant but opposite allele-specific biases between two or more individuals. Six of these 50 sites could potentially be explained by virtue of being close to loci known to be subject to allelic exclusion (the Immunoglobulin heavy chain), a process that affects one allele randomly (see Discussion). One site lies in the KCNQ1 imprinted locus, where the regulatory status depends on parent of origin rather than genotype. The 46 other sites at which the allele-specific binding bias switches between individuals (Table S1) could represent new random allelic exclusion loci or imprinted sites, or could arise because the site at which we see allele specificity is incompletely linked with the causal variant [35]. We tested whether there was a SNP which specifically explained the allele specific switching site; for 28 cases this was the case. We are not able to directly test whether any of these sites could be due to imprinting because parent-of-origin information is not available for the heterozygous alleles of these individuals.

Interestingly, a significant majority (68%, P<1E-16) of the SNPs showed increased binding to the ancestral allele (Figure 4A). Alignment bias towards the reference allele has been reported before [8] and because the ancestral allele is more likely to be the reference allele, the increased binding to the ancestral allele could be the result of the alignment bias. To rule out this possibility, we analyzed the cases where the ancestral allele is the alternate allele and found that the binding bias remained towards the ancestral allele (Figure S12). Additionally, we repeated the allele-specific analysis after using a variant-aware aligner (see Methods). The results were largely identical to what we observed as described above, indicating that the preference for the ancestral allele is not a trivial outcome of any alignment bias (Figure S13).

The allele-specific signal at binding regions (intra-individual measurements) mostly correlated linearly with the QTL effect size (inter-individual measurements) (Figure 4B). There were however exceptions to this, and these were mainly cases in which there was an allele-specific signal but not inter-individual QTL. We did not observe QTLs with strong effect size in binding regions that did not show strong allele-specificity (Figure S14).

Interactions between CTCF and the X chromosome suggest novel binding modes

While exploring the correlation of between CTCF sites, we observed an unexpected behaviour of CTCF signal on the X chromosome. Strikingly, for 87% of CTCF sites on X (excluding the pseudoautosomal regions) there was a strong gender effect (P-value <0.01, Mann Whitney on gender); in nearly every case females have a significantly higher signal on average than males. The higher peak amplitude observed in females indicates, in effect, that the vast majority (87%) of CTCF sites on the X chromosome are occupied on both chromosomes. This is in contrast to the transcription of protein-coding mRNA (3% not compensated, i.e. X-inactivation escape genes), ncRNA (9%) [35] or other transcription factor occupancy as measured by DNase I (4%) (data from Degner et al [17]). We created a simple metric of the relative levels of activity, being the difference between the average male and average female signal, in each case adjusted for library depth as for the QTL analysis (Figure 5A and B). Protein coding mRNA and the majority of DNase I sites are consistent with only one active chromosome, leading to dosage compensated mRNAs(reviewed in [32]). As expected, there is a larger set of female specific ncRNAs, in particular the three XIST transcripts (Figure 5A). Using the Mann Whitney test of gender bias per site, we classified sites first as having significant bias, and then split the significant bias to cases consistent with balanced haploid behaviour, which we call “both-active” sites, and a small number of female-specific sites where there is a strong CTCF signal for females but almost no signal in males (Figure 5B). The remaining CTCF sites, which show similar levels between males and females we describe as single-active. The both-active sites and the single-active sites are evenly distributed along the chromosome (Figure 5A), and the XIST site and two clusters of female specific sites are obviously distinct from the rest.

**Fig. 5. Properties of X chromosome CTCF sites.**

We first confirmed that the single-active and both-active sites represent different modalities of CTCF binding, using intra-individual allele-specific analysis and independent DNase I data. LCLs are Epstein Barr Virus (EBV) transformed lines from a mixed B-cell population, and can be clonal or polyclonal, so that some female-derived LCLs show consistent or clonal X inactivation, whereas others have a mix of both X chromosomes being inactivated. Because we cannot assume that our LCL lines were clonal and therefore have consistent X inactivation, we first assessed the 17 female cell lines for clonal X inactivation status using heterozygous SNPs in genes known to be silenced on the inactive X [36], and selected the 13 lines with consistently skewed expression of these genes indicating consistent X inactivation (Methods). In these 13 lines, heterozygous SNPs in CTCF binding regions showed strikingly different behaviour between the single-active and both-active sites. The single-active sites showed strong allele-specific CTCF binding behaviour (similar to mRNA) whereas the both-active sites showed balanced signal over the two alleles from the very same samples (Figure 5C). In addition, we projected the DNase I data from the independent Yoruban cell lines onto the CTCF classification. For the 451 DNase I sites overlapping the CTCF sites on the X chromosomes, there was a strong concordance of this independent assay, performed on independent cell lines, with the classification of CTCF sites (Figure 5D). Both these analyses strongly support the finding that there are two major distinct types of CTCF binding sites on the X chromosome, with the both-active sites being bound on both the active and inactive X chromosome and the single-active sites being bound on only one chromosome (most likely the active X chromosome).

We then explored differences between these two classes of CTCF sites, using the ENCODE data from the GM12878 LCL [37], derived from a female individual. The majority of histone modifications associated with active chromatin (H3K4me4, H3K27ac) showed strong enrichments in the single-active class of CTCF sites but not in the both-active class, even when we excluded promoters (Figure 6A). The repressive histone mark, H3K27me3, implicated in X chromosome inactivation, is similar between both classes of sites. Interestingly both classes showed nucleosome phasing (Figure 6B) albeit stronger at the both-active sites. There is not a striking change in Cohesin co-binding, as shown by overlap with Rad21 and SMC3 (Figure S20). The mammalian conservation of the two classes of CTCF sites is high and approximately similar (62% for single-active sites overall with GERP conserved elements, and 53% for both active sites), showing that both classes have been under selection across mammalian evolution. Overall there is strong evidence for a dramatic distinction of these two classes of sites in terms of local chromatin behaviour. When we considered histone marks from a smaller set of cell lines, but with a broader set of marks we do not observe the same set of gender-biased signals except for H3K27me3, consistent with it's role in X inactivation. (Figure S21).

**Fig. 6. Chromatin behaviour around X chromosome sites.**

We then turned to the 23 female-specific sites. These sites were concentrated in two loci overlapping non-coding RNAs (X56 and X130), largely identical to sites previously identified as being involved in a repeat-specific X chromosome behaviour [38]. Although there are far fewer sites to analyse than the other classes, the female specific sites are all enriched for binding to YY1, which is known to tether XIST to the inactive X nucleation centre [39]. Horakova et al [38] explored the RNA expression of these ncRNAs in female cells; we performed fluorescence in situ hybridization (FISH) for RNA in both male and female cells. Consistent with the published results [38], we detected RNA from the active X at these loci in female cells (Figure 7A). In male cells we also detected RNA expression (despite the female specific nature of the CTCF sites, Figure 7B), suggesting that these CTCF sites are likely to be involved in a female-specific inactivation process at these loci. Using the data from Kilpinen et al, we can show that these sites are active in female lymphoblastoid cell lines, but not male (Figure S23). It is notable how few of these sites there are on the X chromosome, compared to the far more numerous single-active and both-active categories.

**Fig. 7. Expression and genomic organization of non-coding RNA genes X56 and X130.**

We then turned to the 23 female-specific sites. These sites were concentrated in two loci overlapping non-coding RNAs (X56 and X130), largely identical to sites previously identified as being involved in a repeat-specific X chromosome behaviour [38]. Although there are far fewer sites to analyse than the other classes, the female specific sites are all enriched for binding to YY1, which is known to tether XIST to the inactive X nucleation centre [39]. Horakova et al [38] explored the RNA expression of these ncRNAs in female cells; we performed fluorescence in situ hybridization (FISH) for RNA in both male and female cells. Consistent with the published results [38], we detected RNA from the active X at these loci in female cells (Figure 7A). In male cells we also detected RNA expression (despite the female specific nature of the CTCF sites, Figure 7B), suggesting that these CTCF sites are likely to be involved in a female-specific inactivation process at these loci. It is notable how few of these sites there are on the X chromosome, compared to the far more numerous single-active and both-active categories.

Discussion

This study is the first systematic association based analysis of how normal genetic variation in humans affects the binding of a sequence-specific transcription factor, where the binding is measured as a quantitative trait. The properties of the binding quantitative trait loci (QTLs) that we identified are consistent with and extend previous smaller-scale studies of how genetic variation affects CTCF binding [8], [12], as well as similar analyses of chromatin QTLs underlying DNase I hypersensitive sites [17]. We find a large number of QTLs, with the majority being within or close to the binding region, and approximately a quarter inside the bound CTCF motif. By using 1000 Genomes Project cell lines, we can be reasonably confident that we have a full catalog of common variation of which some subset are the causal variants. Using this information we could show that for a large fraction of the associations where the initial analysis suggested a distal variant more than 1 kb away, there was a plausible causal candidate also within 1 kb of the binding motif. Overall this suggests that, at least for CTCF, the substantial majority (∼75%) of common genetic variants in the region with a reasonably strong effect on its binding lie within 1 kb of the binding motif, although only a minority are actually within the motif. This clarifies previous observations that genetic variants contributing to transcription factor binding were typically not in the motif itself [9], [13] but there was enrichment nearby [11].

We see hundreds of sites showing allele-specific binding. The idea that allele-specific events have similar effects inside one cell as genotypic effects do between individuals is commonplace [40]. Here we show that these two effects are well modeled by a linear relationship (at least for this assay), though with an interesting subset of allele-specific sites that show no QTL. In contrast there are few QTL loci that overlap binding regions without an allele-specific signal.

As expected, some of the allele specific sites switch specificity between the alleles in different samples, consistent with a nearby, incompletely linked causal allele, random allelic inactivation or parent-of-origin imprinting. Many of these sites can be explained by an incompletely linked nearby locus, highlighting that the causal variant is often not co-incident with the binding region.

Finally with more confident mapping of reads from paired read ChIP-seq data we are able to show that a consistent signal towards reference alleles is in fact predominantly due to a biological effect favouring ancestral alleles (at least for the CTCF transcription factor). This suggests that base pair changes segregating in the population tend to reduce binding of existing sites (rather than create new sites), at least for CTCF, and this is consistent with CTCF motif creation occurring by non-base pair changes, e.g. repeat deposition, as suggested in Schmidt et al [41].

We were initially surprised by the strikingly different behaviour of CTCF on the X chromosome compared to gene expression. Unlike transcribed genes, a large proportion of CTCF sites behave in a similar manner on both chromosomes. This is due to the same sites being bound on both the active and inactive X chromosome in females, as shown by the distribution of CTCF signal, the corresponding change in DNaseI signal in entirely separate cell lines and the lack of allele-specific signals in heterozygote sites in this class. This suggests that there is a subset of CTCF sites on the X chromosome that is bound on both copies despite the striking large scale compaction of the inactive X. This X chromosome-wide behaviour of CTCF is a very different phenomenon to the locus-specific interaction at the Xist/Tsix locus implicated in determining which X chromosome is inactivated [42], [43].

This observation has a number of implications. It is consistent with the multi-functional nature of CTCF, which has been commented on many times before in locus-specific [20]–[26] or specific genome-wide screens [27]–[30]. In this study we only examined behaviour in lymphoblastoid lines, and there might be cell type specific differences as well. Single-active sites show histone modifications and TF co-binding consistent with involvement with regulating expression on active chromatin. In contrast, the both-active sites show far less complex histone modification, consistent with structural functions that might apply to both chromosomes. Finally although we discovered this phenomenon on the X chromosome due to how these sites interact with X chromosome inactivation, it is consistent with the different binding behaviours of CTCF seen on the autosomes, with a diversity of different histone modification patterns at different CTCF sites [37].

The female-specific CTCF sites on the X chromosome are a very distinct subset; these are placed mainly over two non-coding RNAs expressed from the active X in females and males. The simplest explanation is that CTCF binding at these sites is involved in transcription repression on the inactive female X chromosome. This catalog of CTCF QTL sites is part of a growing set of molecular assays that are being examined in outbred individuals (for example, see [12], [17], [40], [44], [45]). It provides a specific hypothesis for the 63 disease related loci which overlap these QTLs, and for future overlaps with other molecular, cellular and disease related phenotypes. The gradual unraveling of the different variant effects on different molecular behaviour will provide a growing understanding of molecular and physiological processes in health and disease.

Materials and Methods

ChIP-seq

Cells were cross-linked with 1% formaldehyde for 7 min at room temperature. Formaldehyde was deactivated by adding glycine. Chromatin from harvested cells was sonicated with a Bioruptor to an average size of 500 bp DNA. Immunoprecipitation was performed using sonicated chromatin by adding anti-CTCF antibody (Millipore 07-729). ChIP DNA was used to generate a ChIP-seq library according to the standard Illumina protocol. The library was then sequenced using the Illumina HighSeq platform in 50 bp paired end reads. On average ∼85.54M reads were produced per sample. Sequence lanes were assessed for multiple quality metrics including total yield, read quality, mapping quality, GC content distribution and duplication rate. All sequencing reads were aligned to the human reference sequence (GRCh37) using BWA v0.5.9-r16 [46] using default parameter settings. Duplicate reads were marked by the “MarkDuplicates” function of the software Picard (v1.47 http://picard.sourceforge.net/) and removed. We applied a stringent filter by removing all the reads with MAQ quality score below 30, improperly paired (with 0x2 flag set in the BAM format), or with mate pairs more than 1 kb apart were removed. For allele specific analysis, we further performed local realignment using a variant-aware aligner glia (https://github.com/ekg/glia), which aligns reads against paths in a variant graph built by combining the reference sequence and known variants.

Data processing

Genotypes

Our 51 samples consist of 35 individuals present in the 1000 Genomes Phase 1 release (v3 20101123) [2], 11 individuals in the 1000 Genomes Pilot, 2 individuals in 1000 Genomes high coverage Trio (NA12891 and NA12892) and 3 individuals in the HapMap III [47]. The eleven 1000 Genomes Pilot samples have low coverage. We calculated the genotype likelihood for each of the Phase 1 sites using samtools [48] and then performed imputation using BEAGLE [49] and IMPUTE2 [50] with the 1000 Genomes Phase 1 data as a reference panel. Using Illumina Omni 2.5M SNP array genotypes (available from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20120131_omni_genotypes_and_intensities/) as a validation set, we obtained good accuracy from this procedure with a mean non-reference discordance rate of 2.33% and average genotype dosage R² of 0.956. We also imputed the three HapMap III samples, using their genotype data on the Omni 2.5M array as the imputation panel and the 1000 Genome Phase 1 as the reference panel. We then integrated data from each source and obtained a consolidated genotype set for all 51 individuals. For association mapping, we filtered variants by requiring >5% minor allele frequency, P value for Hardy-Weinberg Equilibrium (HWE)>1E-4 and position within 50 kb of the binding region being mapped. Finally, 4,687,317 variants entered analysis, with 4,250,881 SNPs and 436,436 INDELs.

The 1000 Genome Phase 1 release gives a comprehensive ascertainment of the genetic variants. However, it is still possible that some variants private to this study cohort are yet to be found. To address this concern, we performed variant calling for the CTCF binding regions using ChIP-seq data. The calling was done by using samtools mpileup with parameters “-DV -C50 -q 30 -Q 30 -d 10000 –u -l $qtl_regions -b $bam_list -f $reference”, followed by BCF tools with parameter “-t $qtl_regions -mv”. We filtered on the quality of the calling by keeping only variants with QUAL score greater than 20. We also kept only the variants that are private to the new call set and are absent in the 1000 genomes phase1. In the end, we obtained 4,756 variants are within binding regions with 2,282 SNPs and 2,474 INDELs. It is a small quantity compared to the variant set of the 1000 Genome Phase1 release. This set is also enriched for INDELs (52%). When we conducted the same association scan using these variants, we discovered 55 QTL binding regions associated with 60 variants, out of which only 8 are new. We also repeated the same analysis but with a lower threshold for the mapping quality, the results are similar with very marginal increase in findings. Thus the effect of this additional variant set is minimum in our QTL scan.

Binding region calling

We performed binding region identification using a Parzen kernel density window algorithm described previously [18], [51]. This procedure was applied to both experimental and Input datasets after combining lanes and replicates into cell-line sample sets. Local maxima of these Parzen scores were used to define binding peak positions, and the interquartile range of the kernel density profile was used to determine the corresponding binding site of highest read density. The resulting set of candidate CTCF binding sites was then subjected to input correction, filtering for copy number artifacts, and determination of statistical significance.

First, in order to normalize for background represented by the Input control, each binding site was paired with the corresponding Input site with the highest read count within 200 bp. A binomial P-value was computed for each binding site under the null hypothesis that ChIP and Input reads were equally likely. The ratio of total ChIP to Input reads for each sample was used to normalize for differences in sequencing depth before calculating the binomial P-value, with the library having higher sequencing depth always scaled downward. Binding sites falling in previously defined genomic regions with aberrantly high signal due to copy number differences were discarded [5]. Input-dominated binding sites were also discarded, retaining only sites where the sequencing-depth-scaled ChIP read count exceeded Input.

The resulting set of filtered peak P-values was subjected to multiple hypothesis testing using the Benjamini-Hochberg method [52]. Next, binding regions for the cell lines at various significance levels were merged using bedtools v2.17.0 [53] in such a way as to preserve the set of calling cell lines (bedtools merge -nms -scores collapse -n). We employed several metrics in order to determine an appropriate significance cutoff, including the relationship between binding region count and P-value (Figure S15) and the number of calling cell lines for each binding region (Figure S16). binding regions with BH-adjusted P-value ≤1E-5 were initially retained as significant (n = 127,351), as that value appeared to be the inflection point in the binding region versus P-value curve and had the largest reduction in one-caller binding regions.

Finally, in order to assess the quality of binding regions called by only one cell line, we used bedtools (bedtools intersect –c) to identify binding regions containing the extended CTCF motif (Figure S17). binding regions called by only one cell line showed a significantly lower occurrence of the CTCF motif as compared to binding regions called by two or more cell lines. Therefore, we discarded binding regions with only one calling cell line and retained the 63,753 merged binding regions at adjusted P-value 1E-5with two or more callers.

Blacklisting regions

Out of 63,753 binding regions identified, we removed 2,898 binding regions falling in repeat sequences or in the Immunoglobulin heavy chain locus or major histocompatibility complex (MHC). In detail, 2,578 binding regions lie completely within repeat sequences marked by a merged set consisting of “Repeat Masker”, “Segmental Dup” or “Simple Repeat” data sets from the table browser of the UCSC Genome Browser, 35 binding regions lie within the Immunoglobulin heavy chain locus (chr14 : 106053226–106330470) and 285 fall in the MHC region (chr6 : 28477797-33448354).

Motif word identification

We searched for the instances of CTCF motif in the discovered binding regions using the CTCF canonical 19 bp position weight matrix downloaded from the JASPAR database (http://jaspar.binf.ku.dk/). We extracted DNA sequences at the identified binding regions from human genome reference GRCh37 to construct a sequence database. The search was then performed using the software FIMO [54] in the MEME tool suite [55] using parameter –threshold 1E-4. This process identified at least one motif instance in 45,867 of our 57,428 binding regions. When calculating the overlapping between QTL variants and motifs we considered all discovered motif instances within a binding region. For the QTL variants that do not overlap any motif variants, we used the motif instance within binding region with the best matching score as an anchor for calculating the distance.

Identifying motif of other factors

We also searched for other motifs at regions nearby (+−30 bp) the lead QTL variants as well as within the QTL binding regions. We extract sequences of the regions using software fastaFromBed of the BEDTools suite [53] using reference sequence GRCh37. We then used software MEME [55] to search for motifs. The discovered motifs were then compared against known motifs using software tomtom [55] in JASPAR and UniPROBE databases. The discovered motifs are then provided to MSigDB [56] for functional annotations.

The most enriched motif was a CTCF related motif. After removing a number of low complexity motifs and cryptic repeats by manual inspection, the next most enriched motif was a G-rich motif similar to multiple SP-1 sites. There was not a strong accessory motif in these regions.

CTCF binding quantitation

With the peak profile identified above, we quantified the signal for each binding region by counting the number of sequencing fragments -⁠ read pairs. We applied stringent criteria by only counting the properly aligned read pairs with quality score at least 30 and excluding all the duplicated reads(samtools view -f 0x42 -F0x604 –q 30). We used Bedtools (v2.16.2) [53] to count the intersection between fragments and identified binding regions. This produced an N by M matrix, where N is the number of samples and M is the number of binding regions. To evaluate the variation in the ChIP experiments, we compared the correlation between replicates grown on consecutive days and the correlation between all other samples. We found a mean pairwise correlation coefficient of 0.8314±0.0006 and 0.8202±0.0099 for the replicate sets for NA12891 and NA12892, respectively, while the mean pairwise correlation coefficient between samples was 0.1719±0.0177. This suggests a good signal noise ratio in the experiment.

For the subsequent genetic analysis, we are interested in the binding regions that have good signal and also vary between individuals. The mean and variance of binding intensities are correlated by the nature of the Poisson process for the sequencing. We found a group of 4,516 binding regions (7% of the total binding regions identified) with little signal or variation -⁠ defined as binding regions mapped with fewer than 6 fragments on average per sample and SD<5.14 (Figure S18). These binding regions were excluded from further analysis.

Normalization

Previous studies [17], [31] have shown that appropriate normalization can substantially enhance genetic association signals by removing confounding non-genetic sources of variation. Potential sources of confounding variation include experimental batch effects, GC bias in sequencing library construction and latent unknown technical or biological factors that have systematic effects across large numbers of binding regions. To address these issues, we normalized the raw binding intensity using the following five step approach to generate a normalised adjusted binding intensity (NABI).

Rescale by sequence depth.

where R_i,j is the raw intensity of the ith binding region of the jth lane, and S_j is the sum of intensity across all binding regions for the jth lane. R_i,j is scaled by a factor of the proportion of mean of S across all P lanes over S_j.
Remove variance introduced by GC composition. We adjusted for GC bias in sequencing library construction by forming percentile bins for GC composition of all binding regions and normalising the binding intensities within each bin.

where i,j,k are the indices for binding region, lane, and GC bin respectively.
Merge lanes of a same individual by taking the mean. A subset of our samples were sequenced on multiple lanes and in these cases we took the mean value across lanes as the measurement of the individual.

where X_i,j is the measure from the previous step, i,j,l are indices for the binding region, lane and samples, respectively. N is the total number of samples.
Centre-scale binding intensity for each binding region. We then scaled the binding intensity for each binding region by subtracting the mean and then dividing by the standard deviation. This transforms the measures of each binding region into zero mean and unit variance, which is needed for the quantile normalization to be less affected by the different variances of different binding regions

where i,l are indices for binding region and sample.
Quantile normalize each sample data to a normal distribution. The distribution of binding intensities for each individual is complex. Previous studies have shown that quantile normalization, initially developed for normalising the microarray signals of gene expression, can assist statistical analysis by converting the distributions of each sample to a reference distribution. The linear regression model used to identify QTL in our study assumes a Gaussian distribution of binding measures within each genotype class. We therefore mapped the measures across all binding regions of each sample to the corresponding normal quantiles. This produces a matrix that is essentially a perturbation permutation of the normal quantiles

where Φ is the cumulative normal density function and M is the total number of binding regions. I is an indicator function that returns 1 if the condition is met and 0 otherwise.
Remove confounding variation by principal component analysis (PCA). The measures of binding for each individual can be confounded by a number of hidden factors due to either biological or technical factors, or both. We performed PCA and saw that the first factor explained 24.1% of the variance in the data, substantially more than later components (Figure S2). Further investigation of this component showed that it was correlated with ChIP batch date, and it was therefore removed.

Association and estimation of the false discovery rate (FDR)

We applied linear regression for association testing. For each binding region, we tested the association between the binding intensities and the genotypes of the variants that are within ±50 kb of the binding region by least-squares linear regression. We applied an additive model, where genotypes are coded as the dosages of the alternative alleles, either 0,1, or 2. The P value was obtained from a t-test of the regression coefficient (beta) against the null hypothesis of beta = 0. We estimated FDR by a Q value method [34], which establishes P<7.1E-5 as an FDR of 1%. We further filtered the associated SNPs by requiring the P value to be within 1 order of magnitude to that of the P value of the lead SNP. We report these cluster variants as associated to the target binding region. We also reported results when a more stringent Bonferroni threshold was applied. The threshold was calculated at a significant level of α = 0.05 corrected for 13,293,727 tests, which gives 3.8E-9 for the actual threshold.

Allele-specific analysis

Read counts at each allele were counted for the 5.6M SNPs within 50 kb of a binding region. Heterozygous SNPs with significant allele-specific CTCF binding were identified. In detail, we calculated a binomial P value at all heterozygous SNPs with the null hypothesis that the two allele counts are equal. We then performed multiple testing adjustment at all heterozygous SNPs that have at least 2 reads at each allele and at least 2 reads difference between the two alleles using the Benjamini & Hochberg [52] method. Significant allele-specific binding was determined with an FDR 5%.

X chromosome analysis

We analysed the gender specific CTCF binding on the X chromosome in the 27 female and 24 male LCLs. To ensure that our normalisation would not introduce any bias we used the raw CTCF binding intensities. For each of the 1,968 binding regions on the X chromosome, after blacklisted regions were removed, we assessed gender specificity by a Mann-Whitney U (MWU) test between the male and female samples. Binding regions were then classified as single-active and both-active based on the significance of the MWU test on the binding intensities. To classify the female specific binding regions we also incorporate the fold change between the average male and average female binding intensity. Similar analysis as for CTCF was performed on mRNA and ncRNA data from the Geuvadis project [35] and on DNase I [17].

To differentiate clonal and ployclonal sample, we analyzed allelic RNA expression on previously identified X inactivated genes [36] in 17 female samples. Samples where only one allele is expressed are determined to be clonal and polyclonal samples have RNA expressed from both alleles. Figure S22 shows examples of a clonal and a polyclonal sample.

For the single-active and both-active sites, we analysed the overlap of each category of sites with the ENCODE transcription factor and histone modification datasets for the female CEU lymphoblastoid cell line GM12878 [37]. To avoid bias introduced by unequal distribution of promoter sequences between the classes, we removed all binding regions that overlap with promoters identified in GM12878. For each binding region we define a partial overlap as an overlap. Signal aggregation of each of the classes of sites for histone modification and TF ChIP-seq data, and micrococcal nuclease cleavage was calculated using the ACT toolkit (http://act.gersteinlab.org/, [57]) with the parameters ‘-nbins = 50 -mbins = 0’. Only binding regions that are in the top 50% of bound sites were used. ENCODE bedGraph files for both TF and histone modifications were obtained from ftp.ebi.ac.uk: pub/databases/ensembl/encode/integration_data_jan2011/byDataType/signal/jan2011/bedgraph/and converted into signal files that are used as input for ACT.py.

Sequential RNA and DNA fluorescence in situ hybridisation (FISH)

Female and male human dermal fibroblasts cells (Invitrogen) were grown directly on Nunc Lab-Tek chamber slides, rinsed briefly using 1×PBS (PAA) and immediately fixed in 3% formaldehyde (Sigma-Aldrich) for 10 minutes, permeabilized using 0.5% Triton X-100 (BHD), 10 mM Ribonucleosidase Vanadyl complex (Biolabs) in 1×PBS (PAA) for 10 minutes, and then dehydrated through a 70%, 90% and 100% ethanol series, all at room temperature. The probes for X56 and X130 were selected according to their genomic locations reported in Horakova et al [38]. The probe for X56 consisted of the BAC clone RP11-416J22 and the fosmid G248P8472H8, the probe for X130 consisted of the BAC clone RP11-158M12, while the probe for XIST consisted of the fosmid G248P8779H11. All the clones were selected from the UCSC Genome Browser (GRCh37/hg19 assembly). Plasmid DNA was purified using the PhasePrep BAC DNA kit (Sigma-Aldrich) following manufacturer's protocol, amplified using the whole genome amplification kit (WGA2, Sigma-Aldrich) following manufacturer's recommendations. Clones were labeled using the whole genome re-amplification kit (WGA3, Sigma-Aldrich) as described before [58]. Briefly, X56 probe was labeled with Cyanine 3-dUTP (Enzo), the X130 probe was labeled with ChromaTide Texas Red-12-dUTP (Invitrogen) and the XIST probe was labeled with Green-dUTP (Abbott). For RNA-FISH, approximately 100 ng of labeled DNA from each probe and 2–4 µg of human Cot-1 DNA (Invitrogen) were ethanol precipitated, then resuspended in hybridisation buffer containing 50% formamide, 2×SSC, 10% dextran sulphate, 0.5 M phosphate buffer, pH 7.4. The probe mix was denatured at 65°C for 10 minutes before being applied onto cells on the chamber slides. Hybridisation was carried out in a 37°C incubator overnight. The post-hybridisation washes consisted of two rounds of 50% formamide/2×SSC washes followed by two additional washes in 2×SSC. All washes were done at 40°C, for 5 minutes. After detection, slides were mounted with SlowFade Gold mounting solution containing 4′,6-diamidino-2-phenylindole (Invitrogen). Images were visualised on a Zeiss AxioImager D1 fluorescent microscope. Digital image capture and processing were carried out using the SmartCapture software (Digital Scientific UK).

For the subsequent DNA-FISH, the same slides that have passed through the RNA-FISH assay described above were subject to the following treatment before denaturation in 70% formamide/2×SSC for 1.5 minutes, including one wash in 2×SSC for 5 minutes, digestion with RNase A (100 µg/ml RNase A in 2×SSC) for 30 minutes at 37°C, further digestion with 0.01% pepsin in 10 mM HCl for 5 minutes at room temperature, dehydration through an ethanol series as above and ageing on a 65°C hot plate for an hour. The X chromosome paint probe was labeled with biotin-16-dUTP (Roche). The making and denaturation of the X chromosome paint probe mix, hybridisation incubation, post-hybridisation washes and digital imaging were the same as above described, except that the biotin-labeled probes were visualised using Cy3 conjugated avidin (Sigma Aldrich).

Accession numbers

The ChIP-seq data reported in this paper have been deposited in the European Nucleotide Archive, available with accession number ERP002168. The sample information and experimental design was deposited in ArrayExpress with accession number E-ERAD-141, linked to ERP002168.

Supporting Information

Zdroje

1. HindorffLA, SethupathyP, JunkinsHA, RamosEM, MehtaJP, et al. (2009) Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci U S A 106 : 9362–9367.

2. AbecasisGR, AltshulerD, AutonA, BrooksLD, DurbinRM, et al. (2010) A map of human genome variation from population-scale sequencing. Nature 467 : 1061–1073.

3. BarskiA, CuddapahS, CuiK, RohTY, SchonesDE, et al. (2007) High-resolution profiling of histone methylations in the human genome. Cell 129 : 823–837.

4. HesselberthJR, ChenX, ZhangZ, SaboPJ, SandstromR, et al. (2009) Global mapping of protein-DNA interactions in vivo by digital genomic footprinting. Nat Methods 6 : 283–289.

5. BoyleAP, DavisS, ShulhaHP, MeltzerP, MarguliesEH, et al. (2008) High-resolution mapping and characterization of open chromatin across the genome. Cell 132 : 311–322.

6. JohnS, SaboPJ, ThurmanRE, SungMH, BiddieSC, et al. (2011) Chromatin accessibility pre-determines glucocorticoid receptor binding patterns. Nat Genet 43 : 264–268.

7. SongL, ZhangZ, GrasfederLL, BoyleAP, GiresiPG, et al. (2011) Open chromatin defined by DNaseI and FAIRE identifies regulatory elements that shape cell-type identity. Genome Res 21 : 1757–1767.

8. McDaniellR, LeeBK, SongL, LiuZ, BoyleAP, et al. (2010) Heritable individual-specific and allele-specific chromatin signatures in humans. Science 328 : 235–239.

9. KasowskiM, GrubertF, HeffelfingerC, HariharanM, AsabereA, et al. (2010) Variation in transcription factor binding among humans. Science 328 : 232–235.

10. ReddyTE, GertzJ, PauliF, KuceraKS, VarleyKE, et al. (2012) Effects of sequence variation on differential allelic transcription factor occupancy and gene expression. Genome Res 22 : 860–869.

11. HeinzS, RomanoskiCE, BennerC, AllisonKA, KaikkonenMU, et al. (2013) Effect of natural genetic variation on enhancer selection and function. Nature 503 : 487–492.

12. MauranoMT, WangH, KutyavinT, StamatoyannopoulosJA (2012) Widespread site-dependent buffering of human regulatory polymorphism. PLoS Genet 8: e1002599.

13. StefflovaK, ThybertD, WilsonMD, StreeterI, AleksicJ, et al. (2013) Cooperativity and rapid evolution of cobound transcription factors in closely related mammals. Cell 154 : 530–540.

14. StrangerBE, NicaAC, ForrestMS, DimasA, BirdCP, et al. (2007) Population genomics of human gene expression. Nat Genet 39 : 1217–1224.

15. SpielmanRS, BastoneLA, BurdickJT, MorleyM, EwensWJ, et al. (2007) Common genetic variants account for differences in gene expression among ethnic groups. Nat Genet 39 : 226–231.

16. PickrellJK, MarioniJC, PaiAA, DegnerJF, EngelhardtBE, et al. (2010) Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature 464 : 768–772.

17. DegnerJF, PaiAA, Pique-RegiR, VeyrierasJB, GaffneyDJ, et al. (2012) DNase I sensitivity QTLs are a major determinant of human expression variation. Nature 482 : 390–394.

18. LeeBK, IyerVR (2012) Genome-wide studies of CCCTC-binding factor (CTCF) and cohesin provide insight into chromatin structure and regulation. J Biol Chem 287 : 30906–30913.

19. MerkenschlagerM, OdomDT (2013) CTCF and cohesin: linking gene regulatory elements with their targets. Cell 152 : 1285–1297.

20. YusufzaiTM, TagamiH, NakataniY, FelsenfeldG (2004) CTCF tethers an insulator to subnuclear sites, suggesting shared insulator mechanisms across species. Mol Cell 13 : 291–298.

21. SplinterE, HeathH, KoorenJ, PalstraRJ, KlousP, et al. (2006) CTCF mediates long-range chromatin looping and local histone modification in the beta-globin locus. Genes Dev 20 : 2349–2354.

22. SopherBL, LaddPD, PinedaVV, LibbyRT, SunkinSM, et al. (2011) CTCF regulates ataxin-7 expression through promotion of a convergently transcribed, antisense noncoding RNA. Neuron 70 : 1071–1084.

23. BellAC, FelsenfeldG (2000) Methylation of a CTCF-dependent boundary controls imprinted expression of the Igf2 gene. Nature 405 : 482–485.

24. BellAC, WestAG, FelsenfeldG (1999) The protein CTCF is required for the enhancer blocking activity of vertebrate insulators. Cell 98 : 387–396.

25. van de NobelenS, Rosa-GarridoM, LeersJ, HeathH, SoochitW, et al. (2010) CTCF regulates the local epigenetic state of ribosomal DNA repeats. Epigenetics Chromatin 3 : 19.

26. StedmanW, KangH, LinS, KissilJL, BartolomeiMS, et al. (2008) Cohesins localize with CTCF at the KSHV latency control region and at cellular c-myc and H19/Igf2 insulators. EMBO J 27 : 654–666.

27. SchmidtD, SchwaliePC, Ross-InnesCS, HurtadoA, BrownGD, et al. (2010) A CTCF-independent role for cohesin in tissue-specific transcription. Genome Res 20 : 578–588.

28. SchmidtD, WilsonMD, BallesterB, SchwaliePC, BrownGD, et al. (2010) Five-vertebrate ChIP-seq reveals the evolutionary dynamics of transcription factor binding. Science 328 : 1036–1040.

29. PhillipsJE, CorcesVG (2009) CTCF: master weaver of the genome. Cell 137 : 1194–1211.

30. CuddapahS, JothiR, SchonesDE, RohTY, CuiK, et al. (2009) Global analysis of the insulator binding protein CTCF in chromatin barrier regions reveals demarcation of active and repressive domains. Genome Res 19 : 24–32.

31. MontgomerySB, SammethM, Gutierrez-ArcelusM, LachRP, IngleC, et al. (2010) Transcriptome genetics using second generation sequencing in a Caucasian population. Nature 464 : 773–777.

32. HeardE, DistecheCM (2006) Dosage compensation in mammals: fine-tuning the expression of the X chromosome. Genes Dev 20 : 1848–1867.

33. LeeBK, BhingeAA, BattenhouseA, McDaniellRM, LiuZ, et al. (2012) Cell-type specific and combinatorial usage of diverse transcription factors revealed by genome-wide binding studies in multiple human cells. Genome Res 22 : 9–24.

34. StoreyJD, TibshiraniR (2003) Statistical significance for genomewide studies. Proc Natl Acad Sci U S A 100 : 9440–9445.

35. LappalainenT, SammethM, FriedlanderMR, t HoenPA, MonlongJ, et al. (2013) Transcriptome and genome sequencing uncovers functional variation in humans. Nature 501 : 506–511.

36. CarrelL, WillardHF (2005) X-inactivation profile reveals extensive variability in X-linked gene expression in females. Nature 434 : 400–404.

37. ENCODE_Project_Consortium (2012) An integrated encyclopedia of DNA elements in the human genome. Nature 489 : 57–74.

38. HorakovaAH, MoseleySC, McLaughlinCR, TremblayDC, ChadwickBP (2012) The macrosatellite DXZ4 mediates CTCF-dependent long-range intrachromosomal interactions on the human inactive X chromosome. Hum Mol Genet 21 : 4367–4377.

39. JeonY, LeeJT (2011) YY1 tethers Xist RNA to the inactive X nucleation center. Cell 146 : 119–133.

40. McVickerG, van de GeijnB, DegnerJF, CainCE, BanovichNE, et al. (2013) Identification of genetic variants that affect histone modifications in human cells. Science 342 : 747–749.

41. SchmidtD, SchwaliePC, WilsonMD, BallesterB, GoncalvesA, et al. (2012) Waves of retrotransposon expansion remodel genome organization and CTCF binding in multiple mammalian lineages. Cell 148 : 335–348.

42. ChaoW, HuynhKD, SpencerRJ, DavidowLS, LeeJT (2002) CTCF, a candidate trans-acting factor for X-inactivation choice. Science 295 : 345–347.

43. JeonY, SarmaK, LeeJT (2012) New and Xisting regulatory mechanisms of X chromosome inactivation. Curr Opin Genet Dev 22 : 62–71.

44. KilpinenH, WaszakSM, GschwindAR, RaghavSK, WitwickiRM, et al. (2013) Coordinated effects of sequence variation on DNA binding, chromatin structure, and transcription. Science 342 : 744–747.

45. KasowskiM, Kyriazopoulou-PanagiotopoulouS, GrubertF, ZauggJB, KundajeA, et al. (2013) Extensive variation in chromatin states across humans. Science 342 : 750–752.

46. LiH, DurbinR (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25 : 1754–1760.

47. StrangerBE, MontgomerySB, DimasAS, PartsL, StegleO, et al. (2012) Patterns of cis regulatory variation in diverse human populations. PLoS Genet 8: e1002639.

48. LiH, HandsakerB, WysokerA, FennellT, RuanJ, et al. (2009) The Sequence Alignment/Map format and SAMtools. Bioinformatics 25 : 2078–2079.

49. BrowningBL, YuZ (2009) Simultaneous genotype calling and haplotype phasing improves genotype accuracy and reduces false-positive associations for genome-wide association studies. Am J Hum Genet 85 : 847–861.

50. HowieBN, DonnellyP, MarchiniJ (2009) A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet 5: e1000529.

51. ShivaswamyS, BhingeA, ZhaoY, JonesS, HirstM, et al. (2008) Dynamic remodeling of individual nucleosomes across a eukaryotic genome in response to transcriptional perturbation. PLoS Biol 6: e65.

52. BenjaminiY, HochbergY (1995) Controlling the False Discovery Rate -⁠ a Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society Series B-Methodological 57 : 289–300.

53. QuinlanAR, HallIM (2010) BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26 : 841–842.

54. GrantCE, BaileyTL, NobleWS (2011) FIMO: scanning for occurrences of a given motif. Bioinformatics 27 : 1017–1018.

55. BaileyTL, BodenM, BuskeFA, FrithM, GrantCE, et al. (2009) MEME SUITE: tools for motif discovery and searching. Nucleic Acids Res 37: W202–208.

56. SubramanianA, TamayoP, MoothaVK, MukherjeeS, EbertBL, et al. (2005) Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A 102 : 15545–15550.

57. JeeJ, RozowskyJ, YipKY, LochovskyL, BjornsonR, et al. (2011) ACT: aggregation and correlation toolbox for analyses of genome tracks. Bioinformatics 27 : 1152–1154.

58. GribbleSM, WisemanFK, ClaytonS, PrigmoreE, LangleyE, et al. (2013) Massively parallel sequencing reveals the complex structure of an irradiated human chromosome on a mouse background in the Tc1 model of Down syndrome. PLoS One 8: e60482.