Quantitative Models of the Mechanisms That Control Genome-Wide Patterns of Transcription Factor Binding during Early Development
Transcription factors that drive complex patterns of gene expression during animal development bind to thousands of genomic regions, with quantitative differences in binding across bound regions mediating their activity. While we now have tools to characterize the DNA affinities of these proteins and to precisely measure their genome-wide distribution in vivo, our understanding of the forces that determine where, when, and to what extent they bind remains primitive. Here we use a thermodynamic model of transcription factor binding to evaluate the contribution of different biophysical forces to the binding of five regulators of early embryonic anterior-posterior patterning in Drosophila melanogaster. Predictions based on DNA sequence and in vitro protein-DNA affinities alone achieve a correlation of ∼0.4 with experimental measurements of in vivo binding. Incorporating cooperativity and competition among the five factors, and accounting for spatial patterning by modeling binding in every nucleus independently, had little effect on prediction accuracy. A major source of error was the prediction of binding events that do not occur in vivo, which we hypothesized reflected reduced accessibility of chromatin. To test this, we incorporated experimental measurements of genome-wide DNA accessibility into our model, effectively restricting predicted binding to regions of open chromatin. This dramatically improved our predictions to a correlation of 0.6–0.9 for various factors across known target genes. Finally, we used our model to quantify the roles of DNA sequence, accessibility, and binding competition and cooperativity. Our results show that, in regions of open chromatin, binding can be predicted almost exclusively by the sequence specificity of individual factors, with a minimal role for protein interactions. We suggest that a combination of experimentally determined chromatin accessibility data and simple computational models of transcription factor binding may be used to predict the binding landscape of any animal transcription factor with significant precision.
Published in the journal:
. PLoS Genet 7(2): e32767. doi:10.1371/journal.pgen.1001290
Category:
Research Article
doi:
https://doi.org/10.1371/journal.pgen.1001290
Summary
Transcription factors that drive complex patterns of gene expression during animal development bind to thousands of genomic regions, with quantitative differences in binding across bound regions mediating their activity. While we now have tools to characterize the DNA affinities of these proteins and to precisely measure their genome-wide distribution in vivo, our understanding of the forces that determine where, when, and to what extent they bind remains primitive. Here we use a thermodynamic model of transcription factor binding to evaluate the contribution of different biophysical forces to the binding of five regulators of early embryonic anterior-posterior patterning in Drosophila melanogaster. Predictions based on DNA sequence and in vitro protein-DNA affinities alone achieve a correlation of ∼0.4 with experimental measurements of in vivo binding. Incorporating cooperativity and competition among the five factors, and accounting for spatial patterning by modeling binding in every nucleus independently, had little effect on prediction accuracy. A major source of error was the prediction of binding events that do not occur in vivo, which we hypothesized reflected reduced accessibility of chromatin. To test this, we incorporated experimental measurements of genome-wide DNA accessibility into our model, effectively restricting predicted binding to regions of open chromatin. This dramatically improved our predictions to a correlation of 0.6–0.9 for various factors across known target genes. Finally, we used our model to quantify the roles of DNA sequence, accessibility, and binding competition and cooperativity. Our results show that, in regions of open chromatin, binding can be predicted almost exclusively by the sequence specificity of individual factors, with a minimal role for protein interactions. We suggest that a combination of experimentally determined chromatin accessibility data and simple computational models of transcription factor binding may be used to predict the binding landscape of any animal transcription factor with significant precision.
Introduction
In vivo crosslinking studies show that animal transcription factors each bind to many thousands of DNA regions throughout the genome (e.g. [1]–[18]). While many of the strongest binding events are at functional cis regulatory modules and are evolutionary conserved, many thousands of other genomic regions that are bound at lower levels do not appear to be functional targets [2], [13], [14], [16], [18], [19]. In addition, factors with unrelated biochemical or functional properties bind to the same genomic regions with surprisingly high frequency [2], [16], [20], while the biological specificities of factors appear to be determined in part by quantitative differences in occupancy between proteins at these commonly bound sites [2], [16], [21], [22].
It is a fundamental challenge to determine the biochemical mechanisms that direct these complex, quantitative patterns of factor occupancy. Animal transcription factors bind to short (5–12 bp) sequences of DNA that occur with high frequency throughout the genome [23], yet most occurrences of these recognition sequences are not detectably bound in vivo [2], [5], [13], [24], [25]. This discrepancy between predicted and observed factor binding has been attributed to several mechanisms that could alter the simple behavior of a factor, including: (1) competitive inhibition of binding at those DNA regions that overlap sites occupied either by other sequence specific factors [26], [27] or by nucleosomes and chromatin associated proteins [28]–[33]. (2) Direct and indirect cooperative interactions between factors bound at physically proximal sites that increase their affinity at those sites [34]–[42]. The relative influence that each of these biochemical mechanisms on the overall pattern of factor binding in vivo, however, is currently a matter of debate.
The anterior-posterior (A-P) patterning system in early Drosophila melanogaster development offers an excellent system for addressing the question of transcription factor targeting in an animal regulatory network. Prior to gastrulation, the Drosophila embryo is a syncytium of ∼6,000 nuclei distributed around its periphery [43]. Extensive genetic and molecular analyses have established that a simple regulatory cascade involving the morphogens Bicoid (BCD) and Caudal (CAD), and the gap transcription factors Hunchback (HB), Giant (GT), Kruppel (KR) and Knirps (KNI) directs the expression of numerous target genes in complex three dimensional patterns across the embryo, which in turn establish the segmental body plan in the trunk of the developing fly [44]–[46].
To analyze this network at a systems level, we have previously established a complementary set of quantitative datasets describing key aspects of the network. We have measured the widespread, overlapping in vivo binding of these and other early regulatory factors genome wide [13], [16], [18], In addition, we have characterized DNA affinities of these proteins from in vitro selection (SELEX) experiments (Berkeley Drosophila Transcription Network Project, unpublished data), measured the relative concentration of each of these transcription factors in every embryonic nucleus [47], and determined the accessibility of DNA regions throughout the genome.
Here we incorporate this series of datasets into quantitative models of transcription factor binding to measure the impact of different biophysical forces on occupancy levels. Many previous studies have used computational approaches to predict the locations of genomic regions bound by transcription factors [48]–[68]. While these studies have discovered a number of important principles, they have not sought to quantify the contribution of various factor-targeting forces in the context of the highly overlapping, widespread binding seen in animal systems. For example, many have produced qualitative predictions of which genomic regions might be occupied, and thus do not provide information on the levels of expected factor occupancy, which have been shown to be critical for relating binding patterns to biological function [2], [16], [21], [22]. In addition, while some studies have focused on determining the biochemical mechanisms targeting factors to DNA [57], [58], [60], many others have used additional biological information such as microarray expression data or DNA recognition site conservation to predict only those sites that are functional [59], [63], [65], [67], [68], and thus do not, strictly speaking, attempt to predict binding per se. Finally, when predictions of quantitative occupancy levels have been made, these were either part of more complex models of transcriptional output patterns that did not compare the binding models to any experimental in vivo binding data [66], [69]–[71], or were made to coarse in vivo binding information from yeast [58], [62].
Therefore, we have set out to establish a computational framework that predicts the levels of factor occupancy in a way that allows the effect of various proposed biophysical mechanisms influencing factor binding to be quantified. Our results support a long standing model that suggest that animal transcription factors are expressed at sufficiently high concentrations in cells that they can bind to their recognition sites in accessible parts of the genome without the aid of direct cooperative interactions with other proteins [31], [72], [73]. In this view, the dominant force in cells that modifies the intrinsic DNA specificity of transcription factors is the inhibition of DNA binding by chromatin structure. Because of the high frequency of factor recognition sites in most short accessible regions of the genome, this model explains why animal factors show such widespread, overlapping binding.
Results
Quantitative Models of Transcription Factor Binding
We developed a probabilistic framework to infer the occupancies of one or more transcription factors across any DNA sequence given the concentration of these factors and their DNA binding preferences (Figure S1). Our model is based on the formalism of generalized hidden Markov models (gHMMs; [49], [53], [55], [62], [69], ), which allow efficient integration of the different forces that may influence transcription factor binding into the model.
This class of models offers several prominent advantages. gHMMs have very few parameters and are therefore easy to optimize. And, unlike most probabilistic graphical models, they offer exact inference of posterior probabilities in linear time, using the forward-backward dynamic programming algorithm [75]. Finally, gHMMs are related to thermodynamic equilibrium models, with the ensemble of all possible configurations of bound factors viewed as a Boltzmann distribution and each configuration assigned a weight (or probability) depending on its energetic state. The probability that a factor is bound at a given location is assumed to be the fraction of configurations (weighted by their probabilities) in which it is bound [50], [55], [56], [62], [69], [76]–[80].
The Markovian property of gHMMs prevents them from considering the full context in which binding occurs, and thus they offer only an approximation of the full thermodynamic model. We overcome this limitation with a sampling procedure (see below).
In Vitro and In Vivo DNA Binding Data
To model the DNA binding affinities of the five factors considered in this study (HB, BCD, KR, GT and CAD), we used in vitro specificities (expressed as position weight matrices; PWMs) measured using SELEX-Seq by the Berkeley Drosophila Transcription Network Project (bdtnp.lbl.gov).
For in vivo binding data, we used ChIP-seq measurements of formaldehyde crosslinked HB, BCD, KR, GT and CAD from blastoderm embryos of Drosophila melanogaster (Oregon R) [18]. A range of controls establishes that these data provide a quantitative measure of the relative levels of transcription factor directly bound to different genomic DNA regions [2], [13], [16], [18], [31], [81]. In particular, in vitro controls using purified transcription factors and naked DNA, where binding at individual sites is expected to be proportional to the affinity of the factor for that site, show that relative levels of crosslinking closely correlate with relative DNA affinity [81]. “Spike in” experiments using ∼200 kb BAC DNAs show that the ChIP-seq post immunoprecipitation processing steps accurately preserve the relative enrichment levels of different genomic regions, with the enrichments of DNA samples prior to and post processing showing a correlation of 0.997 and a linear fit of slope of 1.14 (Figure S2). In vivo UV crosslinking results show that similar data are obtained when protein/protein crosslinking is absent [2], [16]. And the genomic regions identified as bound by factors in our ChIP-seq experiments are not preferentially enriched in the crosslinked DNA initially used as input to the immunoprecipitations [18].
This latter observation is particularly important, as Auerbach et al have reported that some methods of DNA isolation can introduce a bias that adversely affects ChIP-seq data [82]. In particular, they found that the sonication of intact nuclei leads to the preferential enrichment of regions of open chromatin. Although the crosslinked DNA used in our ChIP-seq experiments is sonicated only after it has been purified away from non-crosslinked proteins by buoyant density centrifugation [13], and thus is unlikely to suffer the above DNA extraction bias, we nonetheless directly compared the input DNA samples from our ChIP-seq experiments to DNase I-seq genome accessibility data for embryos at the same stage of development (Thomas et al, unpublished data). In contrast to Auerbach et al [82], we found no correlation between ChIP-seq input samples and the DNase I-seq data (mean correlation coefficient of -0.0005).
Thus, while we do not believe ChIP-seq represents an absolutely precise measure of binding to each region of the genome (discussed further below), a wide range of evidence indicates that our ChIP-seq data do not appear to suffer large systematic biases that could interfere with our modeling effort and that these data provide a sufficiently quantitative estimate of binding on which to base attempts to predict it.
Comparing Model Predictions and In Vivo Binding Data
We developed a model-based algorithm to compare the binding probabilities predicted by our gHMM model to high-resolution in vivo binding measurements [83]. Using the length distribution of the DNA fragments recovered by ChIP, we simulated the shape of the peak corresponding to a single binding event, as measured by ChIP-seq. We then used that shape to convert the single-nucleotide resolution binding probabilities into an expected ChIP-seq profile (Figure S3, Methods).
To analyze our predictions, we compiled a list of 21 known target loci of the A-P patterning system, primarily known maternal, gap, and pair rule genes [84], expressed during early stage 5 [47]. Each gene was expanded by ∼10 Kb upstream and downstream of the transcription unit to capture its known regulatory sequences. In each of the analyses presented below, we trained the model parameters to optimize the fit between the predicted and observed ChIP-seq landscapes at a set of six loci spanning over 87 Kb (croc, cnc, slp, kni, hkb, D), and evaluated the trained model on a set of 15 loci spanning over 280 Kb (prd, h, eve, cad, oc, opa, ftz, gt, hb, Kr, odd, run, fkh, tll, os). To account for long genomic regions where no binding is observed in vivo, we enhanced the train and test sets by adding three and five control loci, spanning 100 Kb and 221 Kb, respectively (see Table 1).
Quantitative Comparisons of Model Predictions to In Vivo Binding Data
We began with the simplest model – a single transcription factor binding to DNA. This required optimizing only a single parameter, P(t), for each factor, corresponding to its effective concentration in nuclei. After training, the correlation coefficients between the model's predicted binding and the training data range from poor (CAD, 0.11) to reasonable (BCD, 0.58) with a total correlation of 0.37. Correlations for the test set were similar (suggesting that the training data were not over fit), and ranged from 0.15 (GT) to 0.66 (BCD), with a total correlation of 0.36 (Table 2 and Table S1). Figure 1A shows the model's predictions and the in vivo data for several test loci.
Minimal Effect of Factor Competition in Predicting In Vivo Binding
Encouraged by results with single factors, we examined the effect of competition between the five factors on our ability to predict in vivo binding. Overlapping binding sites are known to modulate direct competition between factors for some individual cases [26], [27]. Moreover, overlapping sites are often conserved at long evolutionary distances [85], [86], suggesting an important mechanistic role for inter-factor competition. We expanded the gHMM in our model to consider all five transcription factors simultaneously in a probabilistic framework, where the different concentration of each factor t is modeled by an additional probabilistic term P(t). In the single factor model binding of one factor to a site did not affect the binding of a different factor to the same site. In this new model, however, because the total occupancy at a site cannot exceed 1, factors effectively compete for binding to overlapping target sites. While other early regulators bind many of the same regulatory regions as these five factors [16], we have not included them in our model to allow for a focused examination of the factors that are most closely implicated in functioning together and because of a lack of complete DNA binding specificity data.
To our surprise, the five-factor competitive model gave slightly less accurate predictions than its single factor counterpart. On the test data, we observed decrease in the model predictions from a total correlation of 0.36 to 0.33 (see Figure 2 and Table 2 for full data).
Expanding the Model to Three Dimensions with Single Nucleus Resolution
One potential explanation for the lack of improvement for the above competition model could have been that, because we were treating the embryo as a homogenous entity, it allowed competition between factors that are expressed together at high levels in few nuclei (e.g. Figure 3A: GT and KR, BCD and CAD, or KR and HB).
We therefore expanded our algorithm to model the binding of all factors in each of the ∼6000 nuclei separately. We used single nucleus estimates of protein concentration, based on three dimensional fluorescence microscopy of D. melanogaster embryos at early stage 5 [47] to scale the optimized concentration parameters of the five transcription factors. Specifically at each run, we multiplied the concentrations P(t) of every regulator by its protein expression level. We then averaged the predicted binding landscape of all nuclei, to obtain whole-embryo genomic predictions, which were compared to (whole-embryo) in vivo binding measurements (Figure 3). The results were slightly improved relatively to the whole-embryo predictions (3–4% improvement on the training (0.34) and test (0.34) sets), presumably because some inappropriate competition events had been eliminated from the model.
Further analysis suggests why including binding site competition did not have a major affect on the predictive power of the model. The stronger affinity recognition sites for proteins that are co-expressed in the same cells, i.e. those having the potential to significantly affect net occupancy of a protein in vivo, overlap in only a minority of cases in the training or test genomic regions. Thus while binding site competition between the five factors may well play a key regulatory role at a subset of sites in a subset of cells as previously proposed [26], [27], it is unlikely to have a major effect on net occupancy averaged across all cells and many genomic regions.
DNA Accessibility Data Greatly Improve Binding Predictions
Even if direct competition between the five factors does not play a major role in shaping their binding landscape, other mechanisms, including interactions with nucleosomes and covalent modifications to nearby histones [29], [30], [37], [39], [87], [88], could allow transcription factors to affect binding to neighboring sites.
When comparing our computational binding predictions (Figure 1) to experimental maps of in vivo binding, we found that the majority of highly bound regions in vivo were indeed correctly predicted (133 of 240 peaks were correctly predicted, for all factors, with specific percentages varying from 30% (KR, 14/46 peaks) to 86% (BCD, 36/42)). Nonetheless, our model suffered from many false positive predictions (Figure 4), a common problem when attempting to predict in vivo binding from DNA sequence [48], [52]. It is widely accepted that a major source of such over-prediction is that many potential high-affinity binding sites are found in regions of the genome that are densely packed into chromatin states that limit their accessibility to regulatory proteins [28], [31], [33]–[35], [39], [87]–[90] and that information on the state of the chromatin can be used to improve prediction of transcription factor binding [32], [33], [41], [57], [63]–[65], [68], [70]. In some cases, the weak sequence preferences of nucleosomes have been used to predict nucleosomes positions and thereby improve binding site or gene expression predictions [57], [62], [70]. Alternatively, histone modification data has been used to identify genomic regions with putative regulatory function [63], [68]. These previous methods, however, did not make use of direct experimental measurements of DNA accessibility.
To measure the influence of chromatin state on the accuracy of our model's binding predictions, we turned to two complementary perspectives. First, we integrated the exact position of nucleosomes into our model at single nucleotide resolution to enable the competition between transcription factors and nucleosomes in binding DNA to be modeled [30], [33], [35], [39], [41], [57], [62], [70], [87], [88]. As there are no direct measurements of nucleosome positions from early Drosophila embryos, we modeled these computationally (see below). Secondly, because chromatin accessibility results from the combined effect of nucleosomes, other chromatin binding proteins and the higher-order 3D packaging of the DNA, we used direct genome-wide measurements of DNA accessibility obtained from DNase I digestion of chromatin in isolated blastoderm embryo nuclei [91]. We then quantified the effect of these two ways of assessing chromatin state on predicting the binding landscape in turn.
Predicting Nucleosome Positions Does Not Improve Model Predictions
In analyzing the role of nucleosome positioning, we were particularly interested in the possibility that the binding of one factor would reduce local nucleosome occupancy and therefore increase the occupancy of other factors at adjacent sites, so called indirect cooperative DNA binding [35], [37], [39], [87]. To investigate this, we extended the cellular resolution (3D) model by incorporating an additional state to represent the 141 bp-long sequence bound by a single nucleosome.
Due to uncertainty in the literature about the contribution of sequence specificity to in vivo nucleosome positioning [92], we decided to evaluate two different ways of incorporating positioned nucleosomes into our models. First, we used a sequence-specific model of nucleosome binding, that takes into account presumed preferences for certain DNA sequence features [93] as an additional state of our generalized hidden Markov models. This addition did not improve the predictions of our model, when comparing our predictions to in vivo measurements, obtaining correlations of 0.35 and 0.33, on the training and test sets, respectively (Table 2).
As an alternative, we used a sequence-independent model of nucleosome binding, where nucleosome are viewed as long “space-fillers” that, when present, prevent regulators from binding. This model obtained correlations of 0.37 and 0.36 on the training and test data, respectively, an improvement of 6-7% over the non-nucleosomal 3D model (see Figure 2B and Table 2). Yet, even this model fails to significantly outperform the naïve algorithm of modeling a single-regulator at a time (no competition and no 3D resolution).
Direct Measurements of Chromatin Accessibility
To more directly quantify the effect of chromatin accessibility on in vivo binding, we used genome-wide measurements of DNA accessibility obtained from DNase I digestion of isolated blastoderm embryo nuclei [91], followed by deep sequencing of the short DNA fragments that were released. When comparing these accessibility data to the predictions of our model, we found that genomic loci for which our model correctly predicts DNA binding based on sequence, namely true-positive loci, tend to have higher DNA accessibility (mean DNA accessibility of 0.36, with 25th and 75th percentiles at 0.10 and 0.60, respectively. See Methods), while false-positive loci (for which we predicted stronger binding than actually measured) typically displayed much lower DNA accessibility (mean DNA accessibility of 0.06, with 25th and 75th percentiles at 0.01 and 0.05, respectively).
Based on this observation, we next partitioned the genome based on DNA accessibility (with a threshold of 0.5), and compared the true positive rates (percent of predicted peaks that match measured ones) at high- and low-accessibility regions. Our model obtained high rates among the accessible regions (50–100% true positive rates for the different factors, with a total average of 77%), compared to much lower accuracy (true positive rates of 15–52%, with a total average of 43%) over the low accessibility regions. This differential accessibility is presented in Figure 4B and Figure S4B, where most of the false-positive predictions of the model (top-left corner) indeed show low accessibility.
To better quantify the role of accessibility in factor targeting, we leveraged the statistical framework of generalized hidden Markov models to incorporate DNA accessibility into the model as a non-uniform prior probability of regulatory binding along the genome – with regions of low accessibility having a greatly reduced probability of binding. The incorporation of differential DNA accessibility dramatically boosted the model's accuracy by a twofold for both the training set correlations (from 0.37 using the first, single-factor, model to 0.75) and the test set predictions (to a correlation of 0.70), with factor-specific correlation varying from 0.57 (HB) to 0.81 (BCD) (see Table 2, and Figure 1C and Figure 4C, 4D).
To further control for any possible experimental biases that might be present in the DNase I accessibility data, we also ran our model with DNase I data from a much later developmental stage, after cellular differentiation (embryonic stage 14). The highly bound genomic regions in our training and test sets are mostly comprised of loci that control early developmental patterning at stage 5 and which have been shown to have dramatically lower DNA accessibility at stage 14 (Thomas et al, unpublished data). Consistent with this, when we used DNase I data from the wrong developmental stage to estimate prior probabilities of transcription factor binding, the model's accuracy deteriorated (correlations of 0.36 and 0.28 on the train and test set, respectively), and was similar to the initial runs, where no DNase I data was used at all. Thus, the DNAse I data for stage 5 largely comprises a measurement of developmentally regulated chromatin accessibility, free from any major constitutive experimental bias.
BAC Spike-Ins Results Set the Highest Obtainable Correlation at 0.92
Our scoring system for the accuracy of our model uses the correlation coefficient between predicted and measured binding. We returned to the BAC spike-in data described earlier (Figure S2) to assess the experimental noise introduced into the ChIP-seq experiments by the amplification and sequencing of immunoprecipitated DNA as this would allow us to estimate the maximal correlation possible. We compared the binding landscape measured by the original ChIP data with versions of the same data into which noise was artificially introduced in levels following the BAC data (Methods). This resulted in an average correlation of 0.92 over 100 random perturbations, suggesting that even if our computational model were to perfectly predict the occupancies of proteins in vivo, the maximum obtainable correlation would be ∼0.92. Because additional experimental variation is likely to be introduced into the ChIP-seq data by differences in crosslinking efficiency and immunoprecipitation and because the DNase-seq data must also contain noise, we suspect that the true maximal correlation achievable is probably somewhat lower than this.
Thermodynamic Modeling Via Boltzmann Ensembles Captures Cooperative Binding
Although our predictions with DNA accessibility data were good, especially in light of the above estimates of experimental noise, we sought to further refine our model by considering more complex types of factor-factor interactions than the simple direct competition (via overlapping recognition sites) described above. For example, direct physical interactions between transcription factors bound at neighboring recognition sites has often been found to increase the occupancy of one or both proteins on DNA, both for homomeric and heteromeric interactions [36], [38], [42], and to sharpen the regulatory response to changes in transcription factor concentration [88], [94]–[97].
Generalized hidden Markov models, however, have limited ability to model the broader context of binding events, including interactions between neighboring sites. Therefore we added a second, sampling-based, phase to our computational model. In this phase, a large ensemble of binding configurations is sampled, each with a different set of protein-DNA interactions. The probability of each configuration is then estimated based on all pairs of nearby occupied sites (up to 95 bp apart) and the parameterized energetic gain of each pair. Finally, the overall binding probability at each position is quantified as a weighted sum of all sampled configurations (Figure 5).
By adopting a statistical mechanics perspective, the exponential space of protein-DNA binding configurations can be viewed as a canonical ensemble in a thermodynamic equilibrium. Here, the probability of each configuration is directly linked to its energetic state, including direct protein-DNA interactions, steric hindrance constraints and cooperative interactions with neighboring factors [50], [55], [56], [62], [69], [76]–[80].
We extended our model to capture cooperativity using a novel set of 15 parameters (one for each non-redundant pair of the five factors), modeling the energy gain for the nearby binding of every possible pair of the five transcriptional regulators in our model.
To predict binding in this new thermodynamic setting, we first used the generalized hidden Markov model in 3D to analyze the sequence and binding competition, and calculate an approximate map of binding. We then used the predicted binding probabilities to sample likely protein-DNA binding configurations, and re-weighted them to account for additional energetic gain, as modeled via protein-protein cooperative binding interactions [69], [98]. Finally, we averaged over these weighted configurations to predict a 3D map of binding. Using this combination of the gHMM followed by importance-weighted sampling, we were able to approximate the full thermodynamic landscape of binding, using a fast framework with few parameters. These cooperativity parameters, as well as their range of effect, were optimized based on the training set of genomic loci, using multiple random runs of a gradient-based trust-region optimization algorithm [99], [100]. The optimized set of cooperative binding parameters includes predictions of interactions between many homomeric and heteromeric pairs. However, these cooperativity parameters only improved the predictive power of the model by <5%, giving a correlation of 0.79 over the training set, with high accuracy binding predictions for all factors, ranging from 0.70 (CAD) to 0.9 (BCD). The model's accuracy over test set was 0.70, ranging from ∼0.6 (HB) to 0.83 (BCD), a marginal improvement of <1% over the Markovian approach (see Table 2 for full details). To further establish this result, we reran this procedure, this time allowing for both stabilizing and destabilizing interactions. Although we now identified additional interacting protein-protein pairs, the resulting correlations remained similar. Thus, our model suggests that cooperative interactions have a rather limited contribution in shaping the genomic landscape of in vivo binding.
Quantitative Estimations of Various Determinants of Transcriptional Regulation
We have presented here a series of increasingly complex models. The simplest model, using only the DNA sequence and in vitro binding data, obtained a correlation coefficient of 0.36 (on test data). After adding single-nucleus 3D modeling, as well as modeling binding competition with additional transcription factors and nucleosomes, the model's accuracy did not improve. We then incorporated experimental data of the DNA accessibility into the model, and boosted its accuracy to 0.70. Finally, we added thermodynamic parameters to model cooperative binding of neighboring factors, without a significant improvement in the model's accuracy (see Figure 2 and Table 2).
These estimators suggest that while sequence per se is responsible for approximately half of our predictive power, and DNA accessibility contributes the other half, the effect of modeling both binding competition and cooperative interactions is rather minor, and is estimated at the order of 1% of our predictive power.
To quantify the contribution of various transcriptional determinants on the binding landscape, we ran our model in all 16 possible configurations, using or not using: (1) sequence, (2) 3D competition, (3) DNA accessibility data, and (4) cooperativity (see Table S1). We then averaged over particular combinations to quantify the direct contribution of each of the four regulatory determinants. Once again, the relative contribution of each determinant was similar. For example, to quantify the importance of sequence per se, we compared the accuracy of the model in eight combinations (spanning all binary combinations 3D competition, accessibility and cooperativity) to the predictions obtained at the same configuration, only with randomly shuffled DNA sequences. The average difference in correlation over eight configuration pairs was 0.33 (±0.03), suggesting that DNA sequence per se does contribute half of our total predictive power. Similarly, the integration of DNA accessibility resulted in an increase of 0.33 (±0.03) in the model's accuracy (or about a half of its predictive power). The consistency of these estimates over different configurations suggests that the different regulatory determinants are largely independent of each other. Thus, the results suggest that binding within open chromatin is mostly controlled by DNA sequence, with a minimal role for direct cooperative or competitive interactions among factors.
Genome-Wide Prediction of Protein Binding
In the work presented so far, our analysis focused on regions of strong, functional binding, in the proximity of 21 known targets of the A-P patterning system and an additional eight control regions. To quantify the accuracy of the model on a larger genomic scale, we used the DNase I as a proxy, and identified the most accessible 2% of the genome (top 170 DNase I peaks, as called by an iterative peak-fitting algorithm [83], and expanded each peak by ∼10 Kb to each side in a similar manner to the initial set of 21 A-P loci). When applying the model to these open regions, it obtained an accuracy of 0.44, with various factors ranging from 0.38 (CAD) to 0.57 (BCD). Finally, we applied the model to the entirety of chromosome 2R (which was not included in our train set loci). Here, the model's accuracy was measured at 0.33 (or at 0.1 when no DNase I-based prior was used).
While these genomic scale predictions are not as accurate as the ones for the functional train and test set loci, they still offer a crude estimate to the genomic locations where binding occurs. The known target loci are inevitably biased towards genomic regions that are more accessible and more highly bound than the genome average (compare Figure 4 and Figure S4). In that sense, the DNase I peaks offer an unbiased proxy to the regions of the genome where chromatin accessibility plays a lesser role in shaping transcription factor binding in vivo. Moreover, part of the reason for the lower apparent performance of the model at the whole chromosome level could, thus, be technical, due to lower signal-to-noise ratios in the empirical data for the poorly bound, less accessible portions of the genome.
To further understand the decrease in the model's accuracy when shifting from an annotated set of known target loci to chromosome-wide applications, we compared the model's accuracy to the expression level of the underlying gene. We applied the model to predict the genome-wide binding across 4,251 genes for which we had gene expression data from the same developmental stage, taking 10 Kb regions around each gene [101]. We found a strong relation between the model's ability to accurately predict the binding landscape of transcription factors along the gene, and its expression level (Figure S5). These results suggest that while the model is applicable for genome-wide applications, it is mostly useful in predicting binding near actively transcribed genes and highly accessible genomic regions.
Discussion
We have used a thermodynamic computational framework to investigate the biochemical mechanisms that direct the widespread, quantitative patterns of binding by five developmental regulatory factors in the Drosophila blastoderm embryo. Our most striking finding is that a very simple thermodynamic model does a good job of quantitatively predicting the occupancy of transcription factors based on DNA sequence, in vitro DNA binding affinities and DNase I accessibility data.
We find no evidence that either competition between factors for binding sites or direct cooperative interactions between proteins show a significant effect on determining the overall pattern of binding in vivo. In the case of competitive binding, we showed that the reason why these do not dominantly affect net occupancy patterns embryo wide, is because high affinity sites for pairs of factors co-expressed in the same cells rarely overlap in the genome. Because our modeling focused on five early A-P segmentation regulators, it is possible that either competition or direct cooperative interactions with other transcription factors could play some role in shaping binding. The good predictive power of our model, however, sets an upper limit on the degree of this role. Also, it is striking that little evidence was found for positive heteromeric interactions among a set of five proteins that are known to cooperate on common sets of cis regulatory targets.
The relationship between chromatin accessibility and transcription factor binding has been previously described. But there is little agreement on the extent to which transcription factors bind regions of accessible chromatin, or if chromatin becomes accessible as a result of transcription factor binding. Our results are more consistent with the former, and fit well with a long-standing model for transcription factor targeting in which sequence specific transcription factors are expressed in cells at sufficient concentration to occupy their high and moderate affinity sites along accessible parts of the genome, without the aid of heteromeric cooperative interactions with other proteins [1], [2], [72].
Many other studies developed computational algorithms for predicting in vivo binding. Crudely, these studies fall into two categories. Qualitative models that aim at identifying statistically significant appearances of binding sites [49], [50], [53], [55]–[57], [59], [61], [63], [65], [67], [68], [78], [79], [102], and quantitative models that estimate the occupancy (or binding probability) at various sites [58], [60], [62], [66]. While the former group is generally more useful to identify putative cis regulatory modules and to unfold the transcriptional regulatory map, the latter approach is more suitable for modeling the continuous quantitative landscape of binding. An additional advantage of the quantitative approach is in its natural probabilistic settings, which allows for an easy integration of external data. For example, we used high-resolution measurements of DNA accessibility using DNase I hypersensitivity assays as an a-priori estimation of protein binding. Alternatively, previous work used various histone modification data as a proxy for DNA accessibility and to highlight regulatory regions [59], [63], [65], [68]. Nonetheless, our goal was to understand the biophysical forces that shape binding per se. For this, we turned to direct in vivo measurements of DNA accessibility and based our predictions on high-resolution quantitative measurements of in vivo binding data. And, unlike these earlier studies, examined the effects of different biophysical phenomena on our ability to predict binding, thereby revealing aspects of transcription factor biochemistry while providing a means to predict where, how, and to what extent transcription factor bind.
Finally, we wish to refer to a growing body of parallel work that models gene expression. Some of these studies use similar statistical models to ours, yet their modeling of transcription factor binding is intrinsic, limited to short regions of the genome with known regulatory activity, and has not examined the role of chromatin accessibility. Moreover, these models are optimized to recapitulate the patterns of expression for various genes, and do not focus on understanding the mechanisms of protein-DNA targeting [66], [69]–[71], [103].
One could argue that our use of DNA accessibility to eliminate regions where transcription factors are not bound is cheating. And, indeed, to the extent that our ultimate goal is to predict transcription factor binding from first principles it surely is. We have not established what determines the genome-wide landscape of DNA accessibility. But we have taken advantage of these direct in vivo measurements of DNA accessibility, and, in doing so, provide both a practical method for predicting binding, and a platform on which to better understand the forces that shape quantitative variation in the binding of individual factors to regions of open chromatin.
This success suggests a streamlined strategy for estimating of transcription factor binding for large numbers of factors involving in vivo measurement of DNA accessibility and in vitro determination of factor affinities for DNA. While this approach should not be viewed as a substitute for systematic experimental measurement of transcription factor binding in vivo, we believe our predictions are good enough to be useful when such experimental data are unavailable or impractical to obtain. Although our predictions are surprisingly good, they are by no means perfect. We are obviously interested in improving these predictions, primarily by incorporating better data on binding specificities and more realistic models of protein-DNA and protein-protein interactions.
Methods
Transcription Factor Chromatin Immunoprecipitation
We used ChIP-seq data for BCD, CAD, GT, HB and KR, from early stage 5 D. melanogaster embryos [18]. Sequenced reads were mapped to the genome (Apr. 2006 assembly, dm3, BDGP Release 5), extended according to their orientation to a length of 150 bp, and binned (down-sampled) to a 25 bp resolution. Finally, the genomic binding landscape of each factor was smoothed using a running window of 10 bins (or 250 bp), to account for sampling noise.
Generalized Hidden Markov Model
We implemented a generalized hidden Markov model to predict transcription factor binding based on the factor concentration and the underlying DNA sequence. Following a thermodynamic rationale, this statistical framework considers the space of all valid binding configurations as a Boltzmann distribution, where the probability of each configuration Pi, depends on its energetic state Eiwhere β equals 1/kBT, with kB being the Boltzmann constant and T is the temperature (25 C). This allows us convert binding energies to probabilities and vice versa.
To model the energetic state of each configuration, we view each unbound nucleotide as generated from a background mononucleotide distribution PB, (0.32 for A/T, 0.18 for G/C). Bound nucleotides are generated using any probabilistic model of transcription factor binding sites [48], [51]. Here we use position weight matrices (PWMs), derived from in vitro SELEX data in which hundreds of bound oligonucleotides are sequenced for 2–4 rounds of SELEX [104] (BDTNP, unpublished data). Using a Maximum-Likelihood estimator with a pseudo-count of 0.01 to prevent zero probabilities, the probability of a subsequence Si to be bound by transcription factor t equalswhere P(t) denotes the prior binding probability of transcription factor t (see below), lt denotes the binding site length of factor t, and corresponds to the probability of observing the nucleotide Si+j, at the j position of the PWM of factor t.
The probability of a full binding configuration at the DNA sequence S, with multiple factors T1, …,Tk bound at positions X1…Xk could be written aswith no overlapping binding sites. Moreover, to account for steric hindrance, we artificially extended each binding site model, adding flanking region of 3 bp, modeled by non-specific background distribution PB (0.32 for A/T, 0.18 for G/C), so the minimum distance between two non-overlapping binding sites is 7 (two flanks of 3 bp plus a 1 bp transition through the background state).
In the 3D models, we further scale the prior probability P(t) of each transcription factor t proportionally to its protein expression level (as measured at a single-cell resolution [47]). The prior probability P(t) of the nucleosomal binding state is assumed to be fixed through the embryo.
To account for all possible configurations, and infer the probability of each factor to bind each DNA position, given protein concentrations and DNA sequence, we apply the forward-backward dynamic programming inference algorithm [75]. Specifically, we calculate the probabilities that each factor t binds the DNA starting at each position i, Ut,i = P(t) * Pt(Si). We then calculate the forward potentials Ft,i, and the backward potentials Bt,i, by summing the probabilities of all configurations (paths) that end (or begin) at position i with a binding site of t. By multiplying the forward and backward potentials, we can then directly calculate the exact posterior probability of factor t bound at position i in a linear time.
In Vitro Protein-DNA Affinities
PWMs for the five transcription factors modeled in this study were obtained from the Berkeley Drosophila Transcription Network Project site (http://bdtnp.lbl.gov). PWM counts were then added a pseudo-count of 0.01 and normalized to probabilities. We have also tested other possible sources of PWMs [69], [105], with similar overall results. For example, one-hybrid PWMs yielded correlations of 0.72 for KR or 0.64 for CAD (compared to 0.71 and 0.68 using BDTNP's SELEX PWMs).
Model-Based Simulation of Chromatin Immunoprecipitation
The probabilities of transcription factor binding that were calculated by the generalized hidden Markov model were convolved to predicted ChIP landscape using a customized model-based estimation of a peak shape [83]. Given a distribution of DNA fragment lengths c(l), the estimated shape F of a peak is described as:where Δx denotes the relative distance from the binding locus (peak center). In general, the probability of sequencing a read Δx bp away from the binding location is proportional to the amount of DNA fragment of length ≥Δx (fragments begin Δx bp away from the binding location and overlap it). We approximate this fraction using a Gamma distribution, with parameters corresponding to mean and standard deviation of fragment length. Finally, we quantify the similarity of the predicted binding landscape for each factor to the in vivo binding measurements using a Pearson correlation coefficient.
Modeling Nucleosome Binding
We used two different probabilistic models to incorporate nucleosome binding into the generalized hidden Markov model. First, we used a 141 bp-long sequence-specific model, based on positional dinucleotide distributions, as described in Segal et al. [93]. Alternatively, we used a 141 bp-long sequence-independent model of nucleosome binding, based on a fixed distribution of nucleotides as in the background state PB of the Markov model (0.32 for A/T, 0.18 for G/C). Similarly to the TF states, the two nucleosomal states were assigned (in turn) a prior probability term P(t) to reflect nucleosomal concentration. P(t) was optimized together with other concentration-related parameters P(t) for all transcription factors.
Non-Uniform Binding Probabilities Using DNA Accessibility
DNase I hypersensitivity data were used to directly compute the in vivo prior probability of transcription factor binding along the genome. We used a logistic sigmoid function to process the genome-wide DNase I read densities DDx into prior probabilities PDxwhere α = 6.008 and β = 0.207. These parameters were optimized over the training data, separately from the concentration parameters in an iterative manner (Piecewise Optimization). We then computed these PDx values for every genomic position based on DNase I read densities, and normalized the prior probability of binding by each TF, P(t), by PDx, to calculate the transition probabilities into the bound states of every transcription factor. The transition probability into the nucleosomal state was not affected by this prior. In addition to the sigmoidal prior described here, we also tried a linear transformation from read densities DDx to PDx, resulting with slightly reduced accuracy (0.64 vs. 0.67).
BAC Spike-Ins
Eight BAC were used as spike-ins, including chr2R:8044567-8229187, chr2R:11688866-11882127, chr2R:19255181-19473745, chr3L:5493623-5675833, chr3L:11822927-11997557, chr3L:14593950-14773107, chr3R:12491763-12640405, and chr3R:23311086-23491584 (all given in release 5 coordinates). Three DNA samples were prepared, including: (1) the starting genomic DNA sample; (2) the genomic DNA with the addition of a set of BAC plasmid DNA premixed at different concentrations, and (3) a sample that contains the genomic DNA and the same set of BACs at 2x higher concentrations. These samples were sonicated to an average size of 500 bp, and the concentration of each BAC in the samples was quantified by Q-PCR using BAC specific primer-probe sets, and normalized to the genomic DNA. These samples were then prepared for sequencing following the same procedure we used for preparing sequencing libraries using ChIP samples. DNA fragments in the range from 200–500 bp, including the adapters, were selected for sequencing, and sequenced (4 lanes for genomic DNA, 2 lanes for each of 2 spike-in samples). Reads were mapped using bowtie [106], and the read coverage was normalized to reflect an equal read density on non-BAC background regions. The relative enrichment of each BAC post amplification and sequencing was then calculated, and compared to the Q-PCR enrichments.
Cooperative Binding Modeled Via Importance Sampling
We incorporated cooperative binding in a thermodynamic setting using sampling [69], [98]. This was done by first computing the posterior binding probabilities for every factor/position/nucleus using the generalized hidden Markov model, with sequence-independent nucleosomal state, in 3D resolution. We then sampled 10,000 binding configurations for every setting, calculated the occurrences of neighboring binding events (up to 95 bp apart), and re-weighted each sampled configuration bywhere Wi corresponds to the weight of configuration i, xj and xk are the binding positions of transcription factors j and k, and Cj,k corresponds to their protein-protein cooperativity parameter, which we optimized using the train set loci (see below). Finally, the weighted samples were averaged, and the probability of every binding position for every factor estimated.
Optimization of Model Parameters
All parameters were optimized by maximizing the correlation between the model binding predictions over the train loci to their in vivo binding occupancies. The protein concentration parameters were initially optimized using a genetic algorithm [107], with 25 generations across a population size of 15. The optimized concentrations were then further improved using a gradient-based trust-region algorithm [99], [100]. Both phases were implemented in MATLAB. Protein-protein cooperativity parameters were optimized using a gradient-based trust-region algorithm starting from >200 random starting points.
Data Availability
All data, including PWMs, 3D protein concentrations, DNase I hypersensitivity prior, binding probabilities and predicted binding landscapes for all factors (at whole-embryo and single-nucleus resolution) and protein-protein cooperativity parameters are available at http://bdtnp.lbl.gov/gHMM
Supporting Information
Zdroje
1. WalterJ
DeverCA
BigginMD
1994 Two homeo domain proteins bind with similar specificity to a wide range of DNA sites in Drosophila embryos. Genes Dev 8 1678 1692
2. CarrA
BigginMD
1999 A comparison of in vivo and in vitro DNA-binding specificities suggests a new model for homeoprotein DNA binding in Drosophila embryos. EMBO Journal 18 1598 1608
3. BoyerLA
LeeTI
ColeMF
JohnstoneSE
LevineSS
2005 Core transcriptional regulatory circuitry in human embryonic stem cells. Cell 122 947 956
4. BiedaM
XuX
SingerMA
GreenR
FarnhamPJ
2006 Unbiased location analysis of E2F1-binding sites suggests a widespread role for E2F1 in the human genome. Genome Res 16 595 605
5. YangA
ZhuZ
KapranovP
McKeonF
ChurchGM
2006 Relationships between p63 binding, DNA sequence, transcription activity, and biological function in human cells. Mol Cell 24 593 602
6. SandmannT
GirardotC
BrehmeM
TongprasitW
StolcV
2007 A core transcriptional network for early mesoderm development in Drosophila melanogaster. Genes Dev 21 436 449
7. ZeitlingerJ
ZinzenRP
StarkA
KellisM
ZhangH
2007 Whole-genome ChIP-chip analysis of Dorsal, Twist, and Snail suggests integration of diverse patterning processes in the Drosophila embryo. Genes Dev 21 385 390
8. JohnsonDS
MortazaviA
MyersRM
WoldB
2007 Genome-wide mapping of in vivo protein-DNA interactions. Science 316 1497 1502
9. RobertsonG
HirstM
BainbridgeM
BilenkyM
ZhaoY
2007 Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nat Methods 4 651 657
10. ChenX
XuH
YuanP
FangF
HussM
2008 Integration of external signaling pathways with the core transcriptional network in embryonic stem cells. Cell 133 1106 1117
11. ConsortiumTEP
2007 Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447 799 816
12. GeorletteD
AhnS
MacAlpineDM
CheungE
LewisPW
2007 Genomic profiling and expression studies reveal both positive and negative activities for the Drosophila Myb MuvB/dREAM complex in proliferating cells. Genes Dev 21 2880 2896
13. LiX-Y
MacarthurS
BourgonR
NixD
PollardDA
2008 Transcription Factors Bind Thousands of Active and Inactive Regions in the Drosophila Blastoderm. PLoS Biol 6 e27 doi:10.1371/journal.pbio.0060027
14. FullwoodMJ
LiuMH
PanYF
LiuJ
XuH
2009 An oestrogen-receptor-alpha-bound human chromatin interactome. Nature 462 58 64
15. BojSF
ServitjaJM
MartinD
RiosM
TalianidisI
2009 Functional targets of the monogenic diabetes transcription factors HNF-1alpha and HNF-4alpha are highly conserved between mice and humans. Diabetes 58 1245 1253
16. MacarthurS
LiX-Y
LiJ
BrownJB
ChuHC
2009 Developmental roles of 21 Drosophila transcription factors are determined by quantitative differences in binding to an overlapping set of thousands of genomic regions. Genome Biol 10 R80
17. CaoY
YaoZ
SarkarD
LawrenceM
SanchezGJ
2010 Genome-wide MyoD binding in skeletal muscle cells: a potential for broad cellular reprogramming. Dev Cell 18 662 674
18. BradleyRK
LiX-Y
TrapnellC
DavidsonS
PachterL
2010 Binding site turnover produces pervasive quantitative changes in transcription factor binding between closely related Drosophila species. PLoS Biol 8 e1000343 doi:10.1371/journal.pbio.1000343
19. LiangZ
BigginMD
1998 Eve and ftz regulate a wide array of genes in blastoderm embryos: the selector homeoproteins directly or indirectly regulate most genes in Drosophila. Development 125 4471 4482
20. MoormanC
SunLV
WangJ
de WitE
TalhoutW
2006 Hotspots of transcription factor colocalization in the genome of Drosophila melanogaster. Proc Natl Acad Sci U S A 103 12027 12032
21. OuyangZ
ZhouQ
WongWH
2009 ChIP-Seq of transcription factors predicts absolute and differential gene expression in embryonic stem cells. Proc Natl Acad Sci U S A 106 21521 21526
22. ZinzenRP
GirardotC
GagneurJ
BraunM
FurlongEEM
2009 Combinatorial binding predicts spatio-temporal cis-regulatory activity. Nature 461 65 70
23. WunderlichZ
MirnyLA
2009 Different gene regulation strategies revealed by analysis of binding motifs. Trends Genet 25 434 440
24. IyerVR
HorakCE
ScafeCS
BotsteinD
SnyderM
2001 Genomic binding sites of the yeast cell-cycle transcription factors SBF and MBF. Nature 409 533 538
25. LiuX
LeeCK
GranekJA
ClarkeND
LiebJD
2006 Whole-genome comparison of Leu3 binding in vitro and in vivo reveals the importance of nucleosome occupancy in target site selection. Genome Res 16 1517 1528
26. StanojevicD
SmallS
LevineM
1991 Regulation of a segmentation stripe by overlapping activators and repressors in the Drosophila embryo. Science (New York, NY) 254 1385 1387
27. MakeevVJ
LifanovAP
NazinaAG
PapatsenkoDA
2003 Distance preferences in the arrangement of binding motifs and hierarchical levels in organization of transcription regulatory information. Nucleic Acids Research 31 6016 6026
28. WolffeAP
1994 Nucleosome positioning and modification: chromatin structures that potentiate transcription. Trends Biochem Sci 19 240 244
29. CosmaMP
TanakaT
NasmythK
1999 Ordered recruitment of transcription and chromatin remodeling factors to a cell cycle- and developmentally regulated promoter. Cell 97 299 311
30. AgaliotiT
LomvardasS
ParekhB
YieJ
ManiatisT
2000 Ordered recruitment of chromatin modifying and general transcription factors to the IFN-beta promoter. Cell 103 667 678
31. CarrA
BigginMD
2000 Accessibility of transcriptionally inactive genes is specifically reduced at homeoprotein-DNA binding sites in Drosophila. Nucleic Acids Res 28 2839 2846
32. NarlikarGJ
FanH-Y
KingstonRE
2002 Cooperation between complexes that regulate chromatin structure and transcription. Cell 108 475 487
33. MorseRH
2007 Transcription factor access to promoter elements. J Cell Biochem 102 560 570
34. TaylorIC
WorkmanJL
SchuetzTJ
KingstonRE
1991 Facilitated binding of GAL4 and heat shock factor to nucleosomal templates: differential function of DNA-binding domains. Genes Dev 5 1285 1298
35. AdamsCC
WorkmanJL
1995 Binding of disparate transcriptional activators to nucleosomal DNA is inherently cooperative. Mol Cell Biol 15 1405 1421
36. JohnsonAD
1995 Molecular mechanisms of cell-type determination in budding yeast. Curr Opin Genet Dev 5 552 558
37. VasheeS
MelcherK
DingWV
JohnstonSA
KodadekT
1998 Evidence for two modes of cooperative DNA binding in vivo that do not involve direct protein-protein interactions. Curr Biol 8 452 458
38. BolouriH
DavidsonEH
2003 Transcriptional regulatory cascades in development: initial rates, not steady state, determine network kinetics. Proc Natl Acad Sci U S A 100 9371 9376
39. MillerJA
WidomJ
2003 Collaborative competition mechanism for gene activation in vivo. Mol Cell Biol 23 1623 1632
40. ZeitlingerJ
SimonI
HarbisonCT
HannettNM
VolkertTL
2003 Program-specific distribution of a transcription factor dependent on partner transcription factor and MAPK signaling. Cell 113 395 404
41. BuckMJ
LiebJD
2006 A chromatin-mediated mechanism for specification of conditional transcription factor targets. Nat Genet 38 1446 1451
42. MannRS
LelliKM
JoshiR
2009 Hox specificity unique roles for cofactors and collaborators. Curr Top Dev Biol 88 63 101
43. Campos-OrtegaJA
HartensteinV
1997 The Embryonic Development of Drosophila melanogaster. Berlin Springer-Verlag
44. Nusslein-VolhardC
WieschausE
1980 Mutations affecting segment number and polarity in Drosophila. Nature 287 795 801
45. St JohnstonD
Nusslein-VolhardC
1992 The origin of pattern and polarity in the Drosophila embryo. Cell 68 201 219
46. Rivera-PomarR
JäckleH
1996 From gradients to stripes in Drosophila embryogenesis: filling in the gaps. Trends Genet 12 478 483
47. FowlkesCC
HendriksCLL
KeränenSVE
WeberGH
RübelO
2008 A quantitative spatiotemporal atlas of gene expression in the Drosophila blastoderm. Cell 133 364 374
48. StormoGD
2000 DNA binding sites: representation and discovery. Bioinformatics 16 16 23
49. FrithMC
HansenU
WengZ
2001 Detection of cis-element clusters in higher eukaryotic DNA. Bioinformatics 17 878 889
50. RajewskyN
VergassolaM
GaulU
SiggiaED
2002 Computational detection of genomic cis-regulatory modules applied to body patterning in the early Drosophila embryo. BMC Bioinformatics 3 30
51. BarashY
ElidanG
FriedmanN
KaplanT
2003 Modeling dependencies in protein-DNA binding sites. Proceedings of the seventh annual international conference on Research in computational molecular biology Berlin, Germany ACM 28 37
52. BulykML
2003 Computational prediction of transcription-factor binding site locations. Genome Biol 5 201
53. SinhaS
van NimwegenE
SiggiaED
2003 A probabilistic method to detect regulatory modules. Bioinformatics 19 Suppl 1 i292 301
54. BarashY
ElidanG
KaplanT
FriedmanN
2005 CIS: compound importance sampling method for protein-DNA binding site p-value estimation. Bioinformatics 21 596 600
55. GranekJA
ClarkeND
2005 Explicit equilibrium modeling of transcription-factor binding and gene regulation. Genome Biol 6 R87
56. SinhaS
2006 On counting position weight matrix matches in a sequence, with application to discriminative motif finding. Bioinformatics 22 e454 463
57. NarlikarL
GordanR
HarteminkAJ
2007 A nucleosome-guided map of transcription factor binding sites in yeast. PLoS Comput Biol 3 e215 doi:10.1371/journal.pcbi.0030215
58. RoiderHG
KanhereA
MankeT
VingronM
2007 Predicting transcription factor affinities to DNA from a biophysical model. Bioinformatics 23 134 141
59. WardLD
BussemakerHJ
2008 Predicting functional transcription factor binding through alignment-free and affinity-based analysis of orthologous promoter sequences. Bioinformatics 24 i165 171
60. HeX
ChenCC
HongF
FangF
SinhaS
2009 A biophysical model for analysis of transcription factor interaction and binding site arrangement from genome-wide binding data. PLoS ONE 4 e8155 doi:10.1371/journal.pone.0008155
61. NarlikarL
OvcharenkoI
2009 Identifying regulatory elements in eukaryotic genomes. Brief Funct Genomic Proteomic 8 215 230
62. WassonT
HarteminkAJ
2009 An ensemble model of competitive multi-factor binding of the genome. Genome Res 19 2101 2112
63. WhitingtonT
PerkinsAC
BaileyTL
2009 High-throughput chromatin information enables accurate tissue-specific prediction of transcription factor binding sites. Nucleic Acids Research 37 14 25
64. WonKJ
AgarwalS
ShenL
ShoemakerR
RenB
2009 An integrated approach to identifying cis-regulatory modules in the human genome. PLoS ONE 4 e5501 doi:10.1371/journal.pone.0005501
65. ErnstJ
PlastererHL
SimonI
Bar-JosephZ
2010 Integrating multiple evidence sources to predict transcription factor binding in the human genome. Genome Res 20 526 536
66. HeX
SameeMA
BlattiC
SinhaS
2010 Thermodynamics-based models of transcriptional regulation by enhancers: the roles of synergistic activation, cooperative binding and short-range repression. PLoS Comput Biol 6 e1000935 doi:10.1371/journal.pcbi.1000935
67. RamseySA
KnijnenburgTA
KennedyKA
ZakDE
GilchristM
2010 Genome-wide histone acetylation data improve prediction of mammalian transcription factor binding sites. Bioinformatics 26 2071 2075
68. WonK-J
RenB
WangW
2010 Genome-wide prediction of transcription factor binding sites using an integrated model. Genome Biol 11 R7
69. SegalE
Raveh-SadkaT
SchroederM
UnnerstallU
GaulU
2008 Predicting expression patterns from regulatory sequence in Drosophila segmentation. Nature 451 535 540
70. Raveh-SadkaT
LevoM
SegalE
2009 Incorporating nucleosomes into thermodynamic models of transcription regulation. Genome Res 19 1480 1496
71. KazemianM
BlattiC
RichardsA
McCutchanM
Wakabayashi-ItoN
2010 Quantitative analysis of the Drosophila segmentation regulatory network using pattern generating potentials. PLoS Biol 8 e1000456 doi:10.1371/journal.pbio.1000456
72. LinS
RiggsAD
1975 The general affinity of lac repressor for E. coli DNA: implications for gene regulation in procaryotes and eucaryotes. Cell 4 107 111
73. von HippelPH
RevzinA
GrossCA
WangAC
1974 Nonspecific DNA binding of genome regulating proteins as a biological control mechanism: 1. The lac operon: Equilibrium aspects. Proc Natl Acad Sci USA 71 4808 4812
74. KulpD
HausslerD
ReeseMG
EeckmanFH
1996 A generalized hidden Markov model for the recognition of human genes in DNA. Proc Int Conf Intell Syst Mol Biol 4 134 142
75. RabinerL
1989 A Tutorial on Hidden Markov Moldes and Selected Applications in Speech Recognition. P Ieee 77 257 286
76. AckersGK
JohnsonAD
SheaMA
1982 Quantitative model for gene regulation by lambda phage repressor. Proc Natl Acad Sci USA 79 1129 1133
77. BuchlerNE
GerlandU
HwaT
2003 On schemes of combinatorial transcription logic. Proc Natl Acad Sci USA 100 5136 5141
78. SchroederMD
PearceM
FakJ
FanH
UnnerstallU
2004 Transcriptional control in the segmentation gene network of Drosophila. PLoS Biol 2 e271 doi:10.1371/journal.pbio.0020271
79. BintuL
BuchlerNE
GarciaHG
GerlandU
HwaT
2005 Transcriptional regulation by the numbers: models. Current Opinion in Genetics & Development 15 116 124
80. GertzJ
CohenBA
2009 Environment-specific combinatorial cis-regulation in synthetic promoters. Mol Syst Biol 5 1 9
81. TothJ
BigginMD
2000 The specificity of protein-DNA crosslinking by formaldehyde: in vitro and in drosophila embryos. Nucleic Acids Res 28 e4
82. AuerbachRK
EuskirchenG
RozowskyJ
Lamarre-VincentN
MoqtaderiZ
2009 Mapping accessible chromatin regions using Sono-Seq. Proc Natl Acad Sci USA 106 14926 14931
83. CapaldiA
KaplanT
LiuY
HabibN
RegevA
2008 Structure and function of a transcriptional network activated by the MAPK Hog1. Nat Genet 40 1300 1306
84. BrodyT
1999 The Interactive Fly: gene networks, development and the Internet. Trends Genet 15 333 334
85. HareEE
PetersonBK
IyerVN
MeierR
EisenMB
2008 Sepsid even-skipped enhancers are functionally conserved in Drosophila despite lack of sequence conservation. PLoS Genet 4 e1000106 doi:10.1371/journal.pgen.1000106
86. KimJ
HeX
SinhaS
2009 Evolution of regulatory sequences in 12 Drosophila species. PLoS Genet 5 e1000330 doi:10.1371/journal.pgen.1000330
87. PolachKJ
WidomJ
1996 A model for the cooperative binding of eukaryotic regulatory proteins to nucleosomal target sites. J Mol Biol 258 800 812
88. MirnyL
2009 Nucleosome-mediated cooperativity between transcription factors. Arxiv preprint arXiv 09012905
89. GrossDS
GarrardWT
1988 Nuclease hypersensitive sites in chromatin. Annu Rev Biochem 57 159 197
90. HesselberthJR
ChenX
ZhangZ
SaboPJ
SandstromR
2009 Global mapping of protein-DNA interactions in vivo by digital genomic footprinting. Nat Meth 6 283 289
91. SaboPJ
KuehnMS
ThurmanR
JohnsonBE
JohnsonEM
2006 Genome-scale mapping of DNase I sensitivity in vivo using tiling DNA microarrays. Nat Meth 3 511 518
92. ZhangY
MoqtaderiZ
RattnerBP
EuskirchenG
SnyderM
2009 Intrinsic histone-DNA interactions are not the major determinant of nucleosome positions in vivo. Nat Struct Mol Biol 16 847 852
93. SegalE
Fondufe-MittendorfY
ChenL
ThåströmA
FieldY
2006 A genomic code for nucleosome positioning. Nature 442 772 778
94. SmallS
BlairA
LevineM
1992 Regulation of even-skipped stripe 2 in the Drosophila embryo. EMBO J 11 4047 4057
95. ArnostiDN
BaroloS
LevineM
SmallS
1996 The eve stripe 2 enhancer employs multiple modes of transcriptional synergy. Development 122 205 214
96. KulkarniMM
ArnostiDN
2005 cis-regulatory logic of short-range transcriptional repression in Drosophila melanogaster. Molecular and Cellular Biology 25 3411 3420
97. FakhouriWD
AyA
SayalR
DreschJ
DayringerE
2010 Deciphering a transcriptional regulatory code: modeling short-range repression in the Drosophila embryo. Molecular Systems Biology 6 1 14
98. HammersleyJM
HandscombDC
1964 Monte Carlo methods. London, New York Methuen; Wiley vii, 178
99. SteihaugT
1983 The Conjugate Gradient Method and Trust Regions in Large Scale Optimization. SIAM Journal on Numerical Analysis 20 626 637
100. ColemanTF
LiY
1996 An Interior Trust Region Approach for Nonlinear Minimization Subject to Bounds. SIAM J Optim 6 418 445
101. ArbeitmanM
FurlongE
ImamF
JohnsonE
NullB
2002 Gene expression during the life cycle of Drosophila melanogaster. Science 297 2270 2275
102. AgiusP
ArveyA
ChangW
NobleWS
LeslieC
2010 High Resolution Models of Transcription Factor-DNA Affinities Improve In Vitro and In Vivo Binding Predictions. PLoS Comput Biol 6 e1000916 doi:10.1371/journal.pcbi.1000916
103. ZinzenRP
SengerK
LevineM
PapatsenkoD
2006 Computational models for neurogenic gene expression in the Drosophila embryo. Curr Biol 16 1358 1365
104. RouletE
BussoS
CamargoAA
SimpsonAJ
MermodN
2002 High-throughput SELEX SAGE method for quantitative modeling of transcription-factor binding sites. Nat Biotechnol 20 831 835
105. NoyesMB
MengX
WakabayashiA
SinhaS
BrodskyMH
2008 A systematic characterization of factors that regulate Drosophila segmentation via a bacterial one-hybrid system. Nucleic Acids Research 36 2547 2560
106. LangmeadB
TrapnellC
PopM
SalzbergSL
2009 Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10 R25
107. GoldbergD
HollandJ
1988 Genetic Algorithms and Machine Learning. Machine learning 3 95 99
Štítky
Genetika Reprodukční medicínaČlánek vyšel v časopise
PLOS Genetics
2011 Číslo 2
Nejčtenější v tomto čísle
- Meta-Analysis of Genome-Wide Association Studies in Celiac Disease and Rheumatoid Arthritis Identifies Fourteen Non-HLA Shared Loci
- MiRNA Control of Vegetative Phase Change in Trees
- The Cardiac Transcription Network Modulated by Gata4, Mef2a, Nkx2.5, Srf, Histone Modifications, and MicroRNAs
- Genome-Wide Transcript Profiling of Endosperm without Paternal Contribution Identifies Parent-of-Origin–Dependent Regulation of