Understanding and predicting transcription factor specificity


Richard Mann, Barry Honig, and Andrea Califano




The long-term goal of this project is to clarify the in vivo specificities of the transcription factors that regulate gene expression in eukaryotes. The project has two main components, both leveraging our experience with the fruit fly, Drosophila melanogaster , as a model system. The first, "DNA-centric" component uses the evolutionary conservation of DNA sequences to identify them as candidate cis-regulatory modules (CRMs) that bind transcription factors. The development of novel computation prediction methods to identify modules will be followed by their in vivo validation. The second, 'protein-centric' component uses a combination of biochemical, modeling, structural, and in vivo methods to understand the DNA-binding specificities of the Hox family of homeodomain proteins. These transcription factors exhibit a high degree of functional specificity in vivo, but low specificity in in vitro DNA-binding assays. Our results attribute much of this discrepancy to sequence-dependent changes in the precise shape of the DNA, especially its minor groove. Further work suggests that altering minor groove shape is a widespread mechanism used by many transcription factors.


The first, "DNA-centric" part of this project seeks to predict transcription-factor binding sites based on DNA sequence information and its evolutionary conservation. We developed a novel integrated set of algorithms for discovering clusters of short DNA sequence motifs that are conserved across multiple Drosophila species (Sosinsky 2007).

We refer to these sets of motifs as Local Permutation Clusters ( LPCs ). Unlike traditional CRM discovery methods, this approach does not require any a priori information about the transcription factors that may bind to these CRMs. A second, unique aspect to this approach, which we call EDGI (for Enhancer Detection with only Genomic Information), is that it does not rely on simple pair-wise comparisons but instead identifies LPCs present in groups of related sequences. Consequently, the order and spacing of the individual motifs within an LPC can differ from species to species. In pilot tests, this method has been shown to be nearly as good at predicting bona fide CRMs as are methods that require knowing the transcription factors and the sequences they bind to within a CRM. We are currently testing the LPCs we have discovered for enhancer activity in Drosophila .

We have extended the algorithm to look for LPCs in co-expressed sets of genes, which we call coEDGI. Using a set of 19 genes that are co-expressed during Drosophila eye development, we identified conserved LPCs and ranked the top 20 of these according to conservation. Among these 20, this analysis identified at least one previously known enhancer from the eyeless gene, which ranked #5 on the list.

The second, "protein-centric" component of the project addresses the molecular basis of transcription-factor binding specificity, focusing on the ubiquitous Hox proteins. These transcription factors, which control critical aspects of development in many species, contain a homeodomain sequence that is used in DNA binding by a wide range of eukaryotes. We found that the sequence specificity of Hox protein binding is associated with the minor groove shape of the DNA.

The research relied on crystal structure determination, in vitro and in vivo assays on wild type and mutant proteins, and computational analysis (Joshi 2007) . This joint experimental/computational work has been particularly fruitful, leading to a second paper, recently published in Nature (Rohs 2009), showing that minor groove shape is a widespread mechanism used by many transcription factors.

The homeodomain comprises three alpha helices. One, the so-called recognition helix, makes base pair-specific contacts in the major groove of the DNA. This recognition helix is nearly identical for all of the Hox homeodomains, so it does not explain the unique specificities that Hox proteins exhibit in vivo . Another helix, the 'N-terminal arm', which is at the N-terminus of the homeodomain, is thought to sit in the minor groove of the DNA, and has been shown in in vivo experiments to be a critical component for Hox-protein specificity.

Previous structural studies, however, did not find well-defined contacts between the N-terminal arm and the minor groove of DNA that would help explain its specificity. We resolved this discrepancy by solving the x-ray structure of a more crystals grown from a more realistic complex between the protein and DNA. In these crystals, the DNA contained a bona fide binding site for the Hox protein Scr, as determined from in vivo studies, instead of using a "consensus" DNA sequence. They also included the cofactor Exd, which in vivo binds with Scr to a site derived from the forkhead gene, but not to the consensus sequence. The crystal structure showed contacts involving specific amino acids. (Joshi 2007), and the bona fide and consensus sequences differ subtly in the shape of the minor groove of the DNA.

Using computational methods and data in the protein databank, we have analyzed the structures of other DNAs bound to proteins, especially other homeodomain proteins. We found that, as for the forkhead sequence, local differences in DNA structure can contribute, perhaps profoundly, to the specific recognition of binding sites by DNA binding proteins (Rohs et al 2009).

Project Publications

Sosinsky A, Honig B, Mann RS, Califano A. Discovering transcriptional regulatory regions in Drosophila by a nonalignment method for phylogenetic footprinting. Proc Natl Acad Sci U S A. 2007;104(15):6305-10.

Joshi R, Passner JM, Rohs R, Jain R, Sosinsky A, Crickmore MA, Jacob V, Aggarwal AK, Honig B, Mann RS.  Functional specificity of a Hox protein mediated by the recognition of minor groove structure.  Cell. 2007;131(3):530-43.

Rohs R, West SM, Sosinsky A, Liu P, Mann RS, Honig B.  The role of DNA shape in protein-DNA recognition.  Nature. 2009;461(7268):1248-53.