The Library
Variable structure motifs for transcription factor binding sites
Tools
Reid, J. E. (John E.), Evans, Kenneth J., Dyer, Nigel, Wernisch, Lorenz and Ott, Sascha. (2010) Variable structure motifs for transcription factor binding sites. BMC Genomics, Vol.11 (Article 30). ISSN 1471-2164
|
PDF
WRAP_Dyer_variable_Structure.pdf - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader Download (1812Kb) |
Official URL: http://dx.doi.org/10.1186/1471-2164-11-30
Abstract
Background: Classically, models of DNA-transcription factor binding sites (TFBSs) have been based on relatively few known instances and have treated them as sites of fixed length using position weight matrices (PWMs). Various extensions to this model have been proposed, most of which take account of dependencies between the bases in the binding sites. However, some transcription factors are known to exhibit some flexibility and bind to DNA in more than one possible physical configuration. In some cases this variation is known to affect the function of binding sites. With the increasing volume of ChIP-seq data available it is now possible to investigate models that incorporate this flexibility. Previous work on variable length models has been constrained by: a focus on specific zinc finger proteins in yeast using restrictive models; a reliance on hand-crafted models for just one transcription factor at a time; and a lack of evaluation on realistically sized data sets. Results: We re-analysed binding sites from the TRANSFAC database and found motivating examples where our new variable length model provides a better fit. We analysed several ChIP-seq data sets with a novel motif search algorithm and compared the results to one of the best standard PWM finders and a recently developed alternative method for finding motifs of variable structure. All the methods performed comparably in held-out cross validation tests. Known motifs of variable structure were recovered for p53, Stat5a and Stat5b. In addition our method recovered a novel generalised version of an existing PWM for Sp1 that allows for variable length binding. This motif improved classification performance. Conclusions: We have presented a new gapped PWM model for variable length DNA binding sites that is not too restrictive nor over-parameterised. Our comparison with existing tools shows that on average it does not have better predictive accuracy than existing methods. However, it does provide more interpretable models of motifs of variable structure that are suitable for follow-up structural studies. To our knowledge, we are the first to apply variable length motif models to eukaryotic ChIP-seq data sets and consequently the first to show their value in this domain. The results include a novel motif for the ubiquitous transcription factor Sp1.
| Item Type: | Journal Article |
|---|---|
| Subjects: | Q Science > QH Natural history > QH426 Genetics |
| Divisions: | Faculty of Science > Molecular Organisation and Assembly in Cells (MOAC) Faculty of Science > Centre for Systems Biology |
| Library of Congress Subject Headings (LCSH): | DNA-binding proteins -- Research, Genetic transcription, Transcription factors, Genomics, Genetics -- Mathematical models |
| Journal or Publication Title: | BMC Genomics |
| Publisher: | BioMed Central Ltd. |
| ISSN: | 1471-2164 |
| Date: | 14 January 2010 |
| Volume: | Vol.11 |
| Number: | Article 30 |
| Identification Number: | 10.1186/1471-2164-11-30 |
| Status: | Peer Reviewed |
| Access rights to Published version: | Open Access |
| Funder: | Engineering and Physical Sciences Research Council (EPSRC), Research Councils UK (RCUK) |
| References: | 1. Loh YH, Wu Q, Chew JL, Vega VB, Zhang W, Chen X, Bourque G, George J, Leong B, Liu J, Wong KY, Sung KW, Lee CW, Zhao XD, Chiu KP, Lipovich L, Kuznetsov VA, Robson P, Stanton LW, Wei CL, Ruan Y, Lim B, Ng HH: The Oct4 and Nanog transcription network regulates pluripotency in mouse embryonic stem cells. Nat Genet 2006, 38(4):431-40. 2. Tanay A: Extensive low-affinity transcriptional interactions in the yeast genome. Genome Res 2006, 16(8):962-72. 3. Foat BC, Morozov AV, Bussemaker HJ: Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE. Bioinformatics 2006, 22(14):e141-9. 4. Scully KM, Jacobson EM, Jepsen K, Lunyak V, Viadiu H, Carrière C, Rose DW, Hooshmand F, Aggarwal AK, Rosenfeld MG: Allosteric effects of Pit-1 DNA sites on long-term repression in cell type specification. Science 2000, 290(5494):1127-1131. 5. Riley T, Sontag E, Chen P, Levine A: Transcriptional control of human p53- regulated genes. Nat Rev Mol Cell Biol 2008, 9(5):402-412. 6. Soldaini E, John S, Moro S, Bollenbacher J, Schindler U, Leonard WJ: DNA binding site selection of dimeric and tetrameric Stat5 proteins reveals a large repertoire of divergent tetrameric Stat5a binding sites. Mol Cell Biol 2000, 20:389-401. 7. Ehret GB, Reichenbach P, Schindler U, Horvath CM, Fritz S, Nabholz M, Bucher P: DNA binding specificity of different STAT proteins. Comparison of in vitro specificity with natural target sites. J Biol Chem 2001, 276(9):6675-6688. 8. Tan T, Chu G: p53 Binds and activates the xeroderma pigmentosum DDB2 gene in humans but not mice. Mol Cell Biol 2002, 22(10):3247-3254. 9. Tuerk C, Gold L: Systematic evolution of ligands by exponential enrichment: RNA ligands to bacteriophage T4 DNA polymerase. Science 1990, 249(4968):505-510. 10. Matys V, Kel-Margoulis OV, Fricke E, Liebich I, Land S, Barre-Dirrie A, Reuter I, Chekmenev D, Krull M, Hornischer K, Voss N, Stegmaier P, Lewicki- Potapov B, Saxel H, Kel AE, Wingender E: TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Res 2006, , 34 Database: D108-D110. 11. Sandelin A, Alkema W, Engstrom P, Wasserman WW, Lenhard B: JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Res 2004, 32:D91-D94. 12. Lee TI, Jenner RG, Boyer LA, Guenther MG, Levine SS, Kumar RM, Chevalier B, Johnstone SE, Cole MF, ichi Isono K, Koseki H, Fuchikami T, Abe K, Murray HL, Zucker JP, Yuan B, Bell GW, Herbolsheimer E, Hannett NM, Sun K, Odom DT, Otte AP, Volkert TL, Bartel DP, Melton DA, Gifford DK, Jaenisch R, Young RA: Control of developmental regulators by Polycomb in human embryonic stem cells. Cell 2006, 125(2):301-313. 13. Wei CL, Wu Q, Vega VB, Chiu KP, Ng P, Zhang T, Shahab A, Yong HC, Fu Y, Weng Z, Liu J, Zhao XD, Chew JL, Lee YL, Kuznetsov VA, Sung WK, Miller LD, Lim B, Liu ET, Yu Q, Ng HH, Ruan Y: A global map of p53 transcription-factor binding sites in the human genome. Cell 2006, 124:207-219. 14. Kim TH, Abdullaev ZK, Smith AD, Ching KA, Loukinov DI, Green RD, Zhang MQ, Lobanenkov VV, Ren B: Analysis of the vertebrate insulator protein CTCF-binding sites in the human genome. Cell 2007, 128(6):1231- 1245. 15. Boyer LA, Lee TI, Cole MF, Johnstone SE, Levine SS, Zucker JP, Guenther MG, Kumar RM, Murray HL, Jenner RG, Gifford DK, Melton DA, Jaenisch R, Young RA: Core transcriptional regulatory circuitry in human embryonic stem cells. Cell 2005, 122(6):947-956. 16. Robertson G, Hirst M, Bainbridge M, Bilenky M, Zhao Y, Zeng T, Euskirchen G, Bernier B, Varhol R, Delaney A, Thiessen N, Griffith OL, He A, Marra M, Snyder M, Jones S: Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nat Methods 2007, 4(8):651-657. 17. Johnson DS, Mortazavi A, Myers RM, Wold B: Genome-wide mapping of in vivo protein-DNA interactions. Science 2007, 316(5830):1497-1502. 18. Harbison CT, Gordon BD, Lee TI, Rinaldi NJ, Macisaac KD, Danford TW, Hannett NM, Tagne JB, Reynolds DB, Yoo J, Jennings EG, Zeitlinger J, Pokholok DK, Kellis M, Rolfe AP, Takusagawa KT, Lander ES, Gifford DK, Fraenkel E, Young RA: Transcriptional regulatory code of a eukaryotic genome. Nature 2004, 431(7004):99-104. 19. Bailey TL, Williams N, Misleh C, Li WW: MEME: discovering and analyzing DNA and protein sequence motifs. Nucleic Acids Res 2006, , 34 Web Server: W369-W373. 20. Down TA, Hubbard TJP: NestedMICA: sensitive inference of overrepresented motifs in nucleic acid sequence. Nucleic Acids Res 2005, 33(5):1445-1453. 21. Roth FP, Hughes JD, Estep PW, Church GM: Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation. Nat Biotechnol 1998, 16(10):939-945. 22. Liu XS, Brutlag DL, Liu JS: An algorithm for finding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments. Nat Biotechnol 2002, 20(8):835-839. 23. Tompa M, Li N, Bailey TL, Church GM, De Moor B, Eskin E, Favorov AV, Frith MC, Fu Y, Kent WJ, Makeev VJ, Mironov AA, Noble WS, Pavesi G, Pesole G, Regnier M, Simonis N, Sinha S, Thijs G, van Helden J, Vandenbogaert M, Weng Z, Workman C, Ye C, Zhu Z: Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotechnol 2005, 23:137-44. 24. Sandve GK, Abul O, Walseng V, Drabløs F: Improved benchmarks for computational motif discovery. BMC Bioinformatics 2007, 8:193. 25. Eden E, Lipson D, Yogev S, Yakhini Z: Discovering motifs in ranked lists of DNA sequences. PLoS Comput Biol 2007, 3(3):e39. 26. Redhead E, Bailey TL: Discriminative motif discovery in DNA and protein sequences using the DEME algorithm. BMC Bioinformatics 2007, 8:385. 27. Morozov AV, Siggia ED: Connecting protein structure with predictions of regulatory sites. Proc Natl Acad Sci USA 2007, 104(17):7068-73. 28. Day WH, McMorris FR: Critical comparison of consensus methods for molecular sequences. Nucleic Acids Res 1992, 20(5):1093-1099. 29. Waterman: Introduction to Computational Biology Chapman and Hall, London 1995, chap. 2. 30. Sharon E, Lubliner S, Segal E: A feature-based approach to modeling protein-DNA interactions. PLoS Comput Biol 2008, 4(8):e1000154. 31. van Helden J, Rios AF, Collado-Vides J: Discovering regulatory elements in non-coding sequences by analysis of spaced dyads. Nucleic Acids Res 2000, 28(8):1808-1818. 32. Carvalho AM, Freitas AT, Oliveira AL, Sagot MF: An efficient algorithm for the identification of structured motifs in DNA promoter sequences. IEEE/ ACM Trans Comput Biol Bioinform 2006, 3(2):126-140. 33. Brazma A, Jonassen I, Eidhammer I, Gilbert D: Approaches to the automatic discovery of patterns in biosequences. J Comput Biol 1998, 5(2):279-305. 34. Frith MC, Saunders NFW, Kobe B, Bailey TL: Discovering sequence motifs with arbitrary insertions and deletions. PLoS Comput Biol 2008, 4(4): e1000071. 35. Riley T, Yu X, Sontag E, Levine A: The p53HMM algorithm: using profile hidden markov models to detect p53-responsive genes. BMC Bioinformatics 2009, 10:111. 36. Badis G, Berger MF, Philippakis AA, Talukder S, Gehrke AR, Jaeger SA, Chan ET, Metzler G, Vedenko A, Chen X, Kuznetsov H, Wang CF, Coburn D, Newburger DE, Morris Q, Hughes TR, Bulyk ML: Diversity and Complexity in DNA Recognition by Transcription Factors. Science 2009, 324:1720-1723. 37. Stormo GD: DNA binding sites: representation and discovery. Bioinformatics 2000, 16:16-23. 38. Rabiner LR: A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 1989, 77(2):257- 286. 39. Casella G, George EI: Explaining the Gibbs Sampler. The American Statistician 1992, 46(3):167-174. 40. Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, Wootton JC: Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science 1993, 262(5131):208-214. 41. Hughes JD, Estep PW, Tavazoie S, Church GM: Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J Mol Biol 2000, 296(5):1205-1214. 42. Workman CT, Stormo GD: ANN-Spec: a method for discovering transcription factor binding sites with improved specificity. Pac Symp Biocomput 2000, 467-478. 43. Thijs G, Marchal K, Lescot M, Rombauts S, Moor BD, Rouzé P, Moreau Y: A Gibbs sampling method to detect overrepresented motifs in the upstream regions of coexpressed genes. J Comput Biol 2002, 9(2):447-464. 44. Frith MC, Hansen U, Spouge JL, Weng Z: Finding functional sequence elements by multiple local alignment. Nucleic Acids Res 2004, 32:189-200. 45. Favorov AV, Gelfand MS, Gerasimova AV, Ravcheev DA, Mironov AA, Makeev VJ: A Gibbs sampler for identification of symmetrically structured, spaced DNA motifs with improved estimation of the signal length. Bioinformatics 2005, 21(10):2240-2245. 46. Chen X, Guo L, Fan Z, Jiang T: W-AlignACE: an improved Gibbs sampling algorithm based on more accurate position weight matrices learned from sequence and gene expression/ChIP-chip data. Bioinformatics 2008, 24(9):1121-1128. 47. Bailey TL, Elkan C: Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology AAAI Press 1994, 28- 36. 48. Gusfield D: Algorithms on strings, trees, and sequences: computer science and computational biology Cambridge Univ. Press 2007. 49. Gribskov M, Veretnik S: Identification of sequence pattern with profile analysis. Methods Enzymol 1996, 266:198-212. 50. Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA, McWilliam H, Valentin F, Wallace IM, Wilm A, Lopez R, Thompson JD, Gibson TJ, Higgins DG: Clustal W and Clustal X version 2.0. Bioinformatics 2007, 23(21):2947-2948. 51. Kel AE, Gössling E, Reuter I, Cheremushkin E, Kel-Margoulis OV, Wingender E: MATCH: A tool for searching transcription factor binding sites in DNA sequences. Nucleic Acids Res 2003, 31(13):3576-3579. 52. Shida K: GibbsST: a Gibbs sampling method for motif discovery with enhanced resistance to local optima. BMC Bioinformatics 2006, 7:486. 53. Wang T, Stormo GD: Combining phylogenetic data with co-regulated genes to identify regulatory motifs. Bioinformatics 2003, 19(18):2369-2380. 54. Siddharthan R, Siggia ED, van Nimwegen E: PhyloGibbs: a Gibbs sampling motif finder that incorporates phylogeny. PLoS Comput Biol 2005, 1(7):e67. 55. Sinha S, Blanchette M, Tompa M: PhyME: a probabilistic algorithm for finding motifs in sets of orthologous sequences. BMC Bioinformatics 2004, 5:170. 56. Zemojtel T, Kielbasa SM, Arndt PF, Chung HR, Vingron M: Methylation and deamination of CpGs generate p53-binding sites on a genomic scale. Trends Genet 2009, 25(2):63-66. 57. Chong JA, Tapia-Ramírez J, Kim S, Toledo-Aral JJ, Zheng Y, Boutros MC, Altshuller YM, Frohman MA, Kraner SD, Mandel G: REST: a mammalian silencer protein that restricts sodium channel gene expression to neurons. Cell 1995, 80(6):949-957. 58. Bruce AW, López-Contreras AJ, Flicek P, Down TA, Dhami P, Dillon SC, Koch CM, Langford CF, Dunham I, Andrews RM, Vetrie D: Functional diversity for REST (NRSF) is defined by in vivo binding affinity hierarchies at the DNA sequence level. Genome Res 2009, 19(6):994-1005. 59. Kaczynski J, Cook T, Urrutia R: Sp1- and Krüppel-like transcription factors. Genome Biol 2003, 4(2):206. 60. Cawley S, Bekiranov S, Ng HH, Kapranov P, Sekinger EA, Kampa D, Piccolboni A, Sementchenko V, Cheng J, Williams AJ, Wheeler R, Wong B, Drenkow J, Yamanaka M, Patel S, Brubaker S, Tammana H, Helt G, Struhl K, Gingeras TR: Unbiased mapping of transcription factor binding sites along human chromosomes 21 and 22 points to widespread regulation of noncoding RNAs. Cell 2004, 116(4):499-509. 61. Wierstra I: Sp1: Emerging roles-Beyond constitutive activation of TATAless housekeeping genes. Biochemical and Biophysical Research Communications 2008, 372:1-13. 62. TRANSFAC: New ChIP-on-chip data. Rel121 2008. 63. Brändén C, Tooze J: Introduction to protein structure Garland Publishing, New York 1991. 64. Santelli E, Richmond TJ: Crystal structure of MEF2A core bound to DNA at 1.5 A resolution. J Mol Biol 2000, 297(2):437-449. 65. Oka S, Shiraishi Y, Yoshida T, Ohkubo T, Sugiura Y, Kobayashi Y: NMR structure of transcription factor Sp1 DNA binding domain. Biochemistry 2004, 43(51):16027-16035. 66. Seidel HM, Milocco LH, Lamb P, Darnell JE, Stein RB, Rosen J: Spacing of palindromic half sites as a determinant of selective STAT (signal transducers and activators of transcription) DNA binding and transcriptional activity. Proc Natl Acad Sci USA 1995, 92(7):3041-3045. 67. Klemm JD, Rould MA, Aurora R, Herr W, Pabo CO: Crystal structure of the Oct-1 POU domain bound to an octamer site: DNA recognition with tethered DNA-binding modules. Cell 1994, 77:21-32. 68. Berger MF, Bulyk ML: Universal protein-binding microarrays for the comprehensive characterization of the DNA-binding specificities of transcription factors. Nat Protoc 2009, 4(3):393-411. 69. Newburger DE, Bulyk ML: UniPROBE: an online database of protein binding microarray data on protein-DNA interactions. Nucleic Acids Res 2009, , 37 Database: D77-D82. 70. Wilson DS, Desplan C: Structural basis of Hox specificity. Nat Struct Biol 1999, 6(4):297-300. 71. Joshi R, Passner JM, Rohs R, Jain R, Sosinsky A, Crickmore MA, Jacob V, Aggarwal AK, Honig B, Mann RS: Functional specificity of a Hox protein mediated by the recognition of minor groove structure. Cell 2007, 131(3):530-543. 72. Hannenhalli S: Eukaryotic transcription factor binding sites-modeling and integrative search methods. Bioinformatics 2008, 24(11):1325-1331. 73. Valouev A, Johnson DS, Sundquist A, Medina C, Anton E, Batzoglou S, Myers RM, Sidow A: Genome-wide analysis of transcription factor binding sites based on ChIP-Seq data. Nat Methods 2008, 5(9):829-834. 74. Liao W, Schones DE, Oh J, Cui Y, Cui K, Roh TY, Zhao K, Leonard WJ: Priming for T helper type 2 differentiation by interleukin 2-mediated induction of interleukin 4 receptor alpha-chain expression. Nat Immunol 2008, 9(11):1288-1296. |
| URI: | http://wrap.warwick.ac.uk/id/eprint/3006 |
Actions (login required)
![]() |
View Item |
Tools
Tools

