Molecular Biology

, 45:667 | Cite as

Machine learning study of DNA binding by transcription factors from the LacI family

  • G. G. Fedonin
  • A. B. Rakhmaninova
  • Yu. D. Korostelev
  • O. N. Laikova
  • M. S. Gelfand


We studied 1372 LacI-family transcription factors and their 4484 DNA binding sites using machine learning algorithms and feature selection techniques. The Naive Bayes classifier and Logistic Regression were used to predict binding sites given transcription factor sequences and to classify factor-site pairs on binding and non-binding ones. Prediction accuracy was estimated using 10-fold cross-validation. Experiments showed that the best prediction of nucleotide densities at selected site positions is obtained using only a few key protein sequence positions. These positions are stably selected by the forward feature selection based on the mutual information of factor-site position pairs.


transcription factors Naive Bayes classifier Logistic Regression Mutual Information prokaryotes LacI family 


  1. 1.
    Suzuki M., Brenner S.E., Gerstein M., Yagi N. 1995. DNA recognition code of transcription factors. Protein Eng. 8, 319–328.PubMedCrossRefGoogle Scholar
  2. 2.
    Jones S., Shanahan H.P., Berman H.M., Thornton J.M. 2003. Using electrostatic potentials to predict DNA-binding sites on DNA-binding proteins. Nucleic Acids Res. 31, 7189–7198.PubMedCrossRefGoogle Scholar
  3. 3.
    Baker C.M., Grant G.H. 2007. Role of aromatic amino acids in protein-nucleic acid recognition. Biopolymers. 85, 456–470.PubMedCrossRefGoogle Scholar
  4. 4.
    Sarai A., Kono H. 2005. Protein-DNA recognition patterns and predictions. Annu. Rev. Biophys. Biomol. Struct. 34, 379–398.PubMedCrossRefGoogle Scholar
  5. 5.
    Sandelin A., Wasserman W.W. 2004. Constrained binding site diversity within families of transcription factors enhances pattern discovery bioinformatics. J. Mol. Biol. 338, 207–215.PubMedCrossRefGoogle Scholar
  6. 6.
    Mahony S., Auron P.E., Benos P.V. 2007. Inferring protein-DNA dependencies using motif alignments and mutual information. Bioinformatics. 23, i297–i304.PubMedCrossRefGoogle Scholar
  7. 7.
    Ahmad S., Sarai A. 2005. PSSM-based prediction of DNA binding sites in proteins. BMC Bioinformatics. 6, 33–34.PubMedCrossRefGoogle Scholar
  8. 8.
    Ofran Y., Mysore V., Rost B. 2007. Prediction of DNA-binding residues from sequence. Bioinformatics. 23, i347–i353.PubMedCrossRefGoogle Scholar
  9. 9.
    Yan C., Terribilini M., Wu F., et al. 2006. Predicting DNA-binding sites of proteins from amino acid sequence. BMC Bioinformatics. 7, 262–262.PubMedCrossRefGoogle Scholar
  10. 10.
    Mirny L.A., Gelfand M.S. 2002. Using orthologous and paralogous proteins to identify specificity-determining residues in bacterial transcription factors. J. Mol. Biol. 321, 7–20.PubMedCrossRefGoogle Scholar
  11. 11.
    Kalinina O.V., Mironov A.A., Gelfand M.S., Rakhmaninova A.B. 2004. Automated selection of positions determining functional specificity of proteins by comparative analysis of orthologous groups in protein families. Protein Sci. 13, 443–456.PubMedCrossRefGoogle Scholar
  12. 12.
    Donald J.E., Shakhnovich E.I. 2005. Predicting specificity-determining residues in two large eukaryotic transcription factor families. Nucleic Acids Res. 33, 4455–4465.PubMedCrossRefGoogle Scholar
  13. 13.
    Korostelev Y., Laikova O.N., Rakhmaninova A.B., Gelfand M.S. First RECOMB Satellite Conference on Bioinformatics Education, San Diego, 2009. Abstract Book, p. 13.Google Scholar
  14. 14.
    Novichkov P.S., Laikova O.N., Novichkova E.S., Gelfand M.S., Arkin A.P., Dubchak I., Rodionov D.A. 2010. RegPrecise: A database of curated genomic inferences of transcriptional regulatory interactions in prokaryotes. Nucleic Acids Res. 38, D111–D118.PubMedCrossRefGoogle Scholar
  15. 15.
    Schultz J., Milpetz F., Bork P., Ponting C.P. 1998. SMART, a simple modular architecture research tool: Identification of signaling domains. Proc. Natl. Acad. Sci. U. S. A. 95, 5857–5864.PubMedCrossRefGoogle Scholar
  16. 16.
    Kalinina O.V., Novichkov P.S., Mironov A.A., Gelfand M.S., Rakhmaninova A.B. 2004. SDPpred: A tool for prediction of amino acid residues that determine differences in functional specificity of homologous proteins. Nucleic Acids Res. 32, W424–W428.PubMedCrossRefGoogle Scholar
  17. 17.
    Gerstein M., Sonnhammer E.L., Chothia C. 1994. Volume changes in protein evolution. J. Mol. Biol. 236, 1067–1078.PubMedCrossRefGoogle Scholar
  18. 18.
    Domingos P., Pazzani M. 1997. On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning. 29, 103–137.CrossRefGoogle Scholar
  19. 19.
    Hosmer D., Lemeshow S. 2000. Applied Logistic Regression, 2nd ed. NY: Wiley.CrossRefGoogle Scholar
  20. 20.
    Peng H.C., Long F., Ding C. 2005. Feature selection based on mutual information: Criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Analysis Machine Intell. 27, 1226–1238.CrossRefGoogle Scholar
  21. 21.
    Henikoff S., Henikoff J.G. 1992. Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. U. S. A. 89, 10915–10919.PubMedCrossRefGoogle Scholar
  22. 22.
    Berman H.M., Westbrook J., Feng Z., Gilliland G., Bhat T.N., Weissig H., Shindyalov I.N., Bourne P.E. 2000. The protein data bank. Nucleic Acids Res. 28, 235–242.PubMedCrossRefGoogle Scholar
  23. 23.
    Rodriguez R., Chinea G., Lopez N., Pons T., Vriend G. 1998. Homology modeling, model and software evaluation: Three related resources. Comput. Appl. Biosci. 14, 523–528.Google Scholar
  24. 24.
    Sartorius J., Lehming N., Kisters B., von Wilcken-Bergmann B., Muller-Hill B. 1989. Lac repressor mutants with double or triple exchanges in the recognition helix bind specifically to lac operator variants with multiple exchanges. EMBO J. 8, 1265–1270.PubMedGoogle Scholar

Copyright information

© Pleiades Publishing, Ltd. 2011

Authors and Affiliations

  • G. G. Fedonin
    • 1
  • A. B. Rakhmaninova
    • 2
  • Yu. D. Korostelev
    • 2
  • O. N. Laikova
    • 1
  • M. S. Gelfand
    • 1
  1. 1.Institute for Information Transmission Problems (Kharkevich Institute)MoscowRussia
  2. 2.Department of Bioengineering and BioinformaticsMoscow State UniversityMoscowRussia

Personalised recommendations