Prediction of Transcription Factor Families Using DNA Sequence Features

  • Ashish Anand
  • Gary B. Fogel
  • Ganesan Pugalenthi
  • P. N. Suganthan
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5265)


Understanding the mechanisms of protein-DNA interaction is of critical importance in biology. Transcription factor (TF) binding to a specific DNA sequence depends on at least two factors: A protein-level DNA-binding domain and a nucleotide-level specific sequence serving as a TF binding site. TFs have been classified into families based on these factors. TFs within each family bind to specific nucleotide sequences in a very similar fashion. Identification of the TF family that might bind at a particular nucleotide sequence requires a machine learning approach. Here we considered two sets of features based on DNA sequences and their physicochemical properties and applied a one-versus-all SVM (OVA-SVM) with class-wise optimized features to identify TF family-specific features in DNA sequences. Using this approach, a mean prediction accuracy of ~80% was achieved, which represents an improvement of ~7% over previous approaches on the same data.


Transcription factor family prediction multi-class classification 


  1. 1.
    Fogel, G.B., Weekes, D.G., Varga, G., Dow, E.R., Craven, A.M., Harlow, H.B., Su, E.W., Onyia, E., Su, C.: A Statistical Analysis of the TRANSFAC database. Biosystems 81(2), 137–154 (2005)CrossRefPubMedGoogle Scholar
  2. 2.
    Sandelin, A., Wasserman, W.W.: Constrained Binding Site Diversity within Families of Transcription Factors Enhances Pattern Discovery Bioinformatics. J. Mol. Biol. 338, 207–215 (2004)CrossRefPubMedGoogle Scholar
  3. 3.
    Matys, V., et al.: TRANSFAC: Transcriptional Regulation, from Patterns to Profiles. Nucleic Acids Res. 31, 374–378 (2003)CrossRefPubMedPubMedCentralGoogle Scholar
  4. 4.
    Blanchette, M., Tompa, M.: Discovery of Regulatory Elements by a Computational Method for Phylogenetic Footprinting. Genome Research 12(5), 739–748 (2002)CrossRefPubMedPubMedCentralGoogle Scholar
  5. 5.
    Stormo, G.: DNA Binding Sites: Representation and Discovery. Bioinformatics 16, 16–23 (2000)CrossRefPubMedGoogle Scholar
  6. 6.
    Lawrence, C.E., Altschul, S.F., Boguski, M.S., Liu, J.S., Neuwald, A.F., Wootton, J.C.: Detecting Subtle Sequence Signals: A Gibbs Sampling Strategy for Multiple Alignment. Science, 208–214 (1993)Google Scholar
  7. 7.
    Bailey, T.L., Elkan, C.: Fitting a Mixture Model by Expectation Maximization to Discover Motifs in Biopolymers. In: Second International Conference on Intelligent Systems for Molecular Biology, pp. 28–36. AAAI Press, Menlo Park (1994)Google Scholar
  8. 8.
    McCue, L.A., Thompson, W., Carmack, C.S., Lawrence, C.E.: Factors Influencing the Identification of Transcription Factor Binding Sites by Cross-Species Comparis. Genome Res. 12, 1523–1532 (2002)CrossRefPubMedPubMedCentralGoogle Scholar
  9. 9.
    Qin, Z.S., McCue, L.A., Thompson, W., Mayerhofer, L., Lawrence, C.E., Liu, J.S.: Identification of Co-reguated Genes through Bayesian Clustering of Predicted Regulatory Binding Sites. Nature Biotechnology 21, 435–443 (2003)CrossRefPubMedGoogle Scholar
  10. 10.
    Ponomarenko, J., Ponomarenko, M., Frolov, A., Vorobyev, D., Overton, G., Kolchanov, N.: Conformational and Physicochemical DNA Features Specific for Transcription Factor Binding Sites. Bioinformatics 15, 654–668 (1999)CrossRefPubMedGoogle Scholar
  11. 11.
    Narlikar, L., Hartemink, A.J.: Sequence Features of DNA Binding Sites Reveal Structural Class of Associated Transcription Factor. Bioinformatics 22(2), 157–163 (2006)CrossRefPubMedGoogle Scholar
  12. 12.
    Krishnapuram, B., Figueiredo, M., Carin, L., Hartemink, A.: Sparse Multinomial Logistic Regression: Fast Algorithms and Generalized Bounds. IEEE Transactions on Pattern Analysis and Machine Intelligence 27, 957–968 (2005)CrossRefPubMedGoogle Scholar
  13. 13.
    Tan, K., McCue, L.A., Stormo, G.D.: Making Connections between Novel Transcription Factors and their DNA Motifs. Genome Res. 15, 312–320 (2005)CrossRefPubMedPubMedCentralGoogle Scholar
  14. 14.
    Anand, A., Fogel, G., Tang, E.K., Suganthan, P.N.: Feature Selection Approach for Quantitative Prediction of Transcriptional Activities. In: IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, Toronto, Canada, pp. 57–62 (2006)Google Scholar
  15. 15.
    Anand, A., Pugalenthi, G., Suganthan, P.N.: Predicting Protein Structural Class by SVM with Class-wise Optimized Features and Decision Probabilities. Journal of Theoretical Biology (2008), doi:10.1016/j.jtbi.2008.02.031Google Scholar
  16. 16.
    Vlieghe, D., Sandelin, A., De Bleser, P.J., Vleminckx, K., Wasserman, W.W., Roy, F., Lenhard, B.: A New Generation of JASPAR, the Open-Access Repository for Transcription Factor Binding Site Profiles. Nucleic Acids Research 34, 95–97 (2006)CrossRefGoogle Scholar
  17. 17.
    Karim, F.D., et al.: The ETS-domain: a new DNA-binding motif that recognizes a purine-rich core DNA sequence. Genes Dev. 4, 1451–1453 (1990)CrossRefPubMedGoogle Scholar
  18. 18.
    Luscombe, N.M., Austin, S.E., Berman, H.M., Thornton, J.M.: An overview of the structures of protein-DNA complexes. Genome Biology. 1(1), reviews001.1—001.10 (2000) Google Scholar
  19. 19.
    Svingen, T., Tonnisen, K.F.: Hox Transcription Factors and their Elusive Mammalian Gene Targets. Heredity 97, 88–96 (2006)CrossRefPubMedGoogle Scholar
  20. 20.
    Ponomarenko, J.V., Furman, D.P., Kolchanov, N.A., Sarai, A.: Activity: A database on DNA/RNA sites activity adapted to apply sequence-activity relationships from one system to another. Nucleic Acids Research 29(1), 284–287 (2001)CrossRefPubMedPubMedCentralGoogle Scholar
  21. 21.
    Vapnik, V.: Statistical learning theory. Wiley-Interscience, New York (1998)Google Scholar
  22. 22.
    Krebel, U.H.-G.: Pairwise classification and support vector machines. In: Scholkopf, B., Burges, C.J.C., Smola, A.J. (eds.) Advances in kernel methods: Support vector learning, pp. 255–268. MIT press, Cambridge (1999)Google Scholar
  23. 23.
    Weston, J., Watkins, C.: Support vector machines for multiclass pattern recognition. In: Proc 7th European symposium on artificial neural networks (1999)Google Scholar
  24. 24.
    Crammer, K., Singer, Y.: On the algorithmic implementation of multiclass kernel-based vector machines. Journal of machine learning research 2, 265–292 (2001)Google Scholar
  25. 25.
    Hsu, C.-W., Lin, C.-J.: A comparison of methods for multi-class support vector machines. IEEE Trans. Neural Networks 13, 415–425 (2002)CrossRefPubMedGoogle Scholar
  26. 26.
    Lee, Y., et al.: Multicategory support vector machines: theory and application to the classification of microarray data and satellite radiance data. J. Am. Stat. Assoc. 99, 67–81 (2004)CrossRefGoogle Scholar
  27. 27.
    Bottou, L. et al.: Comparison of classifier methods: A case study in handwriting digit recognition. In: Proc. Int. Conf. Pattern Recognition. pp. 77-87 (1994) Google Scholar
  28. 28.
    Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene selection for cancer classification using support vector machines. Machine Learning 46, 389–422 (2002)CrossRefGoogle Scholar
  29. 29.
    Chai, H., Domeniconi, C.: An evaluation of gene selection methods for multi-class microarray data classification. In: Scheffer, T. (ed.) Proceedings of the Second European Workshop on Data Mining and Text Mining in Bioinformatics, pp. 3–10 (2004)Google Scholar
  30. 30.
    Ramaswamy, S., Tamayo, P., Rifkin, R., Mukherjee, S., Yeang, C.H., Angelo, M., Ladd, C., Reich, M., Latulippe, E., Mesirov, J.P., et al.: Multiclass cancer diagnosis using tumor gene expression signatures. Proc. Natl. Acad. Sci. USA 98, 15149–15154 (2001)CrossRefPubMedPubMedCentralGoogle Scholar
  31. 31.
    Rifkin, R., Mukherjee, S., Tamayo, P., Ramaswamy, S., Yeang, C.-H., Angelo, M., Reich, M., Poggio, T., Lander, E.S., Golub, T.R., Mesirov, J.P.: An analytical method for multiclass molecular cancer classification. Siam Review 45(4), 706–723 (2003)CrossRefGoogle Scholar
  32. 32.
    Platt, J.: Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. In: Smola, A.J., Bartlett, P.L., Scholkopf, B., Schuumans, D. (eds.) Advances in Large Margin Classifiers, pp. 61–74. MIT Press, Cambridge (2000)Google Scholar
  33. 33.
    Ambroise, C., McLachlan, G.J.: Selection bias in gene extraction on the basis of microarray gene-expression data. Pro. Nat. Acad. Sci. 99, 6562–6566 (2002)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Ashish Anand
    • 1
  • Gary B. Fogel
    • 2
  • Ganesan Pugalenthi
    • 1
  • P. N. Suganthan
    • 1
  1. 1.School of Electrical and Electonic EngineeringNanyang Technological UniversitySingapore
  2. 2.Natural SelectionSan Diego

Personalised recommendations