Statistical Feature Selection from Chaos Game Representation for Promoter Recognition

  • Orawan Tinnungwattana
  • Chidchanok Lursinsap
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3992)


The accuracy of promoter recognition depends upon not only the appropriate representation of the promoter sequence but also the essential features of the sequence. These two important issues are addressed in this paper. Firstly, a promoter sequence is captured in form of a Chaos Game Representation (CGR). Then, based on the concept of Mahalanobis distance, a new statistical feature extraction is introduced to select a set of the most significant pixels from the CGR. The recognition is performed by a supervised neural network. This proposed technique achieved 100% accuracy when it is tested with the E.coli promoter sequences using a leave-one-out method. Our approach also outperforms other techniques.


Feature Selection Hide Markov Model Transcription Start Site Promoter Sequence Mahalanobis Distance 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Ma, Q., et al.: DNA Sequence Classification via an Expectation Maximization Algorithm and Neural Networks: A Case Study. IEEE Trans. Systems, Man and Cybernetics, Part-C: Applications and Reviews 31, 468–475 (2001)CrossRefGoogle Scholar
  2. 2.
    Mahadevan, I., Ghosh, I.: Analysis of E. coli promoter structures using neural networks. Nucleic Acids Research 22(11), 2158–2165 (1994)Google Scholar
  3. 3.
    O’Neill, M.C.: Escherichia coli promoters: neural networks develop distinct descriptions in learning to search for promoters of different spacing classes. Nucleic Acids Research 20, 3471–3477 (1992)MathSciNetCrossRefGoogle Scholar
  4. 4.
    Pedersen, A.G., Engelbrecht, J.: Investigations of Escherichia coli promoter sequences with artificial neural network: New signals discovered upstream of the transcriptional startpoint. In: Proceedings of ISM 1995 (1995)Google Scholar
  5. 5.
    Horton, P.B., Kanehisa, M.: An assessment of neural network and statistical approaches for prediction of E. coli promoter sites. Nucleic Acids Research 20, 4331–4338 (1992)CrossRefGoogle Scholar
  6. 6.
    Demeler, B., Zhou, G.W.: Neural network optimization of E. coli promoter prediction. Nucleic Acids Research 19, 1593–1599 (1991)Google Scholar
  7. 7.
    Pedersen, A.G., et al.: Characterization of prokaryotic and eukaryotic promoters using hidden markov models. In: Proceedings of ISM 1998 (1998)Google Scholar
  8. 8.
    Bucher, P.: Weight matrix descriptions of four eukaryotic RNA polymerase II promoter elements derived from 502 unrelated promoter sequences. J.Mol.Biol. 212, 563–578 (1990)CrossRefGoogle Scholar
  9. 9.
    Matsuda, T., Motoda, H., Washio, T.: Graph based induction and its applications. Advanced Engineering Informatics 16, 135–143 (2002)CrossRefGoogle Scholar
  10. 10.
    Matsuyama, Y., Kawamura, R.: Promoter Recognition for E.coli DNA Segments by Independent Component Analysis. In: Proceedings of CSB 2004, pp. 686–691 (2004)Google Scholar
  11. 11.
    Huang, Y.F., Wang, C.M.: Integration of Knowledge-Discovery and Artificial-Intelligence Approaches for Promoter Recognition in DNA Sequences. In: Proceedings of ICITA 2005, vol. 1, pp. 259–264 (2005)Google Scholar
  12. 12.
    Hirsh, H., Noordewier, M.: Using Background Knowledge to Improve Inductive Learning of DNA sequences. In: Proceeding of the Tenth Annual Conference on Artificial Intelligence for Applications, San Antonio, TX, pp. 351–357 (1994)Google Scholar
  13. 13.
    Jeffrey, H.J.: Chaos game representation of gene structure. Nucleic Acids Research 18(8), 2163–2170 (1990)CrossRefGoogle Scholar
  14. 14.
    Goldman, N.: Nucleotide, dinucleotide and trinucleotide frequencies explain patterns observed in chaos game representations of DNA sequences. Nucleic Acids Research 21(10), 2487–2491 (1993)CrossRefGoogle Scholar
  15. 15.
    Almeida, J.S., et al.: Analysis of genomic sequences by Chaos Game Representation. Bioinformatics 17(5), 429–437 (2001)MathSciNetCrossRefGoogle Scholar
  16. 16.
    Deschavanne, P.J., et al.: Genomic Signature: Characterization and Classification of Species Assessed by Chaos Game Representation of Sequences. Mol. Biol. Evol. 16(10), 1391–1399 (1999)Google Scholar
  17. 17.
    Dash, M., Liu, H.: Consistency-based search in feature selection. Elsevier 151, 155–176 (2003)MathSciNetMATHGoogle Scholar
  18. 18.
    Blake, C.L., Merz, C.J.: UCI Repository of machine learning databases (1998),
  19. 19.
    Quinlan, J.R.: Induction of decision trees. Machine Learning 1, 81–106 (1986)Google Scholar
  20. 20.
    Quinlan, J.R.: C4.5: programs for machine learning. Morgan Kaufmann, Los Altos (1993)Google Scholar
  21. 21.
    Huang, J.-W., Yang, C.-B., Tseng, K.-T.: Promoter Prediction in DNA Sequences. In: Proceedings of National Computer Symposium, Workshop on Algorithm and Computation Theory, Taichung, Taiwan (2003)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Orawan Tinnungwattana
    • 1
  • Chidchanok Lursinsap
    • 1
  1. 1.Advanced Virtual and Intelligent Computing (AVIC) Center, Department of MathematicsChulalongkorn UniversityBangkokThailand

Personalised recommendations