Multispecies Gene Entropy Estimation, a Data Mining Approach

  • Xiaoxu Han
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4065)


This paper presents a data mining approach to estimate multispecies gene entropy by using a self-organizing map (SOM) to mine a homologous gene set. The gene distribution function for each gene in the feature space is approximated by its probability distribution in the feature space. The phylogenetic applications of the multispecies gene entropy are investigated in an example of inferring the species phylogeny of eight yeast species. It is found that genes with the nearest K-L distances to the minimum entropy gene are more likely to be phylogenetically informative. The K-L distances of genes are strongly correlated with the spectral radiuses of their identity percentage matrices. The images of identity percentage matrices of the genes with small K-L distances to the minimum entropy gene are more similar to the image of the minimum entropy gene in their frequency domains after fast Fourier transforms (FFT) than the images of those genes with large K-L distances to the minimum entropy gene. Finally, a K-L distance based gene concatenation approach under gene clustering is proposed to infer species phylogenies robustly and systematically.


Sequence Space Gene Entropy Informative Gene Data Mining Approach Good Match Unit 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Schmitt, A., Herzel, H.: Estimating the Entropy of DNA Sequences. Journal of Theoretical Biology 188, 369–377 (1997)CrossRefGoogle Scholar
  2. 2.
    Lio, P., Politi, A., Buiatti, M., Ruffo, S.: High Statistics Block Entropy Measures of DNA Sequences. Journal of Theoretical Biology 180, 151–160 (1996)CrossRefGoogle Scholar
  3. 3.
    Lanctot, J., Li, M., Yang, E.: Estimating DNA sequence entropy. In: Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms, pp. 409–418 (2000)Google Scholar
  4. 4.
    Herzel, H., Ebeling, W., Schmitt, A.O.: Entropies of biosequences: The role of repeats. Phys. Rev. E 50, 5061–5071 (1994)CrossRefGoogle Scholar
  5. 5.
    Vinga, S., Almeida, J.: Renyi continuous entropy of DNA sequences. Journal of Theoretical Biology 231, 377–388 (2004)CrossRefMathSciNetGoogle Scholar
  6. 6.
    Yeo, G., Burge, C.: Maximum entropy modeling of short Sequence motifs applications to RNA splicing signals. In: RECOMB 2003, Berlin, Germany, pp. 322–331 (2003)Google Scholar
  7. 7.
    Weikl, T., Dill, K.: Folding rates and low-entropy-loss routes of two-state Proteins. J. Mol. Biol. 329, 585–598 (2003)CrossRefGoogle Scholar
  8. 8.
    Haykin, S.: Neural Networks: A Comprehensive Foundation, 2nd edn. Prentice-Hall, Englewood Cliffs (1999)MATHGoogle Scholar
  9. 9.
    Kohonen, T.: Self-Organizing Maps, 3rd edn. Springer, Heidelberg (2001)MATHGoogle Scholar
  10. 10.
    Ritter, H., Martinetz, T., Schulten, K.: Neural Computation and Self-Organizing Maps: An introduction. Addison-Wesley, Reading (1992)MATHGoogle Scholar
  11. 11.
    Maddison, W.P.: Gene trees in species trees. Syst. Biol. 46, 523–536 (1997)CrossRefGoogle Scholar
  12. 12.
    Page, R., Holmes, E.: Molecular evolution, a phylogenetics approach. Blackwell Science, Malden (1998)Google Scholar
  13. 13.
    Nei, M., Kumar, S.: Molecular Evolution and Phylogenetics, 2nd edn. Oxford University Press, Oxford (2000)Google Scholar
  14. 14.
    Rokas, A., Williams, B., King, N., Carroll, S.: Genome-scale approaches to resolving incongruence in molecular phylogenies. Nature 425, 798–804 (2003)CrossRefGoogle Scholar
  15. 15.
    Tamayo, P., Solni, D., Mesirov, J., Zhu, Q., Kitareewan, K., Dmitrovsky, E., Lander, E., Golub, T.: Interpreting Patterns of Gene Expression with Self-Organizing Maps: Methods and Application to Hematopoietic Differentiation. Proc. Nat’l. Academy of Sciences of the United States of Am. 96(6), 2907–2912 (1999)CrossRefGoogle Scholar
  16. 16.
    Nikkila, J., Toronen, P., Kaski, S., Venna, J., Castren, E., Wong, G.: Analysis and visualization of gene expression data using self-organizing maps. Neural Networks, Special issue on New Developments on Self-Organizing Maps 15, 9530–9660 (2002)Google Scholar
  17. 17.
    Kohonen, T., Somervuo, P.: How to make large self-organizing maps for nonvectorial data. Neural Networks 15, 945–952 (2002)CrossRefGoogle Scholar
  18. 18.
    Yanikoglu, B., Erman, B.: Minimum Energy Configurations of the 2-Dimensional HP-Model of Proteins by Self-Organizing Networks. Journal of Computational Biology 9(4), 613–620 (2002)CrossRefGoogle Scholar
  19. 19.
    Dunham, M.: Data mining introductory and advanced topics. Prentice Hall, Englewood Cliffs (2002)Google Scholar
  20. 20.
    Felsentein, J.: Inferring Phylogenies, Sinauer Associates, Inc. (2004)Google Scholar
  21. 21.
    Huelsenbeck, J., Ronquist, F.: MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics 17, 754–755 (2001)CrossRefGoogle Scholar
  22. 22.
    Shimodaira, H., Hasegawa, M.: Multiple comparisons of log-likelihoods with applications to phylogenetic inference. Mol. Biol. Evol. 16, 1114–1116 (1999)Google Scholar
  23. 23.
    Durbin, R., Eddy, S., Krogh, A., Mitchison, G.: Probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge (1998)MATHCrossRefGoogle Scholar
  24. 24.
    Walker, J.: Fast Fourier Transforms. CRC Press, Boca Raton (1996)MATHGoogle Scholar
  25. 25.
    Bichindaritz, I., Potter, S., de Systématique, S.F.: Knowledge Based Phylogenetic Classification Mining. In: Perner, P. (ed.) ICDM 2004. LNCS, vol. 3275, pp. 163–172. Springer, Heidelberg (2004)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Xiaoxu Han
    • 1
  1. 1.Department of Mathematics and Bioinformatics ProgramEastern Michigan UniversityYpsilantiUSA

Personalised recommendations