Experimental Investigation of Three Machine Learning Algorithms for ITS Dataset

  • J. L. Yearwood
  • B. H. Kang
  • A. V. Kelarev
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5899)


The present article is devoted to experimental investigation of the performance of three machine learning algorithms for ITS dataset in their ability to achieve agreement with classes published in the biologi cal literature before. The ITS dataset consists of nuclear ribosomal DNA sequences, where rather sophisticated alignment scores have to be used as a measure of distance. These scores do not form a Minkowski metric and the sequences cannot be regarded as points in a finite dimensional space. This is why it is necessary to develop novel machine learning ap proaches to the analysis of datasets of this sort. This paper introduces a k-committees classifier and compares it with the discrete k-means and Nearest Neighbour classifiers. It turns out that all three machine learning algorithms are efficient and can be used to automate future biologically significant classifications for datasets of this kind. A simplified version of a synthetic dataset, where the k-committees classifier outperforms k-means and Nearest Neighbour classifiers, is also presented.


Hide Markov Model Speech Signal Machine Learning Algorithm Alignment Score Neighbour Algorithm 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Bagirov, A.M., Rubinov, A.M., Yearwood, J.: A global optimization approach to classification. Optim. Eng. 3, 129–155 (2002)zbMATHCrossRefMathSciNetGoogle Scholar
  2. 2.
    Baldi, P., Brunak, S.: Bioinformatics: The Machine Learning Approach. MIT Press, Cambridge (2001)zbMATHGoogle Scholar
  3. 3.
    Huda, S., Ghosh, R., Yearwood, J.: A variable initialization approach to the EM algorithm for better estimation of the parameters of Hidden Markov Model based acoustic modeling of speech signals. In: Perner, P. (ed.) ICDM 2006. LNCS (LNAI), vol. 4065, pp. 416–430. Springer, Heidelberg (2006)Google Scholar
  4. 4.
    Huda, S., Yearwood, J., Ghosh, R.: A hybrid algorithm for estimation of the parameters of Hidden Markov Model based acoustic modeling of speech signals using constraint-based genetic algorithm and expectation maximization. In: Proceedings of ICIS 2007, the 6th Annual IEEE/ACIS International Conference on Computer and Information Science, Melbourne, Australia, July 11-13, pp. 438–443 (2007)Google Scholar
  5. 5.
    Huda, S., Yearwood, J., Togneri, R.: A constraint based evolutionary learning approach to the expectation maximization for optiomal estimation of the Hidden Markov Model for speech signal modeling. IEEE Transactions on Systems, Man, Cybernetics, Part B 39(1), 182–197 (2009)CrossRefGoogle Scholar
  6. 6.
    Kang, B.H., Kelarev, A.V., Sale, A.H.J., Williams, R.N.: A new model for classifying DNA code inspired by neural networks and FSA. In: Hoffmann, A., Kang, B.-h., Richards, D., Tsumoto, S. (eds.) PKAW 2006. LNCS (LNAI), vol. 4303, pp. 187–198. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  7. 7.
    Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley & Sons, New York (1990)Google Scholar
  8. 8.
    Kelarev, A.V., Kang, B.H., Sale, A.H.J., Williams, R.N.: Labeled directed graphs and FSA as classifiers of strings. In: 17th Australasian Workshop on Combinatorial Algorithms, AWOCA 2006, Uluru (Ayres Rock), Northern Territory, Australia, July 12–16, pp. 93–109 (2006)Google Scholar
  9. 9.
    Kelarev, A., Kang, B., Steane, D.: Clustering algorithms for ITS sequence data with alignment metrics. In: Sattar, A., Kang, B.-h. (eds.) AI 2006. LNCS (LNAI), vol. 4304, pp. 1027–1031. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  10. 10.
    Lee, K., Kay, J., Kang, B.H.: KAN and RinSCut: lazy linear classifier and rank-in-score threshold in similarity-based text categorization. In: Proc. ICML 2002 Workshop on Text Learning, University of New South Wales, Sydney, Australia, pp. 36–43 (2002)Google Scholar
  11. 11.
    Park, G.S., Park, S., Kim, Y., Kang, B.H.: Intelligent web document classification using incrementally changing training data set. J. Security Engineering 2, 186–191 (2005)Google Scholar
  12. 12.
    Sattar, A., Kang, B.H.: Advances in Artificial Intelligence. In: Proceedings of AI 2006, Hobart, Tasmania (2006)Google Scholar
  13. 13.
    Steane, D.A., Nicolle, D., Mckinnon, G.E., Vaillancourt, R.E., Potts, B.M.: High-level relationships among the eucalypts are resolved by ITS-sequence data. Australian Systematic Botany 15, 49–62 (2002)CrossRefGoogle Scholar
  14. 14.
    WEKA, Waikato Environment for Knowledge Analysis,
  15. 15.
    Washio, T., Motoda, H.: State of the art of graph-based data mining, SIGKDD Explorations. In: Dzeroski, S., De Raedt, L. (eds.) Editorial: Multi-Relational Data Mining: The Current Frontiers; SIGKDD Exploration 5(1), 59–68 (2003)Google Scholar
  16. 16.
    Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, San Francisco (2005)Google Scholar
  17. 17.
    Yearwood, J.L., Mammadov, M.: Classification Technologies: Optimization Approaches to Short Text Categorization. Idea Group Inc., USA (2007)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • J. L. Yearwood
    • 1
  • B. H. Kang
    • 2
  • A. V. Kelarev
    • 1
  1. 1.School of Information Technology and Mathematical SciencesUniversity of BallaratBallarat, VictoriaAustralia
  2. 2.School of Computing and Information SystemsUniversity of TasmaniaTasmaniaAustralia

Personalised recommendations