Transforming Strings to Vector Spaces Using Prototype Selection

  • Barbara Spillmann
  • Michel Neuhaus
  • Horst Bunke
  • Elżbieta Pękalska
  • Robert P. W. Duin
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4109)


A common way of expressing string similarity in structural pattern recognition is the edit distance. It allows one to apply the kNN rule in order to classify a set of strings. However, compared to the wide range of elaborated classifiers known from statistical pattern recognition, this is only a very basic method. In the present paper we propose a method for transforming strings into n-dimensional real vector spaces based on prototype selection. This allows us to subsequently classify the transformed strings with more sophisticated classifiers, such as support vector machine and other kernel based methods. In a number of experiments, we show that the recognition rate can be significantly improved by means of this procedure.


Support Vector Machine Recognition Rate Edit Distance Real Vector Space Edit Operation 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Bunke, H., Sanfeliu, A.: Syntactic and Structural Pattern Recognition – Theory and Applications. World Scientific Publ. Co., Singapore (1990)MATHGoogle Scholar
  2. 2.
    Cha, S.H., Shin, Y.C., Srihari, S.N.: Approximate stroke sequence matching algorithm for character recognition and analysis. In: 5th International Conference on Document Analysis and Recognition, pp. 53–56 (1999)Google Scholar
  3. 3.
    Bunke, H., Bühler, U.: Applications of approximate string matching to 2D shape recognition. Pattern Recognition 26, 1797–1812 (1993)CrossRefGoogle Scholar
  4. 4.
    Chen, S.W., Tung, S.T., Fang, C.Y., Cheng, S., Jain, A.K.: Extended attributed string matching for shape recognition. Computer Vision and Image Understanding 70, 36–50 (1998)CrossRefGoogle Scholar
  5. 5.
    Durbin, R., Eddy, S.R., Krogh, A., Mitchison, G.: Biological sequence analysis. Cambridge University Press, Cambridge (1998)MATHCrossRefGoogle Scholar
  6. 6.
    Wagner, R.A., Fischer, M.J.: The string-to-string correction problem. Journal of the ACM 21, 168–173 (1974)MATHCrossRefMathSciNetGoogle Scholar
  7. 7.
    Duda, R., Hart, P., Stork, D.: Pattern Classification, 2nd edn. Wiley, New York (2001)MATHGoogle Scholar
  8. 8.
    Vapnik, V.: The Nature of Statistical Learning Theory, 2nd edn. Springer, Heidelberg (2000)MATHGoogle Scholar
  9. 9.
    Wilson, R.C., Hancock, E.R., Luo, B.: Pattern vectors from algebraic graph theory. IEEE Transactions on Pattern Analysis and Machine Intelligence 27, 1112–1124 (2005)CrossRefGoogle Scholar
  10. 10.
    Hjaltason, G.R., Samet, H.: Properties of embedding methods for similarity searching in metric spaces. IEEE Transactions on Pattern Analysis and Machine Intelligence 25, 530–549 (2003)CrossRefGoogle Scholar
  11. 11.
    Pękalska, E.: Dissimilarity representations in pattern recognition. PhD thesis, Delft University of Technology (2005)Google Scholar
  12. 12.
    Pękalska, E., Duin, R.P., Paclík, P.: Prototype selection for dissimilarity-based classifiers. Pattern Recognition 39, 189–208 (2006)MATHCrossRefGoogle Scholar
  13. 13.
    Kohonen, T.: Median strings. Pattern Recognition Letters 3, 309–313 (1985)CrossRefGoogle Scholar
  14. 14.
    Katsavounidis, I., Kuo, C.C.J., Zhang, Z.: A new initialization technique for generalized lloyd iteration. IEEE Signal processing letters 1, 144–146 (1994)CrossRefGoogle Scholar
  15. 15.
    Juan, A., Vidal, E.: Comparison of four initialization techniques for the k -medians clustering algorithm. In: Amin, A., Pudil, P., Ferri, F., Iñesta, J.M. (eds.) SPR 2000 and SSPR 2000. LNCS, vol. 1876, pp. 842–852. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  16. 16.
    Jain, A.K., Dubes, R.C.: Algorithms for clustering data. Prentice-Hall, Inc., Upper Saddle River (1988)MATHGoogle Scholar
  17. 17.
    Alpaydin, E., Alimoglu, F.: Department of Computer Engineering, Bogaziçi University, 80815 Istanbul Turkey (1998),
  18. 18.
    Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge (2004)Google Scholar
  19. 19.
    Vapnik, V.: Statistical Learning Theory. Wiley-Interscience, Chichester (1998)MATHGoogle Scholar
  20. 20.
    Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines (2001),
  21. 21.
    Alimoglu, F., Alpaydin, E.: Combining multiple representations for pen-based handwritten digit recognition. Turk. J. Elec. Engin. 9 (2001)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Barbara Spillmann
    • 1
  • Michel Neuhaus
    • 1
  • Horst Bunke
    • 1
  • Elżbieta Pękalska
    • 2
  • Robert P. W. Duin
    • 2
  1. 1.Institute of Computer Science and Applied MathematicsUniversity of BernBernSwitzerland
  2. 2.Faculty of Electrical Engineering, Mathematics and Computer ScienceDelft University of TechnologyDelftThe Netherlands

Personalised recommendations