Abstract

Due to its robustness to outliers, many Pattern Recognition algorithms use the median as a representative of a set of points. A special case arises in Syntactical Pattern Recognition when the points (prototypes) are represented by strings. However, when the edit distance is used, finding the median becomes a NP-Hard problem. Then, either the search is restricted to strings in the data (set-median) or some heuristic approach is applied. In this work we use the (conditional) stochastic edit distance instead of the plain edit distance. It is not yet known if in this case the problem is also NP-Hard so an approximation algorithm is described. The algorithm is based on the extension of the string structure to multistrings (strings of stochastic vectors where each element represents the probability of each symbol) to allow the use of the Expectation Maximization technique. We carry out some experiments over a chromosomes corpus to check the efficiency of the algorithm.

Keywords

Median String (Multi)string Stochastic Edit Distance 

References

  1. 1.
    Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady 10, 707–710 (1966)MathSciNetMATHGoogle Scholar
  2. 2.
    de la Higuera, C., Casacuberta, F.: Topology of strings: Median string is NP-complete. Theoretical Computer Science 230, 39–48 (2000)MathSciNetCrossRefMATHGoogle Scholar
  3. 3.
    Fu, K.S.: Syntactical Pattern Recognition and Applications. Prentice-Hall, Englewood Cliffs (1982)Google Scholar
  4. 4.
    Kohonen, T.: Median Strings. PRL 3, 309–313 (1985)CrossRefGoogle Scholar
  5. 5.
    Martínez-Hinarejos, C.D., Juan, A., Casacuberta, F.: Use of Median String for Classification. In: ICPR, pp. 2903–2906 (2000)Google Scholar
  6. 6.
    Martínez-Hinarejos, C.D., Juan, A., Casacuberta, F.: Median Strings for k-nearest neighbour classification. Pattern Recog. Lett. 24, 173–181 (2003)CrossRefMATHGoogle Scholar
  7. 7.
    Ristad, E.S., Yianilos, P.N.: Learning String-Edit Distance. IEEE Trans. Pattern Anal. Mach. Intell. 20, 522–532 (1998)CrossRefGoogle Scholar
  8. 8.
    Oncina, J., Sebban, M.: Learning stochastic edit distance: Application in handwritten character recognition. Pattern Recognition 39, 1575–1587 (2006)CrossRefMATHGoogle Scholar
  9. 9.
    Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society. Series B (Methodological) 39, 1–38 (1977)MathSciNetMATHGoogle Scholar
  10. 10.
    Martínez-Hinarejos, C.D., Juan, A., Casacuberta, F., Mollineda, R.A.: Reducing the Computational Cost of Computing Approximated Median Strings. In: IAPR International Workshop on Structural, Syntactic, and Statistical Pattern Recognition, pp. 47–55. Springer, London (2002)CrossRefGoogle Scholar
  11. 11.
    Martínez-Hinarejos, C.D.: La cadena media y su aplicación en reconocimiento de formas. Phd. Thesis. Departamento de Sistemas Informáticos y Computación, Universidad Politécnica de Valencia (2003)Google Scholar
  12. 12.
    Granum, E., Thomason, M.G.: Automatically inferred markov network models for classification of chromosomal band pattern structures. Cytometry 11, 26–39 (1990)CrossRefGoogle Scholar
  13. 13.
    Juan, A., Vidal, E.: Comparison of Four Initialization Techniques for the K -Medians Clustering Algorithm. In: Joint IAPR International Workshops on Advances in Pattern Recognition, pp. 842–852. Springer, London (2000)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Cristian Olivares-Rodríguez
    • 1
  • Jose Oncina
    • 1
  1. 1.Departamento de lenguajes y sistemas informáticosUniversidad de AlicanteSpain

Personalised recommendations