Skip to main content

Similarity Measures and Clustering of String Patterns

  • Chapter
Pattern Recognition and String Matching

Part of the book series: Combinatorial Optimization ((COOP,volume 13))

Abstract

Clustering is a powerful tool in revealing the intrinsic organization of data. A clustering of structural patterns consists of an unsupervised association of data based on the similarity of their structures and primitives. This chapter addresses the problem of structural clustering, and presents an overview of similarity measures used in this context. The distinction between string matching and structural resemblance is stressed. The hierarchical agglomerative clustering concept and a partitional approach are explored in a comparative study of several dissimilarity measures: minimum code length based measures; dissimilarity based on the concept of reduction in grammatical complexity; and error-correcting parsing.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. T. A. Bailey and R. Dubes. Cluster validity profiles. Pattern Recognition, 15 (2): 61–83, 1982.

    Article  MathSciNet  Google Scholar 

  2. J. Buhmann and M. Held. Unsupervised learning without overfitting: Empirical risk approximation as an induction principle for reliable clustering. In Sameer Singh, editor, International Conference on Advances in Pattern Recognition, pages 167–176. Springer Verlag, 1999.

    Chapter  Google Scholar 

  3. H. Bunke. String matching for structural pattern recognition. In H. Bunke and A. Sanfeliu, editors, Syntactic and Structural Pattern Recognition Theory and Applications, pages 119–144. World Scientific, 1990.

    Google Scholar 

  4. H. Bunke. Recent advances in string matching. In H. Bunke, editor, Advances in Structural and Syntactic Pattern Recognition, pages 107–116. World Scientific, 1992.

    Google Scholar 

  5. M. Chavent. A monothetic clustering method. Pattern Recognition Letters, 19: 989–996, 1998.

    Article  MATH  Google Scholar 

  6. J-Y. Chen, C. A. Bouman, and J. P. Allebach. Fast image database search suing tree-structured VQ. In Proc. of IEEE Int’l Conf. on Image Processing, volume 2, pages 827–830, 1997.

    Chapter  Google Scholar 

  7. J. M. Coggins. Dissimilarity measures for clustering strings. In D. Sankoff and J. Kruskal, editors, Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison, chapter 1, pages 311–321. Reprint, with a forward by J. Nerbonne, Stanford, CA: CLSI Publications, [ 1983 ] 1999.

    Google Scholar 

  8. G. Cortelazzo, D. Deretta, G. A. Mian, and P. Zamperoni. Normalized weighted levensthein distance and triangle inequality in the context of similarity discrimination of bilevel images. Pattern Recognition Letters, 17: 431–436, 1996.

    Article  Google Scholar 

  9. G. Cortelazzo, G. A. Mian, G. Vezzi, and P. Zamperoni. Trademark shapes description by string-matching techniques. Pattern Recognition, 27 (8): 1005–1018, 1994.

    Article  Google Scholar 

  10. R. Dubes and A. K. Jain. Validity studies in clustering methodologies. Pattern Recognition, 11: 235–254, 1979.

    Article  MATH  Google Scholar 

  11. R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. Wiley, second edition, 2001.

    MATH  Google Scholar 

  12. Y. El-Sonbaty and M. A. Ismail. On-line hierarchical clustering. Pattern Recognition Letters, pages 1285–1291, 1998.

    Google Scholar 

  13. J. Barros et al. Using the triangle inequality to reduce the number of computations required for similarity-based retrival. In Proc. of SPIE/IS&T, Conference on Storage and Retrieval for Still Image and Video Databases IV, volume 2670, pages 392–403, 1996.

    Chapter  Google Scholar 

  14. M. Figueiredo and A. K. Jain. Unsupervised learning of finite mixture models. IEEE Trans. Pattern Analysis and Machine Intelligence, 24 (3): 381–396, 2002.

    Article  Google Scholar 

  15. W. B. Frakes. Stemming algorithms. In William B. Frakes and Ricardo Baeza-Yates, editors, Information Retrieval: Data Structures and Algorithms, chapter 8, pages 131–160. Prentice Hall, 1992.

    Google Scholar 

  16. A. L. Fred. Clustering of sequences using a minimum grammar complexity criterion. In Grammatical Inference: Learning Syntax from Sentence, pages 107–116. Springer-Verlag, 1996.

    Chapter  Google Scholar 

  17. A. L. Fred. Finding consistent clusters in data partitions. In Josef Kittler and Fabio Roli, editors, Multiple Classifier Systems, volume LNCS 2096, pages 309–318. Springer, 2001.

    Google Scholar 

  18. A. L. Fred and J. Leitão. A minimum code length technique for clustering of syntactic patterns. In Proc. Of the 13th IAPR Int’l Conference on Pattern Recognition, pages 680–684, Vienna, 1996.

    Chapter  Google Scholar 

  19. A. L. Fred and J. Leitão. Solomonoff coding as a means of introducing prior information in syntactic pattern recognition. In Proc. of the 12th IAPR Int’l Conference on Pattern Recognition, pages 14–18, 1994.

    Google Scholar 

  20. A. L. Fred and J. Leitâo. A comparative study of string dissimilarity measures in structural clustering. In Sameer Singh, editor, Int’l Conference on Advances in Pattern Recognition, pages 385–384. Springer, 1998.

    Google Scholar 

  21. A. L. Fred and J. Leitão. Clustering under a hypothesis of smooth dissimilarity increments. In Proc. of the 15th Int’l Conference on Pattern Recognition, volume 2, pages 190–194, Barcelona, 2000.

    Chapter  Google Scholar 

  22. A. L. Fred, J. S. Marques, and P. M. Jorge. Hidden markov models vs syntactic modeling in object recognition. In Int’l Conference on Image Processing, ICIP’97, pages 893–896, Santa Barbara, October 1997.

    Chapter  Google Scholar 

  23. K. S. Fu. Syntactic pattern recognition. In Handbook of Pattern Recognition and Image Processing, pages 85–117. Academic Press, 1986.

    Google Scholar 

  24. K. S. Fu and S. Y. Lu. Aclustering procedure for syntactic patterns. IEEE Trans. Systems Man Cybernetics, 7 (7): 537–541, 1977.

    Article  MathSciNet  Google Scholar 

  25. K. S. Fu and S. Y. Lu. Grammatical inference: Introduction and survey -part I and II. IEEE Trans. Pattern Analysis and Machine Intelligence, 8(5):343–359, 1986.

    Article  MATH  MathSciNet  Google Scholar 

  26. J. A. Garcia, J. Valdivia, F. J. Cortijo, and R. Molina. A dynamic approach for clustering data. Signal Processing, 2: 181–196, 1995.

    Article  Google Scholar 

  27. M. Har-Even and V. L. Brailovsky. Probabilistic validation approach for clustering. Pattern Recognition, 16: 1189–1196, 1995.

    Article  Google Scholar 

  28. D. Harman. Ranking algorithms. In William B. Frakes and Ricardo Baeza-Yates, editors, Information Retrieval: Data Structures and Algorithms, chapter 14, pages 363–392. Prentice Hall, 1992.

    Google Scholar 

  29. J. E. Hoperoft and J. D. Ullman. Introduction to Automata Theory, Languages and Computation. Addison-Wesley, London, 1979.

    Google Scholar 

  30. Q. Huang, Z. Liv, and A. Rosenberg. Automated semantic structure reconstruction and representation generation for broadcast news. In Proc. of IS&T/SPIS Conference on Storage and Retrieval for Image and Video Databases VII, pages 50–62, 1999.

    Google Scholar 

  31. A. Jain. Fundamentals of Digital Image Processing. Prentice-Hall, 1989.

    MATH  Google Scholar 

  32. A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988.

    MATH  Google Scholar 

  33. A.K. Jain, M. N. Murty, and P.J. Flynn. Data clustering: A review. ACM Computing Surveys, 31 (3): 264–323, September 1999.

    Article  Google Scholar 

  34. A. Juan and E. Vidal. Fast median search in metric spaces. In A. Amin and D. Dori, editors, Advances in Pattern Recognition, pages 905–912. Springer-Verlag, 1998.

    Chapter  Google Scholar 

  35. J. A. Kaandorp. Fractal Modelling: Grouth and Form in Biology. Springer-Verlag, 1994.

    Book  Google Scholar 

  36. R. L. Kashyap and B. J. Oommen. String correction using probabilistic models. Pattern Recogition Letters, pages 147–154, 1984.

    Google Scholar 

  37. J. Kittler. Pattern classification: Fusion of information. In Sameer Singh, editor, Int. Conf. on Advances in Pattern Recognition, pages 13–22. Springer, 1998.

    Google Scholar 

  38. J. Kittler, M. Hatef, R. P Duin, and J. Matas. On combining classifiers. IEEE Trans. Pattern Analysis and Machine Intelligence, 20 (3): 226–239, 1998.

    Article  Google Scholar 

  39. T. Kohonen. Median strings. Pattern Recognition Letters, 3: 309–313, 1985.

    Article  Google Scholar 

  40. R. Kothari and D. Pitts. On finding the number of clusters. Pattern Recognition Letters, 20: 405–416, 1999.

    Article  Google Scholar 

  41. J. Kruskal. An overview of sequence comparison. In D Sankoff and J. Kruskal, editors, Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison, chapter 1, pages 1–44. Reprint, with a forward by J. Nerbonne, Stanford, CA: CLSI Publications, [ 1983 ] 1999.

    Google Scholar 

  42. S. Y. Lu and K. S. Fu. Stochastic error-correcting syntax analysis for the recognition of noisy patterns. IEEE Trans. Computers, C-26(12):1268–1276, December 1977.

    MathSciNet  Google Scholar 

  43. S. Y. Lu and K. S. Fu. A sentence-to-sentence clustering procedure for pattern analysis. IEEE Trans. Systems Man Cybernetics, 8 (5): 381–389, 1978.

    Article  MATH  MathSciNet  Google Scholar 

  44. Y. Man and I. Gath. Detection and separation of ring-shaped clusters using fuzzy clusters. IEEE Trans. Pattern Analysis and Machine Intelligence, 16 (8): 855–861, 1994.

    Article  Google Scholar 

  45. C. D. Martinez-Hinarejos, A. Juan, and F. Casacuberta. Use of median string for classification. In Proc. of the 15th Int’l Conf. on Pattern Recognition, pages 907–910, 2000.

    Google Scholar 

  46. A. Marzal and E. Vidal. Computation of normalized edit distance and applications. IEEE Trans. Pattern Analysis and Machine Intelligence, 2 (15): 926–932, 1993.

    Article  Google Scholar 

  47. G. McLachlan and K. Basford. Mixture Models: Inference and Application to Clustering. Marcel Dekker, New York, 1988.

    Google Scholar 

  48. L. Miclet. Grammatical inference. In H. Bunke and A. Sanfeliu, editors, Syntactic and Structural Pattern Recognition-Theory and Applications, pages 237–290. Scientific Publishing, 1990.

    Google Scholar 

  49. B. Mirkin. Concept learning and feature selection based on square-error clustering. Machine Learning, 35: 25–39, 1999.

    Article  MATH  MathSciNet  Google Scholar 

  50. T. Oates, L. Firoiu, and P. R. Cohen. Using time warping to bootstrap HMM-based clustering of time series. In R. Sun and C. L. Giles, editors, Sequence Learning: Paradigms, Algorithms and Applications, volume LNAI of Lecture Notes in Computer Science, pages 35–52. Springer-Verlag, 2000.

    Google Scholar 

  51. B. J. Oomen and R. S. K. Loke. Pattern recognition of strings containing traditional and generalized transposition errors. In Int’l Conf. on Systems, Man and Cybernetics, pages 1154–1159, 1995.

    Google Scholar 

  52. B. J. Oommen. Recognition of noisy subsequences using constrained edit distances. IEEE Trans. Pattern Analysis and Machine Intelligence, 9 (5): 676–685, 1987.

    Article  MATH  Google Scholar 

  53. N. R. Pal and J. C. Bezdek. On cluster validity for the fuzzy c-means model. IEEE Trans. Fuzzy Systems, 3: 370–379, 1995.

    Article  Google Scholar 

  54. E. J. Pauwels and G. Frederix. Fiding regions of interest for content-extraction. In Proc. of IS&T/SPIE Conference on Storage and Retrieval for Image and Video Databases VII, volume SPIE Vol. 3656, pages 501–510, San Jose, January 1999.

    Google Scholar 

  55. L. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. of IEEE, 77 (2): 257–285, 1989.

    Article  Google Scholar 

  56. E. S. Ristad and P. N. Yianilos. Learning string-edit distance. IEEE Trans. Pattern Analysis and Machine Intelligence, 20 (5): 522–531, 1998.

    Article  Google Scholar 

  57. S. Roberts, D. Husmeier, I. Rezek, and W. Penny. Bayesian approaches to gaussian mixture modelling. IEEE Trans. Pattern Analysis and Machine Intelligence, 20(11), 1998.

    Google Scholar 

  58. D. Sankoff and J. Kruskal. Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Reprint, with a forward by J. Nerbonne, Stanford, CA: CLSI Publications, [ 1983 ] 1999.

    Google Scholar 

  59. S. Santini and R. Jain. Similarity is a geometer. Multimedia Tools and Applications, 5 (3): 277–306, 1997.

    Article  Google Scholar 

  60. P. Sebastiani, M. Ramoni, and P. Cohen. Sequence clustering via bayesian clustering by dynamics. In R. Sun and C. L. Giles, editors, Sequence Learning: Paradigms, Algorithms and Applications, number 1828 in Lecture Notes in Computer Science, pages 11–34. Springer-Verlag, 2000.

    Google Scholar 

  61. P. Smyth. Clustering sequences with hidden Markov models. In M. C. Mozer, M. I. Jordan, and T. Petsche, editors, Advances in Neural Information Processing 9, pages 72–93. MIT Press, Cambridge, 1997.

    Google Scholar 

  62. R. J. Solomonoff. A formal theory of inductive inference (part I and II). Information and Control, 7:1–22, 224–254, 1964.

    Google Scholar 

  63. D. Stanford and A. E. Raftery. Principal curve cluster-ing with noise. Technical report, University of Washington, http://www.stat.washington.edu/raftery 1997.

    Google Scholar 

  64. E. Tanaka. Parsing and error correcting parsing for string grammars In Syntactic and Structural Pattern Recognition-Theory and Applications, pages 55–84. Scientific Publishing, 1990.

    Google Scholar 

  65. H. Tenmoto, M. Kudo, and M. Shimbo. MDL-based selection of the number of components in mixture models for pattern recognition. In Adnan Amin, Dov Dori, Pavel Pudil, and Herbert Freeman, editors, Advances in Pattern Recognition, volume 1451 of Lecture Notes in Computer Science, pages 831–836. Springer Verlag, 1998.

    Chapter  Google Scholar 

  66. S. Theodoridis and K. Koutroumbas. Pattern Recognition. Academic Press, 1999.

    Google Scholar 

  67. C. Zahn Graph-theoretical methods for detecting and describing gestalt structures. IEEE Trans. Computers, C-20(1): 68–86, 1971.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Kluwer Academic Publishers

About this chapter

Cite this chapter

Fred, A. (2003). Similarity Measures and Clustering of String Patterns. In: Chen, D., Cheng, X. (eds) Pattern Recognition and String Matching. Combinatorial Optimization, vol 13. Springer, Boston, MA. https://doi.org/10.1007/978-1-4613-0231-5_7

Download citation

  • DOI: https://doi.org/10.1007/978-1-4613-0231-5_7

  • Publisher Name: Springer, Boston, MA

  • Print ISBN: 978-1-4613-7952-2

  • Online ISBN: 978-1-4613-0231-5

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics