Abstract
Clustering is a powerful tool in revealing the intrinsic organization of data. A clustering of structural patterns consists of an unsupervised association of data based on the similarity of their structures and primitives. This chapter addresses the problem of structural clustering, and presents an overview of similarity measures used in this context. The distinction between string matching and structural resemblance is stressed. The hierarchical agglomerative clustering concept and a partitional approach are explored in a comparative study of several dissimilarity measures: minimum code length based measures; dissimilarity based on the concept of reduction in grammatical complexity; and error-correcting parsing.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
T. A. Bailey and R. Dubes. Cluster validity profiles. Pattern Recognition, 15 (2): 61–83, 1982.
J. Buhmann and M. Held. Unsupervised learning without overfitting: Empirical risk approximation as an induction principle for reliable clustering. In Sameer Singh, editor, International Conference on Advances in Pattern Recognition, pages 167–176. Springer Verlag, 1999.
H. Bunke. String matching for structural pattern recognition. In H. Bunke and A. Sanfeliu, editors, Syntactic and Structural Pattern Recognition Theory and Applications, pages 119–144. World Scientific, 1990.
H. Bunke. Recent advances in string matching. In H. Bunke, editor, Advances in Structural and Syntactic Pattern Recognition, pages 107–116. World Scientific, 1992.
M. Chavent. A monothetic clustering method. Pattern Recognition Letters, 19: 989–996, 1998.
J-Y. Chen, C. A. Bouman, and J. P. Allebach. Fast image database search suing tree-structured VQ. In Proc. of IEEE Int’l Conf. on Image Processing, volume 2, pages 827–830, 1997.
J. M. Coggins. Dissimilarity measures for clustering strings. In D. Sankoff and J. Kruskal, editors, Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison, chapter 1, pages 311–321. Reprint, with a forward by J. Nerbonne, Stanford, CA: CLSI Publications, [ 1983 ] 1999.
G. Cortelazzo, D. Deretta, G. A. Mian, and P. Zamperoni. Normalized weighted levensthein distance and triangle inequality in the context of similarity discrimination of bilevel images. Pattern Recognition Letters, 17: 431–436, 1996.
G. Cortelazzo, G. A. Mian, G. Vezzi, and P. Zamperoni. Trademark shapes description by string-matching techniques. Pattern Recognition, 27 (8): 1005–1018, 1994.
R. Dubes and A. K. Jain. Validity studies in clustering methodologies. Pattern Recognition, 11: 235–254, 1979.
R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. Wiley, second edition, 2001.
Y. El-Sonbaty and M. A. Ismail. On-line hierarchical clustering. Pattern Recognition Letters, pages 1285–1291, 1998.
J. Barros et al. Using the triangle inequality to reduce the number of computations required for similarity-based retrival. In Proc. of SPIE/IS&T, Conference on Storage and Retrieval for Still Image and Video Databases IV, volume 2670, pages 392–403, 1996.
M. Figueiredo and A. K. Jain. Unsupervised learning of finite mixture models. IEEE Trans. Pattern Analysis and Machine Intelligence, 24 (3): 381–396, 2002.
W. B. Frakes. Stemming algorithms. In William B. Frakes and Ricardo Baeza-Yates, editors, Information Retrieval: Data Structures and Algorithms, chapter 8, pages 131–160. Prentice Hall, 1992.
A. L. Fred. Clustering of sequences using a minimum grammar complexity criterion. In Grammatical Inference: Learning Syntax from Sentence, pages 107–116. Springer-Verlag, 1996.
A. L. Fred. Finding consistent clusters in data partitions. In Josef Kittler and Fabio Roli, editors, Multiple Classifier Systems, volume LNCS 2096, pages 309–318. Springer, 2001.
A. L. Fred and J. Leitão. A minimum code length technique for clustering of syntactic patterns. In Proc. Of the 13th IAPR Int’l Conference on Pattern Recognition, pages 680–684, Vienna, 1996.
A. L. Fred and J. Leitão. Solomonoff coding as a means of introducing prior information in syntactic pattern recognition. In Proc. of the 12th IAPR Int’l Conference on Pattern Recognition, pages 14–18, 1994.
A. L. Fred and J. Leitâo. A comparative study of string dissimilarity measures in structural clustering. In Sameer Singh, editor, Int’l Conference on Advances in Pattern Recognition, pages 385–384. Springer, 1998.
A. L. Fred and J. Leitão. Clustering under a hypothesis of smooth dissimilarity increments. In Proc. of the 15th Int’l Conference on Pattern Recognition, volume 2, pages 190–194, Barcelona, 2000.
A. L. Fred, J. S. Marques, and P. M. Jorge. Hidden markov models vs syntactic modeling in object recognition. In Int’l Conference on Image Processing, ICIP’97, pages 893–896, Santa Barbara, October 1997.
K. S. Fu. Syntactic pattern recognition. In Handbook of Pattern Recognition and Image Processing, pages 85–117. Academic Press, 1986.
K. S. Fu and S. Y. Lu. Aclustering procedure for syntactic patterns. IEEE Trans. Systems Man Cybernetics, 7 (7): 537–541, 1977.
K. S. Fu and S. Y. Lu. Grammatical inference: Introduction and survey -part I and II. IEEE Trans. Pattern Analysis and Machine Intelligence, 8(5):343–359, 1986.
J. A. Garcia, J. Valdivia, F. J. Cortijo, and R. Molina. A dynamic approach for clustering data. Signal Processing, 2: 181–196, 1995.
M. Har-Even and V. L. Brailovsky. Probabilistic validation approach for clustering. Pattern Recognition, 16: 1189–1196, 1995.
D. Harman. Ranking algorithms. In William B. Frakes and Ricardo Baeza-Yates, editors, Information Retrieval: Data Structures and Algorithms, chapter 14, pages 363–392. Prentice Hall, 1992.
J. E. Hoperoft and J. D. Ullman. Introduction to Automata Theory, Languages and Computation. Addison-Wesley, London, 1979.
Q. Huang, Z. Liv, and A. Rosenberg. Automated semantic structure reconstruction and representation generation for broadcast news. In Proc. of IS&T/SPIS Conference on Storage and Retrieval for Image and Video Databases VII, pages 50–62, 1999.
A. Jain. Fundamentals of Digital Image Processing. Prentice-Hall, 1989.
A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988.
A.K. Jain, M. N. Murty, and P.J. Flynn. Data clustering: A review. ACM Computing Surveys, 31 (3): 264–323, September 1999.
A. Juan and E. Vidal. Fast median search in metric spaces. In A. Amin and D. Dori, editors, Advances in Pattern Recognition, pages 905–912. Springer-Verlag, 1998.
J. A. Kaandorp. Fractal Modelling: Grouth and Form in Biology. Springer-Verlag, 1994.
R. L. Kashyap and B. J. Oommen. String correction using probabilistic models. Pattern Recogition Letters, pages 147–154, 1984.
J. Kittler. Pattern classification: Fusion of information. In Sameer Singh, editor, Int. Conf. on Advances in Pattern Recognition, pages 13–22. Springer, 1998.
J. Kittler, M. Hatef, R. P Duin, and J. Matas. On combining classifiers. IEEE Trans. Pattern Analysis and Machine Intelligence, 20 (3): 226–239, 1998.
T. Kohonen. Median strings. Pattern Recognition Letters, 3: 309–313, 1985.
R. Kothari and D. Pitts. On finding the number of clusters. Pattern Recognition Letters, 20: 405–416, 1999.
J. Kruskal. An overview of sequence comparison. In D Sankoff and J. Kruskal, editors, Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison, chapter 1, pages 1–44. Reprint, with a forward by J. Nerbonne, Stanford, CA: CLSI Publications, [ 1983 ] 1999.
S. Y. Lu and K. S. Fu. Stochastic error-correcting syntax analysis for the recognition of noisy patterns. IEEE Trans. Computers, C-26(12):1268–1276, December 1977.
S. Y. Lu and K. S. Fu. A sentence-to-sentence clustering procedure for pattern analysis. IEEE Trans. Systems Man Cybernetics, 8 (5): 381–389, 1978.
Y. Man and I. Gath. Detection and separation of ring-shaped clusters using fuzzy clusters. IEEE Trans. Pattern Analysis and Machine Intelligence, 16 (8): 855–861, 1994.
C. D. Martinez-Hinarejos, A. Juan, and F. Casacuberta. Use of median string for classification. In Proc. of the 15th Int’l Conf. on Pattern Recognition, pages 907–910, 2000.
A. Marzal and E. Vidal. Computation of normalized edit distance and applications. IEEE Trans. Pattern Analysis and Machine Intelligence, 2 (15): 926–932, 1993.
G. McLachlan and K. Basford. Mixture Models: Inference and Application to Clustering. Marcel Dekker, New York, 1988.
L. Miclet. Grammatical inference. In H. Bunke and A. Sanfeliu, editors, Syntactic and Structural Pattern Recognition-Theory and Applications, pages 237–290. Scientific Publishing, 1990.
B. Mirkin. Concept learning and feature selection based on square-error clustering. Machine Learning, 35: 25–39, 1999.
T. Oates, L. Firoiu, and P. R. Cohen. Using time warping to bootstrap HMM-based clustering of time series. In R. Sun and C. L. Giles, editors, Sequence Learning: Paradigms, Algorithms and Applications, volume LNAI of Lecture Notes in Computer Science, pages 35–52. Springer-Verlag, 2000.
B. J. Oomen and R. S. K. Loke. Pattern recognition of strings containing traditional and generalized transposition errors. In Int’l Conf. on Systems, Man and Cybernetics, pages 1154–1159, 1995.
B. J. Oommen. Recognition of noisy subsequences using constrained edit distances. IEEE Trans. Pattern Analysis and Machine Intelligence, 9 (5): 676–685, 1987.
N. R. Pal and J. C. Bezdek. On cluster validity for the fuzzy c-means model. IEEE Trans. Fuzzy Systems, 3: 370–379, 1995.
E. J. Pauwels and G. Frederix. Fiding regions of interest for content-extraction. In Proc. of IS&T/SPIE Conference on Storage and Retrieval for Image and Video Databases VII, volume SPIE Vol. 3656, pages 501–510, San Jose, January 1999.
L. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. of IEEE, 77 (2): 257–285, 1989.
E. S. Ristad and P. N. Yianilos. Learning string-edit distance. IEEE Trans. Pattern Analysis and Machine Intelligence, 20 (5): 522–531, 1998.
S. Roberts, D. Husmeier, I. Rezek, and W. Penny. Bayesian approaches to gaussian mixture modelling. IEEE Trans. Pattern Analysis and Machine Intelligence, 20(11), 1998.
D. Sankoff and J. Kruskal. Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Reprint, with a forward by J. Nerbonne, Stanford, CA: CLSI Publications, [ 1983 ] 1999.
S. Santini and R. Jain. Similarity is a geometer. Multimedia Tools and Applications, 5 (3): 277–306, 1997.
P. Sebastiani, M. Ramoni, and P. Cohen. Sequence clustering via bayesian clustering by dynamics. In R. Sun and C. L. Giles, editors, Sequence Learning: Paradigms, Algorithms and Applications, number 1828 in Lecture Notes in Computer Science, pages 11–34. Springer-Verlag, 2000.
P. Smyth. Clustering sequences with hidden Markov models. In M. C. Mozer, M. I. Jordan, and T. Petsche, editors, Advances in Neural Information Processing 9, pages 72–93. MIT Press, Cambridge, 1997.
R. J. Solomonoff. A formal theory of inductive inference (part I and II). Information and Control, 7:1–22, 224–254, 1964.
D. Stanford and A. E. Raftery. Principal curve cluster-ing with noise. Technical report, University of Washington, http://www.stat.washington.edu/raftery 1997.
E. Tanaka. Parsing and error correcting parsing for string grammars In Syntactic and Structural Pattern Recognition-Theory and Applications, pages 55–84. Scientific Publishing, 1990.
H. Tenmoto, M. Kudo, and M. Shimbo. MDL-based selection of the number of components in mixture models for pattern recognition. In Adnan Amin, Dov Dori, Pavel Pudil, and Herbert Freeman, editors, Advances in Pattern Recognition, volume 1451 of Lecture Notes in Computer Science, pages 831–836. Springer Verlag, 1998.
S. Theodoridis and K. Koutroumbas. Pattern Recognition. Academic Press, 1999.
C. Zahn Graph-theoretical methods for detecting and describing gestalt structures. IEEE Trans. Computers, C-20(1): 68–86, 1971.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Kluwer Academic Publishers
About this chapter
Cite this chapter
Fred, A. (2003). Similarity Measures and Clustering of String Patterns. In: Chen, D., Cheng, X. (eds) Pattern Recognition and String Matching. Combinatorial Optimization, vol 13. Springer, Boston, MA. https://doi.org/10.1007/978-1-4613-0231-5_7
Download citation
DOI: https://doi.org/10.1007/978-1-4613-0231-5_7
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4613-7952-2
Online ISBN: 978-1-4613-0231-5
eBook Packages: Springer Book Archive