Similarity Measures and Clustering of String Patterns

Fred, Ana

doi:10.1007/978-1-4613-0231-5_7

Ana Fred⁵

Part of the book series: Combinatorial Optimization ((COOP,volume 13))

491 Accesses
2 Citations

Abstract

Clustering is a powerful tool in revealing the intrinsic organization of data. A clustering of structural patterns consists of an unsupervised association of data based on the similarity of their structures and primitives. This chapter addresses the problem of structural clustering, and presents an overview of similarity measures used in this context. The distinction between string matching and structural resemblance is stressed. The hierarchical agglomerative clustering concept and a partitional approach are explored in a comparative study of several dissimilarity measures: minimum code length based measures; dissimilarity based on the concept of reduction in grammatical complexity; and error-correcting parsing.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

T. A. Bailey and R. Dubes. Cluster validity profiles. Pattern Recognition, 15 (2): 61–83, 1982.
Article MathSciNet Google Scholar
J. Buhmann and M. Held. Unsupervised learning without overfitting: Empirical risk approximation as an induction principle for reliable clustering. In Sameer Singh, editor, International Conference on Advances in Pattern Recognition, pages 167–176. Springer Verlag, 1999.
Chapter Google Scholar
H. Bunke. String matching for structural pattern recognition. In H. Bunke and A. Sanfeliu, editors, Syntactic and Structural Pattern Recognition Theory and Applications, pages 119–144. World Scientific, 1990.
Google Scholar
H. Bunke. Recent advances in string matching. In H. Bunke, editor, Advances in Structural and Syntactic Pattern Recognition, pages 107–116. World Scientific, 1992.
Google Scholar
M. Chavent. A monothetic clustering method. Pattern Recognition Letters, 19: 989–996, 1998.
Article MATH Google Scholar
J-Y. Chen, C. A. Bouman, and J. P. Allebach. Fast image database search suing tree-structured VQ. In Proc. of IEEE Int’l Conf. on Image Processing, volume 2, pages 827–830, 1997.
Chapter Google Scholar
J. M. Coggins. Dissimilarity measures for clustering strings. In D. Sankoff and J. Kruskal, editors, Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison, chapter 1, pages 311–321. Reprint, with a forward by J. Nerbonne, Stanford, CA: CLSI Publications, [ 1983 ] 1999.
Google Scholar
G. Cortelazzo, D. Deretta, G. A. Mian, and P. Zamperoni. Normalized weighted levensthein distance and triangle inequality in the context of similarity discrimination of bilevel images. Pattern Recognition Letters, 17: 431–436, 1996.
Article Google Scholar
G. Cortelazzo, G. A. Mian, G. Vezzi, and P. Zamperoni. Trademark shapes description by string-matching techniques. Pattern Recognition, 27 (8): 1005–1018, 1994.
Article Google Scholar
R. Dubes and A. K. Jain. Validity studies in clustering methodologies. Pattern Recognition, 11: 235–254, 1979.
Article MATH Google Scholar
R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. Wiley, second edition, 2001.
MATH Google Scholar
Y. El-Sonbaty and M. A. Ismail. On-line hierarchical clustering. Pattern Recognition Letters, pages 1285–1291, 1998.
Google Scholar
J. Barros et al. Using the triangle inequality to reduce the number of computations required for similarity-based retrival. In Proc. of SPIE/IS&T, Conference on Storage and Retrieval for Still Image and Video Databases IV, volume 2670, pages 392–403, 1996.
Chapter Google Scholar
M. Figueiredo and A. K. Jain. Unsupervised learning of finite mixture models. IEEE Trans. Pattern Analysis and Machine Intelligence, 24 (3): 381–396, 2002.
Article Google Scholar
W. B. Frakes. Stemming algorithms. In William B. Frakes and Ricardo Baeza-Yates, editors, Information Retrieval: Data Structures and Algorithms, chapter 8, pages 131–160. Prentice Hall, 1992.
Google Scholar
A. L. Fred. Clustering of sequences using a minimum grammar complexity criterion. In Grammatical Inference: Learning Syntax from Sentence, pages 107–116. Springer-Verlag, 1996.
Chapter Google Scholar
A. L. Fred. Finding consistent clusters in data partitions. In Josef Kittler and Fabio Roli, editors, Multiple Classifier Systems, volume LNCS 2096, pages 309–318. Springer, 2001.
Google Scholar
A. L. Fred and J. Leitão. A minimum code length technique for clustering of syntactic patterns. In Proc. Of the 13th IAPR Int’l Conference on Pattern Recognition, pages 680–684, Vienna, 1996.
Chapter Google Scholar
A. L. Fred and J. Leitão. Solomonoff coding as a means of introducing prior information in syntactic pattern recognition. In Proc. of the 12th IAPR Int’l Conference on Pattern Recognition, pages 14–18, 1994.
Google Scholar
A. L. Fred and J. Leitâo. A comparative study of string dissimilarity measures in structural clustering. In Sameer Singh, editor, Int’l Conference on Advances in Pattern Recognition, pages 385–384. Springer, 1998.
Google Scholar
A. L. Fred and J. Leitão. Clustering under a hypothesis of smooth dissimilarity increments. In Proc. of the 15th Int’l Conference on Pattern Recognition, volume 2, pages 190–194, Barcelona, 2000.
Chapter Google Scholar
A. L. Fred, J. S. Marques, and P. M. Jorge. Hidden markov models vs syntactic modeling in object recognition. In Int’l Conference on Image Processing, ICIP’97, pages 893–896, Santa Barbara, October 1997.
Chapter Google Scholar
K. S. Fu. Syntactic pattern recognition. In Handbook of Pattern Recognition and Image Processing, pages 85–117. Academic Press, 1986.
Google Scholar
K. S. Fu and S. Y. Lu. Aclustering procedure for syntactic patterns. IEEE Trans. Systems Man Cybernetics, 7 (7): 537–541, 1977.
Article MathSciNet Google Scholar
K. S. Fu and S. Y. Lu. Grammatical inference: Introduction and survey -part I and II. IEEE Trans. Pattern Analysis and Machine Intelligence, 8(5):343–359, 1986.
Article MATH MathSciNet Google Scholar
J. A. Garcia, J. Valdivia, F. J. Cortijo, and R. Molina. A dynamic approach for clustering data. Signal Processing, 2: 181–196, 1995.
Article Google Scholar
M. Har-Even and V. L. Brailovsky. Probabilistic validation approach for clustering. Pattern Recognition, 16: 1189–1196, 1995.
Article Google Scholar
D. Harman. Ranking algorithms. In William B. Frakes and Ricardo Baeza-Yates, editors, Information Retrieval: Data Structures and Algorithms, chapter 14, pages 363–392. Prentice Hall, 1992.
Google Scholar
J. E. Hoperoft and J. D. Ullman. Introduction to Automata Theory, Languages and Computation. Addison-Wesley, London, 1979.
Google Scholar
Q. Huang, Z. Liv, and A. Rosenberg. Automated semantic structure reconstruction and representation generation for broadcast news. In Proc. of IS&T/SPIS Conference on Storage and Retrieval for Image and Video Databases VII, pages 50–62, 1999.
Google Scholar
A. Jain. Fundamentals of Digital Image Processing. Prentice-Hall, 1989.
MATH Google Scholar
A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988.
MATH Google Scholar
A.K. Jain, M. N. Murty, and P.J. Flynn. Data clustering: A review. ACM Computing Surveys, 31 (3): 264–323, September 1999.
Article Google Scholar
A. Juan and E. Vidal. Fast median search in metric spaces. In A. Amin and D. Dori, editors, Advances in Pattern Recognition, pages 905–912. Springer-Verlag, 1998.
Chapter Google Scholar
J. A. Kaandorp. Fractal Modelling: Grouth and Form in Biology. Springer-Verlag, 1994.
Book Google Scholar
R. L. Kashyap and B. J. Oommen. String correction using probabilistic models. Pattern Recogition Letters, pages 147–154, 1984.
Google Scholar
J. Kittler. Pattern classification: Fusion of information. In Sameer Singh, editor, Int. Conf. on Advances in Pattern Recognition, pages 13–22. Springer, 1998.
Google Scholar
J. Kittler, M. Hatef, R. P Duin, and J. Matas. On combining classifiers. IEEE Trans. Pattern Analysis and Machine Intelligence, 20 (3): 226–239, 1998.
Article Google Scholar
T. Kohonen. Median strings. Pattern Recognition Letters, 3: 309–313, 1985.
Article Google Scholar
R. Kothari and D. Pitts. On finding the number of clusters. Pattern Recognition Letters, 20: 405–416, 1999.
Article Google Scholar
J. Kruskal. An overview of sequence comparison. In D Sankoff and J. Kruskal, editors, Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison, chapter 1, pages 1–44. Reprint, with a forward by J. Nerbonne, Stanford, CA: CLSI Publications, [ 1983 ] 1999.
Google Scholar
S. Y. Lu and K. S. Fu. Stochastic error-correcting syntax analysis for the recognition of noisy patterns. IEEE Trans. Computers, C-26(12):1268–1276, December 1977.
MathSciNet Google Scholar
S. Y. Lu and K. S. Fu. A sentence-to-sentence clustering procedure for pattern analysis. IEEE Trans. Systems Man Cybernetics, 8 (5): 381–389, 1978.
Article MATH MathSciNet Google Scholar
Y. Man and I. Gath. Detection and separation of ring-shaped clusters using fuzzy clusters. IEEE Trans. Pattern Analysis and Machine Intelligence, 16 (8): 855–861, 1994.
Article Google Scholar
C. D. Martinez-Hinarejos, A. Juan, and F. Casacuberta. Use of median string for classification. In Proc. of the 15th Int’l Conf. on Pattern Recognition, pages 907–910, 2000.
Google Scholar
A. Marzal and E. Vidal. Computation of normalized edit distance and applications. IEEE Trans. Pattern Analysis and Machine Intelligence, 2 (15): 926–932, 1993.
Article Google Scholar
G. McLachlan and K. Basford. Mixture Models: Inference and Application to Clustering. Marcel Dekker, New York, 1988.
Google Scholar
L. Miclet. Grammatical inference. In H. Bunke and A. Sanfeliu, editors, Syntactic and Structural Pattern Recognition-Theory and Applications, pages 237–290. Scientific Publishing, 1990.
Google Scholar
B. Mirkin. Concept learning and feature selection based on square-error clustering. Machine Learning, 35: 25–39, 1999.
Article MATH MathSciNet Google Scholar
T. Oates, L. Firoiu, and P. R. Cohen. Using time warping to bootstrap HMM-based clustering of time series. In R. Sun and C. L. Giles, editors, Sequence Learning: Paradigms, Algorithms and Applications, volume LNAI of Lecture Notes in Computer Science, pages 35–52. Springer-Verlag, 2000.
Google Scholar
B. J. Oomen and R. S. K. Loke. Pattern recognition of strings containing traditional and generalized transposition errors. In Int’l Conf. on Systems, Man and Cybernetics, pages 1154–1159, 1995.
Google Scholar
B. J. Oommen. Recognition of noisy subsequences using constrained edit distances. IEEE Trans. Pattern Analysis and Machine Intelligence, 9 (5): 676–685, 1987.
Article MATH Google Scholar
N. R. Pal and J. C. Bezdek. On cluster validity for the fuzzy c-means model. IEEE Trans. Fuzzy Systems, 3: 370–379, 1995.
Article Google Scholar
E. J. Pauwels and G. Frederix. Fiding regions of interest for content-extraction. In Proc. of IS&T/SPIE Conference on Storage and Retrieval for Image and Video Databases VII, volume SPIE Vol. 3656, pages 501–510, San Jose, January 1999.
Google Scholar
L. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. of IEEE, 77 (2): 257–285, 1989.
Article Google Scholar
E. S. Ristad and P. N. Yianilos. Learning string-edit distance. IEEE Trans. Pattern Analysis and Machine Intelligence, 20 (5): 522–531, 1998.
Article Google Scholar
S. Roberts, D. Husmeier, I. Rezek, and W. Penny. Bayesian approaches to gaussian mixture modelling. IEEE Trans. Pattern Analysis and Machine Intelligence, 20(11), 1998.
Google Scholar
D. Sankoff and J. Kruskal. Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Reprint, with a forward by J. Nerbonne, Stanford, CA: CLSI Publications, [ 1983 ] 1999.
Google Scholar
S. Santini and R. Jain. Similarity is a geometer. Multimedia Tools and Applications, 5 (3): 277–306, 1997.
Article Google Scholar
P. Sebastiani, M. Ramoni, and P. Cohen. Sequence clustering via bayesian clustering by dynamics. In R. Sun and C. L. Giles, editors, Sequence Learning: Paradigms, Algorithms and Applications, number 1828 in Lecture Notes in Computer Science, pages 11–34. Springer-Verlag, 2000.
Google Scholar
P. Smyth. Clustering sequences with hidden Markov models. In M. C. Mozer, M. I. Jordan, and T. Petsche, editors, Advances in Neural Information Processing 9, pages 72–93. MIT Press, Cambridge, 1997.
Google Scholar
R. J. Solomonoff. A formal theory of inductive inference (part I and II). Information and Control, 7:1–22, 224–254, 1964.
Google Scholar
D. Stanford and A. E. Raftery. Principal curve cluster-ing with noise. Technical report, University of Washington, http://www.stat.washington.edu/raftery 1997.
Google Scholar
E. Tanaka. Parsing and error correcting parsing for string grammars In Syntactic and Structural Pattern Recognition-Theory and Applications, pages 55–84. Scientific Publishing, 1990.
Google Scholar
H. Tenmoto, M. Kudo, and M. Shimbo. MDL-based selection of the number of components in mixture models for pattern recognition. In Adnan Amin, Dov Dori, Pavel Pudil, and Herbert Freeman, editors, Advances in Pattern Recognition, volume 1451 of Lecture Notes in Computer Science, pages 831–836. Springer Verlag, 1998.
Chapter Google Scholar
S. Theodoridis and K. Koutroumbas. Pattern Recognition. Academic Press, 1999.
Google Scholar
C. Zahn Graph-theoretical methods for detecting and describing gestalt structures. IEEE Trans. Computers, C-20(1): 68–86, 1971.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Telecommunications Institute Instituto Superior Técnico, Technical University of Lisbon, Portugal
Ana Fred

Authors

Ana Fred
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Wisconsin — Green Bay, Green Bay, WI, USA
Dechang Chen
The George Washington University, Washington DC, USA
Xiuzhen Cheng

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Fred, A. (2003). Similarity Measures and Clustering of String Patterns. In: Chen, D., Cheng, X. (eds) Pattern Recognition and String Matching. Combinatorial Optimization, vol 13. Springer, Boston, MA. https://doi.org/10.1007/978-1-4613-0231-5_7

Download citation

DOI: https://doi.org/10.1007/978-1-4613-0231-5_7
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4613-7952-2
Online ISBN: 978-1-4613-0231-5
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics