Abstract
In information retrieval and classification, the relevance of the obtained result and the efficiency of the computational process are strongly influenced by the distance measure used for data comparison. Conventional distance measures, including Hamming distance (HD) and Levenshtein distance (LD), count merely the number of mismatches (or modifications). Given a query, samples mapped at the same distance have the same number of mismatches, but the distribution of the mismatches might be different, either disperse or blocked, so that other measures must be cascaded for further differentiation of the samples. Here we present a new type of distances, called transition-sensitive distances, which count, in addition to the number of mismatches, the cost of transitions between positionally adjacent match-mismatch pairs, as part of the distance. The cost of transitions that reflects the dispersion of mismatches can be integrated into conventional distance measures. We introduce transition-sensitive variants of LD and HD, referred to as TLD and THD. It is shown that while TLD and THD hold properties of the metric similarly as LD and HD, they function as more strict distance measures in similarity search applications than LD and HD, respectively.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Navarro, G.: A guided tour to approximate string matching. ACM Computing Surveys 33(1), 31–88 (2001)
Cohen, W.W., Ravikumar, P., Fienberg, S.E.: A comparison of string distance metrics for name-matching tasks. In: Proceedings of the ACM Workshop on Data Clearning, Record Linkage and Object Identification (2003)
Liu, C.-C., Hsu, J.-L., Chen, A.L.P.: An approximate string matching algorithm for content-based music data retrieval. In: Proceedings of the IEEE International Conference on Multimedia Computing and Systems, vol. 2, p. 9451 (1999)
Clifford, R., Iliopoulos, C.: Approximate string matching for music analysis. Soft Computing - A Fusion of Foundatios, Methodologies and Applications 8(9), 597–603 (2004)
Yeh, M.-C., Cheng, K.-T.: A string matching approach for visual retrieval and classification. In: Proceeding of the 1st ACM Conference on Multimedia Information Retrieval, pp. 52–58 (2008)
Adjeroh, D.A., Lee, M.C., King, I.: A distance measure for video sequences. Computer Vison and Image Understanding 75(1/2), 25–45 (1999)
Bezerra, F.N., Leite, N.J.: Using string matching to detect video transitions. Pattern Analysis & Applications 10(10), 45–54 (2007)
Hamming, R.W.: Error detecting and error correcting codes. Bell System Technical Journal 29(2), 147–160 (1950)
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics=Doklady, Cybernetics and Control Theory 10(8), 707–710 (1966)
Zelenko, D.: System and method for variant string matching. World Intellectual Property, WO/2009/094649, PCT/US2009/032034 (2009)
Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Murthukrishnan, S., Pietarinen, L., Srivastava, D.: Using q-grams in a DBMS for approximate string processing. IEEE Data Engineering Bulletin 24, 28–34 (2001)
Wang, C., Li, J., Shi, S.: N-gram inverted index structures on music data for theme mining and content-basd information retrieval. Pattern Recognition Letters 27(5), 492–503 (2006)
Shannon, C.E.: A mathematical theory of communication. Bell System Technical Journal 27, 379–423, 623–656 (1948)
Shannon, C.E.: Prediction and entropy of printed english. Bell System Technical Journal 30, 50–64 (1951)
Lin, J.: Divergence measures based on the Shannon entropy. IEEE Transactions on Information Theory 37(1), 145–151 (1991)
Zipf, G.K.: Human behavior and the principle of least effort. Addison-Wesley, Cambridge (1949)
Defays, D.: The efficient algorithm for a complete link method. The Computer 20(4), 364–366 (1977)
Yang, S.: Entropy distance. Computing Research Repository, 1303.0070 (2013)
Camarena-Ibarrola, A., Chávez, E.: On musical performances identification, entropy and string matching. In: Gelbukh, A., Reyes-Garcia, C.A. (eds.) MICAI 2006. LNCS (LNAI), vol. 4293, pp. 952–962. Springer, Heidelberg (2006)
Juola, P.: Cross-entropy and linguistic typology. In: Powers, D.M.W. (ed.) NeMLaP3CoNLL98: New Methods in Language Processing and Computational Natural Language Learning, pp. 141–149. ACL
Benson, G.: A new distance measure for comparing sequence profiles based on path lengths along an entropy surface. Bioinformatics 18(suppl. 2), S44–S53 (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Yoshida, K. (2014). Transition-Sensitive Distances. In: Traina, A.J.M., Traina, C., Cordeiro, R.L.F. (eds) Similarity Search and Applications. SISAP 2014. Lecture Notes in Computer Science, vol 8821. Springer, Cham. https://doi.org/10.1007/978-3-319-11988-5_13
Download citation
DOI: https://doi.org/10.1007/978-3-319-11988-5_13
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11987-8
Online ISBN: 978-3-319-11988-5
eBook Packages: Computer ScienceComputer Science (R0)