Abstract
Many problems in information retrieval and related fields depend on a reliable measure of the distance or similarity between objects that, most frequently, are represented as vectors. This paper considers vectors of bits. Such data structures implement entities as diverse as bitmaps that indicate the occurrences of terms and bitstrings indicating the presence of edges in images. For such applications, a popular distance measure is the Hamming distance. The value of the Hamming distance for information retrieval applications is limited by the fact that it counts only exact matches, whereas in information retrieval, corresponding bits that are close by can still be considered to be almost identical. We define a “Generalized Hamming distance” that extends the Hamming concept to give partial credit for near misses, and suggest a dynamic programming algorithm that permits it to be computed efficiently. We envision many uses for such a measure. In this paper we define and prove some basic properties of the “Generalized Hamming distance”, and illustrate its use in the area of object recognition. We evaluate our implementation in a series of experiments, using autonomous robots to test the measure's effectiveness in relating similar bitstrings.
Article PDF
Similar content being viewed by others
References
Adriaans P and Zantinge D (1996) Data Mining. Addison Wesley Longman Ltd, Harlow, England.
Boykov Y and Huttenlocher D (1999) A new Bayesian framework for object recognition. In: proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE Computer Society.
Bookstein A and Klein ST (1990) Information retrieval tools for literary analysis. In: Tjoa AM, Ed., Database and Expert Systems Applications. Springer-Verlag, Vienna, pp. 1–7.
Bookstein A and Klein ST (1991) Compression of correlated bit-vectors. Information Systems, 16:387–400.
Bookstein A, Klein ST and Raita T (1998) Clumping properties of content-bearing words. Journal of the American Society for Information Science, 49:102–114.
Bookstein A, Klein ST and Raita T (2001) Fuzzy hamming distance: A new dissimilarity measure. In: Amir, Amihood and Landau, G Eds., Proceedings: 12th Annual Symposium on Combinatorial Pattern Matching (CPM2001). Jerusalem, Israel, 1-4 July, 2001.
Chang P and Krumm J (1999) Object recognition with color cooccurrence histograms. In: proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE Computer Society.
Cormen TH, Leiserson CE and Rivest RL (1990) Introduction to Algorithms. MIT Press, Cambridge, MA.
Crochemore M and Rytter W (1994) Text Algorithms. Oxford University Press, New York.
Doyle L (1961) Semantic road maps for literature searchers. Journal of the ACM, 8(4):553–578.
Hamming RW (1980) Coding and Information Theory. Prentice-Hall, Englewood Cliffs, NJ.
Hearst MA (1994) Multi-paragraph segmentation of expository text. In: proc. ACL Conf. Las Cruces.
Hearst MA and Plaunt C (1993) Subtopic structuring for full-length document access. In: proc. 16-th ACM-SIGIR Conf. Pittsburgh, pp. 59–68.
Jansen BJ, Spink A, Bateman J and Saracevic T (1998) Real life information retrieval: A study of user queries on the web. SIGIR Forum, 32(1):5–18.
Knuth DE (1973) The Art of Computer Programming, Vol I, Fundamental Algorithms, Addison-Wesley, Reading, MA.
Kulyukin V and Burke A (in press) Mining free text for structure. In: wang J, Ed., Data Mining: Opportunities and Challenges, Idea Group Publishing, Pensylvania.
Kulyukin V and Morley N (2002) Integrated object recognition in the three-tiered robot architecture, In: Proc. of the International Conference on Artificial Intelligence, Las Vegas.
Kulyukin V and Zoko A (2000) Hamming distance for object recognition by mobile robots. In: Proceedings of the Research Symposium of the DePaul University School of Computer Science, Telecommunications and Information Systems (CTIRS-2000). DePaul University.
Murphy R (2000) AI Robotics. MIT Press, New York.
Parker, JR (1993) Practical Computer Vision Using C. John Wiley and Sons, New York.
Roman S (1992) Coding and Information Theory. Springer-Verlag, New York.
Sankoff D and Kruskal JB (1983) Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley, Reading, MA.
Swain MJ and Ballard DH (1991) Color Indexing. International Journal of Computer Vision, 7:11–32.
Tam AM and Leung CHC (2001) Structured natural-language descriptions for semantic content retrieval of visual materials. Journal of the American Society for Information Science and Technology, 52(11):930–937.
Witten IH, Moffat A and Bell TC (1999) Managing Gigbytes. Morgan Kaufmann Publishers, San Fransisco.
Young SS, Scott PD and Nasrabadi N (1994) Multi-layer hopfield neural network. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recogntion. IEEE Computer Society.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Bookstein, A., Kulyukin, V.A. & Raita, T. Generalized Hamming Distance. Information Retrieval 5, 353–375 (2002). https://doi.org/10.1023/A:1020499411651
Issue Date:
DOI: https://doi.org/10.1023/A:1020499411651