Skip to main content

A Compressed Index for Hamming Distances

  • Conference paper
Similarity Search and Applications (SISAP 2014)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8821))

Included in the following conference series:

  • 1009 Accesses

Abstract

Some instances of multimedia data can be represented as high dimensional binary vectors under the hamming distance. The standard index used to handle queries is Locality Sensitive Hashing (LSH), reducing approximate queries to a set of exact searches. When the queries are not selective and multiple families of hashing functions are employed, or when the collection is large, LSH indexes should be stored in secondary memory, slowing down the query time.

In this paper we present a compressed LSH index, queryable without decompression and with negligible impact in query speed. This compressed representation enables larger collections to be handled in main memory with the corresponding speedup with respect to fetching data from secondary memory.

We tested the index with a real world example, indexing songs to detect near duplicates. Songs are represented using an entropy based audio-fingerprint (AFP), of independent interest.

The combination of compressed LSH and the AFP enables the retrieval of lossy compressed audio with near perfect recall at bit-rates as low as 32 kbps, packing the representation of 30+ million music tracks of standard length (which is about the total number of unique tracks of music available worldwide) in half a gigabyte of space. A sequential search for matches would take about 15 minutes; while using our compressed index, of size roughly one gigabyte, searching for a song would take a fraction of a second.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Chandrasekhar, V., Sharifi, M., Ross, D.: Survey and evaluation of audio fingerprinting schemes for mobile query-by-example applications. In: Proceedings of ISMIR (2011)

    Google Scholar 

  2. LTD, S.: (2006), http://www.shazam.com/

    Google Scholar 

  3. SoundHound (2008), http://www.soundhound.com/

  4. McFee, B., Lanckriet, G.R.G.: Large-scale music similarity search with spatial trees. In: Klapuri, A., Leider, C. (eds.) ISMIR, pp. 55–60. University of Miami (2011)

    Google Scholar 

  5. Bertin-Mahieux, T., Ellis, D.P.W., Whitman, B., Lamere, P.: The million song dataset. In: Klapuri, A., Leider, C. (eds.) ISMIR, University of Miami, pp. 591–596. University of Miami (2011)

    Google Scholar 

  6. Samet, H.: Foundations of Multidimensional and Metric Data Structures, 1st edn. The morgan Kaufman Series in Computer Graphics and Geometic Modeling. Morgan Kaufmann Publishers, University of Maryland at College Park (2006)

    Google Scholar 

  7. Chávez, E., Navarro, G., Baeza-Yates, R., Marroquín, J.L.: Searching in metric spaces. ACM Comput. Surv. 33(3), 273–321 (2001)

    Article  Google Scholar 

  8. Lu, G.: Indexing and retrieval of audio: A survey. Multimedia Tools Appl. 15(3), 269–290 (2001)

    Article  MATH  Google Scholar 

  9. Stober, S., Nürnberger, A.: Adaptive music retrieval - a state of the art. Multimedia Tools and Applications (2012), ’Online First’ article

    Google Scholar 

  10. Yan, R., Huet, B., Sukthankar, R.: Large-scale multimedia retrieval and mining. IEEE Multimedia 18, 11–13 (2011)

    Article  Google Scholar 

  11. Camarena-Ibarrola, A., Chavez, E.: A robust entropy-based audio-fingerprint. In: Proceedings of the International Conference on Multimedia and Expo, ICME 2006, pp. 1729–1732 (2006)

    Google Scholar 

  12. Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: Proceedings of the 25th International Conference on Very Large Data Bases, VLDB 1999, pp. 518–529. Morgan Kaufmann Publishers Inc., San Francisco (1999)

    Google Scholar 

  13. Andoni, A., Indyk, P.: Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Communications ACM 51, 117–122 (2008)

    Article  Google Scholar 

  14. González, R., Grabowski, S., Mäkinen, V., Navarro, G.: Practical implementation of rank and select queries. In: Poster Proc. Volume of 4th Workshop on Efficient and Experimental Algorithms (WEA), pp. 27–38. CTI Press and Ellinika Grammata, Greece (2005)

    Google Scholar 

  15. Raman, R., Raman, V., Rao, S.S.: Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In: Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 233–242. ACM/SIAM, San Francisco (2002)

    Google Scholar 

  16. Claude, F., Navarro, G.: Practical rank/select queries over arbitrary sequences. In: Amir, A., Turpin, A., Moffat, A. (eds.) SPIRE 2008. LNCS, vol. 5280, pp. 176–187. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  17. Okanohara, D., Sadakane, K.: Practical entropy-compressed rank/select dictionary. In: Proceedings of the Workshop on Algorithm Engineering and Experiments, ALENEX 2007. SIAM, New Orleans (2007)

    Google Scholar 

  18. Tellez, E.S.: Practical Proximity Searching in Large Metric Databases. PhD thesis, Universidad Michoacana, Morelia, Michoacán, México (July 2012)

    Google Scholar 

  19. Tellez, E.S., Chavez, E., Navarro, G.: Succinct nearest neighbor search. Information Systems 38(7), 1019–1030 (2013)

    Article  Google Scholar 

  20. Grossi, R., Gupta, A., Vitter, J.S.: High-order entropy-compressed text indexes. In: Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2003, pp. 841–850. Society for Industrial and Applied Mathematics, Philadelphia (2003)

    Google Scholar 

  21. Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Computing Surveys 39(1) (2007)

    Google Scholar 

  22. Golynski, A., Munro, J.I., Rao, S.S.: Rank/select operations on large alphabets: a tool for text indexing. In: Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithm, SODA 2006, pp. 368–373. ACM, New York (2006)

    Chapter  Google Scholar 

  23. Munro, J.I., Raman, R., Raman, V., Rao, S.S.: Succinct representations of permutations. In: Baeten, J.C.M., Lenstra, J.K., Parrow, J., Woeginger, G.J. (eds.) ICALP 2003. LNCS, vol. 2719, pp. 345–356. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Santoyo, F., Chávez, E., Téllez, E.S. (2014). A Compressed Index for Hamming Distances. In: Traina, A.J.M., Traina, C., Cordeiro, R.L.F. (eds) Similarity Search and Applications. SISAP 2014. Lecture Notes in Computer Science, vol 8821. Springer, Cham. https://doi.org/10.1007/978-3-319-11988-5_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-11988-5_11

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-11987-8

  • Online ISBN: 978-3-319-11988-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics