Advertisement

Multimedia Tools and Applications

, Volume 71, Issue 3, pp 1333–1362 | Cite as

MI-File: using inverted files for scalable approximate similarity search

  • Giuseppe Amato
  • Claudio Gennaro
  • Pasquale Savino
Article

Abstract

We propose a new efficient and accurate technique for generic approximate similarity searching, based on the use of inverted files. We represent each object of a dataset by the ordering of a number of reference objects according to their distance from the object itself. In order to compare two objects in the dataset, we compare the two corresponding orderings of the reference objects. We show that this representation enables us to use inverted files to obtain very efficiently a very small set of good candidates for the query result. The candidate set is then reordered using the original similarity function to obtain the approximate similarity search result. The proposed technique performs several orders of magnitude better than exact similarity searches, still guaranteeing high accuracy. To also demonstrate the scalability of the proposed approach, tests were executed with various dataset sizes, ranging from 200,000 to 100 million objects.

Keywords

Similarity searching Access methods Multimedia information retrieval 

References

  1. 1.
    Amato G, Savino P (2008) Approximate similarity search in metric spaces using inverted files. In: InfoScale ’08: proceedings of the 3rd international conference on scalable information systems, ICST, pp 1–10Google Scholar
  2. 2.
    Amato G, Rabitti F, Savino P, Zezula P (2003) Region proximity in metric spaces and its use for approximate similarity search. ACM Trans Inf Syst 21(2):192–227CrossRefGoogle Scholar
  3. 3.
    Andoni A, Indyk P (2008) Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Communication of the ACM 51(1):117–122. doi:10.1145/1327452.1327494 CrossRefGoogle Scholar
  4. 4.
    Bawa M, Condie T, Ganesan P (2005) Lsh forest: self-tuning indexes for similarity search. In: WWW (International World Wide Web Conference), ACM Press, pp 651–660. doi:10.1145/1060745.1060840
  5. 5.
    Beyer KS, Goldstein J, Ramakrishnan R, Shaft U (1999) When is nearest neighbor meaningful? In: Beeri C, Buneman P (eds) Database theory—ICDT ’99. Proceedings of the 7th international conference, Jerusalem, Israel, 10–12 Jan 1999. Ser. Lecture notes in computer science, vol 1540. Springer, pp 217–235Google Scholar
  6. 6.
    Bolettieri P, Esuli A, Falchi F, Lucchese C, Perego R, Rabitti F (2009) Enabling content-based image retrieval in very large digital libraries. In: Second workshop on very large digital libraries, pp 43–50Google Scholar
  7. 7.
    Bozkaya T, Özsoyoglu, ZM (1997) Distance-based indexing for high-dimensional metric spaces. In: SIGMOD conference, pp 357–368Google Scholar
  8. 8.
    Brin S (1995) Near neighbor search in large metric spaces. In: VLDB, pp 574–584Google Scholar
  9. 9.
    Ciaccia P, Patella M (2000) Pac nearest neighbor queries: approximate and controlled search in high-dimensional and metric spaces. In: ICDE, pp 244–255Google Scholar
  10. 10.
    Ciaccia P, Patella M, Zezula P (1997) M-tree: an efficient access method for similarity search in metric spaces. In: Jarke M, Carey MJ, Dittrich KR, Lochovsky FH, Loucopoulos P, Jeusfeld MA (eds) VLDB’97. Proceedings of 23rd international conference on very large data bases, Athens, Greece, 25–29 Aug 1997. Morgan Kaufmann, pp 426–435Google Scholar
  11. 11.
    Diaconis P (1988) Group representations in probability and statistics. In: Ser. IMS Lecture notes—monograph series, vol 11. Institute of Mathematical Statistics, Hawyard, CAGoogle Scholar
  12. 12.
    Egecioglu Ö, Ferhatosmanoglu H (2000) Dimensionality reduction and similarity computation by inner product approximations. In: Proceedings of the ACM international conference on information and knowledge management (CIKM 2000), McLean, Virginia, USA, 6–11 Nov 2000. ACM Press, pp 219–226Google Scholar
  13. 13.
    Esuli A (2012) Use of permutation prefixes for efficient and scalable approximate similarity search. Inf Process Manag 48(5):889–902CrossRefGoogle Scholar
  14. 14.
    Faloutsos C, Lin K-I (1995) FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In: Carey MJ, Schneider DA (eds) Proceedings of the 18th ACM international conference on management of data (SIGMOD 1995), San Jose, California, USA, 22–25 May 1995. ACM Press, pp 163–174Google Scholar
  15. 15.
    Ferhatosmanoglu H, Tuncel E, Agrawal D, Abbadi AE (2001) Approximate nearest neighbor searching in multimedia databases. In: Proceedings of the 17th international conference on data engineering, Heidelberg, Germany, 2–6 April 2001. IEEE Computer Society, pp 503–511Google Scholar
  16. 16.
    Flickr (2012). http://www.flickr.com/. Accessed 26 Nov 2012
  17. 17.
    Gennaro C, Amato G, Bolettieri P, Savino P (2010) An approach to content-based image retrieval based on the lucene search engine library. In: Lalmas M, Jose J, Rauber A, Sebastiani F, Frommholz I (eds) Research and advanced technology for digital libraries. Ser. Lecture notes in computer science, vol 6273. Springer, Berlin, pp 55–66CrossRefGoogle Scholar
  18. 18.
    Chávez E, Figueroa K, Navarro G (2008) Effective proximity retrieval by ordering permutations. IEEE Trans Pattern Anal Mach Intell 30(9):1647–1658CrossRefGoogle Scholar
  19. 19.
    Hjaltason GR, Samet H (2003) Index-driven similarity search in metric spaces. ACM Trans Database Syst 28(4):517–580CrossRefGoogle Scholar
  20. 20.
    Indyk P, Motwani R (1998) Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proc. 30th symposium on theory of computing, pp 604–613Google Scholar
  21. 21.
    Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110CrossRefGoogle Scholar
  22. 22.
    Mpeg-7 (2004) ISO/IEC JTC1/SC29/WG11N6828. http://www.chiariglione.org/mpeg/standards/mpeg-7/mpeg-7.htm. Accessed 26 Nov 2012
  23. 23.
    Ogras ÜY, Ferhatosmanoglu H (2003) Dimensionality reduction using magnitude and shape approximations. In: Proceedings of the ACM international conference on information and knowledge management (CIKM 2003), New Orleans, Louisiana, USA, 3–8 Nov 2003. ACM Press, pp 99–107Google Scholar
  24. 24.
    Patella M, Ciaccia P (2009) Approximate similarity search: a multi-faceted problem. J Discrete Algorithms 7(1):36–48CrossRefzbMATHMathSciNetGoogle Scholar
  25. 25.
    Salton G, McGill MJ (1983) Introduction to modern information retrieval. McGraw-HillGoogle Scholar
  26. 26.
    SAPIR: Search In Audio Visual Content Using Peer-to-peer IR (2009) Project Web Site. http://sysrun.haifa.il.ibm.com/sapir/. Accessed 26 Nov 2012
  27. 27.
    Seward HH (1954) Information sorting in the application of electronic digital computers to business operations. Master Thesis, MITGoogle Scholar
  28. 28.
    Shapiro MB (1977) The choice of reference points in best-match file searching. Commun ACM 20(5):339–343CrossRefGoogle Scholar
  29. 29.
    Skala M (2009) Counting distance permutations. J Discrete Algorithms 7:49–61. [Online]. Available: http://portal.acm.org/citation.cfm?id=1501025.1501131 CrossRefzbMATHMathSciNetGoogle Scholar
  30. 30.
    Uhlmann JK (1991) Satisfying general proximity/similarity queries with metric trees. Inf Process Lett 40(4):175–179CrossRefzbMATHGoogle Scholar
  31. 31.
    Wang X, Wang JT-L, Lin K-I, Shasha D, Shapiro BA, Zhang K (2000) An index structure for data mining and clustering. In: Knowledge and information systems, vol 2. Springer, pp 161–184Google Scholar
  32. 32.
    Weber R, Böhm K (2000) Trading quality for time with nearest neighbor search. In: Proceedings of the 7th International Conference on Extending Database Technology, pp 21–35Google Scholar
  33. 33.
    Weber R, Schek H-J, Blott S (1998) A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In: Gupta A, Shmueli O, Widom J (eds) VLDB’98. Proceedings of 24rd international conference on very large data bases, New York City, New York, USA, 24–27 Aug 1998. Morgan Kaufmann, pp 194–205Google Scholar
  34. 34.
    Weiss Y, Torralba A, Fergus R (2008) Spectral hashing. In: NIPS, pp 1753–1760Google Scholar
  35. 35.
    Witten IH, Moffat A, Bell TC (1999) Bell: managing gigabytes: compressing and indexing documents and images, 2nd edn. Morgan KaufmannGoogle Scholar
  36. 36.
    Yianilos PN (1993) Data structures and algorithms for nearest neighbor search in general metric spaces. In: SODA, pp 311–321Google Scholar
  37. 37.
    Zezula P, Amato G, Dohnal V, Batko M (2006) Similarity search—the metric space approach. In: Ser. advances in database systems, vol 32. SpringerGoogle Scholar
  38. 38.
    Zezula P, Savino P, Amato G, Rabitti F (1998) Approximate similarity retrieval with m-trees. VLDB J 7(4):275–293CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2012

Authors and Affiliations

  • Giuseppe Amato
    • 1
  • Claudio Gennaro
    • 1
  • Pasquale Savino
    • 1
  1. 1.ISTI-CNRPisaItaly

Personalised recommendations