Building a web-scale image similarity search system

Abstract

As the number of digital images is growing fast and Content-based Image Retrieval (CBIR) is gaining in popularity, CBIR systems should leap towards Web-scale datasets. In this paper, we report on our experience in building an experimental similarity search system on a test collection of more than 50 million images. The first big challenge we have been facing was obtaining a collection of images of this scale with the corresponding descriptive features. We have tackled the non-trivial process of image crawling and extraction of several MPEG-7 descriptors. The result of this effort is a test collection, the first of such scale, opened to the research community for experiments and comparisons. The second challenge was to develop indexing and searching mechanisms able to scale to the target size and to answer similarity queries in real-time. We have achieved this goal by creating sophisticated centralized and distributed structures based purely on the metric space model of data. We have joined them together which has resulted in an extremely flexible and scalable solution. In this paper, we study in detail the performance of this technology and its evolvement as the data volume grows by three orders of magnitude. The results of the experiments are very encouraging and promising for future applications.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Notes

  1. 1.

    http://www.enterprisestrategygroup.com/

  2. 2.

    EU IST FP6 project 045128: Search on Audio-visual content using Peer-to-peer IR

  3. 3.

    http://www.flickr.com/

  4. 4.

    http://www.exalead.com/

  5. 5.

    http://www.picsearch.com/

  6. 6.

    http://www.espgame.org/

  7. 7.

    http://images.google.com/imagelabeler/

  8. 8.

    http://www.flickr.com/map/

  9. 9.

    http://www.tiltomo.com/

  10. 10.

    http://media-vibrance.itn.liu.se/vinnova/cse.php

  11. 11.

    http://labs.ideeinc.com/

  12. 12.

    http://www.alipr.com/

  13. 13.

    http://www.ist-chorus.org/

  14. 14.

    http://www.flickr.com/services/api/

  15. 15.

    http://www.diligentproject.org/

  16. 16.

    http://cophir.isti.cnr.it/

  17. 17.

    http://meta.cesnet.cz

References

  1. 1.

    Amato G, Falchi F, Gennaro C, Rabitti F, Savino P, Stanchev P (2004) Improving image similarity search effectiveness in a multimedia content management system. In: Proc. of workshop on multimedia information system (MIS), pp 139–146

  2. 2.

    Amato G, Rabitti F, Savino P, Zezula P (2003) Region proximity in metric spaces and its use for approximate similarity search. ACM Trans Inf Sys (TOIS) 21(2):192–227

    Article  Google Scholar 

  3. 3.

    Aspnes J, Shah G (2003) Skip graphs. In: Proc. of ACM-SIAM symposium on discrete algorithms, pp 384–393

  4. 4.

    Baeza-Yates RA, del Solar JR, Verschae R, Castillo C, Hurtado CA (2004) Content-based image retrieval and characterization on specific web collections, pp 189–198

  5. 5.

    Batko M, Novak D, Falchi F, Zezula P (2006) On scalability of the similarity search in the world of peers. In: Proc. of INFOSCALE, Hong Kong. ACM, New York, pp 1–12

  6. 6.

    Batko M, Novak D, Zezula P (2007) MESSIF: metric similarity search implementation framework. In: Proc. of DELOS conference. LNCS, vol 4877, pp 1–10

  7. 7.

    Ciaccia P, Patella M, Zezula P (1997) M-Tree: an efficient access method for similarity search in metric spaces. In: Proceedings of VLDB’97, Athens, 25–29 August 1997, pp 426–435

  8. 8.

    CoPhIR (Content-based Photo Image Retrieval) Test Collection (2008) http://cophir.isti.cnr.it/

  9. 9.

    Datta R, Joshi D, Li J, Wang JZ (2008) Image retrieval: ideas, influences, and trends of the new age. ACM Comput Surv 40(2):1–60. doi:10.1145/1348246.1348248

    Article  Google Scholar 

  10. 10.

    Dohnal V, Sedmidubsky J, Zezula P, Novak D (2008) Similarity searching: towards bulk-loading peer-to-peer networks. In: 1st international workshop on similarity search and applications (SISAP), pp 1–8

  11. 11.

    Gelasca ED, Guzman JD, Gauglitz S, Ghosh P, Xu J, Moxley E, Rahimi AM, Bi Z, Manjunath BS (2007) Cortina: searching a 10 million + images database. Tech. Rep., University of California, Santa Barbara

  12. 12.

    ISO/IEC (2003) Information technology—multimedia content description interfaces. Part 6: reference software. 15938-6:2003

  13. 13.

    Jagadish HV, Ooi BC, Tan KL, Yu C, Zhang R (2005) iDistance: an adaptive B + -tree based indexing method for nearest neighbor search. ACM Trans Database Syst (TODS 2005) 30(2):364–397. doi:10.1145/1071610.1071612

    Article  Google Scholar 

  14. 14.

    Kumar R, Novak J, Tomkins A (2006) Structure and evolution of online social networks. In: KDD ’06: proceedings of the 12th ACM SIGKDD international conference on knowedge discover and data mining. ACM, New York, pp 611–617

    Google Scholar 

  15. 15.

    Li J, Wang JZ (2006) Real-time computerized annotation of pictures. In: MULTIMEDIA ’06: proceedings of the 14th annual ACM international conference on multimedia. ACM, New York, pp 911–920. doi:10.1145/1180639.1180841

    Google Scholar 

  16. 16.

    Manjunath B, Salembier P, Sikora T (eds) (2002) Introduction to MPEG-7: multimedia content description interface. Wiley, New York

    Google Scholar 

  17. 17.

    MPEG-7 (2002) Multimedia content description interfaces. Part 3: visual. ISO/IEC 15938-3:2002

  18. 18.

    MUFIN (Multi-Feature Indexing Network) (2008) http://mufin.fi.muni.cz/

  19. 19.

    Novak D, Zezula P (2006) M-Chord: a scalable distributed similarity search structure. In: Proc. of INFOSCALE, Hong Kong. ACM, New York, pp 1–10

    Google Scholar 

  20. 20.

    Novak D, Batko M, Zezula P (2008) Web-scale system for image similarity search: when the dreams are coming true. In: Proceedings of the sixth international workshop on content-based multimedia indexing (CBMI 2008), p 8

  21. 21.

    Novak D, Batko M, Zezula P (2009) Generic similarity search engine demonstrated by an image retrieval application. In: Proc. of the 32st ACM SIGIR conference on research and development in information retrieval (SIGIR). ACM, Boston

    Google Scholar 

  22. 22.

    Skopal T, Pokorný J, Snásel V (2004) PM-tree: pivoting metric tree for similarity search in multimedia databases. In: Proc. of ADBIS, Budapest

  23. 23.

    Stoica I, Morris R, Karger DR, Kaashoek FM, Balakrishnan H (2001) Chord: a scalable peer-to-peer lookup service for internet applications. In: Proc. of SIGCOMM. ACM, San Diego, pp 149–160 doi:10.1145/383059.383071. citeseer.ist.psu.edu/article/stoica01chord.html

    Google Scholar 

  24. 24.

    Traina C Jr, Traina AJM, Seeger B, Faloutsos C (2000) Slim-Trees: high performance metric trees minimizing overlap between nodes. In: Proc. of EDBT. LNCS, vol 1777. Springer, New York, pp 51–65

    Google Scholar 

  25. 25.

    Veltkamp RC, Tanase M (2002) Content-based image retrieval systems: a survey. Tech. Rep. UU-CS-2000-34, Department of CS, Utrecht University

  26. 26.

    Wang JZ, Li J, Wiederhold G (2001) SIMPLIcity: semantics-sensitive integrated matching for picture libraries. IEEE Trans Pattern Anal Mach Intell 23(9):947–963. doi:10.1109/34.955109

    Article  Google Scholar 

  27. 27.

    Zezula P, Amato G, Dohnal V, Batko M (2006) Similarity search: the metric space approach. Advances in database systems, vol 32. Springer, New York

    Google Scholar 

  28. 28.

    Zezula P, Savino P, Amato G, Rabitti F (1998) Approximate similarity retrieval with m-trees. VLDB J 7(4):275–293

    Article  Google Scholar 

Download references

Acknowledgements

This research was supported by the EU IST FP6 project 045128 (SAPIR) and national projects GACR 201/08/P507, GACR 201/09/0683, GACR 102/09/H042, and MSMT 1M0545. Hardware infrastructure was provided by MetaCenterFootnote 17 and by IBM SUR Award.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Jan Sedmidubsky.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Batko, M., Falchi, F., Lucchese, C. et al. Building a web-scale image similarity search system. Multimed Tools Appl 47, 599–629 (2010). https://doi.org/10.1007/s11042-009-0339-z

Download citation

Keywords

  • Similarity search
  • Content-based image retrieval
  • Metric space
  • MPEG-7 descriptors
  • Peer-to-peer search network