Multimedia Tools and Applications

, Volume 47, Issue 3, pp 599–629 | Cite as

Building a web-scale image similarity search system

  • Michal Batko
  • Fabrizio Falchi
  • Claudio Lucchese
  • David Novak
  • Raffaele Perego
  • Fausto Rabitti
  • Jan SedmidubskyEmail author
  • Pavel Zezula


As the number of digital images is growing fast and Content-based Image Retrieval (CBIR) is gaining in popularity, CBIR systems should leap towards Web-scale datasets. In this paper, we report on our experience in building an experimental similarity search system on a test collection of more than 50 million images. The first big challenge we have been facing was obtaining a collection of images of this scale with the corresponding descriptive features. We have tackled the non-trivial process of image crawling and extraction of several MPEG-7 descriptors. The result of this effort is a test collection, the first of such scale, opened to the research community for experiments and comparisons. The second challenge was to develop indexing and searching mechanisms able to scale to the target size and to answer similarity queries in real-time. We have achieved this goal by creating sophisticated centralized and distributed structures based purely on the metric space model of data. We have joined them together which has resulted in an extremely flexible and scalable solution. In this paper, we study in detail the performance of this technology and its evolvement as the data volume grows by three orders of magnitude. The results of the experiments are very encouraging and promising for future applications.


Similarity search Content-based image retrieval Metric space MPEG-7 descriptors Peer-to-peer search network 



This research was supported by the EU IST FP6 project 045128 (SAPIR) and national projects GACR 201/08/P507, GACR 201/09/0683, GACR 102/09/H042, and MSMT 1M0545. Hardware infrastructure was provided by MetaCenter17 and by IBM SUR Award.


  1. 1.
    Amato G, Falchi F, Gennaro C, Rabitti F, Savino P, Stanchev P (2004) Improving image similarity search effectiveness in a multimedia content management system. In: Proc. of workshop on multimedia information system (MIS), pp 139–146Google Scholar
  2. 2.
    Amato G, Rabitti F, Savino P, Zezula P (2003) Region proximity in metric spaces and its use for approximate similarity search. ACM Trans Inf Sys (TOIS) 21(2):192–227CrossRefGoogle Scholar
  3. 3.
    Aspnes J, Shah G (2003) Skip graphs. In: Proc. of ACM-SIAM symposium on discrete algorithms, pp 384–393Google Scholar
  4. 4.
    Baeza-Yates RA, del Solar JR, Verschae R, Castillo C, Hurtado CA (2004) Content-based image retrieval and characterization on specific web collections, pp 189–198Google Scholar
  5. 5.
    Batko M, Novak D, Falchi F, Zezula P (2006) On scalability of the similarity search in the world of peers. In: Proc. of INFOSCALE, Hong Kong. ACM, New York, pp 1–12Google Scholar
  6. 6.
    Batko M, Novak D, Zezula P (2007) MESSIF: metric similarity search implementation framework. In: Proc. of DELOS conference. LNCS, vol 4877, pp 1–10Google Scholar
  7. 7.
    Ciaccia P, Patella M, Zezula P (1997) M-Tree: an efficient access method for similarity search in metric spaces. In: Proceedings of VLDB’97, Athens, 25–29 August 1997, pp 426–435Google Scholar
  8. 8.
    CoPhIR (Content-based Photo Image Retrieval) Test Collection (2008)
  9. 9.
    Datta R, Joshi D, Li J, Wang JZ (2008) Image retrieval: ideas, influences, and trends of the new age. ACM Comput Surv 40(2):1–60. doi: 10.1145/1348246.1348248 CrossRefGoogle Scholar
  10. 10.
    Dohnal V, Sedmidubsky J, Zezula P, Novak D (2008) Similarity searching: towards bulk-loading peer-to-peer networks. In: 1st international workshop on similarity search and applications (SISAP), pp 1–8Google Scholar
  11. 11.
    Gelasca ED, Guzman JD, Gauglitz S, Ghosh P, Xu J, Moxley E, Rahimi AM, Bi Z, Manjunath BS (2007) Cortina: searching a 10 million + images database. Tech. Rep., University of California, Santa BarbaraGoogle Scholar
  12. 12.
    ISO/IEC (2003) Information technology—multimedia content description interfaces. Part 6: reference software. 15938-6:2003Google Scholar
  13. 13.
    Jagadish HV, Ooi BC, Tan KL, Yu C, Zhang R (2005) iDistance: an adaptive B + -tree based indexing method for nearest neighbor search. ACM Trans Database Syst (TODS 2005) 30(2):364–397. doi: 10.1145/1071610.1071612 CrossRefGoogle Scholar
  14. 14.
    Kumar R, Novak J, Tomkins A (2006) Structure and evolution of online social networks. In: KDD ’06: proceedings of the 12th ACM SIGKDD international conference on knowedge discover and data mining. ACM, New York, pp 611–617CrossRefGoogle Scholar
  15. 15.
    Li J, Wang JZ (2006) Real-time computerized annotation of pictures. In: MULTIMEDIA ’06: proceedings of the 14th annual ACM international conference on multimedia. ACM, New York, pp 911–920. doi: 10.1145/1180639.1180841 CrossRefGoogle Scholar
  16. 16.
    Manjunath B, Salembier P, Sikora T (eds) (2002) Introduction to MPEG-7: multimedia content description interface. Wiley, New YorkGoogle Scholar
  17. 17.
    MPEG-7 (2002) Multimedia content description interfaces. Part 3: visual. ISO/IEC 15938-3:2002Google Scholar
  18. 18.
    MUFIN (Multi-Feature Indexing Network) (2008)
  19. 19.
    Novak D, Zezula P (2006) M-Chord: a scalable distributed similarity search structure. In: Proc. of INFOSCALE, Hong Kong. ACM, New York, pp 1–10Google Scholar
  20. 20.
    Novak D, Batko M, Zezula P (2008) Web-scale system for image similarity search: when the dreams are coming true. In: Proceedings of the sixth international workshop on content-based multimedia indexing (CBMI 2008), p 8Google Scholar
  21. 21.
    Novak D, Batko M, Zezula P (2009) Generic similarity search engine demonstrated by an image retrieval application. In: Proc. of the 32st ACM SIGIR conference on research and development in information retrieval (SIGIR). ACM, BostonGoogle Scholar
  22. 22.
    Skopal T, Pokorný J, Snásel V (2004) PM-tree: pivoting metric tree for similarity search in multimedia databases. In: Proc. of ADBIS, BudapestGoogle Scholar
  23. 23.
    Stoica I, Morris R, Karger DR, Kaashoek FM, Balakrishnan H (2001) Chord: a scalable peer-to-peer lookup service for internet applications. In: Proc. of SIGCOMM. ACM, San Diego, pp 149–160 doi: 10.1145/383059.383071. Google Scholar
  24. 24.
    Traina C Jr, Traina AJM, Seeger B, Faloutsos C (2000) Slim-Trees: high performance metric trees minimizing overlap between nodes. In: Proc. of EDBT. LNCS, vol 1777. Springer, New York, pp 51–65Google Scholar
  25. 25.
    Veltkamp RC, Tanase M (2002) Content-based image retrieval systems: a survey. Tech. Rep. UU-CS-2000-34, Department of CS, Utrecht UniversityGoogle Scholar
  26. 26.
    Wang JZ, Li J, Wiederhold G (2001) SIMPLIcity: semantics-sensitive integrated matching for picture libraries. IEEE Trans Pattern Anal Mach Intell 23(9):947–963. doi: 10.1109/34.955109 CrossRefGoogle Scholar
  27. 27.
    Zezula P, Amato G, Dohnal V, Batko M (2006) Similarity search: the metric space approach. Advances in database systems, vol 32. Springer, New YorkGoogle Scholar
  28. 28.
    Zezula P, Savino P, Amato G, Rabitti F (1998) Approximate similarity retrieval with m-trees. VLDB J 7(4):275–293CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2009

Authors and Affiliations

  • Michal Batko
    • 1
  • Fabrizio Falchi
    • 2
  • Claudio Lucchese
    • 2
  • David Novak
    • 1
  • Raffaele Perego
    • 2
  • Fausto Rabitti
    • 2
  • Jan Sedmidubsky
    • 1
    Email author
  • Pavel Zezula
    • 1
  1. 1.Masaryk UniversityBrnoCzech Republic
  2. 2.ISTI-CNRPisaItaly

Personalised recommendations