As the number of digital images is growing fast and Content-based Image Retrieval (CBIR) is gaining in popularity, CBIR systems should leap towards Web-scale datasets. In this paper, we report on our experience in building an experimental similarity search system on a test collection of more than 50 million images. The first big challenge we have been facing was obtaining a collection of images of this scale with the corresponding descriptive features. We have tackled the non-trivial process of image crawling and extraction of several MPEG-7 descriptors. The result of this effort is a test collection, the first of such scale, opened to the research community for experiments and comparisons. The second challenge was to develop indexing and searching mechanisms able to scale to the target size and to answer similarity queries in real-time. We have achieved this goal by creating sophisticated centralized and distributed structures based purely on the metric space model of data. We have joined them together which has resulted in an extremely flexible and scalable solution. In this paper, we study in detail the performance of this technology and its evolvement as the data volume grows by three orders of magnitude. The results of the experiments are very encouraging and promising for future applications.
This is a preview of subscription content, log in to check access.
Buy single article
Instant access to the full article PDF.
Tax calculation will be finalised during checkout.
Subscribe to journal
Immediate online access to all issues from 2019. Subscription will auto renew annually.
Tax calculation will be finalised during checkout.
EU IST FP6 project 045128: Search on Audio-visual content using Peer-to-peer IR
Amato G, Falchi F, Gennaro C, Rabitti F, Savino P, Stanchev P (2004) Improving image similarity search effectiveness in a multimedia content management system. In: Proc. of workshop on multimedia information system (MIS), pp 139–146
Amato G, Rabitti F, Savino P, Zezula P (2003) Region proximity in metric spaces and its use for approximate similarity search. ACM Trans Inf Sys (TOIS) 21(2):192–227
Aspnes J, Shah G (2003) Skip graphs. In: Proc. of ACM-SIAM symposium on discrete algorithms, pp 384–393
Baeza-Yates RA, del Solar JR, Verschae R, Castillo C, Hurtado CA (2004) Content-based image retrieval and characterization on specific web collections, pp 189–198
Batko M, Novak D, Falchi F, Zezula P (2006) On scalability of the similarity search in the world of peers. In: Proc. of INFOSCALE, Hong Kong. ACM, New York, pp 1–12
Batko M, Novak D, Zezula P (2007) MESSIF: metric similarity search implementation framework. In: Proc. of DELOS conference. LNCS, vol 4877, pp 1–10
Ciaccia P, Patella M, Zezula P (1997) M-Tree: an efficient access method for similarity search in metric spaces. In: Proceedings of VLDB’97, Athens, 25–29 August 1997, pp 426–435
CoPhIR (Content-based Photo Image Retrieval) Test Collection (2008) http://cophir.isti.cnr.it/
Datta R, Joshi D, Li J, Wang JZ (2008) Image retrieval: ideas, influences, and trends of the new age. ACM Comput Surv 40(2):1–60. doi:10.1145/1348246.1348248
Dohnal V, Sedmidubsky J, Zezula P, Novak D (2008) Similarity searching: towards bulk-loading peer-to-peer networks. In: 1st international workshop on similarity search and applications (SISAP), pp 1–8
Gelasca ED, Guzman JD, Gauglitz S, Ghosh P, Xu J, Moxley E, Rahimi AM, Bi Z, Manjunath BS (2007) Cortina: searching a 10 million + images database. Tech. Rep., University of California, Santa Barbara
ISO/IEC (2003) Information technology—multimedia content description interfaces. Part 6: reference software. 15938-6:2003
Jagadish HV, Ooi BC, Tan KL, Yu C, Zhang R (2005) iDistance: an adaptive B + -tree based indexing method for nearest neighbor search. ACM Trans Database Syst (TODS 2005) 30(2):364–397. doi:10.1145/1071610.1071612
Kumar R, Novak J, Tomkins A (2006) Structure and evolution of online social networks. In: KDD ’06: proceedings of the 12th ACM SIGKDD international conference on knowedge discover and data mining. ACM, New York, pp 611–617
Li J, Wang JZ (2006) Real-time computerized annotation of pictures. In: MULTIMEDIA ’06: proceedings of the 14th annual ACM international conference on multimedia. ACM, New York, pp 911–920. doi:10.1145/1180639.1180841
Manjunath B, Salembier P, Sikora T (eds) (2002) Introduction to MPEG-7: multimedia content description interface. Wiley, New York
MPEG-7 (2002) Multimedia content description interfaces. Part 3: visual. ISO/IEC 15938-3:2002
MUFIN (Multi-Feature Indexing Network) (2008) http://mufin.fi.muni.cz/
Novak D, Zezula P (2006) M-Chord: a scalable distributed similarity search structure. In: Proc. of INFOSCALE, Hong Kong. ACM, New York, pp 1–10
Novak D, Batko M, Zezula P (2008) Web-scale system for image similarity search: when the dreams are coming true. In: Proceedings of the sixth international workshop on content-based multimedia indexing (CBMI 2008), p 8
Novak D, Batko M, Zezula P (2009) Generic similarity search engine demonstrated by an image retrieval application. In: Proc. of the 32st ACM SIGIR conference on research and development in information retrieval (SIGIR). ACM, Boston
Skopal T, Pokorný J, Snásel V (2004) PM-tree: pivoting metric tree for similarity search in multimedia databases. In: Proc. of ADBIS, Budapest
Stoica I, Morris R, Karger DR, Kaashoek FM, Balakrishnan H (2001) Chord: a scalable peer-to-peer lookup service for internet applications. In: Proc. of SIGCOMM. ACM, San Diego, pp 149–160 doi:10.1145/383059.383071. citeseer.ist.psu.edu/article/stoica01chord.html
Traina C Jr, Traina AJM, Seeger B, Faloutsos C (2000) Slim-Trees: high performance metric trees minimizing overlap between nodes. In: Proc. of EDBT. LNCS, vol 1777. Springer, New York, pp 51–65
Veltkamp RC, Tanase M (2002) Content-based image retrieval systems: a survey. Tech. Rep. UU-CS-2000-34, Department of CS, Utrecht University
Wang JZ, Li J, Wiederhold G (2001) SIMPLIcity: semantics-sensitive integrated matching for picture libraries. IEEE Trans Pattern Anal Mach Intell 23(9):947–963. doi:10.1109/34.955109
Zezula P, Amato G, Dohnal V, Batko M (2006) Similarity search: the metric space approach. Advances in database systems, vol 32. Springer, New York
Zezula P, Savino P, Amato G, Rabitti F (1998) Approximate similarity retrieval with m-trees. VLDB J 7(4):275–293
This research was supported by the EU IST FP6 project 045128 (SAPIR) and national projects GACR 201/08/P507, GACR 201/09/0683, GACR 102/09/H042, and MSMT 1M0545. Hardware infrastructure was provided by MetaCenterFootnote 17 and by IBM SUR Award.
About this article
Cite this article
Batko, M., Falchi, F., Lucchese, C. et al. Building a web-scale image similarity search system. Multimed Tools Appl 47, 599–629 (2010). https://doi.org/10.1007/s11042-009-0339-z
- Similarity search
- Content-based image retrieval
- Metric space
- MPEG-7 descriptors
- Peer-to-peer search network