Abstract
With the increase of digital media databases, the need for methods that can allow the user to efficiently peruse them has risen dramatically. This paper studies how to explore image datasets more efficiently in online content-based image retrieval (CBIR). We present a new approach for exploratory CBIR that is dynamic, robust and gives a good coverage of the search space, while maintaining a high retrieval precision. Our method uses deep similarity-based learning to find a new representation of the image space. With this metric, it finds the central point of interest and clusters its local region to present the user with representative images within the vicinity of their target search. This clustering provides a more varied training set for the next iteration, allowing the location of relevant features faster. Additionally, relearning a representation of the user’s search interest in each round enables the system to find other non-local regions of interest in the search space, thus preventing the user from getting stuck in a context trap. We test our method in a simulated online setting, taking into consideration the accuracy, coverage and flexibility of adapting to changes in the user’s interest.
Keywords
1 Introduction
Actively learning the user’s search target is an important aspect of content-based image retrieval techniques (CBIR) [11]. Traditionally, image retrieval techniques were dependent on meta-data or otherwise restrictive features, which limited the possible directions of search. In recent years, there has been an interest in developing methods that dynamically react to changes in the semantic target search of the user. A related issue is system responsiveness both in terms of processing time and the amount of user feedback required for the system to converge to the user’s search target.
There are many scenarios which would benefit from a highly dynamic and fast image retrieval systems, which can adjust to changes in what the user’s ideal target image is. For example, an artist or journalist browsing stock photos. They might be looking for an image with a particular mood, which cannot be easily captured with tags, and they will only know they have found the right images once they see it. This would require the system to be able to learn specific features from the data throughout a given search session even though they may not have been in the original training set. Another case could be a doctor searching through a medical image database for images similar to an image of their patients condition.
In each scenario, the system has to be responsive and answer to the needs of the user immediately, providing relevant search results after a few training examples [32]. Furthermore, the user might not be sure what they are looking for at the beginning of the search session, but are refining their target as they see more images from a given dataset. Covering the various types of images present in a given dataset in a fair manner is important to avoid the context trap [17], where the user “gets stuck” in a single location in the search space. Sometimes the user may simply wish to browse the contents of the database, knowing their target only once they see it.
With these challenges in mind, we introduce a framework for exploratory CBIR that tackles all the above challenges. Our system flexibly relearns the relevant features from a simple binary feedback within a few search iterations, while balancing a good coverage of the database and maintaining high accuracy. Furthermore, the suggested framework allows the user to peruse images from unannotated databases. We also propose a metric to measure the coverage of the search space suitable for datasets of any size.
Our architecture utilizes pre-extracted features from deep neural networks, on top of which we learn a new distance representation of the images based on the user’s relevance feedback. For the distance measure, we use the Siamese architecture, which was originally used for face verification [5]. The method then uses these distances to find interesting regions in the data set, identifying an efficient ranking for each image to help the user to learn what is available. These regions are found by clustering the images in the new representation and then choosing central images from each cluster to represent all the images in a given region. When the user chooses an image from one of these regions, the search may proceed faster towards the images most relevant to the task at hand.
2 Related Work
The first CBIR experiments date back to 1992 [16]. Since then, a variety of feature descriptors or local representations have been used for image representation, such as color, edge, texture and GIST [21], as well as local feature representations, such as the bag-of-words models [31] in conjunction with local feature descriptors (e.g. SIFT [20]). However, using such low-level feature representation may not be always optimal for more complex image retrieval tasks due to the semantic gap between such features and high-level human perception. Hence, in recent years there has been an increased interest in developing similarity measures specifically for such low-level feature image representation [3] as well as enhancing the feature representation in distance metric learning [28].
Over the past decade deep neural networks have been successfully utilized in CBIR tasks, bringing strong feature representations to the field [22]. For example, learning deep hierarchies that formulate short binary codes for image data sets that allow fast retrieval was considered on two occasions: by using autoencoders to automatically formulate a structure for the data [19], or creating hash codes based on deep semantic ranking [33]. While both methods are fast, neither is flexible enough to learn the image target based on the small amount of user’s relevance feedback. Wan et al. [28] is the first study to apply deep learning to learn a similarity measure between images in a CBIR setting. The method relies on metric learning methods to learn the similarity between images. Their initial results show a potential for using deep similarity measures in image retrieval tasks.
Unfortunately, no consideration was given to the time requirements of the learning task, which is an important aspect of online interactive retrieval systems. A comprehensive case-study of deep learning in CBIR was conducted in 2014 [29]. Many state-of-the-art techniques were utilized in four different image retrieval tasks, showing the usability of convolutional networks in CBIR. These methodologies are similar to our baseline setups, utilizing similar feature representations and ranking principles.
Exploratory search for image retrieval is a field of research that has seen more activity in recent years [8, 10, 12,13,14, 18, 27]. There are a number of approaches comparable to our work. AIDE [6] is a framework for generic information retrieval settings in metric environments, which is also applicable for image retrieval. The method employs several exploration phases for even sampling of the search space, giving a good overview of the present items. In iconic images [1] a framework was proposed that shows clusters of images to the user. This setup finds images that represent existing categories in the search space by finding most salient and complete examples of possible targets. In [2] a kernel based method was suggested for image set exploration, which allowed users to utilize image representation, summarization and visualization. Although relevant to our work, their method focused solely on browsing the system. CLUE [4] is a framework for CBIR that utilizes graph-theoretic clustering to find similarities between images, providing a good comparison for images.
Previous work focused mainly on non-exploratory methodologies, which work best in simple look-up scenarios, or exploratory methodologies that use existing or rigid structures to measure the search space. The aim of our work is to create an exploratory CBIR system, which requires little preprocessing to start with, and is not constrained by predefined structures. Instead, it learns the user’s interests dynamically during the search. Due to this, we assess the quality of exploration by comparing jointly both the precision and the coverage of the retrieved images. This ensures that the user does not only find a local cluster of good images within the search space, but also obtains a more comprehensive view of the search space.
3 System Overview
Our system assists the user in finding relevant images from databases that have minimal preprocessing done to them before hand. The user may feed an example image into the system at the beginning of the search to speed the process of finding relevant images. Next, at each search iteration the user is presented with k images and they indicate the relevant ones for their search. The remaining images in the set of k images that did not receive any feedback are treated as irrelevant. Based on this feedback, all the images in a dataset are re-ranked, providing a more refined representation on what is relevant for the user. The system aims to identify the subspace from the dataset that is relevant for the user in as few iterations as possible. We do this by exploring regions near the images already tagged as relevant.
3.1 Feature Extraction
In order to obtain a good base representation, we use features extracted with OverFeat [26] and relearn the last fully connected layers as the target representation. OverFeat is a publicly available convolutional neural network, trained with the ILSVRC13 data set [24], on which it achieved an error rate of \(14.2\%\). ILVSRC13 contains 1000 object classes from a total of 1.2 million images. OverFeat has been shown to be successful in various image recognition tasks from fine-grained classification to generic visual instance recognition tasks [23]. The selected features were a set of hidden nodes as the fully connected graph begins from layer 7 (19 within the architecture), totalling 4096 features. The images were shrunk and then cropped from all sides to produce images of equal size of \(231 \times 231\) pixels.
3.2 Siamese Architecture
Our system employs the Siamese architecture [5], which is used to learn a similarity metric between images. This is done by maximizing the pair-wise distance of dissimilar images, and minimizing the same for similar images. The resulting representation maps images into a euclidean space where images from one class are grouped together, while being separated from images from other classes. We employ user relevance feedback to divide the presented images into the two classes, i.e. images with positive feedback (relevant class) and images with negative feedback (non-relevant class).
Siamese architecture D consists of two networks, both of which take as input an image \(X_1\) and \(X_2\), respectively. The networks share their weights W, which are trained to learn a location in the new metric representation. The system uses a contrastive loss function:
where \(Y = 1\) if the two images are from the same class, and \(Y = 0\) if they are from different classes, \(D_W\) is the network’s predicted distance between \(X_1\) and \(X_2\) given W, and m is a margin we wish to keep between different classes of images. This metric aims to maximize the intra-class similarity, in the case where \(X_1\) and \(X_2\) belong to the same class, and to minimize the inter-class similarity if they belong to different classes.
The Siamese architecture (Right side in Fig. 1) is able to find a new representation in the feature space that helps to distinguish between different aspects of the image, making it an ideal choice for our application. An important aspect of this architecture is that it generates a distance metric, which may be used to rank or generate relevance scores for all the images in a dataset.
3.3 Exploratory Search
As the Siamese neural network produces a metric representation for the images, our exploratory methodologies are able to separate data points spatially into regions of interest. The exploratory methodologies we present here affect two variables in the framework: first, what the primary point of interest in the search space is, and second, how to explore the region around this point. For our primary location, we evaluate a focal point that is far from irrelevant images and close to the center of the cluster of relevant images. From here we explore nearby groups of images to find images that cover as large a portion of images as possible.
Central target \(\gamma \) is the starting point of the exploration in our method. It is chosen to be the center of the cluster of images rated as relevant, which should be close to the highest estimated relevance according to the Siamese network. This point is found by minimizing the distance for each positively ranked image \(x_+ \in X\), while maximizing the distance to all the negatively ranked images \(x_- \in X\). Central target is thus selected by the following function:
where x is an image that may or may not have yet been ranked, \(\Vert x, x_+\Vert \) is the distance measure between x and \(x_+\), and C is a suitably selected constant. The purpose of the constant is to work as the exploratory term – moving more aggressively away from the edges of positive clusters allows the method to find the center of each relevant region faster.
After the central target has been located, we define the local region as the nearest n images. This neighbourhood is then clustered to locate images that describe the content best. Defining a range parameter such as this one allows our method to utilize the relearned metric to increase the relevance. The parameter may be chosen as a proportion of the total size of the dataset balanced with the processing capabilities of the whole system.
In our system we used DBScan [7] for clustering as it generates clusters based on the form of the data itself, creating as many clusters as there are concentrated regions. This phase handles the exploratory phase of the search. As DBScan does not generate centroids, we use the image closest to the center as the centroid. This image is presented to the user if there are enough exploratory slots left. Depending on the precision rate of the previous iteration, we explore more or fewer items close to these centroids. If the previous iteration resulted in only relevant images, no exploration is done but rather the primary central region is exploited until it is exhausted. If, on the other hand, the precision is low, we look for more images from the nearby clusters.
Our algorithm moves around the image space due to the changes the neural net imposes on the representation. It reconfigures the center of interest at each iteration, assuming that interesting images were successfully separated from the rest.
4 Experiments
We conducted two sets of experiments with three different datasets to evaluate the applicability of the proposed systems in interactive CBIR. For this, we identified the following aspects of the system’s performance to be crucial. The system needs to work with relatively few training examples, i.e. at each search iteration, the user is presented with only a small number of images and often provides feedback only to a subset of these. The retrieval system needs to be able to “learn” what the user is looking for based on this limited feedback. The search target may be something very concrete, e.g. “red rose”, or very abstract, e.g. “happiness”, and the system needs to support all types of searches with varying degree of abstractness. Furthermore, the training time has to be very short for the system to be interactive from the user’s perspective.
We first test the overall precision and running time of our method. In the second part of our experiments, we test how well our method compares with other algorithms in a simulated setting, measuring both cumulative precision and the coverage of the search space as the user gives more feedback.
4.1 Experimental Setup
In our experiments we used three different datasets (Fig. 2), where the class labels represent the targets of the search. The first dataset consisted of 1096 images from the MIRFlickr dataset [15], which contains images of various mammals, birds, insects and vehicles. The original labels were combined into larger classes, e.g. images of hens, finches and parrots were labeled as ’birds’. The aim of this dataset is to show that our method is able to transfer learn an abstract, combinatorial concept, that the original features were not meant for. Here the features for example, for hens and parrots are distinct, and no semantic relation has been separately taught to the original feature representation – the new representation has to be able to learn it.
The next dataset was our own collection of 294 images of 6 different dog breeds, of which only four are included in the OverFeat’s classification list. This dataset allows us to test whether the model is able to transfer learn the target in the presence of semantically related images, some of which are not included in original scope of features used for training.
Lastly, we used a classical dataset to test how well the method scales to larger datasets: 100 classes from the ILSVRC2013 dataset [24], totalling 128894 images.
We tested our method against three other CBIR setups that work in comparable settings. Each of them is based on similarity measures and assists the user in exploring a given dataset. First, Rocchio’s algorithm [25], which is a widely used ranking method for vector space settings. It finds a vector from around which documents are shown to the user. The relevance score given to the method directs this vector towards a space with more related documents. We also paired Rocchio’s algorithm with a classical exploratory method from multi-armed bandit literature: \(\epsilon \)-greedy exploration [30]. Here, a certain number of actions are randomized to avoid policy stagnation. More precisely, the estimated optimal action is taken with a chance \(1 - \epsilon \), and a random choice with chance \(\epsilon \). The initial \(\epsilon \) was set to 0.5, which was annealed linearly to 0 after 10 simulation iterations with steps of 0.05. Finally, AIDE [6] is a recent exploratory framework for information retrieval that attempts to provide a good coverage of the whole dataset. It partitions the search space into subspaces, from where it attempts to find all the relevant regions by presenting to the user samples from each of them. We conducted one initial test for our system with small training set sizes ranging from 10 up to 150 presented images. This is the average number of images in a typical CBIR search session [9], when the user is presented with 10 images in a single iteration. We measured the precision and running time of our method, with results presented in Tables 1 and 2.
Next, we conducted exploratory tests for all the methodologies outlined above. In this setup we simulated an image retrieval task, where the system presents 10 images to the simulated user over 20 iterations, yielding a total of 200 images. In the simulations, the user feedback is 1 for images with a relevant label and 0 for the remaining images in the presented set. The search starts with a random selection of 9 irrelevant images, plus one relevant image “chosen by the user” – this setting allows us to ensure that all the simulation experiments have a comparable starting point. For the ILSVRC2013 dataset we sampled evenly 20000 images for the testing for each simulation due to space and time constraints set by some of the baseline methodologies. This setting allowed us to test if our system scales well early in the search task, reaching good enough performance in precision and running speed. All the reported results are averaged over 5 training runs for each of the existing classes in the datasets.
To test the quality of exploration, we measured the precision of the retrieved images as well as the coverage of the search space. We measured coverage C as the average of distances between all retrieved items compared to the dataset size:
where \(x^s\) is the set of retrieved images, \(\mid x^s \mid \) its size, \(\Vert x^s_i, x^s_j \Vert \) is the distance between i:th and j:th member in the set averaged over the number of retrieved images. The term maxDist(X) is the maximum distance between two points within the dataset, scaling the sum to be between 0 and 1. The greater the average sum of these distances, the further apart the data points are in the similarity space, and thus the larger the view over the data set is.
4.2 Experimental Results
The initial precision results are shown in Table 1. As can be seen, our system is able to retrieve relevant images with high accuracy even within the first few iterations. As the training set is increased to 150 images, the precisions become comparable to modern ranking methodologies.
In Table 2 we show the average training time for each training set size. For each dataset, the average duration for each search iteration is below 4 s. This makes the system interactive from the usability perspective, and grows linearly even as the number of the training data points grows larger.
In the second set of experiments we look at the performance of the various exploratory methods (Fig. 3). We report the cumulative precision until a given point with the previous iterations acting as the context for the user throughout the search session.
Our centroid-based method gains clear advantage after approximately 5 iterations as the system learns the target representation. Due to the small number of images present in the dog dataset, the curves for this dataset turn downwards for most methods as the search progresses as all the relevant images have been exhausted early on in the search. Still, our method finds the relevant images sooner and finds a larger portion of them at the end of the search. For the MIRFlickr dataset, we can see how Rocchio’s exploits an early local cluster of good images but fails later on in the search as it is unable to break out of the initial context. Meanwhile our method sacrifices a number of attempts early on and gradually achieves a larger number of correctly retrieved images.
The cumulative average coverage shows interesting trends with different methodologies. The baseline methods proceed steadily through the dataset adding relatively small gains throughout the search. Our method, on the other hand, keeps finding new regions for a long time until slowing down after approximately 7 iterations. The overall coverage with our method is significantly larger in the various settings, showing how exploring the changing metric spaces helps us to find new regions of interest faster.
Finally, with the ILSVRC2013 dataset the effect of sparse targets highlights the efficiency of our method. With 100 target classes present, the local space of the initial target quickly exhausts valid images with centroid exploration. It is likely that if the first image is on the edge of the valid cluster of images, the nearby images will quickly present neighbouring classes.
In Fig. 4, we see the coverage for each method and dataset. The more exploratory methods keep covering a larger section of the datasets faster, while the greedy Rocchio’s lags behind. With the MIRFlickr dataset, our method reaches the same coverage as AIDE after approximately 18 iterations. This suggests that the refined representation is able to find relevant locations beyond the immediate neighbourhood of the starting location.
5 Conclusions
We presented a deep exploratory search framework for online interactive CBIR settings, which reacts to the users feedback dynamically and covers a larger portion of the search space than conventional retrieval tools. The system allows users to conduct searches for concepts outside of the initially used features. We showed that this transfer learning is able to extend to abstract targets, learning concepts robustly that were not originally intended for the starting features.
The system is highly dependent on good initial image features. For cases where the dataset has not been annotated but presents natural images, a good object classification CNN is required. In the case of specialised image datasets, such as medical imaging, a separate neural network should be trained just for that purpose, after which the presented methodologies are able to transfer learn the various combinations required to identify the target. Furthermore, efficient sampling (or better hardware) is required to process more images. Fortunately, computational times with modern GPU-based neural networks scale well with larger datasets given an adequate memory.
References
Berg, T.L., Berg, A.C.: Finding iconic images. In: 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp. 1–8, June 2009
Camargo, J.E., Caicedo, J.C., Gonzalez, F.A.: A kernel-based framework for image collection exploration. J. Vis. Lang. Comput. 24(1), 53–67 (2013)
Chechik, G., Sharma, V., Shalit, U., Bengio, S.: Large scale online learning of image similarity through ranking. J. Mach. Learn. Res. 11, 1109–1135 (2010)
Chen, Y., Wang, J.Z., Krovetz, R.: Clue: cluster-based retrieval of images by unsupervised learning. IEEE Trans. Image Process. 14(8), 1187–1201 (2005)
Chopra, S., Hadsell, R., LeCun, Y.: Learning a similarity metric discriminatively, with application to face verification. In: Proceedings of CVPR (2005)
Dimitriadou, K., Papaemmanouil, O., Diao, Y.: Explore-by-example: an automatic query steering framework for interactive data exploration. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (SIGMOD 2014), pp. 517–528, New York, NY, USA. ACM (2014)
Ester, M., Kriegel, H.-P., Sander, J., Xiaowei, X.: A density-based algorithm for discovering clusters in large spatial databases with noise, pp. 226–231. AAAI Press (1996)
Głowacka, D., Shawe-Taylor, J.: Content-based image retrieval with multinomial relevance feedback. In: Proceedings of ACML, pp. 111–125 (2010)
Głowacka, D., Hore, S.: Balancing exploration-exploitation in image retrieval. In: Proceedings of UMAP (2014)
Glowacka, D., Teh, Y.W., Shawe-Taylor, J.: Image retrieval with a Bayesian model of relevance feedback. arXiv preprint (2016). arXiv:1603.09522
Heesch, D.: A survey of browsing models for content based image retrieval. Multimed. Tools Appl. 40(2), 261–284 (2008)
Hoque, E., Hoeber, O., Gong, M.: CIDER: concept-based image diversification, exploration, and retrieval. Inf. Process. Manage. 49(5), 1122–1138 (2013)
Hore, S., Glowacka, D., Kosunen, I., Athukorala, K., Jacucci, G.: Futureview: enhancing exploratory image search. In: IntRS@RecSys, pp. 37–40 (2015)
Hore, S., Tyrvainen, L., Pyykko, J., Glowacka, D.: A reinforcement learning approach to query-less image retrieval. In: Jacucci, G., Gamberini, L., Freeman, J., Spagnolli, A. (eds.) Symbiotic 2014. LNCS, vol. 8820, pp. 121–126. Springer, Cham (2014). doi:10.1007/978-3-319-13500-7_10
Huiskes, M.J., Lew, M.S.: The MIR flickr retrieval evaluation. In: Proceedings of MIR (2008)
Kato, T., Kurita, T., Otsu, N., Hirata, K.: A sketch retrieval method for full color image database-query by visual example. In: Pattern Recognition. Computer Vision and Applications, pp. 530–533 (1992)
Kelly, D., Xin, F.: Elicitation of term relevance feedback: an investigation of term source and context. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2006), pp. 453–460, New York, NY, USA. ACM (2006)
Konyushkova, K., Glowacka, D.: Content-based image retrieval with hierarchical gaussian process bandits with self-organizing maps. In: ESANN (2013)
Krizhevsky, A., Hinton, G.E.: Using very deep autoencoders for content-based image retrieval. In: Proceedings of ESANN (2011)
Lowe, D.G.: Object recognition from local scale-invariant features. In: ICCV, pp. 1150–1157 (1999)
Oliva, A., Torralba, A.: Modeling the shape of the scene: a holistic representation of the spatial envelope. Int. J. Comput. Vis. 42(3), 145–175 (2001)
Pyykko, J., Glowacka, D.: Interactive content-based image retrieval with deep neural networks. In: Gamberini, L., et al. (eds.) Symbiotic 2016. LNCS, vol. 9961. Springer, Heidelberg (2017)
Razavian, A.S., Azizpour, H., Sullivan, J., Carlsson, S.: CNN features off-the-shelf: an astounding baseline for recognition. CoRR, abs/1403.6382 (2014)
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 211–252 (2014)
Salton, G.: The SMART Retrieval System-Experiments in Automatic Document Processing. Prentice-Hall Inc., Upper Saddle River (1971)
Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., LeCun, Y.: Overfeat: integrated recognition, localization and detection using convolutional networks. In: Proceedings of ICLR (2014)
Suditu, N., Fleuret, F.: Iterative relevance feedback with adaptive exploration/exploitation trade-off. In: Proceedings of CIKM (2012)
Wan, J., Wang, D., Hoi, S.C.H., Wu, P., Zhu, J., Zhang, Y., Li, J.: Deep learning for content-based image retrieval: a comprehensive study. In: Proceedings of MM (2014)
Wan, J., Wang, D., Hoi, S.C., Wu, P., Zhu, J., Zhang, Y., Li, J.: Deep learning for content-based image retrieval: a comprehensive study. In: Proceedings of the 22nd ACM International Conference on Multimedia (MM 2014), pp. 157–166, New York, NY, USA. ACM (2014)
Watkins, C.J.C.H.: Learning from delayed rewards. Ph.D. thesis, King’s College, Cambridge, UK, May 1989
Yang, J., Jiang, Y.-G., Hauptmann, A.G., Ngo, C.-W.: Evaluating bag-of-visual-words representations in scene classification. In: Multimedia, Information Retrieval, pp. 197–206 (2007)
Yee, K.-P., Swearingen, K., Li, K., Hearst, M.: Faceted metadata for image search and browsing. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI 2003), pp. 401–408, New York, NY, USA. ACM (2003)
Zhao, F., Huang, Y., Wang, L., Tan, T.: Deep semantic ranking based hashing for multi-label image retrieval. ArXiv e-prints, January 2015
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Pyykkö, J., Głowacka, D. (2017). Dynamic Exploratory Search in Content-Based Image Retrieval. In: Sharma, P., Bianchi, F. (eds) Image Analysis. SCIA 2017. Lecture Notes in Computer Science(), vol 10269. Springer, Cham. https://doi.org/10.1007/978-3-319-59126-1_45
Download citation
DOI: https://doi.org/10.1007/978-3-319-59126-1_45
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-59125-4
Online ISBN: 978-3-319-59126-1
eBook Packages: Computer ScienceComputer Science (R0)