Retrieving Images by Multiple Samples via Fusing Deep Features

  • Kecai Wu
  • Xueliang Liu
  • Jie Shao
  • Richang Hong
  • Tao Yang
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9916)


Most existing image retrieval systems search similar images on a given single input, while querying based on multiple images is not a trivial. In this paper, we describe a novel image retrieval paradigm that users could input two images as query to search the images that include the content of the two input images-synchronously. In our solution, the deep CNN feature is extracted from each single query image and then fused as the query feature. Due to the role of the two query images is different and changeable, we propose the FWC (Feature weighting by Clustering), a novel algorithm to weight the two query features. All the CNN features in the whole dataset are clustered and the weight of each query is obtained by the distance to the mutual nearest cluster. The effectiveness of our algorithm is evaluated in PASCAL VOC2007 and Microsoft COCO datasets.


Image retrieval Feature fusion Convolutional neural network 

1 Introduction

The goal of content-based image retrieval (CBIR) system is to search similar images from a large visual dataset on a given query. Moreover, most existing image retrieval systems only accept a single image as query and fail when the users want to search images containing the different concepts in the image. In recent few years, there are some existing work which address similar retrieval task [6, 15]. For example, Vaca-Castano et al. [15] presented a new search paradigm that utilizes multiple images as input to query the semantic behind these images. If you input milk, beef and ranch as query images, the system will return some images of cow. Ghodrati et al. [6] provided a swap retrieval framework that includes the visual and textual content as the query to search images with similar content but swapping the object class with another similar one. Such as, we input an image of dog with hat as query, then the system will return some images of cat with hat.

In this paper, we propose a novel retrieval framework which accepts two different images as the query and finds out the images containing the concepts in both inputs. For example, if you input one dog and one cat as queries, the retrieval system can search the images that include dog and cat in the same image. The workflow of our framework is shown in Fig. 1.
Fig. 1.

Given a dog and a cat as query images, our work is to retrieve images with dog and cat.

There is some existing work which could accept multiple inputs in the image retrieval problem [1, 3, 4, 7]. However, it could be found out that extracting different features of the same image or multiple different perspective of the same object are used as the inputs in these work, which is completely different to our retrieval problem. The common characteristics of these methods is the multiple inputs that belongs to the same object. However, the content of two query images may be totally different object in our retrieval task, hence these existing methods cannot perform such a novel retrieval framework.

Therefore, we propose a novel algorithm for this challenge task by weighting and fusing the two query features. First of all, the Convolutional Neural Networks (CNN) [14] are employed to extract a compact representation feature due to their advantage on automatically learning compact features for recognition compared to hand-crafted features. Then, all the features of dataset are clustered by the K-means and the weights of two query features are generated by the distance to their mutual nearest cluster. Finally, we can search a joint latent semantic by fusing the two weighted features.

The remainder of this paper is structured as follows: in Sect. 2 we review some related work in the literatures of convolutional neural network and image retrieval. Then in Sect. 3 we will introduce our solution on the multiple inputs retrieval task. Several experimental and evaluated results will be discussed in the Sect. 4. Section 5 concludes this paper.

2 Related Work

2.1 Convolutional Neural Networks

In past few years, the Convolutional Neural Networks have made great performance in many multimedia directions. For example, Krizhevsky et al. [11] provided a deep Convolutional Neural Networks framework to classify the 1.2 million images into the 1000 classes. Moreover, features based on convolutional networks also have led to the great performance on a range of vision tasks [8, 9, 14, 20, 21]. Razavian et al. [14] utilize the features extracted from the CNNs as a generic image representation to tackle the diverse range of recognition tasks of object image classification, scene recognition and image retrieval applied to a diverse set of public datasets. Inconceivably, they report consistent superior results compared to the highly tuned state-of-the-art visual systems in all the visual classification tasks on several public datasets. So, they strongly suggest that representation features obtained from deep learning with convolutional networks should be the primary candidate in most visual recognition tasks.

2.2 Multiple Query Image Retrieval

Most image retrieval algorithms are proposed for single query, which obtain good retrieval result [12, 13, 17, 18, 19]. However, multiple queries retrieving with multiple inputs can meet the special needs of users. In this problem, the key issue is to find a representation feature instead of multiple query features.

For example, Zhang et al. [16] provided an algorithm that fuses the different query features of the same image that making a good retrieval performance. Fernando et al. [3] provided a similar method to learn an object-specific mid-level representation, which can be used for merging several query images in different viewpoints or viewing conditions. A novel mobile visual searching algorithm is proposed by exploring saliency from multi-relevant photos in [17], and making a great performance by their method. However, the goal of these algorithms is to get a representation instead of all queries to performing the image retrieval.

We have discussed the general solution of multiple queries cannot solve our retrieval task. Our solution is inspired by the following methods, even though these methods cannot directly solve our retrieval problem. Gawande et al. [7] proposed a novel algorithm for feature level fusion utilizing Support Vector Machine (SVM) classifier that fuses the feature of two different modalities fingerprint and eye iris. The Mahalanobis distance is used for fusing the two query features. The Principal Component Analysis (PCA) approach is a commonly used method for satellite image fusing [1], since PCA keeps the principal information during the image fusion.

3 Our Methods

In this section, we introduce our solution on this special multiple queries retrieval task. Let A and B denote the query images that are selected from the different object classes. The CNN feature of two query images are extracted by Caffe [10] and denoted as \( F_{A} \) and \( F_{B} \).

3.1 The Visual Feature Extraction

In our solution, the deep learning framework Caffe is utilized to extract the CNN feature of query and dataset. We know the network of ImageNet [11] has eight layers in total, containing five convolutional and three fully-connected layers. Since we are interested in high-level visual feature of the image instead of a classifier, we remove the eighth layer of the network. We use the pre-trained CNNs framework and the features are extracted by forward propagation of a mean subtracted 256 × 256 RGB color image over five convolutional and two fully-connected layers. Finally, we get a global feature descriptor that is a 4096d vector.

3.2 The Feature Weighting by Clustering

To weight the two query features, we cluster all the features in the dataset by the K-means and the weight of two query features are obtained by the distance to them mutual nearest cluster. Then the weighted feature of two queries can be used for searching them common latent semantic.

Firstly, we need to assign the CNN feature of dataset to K clusters. Secondly, calculating the Euclidean distance between the feature of image A and all of clusters. Finding the cluster \( K_{A} \) that the corresponding distance \( dist_{A} \) is minimum all of the distances (1). We also can find the cluster \( K_{B} \) that the corresponding \( dist_{B} \) is minimum all of the distances (2). The \( dist_{Ai} \) means the Euclidean distance between the feature of image A and the \( {\text{ith}} \) cluster, and the \( dist_{Bi} \) mean the distance between the feature of image B and the \( {\text{ith}} \) cluster. Finally, we need to determine whether the clusters \( K_{A} \) and \( K_{B} \) is the same cluster.
$$ dist_{A} = \mathop {\hbox{min} }\limits_{i = 1 \to k} \left\{ {dist_{Ai} } \right\} $$
$$ dist_{B} = \mathop {\hbox{min} }\limits_{i = 1 \to k} \left\{ {dist_{Bi} } \right\} $$
If the cluster \( K_{A} \) and the cluster \( K_{B} \) are the same cluster, we can get two parameters \( \mu_{A} \) and \( \mu_{B } \) as weighted for the CNN feature of two query images according the formulas (3) and (4).
$$ \mu_{A} = \frac{{dist_{A} }}{{dist_{A} + dist_{B} }} $$
$$ \mu_{B} = \frac{{dist_{B} }}{{dist_{A} + dist_{B} }} $$
If the cluster \( K_{A} \) and the cluster \( K_{B} \) are not the same cluster, we need to re-find a new cluster \( K_{w } \) from all clusters, the new cluster \( K_{w} \) should meet this condition that the sum of the distance between \( K_{w} ,K_{A} \) and \( K_{w} ,K_{B} \) is minimum (5). The \( dist_{Ai}^{'} \) mean the distance between the cluster \( K_{A} \) and the \( {\text{ith}} \) cluster, and the \( dist_{Bi}^{'} \) mean the distance between the cluster \( K_{B } \) and the \( {\text{ith}} \) cluster.
$$ K_{w} = \mathop {\hbox{min} }\limits_{i = 1 \to k} \left\{ {dist_{Ai}^{ '} + dist_{Bi}^{ '} } \right\} $$
Once we find the cluster \( K_{w} \), we calculate the distance between two query features and the cluster \( K_{w} \), which are denoted as \( dist_{A}^{'} \) and \( dist_{B}^{'} \) respectively. The parameter \( \mu_{A} \) and \( \mu_{B} \) are calculated by the formulas (6) and (7). These parameters are also adopted as weights.
$$ \mu_{A} = \frac{{dist_{A}^{ '} }}{{dist_{A}^{ '} + dist_{B}^{ '} }} $$
$$ \mu_{B} = \frac{{dist_{B}^{ '} }}{{dist_{A}^{ '} + dist_{B}^{ '} }} $$
No matter whether the cluster \( K_{A} \) and \( K_{B} \) are the same cluster or not, we also get two parameters \( \mu_{A} \) and \( \mu_{B} \) as weights for the CNN feature of two query images. The weighted feature of two query images is generated by the formulas (8) and (9).
$$ F_{A}^{ '} = \mu_{A} *F_{A} $$
$$ F_{B}^{ '} = \mu_{B} *F_{B} $$

There are two images as query in our retrieval system, and we do not directly fuse the query features. We often apply different distance metrics and input a query image based on similarity features of which we can retrieve the output images in the traditional retrieval system. There are two query images in our retrieval system, so we proposed a new metric for this special retrieval task. Firstly, we denote the \( dist_{1} \) and \( dist_{2} \) as the Euclidean distance of feature \( F_{A} \) and \( F_{B} \) against to other items in the dataset respectively. Next, the sum of distance is \( dist_{1} + dist_{2} \) that will be used as the query metric. The result of sorting is the retrieval result by the proposed method.

4 Experiments

We evaluate the FWC algorithm on several public datasets, such as PASCAL VOC2007 and MS COCO datasets, then compare the retrieval results of the PCA and our method, and the mean average precision (mAP) and the precision at N of retrieval results are showed at Table 1 and Fig. 3.
Table 1.

The mAP comparison to various indexing approaches in VOC2007 and COCO.

Methods and datasets

Top 5

Top 10

Top 15

Top 20

Top 25

PCA with raw feature (VOC)






PCA with weighted feature (VOC)






FWC with weighted SIFT (VOC)






FWC with weighted CNN (VOC)






PCA with raw feature (COCO)






PCA with weighted feature (COCO)






FWC with weighted SIFT (COCO)






FWC with weighted CNN (COCO)






4.1 Dataset and Implementation Details

For the evaluation purpose, we use the concept dog and cat in our experiments. We prepare two public datasets in which there are multiple objects per image. The first dataset is the PASCAL VOC20071, which contains about 10000 images and 20 different classes. The second dataset is the MS COCO2, which is an image recognition dataset containing 80 object categories. Moreover, the MS COCO is too huge to our retrieval task. And a subset, we randomly select 10000 images from the MS COCO for the experiments. However, the two datasets have not enough images which contain dog and cat simultaneously. Hence we manually collect a subset with the images of dog and cat, and add them into these datasets. Finally, we collect 90 pairs of dog and cat respectively as query images from different categories.

We perform several experiments to evaluate the FWC method for 90 pairs of query images. For each pair of query images, we choose the top 25 or 50 retrieved images as the retrieved result. If the retrieved images were relevant to both queries, we denote the images as binary 1. Otherwise, the images were denoted as binary 0. We calculate the mAP and the precision at N of retrieved images from the 90 pairs of query images. The mAP is reported for the top 25 retrieved images, ranging from 5 to 25 in intervals of 5. The precision at N is reported the top 50 retrieved images, range from 5 to 50 in intervals of 5.

We know the K-means is an algorithm of vector quantization, which is popular for cluster analysis in data mining and computer vision. The K-means is performed in the proposed solution that need to partition all the feature of dataset to K clusters. So, we have to choose a suitable K for two different datasets. There are 20 different classes in the PASCAL VOC2007 and 80 different classes in the MS COCO. For simplicity, we choose the suitable K for two different datasets that the K of PASCAL VOC2007 is 21 and MS COCO is 81 in our experiments.

4.2 Compared Approaches

To demonstrate the effectiveness of our proposed method by fusing the two query features, we compare our approach to following closely related methods. As aforementioned, the feature weighting and CNN feature are two crucial factors in our solution, but there is little work which addresses the multiple input query problems. And our compared approaches are designed with different fusing and visual features.

In order to demonstrate the benefit of the fusing method, we use the PCA to fuse the two inputs. First of all, the two query features \( F_{A} \) and \( F_{B} \) are 4096d vector, we can concatenate the two vectors into a new array. Secondly, the size of the new array is 2*4096, which is usually viewed as 2 observations and each observation is 4096 dimensions. However, we also think the new array containing 4096 observations and each observation is 2 dimensions by making a transpose operation. Thirdly, we utilize the extended algorithm of [1] to fusing the two query features by PCA. The dimension of each observation is reduced from 2 to 1, and the last array is a 4096d vector that is viewed as fusion feature containing the principal component about the image A and B. Finally, the fusion feature can be used to the ordinary retrieval system that measuring the similarity between the query feature and all the feature of dataset. The detail information of basic method is showed at Fig. 2.
Fig. 2.

The flow diagram of PCA fusing.

In order to illustrate the advantage of feature weighting, the raw CNN feature and the weighted CNN feature are used for performing our retrieval task. Moreover, we have introduced the state-of-the-art performance of CNNs in Sect. 2. As a contrast, we also extract the Scale Invariant Feature Transform (SIFT) [5] feature of two queries to implement our retrieval task by the FWC. Now, there are 4 retrieval solutions that PCA with raw feature, PCA with weighted feature, FWC with weighted SIFT feature and FWC with weighted CNN feature. These solutions will be used for performing our retrieval task in two public datasets.

4.3 Results and Discussion

These solutions are evaluated by the mAP and the precision at N in PASCAL VOC2007 and MS COCO datasets. Table 1 shows the evaluation result of the mAP in two datasets. Figure 3 reports the evaluation result of the Precision at N in two datasets respectively. Moreover, the PCA− means using the raw feature as query feature, and the PCA+ means using the weighted feature. The PCA method with the weighted feature clearly outperforms the PCA method with the raw feature, and the CNN feature of two queries achieves better retrieval performance than the SIFT feature.
Fig. 3.

The precision at N of in the VOC2007 (a) and COCO (b).

From the results, we can see that our approach outperforms other methods. The experiment result show the feature weighting and CNN feature are necessary factors in our solution. In Table 1 and Fig. 3, the FWC method achieves the best retrieval results in two datasets. The PCA method makes a competitive retrieval performance in the MS COCO. However, the performance of the PCA method in the PASCAL VOC2007 is not acceptable. The PCA method do not takes it into consideration that the objects in two queries may be totally different. By contrast, we propose a new way for solving the above problem by using the feature weighting as query. The weighted feature of two queries can match the latent semantic of the two queries. Hence the retrieval performance of the proposed method is more stable and effective in two different datasets.

5 Conclusion

In this paper, we propose a new retrieval framework, in which two input images are as query and we can retrieve the images that include the concepts in both queries. In our proposed solution, we present feature weighting method that search the mutual latent semantic of two queries. The weighted feature is obtained by calculating the metric distance between each query feature and their corresponding nearest cluster.

The proposed method is evaluated with the two different datasets. The experiment result indicates that proposed method make a promising retrieval performance, which suggests that the feature weighting can well model the latent semantic of the two different query features.




This work was partially supported by National High Technology Research and Development Program of China (Grant No. 2014AA015104), the Natural Science Foundation of China (NSFC) under Grant 61502139 and 61472116, The Natural Science Foundation of Anhui Province under Grant 1608085MF128, and the program from the Key Lab of Information Network Security, Ministry of Public Security under Grant C14605.


  1. 1.
    Chiang, J.L.: Knowledge-based principal component analysis for image fusion. Appl. Math. 8(1L), 223–230 (2014)Google Scholar
  2. 2.
    Elkan, C.: Using the triangle inequality to accelerate k-means. In: ICML, vol. 3, pp. 147–153 (2003)Google Scholar
  3. 3.
    Fernando, B., Tuytelaars, T.: Mining multiple queries for image retrieval: on-the-fly learning of an object-specific mid-level representation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2544–2551 (2013)Google Scholar
  4. 4.
    Fu, Y., Cao, L., Guo, G., et al.: Multiple feature fusion by subspace learning. In: Proceedings of the 2008 International Conference on Content-Based Image and Video Retrieval, pp. 127–134. ACM (2008)Google Scholar
  5. 5.
    Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 60(2), 91–110 (2004)CrossRefGoogle Scholar
  6. 6.
    Ghodrati, A., Jia, X., Pedersoli, M., et al.: Swap retrieval: retrieving images of cats when the query shows a dog. In: Proceedings of the 5th ACM on International Conference on Multimedia Retrieval, pp. 395–402. ACM (2015)Google Scholar
  7. 7.
    Gawande, U., Zaveri, M., Kapur, A.: A novel algorithm for feature level fusion using SVM classifier for multibiometrics-based person identification. Appl. Comput. Intell. Soft Comput. 2013, 9 (2013)CrossRefGoogle Scholar
  8. 8.
    Girshick, R., Donahue, J., Darrell, T., et al.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580–587 (2014)Google Scholar
  9. 9.
    Hariharan, B., Arbeláez, P., Girshick, R., et al.: Hypercolumns for object segmentation and fine-grained localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 447–456 (2015)Google Scholar
  10. 10.
    Jia, Y., Shelhamer, E., Donahue, J., et al.: Caffe: Convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM International Conference on Multimedia, pp. 675–678. ACM (2014)Google Scholar
  11. 11.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)Google Scholar
  12. 12.
    Makadia, A.: Feature tracking for wide-baseline image retrieval. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part V. LNCS, vol. 6315, pp. 310–323. Springer, Heidelberg (2010). doi:10.1007/978-3-642-15555-0_23 CrossRefGoogle Scholar
  13. 13.
    Nister, D., Stewenius, H.: Scalable recognition with a vocabulary tree. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2006), vol. 2, pp. 2161–2168. IEEE (2006)Google Scholar
  14. 14.
    Razavian, A.S., Azizpour, H., Sullivan, J., et al.: CNN features off-the-shelf: an astounding baseline for recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 806–813 (2014)Google Scholar
  15. 15.
    Vaca-Castano, G., Shah, M.: Semantic image search from multiple query images. In: Proceedings of the 23rd ACM International Conference on Multimedia, pp. 887–890. ACM (2015)Google Scholar
  16. 16.
    Zhang, S., Yang, M., Cour, T., Yu, K., Metaxas, D.N.: Query specific fusion for image retrieval. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part II. LNCS, vol. 7573, pp. 660–673. Springer, Heidelberg (2012). doi:10.1007/978-3-642-33709-3_47 CrossRefGoogle Scholar
  17. 17.
    Yang, X., Qian, X., Xue, Y.: Scalable mobile image retrieval by exploring contextual saliency. IEEE Trans. Image Proc. 24(6), 1709–1721 (2015)MathSciNetCrossRefGoogle Scholar
  18. 18.
    Liu, X., Wang, M., Yin, B.-C., Huet, B., Li, X.: Event-based media enrichment using an adaptive probabilistic hypergraph model. IEEE Trans. Cybern. 45(11), 2461–2471 (2015)CrossRefGoogle Scholar
  19. 19.
    Wang, M., Li, W., Liu, D., Ni, B., Shen, J., Yan, S.: Facilitating image search with a scalable and compact semantic mapping. IEEE Trans. Cybern. 45(8), 1561–1574 (2015)CrossRefGoogle Scholar
  20. 20.
    Wang, M., Li, G., Lu, Z., Gao, Y., Chua, T.-S.: When Amazon meets Google: product visualization by exploring multiple information sources. ACM Trans. Internet Technol. 12(4), 1–17 (2013). Article 12CrossRefGoogle Scholar
  21. 21.
    Wang, M., Gao, Y., Ke, L., Rui, Y.: View-based discriminative probabilistic modeling for 3D object retrieval and recognition. IEEE Trans. Image Proces. 22(4), 1395–1407 (2013)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • Kecai Wu
    • 1
  • Xueliang Liu
    • 1
  • Jie Shao
    • 2
  • Richang Hong
    • 1
  • Tao Yang
    • 3
  1. 1.Hefei University of TechnologyHefeiChina
  2. 2.University of Electronic Science and Technology of ChinaChengduChina
  3. 3.The Third Research Institute of Ministry of Public SecurityBeijingChina

Personalised recommendations