Content based image retrieval using fusion of multilevel bag of visual words
- 251 Downloads
Content based image retrieval (CBIR) is the art of finding visually and conceptually similar pictures to the given query picture. Usually, there is a semantic gap between low-level image features and high-level concepts perceived by viewers. Although features such as intensity and color enforce a good distinction between the images in terms of greater detail, they convey little semantic information. Therefore, employing higher-level features such as properties of regions and objects within the image could improve the retrieval performance. In this study, features are extracted at the pixel, region, object, and concept levels. The fusion step concatenates the four feature vectors and maps it to a lower-dimensional space using auto-encoders. The experiments confirm the efficiency of the proposed method over the individual feature groups and also the state of the art methods.
KeywordsCBIR Feature fusion Bag of visual words
Due to the growing interest in digital images and their application, the retrieval and search of images in both areas of research and commercial domains have been increasingly gaining attention. Without using image retrieval systems, it is very hard to search in the large and ever-growing image databases on the Internet. The basic methods of image retrieval are based on the labels and tags that are assigned to images by human experts. These approaches are costly and suffer from inconsistencies of the tags and labels composed by different experts. In addition, some images cannot be easily described by a few sets of keywords. Content-based image retrieval (CBIR) methods try to overcome these problems by learning models that map visual image features to well-defined objects and concepts .
Generally, CBIR systems are composed of indexing and searching steps. In the indexing step,feature vectors representing the visual features of images are extracted and stored in a database. The search step processes a given query image by first extracting the same features as in the first step and looking it up in the database. The search procedure employs various similarity measures to rank and retrieve the most similar images.
The most important challenge in CBIR is the semantic gap which is the difference between the features extracted from an image and its latent high-level concepts. A high-level concept is an understanding that a human has from an image. In fact, two images may be very similar in terms of color and texture, but human perception of them may considerably be different. The semantic gap causes retrieval system to fetch images which are similar in terms of visual features but communicate quite different concepts.
Image feature extraction is done in two global and local methods. The global methods, extract visual features from the whole image without considering the similarity and spatial relations of the pixels. Hence, the image is represented by a single feature vector that can be readily used in comparison and ranking procedures. Despite simplicity and robustness of these methods, their performance degrades with changes of light, scale, and viewpoint. Local methods extract features from several regions of the image. These regions can be obtained by a simple partitioning of the image,a color segmentation algorithm, or neighbourhoods around key points. Finally, the image is represented by a set of feature vectors corresponding to the regions. Due to the different number of regions in images, the set of local feature vectors must be further processed to induce a single descriptor for comparison purposes .
With the large and diverse number of available feature types, some works try to extract and use multiple groups of features and combine their results at feature or decision levels. The decision fusion strategy independently trains basic learners for each group. Results of the basic learners are then combined using functions such as majority voting . The feature fusion, on the other hand, combines the extracted feature vectors to build a new feature vector. This strategy takes into account the interdependence and complementary role of the different feature groups. The fusion function ranges from a simple concatenation of the vectors to the more advanced belief networks .
According to a recent survey by Piras and Giacento , most of the recent works that combine multiple sources of information are mainly based on decision level fusion. The adopted sources are visual image features, accompanied textual information, and relevance feedbacks. The major challenges of feature fusion are the intrinsic differences between the feature groups. Availability, robustness, scale, and dynamics of the pixel-based features are quite different from region-based features. For instance, a feature that is based on pixels (e.g. co-occurrence of pixel colors) is extracted from thousands of pixels which can be safely and robustly modeled by parametric probability distributions. However, a feature that is extracted from multiple regions (e.g. mean area of coherent color segments) is hardly based on more than 10 regions which limits the application of modeling and inferential methods. Furthermore, the different scale and dynamics of the feature groups require complex space transformations .
In this study, feature extraction is performed at four levels, namely pixel, region, object, and concept. A feature fusion step concatenates the four feature vectors and maps it to a lower dimensional space using an auto-encoder network. This final feature vector is the basis for similarity estimation and retrieval. As a decision level approach, in our earlier work , four basic retrieval systems are built over the four feature groups. Each system reports a ranked list of related images. The lists are then combined and a final ranking is generated. Besides its simplicity of setup and extension, it ignores feature dependencies and overlaps. In this research, we extend our work in a new dimension that implicitly captures inter-feature information to build a more accurate and robust system.
The present work systematically differs from deep image learning and image hashing as two currently active topics in the area of image retrieval. At first, we emphasize that while the proposed approach is not a deep-learning method, it benefits from deep learning based image features through the adoption of a successful pre-trained deep neural networks. This aspect of the work can be further explored by using features from more recent deep NNs such as ResNet . Secondly, image hashing techniques try to enhance the search time of the retrieval algorithms. Usually, the search operation is linear in size of the underlying image set. Hence, the amount of required time could be a bottleneck In real time scenarios. Image hashing techniques try to limit the search to small subsets of the images. Results of the recent works are quite promising in terms of time complexity but their accuracy has more ways to go .
The experimental results over three public domain datasets confirm that fusion of multiple types of visual words significantly improves the retrieval performance. Furthermore, the similar trend of improvements over the datasets suggests that the improvement should be mainly attributed to the fusion procedure.
The remainder of the paper is arranged as follows: In the next section, the works related to various levels have been introduced and reviewed. The proposed method is explained in detail in the third section. Then, in the fourth section, the adopted datasets, the experiment and implementation details, and the results are presented. Section 5 concludes the paper with a short discussion and a few future directions.
2 Related work
In this section, the related work is studied in four groups corresponding to four mentioned feature types. Furthermore few works that address the fusion of multiple feature groups are also introduced. Finally, the major strategies of information fusion in learning scenarios are also briefly reviewed.
The tasks performed at the pixel level include extracting the global features from all of the image pixels and the local features from small patches around key points. In , directional local extrema patterns are used in image retrieval. This feature is inspired by the local binary patterns (LBP) and is used to extract edge information in multiple orientations. The histogram of the oriented edges constitutes the global feature vector for the image. In , histograms of HSV color and an LBP inspired texture feature named local texton xor patterns are used for retrieval.
In  the image has been partitioned into \(4 \times 4\) blocks and then texture and color features are extracted from the blocks. The blocks are then clustered and the cluster centers are reported as the region features. Similarly, color and texture features in a small neighborhood around pixels are used in . They used Gaussian mixture models for grouping and segmentation of image pixels. In  features of salient regions are extracted and used for retrieval. A salient region is an important part of the image that draws attention from other parts of the image.
As an approach that exploits features of the objects in an image,  builds a statistical model of an object out of a small set of samples provided by the users. The system finds images containing the object by matching against this model.  follows a similar approach by first segmenting an image into regions; hopefully containing a single object. The set of descriptors are then normalized into a fixed length vector and then is classified by an MLP network. The notion of object classifiers is also adopted in . A set of 200 objects, frequently appearing in web images, is selected for classification. A template of each object slides over the image and the response for all the objects are collected and fed to the classifiers. A nice point in this work is using two distinct classifiers for foreground and background objects.
The problem with tagging images using the correspondence between the visual features of the image and the keywords is that a large number of the irrelevant keywords is generated. The purpose of the research in  is to prune the unrelated keywords using word similarities extracted from the WordNet. The purpose of the study at  is also refining inappropriate labels of images. The tags assigned to images are ranked and tags below a predefined cutoff are ignored. in  the final similarity is a combination of multiple decisions based on the specific feature groups extracted from curvelet transform, wavelet transform, and dominant color descriptor. The color and texture features are extracted in HSV space. After getting the similarity value for each group, a linear combination that is learned by particle swarm optimization (PSO) combines the results.
Several recent works use multiple sources of information for the retrieval task. A recent review studies the available works in the common framework of decision fusion. The differences between the works are mainly the type of features used and their fusion function. In  a hybrid feature named correlated primary visual texton histogram feature (CPV-THF) has been introduced that combines color, texture, spatial, and structural features of the images using their cross-correlations. Finally \(L_1\) distance is used for comparison and retrieval. Srivastava et al. also combine SURF and LBP features for image classification . Similarly, Mehmood et. al combine SURF and HOG features to improve retrieval performance .
In summary, existing works adopt several sources of information for the retrieval task. However, the incremental effect of the sources is mainly studied in terms of the decision fusion. This approach is limited in the sense that it requires multiple passes over the database. On the other hand, a combined feature vector induces a single description for the image such that its contents convey information from multiple sources but needs to search the database once. Another point that is not explored in the existing research is the use of labels coming from image classification. The text adopted in the studied works mainly come from captions and comments along with the image in the web pages. However, the object names and class labels of the images assigned by the intelligent analysis of the image content need more exploration.
3 The proposed approach
We propose a feature fusion approach that combines features at various granularity levels for efficient image retrieval. The general scheme of this method is shown in Fig. 1. The approach is composed of two main modules. The Index module extracts several types of the features from the images and induces a unique representation through the application of autoencoder neural networks. The representations are stored in a database for the later references by the search module.
3.1 Feature extraction and normalization
As depicted in Fig. 2, the features of an image are extracted in 4 levels of pixels, regions, objects, and concepts. Each type of the extracted features are then independently normalized to [0, 1]. The features are concatenated to constitute a single vector. An auto-encoder neural network non-linearly maps the vector into a lower dimensional space, resulting the final feature vector of the image. The auto-encoders have been proved to be more efficient than classic methods ; making them an ideal tool for feature extraction. The auto-encoder is trained using the training images. The number of new dimensions, i.e. the size of the auto-encoder’s hidden layer, is empirically determined.
3.1.1 Pixel level
In this level, initial moments of color, Gabor filter responses and SIFT features are extracted as local descriptors. A code-book is built using k-means clustering algorithm; in which cluster centers are stored as the visual words. The image is then represented as a bag of these words.
Following the approach of , to extract color moments, the image is partitioned into \(16\times 16\) blocks and the mean, variance, and skewness of the HSV components make the nine element feature vector for each block.
Gabor features  are extracted by applying a bank of filters with five scales and eight directions to the gray-scale images. Each response image is partitioned into \(16\times 16\) blocks and then the mean and variance of the corresponding blocks in the 40 images are extracted to build up an 80-element feature vector for each block.
To extract SIFT features, the key points of the image are obtained as the extrema of the Gaussian differences. For each key point, its \(16\times 16\) neighbourhood, is partitioned into \(4 \times 4\) blocks. Histogram of the oriented gradients at the blocks constitute the 128-element SIFT descriptor.
The features extracted from all the training images are described by a bag of visual words model. In this model, a vocabulary is generated from each type of the features using the K-means clustering algorithm. The size of the vocabulary, N, is empirically determined, as in . Here we assign N / 3 words to each feature type.
3.1.2 Region level
Similar to the pixel level, a BoVW model is used to describe images at the region level. In this level, the Hue statistics and LBP descriptors are extracted from the segments or regions of the image. Therefore, the image must be partitioned to its constituent regions. For this part, we adopt JSEG method presented in .
Each region is divided into \(10 \times 10\) block and the 64-elements hue descriptor is extracted from each block . The mean of the features for each region is used as its final feature vector. The histogram of LBP responses at a region is used as its LBP descriptor . The visual vocabulary at region levels contains V words. It is equally divided between the Hue and LBP features.
3.1.3 Object level
According to a study of the distribution of images on the web, a large number of the images contain a small group of well-known objects . At the object level, a set of 150 binary support vector machine (SVM) classifiers are trained to detect 150 most frequent objects of this study. SVMs are well-known robust classifiers that induce a maximum decision margin among the compared classes . However, tuning their hyper-parameters requires a substantial amount of try and error efforts. Furthermore, it is hard to develop a probabilistic interpretation of their outputs. These drawback limit the application of SVMs, specially, for small training samples. However, in the case of CBIR systems with sufficiently large datasets, SVMs outperform common classifiers such as regression models and decision trees. Neural networks could be considered as alternatives for SVM classifiers bu they are more memory intensive and require a large amount of training time.
Following the approach of  to train the classifiers, a set of 100 images containing the desired object is selected as the positive sample and another set of 100 images without the object (e.g. images of the other object types) is selected as the negative sample.
3.1.4 Concept level
We use the word2vec version that is trained on Google News dataset . It maps the words to a 300-element vector space. The final feature vector for the concept level is the mean of the vectors corresponding to the detected objects.
3.2 Feature fusion
4 Experimental results
4.1 Data and setup
in order to evaluate the efficiency of the proposed approach, we have conducted experiments using three sets of images. The first one, Wang , contains 1000 images in 10 different classes. Corel9C is the set of images that have been used in . It contains 900 images in nine classes. Corel5K  contains 5000 images in 50 classes. For all the three datasets, each class contains 100 images of which 10 images are randomly selected as query images.
The precision measures the purity of the retrieved set. Recall on the other hand determines which fraction of the relevant images contained in the database are retrieved. The precision-recall, \((P-R)\), curve denotes the balance between the precision and the recall in different thresholds. The higher the area under the curve, the higher the precision and recall. A perfect retrieval system is expected to retrieve the images of the same class as the query in the top ranks.
The number of visual words, i.e. the vocabulary size, for each feature group in pixel and region levels is experimentally selected to be 250. Thus, the pixel level retrieval system includes 750 words (for three groups) and region level contains 500 words (for two groups). For each group, a feature vector is assigned to its three nearest clusters. That is, each feature vector is represented by three visual words. This number is also selected based on the experimental observations
The average precision of the candidate similarity functions
4.2 Fusion results
An immediate observation is the performance of the individual levels and the fused system. Figure 8 shows two precision/recall curves for Wang and Corel5K datasets. While the results on Wang are much better than Corel5K, the relative order of the systems is quite similar. Region level retrieval has the lowest performance which may be attributed to the image segmentation inconsistencies. A segmentation algorithm may falsely partition a large segment or merge small and distinct segments, resulting in significant changes in the statistical features.
Incremental effect of multilevel features
4.3 Comparison with other systems
Average precision and recall for the fused system on the three datasets
5 Conclusion and future directions
in this paper, an image retrieval system is proposed that combines different types of image features. The pixel level extracts basic visual features of the image and has a less conceptual load. The region level tries to draw closer to the human visual system by segmenting the image and extracting features from its homogeneous regions. The object level tries to reduce the semantic gap by classifying and labeling objects within the image. The vector space modeling of the object names is further exploited to find similar concepts. The fusion system proposed in this paper incorporates the mentioned feature types in a single combined feature vector. The adopted autoencoder network induces a lower dimensional but informative feature vector.
The experimental results show the proper performance of the proposed method on three datasets including 900, 1000, and 5000 images. The results of the proposed method at the concept level indicate that the conceptual correlation of words corresponding to objects and the scenes within the images can be used as a measure of their similarity. The proposed method in this study has acquired an average precision of about 88% and 89% for the images of Wang and Corel9C and 68% for Corel5K, which is competitive to existing works.
In the future, we plan to explore this study in several directions. The first work is to analyze feature correlations and their importance in the constituting final feature vector. Another point is the direct inclusion of AlexNet’s features into the fusion system. Finally, the application areas of the proposed system will be explored.
Compliance with ethical standards
Conflict of interest
The authors declare that there is no conflict of interest.
- 4.Bloch I (2013) Information fusion in signal and image processing: major probabilistic and non-probabilistic numerical approaches. Wiley, HobokenGoogle Scholar
- 7.Moghimian A, Mansoorizadeh M, Dezfoulian MH (2018) Content based image retrieval using decision fusion of multilevel bag of words model. Tabriz J Electr Eng 50(4):1–13Google Scholar
- 8.Szegedy C, Ioffe S, Vanhoucke V, Alemi AA (2017) Inception-v4, inception-resnet and the impact of residual connections on learning. In: Thirty-first AAAI conference on artificial intelligenceGoogle Scholar
- 9.Liu H, Wang R, Shan S, Chen X (2016) Deep supervised hashing for fast image retrieval. In: Proceedings of the IEEE conference on computer vision and pattern recognition, p 2064–2072Google Scholar
- 15.Hoiem D, Sukthankar R, Schneiderman H, Huston L (2004) Object-based image retrieval using the statistical structure of images. In: null. IEEE, p 490–497Google Scholar
- 16.Li Y (2005) Object and concept recognition for content-based image retrieval. CiteseerGoogle Scholar
- 17.Li LJ, Su H, Fei-Fei L, Xing EP (2010) Object bank: a high-level image representation for scene classification & semantic feature sparsification. In: Lafferty JD, Williams CKI, Shawe-Taylor J, Zemel RS, Culotta A (eds) Advances in neural information processing systems, p 1378–1386Google Scholar
- 18.Jin Y, Khan L, Wang L, Awad M (2005) Image annotations by combining multiple evidence & wordnet. In: Proceedings of the 13th annual ACM international conference on Multimedia. ACM, p 706–715Google Scholar
- 19.Wang C, Jing F, Zhang L, Zhang HJ (2007) Content-based image annotation refinement. In: IEEE conference on computer vision and pattern recognition, CVPR’07. IEEE, p 1–8Google Scholar
- 25.Long F, Zhang H, Feng DD (2003) Fundamentals of content-based image retrieval. In: Feng D, Siu WC, Zhang HJ (eds) Multimedia information retrieval and management. Springer, Berlin, pp 1–26Google Scholar
- 28.Van De Weijer J, Schmid C (2006) Coloring local feature extraction. In: European conference on computer vision. Springer, Berlin, p 334–348Google Scholar
- 30.Zou Z, Shi Z, Guo Y, Ye J (2016) Object detection in 20 years: a survey. arXiv preprint arXiv:1905.05055
- 31.Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
- 32.Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Pereira F, Burges CJC, Bottou L, Weinberger KQ (eds) Advances in neural information processing systems. p 1097–1105Google Scholar
- 33.Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ (eds) Advances in neural information processing systems. p 3111–3119Google Scholar
- 34.Goldberg Y, Levy O (2014) word2vec explained: deriving mikolov et al.’s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722