1 Introduction

The extensive use of the social networking sites and technological improvements of the image acquisition devices, as well as the size of image repositories, are increased exponentially. It has gained the researchers’ interest to find a better approach to search images from huge image collections using an effective and efficient mechanism [1,2,3]. The commonly used image annotation methods are based on the mapping of image descriptors with few keywords. These methods cannot describe the diversity of contents within images due to the lack of discriminative capability. Content-based image retrieval (CBIR) gained significance in research over the years [4, 5]. The focus of any CBIR technique is to compute low-level visual features from the images and extend the association between them in terms of co-occurrence of similar visual contents. The visual contents of the image are represented in terms of the low-level visual features such as texture and shape [6]. Texture-based features are capable to find the spatial variations between intensity values and surface attributes of an object within an image. However, segmentation of texture is a challenging task in order to fulfill human perception [7]. Color-based features are invariant to scale and rotation with high computational cost. Shape-based features are not able to provide a mathematical foundation for image deformation and generally consistent with the intuitive feeling [4]. Therefore, image representation by using only low-level features cannot describe the semantic relationship between images efficiently. The resemblance in the pictorial look of the images belonging to various categories consequence in the closeness of low-level visual features and it decreases the performance of CBIR [8, 9].

In VBoW framework [10], features are computed from each image, after that clustering is applied to formulate codebook that consists of visual words. These visual words are used to build a global histogram for each image, which result in loss of the spatial information and after that classification is performed, and the similarity is calculated between inquiry image and archive images in order to retrieve images. The robustness of the VBoW-based image illustration suffers due to ignorance of spatial context among local features [11,12,13,14]. Different techniques such as geometric coding [15] and co-occurrence of visual words [16] are introduced to add the spatial context of visual words to the VBoW framework. These techniques required high computational complexity on larger sizes of the codebook or dictionary [11]. The spatial information is available in the sub-areas of an image. The technique of Lazebnik et al. [17] formulates spatial histogram from each square area of the grid by splitting an image into different square areas. Keeping in view the effective performance of [17] to incorporate spatial information, the proposed technique of this article is to split an image into four adapted triangular regions instead of square regions [17] and extracting adapted local features, formulating weighted soft codebooks, and computing histograms over the each triangular area of an image.

The content-based properties of an image describe the presence of salient objects in the image, while the compositional properties describe the image layout and it includes the photographic rule of composition [18]. The principle of the rule of thirds is to divide each image into nine square areas and place the salient objects at the intersection points of the grid to acquire content-based and compositional properties which improve the CBIR performance [18]. Figure 1a represents an image that is captured according to the photographic rule of thirds, and salient objects are placed at the intersection points of the square areas. Figure 1b also represents an image that is separated into four triangular areas to extract features over four dense scales (from four distributed triangular areas) for the addition of spatial information to the VBoW framework and to reduce the semantic gap. In an image of a scene, there are different regions or objects that are located in different sub-regions. The water or grass is likely to be located at the bottom and clouds or sky is positioned at the top, while salient objects are positioned at left or right side. The detachment of an image into four adapted triangular areas represents this triangular association.

Fig. 1
figure 1

Image a http://mikemellizaphotography.blogspot.com/ is showing the photographic rule of thirds, while image b is showing its division into four adapted triangular areas

Figure 2a presents two images with a very close visual similarity but different semantic meanings. The feature map of the dog and the lady are visually analogous, but their semantic implications are entirely diverse. In Fig. 2b, the image on the left is the inquiry image, the output of image retrieval can give emphasis on the contemporary building so the preeminent contest could be a modern building like the one in the middle image, while in some cases the emphasis may be on the prettiness, so the rightmost image is the preeminent contest. The best CBIR system is the one which retrieves accurate images according to the user requirements. Different regions like the ground, water, sky, clouds, and grass are located within these different triangular areas, while the salient objects like peoples and horses are positioned at left or right side. Keeping these facts into view, we computed dense LIOP features from four adapted triangular areas; feature space is quantized to formulate weighted soft codebook from each triangular area and computed spatial histogram from each adapted triangular area of the image. The proposed technique provides an option to extract the spatial properties of an image from each of the triangular areas and solve the problem of overfitting on larger sizes of the codebook by formulating weighted soft codebooks. Following are the key contributions of this article:

  1. 1.

    The accumulation of the spatial information to the VBoW framework.

  2. 2.

    An adapted triangular regions based technique for the feature extraction over four adapted triangular regions of an image, resolve the problem of overfitting on larger sizes of the codebook by formulating weighted soft codebook over each triangular region.

  3. 3.

    Lessening of semantic gap between high-level semantic perceptions and low-level features of the image.

Fig. 2
figure 2

a Image with very close visual appearance but with a different semantic meaning. b The image on the left is the inquiry image, while the image of building and lady are the retrieved images in reply to the inquiry image due to close visual appearance

2 Related work

The query by image content (QBIC) system [19] is the first CBIR system introduced by IBM. Many systems after QBIC are developed by IBM. Common domain for all of these was to enhance the image searching techniques and similarity matching in order to increase the performance of CBIR. In existing literature [20, 21], several methods have been implemented to overcome the limitations in CBIR, i.e., to reduce the semantic gap between low-level features and high-level semantic concepts. To address these issues, focus of research is local features such as color, boundary contour, texture, and spatial layout, and different discriminative feature extraction techniques were introduced to enhance the performance of CBIR.

An optimized technique for image retrieval is introduced by Zhong and Defée [22], which relies on the quantized histograms of discrete cosine transform (DCT) blocks and uses different global parameters such as scalar quantization, histograms size, difference vectors, and integrated AC-pattern and DC-DirecVec histograms. Histograms are optimized through the factor of quantization together with the count of DCT blocks that are normalized under luminance in order to improve CBIR performance. Yuan et al. [23] propose a local descriptor that integrates SIFT and LBP features to obtain a high-dimensional feature vector for each image. Two fusion models, i.e., patch-level and image-level, are employed for feature fusion. For compact representation of high-dimensional feature vector, clustering technique based on k-means is used to formulate a codebook. According to the semantic category of the query image, images are retrieved and ranked based on the similarity measure. Yu et al. [24] present feature fusion technique which uses a histogram of oriented gradients (HOG), SIFT, and local binary pattern (LBP) features in order to achieve effective results for CBIR. A high-dimensional feature descriptor is formed by the fusion of separately computed visual features using SIFT, LBP, and HOG features. After that, these fused features are encoded into visual words by applying a k-means clustering technique to form a dictionary and each image is characterized as a distribution of these visual words. Raja and Bhanu [25] propose an improve CBIR technique on the basis of image local features like color, texture, shape, wavelet-based histogram, and incorporate relevance feedback to achieve better accuracy for CBIR. A visual similarity matching technique known as adaptive region matching (ARM) is proposed by Yang and Cai [26] which uses region-based image retrieval (RBIR). A semantic meaningful region (SMR) and region important index (RII) are built to reduce the adverse consequence of interference regions and loss of spatial information. Images are compared conforming to whether the given image has an SMR and it performs SMR-to-image matching to improve the performance of the CBIR.

Wang et al. [27] propose a spatial weighting bag-of-features (SWBoF) model of visual words by applying texture measure. The spatial information is extracted from diverse areas of the image. The dissimilarity between groups of pixels is selected to compute the useful information. The spatial information is computed by applying local entropy, adjacent blocks distance and local variance. According to the experimental results of [27], SWBOF model performs better than traditional BoF approach. According to Liu et al. [11], the spatial information among local features carries significant information for content verification. A rotation and scale-invariant edge orientation difference histogram (EODH) descriptor are proposed by Tian et al. [28]. The steerable filter and vector sum are applied to obtain the main orientation of pixels. The color-SIFT and EODH descriptors are integrated to improve the effectiveness of the feature space and to reduce the semantic gap. The dictionary is constructed by applying a weighted average of Color-SIFT and EODH. According to the experimental results [28], weighted average distribution enhances the performance of the image retrieval.

Rashno et al. [29] propose an effective technique for CBIR which relies on the discrete wavelet transform (DWT) and color features. In this technique, visual contents of each image are represented through a feature vector which comprises of texture feature by applying wavelet transform and color feature obtained by converting each image from RGB and HSV space. In wavelet transform, each image is decomposed into four sub-bands and then low-frequency sub-band is used as a texture-feature. For color-features, dominant color descriptor (DCD) is used for quantization of the image to achieve color statistics and histogram features. Ant colony optimization technique is used for selecting relevant and unique features from the entire feature set consisting of both color and texture features. The images are retrieved by applying Euclidean distance to find a resemblance between inquiry image and database images. Rahimi and Moghaddam [30] introduce a CBIR technique which uses intraclass and interclass features to improve the performance of the CBIR. The distribution of the color tone is used as an intraclass feature, whereas singular value decomposition (SVD) and complex wavelet transform are used as inter-class features. Self-organizing map (SOM) is produced using these features by applying artificial neural network (ANN) to increase the proficiency of the CBIR. Yan et al. [31] propose a novel technique using the deep convolutional neural network to analyze image contents to learn high-quality binary codes, known as a one-stage supervised deep hashing framework (SDHP). The proposed technique assign similar binary codes to the similar images and vice versa. The learned codes in this technique are evenly distributed, and during the conversion process of the Euclidean space to Hamming space, quantization loss is reduced. The discriminative power of the learned binary codes is further improved by extending SDHP to SDHP+, which significantly improve the search accuracy as compared with state-of-the-art hashing algorithms. Yan et al. [32] present another novel framework for recognition of Uyghur language text in case of intricate background images. For detecting text regions, the maximally stable extremal regions (MSERs) technique is introduced but one of its shortcomings is that in case of blur and low contrast images, it does not perform well. Due to this reason, another technique is introduced, known as channel-enhanced MSERs. This technique outperforms the traditional MESRs technique, but one limitation occurs in this case that is noise and overlapping regions. The HOG and SVM are employed for extracting non-text overlapping regions and noise. One of the most important outcomes of this technique is the usefulness for detecting Uyghur language as well as other languages can be identified by changing some empirical rules and parameters. Different efficient techniques are introduced to analyze the image and video contents in a variety of applications [33, 34].

3 Proposed methodology

The framework of the proposed technique is presented in Fig. 3. We obtained the spatial information by separating an image into four adapted triangular areas. This enables to extract the visual features based on LIOP features, weighted soft codebooks, and spatial histograms from the top, down, left, and right areas of the image. Figure 3 is presenting the procedure for computation of dense LIOP features, weighted soft codebooks, and spatial histograms over the four triangular areas of the image. The description of each step of the proposed technique is as follows:

  1. 1.

    The approximation coefficient of each image (represented by IMG) of the training and test sets after applying level-2 decomposition using discrete wavelet transform (DWT) is divided into the four adapted triangular areas, which are extracted by applying following mathematical equations:

Fig. 3
figure 3

Framework of the proposed technique based on the adapted triangular areas and weighted soft codebooks

$$ {R}_{ttp1}=\mathrm{IMG}\left(1,1\right),\kern0.5em {R}_{ttp2}=\mathrm{IMG}\left(1,w\right),\kern0.5em {R}_{ttp3}=\mathrm{IMG}\left(\frac{h}{2},\frac{w}{2}\right) $$
(1)
$$ {R}_{btp1}=\mathrm{IMG}\left(h,1\right),\kern0.5em {R}_{btp2}=\mathrm{IMG}\left(h,w\right),\kern0.5em {R}_{btp3}=\mathrm{IMG}\left(\frac{h}{2},\frac{w}{2}\right) $$
(2)
$$ {R}_{l\mathrm{tp}1}=\mathrm{IMG}\left(1,1\right),\kern0.5em {R}_{\mathrm{ltp}2}=\mathrm{IMG}\left(h,1\right),\kern0.5em {R}_{ltp3}=\mathrm{IMG}\left(\frac{h}{2},\frac{w}{2}\right) $$
(3)
$$ {R}_{rtp1}=\mathrm{IMG}\left(1,w\right),\kern0.5em {R}_{rtp2}=\mathrm{IMG}\left(h,w\right),\kern0.5em {R}_{rtp3}=\mathrm{IMG}\left(\frac{h}{2},\frac{w}{2}\right) $$
(4)
  1. 2.

    The LIOP features [35] are computed from each adapted triangular area of the image by applying following mathematical equations:

$$ \mathrm{LIOP}\ \mathrm{descriptor}=\left({\mathrm{des}}_1,{des}_2,\dots, {\mathrm{des}}_i\right) $$
(5)
$$ {\mathrm{des}}_l={\sum}_{x\varepsilon {bin}_l}w(x)\mathrm{LIOP}(x) $$
(6)

where

$$ \mathrm{LIOP}(x)=\varphi \Big(\gamma \left(P(x)\right) $$
(7)

where

$$ P(x)=\left(I\left({x}_1\right),I\left({x}_2\right),\dots, I\left({x}_n\right)\right)\ \epsilon\ {P}^N $$
(8)

and φ is a feature mapping function that map the permutation π to an N ! −dimensional feature vector \( {V}_{N!}^i \) whose all the elements are 0 except for the ithelement which is 1. The feature mapping function φ is defined by the following mathematically equation:

$$ \phi \left(\pi \right)={V}_{N!}^{\mathrm{Ind}\left(\pi \right)},\pi\ \upvarepsilon\ {\prod}^N $$
(9)

where\( {V}_{N!}^{\mathrm{Ind}\left(\pi \right)}=\left(0,\dots, 0,{1}_{\mathrm{Ind}\left(\pi \right)},0,\dots, 0\right) \)and Ind(π) represents the index of π in the index table.

$$ \mathrm{LIOP}(x)={V}_{N!}^{\mathrm{Ind}\Big(\gamma \left(P(x)\right)} $$
$$ \mathrm{LIOP}(x)=\left(0,\dots, 0,{1}_{\mathrm{Ind}\Big(\gamma \left(P(x)\right)},0,\dots, 0\right) $$
(10)
$$ \mathrm{and}\ w(x)={\sum}_{i,j}\mathit{\operatorname{sgn}}\ \left(\left|I\left({x}_i\right)-I\left({x}_j\right)\right|-{T}_{lp}\right)+1 $$
(11)

In the above equations, for a sample point x n , I(x n ) represents the intensity of the nth neighboring sample, preset threshold is represented by T lp , sign function is represented by sgn, w(x) represents the weighted function of the LIOP descriptor, the feature mapping function is represented by φ, and i, j represent the coordinate position of the sample point x n .

  1. 3.

    The combination of visual words known as a codebook. In order to formulate four weighted soft codebooks, the clustering technique based on the k-means++ [36] is applied on the extracted features from each adapted triangular area, which produces four soft codebooks. In order to resolve the problem of overfitting on a codebook of larger sizes, each soft codebook is multiplied by the weight of 0.25 (as four weighted soft codebooks are formulated, so 1/4 result in a weight (w) of 0.25), which produces four weighted soft codebooks. The clustering technique based on the k-means++ is chosen for auto selection of initial seed for clustering to improve the clustering results. The four weighted soft codebooks are represented by the following mathematical equations:

$$ {C}_{wst}=0.25\times \left\{{v}_{t1},{v}_{t2},{v}_{t3},{v}_{t4},{v}_{t5},,\dots, {v}_{tx}\right\} $$
(12)
$$ {C}_{wsb}=0.25\times \left\{{v}_{b1},{v}_{b2},{v}_{b3},{v}_{b4},{v}_{b5},,\dots, {v}_{bx}\right\} $$
(13)
$$ {C}_{wsl}=0.25\times \left\{{v}_{l1},{v}_{l2},{v}_{l3},{v}_{l4},{v}_{l5},,\dots, {v}_{lx}\right\} $$
(14)
$$ {C}_{wsr}=0.25\times \left\{{v}_{r1},{v}_{r2},{v}_{r3},{v}_{r4},{v}_{r5},,\dots, {v}_{rx}\right\} $$
(15)

where C wst , C wsb , C wsl ,and C wsr represent the weighted soft codebooks formulated from top, bottom, left, and right adapted triangular areas of the image, respectively. The vt1 to v tx , vb1 to v bx , vl1 to v lx , and vr1 to v rx represent the visual words of the top, bottom, left, and right adapted triangular areas of the weighted soft codebooks, respectively.

  1. 4.

    The visual words of each adapted triangular area of the image are mapped to the associated quantized feature descriptor by applying following mathematical equation:

$$ v\left({a}_n\right)= argmi{n}_{v\in {c}_{atr}} Dist\left(v,{a}_n\right) $$
(16)

where v(a n ) represents the associated visual word of the nth feature descriptor a n . The distance between visual word v and feature descriptor a n is represented by Dist(v, a n ). The c atr represents the weighted soft codebook of the associated triangular area of the image.

  1. 5.

    The spatial histogram is formulated using an x number of visual words of each adapted triangular area of the weighted soft codebook as shown in Fig. 4. The four adapted triangular areas of the image produce four resultant spatial histograms.

  2. 6.

    The spatial histogram of each adapted triangular area of the image is concatenated together, and spatial information is added to the VBoW framework. Let total number of visual words of each weighted soft codebook (C ws ) are represented by x. If visual word v i is mapped to the d si descriptor, then ith bin of each spatial histogram h i is the cardinality of the set d s is represented as follows:

Fig. 4
figure 4

Image a is chosen from the image benchmark of the Corel-1K, which presents the division into four adapted triangular areas, and image b is presenting the method for computation of spatial histograms over the four adapted triangular areas of the image [20, 51]

$$ {h}_i= Card\left({d}_{si}\right)\kern3.25em \mathrm{and}\kern3.25em {d}_{si}=\left\{{v}_{dsi}, dsi\in \left(1,\dots, x\right)|{C}_{ws}\left({v}_{dsi}\right)={v}_i\right\} $$
(17)
  1. 7.

    The resultant spatial histogram of each image is normalized by applying the Hellinger kernel function [37] of the support vector machine (SVM). The total images of each reported image benchmark are categorized into training and test sets. The SVM classifier is trained using normalized histograms of the training images. The best values of the regularization parameters (C and Gamma) are determined by applying 10-fold cross validation using the images of the training set for each reported image benchmark.

  2. 8.

    The Euclidean distance [38] is used as a similarity measure technique. The images are retrieved by measuring the similarity between a score of the inquiry image and scores of the archive images for each reported image benchmark.

4 Experimental results and discussions

To evaluate the performance of proposed methodology, we have selected five challenging image benchmarks and results are compared with recent CBIR techniques. The images are randomly divided into training and test sets. Visual features are extracted by applying dense-LIOP feature descriptor, and all the processing is performed on the grayscale images.

Training images are used to formulate the weighted soft codebooks, and test images are used to compute the retrieval precision. Due to the unsupervised nature of clustering using k-means++, each experiment is repeated 10 times and the average values are reported and for every iteration, images are randomly divided into training and test sets. Lazebnik et al. [17] proposed a CBIR technique based on the spatial pyramid matching, which divides an image into several rectangular grids and formulate histograms from each region of the grid. The proposed technique of image representation based on adapted triangular areas is compared with 2 × 2 RSH technique of CBIR, which divides an image into 2 × 2 rectangular grids. Figure 5 is presenting the division of an image into four rectangular areas. The histograms are computed from each rectangular region to perform a comparison between the spatial rectangular and adapted triangular histograms of visual words.

Fig. 5
figure 5

Image a is chosen from the semantic class “Horses” of the Corel-1K image benchmark, and image b is presenting the division of the image into four rectangular areas for computation of histogram from each rectangular area [40]

4.1 Experimental parameters and performance evaluation metrics

The details about the parameters used for the experimental research are given below:

  1. a)

    Codebook size: The total number of images of each reported image collection is divided into two sets known as test and training sets. The images of the training set are used to formulate the dictionary for each image benchmark. The performance of the CBIR techniques based on the VBoW model is affected by varying sizes of the codebook.

  2. b)

    Step size: Dense-LIOP is used for features extraction. For a precise content-based image matching, we extracted dense features from four divided triangular regions of each image (at four different scales). Step size is used to control the sampling density, which is the vertical and horizontal displacement of each feature center to the next. The proposed technique is evaluated using pixel step sizes of 10, 15, and 25. For a step size of 10, every 10th pixel is selected to compute the LIOP descriptor.

  3. c)

    Features % per image for dictionary learning: According to [39], the number of features percentage per image for codebook or dictionary learning from the training set is an important parameter that affects the performance of CBIR. We formulated dictionary using different feature percentages [10, 25, 50, 75, and 100%] per image of the training set.

    Precision, recall, average precision (AP), and mean average precision (MAP) are the standard performance evaluation metrics to evaluate the performance of the CBIR system. The performance of the proposed technique is also evaluated using these metrics.

  4. d)

    Precision: The specificity of the image retrieval model is evaluated by the precision P, which is mathematically defined as follows:

$$ P=\frac{C_r}{R_t} $$
(18)
  1. e)

    Recall: The recall ‘R’ evaluate the sensitivity of the image retrieval model, which is mathematically defined as follows:

$$ R=\frac{C_r}{T_c} $$
(19)

where R t , C r , and T c represent the total retrieved, correctly retrieved, and total per class images, respectively.

  1. f)

    Average precision (AP): The AP for a set of image queries is the average of the precision of particular class of the image benchmark, which is mathematically defined as follows:

$$ AP=\frac{\sum_{j=1}^MP(j)}{M} $$
(20)
  1. g)

    Mean average precision (MAP): For a set of image queries, the MAP is the mean of the average precision values for each image query, which is mathematically defined as follows:

$$ MAP=\frac{\sum_{j=1}^M AP(j)}{M} $$
(21)

where M is the total number of the image queries.

4.2 Analysis of the evaluation metrics on the image benchmark of the Corel-1K

The image benchmark of the Corel-1K is a subset of the WANG image benchmark [40]. The proposed technique is evaluated using image benchmark of the Corel-1K, and recent CBIR techniques [28, 41,42,43,44] are used for the performance comparison of the proposed technique. The image benchmark of the Corel-1K comprises of 1000 images, which are organized into 10 semantic classes. The sample image associated with each class of the Corel-1K is presented in Fig. 6. The test set and a training set of the Corel-1K are split into 300 and 700 images, respectively. The MAP performance for top 20 image retrievals for a step size of 10 using different sizes of the weighted soft codebooks and feature percentage for weighted soft codebook learning are shown in Table 1.

Fig. 6
figure 6

Sample images associated with image benchmark of the Corel-1K [40]

Table 1 MAP performance on the image benchmark of the Corel-1K with a step size = 10

The MAP is calculated by taking the mean of the column-wise values of Table 1. The comparison of the MAP performance using proposed technique and 2 × 2 RSH technique for the dense pixel strides of 10, 15, and 25 is presented in Fig. 7. According to the experimental results, the MAP obtained by using the proposed research with a pixel stride of 10 is 87.22%, while the MAP obtained from the pixel strides of 15 and 25 is 84.27 and 78.13%, respectively (with a weighted soft codebook size of 200 visual words). This shows that increasing the pixel stride decreases the MAP performance and vice versa. In order to present a sustainable performance of the proposed research, the MAP for top-20 retrievals is calculated and compared with recent CBIR techniques [28, 41,42,43,44]. Table 2 and Table 3 are presenting the class-wise comparisons of average precision and recall on the image benchmark of the Corel-1K.

Fig. 7
figure 7

Effect on the MAP performance by varying weighted soft codebook sizes on the image benchmark of the Corel-1K

Table 2 MAP performance comparison of the proposed technique with recent CBIR techniques on the image benchmark of the Corel-1 K
Table 3 Average-recall performance comparison of the proposed technique with recent CBIR techniques on the image benchmark of the Corel-1K

Experimental results and the comparisons are conducted on the image benchmark of the Corel-1K prove the robustness of the proposed technique. The mean precision and recall values obtained using proposed technique are higher than the recent CBIR techniques [28, 41,42,43,44]. The image retrieval results for the semantic classes of “Mountains” and “Elephants” are presented in Figs. 8 and 9, respectively in response to the query images that show a reduction of semantic gap in terms of classifier decision value (score). Top-20 retrieved images, whose score is close to the score of the query image are more similar to the query image and vice versa.

Fig. 8
figure 8

Top-20 retrieved images associated with “Mountains” class of the image benchmark of the Corel-1K

Fig. 9
figure 9

Top-20 retrieved images associated with “Elephants” class of the image benchmark of the Corel-1K

4.3 Analysis of the evaluation metrics on the image benchmark of the Corel-1.5K

There are 15 semantic classes in image benchmark of the Corel-1.5K, and each semantic class contains 100 images. The image benchmark of the Corel-1.5K is also a subset of the WANG image benchmark [40] and used for performance comparison of the proposed technique with [45]. Figure 10 is presenting the sample of images from each semantic class of the image benchmark of the Corel-1.5K. The MAP performance as a function of the weighted soft codebook size is graphically presented in Fig. 11. The precision and recall values obtained using the proposed technique are compared with recent CBIR technique of [45] and are presented in Table 4.

Fig. 10
figure 10

Sample images associated with image benchmark of the Corel-1.5K [40]

Fig. 11
figure 11

Effect on the MAP performance by varying weighted soft codebook sizes on the image benchmark of the Corel-1.5K

Table 4 MAP performance and recall comparison of the proposed technique with recent CBIR techniques on the image benchmark of the Corel-1.5K

According to the experimental results, the MAP performance of 85.56% is obtained using the proposed technique on a weighted soft codebook size of 400 visual words (with pixel step size of 10), while the MAP performance obtained using 2 × 2 RSH technique is 84.97%. The proposed image representation outperforms 2 × 2 RSH-based CBIR technique as well as recent CBIR technique of [45].

4.4 Analysis of the evaluation metrics on the image benchmark of the 15-Scene

The 15-Scene image benchmark [46] comprises of 4485 images that are organized into 15 categories. Each category comprises of 200 to 400 images containing outdoor and indoor scenes as shown in Fig. 12. The resolution of each image in this image collection is 250 × 300 pixels. The MAP performance of the proposed technique on different sizes of the dictionary is shown in Fig. 13, which is compared with 2 × 2 RSH-based CBIR technique.

Fig. 12
figure 12

Sample images associated with image benchmark of the 15-Scene

Fig. 13
figure 13

Effect on the MAP performance by varying dictionary size on the image benchmark of the 15-Scene

The experimental details are shown in Fig. 13 and Table 5 that indicates the efficiency of the proposed technique on the reported weighted soft codebook sizes as compared with 2 × 2 RSH-based CBIR technique and recent CBIR techniques [26, 40, 41]. The proposed technique and 2 × 2 RSH technique gives best MAP performance of 79.02 and 77.99% by formulating the weighted soft codebook size of 800 visual words using a step size of 10, respectively.

Table 5 MAP performance and recall comparison of the proposed technique with recent CBIR techniques on the image benchmark of the 15-Scene

4.5 Analysis of the evaluation metrics on the image benchmark of the Ground-truth

In Ground-truth image benchmark, a total number of images is 1109 which are organized into 22 semantic classes. The Ground-truth image benchmark is commonly used for performance evaluation of the recent CBIR techniques [47,48,49]. For a clear comparison, 5 semantic classes which comprise a total number of 228 images are chosen as the performance of the recent CBIR techniques [47,48,49] are also evaluated for the same classes. The sample images of the chosen categories of the Ground-truth image benchmark are shown in Fig. 14, while MAP performance comparison of the proposed technique with recent CBIR techniques [47,48,49] is presented in Fig. 15, which proves the robustness of the proposed technique by formulating a weighted soft codebook size of 60 words using a step size of 10.

Fig. 14
figure 14

Sample images associated with image benchmark of the Ground-truth

Fig. 15
figure 15

Analysis of MAP performance comparison of the proposed technique with recent CBIR techniques [47,48,49] on the image benchmark of the Ground-truth

4.6 Analysis of the evaluation metrics on the image benchmark of the Caltech-256

The image benchmark of the Caltech-256 is deposited on 2007 and a successor of the image benchmark of the Caltech-101. The total number of images in the Caltech-256 are 30,607 which are categorized into 256 semantic classes. Every semantic class contains a minimum of 80 images with no artifacts, performance is halved, shows diversity in image representation [50]. The best MAP performance of the proposed technique is 31.19%, which is achieved on a weighted soft codebook size of 1400 visual words. The performance of the proposed technique is compared with recent CBIR techniques [20, 44, 45], which also proved its robustness as presented in Table 6. The performance analysis in terms of the precision-recall (PR)-curve of the proposed technique is presented in Fig. 16 for the image benchmarks of the Corel-1K, Corel-1.5K, 15-Scene, Ground-truth, and Caltech-256.

Table 6 MAP performance and recall comparison of the proposed technique with recent CBIR techniques on the image benchmark of the Caltech-256
Fig. 16
figure 16

PR-curve of the proposed technique on the image benchmarks of the Corel-1K, Corel-1.5K, 15-Scene, Ground-truth, and Caltech-256

4.7 Requirement of the computational resources

The performance of the proposed technique of this article is measured on a computer, whose hardware specifications are as follows: RAM with 8 GB storage capacity, GPU with 2 GB storage capacity, and Intel Pentium (R) Core i7 microprocessor with 2.4 GHz clock frequency. The required software resources for the implementation of the proposed technique are Microsoft Windows 7 64-bit operating system and MATLAB 2015a. The computational complexity required for feature extraction of the proposed technique is presented in Table 7, while computational complexity of the proposed image retrieval framework (i.e., complete framework) is presented in Table 8. The computational complexity is reported by selecting the image benchmark of the Corel-1 K, which comprises of each image resolution of 384 × 256 or 256 × 384.

Table 7 Computational complexity (time in seconds) of the proposed technique required for feature extraction only
Table 8 Computational complexity (time in seconds) of the proposed image retrieval framework and its comparisons with recent CBIR techniques

5 Conclusions

In this article, we have proposed a novel image representation based on the adapted triangular areas and weighted soft codebooks. The dense LIOP features, weighted soft codebooks, and spatial histograms are extracted over the four triangular areas of the image. The proposed technique adds the spatial context of information to the inverted index of VBoW model. The collection of dense LIOP features and spatial histograms over the four adapted triangular areas of an image is a possible solution to add the spatial information to the VBoW model and reduction of the semantic gap between low-level features of the image and high-level semantic concepts. The problem of overfitting on the codebook of larger sizes is reduced by the weighted soft codebooks to further improve CBIR performance. The Hellinger kernel of the SVM is selected for image classification. The proposed technique is evaluated on five challenging image benchmarks and results are compared with recent CBIR techniques and 2 × 2 RSH technique of the CBIR. The proposed image representation outperforms in terms of the performance comparison with recent CBIR techniques as well as RSH technique of the CBIR. In future, we plan to replace VBoW model with a vector of locally aggregated descriptors (VLAD) or Fisher kernel framework to evaluate the proposed technique for a large-scale image retrieval.