1 Introduction

Search engines are widely used in various realtime applications. Generally digital libraries and text search engines work in similar fashion. Both utilizes several indexes and utilize words for saving or retrieval of the search space [1, 25, 34]. Various searching and indexing approaches are utilized to implement the search engines. Some popular approaches are as boolean retrieval model, inverted index, etc. [11, 21, 25, 34]. However, the size of indexes become exponentially complex as number of search space increases [25]. Therefore, ranking of indexes is done based upon their retrieval frequency [34]. However, to evaluate the queries where similarity has significantly lesser values, is still a challenging task [10, 14, 31]. To overcome query mismatch issue, query expansion approaches have been implemented to improve the results [28]. Various alternatives or similar query words were utilized to prevent query mismatch issue [24]. Thereafter, linguistic approaches have been implemented such as latent semantic indexing [2], term-document matrix [7], word-net [20], singular value decomposition [8], etc. Latent dirichlet allocation [3, 24, 30]. However, the development of efficient search engine queries especially, in case of query biomedical image-mismatch is still defined as an ill-posed problem.

Figure 1 shows the diagrammatic representation of the biomedical search engine. It clearly shows that an efficient training model is required to build an offline biomedical image search engine. Also, during the online phase users can pass query images and obtained the respective results.

Fig. 1
figure 1

Diagrammatic representation of the biomedical search engine

The main contributions of this paper as as:

  1. 1.

    A deep learning based vector-space is proposed for improving the query similarity matching for enhancing the performance of search engines especially for mismatch queries.

  2. 2.

    A softmax function is defined by converting the vector-space model to classification problem.

  3. 3.

    Finally, deep learning model is trained to implement the search engine for biomedical images.

  4. 4.

    Extensive experiments reveal that the proposed model outperforms the competitive models in terms of various performance metrics.

The remaining paper is as: The literature review is discussed in Section 2. Proposed model is mathematically defined in Section 3. Experimental results and discussions are presented in Section 4. Concluding remarks are presented in Section 5.

2 Literature review

Pinho et al. designed a novel biomedical search engine. An extensible model for biomedical images combined with an open-source picture archiving and communication model with profile-based capabilities has been utilized [22]. Long designed a novel search engine model for a supplemental health applications. Supplemental federated search engine was also designed. Performance was evaluated on federated search engine along with website usability testing results [19]. Faroo has reviewed many biomedical search engines and found that the development of biomedical search engine is still an open area of research [6].

Hochberg et al. designed biomedical search engine to diagnose diabetes. Different models such as decision tree, logistic regression, linear regression, and random forest to diagnose diabetic patients [12]. Ye et al. designed COVID-19-related query logs to develop search engines. It was significant to learn about the epidemic’s influence on users’ search behavior and improve search engine to tackle comparable pandemic outbreaks in the future [32]. Young et al. implemented a search engine for diagnose HIV infected patients. A negative binomial approach was designed to estimate HIV infected patients by considering a subgroup of predictor keywords recognized by lasso regression. The Google search data was integrated with existing HIV reports [33]

Fagroud et al. designed a novel internet of things (IoT) search engine. With the advancement in IoT networks and the enhancement of the number of IoT resources, searching the data of IoT, learning IoT, recognize and list of the associated resources have become a necessity, which became possible with the presence of a various kind of IoT search engines. [5]. Kopanos et al. implemented an VarSome i.e., human genomic variant search engine [17]. Doulani et al. discussed a Scopus database and google scholar search engine. The statistical population of 118 researchers who were active in social- scientific network from 29 governmental universities were utilized. t-test and pearson correlation coefficient were implemented for search engine analysis [4].

Recently many researchers have designed various machine learning models such as to classify various kind of applications [9, 13]. However, the majority of the existing models suffer from the over-fitting issue [15, 16, 29]. Therefore, in this paper, a novel fusion mdoel by using DCNN and vector space model is proposed to achieve better results.

3 Proposed model

In this section, initially, vector-space model is discussed. Thereafter, deep convolutional neural network (DCNN) is presented. Finally, DCNN based vector-space model is discussed.

3.1 Vector-space model

In this paper, we have focused on the vector-space based query similarity matching approach for improving the performance of search engines especially for biomedical image-mismatch queries.

The vector-space model defines search space and queries as group of vector indexes. Weights define the significance of biomedical image features in query Q and image space D [18] as:

$$ Q= (N_{Q1},N_{Q2},...... N_{Qr}) $$
(1)
$$ D= (N_{a1},N_{a2},...... N_{al}) $$
(2)

To defines weights for biomedical image search space vector, lqaDq [26] is utilized. In lqaDq [26], weights are computed using two factors lqak i.e., frequency of word k in Da and occurrence of k in collected search space (Dqk). Dqk requires weight scaling.

W defines total search space in Dqk. Inverse biomedical image search space frequency (aDqk) of k is defined as:

$$ aDq_{k}=log\frac{W}{Dq_{k}} $$
(3)

It augments aDq. However, it may convert aDq frequent terms with low degree [25]. A composite weight is defined by integrating Dqk and aDqk. Therefore, in lQaDq weighting, the weight of kDa is represented as:

$$ M_{ak}= lq_{ak} \times aDq_{k} = lq_{ak} \times log W/Dq_{k} $$
(4)

It provides significantly more weights to words having higher frequency [23, 27]. By considering lqaDq, the vector-space model computes cosine similarity (cos 𝜃) among biomedical image search space and query vectors [23, 27]. cos 𝜃 defines vector details of Da and Q, respectively, by utilizing the dot multiplication of two vectors and also the multiplication of their respective Euclidean values. cos 𝜃 can be evaluated as:

$$ cos \theta = \frac{\overrightarrow{D_{a}.}\overrightarrow{Q}}{|\overrightarrow{D_{a}}\parallel\overrightarrow{Q}|} $$
(5)

By using (5), dot multiplication \(|\overrightarrow {D_{a}}\parallel \overrightarrow {Q}|\) can be evaluated as \({\sum }_{k-1}^{U} M_{Qk}\times M_{ak}\). Here, MQ,k shows weight of k in query q. U defines size of word. \(|\overrightarrow {D_{a}}\parallel \overrightarrow {Q}|\) defines the multiplication of Euclidean values and can be evaluated as \({\sum }_{k-1}^{U} M^{2}_{Q,k} {\sum }_{k-1}^{U} {M^{2}_{a}}k\). Integration of these variables define the similarity among Da and Q as:

$$ sim (D_{a},Q)= \sum\limits_{k-1}^{U} M_{Q,k} \times M_{ak}/ \sqrt{\neq of . terms . in . D_{a}} $$
(6)

Equation (6) predicts the normalization normalization impact in a search engine. However, vector-space model does not consider the relational details among the keywords and biomedical image search space is not evaluated.

3.2 Deep learning model

In this section, deep convolutional neural network (DCNN) based vector-space model is defined. Our goal is to predict such a combination of Q and D which can provide more accurate results. DCNN requires various convolution filters to squeeze local features (please see Fig. 2).

Fig. 2
figure 2

Deep learning based biomedical image search engine

Consider there is single channel which can be defined as:

$$ C = \left[ {{c_{1}},{c_{2}},{c_{3}},...,{c_{n}}} \right]. $$
(7)

where \( C \in {\mathbb {R}^{n \times k}} \). n shows the size of input biomedical image. k represents the enclosed dimension of every input factor. In convolution process, a filter \( {\textbf {m}} \in \mathbb {R}^{lk} \) is required in implementing to successive l biomedical images to bring potential features as:

$$ {x_{i}} = f\left( {{\textbf{m}} \cdot {{\textbf{c}}_{i:i + l - 1}} + b} \right), $$
(8)

Here, ci:i+l− 1 is the integration of ci,...,ci+l− 1. \( b \in \mathbb {R} \) is a bias. f represents a non-liner activation function like relu. Thereafter, filter m move towards \( \left \{ {{{\textbf {c}}_{1:l}},{{\textbf {c}}_{2:l + 1}},...,{{\textbf {c}}_{n - l + 1:n}}} \right \} \), then following feature map can be obtained:

$$ {\textbf{x}} = \left[ {{x_{1}},{x_{2}},...,{x_{n - l + 1}}} \right]. $$
(9)

Thereafter, max-pool is implemented on x to obtain the maximum value \( \hat x = {\max \limits } \{ {\textbf {x}}\} \). It defines the final feature extracted by m. It obtains the dominated feature set of every filter. CNN computes various feature sets by using numerous filters with different sizes. The obtained feature sets contain a vector as

$$ {\textbf{r}} = \left[ {{x_{1}},{x_{2}},...,{x_{s}}} \right] $$
(10)

Here, s defines the number of filters. The softmax (sf) layer is then used to compute the estimated probability distribution as:

$$ y = s_{f} \left( {W \cdot {\textbf{r}} + b} \right). $$
(11)

Consider a training data (xi, yi) in which \( {y^{i}} \in \left \{ {1,2, {\cdots } ,c} \right \} \) defines matched image query for search engine of xi and approximated probability of DCNN is \( \tilde {y_{j}^{i}} \in [0,1] \) for every label \( j \in \left \{ {1,2, \cdots ,c} \right \} \). The estimated error can be computed as:

$$ L({{\textbf{x}}^{i}},{y^{i}}) = -\sum\limits_{j = 1}^{c} {if\{ {y^{i}} = j\} } \log (\tilde {y_{j}^{i}}). $$
(12)

where c shows the number of labels of xi. \( if\{ \dot \} \) define as an indicator and if{yi = j} = 1 if yi = j, if{yi = j} = 0 otherwise. The stochastic gradient descent is employed to update the DCNN attributes and adopt Adam optimizer.

4 Performance analysis

The proposed model is applied to the benchmark search engine dataset. The comparison of the proposed technique is drawn with the state-of-art models such as Decision tree, Logistic regression, Support vector machine, Artificial neural network, Random forest, Naive Bayes, k nearest neighbour (k-NN), Adaboost, SVM-Random forest, CNN, and Gradient boosting. The experiments are performed on core i7 3.80 GHz, 32-GB RAM, and 15M cache on MATLAB 2019a software.

Figure 3 shows the validation, training and testing analysis of proposed model. It is found that the proposed model converges at very fast speed during the training process. At 262nd epoch, the proposed model achieves the best training and validation results, respectively. Thus, the proposed model obtains significantly lesser binary-cross entropy values, i.e., loss during the model building process.

Fig. 3
figure 3

Binary-cross entropy based loss analysis of proposed model

To evaluate the performance of the proposed model, median and degree of uncertainty values (i.e., median ± IQR × 1.5) are evaluated by repeating the experiments 50 times. We have used 65% dataset for training, 15% for validation, and 20% for testing, respectively. The fraction of training is set to be 65% because the obtained dataset is small in size. Other experiments are also considered by changing the fractions of training data. But it is found that the significant performance is found when the fraction of the training data is 65%.

To draw comparisons among the proposed and the existing models, confusion matrix-based measures are used. These measures are accuracy, specificity, sensitivity, area under curve (AUC) and f-measure.

Tables 1 and 2 depict the training and testing analysis of the proposed model for biomedical search engine dataset. Various confusion matrix based metrics like accuracy, sensitivity, specificity, f-measure, and AUC are used to compute the effectiveness of the proposed model over the existing models. From these tables, it is found that the proposed automated model provides significantly better results as compared to the existing model. As the proposed model achieves significantly better sensitivity and specificity values, therefore, a fast and efficient search engine similarity algorithm is proposed.

Table 1 Training analysis
Table 2 Testing analysis

Table 3 shows web image search engines analysis among the proposed and the existing models. It is found that the proposed model outperforms the competitive web image serach engines.

Table 3 Performance analysis of the web image search engines

5 Conclusion

From the extensive review, it has been found that the vector-space model did not consider the relational details among the biomedical contents and image search space. Therefore, a fused DCNN and vector-space based biomedical image query similarity matching approach was proposed for improving the performance of biomedical search engines. DCNN model was defined by converting the vector-space model to classification problem. Finally, biomedical image search engine was trained. Extensive experiments have been drawn by using the proposed and the competitive models for search engines. The proposed model has shown significant improvement over the existing biomedical search engines.