1 Introduction

Image captioning is an interesting research area due to its various applications like supporting blind people, helpful for image indexing and other NLP applications [35]. Web accessing through image captions is a major part in the everyday life of blind people. At the same time, identifying the images on the web is a very challenging task for the blind people [4]. The idea of retrieving web data and accomplishing the everyday jobs like banking, grocery shopping is difficult for the blind person [15]. Web is an important source for the blind people and giving most important autonomy to the blind people [7]. Hence, the web accessibility practice is utilized to describe the images with the alternative text (alt text). This alt text gives small captions substitutes to the image that expresses general meaning of the image [31].

The caption of the images permits the blind people to contribute on the social activities and getting more information from online and helps in purchasing the products. Generation of the captions automatically enable the blind people to get more information about the images [18]. Image captioning is the process of automatically creating a caption for the image. In artificial intelligence, creation of caption in images providing more attention and is becoming more significant [3]. Recently researchers are focused on the enhancements in the accessibility of web with different techniques. These methods are categorized into 3 classes are, crowdsourcing, machine based and hybridized technologies.

In the first crowdsourcing process, the captions are produced in the images with the human annotators [1, 20]. In the machine-based techniques, the image is identified with the learning methodologies and provides the corresponding captions [9, 23]. The hybridized techniques provide the image captions automatically. It lessens the time consumption as well as the expense of the process [8]. Numerous methodologies are presented in the previous image captioning works for blind people. They are, ZFNet, VGGNet, GoogLeNet, AlexNet, Nearest Neighbor (NN) [19], LSTM [23]. Deep learning based techniques are broadly utilized in the image captioning and attained better results in the image captioning [2, 13, 22]. Deep learning (DL) is the advanced subtype of ML (Machine Learning) which is very useful in interpreting huge amounts of data and make the process easier and faster. Recent studies have shown that, identifying diseases using deep learning architectures can reduce the workload of radiologists compared to other diagnostic methods. In the field of medical processing, deep learning can act as an automatic screening and detection of COVID-19 disease [24] and is very convenient and fast compared to ML techniques. Deep learning is the most popular and convenient method for providing fast detection in various applications such as image classification, object detection, entertainment, healthcare [21, 25,26,27], fake news detection, crop yield prediction, robotics etc.

Number of researchers have been hardly working to enhance the accessibility of online images via a huge range of techniques. Deep learning based captioning models produce captions generated by machine by using the trained CV (computer vision) models that identify the relationships and objects in an image and produce suitable captions. In this work, a new DL framework named ECANN is presented to generate multiple image captions and make use of reverse search strategy to select the most appropriate caption for the image input. The proposed ECANN model progresses the image captions accessibility by means of the fully-automated principle and explores the feasibility of images that are being shared online.

1.1 Motivation and problem statement

Recently, computer vision-based assistive technologies have been designed for assisting the blind people. In particular, web accessing through image captions is a major part in the everyday life of blind people. Generation of image captioning techniques are supports them for easy identification of images with captions in various applications such as grocery shopping, education and so on. Thus, it gained significant attention and deliberated as one of the most famous topics in the field of computer vision. But accurate generation of captions in image is difficult task in this field. The process of caption generation faces the major problems such as: to create the entire NL (Natural language) sentences/captions similar to humans, to make the caption generated and its semantics to be accurate, correct and being consistent with the input image. With the advancements of the image caption generation system, blind people can see the world like normal humans. The problems originate in accordance with the naturalness and the compositionality. The conventional system on image captioning suffers because of the lack of naturalness and NL compositional nature in which the captions are generated in a sequential way that creates the language structure to be semantically inappropriate. Another significant challenge is the impact of dataset bias, which makes the trained models to over fit on common/similar objects, which struggles to generalize the suitable captions. To overcome these challenges, a new DL based automatic caption generation strategy is introduced in this work, which bridges the semantic gap among the vision points and language to incorporate the need of accurate scene understanding. The presented approach automatically generates the alternative caption of images on the websites that are not captioned and make web access easier for blind people.

1.2 Contributions

The main contributions of ECANN based image caption generation are stated as:

  • To develop a deep learning based neural network model named ECANN to generate the accurate image captions. The proposed ECANN model is designed based on understanding the difficulties faced by the blind people in shopping online and the error occurrence in the generation of image caption is minimized using AAS algorithm.

  • The proposed ECANN model work on the pre-defined captions generated and choose the most appropriate caption with reversible search enabling quick browsing.

  • The ECANN framework satisfy the necessities of blind people in alternative image caption generation with higher accuracy compared to other baseline classifiers.

  • The selection of non-captioned images from the dataset is performed using ARO, which is the hybrid combination of FCM (Fuzzy C Means) clustering, optimized by the ARO algorithm. Next, the appearance and texture features are extracted using SDM and WPLBP.

  • Finally, the ECANN allows the automatic generation of image captions and effectively helps the visually impaired people in online Grocery shopping.

The outline of this paper is structured as: Section 1 presents the introduction and highlights the motivation and contributions. Section 2 discuss about the recent related works on generation of image captions. Section 3 clearly describes about the proposed methodology. Section 4 provides the detailed description about Implementation and the results obtained. Section 5 concludes the presented work followed by references.

2 Related work

Some of the recent literature works related to image caption generation are discussed as follows:

Heng Song et al. [29] presented an innovative visual text merging (VTM) framework for providing the image captions. Initially, an attention model was developed for the visual data to get the accurate image captions. In avtmNet (adaptive VTM network) the merging of visual, text data was done correspondingly. The datasets used were COCO2014 and Flickr30K. The presented merging network merges the text and visual data accurately. The developed adaptive VTM methodology produces the resulting image captions based on the attained text and visual data. The performance obtained was evaluated for the dual datasets based on different scores such as BLEU, CIDEr, ROUGE-L, SPICE and METEOR. The evaluation results for Flickr30 were BLEU (0.248), ROUGE-L (0.494), METEOR (0.208), CIDEr (0.598), SPICE (0.157). While for COCO2014 the outcomes were BLEU (0.3317), ROUGE-L (0.567), METEOR (0.273) CIDEr (1.126) and SPICE (0.201) illustrates the efficacy of merging network. Deng et al. [5] developed an image caption framework called DenseNet+LSTM. In the encoding step, DenseNet extricates the features from the image. In the parallel, sentinel gate in the framework selects the feature data for the creation of caption. In the decoding step LSTM was utilized to enhance the image quality in the creation of text in the images. It was very useful for the blind people for better guidance. The datasets used for execution were Flickr30K and COCO2014. The performance obtained with Flickr30K was based on BLEU (0.667), METEOR (0.214) scores. Also, the COCO2014 obtained the scores with BLEU (0.739) and METEOR (0.270).

Yuchen Wei et al. [30] presented the review of various deep learning procedures and its challenges in identifying the marketing products for blind people. VI people deal with number of challenges in everyday life like shopping. The number of datasets used were GroZi-120, RPC, D2S, and Groci-3.2 K. The DL framework was utilized for finding the correct products through the captions which should be very accurate in helping VI people. The lack of accuracy in captions creates difficulty in retail store shopping. The different frameworks on DL provides image captions and recognizes the products effectively. Min Yang et al. [34] developed an innovative retrieval dependent caption creation framework called Ensemble caption (EnsCaption). It was the combined technique of both caption creation and caption retrieving. The model fuses the personalised texts for the query image. The re-ranking procedure recovers the accurate caption for the image from the dataset of created captions. The adversarial network finds the variations among the created and the retrieved captions for the image. The introduced technique provides more accurateness in the caption generation. The datasets used were Flickr-30 K and MSCOCO. The outcomes obtained with flickr-30 K dataset was BLEU (76.2) and CIDEr (69.3) and for MSCOCO was BLEU (81.7) and CIDEr (125.5).

Niange Yu et al. [36] introduced a CNN based multiple labelling classifier for the image caption. Here, the captions were provided for the images based on their topics by utilizing the presented classifier. In the developed framework, input was the image and their topic, and the output was the image caption. The hierarchical framework was conserved with an embedding technique. The retrieving the image captions were attained in the work was utilizing the bi-directional caption image retrieving procedure. The developed technique gives the better quality in the generation of image captions. Fen Xiao et al. [32] developed an image captioning framework with dual LSTM for enhancing accessibility of blind people. In that, two separate LSTM frameworks were integrated with the adaptive semantic attention framework. The first LSTM was adapted after the attention layer which was utilized for attaining the visual sentinel information. Next LSTM was utilized for creating the text for the images as captions. Finally attains the accurate text sequence for the given input images. The dataset used was MSCOCO, Flickr30K. The performance obtained with Flickr30K was BLEU (68.6), METEOR (21.5) whereas for MSCOCO the score obtained was BLEU (75.8), METEOR (27.1).

Loganathan et al. [17] presented an automatic captions generation for the images with the combined CNN and LSTM frameworks. The framework provides the accurate captions for the images with the learning process. Here, the effective learning procedure results the better image caption. These combined machine learning procedures gives the accurate image captioning with reduced complexity. The automatic caption generation results the easy access to the blind people for identifying the images without any difficulties. The dataset used was Flicker8K dataset. Singh, A et al. [28] introduced the encoder-decoder based framework for image captioning. Here, the CNN model was used as the encoder for image visual features. Then, the captions were generated for images by the stacked LSTM, which was the integration of bi-directional LSTM and unidirectional LSTM. The VGG19 based CNN model was utilized for encoding the visual features. Hindi genome dataset was used for the validation of this framework. The evaluation metrics were RIBES (0.17) and BLEU (3.28). The presented framework did not capture the alphanumeric content of the images.

Iwamura et al. [10] presented a trainable end-to-end approach for generating the image caption with three datasets namely several copyright-free images, MSCOCO and MSR-VTT2016-image. In this framework, the four phases were performed such as feature extraction, motion estimation, object detection and caption generation. The motion-CNN model was developed here for the automatic motion feature extraction. CNN extracted the features from the input images and passed them on to the object detection phase for detecting the object. Then, the attention model was used in caption generation component for the computation of attention features. LSTM get the correlated attention features for caption generation. The performance obtained on dataset MSR-VTT2016-Image was BLEU (49.9), METEOR (16.1)] whereas for MSCOCO dataset was BLEU (75.9), METEOR (26.7). Khurram et al. [12] developed a Dense-CaptionNet deep learning model for the image captioning. This deep learning model was region based architecture to describe the image semantics. In this framework, three modules were available. The dataset used were MSCOCO, Visual Genome and IAPR TC-12. In the first module, the object relationship and the region description were generated. The object attributes available in the scene was generated in the second module. The textual descriptions attained from first two module were provided as input to third module for the caption generation. This approach provided the detailed description of the images. Table 1 illustrates the review on different existing models.

Table 1 Review on various existing methods

2.1 Research questions (RQs)

This section describes the set of research questions that clearly focus and pinpoints the major objectives of the proposed framework. The following are the proposed RQs that are clearly stated as:

  • RQ 1: How the proposed deep learning based image caption generation model creates accurate image captions for VI people?

  • RQ 2: How the metaheuristic optimization is used with neural network for hyperparameter optimization?

  • RQ 3: What are the various performance metrics used for the evaluating the performance?

  • RQ 4: What are the possible grocery datasets used for online shopping and the major future directions?

3 Proposed methodology

Recently, Caption Generation (CG) in computer vision is expected to have much attention due to its extensive applications such as virtual assistants, image understanding, image retrieval or indexing and helping blind people. Automatic caption generation (ACG) system is a challenging task which helps the blind people by providing better understandings about what is happening around them. Humans naturally have the ability to identify or recognize the images at a quick glance. But the lifestyle of blind people differs from normal people because they make use of other senses (touch, hearing) as assistance to know the objects placed nearby. This work aims to design a new image CG task with the deep learning model named ECANN (Extended Convolutional Atom Neural Network) which make use of computers to imitate the ability of humans to better understand the visual world. Figure 1 signifies the schematic model of the proposed method.

Fig. 1
figure 1

Schematic model of proposed method

Automatic image captioning (AIC) offers simpler image captions to the images that are non-captioned. The proposed AIC with the EACNN model is processed with the combination of subsequent stages such as Data Collection, Non-captioned image selection, Extraction of appearance, texture features and Generation of automatic image captions. The initial stage is the selection of non-captioned images from the database by using ARO algorithm. After the selection of non-captioned images, appearance and texture features are extracted using spatial derivative & multi scale (SDM) feature and weighted patch based local binary pattern (WPLBP). Moreover, the extracted features are utilized for the accurate differentiation of the images. Finally, caption is accurately produced for corresponding images with the help of ECANN architecture. The ECANN based alternate image captioning process introduced a caption reusable system based on AI (Artificial Intelligence) with a reverse image search to reuse pre-existing captions for the target image. However, this proposed ECANN model is used to generate alternate captions to images which should be semantically genuine to the original image. In this framework, error occurrence in the image captioning is reduced using (AAS) algorithm.

3.1 Data collection

The process of collecting or gathering the information from any online public source which enables us to evaluate the results is named as data collection. It can be viewed as a systematic framework to collect data from different sources to acquire an accurate and complete outcome on the particular area of interest. Precise data collection is very much essential to maintain the integrity of the research and guaranteeing quality assurance. In this work, the dual datasets namely Freiburg Groceries Dataset and Grocery Store Dataset are collected from dual online sources and used for the analysis and implementation purpose. These online groceries dataset includes multiple-classes which comprises of different types of objects and products.

3.2 Non-captioned image selection

The raw image input from the dataset consists of both captioned and non-captioned images. The main objective of this work is to generate automatic captions to the non-captioned images. Here, initially the adaptive rain optimization (ARO) algorithm is used to select only the non-captioned images. This ARO algorithm is the combination of Fuzzy C Means (FCM) and Rain Optimization (RO) algorithm. The selection of non-captioned images using ARO solves the problem of imbalanced dataset learning and can be extensively applied in various research fields namely engineering, image processing etc. The basic idea behind ARO is clustering which groups the images into dual groups such as captioned and non-captioned images. The ARO algorithm follows the unsupervised clustering strategy and allows each data to be associated with more number of clusters.

FCM algorithm [33] optimized with RO algorithm is named as ARO which assigns lesser weights to the samples which are equivalent to C clusters. Let the input data points can be given as Z = {z1, z2, …, zn} which is divided into C clusters as C = {C1, C2, …, Ck} where n indicates the number of elements and k signifies the number of pre-defined clusters. The minimization of objective function can be expressed as,

$$ {J}_{\mathrm{min}}=\sum \limits_{i=1}^C\sum \limits_{j=1}^n{U}_{ij}^g{D}_{ij}^2 $$

where, the fuzzification element is g, the membership degree is Uij and the Euclidean distance of jth instance to ith cluster centroid is specified as Dij. Equation (1) is minimized with reference to the following conditions:

$$ 1\le j\le n:\sum \limits_{i=1}^C{U}_{ij}=1 $$
$$ 1\le i\le C:\sum \limits_{j=1}^n{U}_{ij}>0 $$

Here, the Lagrange function is used to minimize Jmin by means of setting the derivative with reference to zero value for both Uij and Cj parameters given as,

$$ {J}_{\mathrm{min}}=\sum \limits_{i=1}^C\sum \limits_{j=1}^n{U}_{ij}^g{D}_{ij}^2+\sum \limits_{j=1}^n{\lambda}_k\left(\sum \limits_{i=1}^n{U}_{ij}-1\right) $$

where, the Lagrange function is represented as λ = [λ1, λ2, …λn]T and the cluster centroid is represented as,

$$ {C}_i=\frac{\sum_{j=1}^n{U}_{ij}^g{z}_j}{\sum_{j=1}^n{U}_{ij}^w},i\in \left[1,C\right] $$

The degree of membership is specified as,

$$ {U}_{ij}=\frac{D_{ij}^{\frac{2}{1-g}}}{\sum_{k=1}^C{D}_{kj}^{\frac{2}{1-g}}},i\in \left[1,C\right],j\in \left[1,n\right] $$

The fuzzification element value is g = 2. FCM model is generalized using the fuzzy set and the initial outcomes are produced with proper competency and simplicity in fuzzy clustering. In FCM model, the Lagrange function λk is optimized with RO algorithm which overcomes the issues related to higher time computation, sensitive to noisy data and initialization. RO algorithm simulate the behaviour of rain drops. Each solution is considered a raindrop and the initial population is generated. Radius is the main function in RO and the population is initialized. The rain droplets R1, R2 connects with R radius is described as,

$$ R={\left({R}_1^m+{R}_2^m\right)}^{\raisebox{1ex}{$1$}\!\left/ \!\raisebox{-1ex}{$m$}\right.} $$

where, m indicates the variables of each droplet count. Thus, by increasing the number of total iterations the cluster group (captioned and non-captioned image group) is formed in minimal time with noise elimination.

3.3 Extraction of appearance and texture features

The process of capturing the image visual content for indexing and retrieval is termed as feature extraction. Texture and appearance features play a vital role in the feature extraction process which extracts relevant image information. Feature extraction begins with the initial set of data and constructs features that can be non-redundant, informative and facilitates successive learning which leads to better human understandings.

3.3.1 Appearance features

The appearance of an image is extracted using SDM feature model. Differential feature is defined as a vector which is linked with a point in an image and it can be utilized to extract the appearance based features of an image. The differential feature is evaluated from the image spatial derivatives. For a given image I, with w point the lower order derivatives can be utilized as a feature which can be expressed as,

$$ H=\left\langle \frac{\partial I}{\partial x}(w),\frac{\partial I}{\partial y}(w),\frac{\partial^2I}{\partial {x}^2}(w),\frac{\partial^2I}{\partial xy}(w),\frac{\partial^2I}{\partial {y}^2}(w)\right\rangle $$

From the image, the valuable statistical information can be captured using the derivatives. The first-order derivatives indicate the intensity edgeness or gradient whereas the second-order derivatives are used to illustrate bars. The derivative features are known as appearance features which can be denoted as vector H. However, these features approximate the intensity surface local shape using the Taylor Series (TS) expansion. The TS specifies that the pixel derivatives are required to estimate the intensity value in a neighbourhood around it.

3.3.2 Texture features

Texture is usually a feature which can be used to divide the images into regions. This texture feature offers information by means of the spatial arrangement of intensities or colours in an image. Texture defines the surface characteristics in terms of shape, size, arrangement, density etc. The image textures can be rough or smooth, hard or soft, glossy or mat etc. and characterized using the spatial distribution of intensity levels in a neighbourhood. In this work, the extraction of texture features can be processed using the approach called weighted patch based local binary pattern (WPLBP).

The LBP approach extracts the local information by matching the pixel differences from each minor region in image. Then the local information extracted is encoded into a shorter string of bits and this string represents the DG (Directional Gradient) information by means of bit ‘0’ or ‘1’. The texture feature extraction using LBP on every central zcn pixel depends on the neighbouring zp (p = 0, 1, …P − 1) pixel with radius R can be expressed as:

$$ LBP=\sum \limits_{p=0}^{P-1}f\left({z}_p-{z}_{cn}\right){2}^p $$
$$ f(z)=\Big\{{\displaystyle \begin{array}{c}1,z\ge 0;\\ {}0,o.w\end{array}} $$

where, P signifies the number of neighbouring pixels (p = 8 or 16) and the threshold function is denoted as f(z). The extended or improved form of LBP is WPLBP which make use of a pyramidal structure. The WPLBP uses any of the kernel \( {s}_k^p\in {S}_k \) which can be summed with the patches of \( {z}_k^p\in {Z}_k \) given as,

$$ WPLBP=\sum \limits_{k=1}^Kf\left(\sum \limits_{p=0}^{P-1}{z}_{k-1}^p{s}_k^p\right){2}^{\left(k-1\right)} $$

where, K signifies the number of levels in the adopted pyramid model. The above Eq. (11) can be expressed in convolution form as,

$$ {Z}_{k+1}={S}_k\ast {Z}_k $$

where, the 2D convolution operation is signified as ∗, the weighted kernel matrix is denoted as Sk and the original image is Z0. The feature maps absorbed from the original image can be given as, Z1, Z2, . …, Zkwhereas Z1, Zk represents the lowest and highest levels of abstraction. In WPLBP, a proper training set is required for the purpose of weight matrix learning.

$$ N- GD={\left(\sum \limits_{t=1}^T{\overline{Z_t}}^{i,j}\right)}^{\raisebox{1ex}{$1$}\!\left/ \!\raisebox{-1ex}{$2$}\right.},\forall i,j\in \left(1,2,\dots, \sqrt{M}\right) $$

The training sets are generated using the N-dimensional GD (Gradient Descriptor). In WPLBP, the gradients are calculated by using the Sobel operator on number of patches \( \overline{Z_t},\left(t=1,2,\dots T\right) \) that can be sampled randomly from the scaled image. Using the WPLBP approach, the image is resized to gather the patches as training set and the texture feature is finally extracted.

3.4 Generation of automatic image captions

The procedure of generating automatic image captions is known as image captioning. A single image can hold large amount of information as everyday massive image data is generated. After the extraction of valuable information from image, the next step is the automatic generation of image captions. Here, with the help of ECANN the automatic image captions are generated for the non-captioned images. This ECANN concept works under automatic caption generation with reverse image search to reuse pre-existing captions for the target image. In this framework, error occurrence in the image captioning is reduced using AAS algorithm. ECANN is a deep learning model which signifies the combination of CNN and LSTM architecture. ECANN can be used to annotate the images automatically by reusing the pre-existing captions generated. Thus, the deep learning based caption generation process can significantly minimize the human errors and make use of AI models in generating the most suitable captions which helps the visually impaired people.

Figure 2 signifies the hybrid combination of CNN and LSTM which represents the ECANN architecture. The most commonly used deep learning architectures are the CNN and LSTM. Here, the CNN models are used to generate the captions and LSTM models follows the revere image search strategy to select the most probable caption from the pre-existing captions for the target image. The proposed ECANN model utilizes the capability of all the layers to learn the internal time-series representation of data and use the LSTM network utilize the training set attributes to categorize the short as well as long-term dependencies. CNN is otherwise named as ConvNet which is a form of ANN model and used for the purpose of image analysis. CNN models allows the user to provide an image input which further assigns learnable weights and biases to generate pre-trained image captions. The layers in CNN model are as follows: Input layer (IL), Convolutional layer (CL), Pooling Layer (PL), Fully Connected Layer (FCL) and Output Layer (OL). The architecture of CNN is illustrated in Fig. 3.

Fig. 2
figure 2

Schematic representation of ECANN model

Fig. 3
figure 3

Architecture of CNN

The first layer of CNN is the IL which is responsible for passing the input to the ECANN model. The size of the image input is 3-dimensiional and it can either be black and white or coloured. This IL layer is different from the other layers which may not contain the weighted inputs. It holds neurons of artificial input which are passive in nature. The next CL is the main layer where maximum number of computations occur. The main function of this CL is to extract the minute details from the image with the use of multiple hidden layers. The number of CLs is more than one and it includes filters for detecting the objects and patterns. The PL is used to divide the features into patches and ignores the useless image details. In between the CLs the PLs are used. The two forms of pooling are Max_pooling and Average_pooling. Max_pooling suppress the image noise and reduce the image dimensions. Average_pooling is also used to minimize the dimensions but the performance is lower when compared to Max_pooling. The FCL is otherwise called as feed forward (FF) layer which is present next to pooling layer. The input to FCL is the output received from the last PL. This FC layer is very dense and each node is being connected to each and every other node which is existing in the previous layer. The last layer is the OL or Softmax layer where the pre-trained captions are generated based on the highest probability acquired by the FC layer.

The final output of CNN is the generation of pre-trained captions on identifying the objects in the input image. Therefore, the hybrid ECANN model that exploits the advantages of both deep learning techniques and enhance the prediction accuracy. LSTM models are more-suited for processing, classifying and making accurate predictions. LSTM is a kind of RNN (Recurrent Neural Network) which can be used to overcome the problems of long term dependency. LSTM network is used to execute the machine translation task and can be used as a language model to create the most appropriate captions based on the given input vector. Figure 4 signifies the diagrammatic representation of LSTM memory block.

Fig. 4
figure 4

Memory Block of LSTM

LSTM networks are the broadly used type of RNN structural design. The architecture of LSTM is better than traditional RNNs for capturing long-term dependencies. The difference among the LSTM model and regular RNN is that each traditional node in the hidden layer of LSTM is interchanged with memory cells (MCs). The architecture of LSTM includes memory cells, gate units, and memory blocks. The main component of LSTM is the cell state which indicates a horizontal straight line that connects the network structure. The gate units in LSTM are utilized to control the flow of information. The tri-gate units are namely input, output, and forget gate. With the combination of forget, input gate the information is disappeared or stored from the MC. Also, the multiplicative input gate units are utilized in order to eliminate the harmful effects originated from the irrelevant inputs. However, the input gate controls the input flow towards the MC while the output of hidden state (hst) is controlled by means of output gate. The sigmoid (σ) activation is used for the execution which specifies the range 0 to 1.

The forget gate is controlled using single-layered NN (Neural Network), in LSTM memory block. The sigmoid function is expressed as:

$$ f{g}_t=\sigma \left(W{e}_fs{i}_t+W{e}_fh{s}_{t-1}+b{i}_f\right) $$

Where, fgt denotes forget gate, We indicate weight vectors, input series is sit, hidden state of previous block (hst − 1) and bias is denoted as bif.

After the input squashing, the input gate ranges from 0 to 1 and newer memory is created by the simple NN which has the activation function as well as the previous memory block. The expressions for the input gate and input squashing function are:

$$ i{g}_t=\sigma \left(W{e}_is{i}_t+W{e}_ih{s}_{t-1}+b{i}_i\right) $$
$$ s{i}_t=\tanh \left(W{e}_ss{i}_t+W{e}_sh{s}_{t-1}+b{i}_s\right) $$

The MC activation function processing at time t is given as,

$$ c{e}_t=i{g}_t\otimes s{i}_t+f{g}_t\otimes c{e}_{t-1} $$

The output gate is denoted in Eq. (18) and the Eq. (19) illustrates the hidden vector

$$ o{g}_t=\sigma \left(W{e}_os{i}_t+W{e}_oh{s}_{t-1}+b{i}_o\right) $$
$$ h{s}_t=o{g}_t\otimes \tanh \left(c{e}_t\right) $$

Where, the element-wise multiplication is ⊗. Thus, the LSTM includes number of layers where each one is linked with each other as a form of repeating chain modules. The captions generated can be given as input to each of the LSTM layers. One caption is inputted to each layer and this LSTM model learn from these captions and gets optimized by itself. Each one of the LSTM layer predicts the most suitable caption for the input image. The last Softmax layer (SL) of LSTM is iteratively trained to reduce the loss present in the network. The loss function in SL is named as cross-entropy (CE) loss which is expressed in Eq. (20). However, this loss is optimized by means of adaptive atom search (AAS) algorithm.

$$ CE\left(\hat{z}\right)=-\log \left(\hat{z}\right) $$

The AAS algorithm is a meta-heuristic algorithm based on the population criterion and inspired by the behavior of molecular dynamics. It mimics the atomic motion being controlled by the constraint forces and interaction in order to design the model which is very effective to solve the problems related to global optimization. The atomic motion is expressed as,

$$ {F}_i+{C}_i=a{c}_i\times m{a}_i $$

Where, the resulting force of interaction on atom I is Fi, the constraining force on atom I is Ci, acceleration is ac and mass is signified as ma. However, the AAS algorithm signifies a feasible solution in the algorithm search space. The significance of the algorithm model is represented using the atom of the feasible solution. All the atoms in the structure repel or attract to each other depending on the distance characteristics which makes the lighter atoms to get attracted towards the heavier atoms. The fitness function can be evaluated by considering the mass mait at t-th iteration of atom ® represented as,

$$ m{a}_i(t)=\frac{S_i(t)}{\sum_{j=1}^N{S}_j(t)} $$
$$ {S}_i(t)={e}^{-\left({F}_i(t)-{F}_b(t)/{F}_w(t)-{F}_b(t)\right)} $$

where, N signifies the number of total atoms, the fitness function of atom ® at t-th iteration can be represented as Fi(t) the fitness of best and worst atoms can be given asFw(t), Fb(t). The normal AS (Atom Search) optimization algorithm make use of the random initialization of solutions whereas the AAS algorithm procedure use the chaotic map based initialization for the evaluation of the fitness value. Choas is regarded as a non-linear phenomenon used in the SI (Swarm Intelligence) algorithm initialization by avoiding the algorithm to fall under the strategy of local optimum. The circular chaotic map is used for the atomic population initialization. The updated condition on velocity, position at (t + 1) iteration are expressed as:

$$ {V}_i^{di}\left(t+1\right)=\mathit{\operatorname{ran}}{d}_i^{di}{V}_i^{di}(t)+a{c}_i^{di}(t) $$
$$ {X}_i^{di}\left(t+1\right)={X}_i^{di}(t)+{V}_i^{di}\left(t+1\right) $$

where, di represents the dimension, Vi denotes the velocity and \( {X}_i^{di} \) represents the updated position. By using the immune detection (ID) operator the accuracy as well the convergence of the AAS is improved which is illustrated based on Eq. (26),

$$ {X}_i\left(t+1\right)=\Big\{{\displaystyle \begin{array}{c}{X}_{N_i}(t),{F}_i(t).>N{F}_i(t)\\ {}{X}_i(t),{F}_i(t)\le N{F}_i(t)\end{array}} $$

In ECANN, the loss function is minimized and the accuracy is improved using AAS which further optimize and update the network weights. Also, this ECANN model can manage the process of training intended for the automatic generation of image captions. Thus, the entire ECANN structure learns to create the most suitable image captions and supports the task of labelling or captioning the images in an automatic manner. Table 2 illustrates the hyperparameters of ECANN model.

Table 2 Hyperparameters of ECANN model

4 Implementation results

This section reveals the details about the implementation procedure and the analysis on the evaluation metrics. The proposed ECANN model is implemented in the PYTHON platform. The proposed deep learning based AIC generation is used to achieve better outcomes. By evaluating the overall model on various metrics the efficiency of the proposed technique is better analyzed. This will lead to generate the image captions precisely and effectively. The performance results on proposed ECANN approach is better when compared with other similar methods.

4.1 Description about dataset

The performance of proposed ECANN method is tested on two publicly available datasets such as Freiburg Groceries and Grocery Store Datasets. Figure 5 shows the sample images of Freiburg Groceries Dataset. However, the first Freiburg dataset includes total 4947 images with 25 image categories. The images were collected from apartments, offices and stores in Germany with phone cameras. In this work, the total number of images used for processing are 4947 with training (3462 images) and testing (1485 images). Here the training and testing data is considered in the ratio as 70:30.

Fig. 5
figure 5

Sample images of freiburg groceries dataset

Figure 6 shows the sample images of Grocery Store Datasets The second Grocery Store dataset involves natural and ionic images. This dataset is used in the classification of natural images and assists the blind people. The images are recorded using the phone camera. The total image in this dataset is 5125 having 80-classes. Here, we consider 22 classes with 862 images. From the grocery store website, the iconic images are taken which represents the product information like nutrient values, weight, and origin country. In this work, the total number of images used for processing are 862 which includes 603 training data and 259 testing data. Here the training and testing data is considered in the ratio as 70:30.

Fig. 6
figure 6

Sample images of grocery store dataset

4.2 Evaluation metrics

The performance of the proposed method is evaluated using the metrics such as accuracy, recall, precision, F-score. The outcome shows that the proposed ECANN method provides high performance than the other existing methods. The performance metrics are explained as follows:

  • Precision (P): It can be defined as the ratio of number of positive samples that is classified from the total number of samples. It is based on percentage of cases that are wrongly categorised.

$$ P=\frac{TP}{TP+ FP}\times 100\% $$
  • Recall (R): It can be defined as the ratio of number of positive samples classified as positive to total number of positive samples. It is based on percentage of cases that are accurately categorised.

$$ R=\frac{TP}{TP+ FP}\times 100\% $$
  • F-score: This metric also called as F1 score and F-measure which is utilized for testing the weighted consonant mean of recall and precision.

$$ F- score=\frac{2\ast P\ast R}{P+R} $$
  • Accuracy (A): It is defined as the proportion of correctly identified samples to that of the total number of samples. The value close to the true value is defined as the accuracy.

$$ A=\frac{TP+ TN}{TP+ TN+ FN+ FP}\times 100\% $$

Where, TP (True Positive), FP (False Positive), TN (True Negative), FN (False Negative).

4.3 Performance analysis

The outcomes of the proposed ECANN is analysed using the evaluation metrics accuracy, recall, F-score and precision on the groceries dataset (Freiburg and Grocery Store). Also, the supremacy of the proposed ECANN method is compared with other DL models to illustrate the effectiveness of the proposed approach.

Table 3 signifies the results of proposed ECANN method. The outcomes obtained on testing with the dual groceries dataset signifies the effective nature of the proposed ECANN based automatic image caption generation. For both the dataset, the classification accuracies are obviously higher with the softmax classifier, and the fine-tuning process offers enhanced results consistently. Thus, the accuracy obtained with Grocery Store Dataset is higher (99.46%) whereas the Freiburg Groceries Dataset obtained equivalent performance of (99.32%) accuracy.

Table 3 Outcomes of ECANN based automatic image caption generation

Table 4 illustrates the comparison of precision and recall on various datasets. The various DL based methods can be used to compare with the proposed ECANN architecture. The metrics used for comparing the evaluated results are the precision and recall. The proposed ECANN method is evaluated on the 2 different groceries datasets whereas the existing DL models were assessed on other groceries datasets which identify the objects or food items in certain constrained environments. Hence, the results prove that the proposed ECANN method have acquired higher outcomes in terms of recall as well precision with the other DL models.

Table 4 Comparison on precision and recall [6]

Table 5 represents the accuracy metric comparison on Grocery Dataset. For the evaluation of the automatic generation of image captions, the accuracy metric is used more precisely which illustrates the ratio of predictions the proposed ECANN model has acquired right. Hence, the highest score indicates the accurateness of the captions generated. Here, the proposed ECANN model gained higher outcomes compared to the other DL models. Table 6 signifies the accuracy metric comparison on Freiburg Dataset. Here, the existing DenseNet-169, VGG16 and AlexNet models are implemented in this work for Freiburg Dataset. The higher accuracy value is obtained using ECANN is (99.32%) and the existing CaffeNet [11], DenseNet-169, VGG16 and AlexNet models acquired lower accuracy values as 78.9%, 82.51%, 70.86% and 67.43%.

Table 5 Accuracy comparison on grocery store dataset [14]
Table 6 Accuracy comparison on Freiburg dataset

Table 7 denotes the comparison on accuracy considering with and without feature extraction using Grocery store dataset. The ECANN approach used the softmax classifier for better classification whereas the other DL approaches used the SVM (Support Vector Machine) based classification. The accuracy acquired with feature extraction provided higher results than without processing the feature extraction phase.

Table 7 Accuracy comparison on with and without feature extraction (Grocery Store Dataset) [14]

Table 8 shows the accuracy comparison of proposed and existing models with and without feature extraction using Freiburg dataset. Here also, the similar existing approaches are considered for comparison on with and without feature extraction. From Table 7, it is clearly proved that the proposed ECANN model with feature extraction obtained better accuracy than existing DL models. Tables 7 and 8 illustrates the accuracy obtained by the proposed ECANN model using the softmax classifier and existing models using SVM classifier with and without feature extraction.

Table 8 Accuracy comparison on with and without feature extraction (Freiburg Dataset)

Table 9 depicts the accuracy achieved by the proposed ECANN and existing VGG166, VGG167, AlexNet6, and Densenet-169 using softmax classifier for both datasets. When comparing the obtained accuracy of proposed ECANN for two datasets, the accuracy obtained by proposed ECANN for Grocery store dataset is slightly higher than the Freiburg dataset. From comparative analysis, the accuracy achieved by the proposed ECANN is higher than the existing DL models. Table 10 illustrates the comparison of proposed model with basic models for the two datasets using softmax classification.

Table 9 Accuracy comparison Proposed and existing models using softmax classifier with and without feature extraction (Grocery store and Freiburg dataset)
Table 10 Comparative assessment of proposed with other basic models

Figure 7 specifies the evaluation results in terms of Graphical analysis. With the Freiburg dataset, the outcomes obtained can be accuracy (99.32%), recall (98.94%), precision (99.73%) and F-score (99.33%). The outcomes acquired from Grocery datasets can be accuracy (99.46%), precision (99.35%), F-score (99.46%) and recall (99.57%). The proposed ECANN model on both datasets gained equivalent results on the automatic image caption generation.

Fig. 7
figure 7

Graphical assessment of ECANN based image caption generation

Figure 8 illustrates the graphical assessment on accuracy comparison. The proposed ECANN model evaluated on the Grocery Store Dataset acquired higher accuracy (99.46%) when compared to other DL architectures such as DenseNet-169 (84.0%), VGG16 (73.8%) and AlexNet (69.3%). The proposed ECANN approach gained improved results due to the ARO based selection of non-captioned images, feature extraction and DL based automatic image caption generation.

Fig. 8
figure 8

Accuracy comparison on various DL models for grocery store dataset

Figure 9 illustrates the accuracy comparison of proposed ECANN and existing CaffeNet, DenseNet-169, VGG16 and AlexNet models for Freiburg dataset. Table 5 and Fig. 9 shows that the proposed ECANN model achieved higher accuracy than the existing models. Thus, the proposed ECANN model is fit for automatic generation of image captions for visually impaired people.

Fig. 9
figure 9

Accuracy comparison on various DL models for Freiburg dataset

Figure 10 represents the comparison on accuracy based on with and without feature extraction for Grocery store dataset. The proposed ECANN approach using Softmax classification gained higher accuracy (99.46%) with feature extraction. Whereas, without performing feature extraction using Softmax classifier the proposed ECANN model obtained lower accuracy (90.14%). The existing DL models such as VGG166, VGG167, AlexNet6 and DenseNet-169 using the SVM classification achieved lower results on with and without feature extraction.

Fig. 10
figure 10

Accuracy comparison on different models with and without feature extraction

Figure 11 depicts the comparative analysis of proposed and existing models with and without feature extraction for Freiburg dataset. Similar to Fig. 10, here also the accuracy of the proposed ECANN model with feature extraction obtained higher accuracy (99.32%) than without feature extraction accuracy (90.03%). Likewise, when comparing the accuracy of proposed model with existing models, which obtained lower accuracy on with and without feature extraction.

Fig. 11
figure 11

Accuracy comparison of different models on with and without feature extraction

Figure 12a and b represents the accuracy evaluation of proposed ECANN and existing DL models using softmax classifier with and without feature extraction on grocery store and Freiburg datasets. In the above all result analysis, the existing models used SVM classifier for the automatic image caption generation. Here, the accuracy is computed for proposed ECANN and existing approaches namely VGG166, VGG167, AlexNet6 and DenseNet169 using softmax classifier. From the comparative analysis, it is cleared that the deep learning models using softmax classifier provided better accuracy than SVM classifier.

Fig. 12
figure 12

a and b Accuracy comparison of various DL models using softmax classifier with and without feature extraction

Figure 13 specifies the ROC curve for ECANN approach. This ROC (Receiver Operating Characteristics) curve is used to execute the quantitative analysis. It defines the plot between TPR (True Positive Rate) and FPR (False Positive Rate). The AUC of the ROC curve space is 0.994 and 0.993 for dataset Grocery and Freiburg respectively.

Fig. 13
figure 13

ROC curve

Figure 14 represents the comparison of precision, recall on various datasets. The proposed ECANN obtained higher values of precision (9.73%), recall (98.94%) on Freiburg Groceries dataset and the Grocery Store dataset gained higher values precision (99.35%) and recall (99.57%). The existing methods such as VGG16, VGG16 + AT (BRISK), VGG16 + AT (SIFT), ResNet-18, ResNet-101 and DenseNet169 acquired lower values when tested with the different Groceries dataset.

Fig. 14
figure 14

Precision and recall comparison on different datasets

Figure 15 shows the training and testing metrics for loss and accuracy are considered for the proposed ECANN classification model. For evaluating the learning performance, the curves like accuracy and loss curves are determined. The training and testing phase for accuracy and loss are carried out by varying the size of epoch from 1 to 100 sequentially. The enhancement in accuracy and minimization in loss take place because of increasing epoch size. The ECANN model attains a training accuracy of 89%, 96% and 97% respectively when the epoch size is 20, 40 and 60. From the Fig. 15a, it is noted that the proposed model obtains maximum accuracy values, and it is same for both cases (testing and training). For performing the training phase, the better values are arranged and when the training stage is exit when the best values are evaluated. Figure 15b depicts the losses of both training and validation cases. In a loss curve, when the size of epoch are20, 40 and 60, it has a training loss of 2.2, 0.2 and 0.3 respectively.

Fig. 15
figure 15

Training and testing performance of the proposed ECANN model (a) Accuracy (b) Loss

Figure 16 signifies the outcomes of automatically generated image captions. The proposed ECANN builds a model by generating pre-trained image captions with CNN and make use of LSTM network to process on these captions and selects the most relevant image caption and removes the non-relevant image caption.

Fig. 16
figure 16

Outcomes of image captions

4.4 Discussion

This section discusses the results of the proposed ECANN and its comparative analysis. The novelty of the proposed work is that automatic generation of image captions with the ECANN deep learning model that uses computers to come up with the solution of solving the inconveniences faced by VI people in shopping grocery items. The proposed ECANN model includes the combination of dual deep networks namely CNN and LSTM architectures to perform a caption reusable system based on AI (Artificial Intelligence) with a reverse image search to reuse pre-existing captions for the target image and selects the most accurate image captions. However, this proposed ECANN model is used to generate alternate captions to images which should be semantically genuine to the original image. Also, the ECANN network is trained using the optimization algorithm AAS. Optimization algorithms are responsible for reducing the losses and provide the most accurate results. Moreover, the optimization algorithms are used to change the neural network attributes such as learning rate and weights in order to reduce the losses. In the proposed work, the AAS algorithm is used for optimizing the network attributes and it is a very competitive optimization algorithm and has been applied in various research fields such as flow scheduling problems, economic load dispatch problems, detection, and classification. When comparing with other optimization algorithms this AAS algorithm benefits from avoiding high local optima, which leads to avoidance of overlapping features during classification and helps to optimize the hyperparameters in the network to obtain the most accurate results. In addition, AAS optimization is integrated with ECANN which finds the best acceptable solution for a given problem. In this framework, error occurrence in the image captioning system is reduced using the AAS algorithm.

The efficiency of the proposed automatic image caption generation approach is computed in terms of precision, recall and accuracy. The precision, accuracy and recall attained by the proposed method is compared with different existing models in the above subsection. From the comparative analysis, the proposed ECANN shows higher efficiency than the existing models in automatic generation of image caption for easy web access of blind people. Two publicly available datasets such as Freiburg and Grocery store database are utilized for the performance evaluation. Moreover, the state-of-the-art deep learning models are considered for the performance comparison namely VGG16, AlexNet, DenseNet-169 [14] and CaffeNet [11] models. These existing pre-trained classifiers has shown lower performance because of the accumulation of error generation and inappropriate text context. These classifier models shown severe complexity during the network training since each caption can be treated similarly without mentioning the importance of diverse words. Moreover, during the generation of captions the scenes or semantic objects can be wrongly recognized. It is very challenging to describe the image content automatically using the accurately formed English language sentences and greatly impact in assisting the VI people in everyday life. Due to these drawbacks, the existing classifiers shown lower performance in image caption generation. Hence in proposed work, the recall, precision and accuracy of the ECANN is evaluated separately for both the datasets and compared with various deep learning models. The ROC curve depicts the accurate outcomes of image captions with less error. The accuracy of the proposed model using softmax classifier with feature extraction is higher than without feature extraction. The feature extraction plays a vital role in the accurate image caption generation process. It is proved from the comparative analysis with and without feature extraction. From the overall comparative analysis, it is clearly showed that the proposed ECANN is appropriate for automatic generation of image captions, and it make web accessing easier for blind people.

5 Conclusion

This work presents a new deep learning based human-computer interaction for automatic generation of image captions which benefits the blind people. The proposed ECANN model acts as an effectual tool in image caption generation which use the reverse search strategy to select the most suitable captions for the input image. The presented EACNN based automatic image caption generator model assists the blind people by incrementing their spatial awareness level and building the Internet to be more accessible. The ECANN approach collects the image data from dual online sources which includes the packaged items (juice products, vegetables, and fruits). Initially, the captioned and non-captioned images are clustered using the ARO model which saves the processing time. Next, the appearance and texture features are extracted from the images which signifies more visual information about these images. The ECANN model is the hybrid of CNN and LSTM networks in which CNN generates the pre-trained captions and the LSTM executes the reverse search strategy to select the most suitable caption for the image input and the loss in the network is minimized using the optimization named AAS. Moreover, this ECANN generates automatic captions, and the images are better described to the people with low vision or blind. Thus, the proposed ECANN model achieved (99.46%) accuracy on Grocery store Dataset and (99.32%) accuracy on Freiburg Groceries dataset which better supports the blind people in distinguishing the food items. The major limitation of the proposed AIC generation is that no hardware platforms namely smart glasses or smart phones are integrated. In future, we will extend this work by analyzing the performance on bigger datasets like MSCOCO or Flickr30k in order to develop precise subtitles. Further, we try to embed NLP and smart phone application with speaker options may help the VI people to understand the surrounding events which turns the life of VI people an enjoyable experience. Also, we plan to use new attention based deep learning model to generate image captions that deal with multiple languages.