1 Introduction

Digital images are now widely available, and captions make them easier for humans to comprehend and analyze. Deep convolutional neural networks have been used in a variety of image classification studies, including [1, 2] to obtain encouraging results. Automatic picture captioning employs algorithms to extract pertinent information about an image’s content and produce a sentence that can be read by humans, whereas manual captioning is a laborious operation. Modern automatic image captioning networks are trained on corpora in English. After that, a neural machine translation model translates the translated captions into Arabic. El Jundi (2020) [3] proposed an end-to-end Arabic image captioning system, which eliminates potential sources of mistakes that may result from the peculiar sentence structure and intricate morphology of the Arabic language, highlighting the importance of this. In research on the state of Arabic picture captioning systems today, Attia (2020) [4] concludes that there is very little research done on Arabic image captioning, which is primarily due to the absence of publicly available datasets. They also emphasize how few studies on Arabic image captioning used attention mechanisms to concentrate on the key elements of the image. The caption creation process will benefit from such attention techniques and produce superior outcomes. Sabri (2021) [5] used image models like EfficientNet [6] and MobileNetV2 [7] and included an attention mechanism [8, 9] layer in the decoder model to increase captioning accuracy to reduce some of the morphological complexity of the Arabic language. But to the best of our knowledge, no one has ever extracted the features from images for Arabic image captioning using feature concatenation combined with pre-trained word embedding. The Arabic language in general is distinct from other languages in that it uses connected characters instead of capital, small, and diacritical letters. Arabic text needs some preprocessing operations, such as eliminating diacritics, prefixes and suffixes, and the connective letter, in contrast to English text preprocessing, which calls for transforming capital letters to lowercase ones [10]. It was critical to review the most recent research on captioning in Arabic and come up with solutions to fill the research vacuum in this field given the Arabic language’s substantial cultural influence.

Embedding techniques are used to represent data in a continuous and dense vector space. In the context of image captioning, they are used to represent images and their associated captions in a common vector space, where the similarity between vectors representing images and their captions can be easily computed. This allows the model to accurately identify the context of an image and generate a caption that accurately describes it. Embeddings are important for image captioning because they allow the model to understand the relationships between different elements in an image, such as the objects and scenes depicted, and the relationships between those elements and the words in a caption. This helps the model generate more accurate captions and natural sounding. Additionally, embeddings can also be used to improve the performance of other parts of the image captioning pipeline, such as image recognition and text generation. They play a key role in accurately identifying the context of an image and generating a caption that accurately describes it.

There two primary models utilized in deep learning-based image captioning are the image model and the language model. The image model is used to extract the image’s features by converting them into a fixed-sized vector. The correctness of the captions will undoubtedly be impacted by the influence of highlighting the best features in the images. The procedure of feature concatenation is one way to aid with this. At the same time, the language model is then given the feature vector and generates the caption as necessary. The impact of using pre-trained word embedding with the LSTM layer can make the result better. These two models are combined using the encoder–decoder architecture, also known as compositional architecture.

2 Feature concatenation

The result of concatenating the features is the creation of a new feature set. Figure 1 illustrates how individual models and the input of dataset images may be used to create a feature map, which is a collection of concatenated features. We further combine them to obtain a concatenated feature map by merging merged features from one feature vector with a different model [11] and combining all of them too to find the best model. Example of one the feature concatenation we used

$$ Final \, feature - concatenation = F^{SWIN} \cup F^{ConvNexT } \cup F^{XCIT} $$
Fig. 1
figure 1

Our framework for the best model of Arabic image captioning

The layer we need from the classification models that we used is the layer before the last or the classifier layer. So we remove the last layer from each mode before merging the feature from them.

2.1 Image encoder models

The image model serves as the first phase of content for the caption generation model. It performs the function of an encoder, converting an input image made up of a vector of three elements (RGB) into a feature vector composed of newly discovered elements. We introduce the latest models that we used as follows.

2.2 Cross-covariance attention (XCIT)

The proposed cross-covariance attention modifies the transformer model by including transposed attention to deal with feature dimensions rather than token dimensions. As XCIT [12] concentrates on using a fixed number of channels rather than the number of tokens, these allow dealing with flexibility with high resolution and reduce processing complexity to be superior to self-attention. XCIT has different sizes that make the accuracy of the models different we applied XCIT-Large 24 p16 224 Release.

2.3 Shifted window (SWIN)

It combines layers for patch merging with linear layers to reduce the number of tokens. The shifted window approach has much reduced latency than the sliding window method because it establishes connections between nearby non-overlapping windows in the previous layer [13]. The model divides the input image into non-overlapping patches using a patch-splitting module similar to that in ViT [14]. It deals with every patch that appears as a token. SWIN also has different sizes and we used a large patch4 window7 224 Release.

2.4 ConvNexT

It is a wholly purely ConvNet model. ConvNexT [15] is a simple, accurate, effective, and scalable solution. The authors gradually upgraded the default ResNet to include a vision transformer. The sliding windows behave more like the patches of the vision transformer due to the changes they make to ResNet. In this manner, they converted ConvNet into ConvNexTs. ConvNexT has different sizes and we used ConvNet _XLarge Release.

2.5 Language decoder models

NLP and computer vision techniques are used to create image captions. Several techniques were used to perform sequence-to-sequence learning tasks, including recurrent neural networks, skip-gram models, and log-bilinear models (RNNs). In this section, we present the most recent models with LSTM compared against the long short-term memory (LSTM) layer only as an embedding layer to determine which was the most accurate.

2.6 LSTM

Through LSTM, RNN has been enhanced. This type of neural network includes unique units in addition to the conventional RNN units that use a memory cell to store input for extended periods and choose what to keep and what to remove [16]. Vinyals et al. [17] used the LSTM as a decoder to provide the caption for the encoded image.

2.7 ARABERT

Antoun et al. (2020) released AraBERT [18], one of several models that were put up against mBert. On the majority of assessed Arabic NLP tasks, AraBERT demonstrated cutting-edge performance. The models were trained using news articles that were manually scraped from various big Arabic corpora that were made available to the public. There are about 77GB of text in the collection. AraBERT is offered in several variations. The bert-base-arabert Release was employed.

2.8 ARAELECTRA

Arabic text can be pre-trained using either the masked language model (MLM) objective or the replaced token detection (RTD) objective; however, Antoun et al., 2021 [19] claimed that the RTD objective is more effective and produces superior pre-trained language representation models. It enhances the state-of-the-art in named-entity recognition, sentiment analysis, and Arabic question answering, and outperforms other models that have been pre-trained using the same dataset and greater model sizes. The araelectra-base-generator release was employed.

2.9 MARBERTv2

One of the three models (M Abdul-Mageed et al., 2021) [20] is the MARBERTv2. They discover that ARBERT and MARBERT's performance in QA is subpar, which is a glaring contrast to what they have seen so far on other jobs. The reason, according to their hypothesis, is that the two models were pre-trained with a sequence length of only 128, which prevented them from adequately capturing a query and its expected response inside the same sequence window.

2.10 CamelBERT

CameLBERT [21] is a group of BERT models that have already been trained on Arabic texts of various sizes and variations. They published pre-trained language models for dialectal Arabic (DA), classical Arabic (CA), and modern standard Arabic (MSA), as well as one pre-trained on a combination of the three. Additionally, they offered other models that have already been pre-trained using a smaller set of the MSA variant (half, quarter, eighth, and sixteenth). According to their findings, the pre-training data amount is less significant than the proximity between pre-training data variants and subtask data. The bert-base-arabic-camelbert-ca release that we used.

On the other hand, a new model called the transformer model is competing with convolutional neural network (CNN or ConvNet) which is widely used. We used in this paper the concatenation features from combining the ConvNet like (ConvNexT) with Transformer whether (SWIN or XCIT) or all of them to extract the best features from images in the encoder phase. At the same time; computer vision functions on the same principles as the transformers used in the field of natural language processing. By examining the links between input token pairs, the transformer internally learns new information. We applied four Arabic models as pre-trained word embedding with the LSTM layer to improve the accuracy of generating Arabic captions.

The main contributions of this study are as follows:

1 We use transformer models as a feature extractor in the image encoder as a first enhancement that shows better results than related work and then compare using them with using the concatenated features from transformer models and both with using the embedding layer before the LSTM layer in each model.

2 Using feature concatenation as a feature extractor as a second enhancement which means combining multiple feature representations of an image into a single feature vector using transformer models. It has a significant impact on the performance of an image captioning model. It helps the model to understand the image better and generate more accurate and descriptive captions.

3 Applying the embedding layer before the LSTM layer in all models, we compared to generate more accurate and fluent captions as using an embedding layer before an LSTM layer in an image captioning model can have a significant impact on the performance of the model. The embedding layer is used to represent words and image features in a continuous vector space, where similar words and features are close to each other. This allows the model to better understand the relationships between the image and the words in the caption. As the LSTM layer can take into account the relationships between words, the embedding layer helps to provide a better representation of these relationships. Additionally, using an embedding layer before the LSTM layer also helps to reduce the dimensionality of the input to the LSTM layer, which can improve the efficiency of the model and reduce over-fitting

The rest of the paper is organized as follows; in the next section; literature reviews. In Sect. 3, the motivation. In Sect. 4, the proposed methodology is used in this paper, followed by experiments and results in Sect. 5. Finally, in Sect. 6; the paper is concluded with future work.

3 Literature reviews

This section explores recent developments in Arabic picture captioning as well as in English and other languages.

3.1 English image captioning

Gao et al. [22] proposed a new image captioning model that uses a graph attention mechanism to capture the relationship between objects in an image. For embedding the inductive bias into a dictionary set, they presented a unique unsupervised learning technique called scene graph auto-encoder (SGAE), which may be used as a re-encoder for language production. To further increase the transferability, they further reduced the inductive bias in the language decoder of the image captioner. As an illustration, Jiang et al. [23] presented a novel approach to image captioning that leverages the textual context of images. Specifically, the authors use a transformer-based language model to generate captions that are coherent concerning a given textual context. The transformer architecture is used for picture captioning in more recent publications. Using meshed memory transformer architecture, Corina et al. [24] created a multi-level representation of the links between image regions to improve captioning accuracy. On the COCO [25] benchmark dataset, this approach obtained the most recent scores with BLEU-1 and 4 scores of 80.8 and 39.1, respectively.

Hu et al. [26] focused on broader concepts, but they found that by pre-training the model on large datasets with plenty of tags and then refining it on image/caption pairings, they could identify distinctive things that were absent from most datasets. The authors explain this by pointing out that there are not many visual concepts in the COCO dataset, which decreases the benefits of developing a substantial visual vocabulary during pre-training in this model. The highest grade received was 45.4 out of 100. (Compared to Corina’s second-best rating of 29.2). For image captioning, Yu et al. [27] introduced a novel learning paradigm that uses dual attention on pyramid feature maps. They strive to study visual feature representations for image captioning, with a primary focus on language production. The P-attention model draws attention from many size receptive fields inside an image, which is an acceptable method for enhancing feature maps and enhancing the visual feature representation. On the other hand, the suggested D-attention model can more effectively utilize the channel- and spatial-wise properties from two separate angles. Qi et al [28] examined the problem of models that are biased toward producing “average” captions that only use common words or phrases. They suggested a discrete mode learning (DML) model for image captioning to address this issue. To build a codebook that incorporates a set of “mode-embeddings,” which enables the image captioning models to output different captions based on various modes, the goal is to explore many rich modes in the training caption corpus.

3.2 The other language for image captioning

Image captioning for several languages has been researched. For instance, Yılmaz et al. [29] used the Yandex translation API to translate the captions for the MS-COCO dataset into Turkish. For the image captioning challenge, they employed a neural network-based model that combines a CNN and an RNN as a single pipelined network. They achieved 28.8 and 30.0 for BLUE-1 and 4, respectively. Zhang et al. [30] employed rehabilitation robots while dealing with children with autism spectrum disorder (ASD) to track their attention and to explain appealing things and scenes to improve their abilities in spontaneous language and turn-taking. To provide Chinese descriptions, an algorithm for image captioning was created. This research highlights the promise of using picture captioning in robot-assisted therapy, particularly for Chinese children who have ASD. With careful consideration, Mishra et al. [31] created an encoder–decoder architecture that was similar to that utilized by Xu et al. [32] to work on Hindi picture captioning. They looked into a variety of attention-focusing techniques, such as spatial, visual, Bahdanau-style, and Luong-style techniques. They achieved the highest score on the Bahdanau-style attentiveness exam (67.0 BLEU-1 on a Hindi version of the COCO dataset). Lu et al. [33] used bidirectional LSTMs and a similar architecture to Xu et al. to work on Chinese picture captioning.

3.3 Arabic image captioning

Vasu [34, 35] conducted two Arabic language studies. The first one employed an encoder–decoder design developed by Vinyals et al. [17]. As the encoder and decoder, Vasu used a deep belief network (DBN) pre-trained by restricted Boltzmann machines and an image model. In his second work, Vasu switched the DBN-based decoder to an LSTM-based one [36]. The two techniques received BLEU-1 ratings of 34.8 and 55.6, respectively. They were tested using two separate proprietary datasets made by integrating images from news articles with captions from websites like Al-Jazeera News.

Al-Muazini et al. [37] described a merging strategy for creating Arabic captions. The entire design consists of an LSTM-based language encoding encoder and an image feature extractor. The caption is produced by an extra LSTM model utilizing the output from both models. They used 2111 captions and 150 manually translated captions on an unpublished Arabic version of the Flicker8k dataset. The captions were translated using crowdsourcing and machine translation. The BLEU-1 rating for this approach was 46 and 8 for BLEU-4. El Jundi et al. [3] also used an encoder–decoder design. The decoder was an LSTM-based model, whereas the encoder was a pre-trained picture model. One of the most significant contributions of this paper is the publication of a new dataset based on the Flicker8K dataset, which has been personally validated. It is now possible for future researchers to compare their findings to past studies, which was previously not possible. Every paper cited above, except for El Jundi et al., utilized a private, unpublished dataset. El Jundi et al. received a 33 and 6 on the BLEU-1 and 4 tests, respectively, for their Arabic Flicker8k dataset. The most recent study by Sabri et al. [5] improved the preprocessing for Arabic language captions and achieved a score of 44.3 on the BLEU-1 and 15.7 on the BLEU-4. Additionally, they employed pre-trained models like EfficientNet and MobileNetV2 to boost performance and accuracy.

In 2022, Jonathan Emami [38] created and assessed several Arabic image captioning models using well-established metrics. They trained transformers on various Arabic corpora before initializing the models. Using an OSCAR learning technique, they adjusted them after initialization using image-caption pairs. OSCAR considerably simplifies the learning of image-text semantic alignments by using object tags found in images as anchor points. They achieved scores of 0.391 and 0.092 for BLEU-1 and 4, respectively. Table 1 compares the Arabic captioning of images.

Table 1 A comparison of different languages in image captioning with BLEU scores

Lasheen [38] suggested looking into the impact of employing beam search and preprocessing on the testing BLEU-4 score. The model with the FARASA segmenter attained it. The outcomes match those of the model, which produced Arabic captions using root words. The generated captions were also qualitatively reviewed in addition to the BLEU-N scores by two teams of native Arabic speakers; the first team evaluated the captions while the second team employed the “THUMB” framework. The generated captions with PyArabic preprocessing produced higher results on the sample tested, scoring 39.108 and 8.29 for BLEU-1 and 4, respectively. Another preprocessing made called the FARASA-based model demonstrated improved attention visualization for the samples it was tested on, with scores of 58.708 and 27.12 for BLEU-1 and 4, respectively.

4 Motivation

Although English image captioning has advanced significantly over the past decade, little research has been done on image captioning for other languages. Bootstrapping off of current English image captioning systems for image understanding and translating the results to the desired language is one approach to solving this issue. Models for the English language cannot be directly applied to Arabic because the two languages have distinct grammar, vocabulary, and writing systems. English and Arabic are also spoken in different cultural contexts and have different idiomatic expressions and colloquialisms. Additionally, Arabic has several written forms, including modern standard Arabic (MSA) and various spoken dialects, which can further complicate the development of models for the language. Unfortunately, because of the mistake accumulation of the two tasks; image captioning and translation and translated image captioning are inadequate. As a result, Arabic image captioning is currently developing slowly. The Arabic language should be taken into consideration as it is the native tongue of 22 countries and is spoken by more than 422 million people in the Arab globe. Arabic is ranked as the sixth most spoken language in the world and is one of the six official languages. Arabic has many challenging characteristics to deal with, including writing from right to left, having many letters that are not pronounced by many other languages, and having more related words than English. Arabic is spoken by up to 500 million people worldwide.

5 Proposed methodology

In this paper, we discuss the issue of Arabic image captioning. Language models for Arabic aim to accurately process and generate text in the language by understanding its grammar, vocabulary, and idiomatic expressions. We suggest an all-encompassing methodology that instantly converts concepts into Arabic text. The goal of the work of “image captioning,” which combines computer vision and natural language processing, is to produce descriptive legends for images. It is a two-step procedure that depends on correctly comprehending both language and images from a syntactic and semantic perspective. We aim to enhance the provided architectural paradigm for image captioning in this paper. To accomplish this, the two actions listed can be taken: first; extracting the concatenated features from the top models from images by removing the classifier layer from each model and then concatenating the extracted features to get a new feature from them that will represent the best features from images. The second is to improve the language decoder to provide captions that are as accurate as possible. Here, we utilized a few models to determine which was the most accurate, compared them to LSTM, and used them as an embedding layer. Three vision models—XCIT-Large, SWIN-Large, and ConvNEXT-X-Large—were examined in this study together with the four most recent language decoder models used as an embedding layer (ARABERT, ARAELECTRA, MARBERT-v2, and CamelBERT). We find as in Fig. 2 that the MARBERT-v2 [20] model was employed as an embedding layer after we get the features for the images by using concatenated features from (ConvNexT [15] + SWIN [13] + XCIT [12]) model is the best value we got in BLEU-1 score.

Fig. 2
figure 2

Our methodology for Arabic image captioning

We applied LSTM as a layer as in Fig. 1 in the decoder phases in all methods we used whether single models or concatenated models. In addition, the only change in the figure is adding a block for pre-trained word embedding before the LSTM layer. As previously noted, the first component is a network encoder that launches once the generation process begins. First, it accepts an image and outputs a 256-bit vector that serves as the image’s description. The 224*224*3 input image is the standard input size for our pre-trained model. It is processed by either a single model or a concatenate model without a top (i.e., no classification). The model then extracts image features from the image and creates an image map, then a feed-forward layer with a ReLU activation function that can be trained and has 256 units passes through the image features to yield, and for every word formed, the network’s decoder component executes recursively. When the token “<start>” is produced or the maximum sequence length (28) is achieved, the sequence comes to an end. A word embedding layer with an embedding dimension of 256 accepts the previously generated word as input (when first starting generation it accepts the “start>” token). Each word is represented in a vector of length equal to 256 then concatenated and passed the context vector and the word embedding layer’s output to an LSTM layer of size 512 recurrent units. Then, a layer with dropout and a 0.5 drop rate is added. We added the output by using the dropout layer with an image vector then passed the output through 512 units of the fully connected layer and ReLU activation mechanism. The final output layer is a feed-forward layer with 10,563 units, a softmax activation function, and several units equal to the vocabulary size. The output of this layer is a probability distribution of each word in the vocabulary, and the recurrent decoder selects the word with the highest probability as the output for this iteration.

The methodology for Arabic image captioning involves several steps as in Fig. 2:

  1. 1

    Data collection: A dataset of images and their associated captions in Arabic needs to be collected and preprocessed and we used the Flickr8k Arabic dataset.

  2. 2

    Preprocessing: The images need to be resized to a standard size so that they can be fed into the image captioning model and normalized, while the captions need to be tokenized into individual words or sub-words, lowercasing the text, removing punctuation so that they can be used as input to the model.

  3. 3

    Data splitting: Splitting the dataset into training, and testing sets to evaluate the performance of the model. We split the data into 7091 images for training and 1000 images for testing.

  4. 4

    Encoding captions: Encoding the captions into numerical values using a tokenizer from the Keras library to convert a list of captions into a list of numerical values, where each value corresponds to a unique word or token in the vocabulary.

  5. 5

    Feature extraction: Extract features from the images that can then be used as input for the image captioning model by merging features from different models or from each model we mentioned as the comparison we made between them using the Pytorch library.

  6. 6

    Language modeling: The pre-trained transformer models are used in the language phase as a pre-trained embedding layer.

  7. 7

    Training: The image captioning model is trained using the extracted features and preprocessed captions.

  8. 8

    Evaluation: The trained model is evaluated on a separate dataset using metrics such as BLEU, METEOR, and CIDEr.

6 Experiments and results

We tested our suggested model, and we trained it using a Keras model on a laptop with one GPU (2060 RTX). The plot of the Arabic caption generation model is shown in Fig. 3, using concatenated features from XCIT-Large, SWIN-large, and ConvNexT-Large. Input1 is the input of image features, input2 is the text sequences or captions, and dense is a vector of 4352 elements that are processed by a dense layer to produce a 256-element representation of the image.

Fig. 3
figure 3

Plot of the caption generation model for xcit + swin + convNexT with MARBERT-v2 model

The major steps and techniques used in the training and optimization of each model include:

  • -Data preprocessing for the images and captions to be in the same format and size and the captions were tokenized and encoded.

  • -Fine-tuning the pre-trained transformer models such as MARBERT-v2 were fine-tuned on the image-caption pairs using a contrastive loss function.

  • -Using dropout which is a regularization technique to improve the generalization of neural networks and to reduce over-fitting by randomly dropping a certain proportion of neurons during training.

  • -GPU training was used to speed up the training process.

  • -Learning-rate schedule was used to adapt the learning rate during training, typically starting with a higher value and gradually decreasing it as the training progresses.

  • -Early stopping was used to prevent over-fitting by monitoring the validation loss and stopping training when the validation loss stops improving.

  • -Hyper-parameter tuning such as the number of layers, the number of neurons, the batch size, and the learning rate were tuned to optimize the performance of the model (Figs. 4, 5, 6, 7, 8, 9).

Fig. 4
figure 4

Loss curve for learning rate 0.001

Fig. 5
figure 5

Accuracy curve for learning rate 0.001

Fig. 6
figure 6

Loss curve for learning rate 3e−5

Fig. 7
figure 7

Accuracy curve for learning rate 3e−5

Fig. 8
figure 8

Loss curve for our hyper-parameters

Fig. 9
figure 9

Accuracy curve for our hyper-parameters

The following are the hyper-parameter settings for our model: Word embedding dimensions: 512; hidden layer dimensions: 256; maximum epochs: 30; language model layers: 1-Layer LSTM Settings for the LSTM dropout: [0.5], learning rate: [0.00001], optimizer: Adam optimizer, batch size: 16. We calculated loss after each iteration of training to minimize the loss so that the model’s predictions became as close as possible to the target values. We used a categorical cross-entropy loss function for this goal as it provides a measure of the difference between the predicted probability distribution over all possible words and the true word in the caption. We experimented with different learning rates and other hyper-parameters to see if the performance of the model can be further improved until we got the above values with the following curve for the accuracy and loss. We started from the default value of learning rate for Adam optimizer =0.001 and got over-fitting with the loss curve as Fig. 6. So we tried to reduce the learning-rate value with the other hyper-parameters like using dropout we got the loss as in Fig. 4 which will affect the accuracy. A small learning rate makes the model learn more slowly and avoids over-fitting to the training data and avoids being overly influenced by noisy data points.

6.1 Data set

Most researchers employ two widely used datasets while training image captioning models. The first dataset that provides information for a variety of computer vision applications, such as object detection, image segmentation, and image captioning, is the common objects in context (COCO) dataset. The image captioning dataset consists of 83 k training images and 41 k testing images. Each picture has a different one of the five subtitles. The dataset also includes an Arabic translation that was generated by the Google machine translation API. Sabri [5] discovered that 46% of the translations ranged from having some grammatical and semantic issues to being completely incomprehensible translations determined by a randomly selected subset of 150 captions. However, the Arabic dataset is not human-checked. Another well-known collection is the Flicker8k dataset [40], which consists of 8,092 photographs with five captions each. The captions were collected from the Flickr website and were used to train image captioning models. The images in the dataset are of various classes, such as animals, people, and buildings. El Jundi et al. [3] developed an Arabic version of this dataset. The best three captions (out of the original five) were maintained after the captions were carefully vetted and graded by professional Arabic translators using Google’s machine translation API. As a result, the dataset is substantially smaller than what is available for other languages. This harms the deep learning models that were trained on this dataset (which generally require huge amounts of data). The classes of the Flickr dataset include various objects and scenes such as people, animals, landscapes, and buildings.

6.2 Efficiency of image captioning

Different evaluation metrics have been established to assess the efficiency of image captioning models. Automated evaluations are performed using computer algorithms. The bulk of research frequently uses one of these methods to evaluate the performance of their models. The main purpose of the concept is to guarantee impartial evaluation practices. It is also claimed that the BLEU measurements, which use lower n-grams, frequently result in erroneous judgments. The thorough study Kilickaya et al. [41] conducted on the topic offers a comprehensive understanding of how to evaluate the automatic metrics for image captioning. Also we used other tools as METEOR, ROUGE, and CIDEr [42] to measure the performance of models and to accomplish this, we first assessed a single vision model for extracting features from images and then compared it with features that were concatenated together to improve the stability and performance of the models we calculated the standard deviation and the mean of all metrics we used for the evaluation process of all different models with different initializations of the weights and to achieve that, we run all the models with three times (once with using pre-trained weight from vision models and the others with adding Xavier and He initializations for ReLU activation function that can be effective way to improve the performance of the models) and show the results in Table 2 and Table 3 for all of the evaluated vision models with the embedding layer we utilized. As Table 2 for single vision models “vision-encoder” shows that the ConvNexT model with pre-trained word embedding “CamelBERT” is the highest score in BLEU-1 and SWIN model with pre-trained word embedding “ARAELECTRA” in BLEU-4 that not far from ConvNexT with CamelBERT and Table 3 for concatenated models “vision-encoder” which shows ConvNexT + XCIT + SWIN model with pre-trained word embedding “CamelBERT” in BLEU-1 is the highest score and in BLEU-4 with ConvNexT + SWIN with ARAELECTRA pre-trained word embedding layer with it but there is a convergence between the scores in BLEU-4 with ConvNexT + XCIT + SWIN model with pre-trained word embedding “CamelBERT” with a score of 0.16646 and using ARAELECTRA which is equal to 0.16518.

Table 2 Comparisons of Flickr8k’s Arabic image captioning efficiency metrics with single vision models + embedding layer
Table 3 Comparisons of Flickr8k’s Arabic image captioning efficiency metrics with concatenated models + embedding layer

Figure 10 shows the comparison of standard deviation and mean of the BLEU scores for the tested models that show ConvNexT with CamelBERT is the highest score with most BLEU scores except SWIN with ARAELECTRA in BLEU4 that is close to score ConvNexT with CamelBERT. Also, we compare the mean and standard deviation of METEOR, ROUGE, and CIDEr scores for all the models as shown in Table 2, where a higher score indicates better performance.

Fig. 10
figure 10

Comparison between BLEU scores for 3 models with an embedding layer

Figure 11 shows the comparison of BLEU scores for the tested concatenated models that show SWIN + XCIT + ConvNEXT with CamelBERT is the highest score with BLEU-1 with most BLEU scores except ConvNexT + SWIN with ARAELECTRA in BLEU4 that is not far with score SWIN + XCIT + ConvNEXT with CamelBERT. Also, we compare the mean and standard deviation of METEOR, ROUGE, and CIDEr for all the models as shown in Table 3, where a higher score indicates better performance.

Fig. 11
figure 11

Comparison between BLEU scores for 4 concatenated models with an embedding layer

6.3 The size of each model

We measured the size of each trained image captioning model for the 3 vision models we used individually with the embedding layer. As shown in Table 4, the image captioning model using XCIT with LSTM only without a pre-trained embedding layer has the smallest size, while the largest model is ConvNexT using MARBERT-v2 but the most efficient in image captioning is ConvNexT with CamelBERT is middle size. On another side, we also measured the size of each trained image captioning model for the 4 concatenated vision models with the embedding layer. As shown in Table. 5, the image captioning model using SWIN + XCIT with LSTM only without a pre-trained embedding layer has the smallest size, while the largest model is ConvNexT + SWIN + XCIT using MARBERT-v2 where the most efficient in image captioning is ConvNexT + SWIN + XCIT using CamelBERT with BLEU-1 score is also in a middle size.

Table 4 Sizes of the image captioning models
Table 5 Sizes of the image captioning models

6.4 Time evaluation

For each of the 3 tested single vision models with the 4 pre-trained embedding models and LSTM, we compared the time taken for training for each model to get the best epoch for captioning. As shown in Fig. 12, XCIT with (ARABERT) was the fastest in training as they took the least training time (1.5 h). The slowest models are SWIN-Large with ARABERT as it finished training in 3.5 h while ConvNexT with CamelBERT is the most efficient in producing captions and it took around 2 hours. The same evaluation did with the concatenated vision models which are 4 models with 4 pre-trained embedding models compared with LSTM, as shown in Fig. 13; we noticed that the most efficient one which is ConvNexT+SWIN+XCIT with CamelBERT as it took around 1.68 hours not far from the fastest one which is 1.57 while the slowest one is ConvNexT+SWIN with ARABERT as it took 2.37 hours.

Fig. 12
figure 12

The tested single image captioning models’ training times with pre-trained embedding models

Fig. 13
figure 13

The tested concatenated image captioning models’ training times with pre-trained embedding models

6.5 Vision evaluation

Sample images and their captions from the Arabic Flickr8k dataset are shown in Fig. 14 with a comparison between the results we got by using the single vision model and by using concatenation vision models with adding a different pre-trained word embedding layer with LSTM as we discussed in the paper. We picked up a sample of captions we get after evaluating each model and compare it as follows.

Fig. 14
figure 14

Samples of captions generated by our best model

7 Conclusion and future work

This work focused on Arabic image captioning. With our work, we improved the accuracy of the captioning model for the Arabic language by using concatenated features from using the transformer models with the ConvNexT model whether in the vision phase with using a pre-trained word embedding in the language phase. We show that combining the features from XCIT-Large, SWIN-Large, and ConvNexT-Large models with the pre-trained embedding model CamelBERT achieved the most accurate scores on the Flickr8k. Arabic image captioning presents a unique set of challenges due to its complicated morphology, which makes it more difficult to analyze than in other languages. This is due to a lack of well-annotated datasets. Since there was only one publicly available dataset for Arabic image captioning, that would be the main limitation on future development especially since some captions appear with repeating the same word many times due to the lack of Arabic captions in the training that can be solved with using augmentation. It is possible to evaluate different preprocessing and deep learning methods; however, doing so would require more testing and resources. Also, we aim to enhance Arabic image captioning in the medical field which will help in the medical sector.