Generative image captioning in Urdu using deep learning

Urdu is morphologically rich language and lacks the resources available in English. While several studies on the image captioning task in English have been published, this is among the pioneer studies on Urdu generative image captioning. The study makes several key contributions: (i) it presents a new dataset for Urdu image captioning, and (ii) it presents different attention-based architectures for image captioning in the Urdu language. These attention mechanisms are new to the Urdu language, as those have never been used for the Urdu image captioning task (iii) Finally, it performs quantitative and qualitative analysis of the results by studying the impact of different model architectures on Urdu’s image caption generation task. The extensive experiments on the Urdu image caption generation task show encouraging results such as a BLEU-1 score of 72.5, BLEU-2 of 56.9, BLEU-3 of 42.8, and BLEU-4 of 31.6. Finally, we present data and code used in the study for future research via GitHub (https://github.com/saeedhas/Urdu_cap_gen).


Introduction
The image captioning task aims at describing the contents of an image in natural language (Mishra et al. 2021), which can be accomplished by combining Computer Vision techniques with Natural Language Processing methods. The general idea of image captioning system is encoding input image into a vector using computer vision techniques and then decoding that vector into words using any decoder from NLP language models. An Example of image caption is illustrated in Fig. 1. Figures are the input of the image captioning system and the captions are the output. Benchmark image captioning datasets for English include Flickr8K (Hodosh et al. 2013) , NOCAPS (Agrawal et al. 2019) and MSCOCO (Lin et al. 2014). Since natural language generation is key part of the captioning system, BLUE score is considered as the common evaluation metric (Papineni et al. 2002) The applications of this task are wide and varied, including but not limited to: assisting visually impaired individuals to surf the web (Makav and Kılıç 2019;Fisch et al. 2020;Liu et al. 2020), enhancing image search with semantic information (Lindh et al. 2020), navigating video scenes Zhou et al. 2020a), or even enabling AI driven cars to better understand their environment (Kim et al. 2018;Xu et al. 2015;Zhou et al. 2020b).
Inspired by prior work (Bahdanau et al. 2015), Xu et al. (2015) proposed a model based on visual attention, trained in a deterministic manner using standard back-propagation techniques and additionally learning to soft attend on objects as well as non-objects (semantics) while generating the corresponding tokens in the output sequence. Their model produced state-of-the-art performance on three benchmark datasets: Flickr8k, Flickr30k and MS COCO (Young et al. 2014). Later on, Aneja et al. (2018) achieved a similar score by using a purely convolutional architecture, replacing LSTM, with feed-forward masked convolutions to restrict the convolution operations to use only the past words' information. Vinyals et al. (2015) and Huang et al. (2019) proposed an "attention on attention" (AoA) module, which extends the conventional attention mechanisms to determine 1 3 the relevance between attention results and current context. Applying AoA to both the encoder and the decoder of the image captioning model achieved new state-of-the-art (SOTA) results (Wang et al. 2022).

Research objectives and our contributions
Urdu is an Indo-Aryan language that borrowed a large percentage of its vocabulary from other languages such as Arabic and Persian (Amjad et al. 2020). The Ethnologue, a wellknown reference source that publishes statistics on living languages, has ranked Urdu as the 11 th most spoken language in the world in 2020. It is also widely acknowledged as a major South Asian language, with 490 million native speakers worldwide (Shaik and Venkatramaphanikumar 2021). It is the official language of five Indian states, including Bhiar, Uttar Pradesh, and Jharkhand. It is the national language of Pakistan, which has a population of about 220 million people. According to the 2011 census of linguistic statistics conducted by the Indian government, India had 50,772,631 Urdu speakers. Urdu speakers can also be found in the United Kingdom, the United States, Canada, Australia, the Middle East, and Europe.
It uses Arabic script in cursive format (Nastaliq style) with the segmental writing system. Specifically, the Urdu language is based on an "abjad" system where the long vowels and consonants are necessarily written while the short vowels (diacritics) are optional. It is a bidirectional language where the numerals are written from left-to-right, while the characters are written from right-to-left. When characters are joined to make the words, they develop different shapes based on the context. Specifically, a character can have a maximum four shape variants known as initial, medial, final and isolated. The characters that can develop all four shapes are known as joiners, while the characters that can only have two shapes (final and isolated) are known as non-joiners (Kanwal et al. 2020).
Unlike English, a white space character is not considered as a reliable word boundary indicator in Urdu. That is, Urdu does not have consistent word boundary markings. For example, a writer may insert a space within a word (respectable) in oder to make it visually correct, where the character . represents the ASCII space character. If the writer omits the space it may lead to an incorrect visual form of the same word. Contrarily, the writer may omit space between two words (Urdu language) because the shape of characters with or without space remains the same. That is, the Urdu words ending with non-joiner characters exhibit correct shape even without space. Consequently, a writer may omit space between words ending with non-joiner characters. Most existing studies on generative image captioning are focused on English. To the best of our knowledge, no such published work exists in the realm of neural image caption generation for Urdu. Urdu is a low-resource and more morphologically complex language than English (Mahmood et al. 2020;Malik et al. 2021).
Urdu is often regarded as a low-resource language due to the lack of or inadequacy of various critical resources, such as gold standard datasets and fundamental natural language processing (NLP) toolkits, such as reliable tokenizers and stemmers (Shaik and Venkatramaphanikumar 2021). Our discussion, however, is focused on the limitations of Urdu in the image captioning task, Some key limitations are as follows.
• Lack of attention. Image captioning task has been extensively investigated for resource-rich languages such as English. To the best of our knowledge, no such published work exists in the realm of neural image caption generation for Urdu. Urdu is a low-resource and more morphologically complex language than English (Mahmood et al. 2020;Malik et al. 2021). • Unavailability of resources. Author gender identification is an important NLP task. However, as mentioned earlier, this is the first study on generative image captioning in Urdu and there is no existing corpus available to perform this task. Therefore in this paper we introduced a new corpus to perform this task. Our contributions. The contributions of this work are as follows: • We present a new dataset for Urdu image captioning which can be accessed via GitHub. 1 • We also discuss different types of attention-based architectures for image captioning in the Urdu language. These attention mechanisms are new for the Urdu language, as those have never been used for the Urdu image captioning task. • Further, we illustrate quantitative and qualitative analysis of the results -studying the impact of differing model architectures on the image caption generation task in Urdu. • Finally, we show that the best model achieves a BLEU-1 score of 72.5, BLEU-2 of 56.9, BLEU-3 of 42.8, and BLEU-4 of 31.6 on the Urdu image caption generation task.
The rest of the paper is organized as follows. Section 2 reviews the existing image captioning techniques. Section 3 discusses methodology and experimental setup. Section 4 presents the experimental results. Section 5 presents the conclusions and future work directions.

Literature review
The image captioning techniques can be organized into extractive and generative techniques. More details on extractive and generative captioning is provided in the following paragraphs.

Extractive captioning
Earliest approaches rely on hand-engineered features for visual elements and rule-based systems for language models. Some progress was reported using human-engineered templates and piecing together the phrases containing detected objects. Hodosh et al. (2013) treated the sentencebased image annotation as a ranking problem mapped to a given pool of captions. Whereas, several studies formulated this task as a retrieval problem and proposed solutions which represent embedding of images and text in the same space (Gong et al. 2014;Li et al. 2020;Zhou et al. 2020a). Socher et al. (2014) used deep learning to co-embed image and sentences together and Karpathy et al. (2014) embedded image sub-regions and sub-sentences jointly. Regional attributes have been used in many image captioning methods to alleviate the issues with predetermined caption templates. Farhadi et al. (2010) proposed detections to infer a triplet of image regions to return the suitable text by filling in a textual template. Li et al. (2011) used object detections and then piece together a final description using phrases containing detected objects, modifiers and locations using web-scale n-grams. Yao et al. (2010) introduced the web-ontologylanguage based on semantic representation produced as a result of parsing images, which is converted to human readable text. Kulkarni et al. (2013) used detection beyond triplets but with template-based text generation. The advantage of using the template-based methods is that the resulting captions tend to be grammatically correct. However, they use hard-coded visual concepts and hence suffer to produce the required variety in the output. Kuznetsova et al. (2014) extracted similar images relevant to the query image, then extracted noun verb and prepositional phrases from captions of those images. Eventually they run an object detector on the query image and compose captions using detected objects by pairing them with relevant captions of previously fetched images.

Generative captioning via deep learning
In contrast to the aforementioned dual stage methods, the recent trend for image to text generation is to use deep learning based encoder-decoder architectures that connect a CNN to an RNN to learn the mapping from images to sentences without involving any rules or human engineered features.  Mao et al. (2014), their RNN is conditioned on the image information only at the first time step. The first landmark paper that reported tangible results was by Vinyals et al. (2015) combined deep CNNs for image classification with an LSTM for sequence modelling, to create a single network that generates descriptions of images. Chen and Lawrence Zitnick (2015) learn a bi-directional mapping between images and their sentence-based descriptions, which additionally enables reconstruction of visual features when given a caption as input. Tanti et al. (2017Tanti et al. ( , 2018 conjectured that in a CNN-RNN setting for image caption generation, the image information can be fed to the neural network either by directly incorporating it in the RNN i.e. conditioning the language model (LM) by 'injecting' or in a layer following the RNN i.e. conditioning the LM by 'merging' image features where the later allows the RNN's hidden state vector to shrink in size by up to four times.
Their results suggest that the visual and linguistic modalities for caption generation need not be jointly encoded by the RNN since it yields large, memory-intensive models with few tangible advantages in performance; rather, the multimodal integration should be delayed to a subsequent stage. Bahdanau et al. (2015) proposed the soft attention mechanism for machine translation that produced revolutionary results by generating the target language tokens conditioning the LM on previous prediction by learning to shift and pay attention to parts of the source sentence representation. Inspired by prior work (Bahdanau et al. 2015), Xu et al. (2015) proposed a model based on visual attention, trained in a deterministic manner using standard back-propagation techniques and additionally learning to soft attend on objects as well as non-objects (semantics) while generating the corresponding tokens in the output sequence. Their model produced state-of-the-art performance on three benchmark datasets: Flickr8k, Flickr30k and MS COCO (Young et al. 2014). Later on, Aneja et al. (2018) achieved a similar score by using a purely convolutional architecture, replacing LSTM, with feed-forward masked convolutions to restrict the convolution operations to use only the past words' information. Vinyals et al. (2015) and Huang et al. (2019) proposed an "attention on attention" (AoA) module, which extends the conventional attention mechanisms to determine the relevance between attention results and current context. Applying AoA to both the encoder and the decoder of the image captioning model achieved new state-of-the-art results (Table 1).

Methodology and experimental setup
We chose ResNet-101 (He et al. 2016) to act as an encoder and a LSTM as a decoder. We have used two encoderdecoder architectures; (i) The Merge Model (Tanti et al. 2018) as a baseline and (ii) the Attention driven Context based Model (Xu et al. 2015) as our main model as shown in Fig. 2.

Dataset
To prepare the image mapped Urdu dataset we make use of the Flickr8K (Hodosh et al. 2013) dataset for cross-reference which is a standard dataset and widely used by the research community to perform image caption generation tasks for English (Hodosh et al. 2013). The Flickr8K dataset comprises 8000 images where each image is presented with 5 English captions on average. We have selected a subset of data from the Flickr8K dataset consisting of five English captions per image; these were manually translated into Urdu by a native speaker followed by several rounds of quality control involving another native speaker of Urdu.

Model training
The data is randomized and split into 1440 images as train set, 180 as validation set and 180 as test set. Each image has five captions, such that it results in a corresponding split of 7200 train, 900 validation and 900 test captions. For the encoder of our baseline model, we remove the last classification layer 'FC' to harness the image feature vector from the second last fully connected layer. However for our main model, based on attended annotation vectors, we make use of spatial context. We strip-off the trailing layers after convolutions i.e. pooling and fully connected (dense) layers to obtain the 3D tensor as an image feature set by adaptive average pooling the output of the last convolutional layer. This 3D feature set, 2048 layered 14x14 tensor, is flattened to a 2D representation of 196 annotation vectors each of size 2048 which is attended to by enhancing the relevant weight.
To initialize the language model (LSTM), annotation vectors are first averaged to produce a single vector of size 196 that is projected using two independent fully connected layers of neurons to the cell state size (512) and hidden state size (512). Soft attention is deterministic and a differentiable function comprising MLPs. This dense neural network is learnt as part of the training process to conditionally decide the amount of soft attention to be applied to each annotation vector a i based on the decoder's last hidden state h t−1 . This warrants for two inputs to this attention network i.e. the flattened image feature annotations and the latest hidden state of the LSTM. The image feature vectors are projected to a 512-dimensional feature space by a fully-connected layer while another separate fully connected layer does the same for h t−1 . The projected hidden state is amalgamated with each of the projected annotation vectors using the add operation which further produces a ReLU activated output of shape (196,512). The tensor is passed to a Softmax layer that converts it to a probabilistic attention vector of dimension (196,1). This vector is used to attend the (2048,196) shaped annotation vectors to finally give the context vector representation of image features.
RNNs require fixed length sequences but we have sentences which are intrinsically of varied lengths. To make them uniform sized, we fixed the maximum size of the caption to be of a suitable length i.e. 39. This does not correspond to the longest sentence size in the dataset but was chosen by doing a percentile analysis discarding outliers to cover 95% of the captions. Longer captions are clipped to comply with the maximum allowed length. To compensate for shorter lengths < pad > tokens are appended to make each caption the same length. We substituted words with frequency of occurrence less than 3 with an < unk > token. This models the probability of unknown words that might appear in validation and test sets captions but are not present in the train set.
We introduced a custom embedding layer of size 512 which learns a fixed length continuous domain representation during the training process. This is the final representation of words that is consumed by the LSTM decoder. The LSTM is used with a hidden state size of 512. To predict the next word, we use the updated hidden state which is upsampled by a fully-connected layer projecting the 512 vector to the vocabulary space. This is connected with Softmax for word prediction. Cross entropy loss (multi class) is used for back-propagation of gradients.
For the baseline model, we use only the last prediction S t−1 's word embedding (512) as input to the next time step. The hidden state h t incurs a cyclic update in the LSTM. For the attention driven main model, the context vector is combined with the previous prediction's word embedding S t−1 to constitute the input. The vectors are combined using concatenation and fed together to the LSTM decoder to generate the next word.
The Adam optimizer is used with a learning rate of 4e −4 . BLEU-4 metric is tracked on the validation set throughout the training process. Adaptive learning rate is used with a decay of 20%, if there is no improvement in BLEU for 8 consecutive epochs. Drop Out of 0.5 has been employed with teacher forcing for 50% of the training epochs chosen randomly. A maximum of 100 Epochs was used, each having mini-batches of 32 while leveraging early stopping based on BLEU score if there are 20 epochs of no improvement.
Cross entropy loss, top 5 accuracy and BLEU scores were tracked. It is observed that the improvement in BLEU score does not always correspond to a reduction in loss so we stopped the training process early using BLEU-4. The resulting improvement in the language scoring metric BLEU-4 is evident as the stabilized img2seq model is tuned further to enhance the Encoder's adaptability. This is done by image encoder retraining. Initially, transfer learning was leveraged on the encoder by keeping its weights frozen and only the decoder was trained. The training phase lasted for 31 epochs with the BLEU-4 score peaking at about 21.56 on the 11th epoch. We fine-tuned the encoder, restarting the training with parameters of the 11th checkpoint using a reduced batch size and reduced learning rate. This is because the trainable model size is now larger, additionally incorporating the computation and backpropagation of the encoder's gradients. For ResNet, we only fine-tune convolutional blocks 2 through 4 while keeping the initial block intact, because the first convolutional block would have usually learned low level features that are fundamental to image processing, such as detecting lines, edges, curves, etc. Consequently we don't change foundations. This resulted in improving the BLEU-4 score to a new high of 23.05 after 4 epochs.

Experimental results and discussions
The image to natural language connection jointly tunes the encoder on top of the trained decoder to bridge the contextual gap between visual and linguistic components. This allows the loss feedback to flow to the image encoder improving the visual component compatibility with the language model. Gains in all BLEU 1-4 scores are recorded in Table 2. Table 3 shows the results on Urdu and those of relevant papers and state-of-the-art for English. We decided to test a multilingual BERT model that covers Urdu as well as being implemented in Hugging Face. The model consists of 110M parameters and is sized at 0.7 GB. We configured the main model to integrate with the BERT encoder. The embedding layer was frozen and the LSTM cells were configured to a layer size of 768, matching the dimensionality of the word embedding extracted from BERT. The BERT model uses sentence context in its entirety to generate the embedding and is very effective at encoding semantics. For Urdu, the best strategy was to learn the embeddings from scratch as part of the training process, rather than relying on pre-trained embeddings. This study reports the results using BLEU score as a quantitative metric to evaluate the goodness of fit as well as maximising BLEU score during the training process. BLEU score is based on the sequential conformance of N-Grams whereas natural language involves much more flexible constructs where alternate words or their combinations may constitute the same semantic sense. METEOR and CIDEr metrics are also used by the latest papers but they lack the necessary resources for Urdu. In the pursuit of better metrics for Urdu, we leveraged two additional candidates for sentence semantics (i) BERT-F1 Score (Zhang et al. 2019) which uses the BERT transformer model extracting word features from multiple layers to form semantic representation pools using the words from each of the reference and hypothesis sentences. It then computes Precision and Recall to give F1 for the hypothesis. (ii) LASER is introduced by Facebook Research (Artetxe and Schwenk 2019) to generate multi-modal sentence embeddings for zero-shot crosslingual transfer. For the languages used for its training, LASER can transform the sentence into a joint space which produces language-independent vectors. To use them as a qualitative measure, there are multiple options such as L1, L2 norms and  cosine similarity. The initial two being subject to certain biases across dimensions, we have used the cosine similarity of each hypothesis against 5 reference captions and computed macro and micro averages as measures to cover the whole evaluation set. We leveraged LASER and BERT F1 scores to govern the model training via early stopping. It was observed that they do not always correlate with BLEU score and the training process stops at a different junction which offers lower BLEU metric but maximizes LASER see Table 4 and Fig. 3. Final results on the evaluation set are listed in Tables 5,6, and 7 for reference and organized into good, average and bad predictions, respectively.

Conclusions and future work
This is the first study on generative image captioning in Urdu. We present a new dataset for Urdu image captioning, annotation treatment and generalization guidelines to make visio-lingual deep learning models effective and applicable to modest sized dataset. We highlight the hindrances of standard evaluation metrics in Urdu and show the use of semantics driven techniques such as Bert-F1 and LASER may be appropriate for evaluating this task in Urdu. One can use transformer for decoder part to enhance the language model ability in the captioning which is left as future work at this movement.

Appendix: A image captions dataset creation for Urdu
To prepare the image mapped Urdu dataset we make use of Flickr8K dataset for cross-reference which is a standard dataset and widely used by the research community to perform image caption generation tasks for English. The Flickr8K dataset comprises 8000 images where each image is presented with 5 English captions on average. Our dataset was created in three phases. There are three high-level approaches and all were exploited in turn one after the nonviability of the other. These approaches are explained in the following subsections.

A.1 Automatic translation
To translate English captions to Urdu, we subscribed to the Google cloud hosted neural machine translation (NMT) model (v2). Once the translation was completed, a preliminary baseline model was trained as a trial. However, it was noted that even though the evaluation scores were acceptable (i.e., BLEU=13), but several generated captions were absurd and un-related to the image. We also found that the translation API lagged in producing quality Urdu translations. These findings of erroneous instances enforced the consideration of human translation as the reliable option to prepare captions.

A.2 Human translation
The human translators consisted of a few colleagues, who are proficient in English while having Urdu as their native language. As translation was progressing, a parallel task of analysing the Urdu annotations was initiated and plethora of issues were faced such as: (i) Urdu Words are not essentially space separated and since they do not always form invalid or different words unlike English. This makes such typing anomalies hard to spot while causing high variability in data Inter and Intra annotator disagreements were observed while translating the same English phases at different instances (see Table 8). All of these observations established the source of high textual variability of captions potentially

A.3 Compliment with human annotation
We combined phase 2 with manual re-annotation, validating each English captions for correctness and relevancy to each corresponding image. Upon verification, we translate English to Urdu, otherwise the human annotators shall self-generate 5 grammatical descriptions in Urdu and type them. The annotations were periodically analyzed using basic NLP techniques while keeping a check on vocabulary size and instances per word that shall be available to learn the Urdu language model later. Keeping in view these issues, below were the high-level aspects that were identified to be fixed: (

A.4 Dataset creation and standardization principles
The overall exercise of preparing Urdu dataset became exceedingly laborious and demanding. Considering the time constraints, to limit the stretch, a potential way out was to prepare a quality Urdu dataset with correctness as the focus but at the expense of data size. A set of principals were formulated to finalize the dataset: • Two options for reducing the annotation dataset size:

Preprocessing of Urdu punctuations:
We leveraged the Unicode character set as it demonstrated the property that fancy characters had a Unicode of pattern 'P*'. Urdu punctuation characters were effectively covered under this category.

Compound word normalization and split corrections:
The vocabulary was sorted by token size descending and ascending, selecting top 500 for each. Each of these tokens were manually analysed to identify the missing space or extra space resulting sub-word issues and fixed by replacing each of such instance in the corpus with the appropriate substitute.

Typing mistakes identification and correction:
We used Urdu-to-English word translation lookup on the corpus vocabulary to flag the typing mistakes. This effectively supported with NER and typing issues. Typos were systematically searched and replaced in the annotations corpus.

Standardization of colour expressions:
As enlisted earlier, there are multiple colour expressions in Urdu that correspond to the same colour in English, with the noticeable property that a subset demonstrates static usage per color while others posit a plural or gender sense. All such instances were changed in the favour of static equivalents to reduce variety e.g. Green ( ← ) 5. Named entity normalization Named entities were identified using earlier stated systematic analysis as well as a manual round of proof reading. The improper nouns were not touched to avoid loss of generality. However, the most common scenarios where normalization was applied pertained to Dog breeds e.g. ( Labrador, German Shepherd, Mastiff ) →

Relevance assurance, count and multi-word correction rounds
Corrective iterations were done for each of caption relevance, number standardization to Urdu words and normalization of multi-token representation of the same word.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.