Building multimedia repository for composing images perspective

Multimedia repository is helpful for educational activities since it offers several illustrations that facilitate the learning process and the text understanding. In this paper, we propose to build a multimedia repository from collected images using object extraction techniques. Then, we associate Arabic captions to all extracted objects. These extracted objects are used to compose new scenes that could illustrate efficiently the most important events in an Arabic story. We extend, thereby the notion of image composing to our approach as the task of constructing new images based on a set of toolboxes and a set of extracted objects as our multimedia repository of common animal behaviors. Our preliminary results show that the composed scenes using single objects provided a fair understanding of the main events of the stories as well as a coherent visual layout of all single objects. The diversity and the precision of the single objects images for the domain of animals have shown a great impact on composing new scenes either manually or dynamically.


Introduction
Multimedia systems have great impact on education, learning and training systems [1,2]. Different research works in education support the assertion that multimedia can broadly promote the performance of learners. A user can learn more efficiently from words associated with pictures, videos or audios than words only. To date, several multimedia systems have been proposed to provide illustrations for stories from an existing multimedia repository (MR). They tried to query the MR in order to retrieve the most convenient images w.r.t. search keywords. In the case of Arabic keywords, a machine translation step is required in order to be able to use existing available resources. Besides the difficulties related to the translation efficacy, it is not always obvious to have a semantically convenient matching between images' captions and the translated keywords. Many multimedia systems proposed to visually explain topics, news stream or stories by annotating texts with pictures [3,4], enriching textual content [5] or composing scenes [6]. There are some common limitations in the existing systems, which have not been properly addressed: 1. Existing multimedia systems can retrieve pictures automatically from the image search engine and generate illustrations [7][8][9][10]. However, manual work is required to filter out inappropriate pictures, which reveals the excessive manual efforts behind these systems as indicated in [6]. We tackle this problem with a filtering process automatically by using a deep learning captioning model to complete the missing caption if not available, for instance Google search engine (referring to JSON response object) lacks appropriate captions or at least meaningful tags about the images. If the caption for the returned images has no matching with any of the initial keywords, so they are filtered out.
2. Multimedia systems for illustrating Arabic text are very limited, which reflects the current technical difficulties in understanding Arabic text since for Arabic language there exists additional challenges; the first one to tackle is multimedia search using Arabic keywords which we resolve by translation the keywords into English. Unlike Arabic, for English language there are enough high-quality multimedia resources associated with well maintained captions, metadata and tags, freely usable. So, we propose to use Machine Translation (MT) to translate Arabic text to English text to overcome the problem with the lack of image resources annotated in Arabic. 3. Many pictures are available on the web, but they lack textual descriptions or captions to be included in a robust image search, so we complete the missing information by the use of a convolutional neural network (CNN) as a pre-trained model to automate the captioning process for all images as well as for extracted single object images to be included in our approach. 4. Generated images based on generative models such as Generative Adversarial Networks (GANs) [11] could be used; however, the generated images are not yet realistic, i.e. mistakable for real as argued in [12]. Besides, these models fail to generate the correct number of instances, i.e. objects with clear boundaries and spatial relationships when the input contains multiple objects, as mentioned in [13]. In some cases, the generated images have poor image quality in terms of color consistency as argued in [14] which could be improved using a regularization technique. In our case, we use rather single object images obtained through object extraction where we could resolve the problems with regard to the number of object instances and scene layout issues such as arrangement of objects in the image plane. 5. Most of existing multimedia systems can illustrate text based on retrieved images. However, none of them, to our best knowledge, considers extracting objects from retrieved images and uses them for composing new pictures from scratch, which is the key requirement in our approach to promote story understanding for Arabic text.
We believe that MR is a good option to compose a realistic and coherent illustration for a story text since, getting a well illustrative image w.r.t. a new story is not obvious even when using a well-maintained image repository. In fact, it supposes that in the MR there exists relevant illustrations w.r.t. story and the objective is to extract it. However, this hypothesis is not necessarily true for all stories. Therefore, for several stories, we could not find in MR any relevant images describing them. The proposed approach focuses on the task of generating a coherent and realistic illustration for Arabic story text and not only on the task of image retrieval. Therefore, we propose to map text to multimedia by generating new images based on single objects that should be initially prepared in local repository. The generation process depends on the keywords, their associated images, and their drawing properties. In this paper, we are focusing more on building single objects as Multimedia Repository (MR) that contains several single object images with their associated captions.
The remainder of the paper is organized as follows. We first provide a literature review in Sect. 2 and then elaborate on our proposed method in Sect. 3. We present experimental results in Sect. 4 and evaluation in Sect. 5. Finally, we conclude this paper and discuss future directions in Sect. 6.

Literature review
The idea of creating a relationship between text and images dates back a long time. Many researches in computer science but also in other fields such as linguistics have been carried out to create such a relationship in a formal way. We review briefly well known annotated image collections, relevant approaches and multimedia systems that have been proposed to date using such kind of multimedia collections. There is a wide range of image collections; ImageNet is an annotated image dataset with WordNet noun synsets [15]. ImageCLEF is an evaluation forum for the cross-language annotation and retrieval of images [16], which also involves diverse related tools and resources. Worth mentioning image collections are MSCOCO [17], Flickr8 k [18] and its extended version Flickr30 k [19] which represent benchmark standards for image tasks, a survey of more image datasets can be found in [20].
Through the recent years, a lot of well known general multimedia datasets and other domain specific datasets have been built to support multimedia systems. In particular, image search engines such as Google image search has been employed as well such as in the work of Aramini et al. [7]. The Authors proposed different methods to extract keywords from text such as extraction of information attached to the retrieved images from the web. Other proprietary image databases have been also built such as in story picturing engine which has been proposed by Joshi et al. [21]. This system uses two large image databases, namely Terra Galleria [22] and The Art Museum Image Consortium [23], which contain a lot of high-quality photos with annotations. It consists of a pipeline of three processes: story processing and image selection, estimation of similarity, and reinforcement-based ranking of the related images. Zhu et al. [24] proposed a Text-To-Picture system that generates pictures for unrestricted English text and ranks retrieved images using machine-learning techniques. This system was enhanced by Goldberg et al. [9] to use semantic role labeling rather than keyword extraction with picturability. Rada et al. [25] used PicNet as an illustrated dictionary for the automatic generation of pictorial representations of simple sentences that uses WordNet as a lexical resource for the automatic translation of an input text into pictures. Agrawal et al. [5] proposed techniques for retrieving images from the web to enrich textbooks with images while following predefined restrictions. Authors used a corpus of high school textbooks used in India. The evaluation showed that the proposed techniques are able to obtain images that can help increase the understanding of the textbook material; however, deeper analysis to identify key concepts is needed since important categories of terms are ignored. An ontology-based approach is proposed by Ustalov et al. [26]. It has been designed to operate with ontology to allow loose coupling of the system's components and to unify the interacting actors' representation and behavior while making the verification of system information resources possible.
Worth mentioning systems that created their own image datasets are [27,28]. The former was developed by Duy et al. [27] and it creates pictures to illustrate instructions for patients. It has a pipeline of five processing steps: pre-processing, medication annotation, post-processing, image construction, and image rendering. The latter is a medical record summary system and was developed by Ruan et al. [28]. The patient's medical data is visualized spatially and temporarily based on the categorization of event categories and 6 physiological systems. An interesting approach in expressing emotions is proposed by Eva et al. [29]. Other approaches in the domain of news streaming have been proposed [3,8]. The latter used BreakingNews, a novel dataset while the former used Flickr images. Aletras et al. [30] proposed an approach to provide images that are useful to represent general topics. The approach retrieves a collection of images for topics using online search engine. The selection of images is based on graph-based methods notably on visual information and text. Huang et al. [31] proposed a system for transforming text to visual form of fairy pictures. This system selects keywords from segmented stories, then, retrieves relevant pictures from online repositories based on their tags, and finally, the pictures are composed for illustrating the main ideas of the input. Jiang et al. [32] proposed a novel assisted instant messaging program which shows illustrations associated with textual messages. The representation of the picture is constructed from a set of most representative retrieved images. Many other, multimedia systems have also been proposed in other works such as [33][34][35]. A recent multimedia system for Arabic stories is proposed in [36]. To build a multimedia repository, the system uses the Scribd [37] online book library in order to collect educational pictures which are then stored locally in binary format and marked for text extraction. The best picture is selected based on the maximum intersection between the conceptual graph of the best selected keywords and sentences from the input text and the conceptual graphs of the pictures. The system uses only cartoon images and disregards essential educational be benefits from other multimedia types. And comparatively other research worked on object classification and recognition in images, in particular face recognition are investigated recently in [38][39][40]. These works propose new face registration algorithm and extends well known face recognition approaches based on face classification, identification and verification techniques. We are aware of these related papers. Different to them, and due to diversity and nonhomogenous object layout of natural images in our MR, it is difficult to analyze our retrieved images by using these low-level visual features used in these works. In fact, various approaches in the literature tried to overcome these limitation using other techniques such as the Bag-of-Visual-Word representation combined with other techniques [41] which showed better object recognition performance, however as reported in [42], the lack of spatial information, the alignments and the overlapping of objects in natural images cannot be supported by these techniques. As a result, it is hard to use these techniques mentioned in [42]in our case as these details are the key elements in defining single objects, their relationship and their spatial arrangements in a scene. However, it is worth mentioning here that the approaches based on deep neural network (DNN) for object segmentation [43], object detection and classification are more practical to tackle the problem of any object extraction from images.
Recently, GANs have shown great potential in generating images and videos. Despite the success, GANs are hard to train for example the training process is usually unstable and sensitive to the choices of hyper-parameters as analyzed in this blog [44]. Given these difficulties and others in modeling details of natural images, the authors in [14] proposed Stacked Generative Adversarial Networks (StackGANs) and succeeded in generating images with photo-realistic details from text descriptions. However, when the input contains multiple objects, StackGANs fails to generate the correct number of instances with clear boundaries and spatial relationships. Moreover, for complex interactions between the main object and its surroundings, StackGANs fails to capture and express the interactions between the objects. The authors in this report [13] reported relevant examples, where the pretrained StackGANs models fail either to generate the objects from the correct categories or to express relations between objects which could be realized by paying more attentions to the relevant words in the given text. To address this issue, Xu et al. [45] proposed an Attentional Generative Adversarial Network (AttnGAN) that enables StackGANs to generate fine-grained high-quality images via word level and sentence level conditioning.
While divers multimedia systems for English, among other resource-rich languages, has developed annotated image collections that are widely used for approaching the relationship between language and vision and especially object detection tasks, multimedia systems that can automatically provide a complete illustration for Arabic stories using single objects are missed. We target at building a MR that contains several single objects with their associated captions in the domain of animals. This repository will establish a well-annotated image repository for composing new images.
Indeed, in this paper, we present an offline multimedia repository that takes advantage of, and in fact is built from image resources on the Web, namely on ImageNet as well as on Google image search. The design of MR is flexible and extensible such that we can easily incorporate new kinds of multimedia such as video and other languages. The effectiveness of MR is demonstrated and a potential application for understanding Arabic stories by composing images is intended in this paper. This work is an endeavor towards quantifying progress on building MR using web image retrieval.

The proposed system
In this section, we present the main components of the proposed system as depicted in Fig. 1. First, we detail the construction of the multimedia repository in which we collect a set of single objects. Second, we present the images composing steps.

Multimedia repository (Part 1)
Multimedia systems can help both instructors and students in pedagogical activities including, learning, imagination, and communication. Based on existing online libraries such as Google Images, ImageNet, etc., we collect a set of images and index them in a temporary database. For each image, we apply an object extraction tool in order to obtain single objects with some of their drawing properties. Subsequently, we apply an image captioning process on extracted objects/single objects to automatically assign a caption for each of them.

Images retrieval
For the initial version of our MR, we suggest to select the domain of animals and to experiment several alternatives (we will compare the use of Google image search API and imageNet). Note, we consider for the current version 80 keywords to retrieve images of animals and some of their related basic behavior. Table 1 shows some sample keywords used in this work as well as their related actions and retrieved images. Note we translated the input keywords from Arabic to English manually.

Object extraction using Mask RCNN
Object extraction aims at identifying the foreground regions based on methods like edge detection, segmentation, and background subtraction. With the advancements of research in various other fields of science and technology over the time, Object extraction has found their application in many different areas i.e. Medical disease detection [46,47], satellite object extraction, image retrieval, objects recognition etc.
We target at extracting foreground images which constitute key elements in defining single objects. We found the model Mask R-CNN [43] outperforming other segmentation models as a deep neural network (DNN) for object segmentation [43], object detection and classification. The model identifies the potential regions that containing the foreground objects and extracts these objects from images without the background. Hence, the images could be reused in composing new images.
Here are the steps we followed to process the segmentation of retrieved images: 1. Collecting images from Google image search and Ima-geNet; 2. Executing the Mask R-CNN pre-trained model which is available on [48] to localize objects in the image with bounding boxes and extract those localized objects; 3. During testing, we extended the deep learning model to process unseen images in bulk, enabling the model to store results in the corresponding directory; 4. For each test image, we extracted all segmented objects and stored in lossy compression format (JPG); 5. To effectively utilize the segmented objects we incorporate an alpha channel to retain the transparency information on the pixel-level. 6. The transparency information is then used to adjust the focus on the single objects' part of the image by cropping the alpha channel to precisely use in the later step of image generation.  Table 2 shows a small subset of the retrieved images for given keywords including their extracted single objects from each image source. The last column results from the combination of all retained single objects after a manual evaluation. The first row of this table shows a good example of object extraction since all single objects are complete and subject for reuse in composing new images. Whereas the second row demonstrates that most of the single objects are not complete and therefore not used to compose new images, only one single object was added to MR.
Although, the effectiveness of Mask R-CNN for extracting objects, however, it fails in some cases when many objects are overlapping. Thus, the number of retained single objects related to retrieved images is low as shown in Table 4.
As a first contribution, we propose to build a local single objects repository. As a second contribution, we propose to use single objects to generate new images.

Captioning single objects
As the prepared single objects' image set is ready for going through the image captioning via CNN which in turn represent an image by embedding it into a fixed-length vector. Then, Recurrent Neural Network RNN especially LSTM [49] decodes the fixed-length vector to a desired output sentence by iterating a recurrence relation [50]. We used a pre-trained model as a fine-tuned checkpoint which has been trained over 3 million iterations using MSCOCO dataset [17]. All extracted single objects are stored locally as image type. All information related to them is stored in relational database together with associated keywords and their mappings. Image captions that have been generated using a CNN pre-trained model [51] are also stored. Obtained captions are available in English only and been MT processed into Arabic. Table 3 shows some generated captions for retrieved images as well as those respectively extracted objects. It is worth mentioning that in some cases the accuracy of captioning process of the single objects outperforms the accuracy of the captioning of the original retrieved images as shown in the first row of Table 3. Thus, a minimal enhancement in captioning step is observed, which in turn underlines the importance of such objects extraction tools. However, for other images the accuracy did not change as shown by the second row of Table 3. The third row of the same table shows that the accuracy decreases since the generated caption after the object extraction step is not accurate. In summary, we cannot admit from this preliminary work that extracting objects would lead to better accuracy for automatic image captioning. However, inadequate image captions are results from the weakness of the used image captioning pretrained model which showed a moderate accuracy for English and a lower accuracy for machine translated Arabic as analyzed in [52]. To resolve it, we plan to train the model using a well-annotated dataset in Arabic.

Images composing (part 2)
In our Image Story Generator tool, image composing is designed in two ways, manual image composing and automatic image composing. In our current tool version, we compose new images or pictures manually, allowing thereby flexible working with the tool. We briefly describe how a user composes new images using our tool. First, a user input keywords in an input field on the top of the tool main interface and hits enter. Note we use single keyword only at this current version. The retrieved single objects from MR are displayed in a panel on the Graphical Toolbox on the left side of the main interface as indicated in Fig. 2. The user or the teacher can drag and drop the main interface, locate images and resize images thereby composing a new scene describing the input sentence. The teacher can successively search for other single objects images doing same steps as described. Thus, the teacher can show the final illustration to the students. The newly created picture can be stored locally. After the image has been stored the system extracts images sizes, images positions, etc. and save this information as drawing properties. This information will be used further for generating new pictures/images dynamically.
Note the option to change the pictures at any stage is also provided to guarantee a flexible learning environment. Figure 2 shows a screen shot of our tool were two single objects are added attempting to describe the given input keywords.

Experimental settings
We conducted experiments on two image sources: Google image search and ImageNet. On both image sources, we gave as input same keywords. Note in Ima-geNet only single words are allowed and each single keyword is associated with many images. Table 4 summarizes the statistics of the two datasets and the combination of both into a unified MR.
In the experimental set up we start by following settings.
• Input text data set we prepared 80 keywords i.e. keyphrases extracted from short simple stories in the domain of animals, a subset is shown in Table 5. • Database has been set to store retrieved images and keywords and their mappings. Image captions have been generated using a CNN pre-trained model [51].
Obtained captions are available in English only and being MT processed in Arabic. These data are also stored in the database. • Image search we used Google image search [53] and ImageNet [15] to search images related to input text dataset. It is worth mentioning that we use these images in our prototype for illustrative purpose only. In the future, it is planned to set up a dataset of appropriate educational images for better learning and understanding. • Object extraction model, we use Mask R-CNN which is a pre-trained model for tensorflow published on github [48]. • Deep learning model we use im2txt, which is a pretrained model for tensorflow published on github [51]. It is a model developed by Google that takes an image as input and creates a caption for it.
After the setting step, we have performed following main tasks to build MR: 1. Extracting keywords and key-phrases using Arabic text segmentation, tokenization and part-of-speech tagging; 2. Translating extracted keywords manually; 3. Retrieving images and creating database to store the mapping of keywords to images to serve as preliminary image pool; 4. Extracting objects from all retrieved images using Mask R-CNN pre-trained model; 5. Retaining correct single objects manually; 6. Captioning single objects using a Convolutional Neural Network (CNN) pre-trained model; 7. Saving single objects locally and all associated information in our relational database. 8. Integrating and loading single objects in the tool for manual composing images

Evaluation
The objective of this work is to build an MR for Arabic text. We target at enhancing our current proposed system version that is still under development and improvement. We are currently processing sentences with simple structure where keywords are extracted, translated and then searched in two different image sources. We show the statistics of our MR and then we conduct a comparative evaluation as well as a user evaluation. As Table 4 shows, MR contains 770 single objects which are saved physically in folders indexed by their keywords as well as in relational database for further processing steps. Many of the extracted objects were not reusable due to sliced object parts during the extraction task and thus eliminated. Subsequently, in the evaluation section, we will make a comparison between the outputs from Google image search API and imageNet. Table 5 lists all keywords and their counts in MR. As we see for each keyword we retained only those single object images that are most representative and having a coherent visual layout.

Comparative evaluation
We show some of the image results in Table 6 where the retrieved images from Google image search API and we compared them to the composed ones created using our tool and MR for the corresponding sample sentences. We compare the output images from Google image search API and our newly built MR, both in respect of human evaluation. We evaluate our approach with regard to (1) using the combination of different web image sources and to (2) using single object images for composing new scenes or pictures to promote understanding of Arabic text through images. First, it is worth mentioning that image results from Google image search API are more significant in number and relevance, while the image results from ImageNet are limited, however better clustered. Nevertheless, the combination of both image sources enriched our repository in many aspects. Second, regarding the image composing, we argue that using single objects to compose new scene to represent textual input has shown an enhancement compared to retrieved images from Google image search API, since in some cases we could represent with Table 5 Statistics of our single objects and keywords   Keyword Id  Keywords  #single  object  images   1  Tiger, running  12  2  Tiger, jumping  10  3  Tiger, walking  11  4  Tiger, eating  6  5  Tiger, drinking  12  6 Deer, running 19 7 Deer, walking 13 8 Deer, sleeping 5 9 Deer, drinking 6 10 Deer, eating 8 11 Zebra, drinking 12 12 Zebra, standing 12 13 Zebra, running 10 14 Zebra, walking 8 15 Zebra, sleeping 10 16 Bear, walking 1 17 Bear, eating 12 18 Bear, standing 14 19 Bear, drinking 3 20 Bear, swimming 16 22 Camel, walking 15 23 Camel, sitting 15 24 Camel, drinking 8 25 Camel, running 12 26 Camel, crouching 4 27 Rabbit, eating 14 28 Rabbit, running 15 29 Rabbit , sleeping  9  30  Rabbit, standing  11  31  Rabbit, graving  5  32  Turtle, eating  4  33  Turtle, walking  5  34  Turtle, drinking  4  35  Turtle, graving  4  36  Turtle, diving  5  37  Monkey, jumping  20  38  Monkey, eating  16  39  Monkey, sleeping  16  40  Monkey, drinking  6  41 Monkey, running 10 42 Bull, drinking 12 43 Bull, running 14 44 Bull, eating 9 45 Bull, walking 14 46 Bull, struggling 7 47 Crocodile, hunting 8 48 Crocodile, swimming 8 49 Crocodile, catching 6 Butterfly, rambling 9 77 Butterfly, flying 12 78 Butterfly, eating 8 79 Butterfly, drinking 5 80 Butterfly, sleeping 6   MR better illustrations than we retrieved by Google as shown in Table 6. Moreover, we are also able to customize the actors of a story, to add actors, to locate them and to resize them, etc., to compose a coherent story though pictures, for instance, row#1 to row#9 show that adding more actors promote the understanding of the input sentence rather than using the images retrieved image from Google where the number of actors is not represented at all. Row#5 shows that different positions and orientations for actors is illustrated all in one scene. All in all, we can represent spatial information through localizing object images in different positions and in different sizes which promotes understanding of the story. Row#8 shows an example of object overlapping where different body parts of cats are overlapped, but the boundaries still clear. Besides, we can draw different actions of actors and their relationships to each other in one scene only. Thus, this tool saved us time and images resources.
In summary, and with regard to the mentioned limitations in exiting systems in Sect. 1, our approach saved us (1) time and manual effort in filtering retrieved images and in completing missing captions. Besides, our approach included (2) enough high-quality multimedia resources associated with well maintained captions in English and Arabic. (3) A robust image search is set, so that automated captioning process is performed for all images and for all extracted single object images. (4) The use of single object images obtained through object extraction resolved many issues such as the problems with regard to number of instances and scene layout issues such as arrangement of objects in the image plane. Many remarks and findings have been concluded; most of them contributed to extracting single objects from retrieved images and using them in our Image Story Generator tool for composing new scene, which is the key requirement in our approach to promote story understanding for Arabic text. It is worth mentioning, that our Image Story Generator tool using our MR outperforms Google image retrieval in representing spatial information, plural words and different categories of objects without any excessive preprocessing compared to the case of StackGANs where 80% of the birds' images which have a small object-image ratio have been processed. In our case, we believe that object-image ratio is a necessary detail in natural images which should be kept in order to support understanding of realistic scenes and thereby realistic stories.
To support the details of natural images, we target at enriching our MR to cover a wide range of single objects and diverse actions. However, we still face some limitations and problems in composing new images such as the quality of extracted single objects, choosing background, etc. Nevertheless, our work strongly argues that illustrations for text through image retrieval only will not produce a meaningful illustration to help understanding with pictures unless we propose a better solution that helps conveniently namely through manual composing new images and in future work through automatic composing and automatic generating images.
We list below major limitations in our current version that should be improved:

User evaluation
We evaluate our tool based user satisfaction of the manually composed images and if they accurately represent the main events of the given stories/keywords. The relevance and the suitability of each created image are obtained through two users. They are provided with 20 stories together with the created images/scenes. They were asked to judge if the created images were a representation of the main actors and events, and to provide a rating on a scale of 0 (completely unsuitable) to 5 (very suitable). The results are shown in Table 7; the overall user satisfaction with the output is equal to 4.5 for our tool compared to 3.8 for image retrieval. This clearly demonstrates that, users were satisfied at a certain level with the composed images than with the retrieved images.

Conclusion
A single objects multimedia repository is presented in this paper. The proposed MR solves the limitations of existing multimedia systems by providing single objects illustrations, considering common concepts including actions and relations in the domain of animals. The current version covers a narrow range of single objects depicting the common behavior of animals like food activities and motion. There are still many single objects missing. However, we are continually collecting more multimedia content to enrich our multimedia repository. Through our experimentations and evaluations, we found out that our work strongly argues that illustrations through image retrieval only, still suffer from many problems and that our tool in which, we identify the key components for abstracting the information of multiple object instances to generate plausible scene composition, can be used to overcome the related problems. Future work in this direction may be to employ more image resources in different domains and to introduce diverse multimedia content, such as 3D virtual reality scenes, videos, and audio. In particular, we will investigate in next steps the potential of automatically generating new pictures i.e. automate images composing. Motivated by the success of StackGANs, we would like to spend more time training a stackGAN model using an annotated dataset in Arabic, as we believe our approach has not reached its full capacity due to the time constraint of this project. We plan to explore more options such as more flexible attention mechanisms and more expressive convolutional neural network models to capture and to express the interactions between objects and more details of the text input.