A storytelling framework based on multimedia knowledge graph using linked open data and deep neural networks

Automatic storytelling is a broad challenge in research contexts such as Natural Language Processing and Contend Based Image Analysis. Despite the considerable achievements of machine learning techniques in these research fields, combining different approaches to fill the gap between an automatic generated story and human handwriting is hard. This work proposes a novel storytelling framework in the Cultural Heritage domain. We developed our framework based on a Multimedia Knowledge Graph (MKG), a crucial point of our work. Furthermore, we populated our Multimedia Knowledge Graph with a focused crawler that employs deep learning techniques to recognise a multimedia object from web resources. Furthermore, we used a combined approach of deep learning techniques and Linked Open Data (LOD) to retrieve information about images and depicted figures using Instance Segmentation. The system has a dynamic, user-friendly interface that guides the user during the storytelling process. Finally, we evaluated the system from a qualitative and quantitative point of view.


Introduction
According to the Oxford learners dictionary, storytelling is "the activity of telling or writing stories" [7]. Since ancient times, storytelling has been an entertainment and a method to pass on traditions. Recent works demonstrate that Artificial Intelligence (AI) would be likely used to implement an automatic storyteller [15]. In the last years, AI has been an excellent solution for story productions starting from images (Visual Storytelling) or using language models, and keywords [15]. Furthermore, in AI storytelling, a story requires a structure, characters and a narrative point of view. In an automatic system, those aspects are derived from object recognition or given keywords detected from the text. "Every painting tells a story" this quote from Alexander Lawrence Posey [21] could be an excellent way to describe the purposes of this paper. Each story has a subject (depicted figures), an outline (depicted scene) and a narrative point of view (author's point of view). Most of them are described in art books or Wiki sites. Most recent works create a story starting from an image (caption generation) or using a language model, and keywords [1,23,44,45]. The basic idea of the proposed storyteller system is that a story will be dynamically created using Linked Open Data (LOD) [3] and shown interactively to the user. We propose a novel architecture that uses deep learning approaches, Linked Open Data and an ontology-driven focused crawler to implement the different aspects of our framework. A basic component of our framework is the Knowledge Graph (KG) [10] that, in our vision, is a logical representation of conceptual formalization (i.e. ontology). Our KG contains concepts and their visual representations. To populate it, we used a focused crawler that extracts new concepts from a set of unpopulated concepts and their visual representation by LOD, such as Wiki-Data. An image representing a painting provided by a user will be analyzed using a deep neural network and a knowledge graph to retrieve information using Linked Open Data. In order to create an interactive story, the user could choose different aspects of the painting, and s/he will get a brief description or a list of names of recognized objects. The user could choose the description or every name to discover all aspects of the analyzed painting. Our system implements a semantic analysis using a Word Sense Disambiguation algorithm to retrieve the correct information. Therefore, in this paper, we proposed an approach to generate a story using a Knowledge Graph and Linked Open Data. Furthermore, we used deep learning techniques to extract the feature from images. We organized the paper as follows: in Section 2, we described several works presented in the literature and related to our context of interest; in Section 3, we introduced the whole architecture of the implemented system together with our novel storytelling framework; in Section 4, we outlined a user case of our framework on Cultural heritage storytelling; in Section 5 we discussed the evaluation results and, eventually, in Section 6, we reported conclusions and future works.

Related works
In this section, we present and discuss different approaches and systems mainly based on Deep Learning and Natural Language Processing to automatic generate stories. Story generation is an important research field in the broad area of Artificial Intelligence. From a general point of view, a crucial aspect of the storytelling framework is a language model (LM). An LM is a probability distribution of the sequence of words that needs statistical and probabilistic techniques to achieve its purpose. One of the most famous language models is GTP-2 (Generative Pre-trained Transformer 2), developed by OpenAI [29]. GTP-2 is an unsupervised transformer language model trained using a massive dataset (called WebText) retrieved from Reddit. GTP-2 has been a mighty tool for different applications, e.g. question answering, story creation and translation from one language to another. Nowadays, GTP-3, an evolution of GTP-2, is the most prominent language model. This model operates on a byte level, and it can be used with any language model. The model has a capacity of 175 billion machine learning parameters (MLP) and can achieve better results than its predecessor [11]. The authors in [43], proposed a storyteller that uses a knowledge graph and Generative Pre-trained Transformer 2 (GTP-2). This system creates sentences using user keywords and relationships by language models. In particular, they used two merged knowledge graphs to create a mixed knowledge graph based on ConceptNet [22] and Wordnet [26]). Furthermore, they used GTP-2 as a text generator for creating stories. In [42] a system to generate automatic stories from an input composed of some images using Convolutional Neural Network (CNN) to extract textual features from images is presented. The authors in [24] described an example of a pipeline to produce a textual description of events and interpretations depicted in a sequence of images. The pipeline is composed of Object identification based on computer vision technology to recongnize objects in an image. A single image inference step is driven by commonsense reasoning and a knowledge bases; it is used to define the interaction in the depicted scene. A Multi-image narration module is composed of narrative planning and natural language sentence generation the describe the scene. The authors evaluated the three layers separately, emphasising the need of new quantitative and qualitative metrics for automatic storytelling. In a same way, [41] describe the lack of automatic metrics for story evaluation. They develop a complex adversarial reward learning algorithm for stories generation using a RNN-CNN architecture for the Policy Model. Features are extracted from images and then encoded as context vectors using a CNN architecture for the Reward model and applied to story parts (substories). The learning objective is to approximate the Reward Boltzmann distribution (maximise the similarity) towards the "real" data distribution and minimise the "faked" data generated from the policy model. The evaluation of such a system is hard due to the complexity of the test set. The authors in [16] introduced the first dataset for sequential vision to language. They captioned many images with three different descriptions (1) Descriptions of images in-isolation (DII); (2) Descriptions of images-in-sequence (DIS); and (3) Stories for images-in-sequence (SIS).) in order to distinguish the description of the single image and the description of a sequence. The dataset is composed of around 210.000 photos from Flickr with their descriptions. The system has been trained using a sequence-to-sequence Recurrent Neural Network and used a simple beam search and some heuristics to improve a more detailed description.
The proposed approach has basic differences and novelties from the analysed ones. Our storytelling process dynamically generates stories using user inputs and fetches data form different data sources based on Linked Open Data (LOD) paradigm and a knowledge graph (KG). Furthermore, Artificial Intelligence is used to extract features from an image and we don't use any language models to produce text, but short descriptions are retrieved from queried data sources. In the proposed framework, we highlight the interaction and userfriendly representation of the generate story. Moreover, we present a complete evaluation of our systems both from a quantitative and qualitative point of view.

The proposed approach
In this section, we introduced our framework examining in detail all processes. We logically split the framework into two submodules. The first one implements the storytelling process. The latter implements a crawler that populates a Knowledge Base used then by the first one to compute word sense disambiguation in polysemy case. In Fig. 1, we summarised the architecture of our framework. Our framework is mainly composed of five modules that respectively perform: (I) object detection, (II) feature extraction, (III) data retrieval, (IV) word sense disambiguation and (V) crawling. The object detection module performs object detection using a deep learning approach. The feature extraction module implements a feature extraction by means of CNNs as a feature extractor. The data retrieval module retrieves the information extracted by picture quering KG and LOD. The word sense disambiguation module consists in a focused crawler. The crawling module implements a WSD algorithms to improve the precision of our system. In the following subsections, we deep describe the crawler and Knowledge Base, the storytelling subprocess, and the WSD algorithm.

The knowledge graph
To design our Knowledge Graph, we used a model proposed in [33]. In brief, the KG is based on an ontology and in our graph where nodes are the concepts and edges are semantic relationships. We implement the KG by means of a multi-model NoSQL technology called Arango DB [5]. Figure 2 shows the hierarchies used to represent the objects in our model. In particular, in our KG we represented concepts, semantic and lexical proprieties according to the lexical-semantic dictionary WordNet [9] and Multimedia Objects.
In this work, we used a crawler [4] proposing a new visual feature extraction approach to populate our KG. The role of the focused crawler is to recognize domain-related documents (i.e. webpages) with a topic detection task browsing the web starting from a given seed using NLP techniques, Linked Open Data (DBpedia [2] and Wikidata [40]). The crawler retrieves entities (from DBpedia), images and topics from domain-related websites and then uses this information to analyze and extract features to populate the knowledge graph. In more details, the textual topic detection task extracts the textual input from a given website and sent it to DBpedia API for having recognized entities. For each entity, the crawler found a description using Wikidata. The text obtained with the combination of entities and descriptions is analyzed with a semantic-based metric [35] to have the best sense of each term. In the image classification task a feature extraction is used on each image of a given webpage using convolutional neural networks (CNN) [38], removing the last layer of the network and using the max-pooling or avg-pooling technique to have a dimensionality reduction [34], in the Section 5 we will report experimental results to recognize the best CNN architecture and configuration. The extracted feature are a query on the Knowledge Graph computing a cosine similarity [37] for each image stored in it. If the best similarity score, given by cosine similarity, is higher than a given threshold set by experiments, there is a match with an image in the KG and the document will be domain-related; otherwise, we combined the result with the textual result using the average of the domain-related score. Figure 3 shows a sketch of our populated Knowledge Graph.

The storytelling process
The storytelling process performs all the tasks useful for interactively drawing the story. The process is interactive because it requires users input to evolve as shown in Section 4. The story is proposed to the user by tag clouds [32].
This process is mainly divided into two parts. The first one provides for three inputs: • Descriptive Content: all properties, such as location, author, artistic period, are retrieved from Linked Open Data and organized in a tag cloud; • Semantic Content: the system recognizes all entities contained in images and organizes them in a tag cloud using the instance segmentation module; • Image Title: the system retrieves a brief description of the images from Linked Open Data and creates a tag cloud from it.
We show the second part of the process in Fig. 4. Given a tag cloud generated in the first part of the process, the user chooses a word from which the story is generated. Iteratively, a new tag cloud is generated in each step, and the user can choose a new word to continue the storytelling.
To improve the precision of storytelling, we compute a Word Sense Disambiguation (WSD) algorithm on polysemy words. The WSD task is an essential step for the storytelling framework. It is the only way to avoid the user going outside the story getting wrong information caused by polysemous word. The WSD task finds all senses for each word and compares them with all senses of the other words of the considered sentence in the analyzed document. The proper sense of a given term is the sense that reaches the best score. In this work, we use the WSD algorithm proposed in [35] using an innovative technique that differs from others present in literature [17,25,25,27,46]. Such WSD algorithm is based on ontology and in particular it uses WordNet 3.0, and compute the sense disambiguation using a Dynamic Semantic Network (DSN), which is built to extract all hypernyms of a pre-selected input term and for each hyponym extracted from this term, all other semantic relationships of WordNet are explored and, eventually, a score of similarity is computed. We used a metric presented in [31], it computes the shortest weighted path between two synsets in Wordnet, each relation has a specific weight. The intuition is that not all relationships Fig. 4 Storytelling sub-process have the same strength (a synonymous relationship is different from "part of" relationship"). The used metric in Equation: where ν is the document, (w i , w j ) the pairs of words in the lexical chain, α ≥ 0 and β > 0 are two scaling parameters, d(w 1 , w 2 ) is computed as the number of hops from the subsumer of w 1 and w 2 to the root of the hierarchy. The data used for storytelling process are retrieved from Wikidata, that is a free and open knowledge base that can be read and edited by both humans and machines. Information like author, dimensions, location, artistic movement or other definitions will be retrieved from this knowledge source using its API in a Linked Open Data framework. The data are organized in labels (title of the page), description (multi-language) and statements composed of some properties. For example, for the Last Supper painting by Leonardo Da Vinci, some properties are creator Leonardo Da Vinci, commissioned by Ludovico Sforza, time period High Renaissance and so on. A crucial module of our framework is the instance segmentation task that extracts the entities in the analyzed image. Its purpose is to recognize entity type (ex: person, animal) and then give a name and/or a description to that entity using Linked Open Data. To perform the instance segmentation task, we choose Mask R-CNN (Mask Region-Based Convolutional Neural Networks [12]) that uses Resnet101 as a backbone to extract features from images [13]. In this work, we used the Mask R-CNN pre-trained on the COCO Dataset (COCO is large-scale object detection, segmentation, and captioning dataset) [19]. We performed the image classification task to recognize the input image and retrieve some information from the knowledge graph. In our work, to recognize the image, a feature extraction task is applied using a Convolutional Neural Network, removing the last layer and applying a max-pooling or avg-pooling technique. The input image features are then compared to the features of the images of the knowledge graph stored in Arango DB. We computed the similarity score using cosine similarity [37]. If the score is higher than a threshold (empirically calculated), there is a match, and the system retrieves entities and the image's name from the database.

Use case
This section introduced a use case in the cultural heritage domain. Following the process described in the previous section, the user queries the system in the first step. In this example, the user chooses a painting in Fig. 5 as input. The system applies an instance segmentation task to recognize the object drawn in paint. In our case, it recognizes persons. At the same time, it provides a classification of the query image and in our example the classification of the last supper is shown Afterthought, the user can choose between Semantic Content, Descriptive Content, or more detail on the recognised input (i.e. the Last Supper). Then selecting Descriptive Content, the system generate the tag cloud menu using information extracted by Linked Open Data generating the tag cloud. Instead, selecting Semantic Content, a user can choose from location, material, author, dimensions or artistic movement. The system, once retrieved the information using knowledge graph and LOD, display the requested information. Finally,the system shows a description extracted from LOD selecting the name of the recognized object. Furthermore, the system generates a tag cloud used in our application as a graphical interface to interact with the user. In other words, after recognizing the paint, the system queries Wikidata to retrieve other helpful information. In this case, the principal information concerns the names of the person depicted in paint, as shown in Fig. 6. In general, all these properties are retrieved from Wikidata using an identifier that transforms the word chosen in a property Wikidata ID and performing a query to retrieve a value for that property. The tag colors depend from the class of recognized objects (i.e. title, person, object,...) and font size is randomly chosen to best fit in the tag cloud box (i.e. a circle) considering the frequency of the class objects.?

Results and evaluation
In this section, we evaluate our system and discuss the results. Firstly, we evaluated several architectures of CNN as features extractors, and then we evaluated the storytelling framework. We evaluated the following CNN pre-trained on ImageNet to choose the best one: VGG16 [38], ResNet50 [14], MobileNet V2 [36], Inception V3 [39]. Furthermore, we tried max and average global pooling for each architecture, as dimensional reduction strategy. In order to choose the best CNN architecture, we use Precision-Recall and mean Average Precision at k (mAP@k) measures [30]. The precision represents how many relevant documents are relevant in the retrieved document set, while the recall represents how many relevant documents have been retrieved, they are respectively defined in (2) and (3): Precision and recall are two essential model evaluation metrics. While precision refers to the percentage of relevant results, recall refers to the percentage of total relevant results correctly retrieved by an algorithm. The Precision-Recall curve is an interpolation of precision on eleven standard points of recall (from 0 to 1) and (4): The Precision-Recall curve allow to analyze the precision of the system for each level of recall. The mean Average Precision (mAP) is defined in (5a) and (5b).
The dataset used to evaluate the CNN architectures as features extractors are PASCAL VOC2012 [8]. It is divided into 20 objects as shown in Table 1. Figure 7 shows the Precision-Recall curve, the best CNN is ResNet 50 with average pooling that is almost always above the other curves, but in our work we interested to accuracy on first retrieved image, so we analyze the mAP@1, shown in Fig. 8, and in this case the best is VGG16 with average pooling.
Based on the test reported above, we evaluated the Storytalling framework using as CNN for feature extraction VGG16. We use an evaluation strategy combining some user oriented metrics. The used metrics are: In order to evaluate the storytelling process using the user-oriented metrics listed before, we use a survey. This survey aims to obtain user opinions using heterogeneous audience. Our audience is is composed of 200 people of different ages and different education levels. We summarised the results in (Figs. 9 and 10) showing metrics and votes and in Fig. 11 in terms of average and standard deviation. The survey that concerns Perceived usefulness shows that more than 70% of the people who fill out the questionnaire give excellent feedback (6 or 7), around 20% of people give good feedback (4 or 5). There are only two votes less than 4. Perceived Easy of Use results show that there is one awful feedback (1) two low feedback, but over 60% of people give very positive feedback (6 or 7). The result about Perceived Enjoyment shows that there are about 70% of very positive feedback (6 or 7) 4% of  The evaluation of the storytelling process using user-oriented metrics showed very good results. The storytelling process is easy to use and enjoyable. The audience states trust in the information system and that the application could be helpful. The perceived usefulness (PU) shows the best result (average = 6.05). Although the average result is high, the survey shows awful feedback (TIS and PEU) and a high standard deviation (PEU metric). So, some aspects of the system should be improved to make the application easier to use and increase trust in the information system. We conducted the survey on a equally distributed cohort refering to age and education level. The analysis of the results highlighted that almost everyone is in agree that our application was practical and enjoyable. Nevertheless, analysing the Fig. 11 Average and std deviation educational grade results showed that those with a higher one found the application more straightforward to use than those with a lower one. However, the partecipants with a higher educational grade trusted less of the retrieved information.

Conclusions and future work
In this paper, we proposed a Storytelling application based on semantic technologies, big data, and artificial intelligence. This work aimed to propose and implement an interactive framework using Linked Open Data to retrieve information and create stories. Compared to the existing literature, it is the first storytelling project that uses Deep Neural Network, Linked Open Data, and an ontology-driven focused crawler in a storytelling process. The results evaluation shows that the framework reaches the goals described before and could be helpful and enjoyable according to the user tasks. The most important differences between this and other storytelling applications are the intensive use of ontologies such as Wordnet, Wikidata, and DBPedia as sources of structured open data. There are several future research to investigate. We want to implement our framework for mobile platforms to improve usability and use other information as geolocalization. Moreover, we want to replace input images with images taken from the smartphone camera. Our approach uses Wordnet as base ontology. We will integrate other ontologies from different sources using specific matching and merging techniques to improve our knowledge graph. Furthermore, we will point out a computational complexity and a statistical analysis to judge the significance of the results.
Funding Open access funding provided by Università degli Studi di Napoli Federico II within the CRUI-CARE Agreement.

Data Availability
The datasets generated during and/or analysed during the current study are available in the WikiData repository, https://www.wikidata.org.

Conflict of Interests
The authors certify that they have NO affiliations with or involvement in any organization or entity with any financial interest (such as honoraria; educational grants; participation in speakers' bureaus; membership, employment, consultancies, stock ownership, or other equity interest; and expert testimony or patent-licensing arrangements), or non-financial interest (such as personal or professional relationships, affiliations, knowledge or beliefs) in the subject matter or materials discussed in this manuscript.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.