A storytelling framework based on multimedia knowledge graph using linked open data and deep neural networks

Renzi, Gianluigi; Rinaldi, Antonio M.; Russo, Cristiano; Tommasino, Cristian

doi:10.1007/s11042-023-14398-x

A storytelling framework based on multimedia knowledge graph using linked open data and deep neural networks

Open access
Published: 22 March 2023

Volume 82, pages 31625–31639, (2023)
Cite this article

Download PDF

You have full access to this open access article

Multimedia Tools and Applications Aims and scope Submit manuscript

A storytelling framework based on multimedia knowledge graph using linked open data and deep neural networks

Download PDF

Gianluigi Renzi¹,
Antonio M. Rinaldi ORCID: orcid.org/0000-0001-7003-4781¹,
Cristiano Russo¹ &
…
Cristian Tommasino¹

1513 Accesses
Explore all metrics

Abstract

Automatic storytelling is a broad challenge in research contexts such as Natural Language Processing and Contend Based Image Analysis. Despite the considerable achievements of machine learning techniques in these research fields, combining different approaches to fill the gap between an automatic generated story and human handwriting is hard. This work proposes a novel storytelling framework in the Cultural Heritage domain. We developed our framework based on a Multimedia Knowledge Graph (MKG), a crucial point of our work. Furthermore, we populated our Multimedia Knowledge Graph with a focused crawler that employs deep learning techniques to recognise a multimedia object from web resources. Furthermore, we used a combined approach of deep learning techniques and Linked Open Data (LOD) to retrieve information about images and depicted figures using Instance Segmentation. The system has a dynamic, user-friendly interface that guides the user during the storytelling process. Finally, we evaluated the system from a qualitative and quantitative point of view.

Automated Storytelling Technologies for Cultural Heritage

A Hierarchical Approach for Visual Storytelling Using Image Description

Human-Like Storyteller: A Hierarchical Network with Gated Memory for Visual Storytelling

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

According to the Oxford learners dictionary, storytelling is “the activity of telling or writing stories” [7]. Since ancient times, storytelling has been an entertainment and a method to pass on traditions. Recent works demonstrate that Artificial Intelligence (AI) would be likely used to implement an automatic storyteller [15]. In the last years, AI has been an excellent solution for story productions starting from images (Visual Storytelling) or using language models, and keywords [15]. Furthermore, in AI storytelling, a story requires a structure, characters and a narrative point of view. In an automatic system, those aspects are derived from object recognition or given keywords detected from the text. “Every painting tells a story” this quote from Alexander Lawrence Posey [21] could be an excellent way to describe the purposes of this paper. Each story has a subject (depicted figures), an outline (depicted scene) and a narrative point of view (author’s point of view). Most of them are described in art books or Wiki sites. Most recent works create a story starting from an image (caption generation) or using a language model, and keywords [1, 23, 44, 45]. The basic idea of the proposed storyteller system is that a story will be dynamically created using Linked Open Data (LOD) [3] and shown interactively to the user. We propose a novel architecture that uses deep learning approaches, Linked Open Data and an ontology-driven focused crawler to implement the different aspects of our framework. A basic component of our framework is the Knowledge Graph (KG) [10] that, in our vision, is a logical representation of conceptual formalization (i.e. ontology). Our KG contains concepts and their visual representations. To populate it, we used a focused crawler that extracts new concepts from a set of unpopulated concepts and their visual representation by LOD, such as WikiData. An image representing a painting provided by a user will be analyzed using a deep neural network and a knowledge graph to retrieve information using Linked Open Data. In order to create an interactive story, the user could choose different aspects of the painting, and s/he will get a brief description or a list of names of recognized objects. The user could choose the description or every name to discover all aspects of the analyzed painting. Our system implements a semantic analysis using a Word Sense Disambiguation algorithm to retrieve the correct information. Therefore, in this paper, we proposed an approach to generate a story using a Knowledge Graph and Linked Open Data. Furthermore, we used deep learning techniques to extract the feature from images. We organized the paper as follows: in Section 2, we described several works presented in the literature and related to our context of interest; in Section 3, we introduced the whole architecture of the implemented system together with our novel storytelling framework; in Section 4, we outlined a user case of our framework on Cultural heritage storytelling; in Section 5 we discussed the evaluation results and, eventually, in Section 6, we reported conclusions and future works.

2 Related works

In this section, we present and discuss different approaches and systems mainly based on Deep Learning and Natural Language Processing to automatic generate stories. Story generation is an important research field in the broad area of Artificial Intelligence. From a general point of view, a crucial aspect of the storytelling framework is a language model (LM). An LM is a probability distribution of the sequence of words that needs statistical and probabilistic techniques to achieve its purpose. One of the most famous language models is GTP-2 (Generative Pre-trained Transformer 2), developed by OpenAI [29]. GTP-2 is an unsupervised transformer language model trained using a massive dataset (called WebText) retrieved from Reddit. GTP-2 has been a mighty tool for different applications, e.g. question answering, story creation and translation from one language to another. Nowadays, GTP-3, an evolution of GTP-2, is the most prominent language model. This model operates on a byte level, and it can be used with any language model. The model has a capacity of 175 billion machine learning parameters (MLP) and can achieve better results than its predecessor [11]. The authors in [43], proposed a storyteller that uses a knowledge graph and Generative Pre-trained Transformer 2 (GTP-2). This system creates sentences using user keywords and relationships by language models. In particular, they used two merged knowledge graphs to create a mixed knowledge graph based on ConceptNet [22] and Wordnet [26]). Furthermore, they used GTP-2 as a text generator for creating stories. In [42] a system to generate automatic stories from an input composed of some images using Convolutional Neural Network (CNN) to extract textual features from images is presented. The authors in [24] described an example of a pipeline to produce a textual description of events and interpretations depicted in a sequence of images. The pipeline is composed of Object identification based on computer vision technology to recongnize objects in an image. A single image inference step is driven by commonsense reasoning and a knowledge bases; it is used to define the interaction in the depicted scene. A Multi-image narration module is composed of narrative planning and natural language sentence generation the describe the scene. The authors evaluated the three layers separately, emphasising the need of new quantitative and qualitative metrics for automatic storytelling. In a same way, [41] describe the lack of automatic metrics for story evaluation. They develop a complex adversarial reward learning algorithm for stories generation using a RNN-CNN architecture for the Policy Model. Features are extracted from images and then encoded as context vectors using a CNN architecture for the Reward model and applied to story parts (substories). The learning objective is to approximate the Reward Boltzmann distribution (maximise the similarity) towards the “real” data distribution and minimise the “faked” data generated from the policy model. The evaluation of such a system is hard due to the complexity of the test set. The authors in [16] introduced the first dataset for sequential vision to language. They captioned many images with three different descriptions (1) Descriptions of images in-isolation (DII); (2) Descriptions of images-in-sequence (DIS); and (3) Stories for images-in-sequence (SIS).) in order to distinguish the description of the single image and the description of a sequence. The dataset is composed of around 210.000 photos from Flickr with their descriptions. The system has been trained using a sequence-to-sequence Recurrent Neural Network and used a simple beam search and some heuristics to improve a more detailed description.

The proposed approach has basic differences and novelties from the analysed ones. Our storytelling process dynamically generates stories using user inputs and fetches data form different data sources based on Linked Open Data (LOD) paradigm and a knowledge graph (KG). Furthermore, Artificial Intelligence is used to extract features from an image and we don’t use any language models to produce text, but short descriptions are retrieved from queried data sources. In the proposed framework, we highlight the interaction and user-friendly representation of the generate story. Moreover, we present a complete evaluation of our systems both from a quantitative and qualitative point of view.

3 The proposed approach

In this section, we introduced our framework examining in detail all processes. We logically split the framework into two submodules. The first one implements the storytelling process. The latter implements a crawler that populates a Knowledge Base used then by the first one to compute word sense disambiguation in polysemy case. In Fig. 1, we summarised the architecture of our framework. Our framework is mainly composed of five modules that respectively perform: (I) object detection, (II) feature extraction, (III) data retrieval, (IV) word sense disambiguation and (V) crawling. The object detection module performs object detection using a deep learning approach. The feature extraction module implements a feature extraction by means of CNNs as a feature extractor. The data retrieval module retrieves the information extracted by picture quering KG and LOD. The word sense disambiguation module consists in a focused crawler. The crawling module implements a WSD algorithms to improve the precision of our system. In the following subsections, we deep describe the crawler and Knowledge Base, the storytelling subprocess, and the WSD algorithm.

3.1 The knowledge graph

To design our Knowledge Graph, we used a model proposed in [33]. In brief, the KG is based on an ontology and in our graph where nodes are the concepts and edges are semantic relationships. We implement the KG by means of a multi-model NoSQL technology called Arango DB [5]. Figure 2 shows the hierarchies used to represent the objects in our model. In particular, in our KG we represented concepts, semantic and lexical proprieties according to the lexical-semantic dictionary WordNet [9] and Multimedia Objects.

In this work, we used a crawler [4] proposing a new visual feature extraction approach to populate our KG. The role of the focused crawler is to recognize domain-related documents (i.e. webpages) with a topic detection task browsing the web starting from a given seed using NLP techniques, Linked Open Data (DBpedia [2] and Wikidata [40]). The crawler retrieves entities (from DBpedia), images and topics from domain-related websites and then uses this information to analyze and extract features to populate the knowledge graph. In more details, the textual topic detection task extracts the textual input from a given website and sent it to DBpedia API for having recognized entities. For each entity, the crawler found a description using Wikidata. The text obtained with the combination of entities and descriptions is analyzed with a semantic-based metric [35] to have the best sense of each term. In the image classification task a feature extraction is used on each image of a given webpage using convolutional neural networks (CNN) [38], removing the last layer of the network and using the max-pooling or avg-pooling technique to have a dimensionality reduction [34], in the Section 5 we will report experimental results to recognize the best CNN architecture and configuration. The extracted feature are a query on the Knowledge Graph computing a cosine similarity [37] for each image stored in it. If the best similarity score, given by cosine similarity, is higher than a given threshold set by experiments, there is a match with an image in the KG and the document will be domain-related; otherwise, we combined the result with the textual result using the average of the domain-related score. Figure 3 shows a sketch of our populated Knowledge Graph.

3.2 The storytelling process

The storytelling process performs all the tasks useful for interactively drawing the story. The process is interactive because it requires users input to evolve as shown in Section 4. The story is proposed to the user by tag clouds [32].

This process is mainly divided into two parts. The first one provides for three inputs:

Descriptive Content: all properties, such as location, author, artistic period, are retrieved from Linked Open Data and organized in a tag cloud;
Semantic Content: the system recognizes all entities contained in images and organizes them in a tag cloud using the instance segmentation module;
Image Title: the system retrieves a brief description of the images from Linked Open Data and creates a tag cloud from it.

We show the second part of the process in Fig. 4. Given a tag cloud generated in the first part of the process, the user chooses a word from which the story is generated. Iteratively, a new tag cloud is generated in each step, and the user can choose a new word to continue the storytelling.

To improve the precision of storytelling, we compute a Word Sense Disambiguation (WSD) algorithm on polysemy words. The WSD task is an essential step for the storytelling framework. It is the only way to avoid the user going outside the story getting wrong information caused by polysemous word. The WSD task finds all senses for each word and compares them with all senses of the other words of the considered sentence in the analyzed document. The proper sense of a given term is the sense that reaches the best score. In this work, we use the WSD algorithm proposed in [35] using an innovative technique that differs from others present in literature [17, 25, 25, 27, 46]. Such WSD algorithm is based on ontology and in particular it uses WordNet 3.0, and compute the sense disambiguation using a Dynamic Semantic Network (DSN), which is built to extract all hypernyms of a pre-selected input term and for each hyponym extracted from this term, all other semantic relationships of WordNet are explored and, eventually, a score of similarity is computed. We used a metric presented in [31], it computes the shortest weighted path between two synsets in Wordnet, each relation has a specific weight. The intuition is that not all relationships have the same strength (a synonymous relationship is different from “part of” relationship”). The used metric in Equation:

$$ SRG(\nu)=\sum\limits_{\left( w_{i}, w_{j}\right)} e^{-\alpha \cdot l\left( w_{i}, w_{j}\right)} \frac{e^{\beta \cdot d\left( w_{i}, w_{j}\right)}-e^{-\beta \cdot d\left( w_{i}, w_{j}\right)}}{e^{\beta \cdot d\left( w_{i}, w_{j}\right)}+e^{-\beta \cdot d\left( w_{i}, w_{j}\right)}} $$

(1)

where ν is the document, (w_i,w_j) the pairs of words in the lexical chain, α ≥ 0 and β > 0 are two scaling parameters, d(w₁,w₂) is computed as the number of hops from the subsumer of w₁ and w₂ to the root of the hierarchy.

The data used for storytelling process are retrieved from Wikidata, that is a free and open knowledge base that can be read and edited by both humans and machines. Information like author, dimensions, location, artistic movement or other definitions will be retrieved from this knowledge source using its API in a Linked Open Data framework. The data are organized in labels (title of the page), description (multi-language) and statements composed of some properties. For example, for the Last Supper painting by Leonardo Da Vinci, some properties are creator Leonardo Da Vinci, commissioned by Ludovico Sforza, time period High Renaissance and so on. A crucial module of our framework is the instance segmentation task that extracts the entities in the analyzed image. Its purpose is to recognize entity type (ex: person, animal) and then give a name and/or a description to that entity using Linked Open Data. To perform the instance segmentation task, we choose Mask R-CNN (Mask Region-Based Convolutional Neural Networks [12]) that uses Resnet101 as a backbone to extract features from images [13]. In this work, we used the Mask R-CNN pre-trained on the COCO Dataset (COCO is large-scale object detection, segmentation, and captioning dataset) [19]. We performed the image classification task to recognize the input image and retrieve some information from the knowledge graph. In our work, to recognize the image, a feature extraction task is applied using a Convolutional Neural Network, removing the last layer and applying a max-pooling or avg-pooling technique. The input image features are then compared to the features of the images of the knowledge graph stored in Arango DB. We computed the similarity score using cosine similarity [37]. If the score is higher than a threshold (empirically calculated), there is a match, and the system retrieves entities and the image’s name from the database.

4 Use case

This section introduced a use case in the cultural heritage domain. Following the process described in the previous section, the user queries the system in the first step. In this example, the user chooses a painting in Fig. 5 as input. The system applies an instance segmentation task to recognize the object drawn in paint. In our case, it recognizes persons. At the same time, it provides a classification of the query image and in our example the classification of the last supper is shown Afterthought, the user can choose between Semantic Content, Descriptive Content, or more detail on the recognised input (i.e. the Last Supper). Then selecting Descriptive Content, the system generate the tag cloud menu using information extracted by Linked Open Data generating the tag cloud. Instead, selecting Semantic Content, a user can choose from location, material, author, dimensions or artistic movement. The system, once retrieved the information using knowledge graph and LOD, display the requested information. Finally,the system shows a description extracted from LOD selecting the name of the recognized object. Furthermore, the system generates a tag cloud used in our application as a graphical interface to interact with the user. In other words, after recognizing the paint, the system queries Wikidata to retrieve other helpful information. In this case, the principal information concerns the names of the person depicted in paint, as shown in Fig. 6. In general, all these properties are retrieved from Wikidata using an identifier that transforms the word chosen in a property Wikidata ID and performing a query to retrieve a value for that property.

The tag colors depend from the class of recognized objects (i.e. title, person, object,...) and font size is randomly chosen to best fit in the tag cloud box (i.e. a circle) considering the frequency of the class objects.?

5 Results and evaluation

In this section, we evaluate our system and discuss the results. Firstly, we evaluated several architectures of CNN as features extractors, and then we evaluated the storytelling framework. We evaluated the following CNN pre-trained on ImageNet to choose the best one: VGG16 [38], ResNet50 [14], MobileNet V2 [36], Inception V3 [39]. Furthermore, we tried max and average global pooling for each architecture, as dimensional reduction strategy. In order to choose the best CNN architecture, we use Precision-Recall and mean Average Precision at k (mAP@k) measures [30]. The precision represents how many relevant documents are relevant in the retrieved document set, while the recall represents how many relevant documents have been retrieved, they are respectively defined in (2) and (3):

$$ Precision = \frac{\{relevant documents\}\cap\{retrieved documents\}}{\{retrieved documents\}} $$

(2)

$$ Recall = \frac{\{relevant documents\}\cap\{retrieved documents\}}{\{relevant documents\}} $$

(3)

Precision and recall are two essential model evaluation metrics. While precision refers to the percentage of relevant results, recall refers to the percentage of total relevant results correctly retrieved by an algorithm. The Precision-Recall curve is an interpolation of precision on eleven standard points of recall (from 0 to 1) and (4):

$$ P_{interp}(r) = \max_{r(i) >= r} p(r_{i}) $$

(4)

The Precision-Recall curve allow to analyze the precision of the system for each level of recall. The mean Average Precision (mAP) is defined in (5a) and (5b).

$$ \begin{array}{@{}rcl@{}} AveP &=& \frac{{\sum}_{k=1}^{n} P(k)\cdot rel(k)}{number of relevant documents} \end{array} $$

(5a)

$$ \begin{array}{@{}rcl@{}} MAP &=& \frac{{\sum}_{q=1}^{Q}AveP(q)}{Q} \end{array} $$

(5b)

The dataset used to evaluate the CNN architectures as features extractors are PASCAL VOC2012 [8]. It is divided into 20 objects as shown in Table 1.

Table 1 PASCAL VOC Statictics

Full size table

Figure 7 shows the Precision-Recall curve, the best CNN is ResNet 50 with average pooling that is almost always above the other curves, but in our work we interested to accuracy on first retrieved image, so we analyze the mAP@1, shown in Fig. 8, and in this case the best is VGG16 with average pooling.

Based on the test reported above, we evaluated the Storytalling framework using as CNN for feature extraction VGG16. We use an evaluation strategy combining some user oriented metrics. The used metrics are:

1.
PU: perceived usefulness [6]
2.
PEU: perceived ease of use [28]
3.
TIS: trust in the information system [18]
4.
PE: Perceived enjoyment [20]

In order to evaluate the storytelling process using the user-oriented metrics listed before, we use a survey. This survey aims to obtain user opinions using heterogeneous audience. Our audience is is composed of 200 people of different ages and different education levels. We summarised the results in (Figs. 9 and 10) showing metrics and votes and in Fig. 11 in terms of average and standard deviation. The survey that concerns Perceived usefulness shows that more than 70% of the people who fill out the questionnaire give excellent feedback (6 or 7), around 20% of people give good feedback (4 or 5). There are only two votes less than 4. Perceived Easy of Use results show that there is one awful feedback (1) two low feedback, but over 60% of people give very positive feedback (6 or 7). The result about Perceived Enjoyment shows that there are about 70% of very positive feedback (6 or 7) 4% of bad feedback (3). The results of “Trust in the Information System” show over 65% of very positive feedback and only two negative feedback (3) but one very negative feedback (2).

The evaluation of the storytelling process using user-oriented metrics showed very good results. The storytelling process is easy to use and enjoyable. The audience states trust in the information system and that the application could be helpful. The perceived usefulness (PU) shows the best result (average = 6.05). Although the average result is high, the survey shows awful feedback (TIS and PEU) and a high standard deviation (PEU metric). So, some aspects of the system should be improved to make the application easier to use and increase trust in the information system. We conducted the survey on a equally distributed cohort refering to age and education level. The analysis of the results highlighted that almost everyone is in agree that our application was practical and enjoyable. Nevertheless, analysing the educational grade results showed that those with a higher one found the application more straightforward to use than those with a lower one. However, the partecipants with a higher educational grade trusted less of the retrieved information.

6 Conclusions and future work

In this paper, we proposed a Storytelling application based on semantic technologies, big data, and artificial intelligence. This work aimed to propose and implement an interactive framework using Linked Open Data to retrieve information and create stories. Compared to the existing literature, it is the first storytelling project that uses Deep Neural Network, Linked Open Data, and an ontology-driven focused crawler in a storytelling process. The results evaluation shows that the framework reaches the goals described before and could be helpful and enjoyable according to the user tasks. The most important differences between this and other storytelling applications are the intensive use of ontologies such as Wordnet, Wikidata, and DBPedia as sources of structured open data. There are several future research to investigate. We want to implement our framework for mobile platforms to improve usability and use other information as geolocalization. Moreover, we want to replace input images with images taken from the smartphone camera. Our approach uses Wordnet as base ontology. We will integrate other ontologies from different sources using specific matching and merging techniques to improve our knowledge graph. Furthermore, we will point out a computational complexity and a statistical analysis to judge the significance of the results.

Data Availability

The datasets generated during and/or analysed during the current study are available in the WikiData repository, https://www.wikidata.org.

References

Aneja J, Deshpande A, Schwing AG (2018) Convolutional image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5561–5570
Auer S, Bizer C, Kobilarov G, Lehmann J, Cyganiak R, Ives Z (2007) Dbpedia: a nucleus for a web of open data. In: The semantic web. Springer, pp 722–735
Bauer F, Kaltenböck M (2011) Linked open data: The essentials. Edition mono/monochrom, Vienna 710
Capuano A, Rinaldi AM, Russo C (2020) An ontology-driven multimedia focused crawler based on linked open data and deep learning techniques. Multimed Tools Appl 79(11):7577–7598
Article Google Scholar
DB A (2022) Arango DB. https://www.arangodb.com/ Accessed 01 Mar 2022
Davis FD (1989) Perceived usefulness, perceived ease of use, and user acceptance of information technology. MIS Quarterly 319–340
Dictionaries OL (2021) Definition of storytelling. https://www.oxfordlearnersdictionaries.com/definition/english/storytelling?q=storytelling Accessed 01 Mar 2022
Everingham M, Van Gool L, Williams CKI, Winn J, Zisserman A (2012) The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html
Fellbaum C (1998) Wordnet: An electronic lexical database. Bradford Books
Fensel D, Şimşek U., Angele K, Huaman E, Kärle E., Panasiuk O, Toma I, Umbrich J, Wahler A (2020) Introduction: what is a knowledge graph?. In: Knowledge graphs. Springer, pp 1–10
Floridi L, Chiriatti M (2020) Gpt-3: Its nature, scope, limits, consequences. Mind Mach 30(4):681–694
Article Google Scholar
He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 2961–2969
He K, Gkioxari G, Dollar P, Girshick R (2020) Mask r-cnn. IEEE Trans Pattern Anal Mach Intell 42(2):386–397
Article Google Scholar
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Hou C, Zhou C, Zhou K, Sun J, Xuanyuan S (2019) A survey of deep learning applied to story generation. In: International conference on smart computing and communication. Springer, pp 1–10
Huang T-H, Ferraro F, Mostafazadeh N, Misra I, Agrawal A, Devlin J, Girshick R, He X, Kohli P, Batra D et al (2016) Visual storytelling. In: Proceedings of the 2016 conference of the north american chapter of the association for computational linguistics: human language technologies, pp 1233–1239
Iacobacci I, Pilehvar MT, Navigli R (2016) Embeddings for word sense disambiguation: an evaluation study. In: Proceedings of the 54th annual meeting of the association for computational linguistics (Volume 1: Long Papers), pp 897–907
Kim B, Han I (2009) The role of trust belief and its antecedents in a community-driven knowledge environment. J Am Soc Inform Sci Technol 60(5):1012–1026
Article Google Scholar
Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: Common objects in context. In: European conference on computer vision. Springer, pp 740–755
Lin CS, Wu S, Tsai RJ (2005) Integrating perceived playfulness into expectation-confirmation model for web portal context. Information & Management 42(5):683–693
Article Google Scholar
Littlefield DF (1992) Evolution of alex posey’s fus fixico persona. Studies in American Indian Literatures 136–144
Liu H, Singh P (2004) Conceptnet—a practical commonsense reasoning tool-kit. BT Technol J 22(4):211–226
Article Google Scholar
Loganathan K, Kumar RS, Nagaraj V, John TJ (2020) Cnn & lstm using python for automatic image captioning. Materials Today: Proceedings
Lukin SM, Hobbs R, Voss CR (2018) A pipeline for creative visual storytelling. arXiv:1807.08077
Melamud O, Goldberger J, Dagan I (2016) Context2vec: Learning generic context embedding with bidirectional lstm. In: Proceedings of the 20th SIGNLL conference on computational natural language learning, pp 51–61
Miller GA (1995) Wordnet: A lexical database for english. Commun ACM 38(11):39–41
Article Google Scholar
Moro A, Raganato A, Navigli R (2014) Entity linking meets word sense disambiguation: a unified approach. Transactions of the Association for Computational Linguistics 2:231–244
Article Google Scholar
Parasuraman A, Zeithaml VA, Berry L (1988) Servqual: A multiple-item scale for measuring consumer perceptions of service quality. 1988 64(1): 12–40
Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I, et al. (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9
Google Scholar
Ricardo Baeza Y, Berthier Ribeiro N (2011) Modern information retrieval. Pearson
Rinaldi AM (2009) An ontology-driven approach for semantic information retrieval on the web. ACM Transactions on Internet Technology (TOIT) 9(3):1–24
Article Google Scholar
Rinaldi AM (2019) Web summarization and browsing through semantic tag clouds. International Journal of Intelligent Information Technologies (IJIIT) 15 (3):1–23
Article Google Scholar
Rinaldi AM, Russo C (2018) A semantic-based model to represent multimedia big data. In: Proceedings of the 10th international conference on management of digital EcoSystems, pp 31–38
Rinaldi AM, Russo C, Tommasino C (2020) A knowledge-driven multimedia retrieval system based on semantics and deep features. Future Internet 12 (11):183
Article Google Scholar
Rinaldi AM, Russo C, Tommasino C (2021) A semantic approach for document classification using deep neural networks and multimedia knowledge graph. Expert Syst Appl 169:114320
Article Google Scholar
Sandler M, Howard A, Zhu M, Zhmoginov A, Chen L-C (2018) Mobilenetv2: Inverted residuals and linear bottlenecks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4510–4520
Shiri A (2004) Introduction to modern information retrieval. Library Review
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2015) Rethinking the inception architecture for computer vision. 2015. arXiv:1512.00567
Vrandečić D, Krötzsch M (2014) Wikidata: a free collaborative knowledgebase. Commun ACM 57(10):78–85
Article Google Scholar
Wang X, Chen W, Wang Y-F, Wang WY (2018) No metrics are perfect: Adversarial reward learning for visual storytelling. arXiv:1804.09160
Yang P, Luo F, Chen P, Li L, Yin Z, He X, Sun X (2019) Knowledgeable storyteller: a commonsense-driven generative model for visual storytelling. In: IJCAI, pp 5356–5362
Yang X, Tiddi I (2020) Creative storytelling with language models and knowledge graphs. In: CEUR Workshop proceedings, vol 2699
Yang Z, Zhang Y-J, Huang Y et al (2017) Image captioning with object detection and localization. In: International conference on image and graphics. Springer, pp 109–118
Zhang Y, Shi X, Mi S, Yang X (2021) Image captioning with transformer and knowledge graph. Pattern Recogn Lett 143:43–49
Article Google Scholar
Zhong Z, Ng HT (2010) It makes sense: a wide-coverage word sense disambiguation system for free text. In: Proceedings of the ACL 2010 system demonstrations, pp 78–83

Download references

Funding

Open access funding provided by Università degli Studi di Napoli Federico II within the CRUI-CARE Agreement.

Author information

Authors and Affiliations

Department of Electrical Engineering and Information Technology, University of Napoli Federico II, Via Claudio, 21, Napoli, 80125, Italy
Gianluigi Renzi, Antonio M. Rinaldi, Cristiano Russo & Cristian Tommasino

Authors

Gianluigi Renzi
View author publications
You can also search for this author in PubMed Google Scholar
Antonio M. Rinaldi
View author publications
You can also search for this author in PubMed Google Scholar
Cristiano Russo
View author publications
You can also search for this author in PubMed Google Scholar
Cristian Tommasino
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Antonio M. Rinaldi.

Ethics declarations

Conflict of Interests

The authors certify that they have NO affiliations with or involvement in any organization or entity with any financial interest (such as honoraria; educational grants; participation in speakers’ bureaus; membership, employment, consultancies, stock ownership, or other equity interest; and expert testimony or patent-licensing arrangements), or non-financial interest (such as personal or professional relationships, affiliations, knowledge or beliefs) in the subject matter or materials discussed in this manuscript.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Gianluigi Renzi, Antonio M. Rinaldi, Cristiano Russo and Cristian Tommasino are contributed equally to this work.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Renzi, G., Rinaldi, A.M., Russo, C. et al. A storytelling framework based on multimedia knowledge graph using linked open data and deep neural networks. Multimed Tools Appl 82, 31625–31639 (2023). https://doi.org/10.1007/s11042-023-14398-x

Download citation

Received: 25 March 2022
Revised: 06 July 2022
Accepted: 21 January 2023
Published: 22 March 2023
Issue Date: August 2023
DOI: https://doi.org/10.1007/s11042-023-14398-x

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A storytelling framework based on multimedia knowledge graph using linked open data and deep neural networks

Abstract

Similar content being viewed by others

Automated Storytelling Technologies for Cultural Heritage

A Hierarchical Approach for Visual Storytelling Using Image Description

Human-Like Storyteller: A Hierarchical Network with Gated Memory for Visual Storytelling

1 Introduction

2 Related works