1 Introduction

Nowadays, we are overwhelmed by a large amount of available information we can benefit from. In particular, e-commerce sites and entertainment Web services usually offer thousands of different items among which users are invited to find the ones they need or desire the most. In this direction, recommender systems (RSs) have proven to be a silver bullet in suggesting appropriate items to users according to their past choices and behaviors. A typical RS exploits item ratings available in a system, either implicitly or explicitly, to predict a list of unseen items which are of potential interest to the user (Ricci et al., 2011). Over the years, several strategies have been developed to build efficient recommendation algorithms. They are generally divided into three main categories: collaborative filtering (CF) techniques, content-based (CB) methods and hybrid approaches (Burke, 2002). While CF techniques exclusively rely on the feedbacks (rating, click, watch, listen) from the users on specific items without considering their description (either structured or unstructured), on the other side CB ones exploit the data associated to an item to compute relevant recommendations to a user. One of the main issues to tackle in adopting a CB approach is then getting the right amount of meaningful information/data about items, which in turn results necessary to model content-aware items and users descriptions. In fact, by gathering data about items rated by users, one can infer attributes that can be used to model users’ profile and preferences.

In the last years, the technological wave related to deep learning techniques and approaches hit also the field of RSs. A variety of new approaches based on different configurations of Neural Networks (NNs) have been proposed to compute personalized lists of items to be suggested to end users (Covington et al., 2016; Wang & Wang, 2014; Elkahky et al., 2015). Among them, autoencoders have been proposed as an interesting tool to mimic the user behavior in producing ratings and by exploiting and modeling user preferences on latent item attributes (Wu et al., 2016). Autoencoders are a particular configuration of artificial NNs which turned out to be very effective especially for dimensionality reduction and feature selection tasks (Vincent et al., 2008). However, autoencoders proved useful also in a different configuration. Indeed, in their mirrored structure, neurons of the hidden layers can be interpreted as a projection of the input layer in a different space. In Bellini et al. (2019), the authors presented SEMAUTOFootnote 1, a Semantics-Aware autoencoder (SA-autoencoder from now on) which leverages the common graph-based structure to encode the semantics, and the structure of a knowledge graph to enhance the representational power of the underlying NN. One of the main advantages of such a hybrid structure it that of giving an explicit semantics/label to the latent dimensions of the new space (Bellini et al., 2019). In this setting, SA-autoencoders may expand the hidden layer to accommodate all features in the KG relevant for a user.

Among the various and diverse KGs freely available on the Web, for sure DBpedia (Auer et al., 2007) and Wikidata (Vrandecic & Krötzsch, 2014) play a key role due to their encyclopedic nature which makes them the ideal candidates to provide structured descriptions on items in a recommender system (Sacenti et al., 2022; Liu et al., 2023). Although there is a partial overlapping among the information sources to build DBpedia and Wikidata, the data they encode is different under various aspects, such as the amount of data and the way it is organized (Ringler & Paulheim, 2017).

Fig. 1
figure 1

Architecture of an autoencoder

In this paper we test the addition of a KG to a RS through extensive experiments that variate the following dimensions:

  • along a knowledge dimension, we variate the KG between DBpedia and Wikidata;

  • along the RS model dimension, we test several recommendation models:

    • three state-of-the-art approaches, two of which both with- and without the addition of knowledge (yielding five possible recommendation models)

    • an SA-autoencoder model, in five different configurations;

  • along the choice of domain, we conduct experiments in three different ones: books, music, and movies, by employing an appropriate dataset for each one of them.

For all possible RSs built over any combination of the above characteristics, we examine its results in terms of accuracy and diversity, evidencing and discussing how the structure and coverage of information encoded on a KG may affect the accuracy and novelty of the recommendation.

The structure of the paper is as follows: in the next section, we recap how to build SA-autoencoders and their recommendation model. Then, in Sections 2 and 3 we introduce the experimental setting and the different configurations adopted in our investigation, while the results are discussed in Section 4. We then compare our results with related works in Section 5. Conclusions and future work close the paper.

2 Semantics-aware autoencoders in recommendation scenarios

Autoencoders are unsupervised NNs that learn a function capable of reconstructing the network’s input at the output layer. They are built on top of two main components which are the encoder and the decoder. The former is usually responsible for compressing the input data into a lower dimensional representation, while the latter does the opposite job reconstructing the original input data starting from a lower dimensional representation (see Fig. 1). Like every NN configuration, autoencoders are structured in layers which contain neurons. Every neuron in layer i is connected through an edge to all the neurons of the following layer \(i+1\). In other words, in its standard configuration we have a fully connected network. In a recommendation scenario, we may use all the items in a catalog as representative of both the input and the output layer. We may then train the NN by using user ratings as inputs to obtain similar values produced by the output layer.

Autoencoders have been studied also in architectures that augment the input-output dimension. Figure 2 depicts an Overcomplete Autoencoder, a deep learning technique with a hidden layer wider than the input layer, offering greater representation capacity than traditional autoencoders (Ranzato et al., 2006; Rifai et al., 2011). This configuration excels both in handling noisy or missing data and in generating new data (Vincent et al., 2008). However, increased computational complexity and overfitting risks may arise. Mitigation strategies include using specialized algorithms and hardware, regularization, and dropout techniques (Ngiam et al., 2011; Lewicki & Sejnowski, 2000).

Denoising Autoencoders (DAE) also feature a wider middle layer and learn compressed, simplified representations of input data (Vincent et al., 2010; Alain & Bengio, 2014). DAEs can extract meaningful features from user interaction datasets or knowledge graphs, improving recommendation algorithm performance and recommendation quality (Liang et al., 2018).

Starting from the observation that both NNs and KGs are directed graphs, in Bellini et al. (2017) the authors propose to use the topology of the latter to model the former, in an original architecture inspired by overcomplete-autoencoders, whose aim is not dimensionality reduction. The main idea is to keep input- and output-layer nodes as representative of the items in the catalog and substitute anonymous nodes in the hidden layers by labeled resources from a knowledge graph thus inheriting their mutual semantic connections (see Fig. 2). Differently from the generic definition of autoencoder, we see here that the resulting NN is no longer fully connected since the input layer’s nodes are linked to neurons in the hidden layer if and only if a corresponding connection exists in the original knowledge graph. In such a semantics-aware autoencoder, we somehow project the items in a space whose dimensions represent all the entities (features) they are connected to.

Fig. 2
figure 2

Architecture of a semantics-aware autoencoder

As for a generic autoencoder, also a semantics-aware autoencoder is trained with user ratings and it learns how to reconstruct them on the output layer but differently from the original autoencoder, since such a network is not fully connected, user ratings propagate only through those nodes that represent features connected in the KG to items rated by the user. According to NN models, a generic neuron outputs a value that is a non-linear function of the weights summation over incoming edges. Then, in a recommendation scenario it turns out that positively rated items tend to have connected neurons with higher output values than those neurons connected with negatively rated items. As proposed in Bellini et al. (2019), one single autoencoder is trained per each user for 1000 epochs by using the well-known Sigmoid activation function for the hidden layer. The number of neurons in the hidden layer depends on the number of entities (features) encoded in the KG for the user’s rated items. We use the autoencoder to describe a single user’s ratings in a feature space that is tightly coupled with the KG, so this method, even with few rates, is able to weigh items features according to the user’s interest to them. Nevertheless, since the autoencoder is fed with ratings of only one user, the overall approach can be easily parallelized by training each user/autoencoder on an autonomous thread.

We recap the training process of the semantics-aware autoencoder in Fig. 3. At start, the model is initialized with random weights (Fig. 3a). For each epoch it is fed with user ratings and the back-propagation adjusts weights in order to reconstruct input’s user ratings at the output layer (Fig. 3b). The model converges and encodes the relevance for each feature in the hidden layer Fig. (3c).

Fig. 3
figure 3

Training process for a semantics-aware autoencoder for an individual user. The color nuance of a hidden unit denotes the relevance of a feature it represents according to the rating given to items connected to that hidden neuron

2.1 User profiles

If we train one SA-autoencoder per user, the resulting model may be interpreted as an explicit representation of the user profile on items attributes. As a matter of fact, at the end of the training, hidden nodes encode a value that represents the relevance for the user in the node’s associated feature. Thus, sets of pairs \(\langle feature, value \rangle \) can be defined and used as a representation of the user profile in the semantic space. An SA-autoencoder is, therefore, a model that uses deep learning techniques to extract weighted features from a KG according to user ratings in order to build users profiles for recommendation tasks. In particular, given a user u, after the training phase the weight of a feature c is the summation of the weights \(w_{k}^u(c)\) associated to the k edges entering the hidden node representing a KG entity c (see Fig. 4).

Fig. 4
figure 4

An excerpt of the network in Fig. 2 after the training

More formally, given user u and a feature c, the weight of c for u can be computed as:

$$\begin{aligned} \omega ^u(c) = \sum _{k=1}^{|In(c)|} w_k^u(c) \end{aligned}$$

where In(c) is the set of the edges entering the hidden node representing the feature c. As an example, if we consider the excerpt of the network in Fig. 4, for Barry Sonnenfeld we have:

$$\begin{aligned} \omega ^u(\mathtt {Barry\hspace{5.0pt}Sonnenfeld}) = w^u_{11} + w^u_{12} \end{aligned}$$

Having weights associated to each resource coming from KG, we can now model a user profile composed by a vector of weighted features. Given \(S^u\) as the set of all the features belonging to the items rated by the user u and \(S = \bigcup _{u \in U} S^u\) as the set of all the features in the system, we have that for each user \(u \in U\) and for each feature \(c \in S\), the user profile P(u) is represented as:

$$\begin{aligned} P(u) = \{\langle c, \omega \rangle ~|~ \omega = \omega ^u(c) \text { if }c \in S^u, \omega = 0 \text { otherwise}\} \end{aligned}$$
(1)

Because we train a SA for each user, the weights associated with the edges that link items to values indicate how influential that type of \(\langle feature, value \rangle \) pair is for the user. Clearly, this value varies from user to user since we have a different SA for each one of them.

2.1.1 Recommendation

The vector computed with (1) is  usually very sparse in the space representing the overall number of features. In other words, many values are set to 0 as the corresponding feature does not belong to any item the user rated in the past. Since this may negatively affect the final result in a recommendation scenario, we reduce the sparseness of the user profile by predicting a value to fill 0-valued features in the vector through the word2vec-like approach (Bellini et al., 2018).

Once we have a less sparse version of the user profile, we produce recommendation performing a standard user-kNN (see (2)) by computing how close are users with each other through a cosine similarity (see (3)).

$$\begin{aligned} \hat{r}(u, i) = \frac{\sum _{j=1}^{k} sim(u, v_{j}) \cdot r(v_j, i)}{\sum _{j=1}^{k} sim(u, v_{j})} \end{aligned}$$
(2)
$$\begin{aligned} sim( u, v) = \frac{P(u) \cdot P(v)}{||P(u)|| \cdot ||P(v)||} \end{aligned}$$
(3)

In (2), given an unseen item i from user u, we predict a rating \(\hat{r}(u,i)\) for u on i by considering the ratings \(r(v_j,i)\) assigned to i by the k most similar users \(v_j\).

From a practitioner’s perspective, this approach can scale both vertically and horizontally as one can train a single autoencoder on a single core; therefore, SA scales vertically with the number of cores on a single machine and horizontally by deploying multiple machines. Given that we train a single autoencoder for each user, the dimensionality of the overall network depends on the number of single-user interactions; this results in training relatively small networks faster in the training procedure. In industrial settings where the availability of clusters is generous, this becomes a plus since it is easy to scale.

3 Experiments

As a direct consequence of relying on a KG, it stands to reason that its structure, as well as the information it encodes, might affect how user profiles are computed. Thus, it is crucial to investigate how the KGs structure impacts the recommendation accuracy. We may believe that the better the data are engineered and curated, the more accurate a recommendation is. The point is, how to evaluate and measure the recommendation?

Given the recommendation model previously described, we performed an experimental evaluation to verify the influence of a knowledge source in the final recommendation task. In this paper, we chose the two main encyclopedic knowledge graphs freely available on the Web: DBpedia and Wikidata due to the richness of information they encode in different knowledge domains. DBpedia and Wikidata differ from each other not only by their structure but also Ringler and Paulheim (2017) by distinct fields of knowledge are covered in different ways.

We first describe the structure of the datasets used in the experiments, then we move on to the evaluation protocol for the recommendation, and finally we discuss the results.

For the sake of ensuring the reproducibility of our models we provide the link to the public repository from which to retrieve the code and datasets (https://split.to/sa-auto).

3.1 Dataset

We conducted our experiments on three different datasets as summarized in Table 1.

Table 1 Datasets characteristics and relative levels of sparsity

MovieLens 1MFootnote 2 stores information about users-items interactions made on a 5-star scale and relates to the movie domain. Last.fmFootnote 3 contains information about music, bands and artists listenings; since in Last.fm for each user we have the number of times a user has listened to a song, we infer users’ preferences by scaling it within the range [1, ..., 10] (using min-max normalization). Finally in LibraryThingFootnote 4, which is a social web application for book cataloging, rates are made on a 10-star scale with reference to books. By relying on these datasets, we have three different knowledge domains which are covered both by DBpedia and by Wikidata.

In order to map items to resources in DBpedia we adopted a freely available mappingFootnote 5 originally presented in Ostuni et al. (2013) and then refined in Anelli et al. (2017). Thus, we retain 3301 mapped items for MovieLens 1M, 10180 for Last.fm and 11695 for LibraryThing. Starting from DBpedia resources URI we obtained Wikidata entities through owl:sameAs links.

3.2 Knowledge graphs: DBpedia vs. Wikidata

In addition to a general evaluation on the impact of the usage of DBpedia vs. Wikidata in a recommendation setting, we also evaluated how different kinds of information from a KG impact the recommendation itself. If we look at the semantic knowledge encoded within the two knowledge graphs, we may identify factual knowledge where we have facts stated on a specific resource, for exampleFootnote 6:

dbr:Men_in_Black_(Film) dbo:director dbr:Barry_Sonnenfeld . and ontological and categorical knowledge which encode the semantics of an entity through classes and categories, such as:

dbr:Men_in_Black_(Film) dct:subject dbc:Buddy_Film .

dbr:Men_in_Black_(Film) rdf:type dbo:Film .

In DBpedia, categorical information is reached through the following predicates:

The former allows us to explore categorical resources related to an item, while the latter lets us discover a wider category in a hierarchical perspective.

As for Wikidata, we considered categorical information encoded through the predicates:

Since DBpedia |skos:broader| is not directly mapped in Wikidata, we used https://www.wikidata.org/wiki/Property:P279, labeled as subclass of, to identify hierarchical categories as well.

Regarding factual information, we used the approach proposed in Ragone et al. (2017) to automatically identify those DBpedia predicates (listed in Table 2) which turn out to be the most meaningful for a recommendation task; the corresponding Wikidata properties were properly collected through SPARQL queries.

Table 2 DBpedia factual predicates selected to compute recommendations

3.3 Data settings

Here we show the different configurations we adopted to inject data from DBpedia and Wikidata in our semantics-aware autoencoder. As stated in Section 2, the input and output layers are always composed by resources representing items of our recommendation setting (i.e., movies for MovieLens 1M, books for LibraryThing, songs, bands, etc. for Last.fm). The differences of the configurations we propose mainly rely on the information encoded in the hidden layers. In Fig. 5 we show only the case for DBpedia, as for Wikidata we have analogous configurations.

We list below the symbol of each configuration along with its explanation:

S:

(for Subject) the first configuration (Fig. 5a) encodes only categories through the dct:subject property. Hence, all the nodes in the hidden layer have a one-to-one mapping with a corresponding category in DBpedia. We refer to this configuration as Subject since categories are structured in a hierarchical way through skos:broader;

B:

(for Broader) We considered in the hidden layer only those categories which are one-hop distant through the skos:broader property from those resources directly connected to an item via dct:subject (Fig. 5b). In this configuration, connections between the items and the hidden layer are represented by the property chain dct:subject/skos:broader;

S-B:

(for Subject-Broader) In this configuration we considered three hidden layersFootnote 7 thus mimicking the actual topology of the knowledge graph in the structure of the NN (see Fig. 5c);

M:

(for Mixed) We put in the same hidden layer both the categories connected via dct:subject and dct:subject/skos:broader. We consider this configuration as flattening of S-B;

F:

(for Factual) Herein the hidden layer is composed by all the resources which are one-hop far from the input item through the properties in Table 2.

Fig. 5
figure 5

The different configurations used in our experiments

3.4 Evaluation

Before measuring how different KGs impact on recommendation quality, we first prove the strength of the proposed method by comparing it with some state-of-the-art baselines. Then we quantify how our performances vary depending on the different subsets of the KG we use.

For the evaluation of our approach we adopted the “All Unrated Items” protocol described in Steck (2013): for each user u, a top-N recommendation list is provided by computing a score for every item i not rated by u, whether i appears in the user test set or not. Training and test sets are generated by splitting each dataset with Hold-Out 80/20, which ensures that user have 80% of their ratings in the training set and the remaining 20% in the test set. To carry out the experimental phase, we used a server equipped with an i7-7700K processor with 128GB RAM and GeForce GTX 960 GPU. Five full rounds (training and evaluation) were run for each model and configuration, and we reported the results of the best obtained in the individual rounds.

The produced recommendation lists are finally compared with the test set by computing performance metrics: Precision, Recall and F1-score. These metrics have been chosen to evaluate the accuracy of our model in a top-10 recommendation scenario (Cremonesi et al., 2010), using threshold values of 4 for MovieLens 1M and 8 for both LibraryThing and Last.fm.

Accuracy metrics are a valuable way to evaluate the performance of a recommender system. Nonetheless, it has been argued (Smyth & McClave, 2001) that diversity should be considered when evaluating how good a recommendation engine is. Gini index (Shani & Gunawardana, 2011, Formula 8.20) is an ideal candidate to measure the distribution/diversity of items across recommendation lists:

$$\begin{aligned} Gini = \frac{1}{n-1} \cdot \sum _{j=1}^{n} (2j-n-1)\cdot p(i_{j}) \end{aligned}$$

where n is the number of items, \(p(i_{j})\) is the proportion of user choices for item \(i_{j}\) and \(i_1,...i_n\) is the list of items ordered according to increasing \(p(i_{j})\). A Gini index value equal to 0 means that all items are chosen equally often, while it is 1 if a single item is always chosen.

Among the several state-of-the-art techniques used in recommender scenarios, we tested the most widely adopted: BPRSLIM (Ning & Karypis, 2011; Rendle et al., 2009), WRMF (Pan et al., 2008; Hu et al., 2008) and a single-layer autoencoder for rating prediction. Although they have been proposed a few years ago, BPRSLIM and WRMF have recently shown to have excellent performances compared to deep-learning based approaches (Dacrema et al., 2019). BPRSLIM is a Sparse Linear Method which leverages Bayesian Personalized Ranking as an objective function. WRMF is a Weighted Regularized Matrix Factorization method which exploits users’ implicit feedbacks to provide recommendations. In their basic version, both strategies rely exclusively on the User-Item matrix in a pure CF approach. They can be hybridized by exploiting side information (SI) (Ning & Karypis, 2011), i.e., additional data associated with items. We used the implementations of BPRSLIM and WRMF available in MyMediaLiteFootnote 8 (Gantner et al., 2011) and both our SA-autoencoder and the classic autoencoder are implemented with TensorFlowFootnote 9.

4 Discussion of the results

For all the recommendation models described, and for both KG, we performed experiments on three datasets: MovieLens 1M, Last.fm and LibraryThing, related to movies, music, and books domains, respectively.

In Table 3, we present the most favorable outcomes we have obtained across three datasets by implementing the methods discussed earlier, with a specific emphasis on the level of sparsity. Regarding the SA-Autoencoder approach, we conducted tests with various numbers of neighbors (k), and we have included only the optimal results in the table. Notably, we have highlighted the overall best-performing approach in bold, while we have underscored the most effective configuration for the SA-autoencoder. We initiate our discussion by evaluating the results in terms of accuracy before delving into the aspect of diversity. Examining the table reveals that the semantics-aware autoencoder surpasses all baseline methods in the Last.fm dataset. Conversely, in the case of the LibraryThing and MovieLens 1M datasets, its performance closely mirrors that of the fully-connected autoencoder. To elucidate this outcome, we can posit that the semantics-aware autoencoder’s performance is contingent upon the quantity of features retrieved from a knowledge graph (KG). A greater number of features corresponds to a larger number of neurons in the hidden layers. According to Hornik’s Universal Approximation Theorem, a neural network with more neurons possesses a superior capacity to approximate any given function. Consequently, as indicated in Table 4, we have computed the ratio of features associated with items, denoted as \(\frac{average \#features}{average \#items}\). Our findings reveal that, considering each pair of KGs (DBPedia-WikiData) for a given setting (S-B-F...), the KG with the higher ratio yields the better results on performance It is imperative to emphasize that this measure also serves as a gauge of connectivity in SA-autoencoders. Specifically, lower values of \(\frac{average \#features}{average \#items}\) indicate a sparse connectivity with fewer connections, while higher values signify a denser connectivity (Pujara et al., 2017).

Table 3 Experimental results over DBpedia and Wikidata KGs
Table 4 Summary of hidden units over DBpedia and Wikidata KGs using both factual and semantics information

Nevertheless, when examining the sparsity levels presented in the final column of Table 1, it becomes evident that, despite the LibraryThing and Last.fm datasets having nearly identical sparsity values, they exhibit substantial disparities in their average features per item ratios. Specifically, the SA-Autoencoder demonstrates superior performance on the Last.fm dataset, characterized by a higher ratio, while its performance is notably inferior on the LibraryThing dataset, where the ratio is lower. Consequently, it is reasonable to infer that this method exhibits a relatively lower sensitivity to dataset sparsity in comparison to the quality of data curation within the Knowledge Graph (KG). Consequently, the comparison between the outcomes of the SA-Autoencoder and the Autoencoder can serve as a valuable means of evaluating data quality within KGs, particularly in the context of recommendation tasks.

On the other hand, we can observe that factual information (F) brings more accurate results for a recommendation task than ontological/categorical knowledge when the datasets are very sparse (as for MovieLens 1M and LibraryThing). This is a quite interesting result. A possible explanation is that categorical information introduces fewer connections among item descriptions than factual information. Hence, factual statements result more useful in making denser connections among items (and then users), which are eventually exploited by the latent collaborative part of the semantics-aware autoencoder.

If we just focus on the absolute numbers, one may argue that SA-Autoencoder is not competitive as it is slightly beaten in terms of accuracy by state-of-the-art algorithms such as BPRSLIM and WRMF (although it is the second best performing approach). Nevertheless, we point out that, differently from the other approaches based on matrix factorization (or any deep learning technique) with SA-Autoencoder we compute a meaningful and explicit user profile which contains user preferences on single features. This may result extremely useful in case we want to automatically generate a content-based explanation for the ranking computed with the recommendation list as also shown in Bellini et al. (2018). Then, although we rely on a deep learning approach, we can go beyond the pure black-box method and provide a human-understandable explanation for a recommendation list.

As for diversity, we can observe that when using Wikidata, values for Gini index are in general lower than the DBpedia case. This means that using Wikidata we are able to better diversify the items in a catalog. This result somehow reinforces the one obtained in Nguyen et al. (2015). Lack of diversity in recommendation strongly depends on the so-called “popularity bias”: popular items tend to be recommended more than those in the long tail. This observation leads us to a possible interpretation on diversity results we obtain in our experiments if we consider the popularity of entities in a KG as the number of connections they have. In DBpedia we have that popular resources (e.g. movies) are more connected to other nodes than unpopular ones. This is not the case with Wikidata where there is a less biased distribution of connections among resources in the graph. Hence, when the adopted KG suffers from a popularity bias in terms of connections among resources, this is inherited by the recommendation dataset, thus affecting the final recommended list of results.

5 Related works

Very few works exist about qualitative studies of Linked Open Data (LOD) knowledge graphs. In Ringler and Paulheim (2017) present an analysis of main KGs such as DBpedia, Wikidata, YAGO, underlining their differences in coverage, identifying overlapping and complementary parts of KGs. They assert that KGs are not easily interchangeable and each of them has its strengths and weaknesses for a domain-related task. Thus, using a specific KG that is suitable for the task to accomplish leads to better performances of the overall system. Authors in this work made a category-specific analysis, asserting that even if DBpedia and YAGO come from the same source (Wikipedia) and have a quite similar number of instances, there are notable differences in coverage. YAGO has five times the number of events of DBpedia, while DBpedia has four times as many settlements (i.e., cities and town) as YAGO; but Wikidata contains twice as many persons as DBpedia and YAGO. They conclude their investigation by providing a coverage summarization for some popular classes. A comparative survey of some popular KGs is done in Färber et al. (2018) in which authors propose a method to find the most suitable KG for a given task setting. To achieve this result, authors identify a set of characteristics they found relevant to describe a KG and then compare different KGs accordingly. Furthermore, in Färber et al. (2018) they provide a more detailed analysis of quantitative information stored in KGs by using several statistics such as the number of triples and classes, distribution of classes and corresponding instances, domain and classes coverage. Finally, to select the KG that best fits task requirements, a novel method that takes into account KG’s quantitative assessment is proposed.

In the last few years, thanks to increasing computational resources, new techniques based on deep learning have been successfully adopted in the recommendation scenario (Cheng et al., 2016). These techniques led to the development of different neural network models, each yielding interesting results (Singhal et al., 2017). In particular, some of these models turned out to be more effective for a specific recommendation task than others; e.g. autoencoders have proved their strength in collaborative filtering, outperforming state-of-the-art approaches (Sedhain et al., 2015), while Recurrent Neural Networks seem to be more suitable in session-based recommendation (Hidasi et al., 2016). Among the several autoencoders extensions, denoising autoencoders have been efficiently used to address the recommendation problem by improving users’ profile learning (Wu et al., 2016) or getting a smaller and non-linear representation of the User-Item rating matrix (Strub et al., 2016). Furthermore, Dong et al. (2017) shows how to build a hybrid RS by integrating side information in CF deep learning techniques, alleviating the sparsity problem and improving all system performances (Dacrema et al., 2019).

LOD are increasingly adopted in recommender systems because they provide more complex structured data that leverages relationships among entities in the graph. Moreover, they encode somehow semantics behind the data (Di Noia et al., 2012; de Gemmis et al., 2015). Recent works leveraged the data encoded in KGs to represent items thus achieving interesting results in recommendation scenarios (Oramas et al., 2017; Di Noia et al., 2016, 2012). Other approaches, as the one proposed by Ostuni et al. (2014) uses kernel graphs to compute item similarities by matching their local neighborhood graphs. LODs have been also exploited for measuring semantic distances between resources in order to provide top-N recommendations (Piao & Breslin, 2016). In Bellini et al. (2017) presented a novel method that combines both deep learning techniques and KGs, in which a semantics-aware neural network that explicitly computes user profiles for recommendation tasks is modeled. In particular, they focus on cold-start scenarios using DBpedia as a source of information for both user- and item modeling. In Bellini et al. (2018), authors used the aforementioned method to perform experiments in recommendation scenarios by using DBpedia KG on three different datasets and they compared their approach with state-of-the-art algorithms. Furthermore, they investigated the effectiveness of their method that leverages on a KG to provide an explanation in recommendation scenarios (Bellini et al., 2018). Another approach that uses KGs to represent the relationship between users and items is investigated in Ristoski et al. (2019), in which the authors leveraged on language models to extract features from node sequences in RDF graphs. Recently, there has also been much interest in the topic of KG completion, whose aim is to infer hidden relationships between entities in a KG. Taking advantage of the progress made in this area, the authors of KTUP (Cao et al., 2019) have proposed to exploit KG completion methods with respect to users’ preferences towards items in a catalogue in order to jointly train a model that exploits both the interactions between users and items and the characteristics of the items in a KG.

6 Conclusions and future work

In this paper, we showed how it is possible to combine the computational predictive power of NNs (in the form of autoencoders) with the representational power of KGs such as DBpedia and Wikidata. We found that the choice of the KG affects the results of a KG-aware approach to recommendation, and that some combinations of KG, recommendation model, and task domain, yield better recommendations than others. In particular, we evaluated an SA-autoencoder on different configurations for DBpedia and Wikidata, and we tested and compared its results in terms of accuracy and diversity with other recommendation models used as baselines, in three different recommendation domains. We showed that the selection of the right information from the right KG may heavily affect recommendations both in terms of accuracy and in terms of diversity of results.

As a prominent example, Wikidata seems to be a better choice than DBpedia if we look for recommendation lists where the popularity bias is mitigated: our results show that Wikidata allows the SA-autoencoder to tackle the popularity bias and to better diversify the items in a catalog, by recommending also less popular items.

As for future work, we are developing a new version of the SA-autoencoder – one that better exploits collaborative information by building a global model encoding preferences from all users at the same time, as done by state-of-the-art CF approaches – with the aim of improving the experiments presented here.