1 Introduction

Currently, the huge amount of data available on the Internet is too much to be handled by our brains. As stated in Borchers et al (1998), in several situations, the users are not able to carry out decision-making tasks in an efficient way, and this problem is called information overload.

Recommender systems (RSs) represent the class of algorithms proposed to solve this problem; they are able to build user profiles to match human needs with the information available online, thus supporting the decision-making process. We use RSs every day in several aspects of our life, for example, to get suggestions about movies to watch, music to listen to, books and news to read, and so on and so forth.

The earlier approaches to implement these systems relied on either user-item interactions or content-based features. The first group of approaches, defined as collaborative filtering RSs (Schafer et al (2007)), is based on the intuition that similar users will tend to appreciate and enjoy similar items. Although quite precise in recommending items, this approach is affected by cold-start (Lika et al (2014)). In parallel, content-based RSs (De Gemmis et al (2015); Musto et al (2022)) emerged since the early 90 s. These RSs exploit item descriptions and structured properties (such as the genre of a movie, or the author of a book) to generate representations of the items and build user profiles including their preferences (such as, horror movies for a user who expressed this preference); then, a matching module finds the most suitable item for the user by exploiting the representations of the item and the user profile, and finally the user gets the recommendation.

Both approaches evolved over the years: deep learning has been applied to collaborative filtering RSs, and now deep learning matrix factorization models represent the state of the art in this area (Xue et al (2017)); on the other hand, the rich semantics provided by plain text and the advances in natural language processing improved the performance of content-based approaches Lops et al (2009, 2011); Musto et al (2014). Nowadays, one of the most investigated research lines regards hybrid RSs, which are RSs that combine more paradigms (e.g., combining collaborative filtering and content-based approaches), in order to take the advantages of both the worlds. The first attempts in this direction, discussed in Sun et al (2019), regard the integration of content-based features (defined as side information) in collaborative filtering methods. More recently, this research line has evolved into Knowledge-Aware Recommender Systems (KARSs) (Tarus et al (2018)).

KARS are RSs which aim at using several knowledge sources to represent users and items and provide effective recommendations by exploiting several kind of knowledge. These methods put their roots in the area of knowledge-based recommender systems (Felfernig et al (2015)), where the knowledge used to drive the recommendation process was typically encoded in the form of constraints. Today, the landscape in the area is much more heterogeneous since KARS can rely on diverse knowledge sources, both structured, such as Knowledge Graphs (KGs) (He et al (2020); Vashishth et al (2019)), unstructured, such as plain text and user reviews (Polignano et al (2021); Asani et al (2021)), and multimodal (Bonnin and Jannach (2014); Andjelkovic et al (2019); Van den Oord et al (2013); Deldjoo et al (2016)). The main intuition behind KARSs is that enriching item representations with the provided knowledge improves the performance of RSs, since the resulting representations are more precise and meaningful. Several works, such as Musto et al (2017), Sun et al (2018), Anelli et al (2019) and Petruzzelli et al (2024), confirm this intuition, generally showing that performance of RSs are improved when knowledge-derived features are encoded in the representation learning process.

In this scenario, a promising research line regards the exploitation of Knowledge Graph Embedding (KGE) techniques. These techniques, discussed in Cai et al (2018), aim at representing nodes and relations that exist in a KG into a vector space in which the structured properties of the original KG are still preserved; during the last decade, several approaches have been proposed to effectively learn graph embeddings, starting from the simplest ones, the geometric models, such as TransE proposed in Bordes et al (2013), to the most sophisticated ones, based on neural networks, such as CompGCN proposed in Vashishth et al (2019). Such techniques obtained very good performance in different scenarios and tasks where data can be modeled as a graph, including biology and social networks (Goyal and Ferrara (2018)), or RSs (Palumbo et al (2018); Sun et al (2018); Song et al (2019)).

As we said, KGEs showed very good performance in RSs, but our conjecture is that pure data-driven and non-symbolicFootnote 1 representations can be further improved. As stated in Sarker et al (2021), the current wave of neuro-symbolic AI systems aim at overcoming the historical dichotomy between pure symbolic AI and machine learning, by proposing approaches which take the advantages of both the worlds, such as the easy explainability of symbolic AI systems, and generalization power of machine learning.

In spite of these considerations, the adoption and implementation of neuro-symbolic AI systems in RSs is currently overlooked. However, this is not surprising since most of RS research has investigated data-driven approaches, without considering at all symbolic approaches. However, some symbolic representations, such as those based on first-order logic (FOL) rules (Smullyan (1995)), could be exploited for RSs as well, to improve the embedding learning process by injecting the explicit knowledge provided by them, and obtain more precise users and item representations. For example, let us consider the following FOL rule: “If you like a movie and the movie has a sequel, then you will probably enjoy the sequel”Footnote 2; this kind of statement, expressed with the formalism of FOL, might provide some semantics which can improve the embeddings, but these mechanisms are overlooked in the literature related to RSs. Guo et al (2016) and Guo et al (2018) proposed two models to tackle this issue: these models can jointly learn embeddings based on both explicit knowledge encoded in a KG, and some background knowledge expressed in terms of FOL rules involving the entities of the KG itself; this is made possible by exploiting a loss function that considers both the knowledge sources.

In this paper, we present a knowledge-aware recommendation model lever-aging on neuro-symbolic graph embeddings exploiting first-order logic rules. We start from a KG encoding both user preferences and structured item properties; next, we propose a framework composed of:

  1. 1.

    a Rule Miner, which aims at mining FOL rules from the KG, to obtain background knowledge.

  2. 2.

    a Graph Embedding Learner, which is able to represent the nodes in the KG (in particular, users and items) in a vector space; the key point of this graph embedding learner is that it is able to encode not only explicit information provided by the KG, but it also encodes the background knowledge provided by FOL rules.

  3. 3.

    a Neural Recommender System which uses the embeddings learned by the graph embedding learner to feed a deep architecture to generate recommendation lists for users.

The novelty of the work we propose is related to the injection of FOL rules in KGE learning process, which brings into the game information previously not considered, such as constraints and background knowledge. According to the classification of neuro-symbolic AI systems presented in Kautz (2022), this recommendation model can be labeled as a neuro-symbolic approach: first, it starts with a symbolic input (a KG); then, neural reasoning is exploited to jointly embed the KG and the symbolic elements (the FOL rules).

In the experimental section, we evaluate the effectiveness of our strategy on three state-of-the-art datasets by analyzing both accuracy and non-accuracy metrics, such as the novelty and the diversity of the recommendations. The results show that the combination of KG embeddings and FOL rules led to an improvement in prediction accuracy and in the novelty of the recommendations. In addition, our approach overcomes several baselines, thus supporting the validity of our intuitions. To sum up, the main contributions of this paper are the following:

  • We introduce a recommendation model inspired by neuro-symbolic AI systems that learns a representation of users and items based on the combination of graph embeddings and FOL rules;

  • We design different strategies to select FOL rules to be injected in the embeddings;

  • We evaluate our strategy against three datasets, and we ensured the reproducibility of the experimental protocol.

The rest of the paper is organized as follows. In Sect. 2, we present related works in the area. Next, the different components of our model are introduced in Sect. 3. The experimental methodology is presented in Sect. 4, while the results are shown and discussed in Sect. 5. Finally, conclusions and future works are sketched in Sect. 7.

2 Related work

In this section, we discuss related works. In the following, we first provide some basics of neuro-symbolic AI systems, then we identify relevant literature in the area of RS based on graph embeddings and KARS. For all these works, the distinctive aspects of our approach are highlighted.

2.1 Neuro-symbolic AI systems

As discussed in Sarker et al (2021), neuro-symbolic artificial intelligence, abbreviated as NeSy AI or NSAI, is a research area in the field of artificial intelligence (AI) that aims to unify and integrate the neural processing and the symbolic reasoning. The term neural refers to neural networks, which are also said to be connectionist systems, while symbolic refers to the explicit symbol manipulation, and regards tasks related to graphs (such as those related to this work, in particular with knowledge graphs, natural language question answering, and logic as well).

The increasing interest for this kind of systems is mainly due to the fact that they try to combine the advantages symbolic and neural systems offer. Indeed, as discussed in Sarker et al (2021), neural systems offer the ability to learn patterns from raw data; symbolic systems offer the high explainability, which is one of the most important issues in neural systems (indeed, explainability of neural networks is a very investigated field). The two approaches differ on data representation too: symbolic systems use explicit representations, which are easily understandable by humans. As an example, we can consider the following FOL rule: human(x) ⇒ mammal(x). Humans can easily understand the meaning of this rule (if x is a human, then x is a mammal), and symbolic systems can easily handle this kind of information. On the other hand, in neural systems the information is represented in the form of embeddings, that are vectors of real numbers, representing entities in a high-dimensional, continuous, differentiable vector space, and are not at all interpretable by humans; in order to obtain this representation, a neural network exploits weighed connections between neurons and activation functions which change the values of the embeddings. In order to emphasize the difference in the way neural and symbolic systems represent the information, embeddings are said to be a sub-symbolic representations (and, accordingly, neural systems are said to be sub-symbolic systems). Finally, while neural systems learns pattern in raw data, symbolic systems reason over manipulated data. According to Valiant (2003), the future research in the fields of AI ans NeSy AI should investigate how to integrate learning and reasoning mechanisms; in this way, that the AI systems will be able to both learn from experience and reason about what the system has learned, that is a key point highlighted also in Garcez et al (2019).

According to Kautz (2022), neuro-symbolic AI systems can be classified into five categories:

  • Symbolic-Neural-Symbolic: AI systems in which the input and the output are symbolic information, while the processing is neural.

  • Symbolic [Neural]: AI systems having a symbolic problem solver which uses a neural component as a subroutine.

  • Neural ∪ compile (Symbolic): AI systems in which the symbolic information (e.g., a FOL rule) is compiled away during the neural training; in this way, the rule \({\text{A}} \to {\text{B}}\) becomes an input–output training pair (A, B).

  • Neural → Symbolic: AI system having a neural module and, cascadingly, a symbolic one.

  • Symbolic [Neural]: AI systems embedding true symbolic reasoning inside a neural engine.

In this work, we present a Neuro-Symbolic Knowledge Graph Embedding Learner which injects first-order logic rules into the knowledge graph embedding learning process, and the resulting embeddings are used to train a neural recommender system; according to this classification, and given that we used first-order logic as a unified framework to inject FOL rules during the neural graph embedding learning process, we believe that our model falls in the neural ∪ compile (Symbolic) category. Several other categorizations of NeSy AI systems can be made; for further readings, we suggest referring to Kautz (2022).

2.2 Graph embeddings for recommender systems

Although the exploitation of graphs and KGs for the recommendation task was already proposed in the early 2000s by Huang et al (2002), the adoption of graph embedding techniques in this field is more recent. These techniques aim at representing nodes and relations of a graph into a vector space in which the structured properties of the original graph are still preserved; in addition, this latent representation can be easily treated by computers due to the continuous and differentiable nature of the vector space in which embeddings are represented.

In Xie et al (2016), the authors propose a RS for points of interests (POIs) exploiting graph embedding, in which a bipartite graph expressing users-POIs interactions was used to provide recommendation; another solution was proposed in Palumbo et al (2017), but here the graph is tripartite, since it encodes not only users interactions, but also item properties as well; in this way, the authors are able to compute both user-item and item-item similarities, aiming at obtaining better recommendation performance. In our work, we start from a tripartite KG encoding user preferences and item properties as well. Musto et al (2019) performs a comparison between node2vec, proposed in Grover and Leskovec (2016), and Laplacian Eigenmaps (Belkin and Niyogi (2002)), showing how exploiting information from KGs can improve the recommendation models in a significant manner; these findings are also confirmed by many other shreds of evidence, including Grad-Gyenge et al (2017), Zhang et al (2020), and Forouzandeh et al (2020).

Our work relies on translation models. In particular, our work exploits KALE, proposed by Guo et al (2016), as graph embedding technique. KALE extends the original TransE by including knowledge derived from FOL rules; both graph-derived and rules-derived information is expressed in terms of first-order logic, which acts as a unified framework between them. In this way, it is possible to jointly learn graph embeddings including external knowledge provided by FOL rules. Moreover, since we combine pure symbolic information (FOL rules) with classical deep learning approaches (e.g., TransE), this model can be labeled as neuro-symbolic graph embedding technique.

Generally speaking, this is one of the first attempts investigating the introduction of FOL rules in the graph embedding learning process for the task of recommendation. Other attempts in the direction of neuro-symbolic recommender systems are represented by Carraro et al (2024), where the authors use a logic tensor network to train latent factor models, and by Zhang et al (2022), where logical layers are introduced in a neural network to learn user-attribute rules that are exploited to make the recommendation process more explainable and transparent.

2.3 Knowledge-aware recommender systems

Knowledge-aware recommender systems (KARSs) are a class of RSs that encourage the exploitation of information encoded in knowledge bases (such as KGs or plain text), to improve users’ and items’ representations and provide better recommendations. The first works in this area relied on linked open data (LOD) and were proposed by Musto et al (2012), Ristoski et al (2014), Piao and Breslin (2016) and Musto et al (2018). More recently, the spreading of deep learning contributed in KARSs as well. In Anelli et al (2019), the authors proposed the knowledge-aware hybrid factorization machines (KaHFM), a model which extended the classical matrix factorization technique by adding and exploiting the knowledge encoded in KGs. Given the effectiveness of this model, we used it as a baseline for our experiments.

Of course, all these works are strictly connected to the literature investigating recommender systems based on side information (Liu et al (2019)). However, while RS with side information originally focuses on the injection of descriptive features into collaborative approach, such as the previously mentioned KaHFM, more recent attempts aim to design and develop more comprehensive architectures that encode heterogeneous kind of knowledge (collaborative, textual, structured and unstructured) into recommendation algorithms.

Current trends lie in the application of GNNs and GCNs in this field, which propagate information between nodes up a certain number of hops in an iterative manner (this is called message passing), and then aggregate them to update the node representation. LightGCN (He et al (2020)) relies only on collaborative information (so it cannot be defined to be a KARS), while other models exploiting GCNs and GNNs use external knowledge as well. For example, KGCN (knowledge graph convolutional network) (Wang et al (2019a)) is a model that learns a representation based on both user-item interactions and item properties encoded in a KG, then propagates this information over the neighbors of each node, which aggregates by using proper graph convolutional filters to obtain a unique representation used for the recommendation. Similarly, KGAT (knowledge graph attention network) (Wang et al (2019a)) uses attention mechanisms (Vaswani et al (2017)) to aggregate neighbors’ information and weight their contributions, in order to identify the most informative ones.

Other KARSs worth to mentions are those based on the combination of more information sources, such as those presented in Polignano et al (2021) and in Spillo et al (2023b); in the first case, the framework relies on graph embeddings learned with TransE (Bordes et al (2013)) or TransH (Wang et al (2014)) and word embeddings learned with BERT (Devlin et al (2019)) or USE (Wang et al (2018)); graph and word embeddings are then combined by concatenating them, and finally they are used to feed a neural recommender system. The experimental results show how to combination improves the general performance of the recommender system, with respect to the exploitation of the single (graph or word) embeddings; next, in Spillo et al (2023b), the authors exploit graph embeddings learned with GCNs (in particular, CompGCN (Vashishth et al (2019))), and text embeddings learned with SBERT (Reimers and Gurevych (2019)), which is an extension of BERT, adapted to handle and represent sentences instead of words. Then, these two representations are combined with self-attention and cross-attention, resulting in a more precise final representation which obtained very competitive recommendation performance. Finally, there are KARSs based on multimedia information as well: McAuley et al (2015) and Albanese et al (2013) recommend clothes and paintings based on image features, respectively; similarly, Van den Oord et al (2013) suggests music based on spectrograms, while Deldjoo et al (2016) recommends movies based on visual-based features.

In our experimental settings, we have considered several baselines, including GCNs, GNNs, KARSs, and approaches based on matrix factorization, both exploiting deep learning and not. The peculiarity of this work lies in the adoption of FOL rules to add more external knowledge during the KGE process, with the application in the recommendation field; up to our knowledge, the combination of symbolic and non-symbolic reasoning (logic rules and deep learning) in recommendation tasks has received limited attention.

3 Methodology

In this section, we introduce the modules that compose our model for knowledge-aware recommendations based on neuro-symbolic graph embeddings exploiting FOL rules. First, we describe our KG and what kind of entities and relations it encodes; next, we discuss the mining of FOL rules from the KG; then, we describe how the KGE jointly learns the embeddings by encoding both graph-based and rule-based knowledge; finally, we describe how these embeddings are used to feed the neural recommender systems. The general workflow of this model is presented in Fig. 1.

Fig. 1
figure 1

Workflow carried out by our model

3.1 Basics of the model

The main idea behind our work is to model users, items, and items’ properties as entities. Let U be the set of users U = {u1…un}, I a set of items I = {i1…um}, and P a set of items’ properties P = {pipk}; starting from these three sets, we can define a set of entities E so that E = U ∪ I ∪ P; now, let R be a set of relations. We can define a knowledge graph KG as follows:

KG = {(eh, r, et): eh, et∈r, E∈R}. This formalism means that KG is a set of triples in which the head and the tail are entities (users, items, or item properties), and the relation is one of the available relations.

As for the set of relations R, it is the set of relations that exist in the knowledge graph KG; since entities can be users, items, or items’ properties, we have two main kind of relations: those related to collaborative information (user-item connections) and those related to descriptive properties about the items (item-entity connections). Formally, Rc= {like, dislike} are the relations connecting users and items (positive or negative, respectively), while Rp represent properties of the items, which are strictly dependent on the domain. As an example, for the movie domain, we might have Rp = {director, starring, genre, subject, etc.}. Given Rc and Rp, the set of relations R=Rc∪Rp. Figure 2 shows an example of tripartite KG in which user preferences and item properties are encoded in a unified representation.

Fig. 2
figure 2

A tripartite knowledge graph. Different kinds of entities (users, items, properties) are highlighted with different colors

First-Order Logic. In this paper, we will make extensive use of terminology borrowed from first-order languages and first-order logic (Smullyan (1995)). FOL rules exploited in our approach are formally defined as Horn clausesFootnote 3 due to the fact they are composed of several atoms connected by logical connectives \((e.g.,\; \vee , \wedge , \Rightarrow )\) and only one atom appears in the head of the clauses (the other atoms are in the body of the clause). The variety of a predicate represents the number of variables it takes as arguments: for example, a predicate taking one argument is said to be unary, while a predicate taking two arguments is said to be binary. Each atom is composed of variables (e.g., x, y) and predicates (equivalent to the relations previously introduced). Accordingly, the rules exploited in our model are first-order formulas in the form ∀x, y: (x, rs, y)⇒(x, rt, y), meaning that if x and y ∈ E are linked by relation rs∈R, then they are linked by relation rt too. As an example, a simple FOL rule could be the following: \(\forall {\text{x}},{\text{y}},{\text{z}}:{\text{likes}}({\text{x}},{\text{z}}) \wedge {\text{prequel}}({\text{z}},{\text{y}}) \Rightarrow {\text{likes}}({\text{x}},{\text{y}}).\) This FOL rule expresses the fact that’If a user likes a movie and that movie has a prequel, the user will like the prequel too’. Finally, a rule is said to be grounded if every variable has been replaced by a suitable entity e ∈ E. The grounded rule based on the previous logical rule can be: likes (user, StarW arsII) ∧ prequel(StarW arsII, StarW arsIII) ⇒ likes(user, StarW arsIII). In this case, the variables y and z have been replaced with the items StarWarsI and StarWarsII, respectively. Of course, given a FOL rule, it is very likely that the same FOL rule has several grounded rules (or groundings); for example, other groundings for the FOL rule presented above might be likes(user, LOTR_The Two Towers) ∧ prequel(LOTR TheTwoTowers, LOTR_The Return Of TheKing) ⇒ likes(user, LOTR_The Return Of The King), or likes(user, SpiderMan) ∧ prequel(SpiderMan, SpiderMan2) ⇒ likes(user, SpiderMan2). The concept of grounded rule (and groundings) is crucial to describe how our model aims at jointly learning embeddings from the KG and FOL rules works.

3.2 Mining first-order logical rules

Given a knowledge graph KG (see Fig. 2), the first task performed by our model is to mine FOL rules that exist in KG; as already discussed, the mined rules are expressed as Horn clauses, so they are clauses in the form ∀x, y, z: predicate1(x, y) ∧ predicate2(y, z) ⇒ predicate3(x, z) in which there is at most one atom in the head of the rule.

As shown in Chen et al (2020) and in Shi and Weninger (2018), methodologies for mining logical rules were initially proposed for the task of knowledge graph completion, which aims at finding and discovering new and missing triples; then, they have been adopted for other tasks. As an example, Karimi and Kamandi (2019) use logical rules for ontology alignment and fact prediction. In our work, we used them to mine background knowledge encoded in the KG related to patterns which can be useful to improve the embedding learning process. More specifically, the focus of our work is not to extract association rules (Kotsiantis and Kanellopoulos (2006)) that connect two entities (such as, users who liked Star Wars II, also liked Star Wars III), but we are mainly interested in identifying background knowledge that characterize several entities and patterns, also including users’ nodes.

In this work, we mined FOL rules with a strategy based on AMIE + and AMIE 3, proposed in Gal´arraga et al. (2013, 2015); Lajus et al (2020), which are specifically designed to mine Horn Clauses on large knowledge bases. Before describing how the rules miner works, it is worth pointing out some useful metrics that are crucial for the mining phase.

  • Length: The length of the rule, expressed as the number of atoms the rule is composed of.

  • Standard Confidence (or Confidence): It expresses how often the head of a rule is true, given its body, in the dataset.

  • Head Coverage: It expresses the number of occurrences of the head of the rule in the data.

  • PCA-Confidence: It expresses how often the head and body of the rule occur together and independently. In other words, it is the sum between the occurrences of rules in which both the head and the body occur, and the occurrences of rules in which only the head occurs.

  • Positive example: It expresses the absolute number for which the FOL rule holds in the dataset (in other terms, the occurrences of the rule).

The pseudo-code reported in Algorithm 1, taken from Gal´arraga et al. (2015), describes the strategy used to mine FOL rules from the KG; it takes as input a knowledge base KG and some parameters: the maximum length of rules to be mined l; the minimum head coverage minHC of the rules; the minimum confidence min-conf of the rules that have to be mined. The rule miner uses the breadth-first search; in particular, it starts with a queue composed of all possible rules with an empty body (e.g., rules of size 1, with one atom in the head and empty body, such as ∅⇒likes(x, y); line 1). Next, each rule is dequeued and its confidence and PCA-confidence are computed: if the rule is closed (which means that its variables appear in at least two atoms), its confidence is greater than or equal to min-conf, and its PCA-confidence is greater than or equal to the PCA-confidence of all the previously mined rules with the same head (line 5), then the rule is added to the set of the rules to be

Algorithm 1:
figure a

Algorithm used for mining FOL rules, taken from Gal´arraga et al. (2015). Input: Knowledge Graph ; maximum length l; minimum head coverage minHeadCov; minimum confidence minConf.

returned (line 6); if the rule is shorter than l and its confidence can still be improved (line 8), then the rules is refined by adding all the possible atoms to the body, and this new rule is added to the queue of the rules to be considered for the mining (lines 9–11). To speed up rule mining, based on Gal´arraga et al. (2015), several strategies have been applied, including pruning strategies, parallelization, and lazy computation of confidence scores; in this way, the system is able to scale effortlessly to large knowledge bases, and at the same time, it is still able to compute, for each rule, the exact confidence and support values without approximations.

The output for this step is a a set of logical rules (line 16) with related scores expressing to what degree the rules hold: we used these scores to split the rules into several subsets and to identify which are the most promising ones. An example of the output can be found in Table 1, in which we can find some real FOL rules mined from a movie recommendation dataset. Along with FOL rules, we can also see some of the already mentioned metrics (in particular, the head coverage, the standard confidence, and the number of positive examples).

Table 1 Some examples of rules mined by our framework. In the tables, we also report the metrics we used for the definition of the heuristics

Next, based on these metrics, we then defined some selection heuristics to split them into subsets and inject them into the graph embedding learning process. Given the absence of a proper literature that already defined some criterion to define relevant rules, we designed our subsets by considering different heuristics. In particular, the heuristics we defined were based on both the metrics that are typically used to assess the validity of FOL rules (i.e., confidence, coverage, positive examples matching the rule) as well as to some background knowledge about the recommendation tasks (i.e., rules having a like predicate in the head).

Based on these initial selection heuristics, eight different subsets are defined and evaluated in this work. In particular:

  1. 1.

    all rules (ALL), including all the FOL rules that have been mined by the rule miner;

  2. 2.

    Low Standard Confidence (LSC), including rules with a standard confidence value greater than 0.2;

  3. 3.

    Medium Standard Confidence (MSC), including rules with a standard confidence value greater than 0.5;

  4. 4.

    Medium–High Standard Confidence (MHSC), including rules with a standard confidence value greater than 0.75;

  5. 5.

    High Standard Confidence (HSC), including rules with a standard confidence value greater than 0.9;

  6. 6.

    Low Head Coverage (LHC), including rules with a head coverage value greater than 0.2;

  7. 7.

    High Positive Examples (HPE), including rules with high number of positive examples. Since different datasets may have rules with very different values related to this metric, we identified one specific threshold value for each dataset: (i) for Last.FM, we set 100 as threshold value; (ii) for DBbook, we set 250 as threshold value; (iii) for ML1M, we set 1000 as threshold value.

  8. 8.

    Like in the head of the rule (LH), including rules that have, in their head, an atom including the relation likes. We believe this is a reasonable choice since we are in the task of the recommendation, and the like relation is crucial.

Even though these are general heuristics, we can justify our choices. In particular, rule-based configuration (1) is useful to assess whether all the possible FOL rules mined from the graph are useful to improve the accuracy of the model; rule-based configurations (2–5) are useful to assess if it is possible to choose a good-for-all standard confidence value to improve the embeddings; moreover, FOL rules with a high confidence value should, in theory, contribute to the graph completion task, resulting in more paths available connecting two nodes, and this might lead to more meaningful and precise embeddings. Similarly, configurations 7–8 aim at considering different metrics for the same purpose. Configuration 8 includes all the rules with the like relation in the head, and it is the set in which we are most interested in, since the task of the recommendation is focused on investigating user preferences, and for this reason this set has been considered. In Sect. 4, the effectiveness of our framework on varying of the different subsets of rules that are picked based on the different heuristics will be assessed.

3.3 Learning graph embeddings

Once the rules are extracted, the joint learning based on triples in the KG and FOL rules is carried out. In this work, we used a model based on KALE (Guo et al (2016)). The neuro-symbolic nature of this model lies in the fact that the embeddings for each entity in the KG are learned by exploiting: (i) explicit knowledge, expressed in the form of triples (ei, rk, ej) that are encoded in KG; (ii) background knowledge, expressed as FOL rules and learned as explained in Sect. 3.2.

The hallmark of our approach lies in the fact that both explicit and background knowledge is represented in a unified framework that learns a comprehensive representation based on both information sources. Generally speaking, our model is based on KALE, a graph embedding techniques that in turn inherits the principles of TransE and extends it by introducing FOL rules in the learning process. Given the good performance of TransE in recommendation tasks, as shown by several shreds of evidence including Palumbo et al (2018) and Polignano et al (2021), the choice of exploiting a method that relies on this algorithm can be considered reasonable.

Our KGE model implements joint learning based on triples from the KG and logical rules. Based on previous work by Rockt¨aschel et al. (2014, 2015), joint training is possible since triples from a KG can be seen as atoms in first-order logic (e.g., likes(Alice,Kill Bill) or starring(Uma Thurman,Kill Bill)). Indeed, each relation that connects two entities in a graph can be seen as a binary predicate (that is, a predicate which takes as arguments two variables), and the two entities can be seen as arguments of the predicate.

In this way, the triple of the KG (Alice, like, KillBill) can be written in terms of FOL as follows: like (Alice, KillBill), where like is a binary predicate applied to the variables Alice and KillBill. In Table 2, we report how the KG in Fig. 2 can be written as a set of FOL facts.

Table 2 Mapping of triples in a KG in FOL formalism

Given that also rules are expressed in a logical form, it is possible to exploit first-order logic as the common framework that allows to unify the representations and to carry out such joint learning. To start the learning process, a Set F consisting of positive training elements is built. In particular, F+ contains: (i) all the atomic formulas based on the KG (i.e., triples in the form (ei, rk, ej)); (ii) grounded rules based on the logical rules previously extracted.

Given a logical rule in the form ∀x, y: (x, rs,y) ⇒ (x, rt,y), a grounded rule is obtained by replacing variables x and y with real entities eej∈  E By referring again to the toy example in Fig. 2, if the rule ∀x, y, z: likes(x, y) ∧ prequel(y, z) ⇒ likes(x, z) is learned, the following grounded rule is generated: likes(user, StarW ars) ∧ prequel(StarW ars, StarW arsII) ⇒ likes(user, StarW arsII). Of course, replacement shall be bound only to the triples that exist in the graph (i.e., combination of entities that do not exist in the graph do not constitute a valid grounding). Once the set F+is obtained, a negative training set F shall be provided as well. To build F, for each (ei, rk, ej) ∈ F+, a corresponding negative triple is obtained by replacing either ei or ej with a random entity e ∈ E.

Finally, the joint learning takes place by minimizing a margin-based ranking loss (see Formula 1), enforcing positive training examples to have larger truth values than negative ones:

$$ \mathop {\min }\limits_{{\left\{ e \right\},\left\{ r \right\}}} \sum\limits_{{f^{ + } \in {\text{F}}^{ + } }} {\sum\limits_{{f^{ + } \in {\text{F}}^{ - } }} {[\gamma - I({\text{f}}^{ + } ) + I({\text{f}}^{ - } )]_{ + } } } $$
(1)

such \(\left\| {e_{2} } \right\| \le 1,\;\forall {\text{e}} \in {\text{E}}\;and\;\left\| {{\text{r}}_{{2}} } \right\|,\;\forall {\text{r}} \in {\text{R}}\), with γ a margin separating positive and negative examples, and [x]+ ≜ max{0, x}. Even if we are only interested in the embeddings of the entities (nodes), it is worth to point out that Formula 1) is used to learn both entity and relationship embeddings.

To complete the training, it is also necessary a function I:F→[0, 1] that assigns to each training example (i.e., atomic and complex formulas) a soft truth value, indicating how likely a triple holds or to what degree a ground rule is satisfied.

Given a triple f+ ∈ F+, computation of I(f) is based on TransE (Bordes et al (2013)), since each triple is modeled such that ei + rk ≈ ej. Accordingly, each triple is scored as \(\left\| {{\text{e}}_{i} \, + \,{\text{r}}_{k} \, - \,{\text{e}}_{j} } \right\|_{1}\) by using Eq. 2:

$$ {\text{I}}({\text{f}}^{ + } ) = {\text{I}}({\text{e}}_{i} ,{\text{r}}_{k} ,{\text{e}}_{j} ) = {1} - \frac{1}{{3\sqrt {\text{d}} }}\left\| {{\text{e}}_{i} + {\text{r}}_{k} - {\text{e}}_{j} } \right\|_{1} $$
(2)

where d is the dimension of the embedding space. Of course, the same holds for negative triples f∈ F.

On the other hand, rules are modeled with t-norm fuzzy logics (H´ajek 2013): the truth value of a complex formula is given by the composition of the truth values of its constituents triples through specific t-norm based logical connectives. In particular, truth values are computed recursively; supposing f1 and f2 are atomic or complex formulae, truth values of complex formulae are computed as follows:

$$ {\text{I}}({\text{f}}_{{1}} \wedge {\text{f}}_{{2}} ) = {\text{I}}({\text{f}}_{{1}} )\cdot{\text{I}}({\text{f}}_{{2}} ) $$
$$ {\text{I}}({\text{f}}_{{1}} \vee {\text{f}}_{{2}} ) = {\text{I}}({\text{f}}_{{1}} ) + {\text{I}}({\text{f}}_{{2}} ) - {\text{I}}({\text{f}}_{{1}} )\cdot{\text{I}}({\text{f}}_{{2}} ) $$
$$ {\text{I}}(\neg {\text{f}}_{{1}} ) = {1} - {\text{I}}({\text{f}}_{{1}} ) $$
$$ {\text{I}}({\text{f}}_{{1}} \Rightarrow {\text{f}}_{{2}} ) = {\text{I}}({\text{f}}_{{1}} )\cdot{\text{I}}({\text{f}}_{{2}} ) - {\text{I}}({\text{f}}_{{1}} ) + {1} $$

where the truth values of the atomic formulae I(f1),I(f2),…,I(fn), these are computed according to Eq. 2. Since we are most interested in FOL rules, it is worth to provide an example related to the last formulae presented, related to the logical implication.

Let us consider the following FOL rule, composed of two atoms:

$$ {\text{f}} \triangleq ({\text{e}}_{i} ,{\text{r}}_{k} ,{\text{e}}_{j} ) \Rightarrow ({\text{e}}_{k} ,{\text{r}}_{l} ,{\text{e}}_{j} ) $$

The truth value of f is given by

$$ {\text{I}}({\text{f}}) = {\text{I}}({\text{e}}_{i} ,{\text{r}}_{k} ,{\text{e}}_{j} )\cdot{\text{I}}({\text{e}}_{i} ,{\text{r}}_{l} ,{\text{e}}_{j} ) - {\text{I}}({\text{e}}_{i} ,{\text{r}}_{k} ,{\text{e}}_{j} ) + {1}\;{\text{where}} $$
$$ {\text{I}}({\text{e}}_{i} ,{\text{r}}_{k} ,{\text{e}}_{j} ) = {1} - \frac{1}{{3\sqrt {\text{d}} }}\left\| {{\text{e}}_{i} + {\text{r}}_{k} - {\text{e}}_{j} } \right\|_{1} {\text{and}} $$
$$ {\text{I}}({\text{e}}_{i} ,{\text{r}}_{l} ,{\text{e}}_{j} ) = {1} - \frac{1}{3\sqrt d }\left\| {{\text{e}}_{i} + {\text{r}}_{l} - {\text{e}}_{j} } \right\|_{1} $$

which refer to Eq. 2.

Let us now consider a rule composed of three atoms:

$$ {\text{f}} \triangleq ({\text{e}}_{i} ,{\text{r}}_{k} ,{\text{e}}_{j} ) \wedge ({\text{e}}_{j} ,{\text{r}}_{l} ,{\text{e}}_{m} ) \Rightarrow ({\text{e}}_{i} ,{\text{r}}_{x} ,{\text{e}}_{m} ) $$

Its truth value will be computed as follows:

$$ {\text{I}}({\text{f}}) = {\text{I}}({\text{e}}_{i} ,{\text{r}}_{k} ,{\text{e}}_{j} )\cdot{\text{I}}({\text{e}}_{j} ,{\text{r}}_{l} ,{\text{e}}_{m} )\cdot{\text{I}}({\text{e}}_{i} ,{\text{r}}_{x} ,{\text{e}}_{m} ) - {\text{I}}({\text{e}}_{i} ,{\text{r}}_{j} ,{\text{e}}_{j} )\cdot{\text{I}}({\text{e}}_{j} ,{\text{r}}_{l} ,{\text{e}}_{m} ) + {1} $$

where according to Eq. 2,

$$ \begin{gathered} {\text{I}}({\text{e}}_{i} ,{\text{r}}_{k} ,{\text{e}}_{j} ) = {1} - \frac{1}{{3\sqrt {\text{d}} }}\left\| {{\text{e}}_{i} + {\text{r}}_{k} - {\text{e}}_{j} } \right\|_{1} , \hfill \\ {\text{I}}({\text{e}}_{j} ,{\text{r}}_{l} ,{\text{e}}_{m} ) = {1} - \frac{1}{{3\sqrt {\text{d}} }}\left\| {{\text{e}}_{j} + {\text{r}}_{l} - {\text{e}}_{m} } \right\|_{1} ,\;{\text{and}} \hfill \\ {\text{I}}({\text{e}}_{i} ,{\text{r}}_{x} ,{\text{e}}_{m} ) = {1} - \frac{1}{{3\sqrt {\text{d}} }}\left\| {{\text{e}}_{i} + {\text{r}}_{x} - {\text{e}}_{m} } \right\|_{1} . \hfill \\ \end{gathered} $$

Of course, the same process can be generalized to formulae with an indefinite number of atoms.

In this way, it is possible to score both triples deriving from the KG and triples derived from the FOL rules; thanks to the score, it is then possible to start the iterative training process, during which neuro-symbolic embeddings will be learned, encoding explicit knowledge deriving from the KG, and background knowledge deriving from the FOL rules, thus leading to a different representation of users and items.

3.4 Recommendation framework

Once the neuro-symbolic embeddings are learned, the neural recommender system can be trained by feeding it with such embeddings. Given a user u ∈ U and an item I ∈ I, the aim of this RS architecture is to predict to what extent the user u will be interested in the item i by providing a recommendation score; these scores are used to generate a recommendation list, which is computed by ordering in descending order the items by the recommendation score.

In this way, it is possible to generate a ranking of items, or a recommendation list. Figure 3 shows the architecture of our deep neural network: starting from the embeddings representing the user u and item i (previously learned by our graph embedding learner), these are fed into three dense layers with decreasing dimensions and ReLU as activation function. Then, the two embeddings are concatenated to obtain a unique representation, and this new embedding is fed again into three dense layers with decreasing dimensions and ReLU as activation function. The last layer has size 1 and uses a sigmoid as activation function, to produce a recommendation score score∈[0, 1], which estimates the probability that the item i is relevant for user u; these scores are finally used to generate the recommendation list. The number of dense layers we used in the architecture is based on design choices that are common in literature (i.e., Spillo et al (2023b); Gu et al (2018); Liu et al (2020)) and it is the result of a tuning of the architecture. Our design choice relies on the intuition that stacking more layers of gradually smaller dimensions allows the model to better capture nonlinearities between the input and the output. In turn, this allows to better approximate a function (in our case, the recommendation function). This is in line with recent advances in the area of automatic dimensioning (De Filippo et al (2022)).

Fig. 3
figure 3

Recommendation framework architecture

Formally, let \({\vec{\text{u}}}\) and \({\vec{\text{i}}}\) be the neuro-symbolic embeddings related to user u and item i, respectively; then, they are fed into three dense layers with reducing dimensions (with sizes 512, 256, and 128) and ReLU as activation functions:

$$ {\text{u}}_{dense} = {\text{ReLU}}({\text{W}}_{u\_dense} * {\vec{\text{u}}} + {\text{b}}_{u\_dense} ) $$
$$ {\text{i}}_{dense} = {\text{ReLU}}({\text{W}}_{i\_dense} * {\vec{\text{i}}} + {\text{b}}_{i\_dense} ) $$

where udense and idense are the reduced embeddings related u and i, respectively, Wu_dense and Wi_dense are the learnable parameters of the layers, and bu_dense and bi_dense are the biases. For the sake of brevity, we report here just one application of this layer.

Then, the two embeddings are concatenated, and the resulting embedding is fed into two layers with reducing dimensions (64 and 32) and ReLU activation function, and one layer with size 1 and sigmoid activation function. The output is the recommendation score between user u and item i.

$$ {\text{concat}}_{dense} = {\text{ReLU}}({\text{W}}_{concat\_dense} * ({\text{u}}_{dense} \oplus {\text{i}}_{dense} ) + {\text{b}}_{concat\_dense} ) $$
$$ {\text{score}} = {\text{sigmoid}}({\text{W}}_{score} * {\text{concat}}_{dense} + {\text{b}}_{score} ) $$

This neural architecture is trained by exploiting the user ratings in the form (u, i) available in our datasets; moreover, since we are training an architecture to understand which are positive and negative elements for some users, this model has been trained by using the binary cross-entropy as loss function, and the accuracy as evaluation metric to be optimized. More details about the training, the size of the layers, the optimizer and the other hyperparameters will be provided in the next section. After the training of the model, it is able to predict to what extent user u would like unseen items I ∈ I, from which a recommendation list consisting of the top-k items for each user is generated.

4 Experimental evaluation

In the experimental section, we carried out experiments aiming at assessing the effectiveness of our methodology in the task of item recommendation. In particular, experiments were designed to answer the following research questions:

  • RQ1: Does the introduction of logical rules changes the elements that are included in the recommendation lists?

  • RQ2: How does the different selection heuristics to pick FOL rules impact on the accuracy, novelty, and diversity of the recommendation?

  • RQ3: How does our approach based on neuro-symbolic graph embeddings perform w.r.t. competitive baselines?

4.1 Experimental design

4.1.1 Datasets

Experiments were carried out in a movie, book and music recommendation scenarios. We used state-of-the-art datasets: MovieLens 1 M (ML1M)Footnote 4 for the movie domain, DBbook for the book domain, and Last.FMFootnote 5 for music domain; these datasets can also be found on our public repositoryFootnote 6. In Table 3, some statistics about the datasets we used are reported; in general, we can observe how ML1M is the dataset recording the highest number of interactions, followed by DBbook; the smallest one is Last.FM. We can also notice how DBbook and Last.FM are the sparsest datasets, while ML1M is more dense. As for the side information encoded in the KGs, we used knowledge encoded in DBpedia; this knowledge has been obtained by exploiting an online available mappingFootnote 7: thanks to a SPARQL query, we have obtained side information about the items (e.g., the name of a book, or the genre of a movie or an album). The mapping information has been exploited to obtain a KG, whose statistics are reported in Table 4.

Table 3 Statistics of the datasets
Table 4 Statistics of the KGs of the datasets

4.1.2 Protocol

The three datasets have been split into training and test sets, and for all of them we adopted an 80–20% ratio. While ratings from DBbook already came into binary form (positive and negative represented by 1 s and 0 s, respectively), for Last.FM and ML1M ratings were expressed as integers ranging from 1 to 5; in order to handle them, we considered positive ratings those equal to 4 and 5, while negative those from 1 to 3. As for the prediction accuracy evaluation, we generated top-5 and top-10 recommendation lists by following the TestRatings strategy proposed in Bellogin et al (2011). Such strategy is often used to evaluate the effectiveness of recommendation algorithms, as shown in Polignano et al. (2021) and Spillo et al (2023b). The lists of top-k items have been obtained by scoring (by exploiting our deep architecture described in Sect. 3) all the items i in the Test Set for each user u; then these rankings have been sorted in descending order, and cut at the first k entries. Given that the findings we obtained with top-5 and top-10 recommendations were similar, for the sake of brevity we just report here the results for top-5 recommendations. Results for top-10 recommendations are available in our repositoryFootnote 8.

4.1.3 Source code and parameters

Source Code. The source code of this model, including used datasets, rule mining, neuro-symbolic KGE model, and deep recommender systems, is publicly available on GitHubFootnote 9. The source code of the rule miner is based on AMIEFootnote 10, while the implementation of our KGE model injecting FOL rules is based on KALEFootnote 11. As for the source code of the deep recommendation architecture, it extends the implementation made available by the authors of Polignano et al (2021) released on GithubFootnote 12.

First-Order Logic Rules. FOL rules mining has been performed by in order to mine rules with at most four atoms (in other words, the maximum length of the rules), and we set as minimum confidence a value equal to 0.001, and 0.1 as minimum coverage value. This choice comes after several run of the strategy, based on a trade-off analysis between algorithmic complexity and knowledge provided by the FOL rules. Experiments with different numbers of atoms are left as future work.

Embeddings. Embeddings were learnt at three different sizes: k = 256, k = 512, and k = 768 dimensions; the learning process has been carried out in mini-batches of size 100 for all the three datasets, with a separating margin (between positive and negative examples) equals to 0.1, entity and relation learning rate equal to 0.05, and 1000 epochs.

Recommender System. The training of the neural recommender system has been carried out for 25 epochs, with a batch size equal to 512; we used Adam as optimizer, a learning rate equal to 0.001, and default values set by Keras for beta parameters (which control the moving average; default values are 0.9 for beta 1, and 0.999 for beta 2). As loss function, we used the binary cross-entropy. Other optimizers (De Filippo et al (2021)) will be evaluated as future work.

4.1.4 Experimental configurations

In this work, we carried out experiments aimed at comparing ten different configurations of our recommendation model based on neuro-symbolic graph embeddings combining FOL rules. In particular, we considered two basic configurations, relying only on information from the graph, and eight configurations based on the injection of FOL rules in the KGE process.

For the first basic configuration, we learned graph embeddings encoding only information derived from user-item interactions: For example, referring to Fig. 2, we only encode the triples (Alice, like, KillBill), (Alice, like, StarW arsI), (Bob, like, StarW arsII), (Chloe, like, StarW arsII). For the second basic configuration, we considered not only user-item interactions but also all the structured properties deriving from the KG: For example, again referring to Fig. 2, we encode the triples related to the genre of the movies or the actors starring in them, such as (KillBill, genre, Thriller), (PulpFiction, starring, SamuelLJackson), (StarWarsII, starring, SamuelLJackson). Next, we evaluated eight configurations based on the injection of FOL rules in the KGE learning. The configurations are presented in Section 3.2. In Table 5, we report some statistics about the number of rules and groundings found for each dataset; let us recall that, given a FOL rule, a grounding for that rule is given by the same rule in which all variables have been replaced with real entities in the KG. It is necessary to observe that, for Last.FM, the set of like rules is empty, and for this reason, it will not appear in the analysis and discussion of the results.

Table 5 Statistics about the used subsets of FOL rules

4.2 Baselines and evaluation metrics

Evaluation Metrics. The evaluation has been performed by using the ElliotFootnote 13 framework (Anelli et al (2021)), to guarantee reproducibility of the experimental protocol; as for the metrics, we naturally considered accuracy metrics, such as Precision, Recall, F1 score and Mean Average Precision (MAP), and we also considered nDCG. Moreover, we also evaluated non-accuracy metrics, such as diversity and novelty: in particular, we considered Gini Index for diversity, and expected free discovery (EFD) and expected popularity complement (EPC) for the novelty, defined as in Vargas and Castells (2011). In all these cases, metrics were calculated based on the implementations available in Elliot. In particular, it is worth to emphasize that the value reported as “Gini” in the tables is calculated as 1—Gini (so, the higher the better).

Baselines. As regards the baselines, we compared our results to ten different baselines available in the Elliot frameworkFootnote 14: three matrix factorization techniques, such as SLIM (Ning and Karypis (2011)), BPRMF (Rendle et al (2012)) and PureSVD, three methods based on deep learning models, such as MultiVAE (Liang et al (2018)), CFGAN (Chae et al (2018)) and NGCF (Wang et al (2019b)) and four algorithms implementing knowledge-aware techniques, such as Item and User-KNN encoding side information (Gantner et al (2011)), the previously mentioned KaHFM proposed in Anelli et al (2019), LightGCN (He et al (2020)) and the basic vector space model (VSM). All the algorithms are run with their optimal parameters, selected by the Elliot framework. In our repositoryFootnote 15, we have added a folder containing all the scripts to find the optimal hyperparameters and to run the baselines with those values. To assess statistical significance, we used the t test implementation available in the Elliot framework. In particular, we compared the recommendation lists provided by both our model and those generated by the baselines. These tests are computed for each considered metrics.

5 Results and discussion

In this section, we discuss the results of the experimental section, by first providing a qualitative analysis on the representations of the items, and then we answer the research questions.

5.1 RQ1: impact of FOL rules on recommendation lists

In this section, we first present some results assessing the impact of FOL rules on the lists of the recommendations. In other terms, we want to investigate whether and how the introduction of FOL rules changes the items recommended to the users.

First, we present some qualitative results obtained by considering some real recommendation lists generated by our recommendation model based on neuro-symbolic graph embeddings with FOL rules. Results are related to the movie recommendation domain and are reported in Table 6; we considered two users and compared the recommendation lists generated for the two basic configurations (user-item and user-item-properties, shortened as UI and UIP), and one configuration based on FOL rules (those related to the rules with the relation like in the head). It is worth observing that different information encoded can have a huge impact on the recommendation list: For example, for User 6039, only one movie appears in all the three recommendation lists (“North by Northwest”), while only two movies are shared by the UIP and the UIP + LH recommendation lists (the already mentioned “North by North-west” and “Citizen Kane”). This confirms what we depicted in the previous section, that is, different encoded knowledge, although deriving from the same source (the KG), leads to different representations. In addition, other interesting observations can be made by analyzing the recommendation lists for User 5743; in this case, we can better assess the impact of FOL rules: while the two basic configurations share four movies, on a total of five, we can say they are almost the same list, except for one element and the permutation of the others, the UIP + LH list has only two movies shared with the basic configurations (“Wings” and “Roman Holiday”), while the other three movies are totally new. This shows very well how injecting FOL rules in the graph embedding process can impact not only the representation of items but the task of recommendation too. In any case, it is necessary to point out that not always encoding different information leads to that differences, as the recommendation lists related to User 12 show; in this case, the three recommendation lists are very similar to each other, even if some differences in the ordering can be observed, and movies not occurring in the other lists can be observed as well. Next, together with the above presented qualitative results, we also carried out a more systematic analysis of the variations in the recommendations lists based on Kendall-Tau (KT) coefficient and NDCG. In both cases, we compared the top-5 recommendation lists (i.e., for all the users) obtained on varying of all the different subsets of FOL rules, based on the implementation of KT available in Sci-Kit LearnFootnote 16. In both cases, lower is better, since KT coefficient measures the overlap in terms of elements in the recommendation lists, while nDCG score measures how similar a recommendation lists is w.r.t. a ground truth. In our case, the ground truth is represented by the recommendation lists obtained by running the simple UIP configuration, that is to say, the one based on the user-item-properties graph without any FOL rule. The results are presented in Figs. 4, 5 and 6.

Table 6 Top-k real recommendations provided by different configurations of our model in the movie field, for Users 6039, 5743, and 12; for this example, we set k = 5
Fig. 4
figure 4

Values of Kendall-Tau’s test and nDCG metric for the Last. FM dataset

Fig. 5
figure 5

Values of Kendall-Tau’s test and nDCG metric for the dataset

Fig. 6
figure 6

Values of Kendall-Tau’s test and nDCG metric for the ML1M dataset

As for Last.fm, the results show that KT scores range between 0.31 and 0.37. It means that all the configurations based on FOL change some of the items in the recommendation lists w.r.t. the basic UIP configuration. However, the analysis of the nDCG scores allows to assess that the change is usually tiny, since these score range from 0.966 to 0.974. Given that nDCG ranges from 0 to 1, it means that the change is very limited. However, by considering the statistics presented in Table 5, this is not surprising since the number or rules (and grounding) mined from the data is very low. Accordingly, it makes sense that this leads to a minimum change in the recommendation lists.

Next, by considering DBbook data, we note a more significant change since nDCG scores range from 0.56 to 0.82. In particular, six out of eight configurations show lower nDCG scores, so it means that many different items are modified due to the introduction of FOL rules. Similar outcomes also emerge from ML1M data, since all the configurations obtain an nDCG score around 0.82. While the score does not different on varying of the different configurations, we note that lower dimension of the embeddings (k = 256) leads to a lower nDCG score (so, more change in the items). In the experimental evaluation we will try to assess whether the change in the lists leads to an increase in terms of both accuracy and non-accuracy metrics.

5.2 RQ2: FOL heuristics comparisons

To answer RQ2, we compared the performance of the eight configuration based on neuro-symbolic knowledge graph embeddings to the two basic configurations based only on user-item and user-item-prop information (shortened, again, as UI and UIP), respectively. Results for the music domain are reported in Tables 7, while those related to the book domain are reported in Table 8, and finally, results related to the movie domain are reported in Table 9. In these tables, we report the performance obtained by our model considering not only the different heuristics used to select FOL rules to be injected during the graph embedding learning process, but also considering and comparing the performance at the variation of the dimension k of the embeddings we learned, as we mentioned in Sect. 3.

Table 7 Results of our model on the Last.FM dataset
Table 8 Results of our model on the DBbook dataset
Table 9 Results of our model on the ML1M dataset

In particular, as for the music domain (see Table 7), we can observe that in most of the cases, the configuration based on the injection of FOL rules overcome the two basic configurations, especially when considering the configurations with k = 256 and k = 768. For the k = 256 setting, the best configuration is clearly UIP + LSC, that is, the configuration obtained by injecting all FOL rules with a Standard Confidence > 0.2, whose precision, recall, F1 score, Gini index, and EFD score overcame the baselines and the other configurations; this means that this configuration is not only accurate, achieving best performance in several accuracy metrics, but also provided more diverse and novel recommendation lists. The configuration UIP + HSC, which is obtained by injecting FOL rules with a Standard Confidence > 0.9, obtained the best results in the other accuracy metrics, MAP, MAR, and nDCG, and achieved the best performance in terms of EPC as well. The setting k = 512 shows a more varied scenario: the basic configuration based only on user-item interactions (UI) obtained the best results in terms of precision, recall, and F1 score, being the lead for the accuracy metrics; best MAP and best nDCG have been obtained by the configuration UIP + ALL (configuration injecting all the available FOL rules), while the configurations UIP + MHSC and UIP + HSC provided the more diverse recommendation lists; moreover, the latter configuration obtained the best F1 score among the configurations with FOL rules. As for the setting k = 768, the best configuration is for sure UIP + LHC (rules with head coverage > 0.2), which is not only accurate (best MAP, MAR, and nDCG), but also provide recommendation lists with the best novelty. UIP + MHSC obtained the best results in terms of precision and F1, while UIP + ALL achieved the best F1 score. Finally, the basic configuration UIP provided the most diverse recommendation lists.

Let us now move to the book domain (Table 8). As for the setting k = 256, the overall best configuration is UIP + HSC (rules with standard confidence > 0.75) got the best performance in terms of MAP, MAR, and nDCG for the accuracy metrics; moreover, it resulted to be the configuration with the most diverse recommendation lists, and obtained good performance in terms of novelty as well. The best precision has been reached by the configuration UIP + LSC, while the best recall and F1 score have been achieved by UIP + LH (rules having the relation like in their head), which also got the highest EPC score. Generally speaking, and according to the setting k = 256 of the previous domain, the basic configurations have been outperformed by the configurations including FOL rules. As for k = 512, similarly to the previous case, the basic configurations obtained the best performance in almost all metrics: in particular, UIP resulted to be the best configuration in terms of MAP (equal score to UIP + HPE, rules with high number of positive examples), MAR, recall, F1, nDCG, EFD, and EPC; the highest MAR has been obtained by UIP + LHC, while the best diversity of the recommendation lists has been achieved by the configurations UIP + HPE and UIP + LH. Finally, for the setting k = 768, we can easily notice that the configurations with rules outperformed the two baselines, and the UIP + LSC is the configuration that achieved the best results. In particular, it got the best MAR, NDCG, Novelty and Diversity, while UIP + HSC got the best results in terms of accuracy (precision, recall an F1).

Finally, let us discuss the results obtained in the movie domain (Table 9). As for k = 256, the configurations using FOL rules overcame the two baselines. In particular, the overall best configuration is UIP + LH, which achieved the best accuracy performance in terms of precision, MAR, Precision, and nDCG, while UIP + HSC got the highest MAP and UIP + MHSC got the highest recall and F1. The highest novelty has been obtained by UIP + MHSC for the EFD and UIP + LSC for the EPC, while the basic configuration UI resulted to be the configuration providing the most diverse recommendation lists. UIP + MHSC achieved the highest diversity among the configurations based on rules. In the setting k = 512, unlike the previous cases, the configurations with FOL rules overcome the two basic configurations in almost all metrics. In fact, UIP + LHC and UIP + MHSC dominate the accuracy, with the highest precision, recall, F1, and nDCG for UIP + LHC, and the highest MAP and MAP for UIP + MHSC. The configuration UIP + LSC is the one that provided the most novel recommendation lists, while the highest diversity has been obtained again by UI, while UIP + MSC provided the highest novelty among the configurations using rules. The setting k = 768 follows the others: the highest accuracy has been obtained by UIP + LSC and UIP + MSC (MAP, MAR, and nDCG for UIP + LSC, while precision, recall, and F1 score for UIP + MSC); the configuration UIP + HSC provided the most novel recommendation lists; it also achieved the best diversity among the configurations using FOL rules, but the highest diversity has been obtained again by UIP. These results bring us to think that, in the case of the dataset ML1M, which is denser than the previous two and has much more interactions (about 1 M), the properties might influence negatively the diversity, while the FOL rules tend to increase it, without bringing it back to the best value obtained by UI.

Take-Home Messages. Generally speaking, the exploitation of FOL rules in the knowledge graph embedding learning process influences positively the performance of the recommender systems in almost all cases. As for Last.fm and DBbook, the injection of FOL rules improves all the metrics for k = 256 and k = 768 and six out of nine metrics (with the exception of Precision, Recall an F1) on k= 512. As for MovieLens, the results are even better since the only exception is represented by the Gini index whose score decreases when FOL rules are injected. Overall, the best-performing configuration is obtained by exploiting a configuration that also encodes rules, with the only exception of Precision, Recall and F1 on Last.fm. In all the other comparisons, the overall best results exploit the FOL rules mined by our framework.

As regards the performance of the different heuristics, a very heterogeneous behavior emerges. First, it is important to point out that the idea of inject all the available rules is not the optimal solution. Indeed, the best rule-based configuration is never the one labeled as UIP + ALL. Similarly, the idea of only using the rules that have a like in the head led to satisfying results. Indeed, while the UIP + LH configuration got the best partial results on some comparisons (i.e., Recall and F1 on DBbook with k = 256, or some metrics on ML1M with k = 256) other subsets of rules tend to obtain better results. Accordingly, it seems that the injection of FOL can be useful also to identify patterns in the data that do not necessarily identify or clarify the preferences of the users.

In particular, the configurations based on the use of confidence scores as a means to select the rules are the most promising ones. As for Last.fm, the UIP + LSC configuration (based on rules with a low confidence threshold) got the best results on Precision, Recall, F1 and Gini Index. Good performance was also obtained by UIP + LHC that got the best results on the remaining five metrics. In this case, the coverage is used as criterion to select the rules.

Next, by considering DBbook data, the best results are obtained with the UIP + HSC configuration. In this case, a lower number of rules are injected since the system only exploits rules having a confidence higher than 0.9. The adoption of this subset allowed to beat all the other configurations on all the metrics, with the exception of Precision and F1. Accordingly, this confirms the good impact of the rules to better rank the items and to make the recommendation more novel and diverse. A similar behavior emerged on ML1M data, since the exploitation of rules with high confidence led to the best results in terms of novelty and diversity, while the adoption of the coverage as heuristic improves the accuracy metrics.

Overall, we can state that the most promising subsets are those that use confidence (LSC and HSC) and coverage (LHC) as heuristics to select the rules. As regards the threshold to be adopted to cut the selection of the rules, some relationship between the structure of the graph and the amount of rules to be injected seems to exist. Indeed, with smaller KGs (such as DBbook), with a lower number of entities and a lower number of properties per item (see Table 3), it is better to inject less rules. Conversely, when more data are available, more rules can be exploited (so, LSC configuration). For example, this holds for ML1M where a huge number of groundings are injected in the model (see Table 5). This can give a general guideline that can both: (i) guide the design of further implementations based on our subsets; (ii) provide a deeper understanding of the results of our model.

To conclude, our analysis showed that the heuristics we designed are able to automatically identify a subset of the rules that can be useful to improve the accuracy of the recommendations. Of course, further qualitative and quantitative analyses (i.e., based on the kind of properties mentioned in the rules) will be carried out to go into details of the single rules injected in the model, in order to better assess the kind of information which is needed or useful to improve the quality of the recommendation process. These analyses are left as future work.

5.3 RQ3: comparisons to baselines

To answer the RQ3, we carried out experiments aimed at comparing the results obtained by our recommender system based on neuro-symbolic graph embeddings with some competitive state-of-the-art recommendation models. The results of the comparison can be found in Table 10 for the music domain (Last.fm), in Table 11 for book domain (DBbook), and finally for the movie domain in Table 12 (ML1M). In particular, we compared the performance of the best configuration in each domain with all the baselines considered. Generally speaking, it is easy to observe that our model outperformed all baselines for almost all the considered metrics, with the only exception represented by EASER. Moreover, most of these improvements are statistically significant (p < 0.05).

Table 10 Comparisons to baselines in the music field (Last.fm dataset)
Table 11 Comparisons to baselines in the book field (DBbook dataset)
Table 12 Comparisons to baselines in the movie field (ML1M dataset)

In the music domain, the configuration of our model that got the best overall performance is the one injecting FOL rules with a standard confidence score greater than 0.2 (the UIP + LSC configuration reported in Table 7). From Table 10, we can observe that our model obtains satisfying results, since they are comparable with those obtained by the best baseline (i.e., EASER) in terms of precision, recall and F1 measure. Moreover, we got the best results in terms of novelty (measured through EFD). As for ranking measures, such as NDCG, we obtain lower results w.r.t. EASER, but we are still in line with the next best-performing techniques such as BPRMF, ItemKNN and PureSVD. In our opinion, to better comprehend the nature of the results, it is necessary to consult the statistics presented in Tables 4 and 5. Indeed, both the tables show that the number of items mapped to the knowledge graph and the number of FOL rules are generally low, especially when compared to ML1M. Accordingly, it is likely that the sparsity of the graph and the low quality of information that is mined from the graph contributed to these results.

As for the book domain, Table 11 shows that our approach does not overcome all the baselines we considered. Unfortunately, this holds for all the metrics since we have at least one baseline that beats our approach. As we already noted for LastFM, EASER got the best results. However, in this case the gap is statistically significant. This behavior can have multiple reasons that are independent from the choice of exploiting FOL in the model (indeed, as shown in Table 8, the use of FOL rules overcomes the basic configurations, whose performance would have been even worse). In our opinion, it is very likely that the characteristics of the dataset as shown in Table 4, only 1931 out of 6698 items in the dataset are mapped to the knowledge graph. Accordingly, most of the users and most of the items do not exploit the information coming either from descriptive properties or logical rules. Accordingly, it is not surprising that a knowledge-aware method as the one we propose does not beat advanced data-driven approaches such as BPRMF that do not exploit exogenous knowledge and only rely on the available items. However, further investigations will be carried out to better understand the behavior of the model on this dataset.

Finally, let us discuss the results obtained in the movie domain (ML1M), which are reported in Table 12. In this case, these results better show the effectiveness of our approach since the best configuration (the one injecting FOL rules with head coverage scores greater than 0.2, labeled as UIP + LHC, as reported in Table 9)) obtained the best results for all the accuracy metrics. Our model also obtained the best results in terms of novelty, with pretty wide gaps with respect to the baselines. All these differences are statistically significant with p < 0.01 (with the only exception of the MAR). Also in this case, EASER emerged as the best baseline, since only the best diversity is obtained by CFGAN. However, these results are generally not surprising. ML1M is a denser and larger dataset than the previous two, where a larger number of FOL rules are mined and exploited by our model. Accordingly, this naturally improves the performance w.r.t. other methods; moreover, due to the same reason, the impact of external knowledge is less pronounced, and this improves methods based on matrix factorization and, more in general, collaborative filtering. To conclude, we can state that the comparison to the state of the art showed that our approach is particularly suitable and effective when the graph is large enough, and when a large amount of rules can be mined and extracted from the triples. In these scenarios, we are able to even beat current state-of-the-art approaches.

Overall, we can say that our model outperformed almost all baselines in almost all accuracy metrics, and with statistically significant improvements with few exceptions; good and significant improvements have been observed also for non-accuracy metrics, such as diversity (Gini index) and novelty (EFD, EPC).

6 Discussion and limitations

Although these are good aspects of our model, we are aware there are some limitations that we will tackle in future works:

  • Choice of the embedding learning model: Although TransE got good performance in the task of graph embedding for the recommendation, there are more recent and sophisticated methods to embed graph-derived information, including GNNs and GCNs; exploiting again the first-order logic as a unified framework for both triples and FOL rules could be a good strategy to combine more effective graph embedding models and keeping injecting FOL rules, to obtain again neuro-symbolic graph embeddings.

  • Identification of the FOL rules: In this work, we mined FOL from the KG itself. Another option could be adding rules provided by domain experts to discover new knowledge that is not even encoded in the original KG as background knowledge.

  • A weighting mechanism for FOL rules: In this work, we treated all FOL rules with the same importance, even when they had different confidence values. An extension of this work might be a weighting mechanism for the rules, aiming to lower the impact of rules with low confidence scores while amplifying the impact of rules with higher confidence scores.

  • An automated FOL rules selection strategy: In this work, we considered eight different subsets of FOL rules, related to eight different heuristics. Although the standard confidence seems to be the best metric to select the most prominent heuristic, an automated mechanism to select the best threshold value is currently lacking, and will be considered as a future work.

  • A user study: In this work, we only performed in-vitro experiments; executing in-vivo experiments by evaluating how this model affects the recommendation lists for real users in necessary as well.

  • More sophisticated methodologies to provide recommendation: In this work, we used simple concatenation to combine the embeddings related to users and items, but more sophisticated methods might be exploited, such as the attention mechanisms. In addition, even though this is beyond the scope of this paper, combining information related to other information sources (e.g., plain text) could even improve the performance of the recommender system.

7 Conclusions and future work

In this paper, we presented a model aiming at (i): mining FOL rules from a KG; (ii): learning graph embedding by injecting the knowledge provided by the FOL rules; (iii): providing users with effective recommendations. We tested this methodology on three datasets related to different domains (music, books, and movies) and carried out extensive experiments to assess the effectiveness of the model.

The experiments carried out proved that the initial intuition was correct; first, the introduction of FOL rules has a decisive impact on both item representation and recommendation lists generation; then, our model improved the performance with respect to diverse competitive baselines; last but not the least, we have been able to derive some guidelines choose the most promising subset of FOL rules to be injected, with particular focus on the standard confidence score.

This work represents a first step in the adoption of neuro-symbolic AI in the field of graph embedding for the recommendation, and we believe that much more research can be done in this field. As future works, we will work at overcoming the sketched limitations, by also comparing the model in terms of trade-off Spillo et al (2023a) between sustainability and effectiveness of the approach.