1 Introduction

A recurrent issue in AI is the development of rational agents capable of tailoring their actions and recommendations to the preferences of human users. The spectrum of applications that depend on this ability is extremely wide, ranging from adaptive interfaces and configuration software, to recommender systems and group decision-making (Brafman and Domshlak 2009). In a nutshell, the crucial ingredients for addressing this issue are representation, reasoning and learning. In complex domains, we need a representation that offers a compact encoding of preference relations defined over large outcome spaces. We also need to be able to use this representation effectively in order to answer a broad range of queries. And, since the performance of decision makers is dependent on their aptitude to reflect users’ preferences, we need to be able to predict and extract such preferences in an automatic way.

Among the different preference models that have been devised in the literature, conditional preference networks (CP-nets) have attracted a lot of attention by providing a compact and intuitive representation of qualitative preferences (Boutilier et al. 2004b). By analogy with Bayesian networks, CP-nets are graphical models in which nodes describe variables of interest and edges capture preferential dependencies between variables. Each node is labeled with a table expressing the preference over alternative values of the node given different values of the parent nodes under a ceteris paribus (“all else being equal”) assumption. For example, in a CP-net for movie recommendation, the rule:

might state that, for a film released in the fifties I prefer a comedy to a drama, provided that all other properties are the same. The semantics of a CP-net is a preference ordering on outcomes derived from such reading of entries in the conditional preference tables.

Despite their popularity, CP-nets are intrinsically limited to “attribute-value” domains. Many applications, however, are richly structured, involving objects of multiple types that are related to each other through a network of different types of relations. Such applications pose new challenges for devising relational preference models endowed with expressive representations, efficient inference engines, and fast learning algorithms.

In this paper, we introduce the framework of relational conditional preference networks (RCP-nets) that extends ceteris paribus preferences to relational domains. Briefly, an RCP-net is a template over a relational schema which specifies a ground CP-net for each particular set of objects. Based on the ceteris paribus paradigm, the representations provided by RCP-nets are transparent, in that a human expert can easily capture their meaning. For example, in an RCP-net for movie recommendation, the entry:

might capture the stereotype that, all other things being equal, action movies are preferable to dramas if the majority of people in the audience are teenagers.

In essence, the interest of RCP-nets lies in their ability to compare and order relational outcomes. Semantically, a relational outcome consists in a set of objects interconnected by the functional dependencies of the database schema. Two outcomes are “comparable” if they are defined over the same set of objects, but differ in the values assigned to the attributes of objects. For example, in configuration software (Junker 2006), the overall goal of the decision maker is to assemble from an available catalog of generic objects a customized product that meets user’s preferences. Customized products, such as computers, cars, insurance products and travel packages, can be described as relational outcomes. In recommender systems (Jannach et al. 2010), a fundamental task is to predict what degree of desire a user would give to a new, unrated, item. Modeling the interaction of users and items as relational outcomes can help the system in making a better personalized recommendation by incorporating relational information about the user, such as her community in social networks (Golbeck 2006), or about the item, such as the actors, directors and critics of a movie (Melville et al. 2002; Newton and Greiner 2004).

From a computational viewpoint, a key feature of our framework is that the class of acyclic RCP-nets supports efficient inference for two well-studied inference tasks in preference handling: outcome optimization and outcome ranking. In outcome optimization, the decision maker is given a partial outcome in which some object attributes are left unspecified; the task is to find a maximally preferred completion of this outcome. For an acyclic CPR-net, (unconstrained) outcome optimization can be solved in polynomial time using a greedy algorithm that finds a maximally preferred completion of a partial outcome according to some topological order induced by the preference network. In outcome ranking, the decision maker is given a set of outcomes defined over the same collection of objects, but which differ in the values assigned to the object attributes; the task is to rank these outcomes in some non-decreasing order of preference. Again, such a task can be solved in polynomial time for acyclic RCP-nets, by compiling the network into a utility function that assigns a score to each outcome under consideration.

The learnability of RCP-nets is analyzed within the online learning setting (Cesa-Bianchi and Lugosi 2006), a well-studied theoretical model for devising algorithms capable of making and updating recommendations in real-time. In this setting, the decision maker observes instances of a reasoning task in a sequential manner. On round t, after observing the tth instance, the decision maker attempts to predict the solution associated with this instance. The prediction is formed by a hypothesis chosen from a predefined class \(\mathcal{N}\) of RCP-nets. The decision maker can use this information to choose another hypothesis from the class \(\mathcal{N}\) before proceeding to the next round. As a common thread in online learning, we make no assumption regarding the sequence of instance-solution pairs. This setting is thus general enough to capture agnostic situations in which the “true” preference model is not necessarily an element of the predefined class \(\mathcal{N}\).

To measure the performance of the decision maker, we consider two standard metrics. The first, called regret, measures the difference in cumulative loss between the decision maker and the optimal hypothesis in \(\mathcal{N}\). The second metric is computational complexity, i.e. the amount of computer resources required to choose hypotheses and to predict solutions. Based on these metrics, we show that the class of tree-structured RCP-nets (with bipartite orderings) is efficiently learnable from both optimization tasks and ranking tasks, using linear loss functions. Our online learning algorithm is an extension of the Hedge algorithm (Freund and Schapire 1997) that exploits the Matrix-Tree Theorem (Tutte 1984) for generating directed spanning trees at random.

The paper is organized as follows. After introducing the necessary background in graph theory (Sect. 2), we examine the syntax and semantics of RCP-nets in Sect. 3. The theoretical results concerning reasoning with acyclic RCP-nets and learning with tree-structured RCP-nets are presented in Sects. 4 and 5, respectively. In Sect. 6 we illustrate the learning potential of our framework with experiments on a large dataset. In Sect. 7, we compare our framework with related work and, in Sect. 8, we conclude by mentioning some perspectives of further research.

2 Preliminaries

Before delving into the representation of relational preference networks, we review the basic concepts from graph theory used in this paper. A digraph is a pair \(\mathcal{G}=(\mathcal{X},\mathcal{E})\), where \(\mathcal{X}\) is a nonempty, finite set, and \(\mathcal{E}\) is a binary relation on \(\mathcal{X}\). The elements of \(\mathcal{X}\) are the nodes of \(\mathcal{G}\), and the elements of \(\mathcal{E}\) are the (directed) edges of \(\mathcal{G}\). The size of \(\mathcal{G}\), denoted \(\vert \mathcal{G}\vert \), is given by the number of its edges. Undirected graphs are represented here as digraphs for which the binary relation on nodes is symmetric. Notably, the underlying graph of a digraph \(\mathcal{G} = (\mathcal{X},\mathcal{E})\) is the pair formed by \(\mathcal{X}\) and the symmetric closure of \(\mathcal{E}\).

For a digraph \(\mathcal{G}\) and a pair of nodes X,Y, a walk of length k in \(\mathcal{G}\) from X to Y is a sequence of nodes (X 1,…,X k+1) such that X=X 1, Y=X k+1 and (X i ,X i+1) is an edge in \(\mathcal{G}\) for 1≤ik. The walk is a path if all nodes are distinct, and a cycle if X 1=X k+1 and all intermediate nodes are distinct. For any pair of nodes X,Y, if there is a path of length k from X to Y then Y is an ancestor of X, and X is a descendant of Y. In the specific case where k=1, Y is called a parent of X, and X is called a child of Y. A root (or source) is a node with no parents, and dually, a leaf (or sink) is a node with no children. The in-degree (resp. out-degree) of a digraph \(\mathcal{G}\) is the maximum number of parents (resp. children) per node in \(\mathcal{G}\).

The deletion of an edge (X,Y) from a digraph \(\mathcal{G}\) is the digraph obtained by simply removing (X,Y) from \(\mathcal{G}\). The contraction of (X,Y) in \(\mathcal{G}\) is the digraph obtained by merging (X,Y) with a new node Z and redefining any edge (X,X′) (resp. (Y′,Y)) to (Z,X′) (resp. (Y′,Z)).

A digraph is acyclic if it contains no cycles. An acyclic digraph \(\mathcal{G}=(\mathcal{X},\mathcal{E})\) is complete bipartite if \(\mathcal{X}\) can be partitioned into two sets \(\mathcal{X}_{1}\) and \(\mathcal{X}_{2}\) such that \(\mathcal{E} = \mathcal{X}_{1} \times\mathcal{X}_{2}\). In the particular case where \(\vert \mathcal{X}_{1}\vert = 1\), \(\mathcal{G}\) is a star. A forest is an acyclic digraph of in-degree one, and a tree is a forest with exactly one root node. A spanning tree (resp. spanning forest) of a digraph \(\mathcal{G} = (\mathcal{X},\mathcal{E})\) is a tree (resp. forest) for which the node set is \(\mathcal{X}\) and the edge set is contained in \(\mathcal{E}\). Let \(\mathcal{K}_{n}\) denote the complete digraph of order n. Then, by Cayley’s formula, the number of spanning trees of \(\mathcal{K}_{n}\) rooted at a fixed node is n n−2, and hence, the number of spanning forests of \(\mathcal{K}_{n}\) is (n+1)n−1.

For a digraph \(\mathcal{G}\) with node set {X 1,…,X n }, a linear extension of \(\mathcal{G}\) is a permutation π over {1,…,n} such that, if there is an edge from X i to X j , then π(i)<π(j). It is well-known that if \(\mathcal{G}\) is acyclic, then a linear extension of \(\mathcal{G}\) can be constructed in \(\mathcal{O}(\vert \mathcal{G}\vert )\) time using a topological sort algorithm.

Finally, a weighted digraph is a triple \(\mathcal{G} = (\mathcal{X},\mathcal{E},w)\), where \((\mathcal{X},\mathcal{E})\) is a digraph and w is a map from \(\mathcal{X} \times\mathcal{X}\) to the set of nonnegative reals, such that w(X,Y)>0 if and only if \((X,Y) \in\mathcal{E}\). For any subgraph \(\mathcal{G}'\) of \(\mathcal{G}\), the weight of \(\mathcal{G}'\) is given by the product of weights of its edges. The Laplacian of \(\mathcal{G}\) is the real matrix \(\boldsymbol {\varLambda} (\mathcal{G})\) over \(\mathcal{X} \times\mathcal{X}\) for which each entry λ(X,Y) is given by:

Let \(\boldsymbol{\varLambda}_{X}(\mathcal{G})\) be the matrix obtained by deleting the row of X and the column of X from \(\boldsymbol{\varLambda }(\mathcal{G})\). By the Matrix-Tree Theorem (Tutte 1984, Theorem 6.27), the determinant of \(\boldsymbol{\varLambda }_{X}(\mathcal{G})\) is equal to the sum of weights of all spanning trees of \(\mathcal{G}\) rooted at X. Based on this property, various algorithms have been proposed in the literature for generating in polynomial time random spanning trees of digraphs (see e.g. Kulkarni 1990; Colbourn et al. 1996).

3 RCP-nets

On the surface, our representation for relational preferences is similar to a probabilistic relational model (Getoor et al. 2002): the representation is structured in a graphical way by exploiting conditional independencies. However, the nature of connections between nodes in the graph is different: whereas conditional probabilities are quantitative and specify a probability measure over the outcome space, conditional preferences are qualitative and specify a strict partial order between outcomes.

3.1 Language

The basic building block of our framework is a relational schema that specifies a database structure. In order to clarify dependencies among the attributes of interconnected objects, the schema is represented as a digraph \(\mathcal{S}\).

The nodes of \(\mathcal{S}\) are separated into class names and attribute names. Intuitively, a class name denotes a type of objects, and an attribute name captures an elementary property that can be attached to a class name. Each attribute name A is associated with two predefined components: a finite domain D A and a finite set Γ A of aggregate functions or aggregators. Each γ A Γ A maps any vector of values in D A into a single value of D A . Common aggregators include the mode (most frequently occurring value), the median, maximum or minimum (if values are ordinals), and the mean (if values are cardinals). The domain size of \(\mathcal{S}\) is given by the maximum of the sizes of its domains.

The edges of \(\mathcal{S}\) capture functional constraints between nodes: they are separated into attributes and references. An attribute is an edge of the form (X,A), also denoted X.A, where X is a class name and A an attribute name. A reference is an edge of the form X.Y where X and Y are class names.

There is a natural correspondence between our representation and that of relational databases. Each class name X is associated with a table and each of its adjacent nodes is associated with a column in the table. For an attribute X.A, the entries in the corresponding column are values in D A , and for a reference X.Y, the entries are foreign keys, each identifying an object in Y. We note in passing that this representation does not prevent us from having complex relationships between entities: using a standard reification technique, each k-ary relationship can be captured by introducing a new class name associated with k references, in which each object corresponds to a row in the relationship. These notions are illustrated in the following example.

Example 1

Suppose we would like to design a movie recommender system that periodically suggests a list of movies to each subscriber. The relational schema, described in Fig. 1, is composed of movies, actors, directors, critics, and users. Each box specifies a class name with its adjacent attribute names (in roman style) and reference names (in italic style); the dotted lines indicate the types of objects referenced. For instance, the class name Cast specifies the rank of actors playing in a movie; this class is associated with the attribute \(\mathsf{Cast.Rank}\) and the references \(\mathsf{Cast.Actor}\) and \(\mathsf{Cast.Movie}\).

Fig. 1
figure 1

A relational schema for the movie domain

A chain is a path in the underlying graph of \(\mathcal{S}\) of the form X.R where X is a class name and R is a (possibly empty) sequence of class names. A slot is an expression of the form X.R.A where X.R is a chain and A is an attribute name connected to the last class name of X.R. Intuitively, X.R.A denotes a binary relation between objects of type X and values of type A. Note that because any slot is a path in the symmetric closure of \(\mathcal{S}\), the relation captured by X.R.A is not necessarily functional. For example, the slot \(\mathsf{User.Audience.Movie.Genre}\) refers to all genres of movies watched by a user. A term is an aggregated slot, that is, an expression of the form γ A (X.R.A) where X.R.A is a slot and γ A is an aggregator in Γ A . The set of all attributes occurring in \(\mathcal{S}\) is denoted \(\mathcal{A(S)}\) and the set of all possible terms that can be generated from \(\mathcal{S}\) is denoted \(\mathcal{T(S)}\).

In this study, preference relations are modeled as strict partial orders. Formally, given an arbitrary set \(\mathcal{X}\), a preference ordering ≻ on \(\mathcal{X}\) is a binary relation on \(\mathcal{X}\) that is irreflexive, antisymmetric and transitive.

Definition 1

For a set \(\mathcal{A} \subseteq\mathcal{A}(S)\) of attributes and a set \(\mathcal{T} \subseteq\mathcal{T(S)}\) of terms, a relational conditional preference network (RCP-net) is a pair N=(par,cpt) such that:

  • par associates to each attribute \(X.A \in\mathcal{A}\) a parent set par(X.A), which is a collection \(\{\gamma_{A_{1}}(X.R_{1}.A_{1}), \ldots, \gamma_{A_{p}}(X.R_{p}.A_{p})\}\) of terms in \(\mathcal{T}\) rooted at X.

  • cpt associates to each attribute \(X.A \in\mathcal{A}\) a conditional preference table cpt X.A , which maps each vector u in \(D_{A_{1}} \times\cdots\times D_{A_{p}}\) into a preference ordering cpt X.A (u) over D A .

By \(\mathcal{N}[\mathcal{A},\mathcal{T}]\) we denote the class of all RCP-nets defined over the set \(\mathcal{A}\) of attributes taking parents in the set \(\mathcal{T}\) of terms. Each attribute in \(\mathcal{A}\) is said to be controllable. For example, in a movie recommender system, it is legitimate to consider that movie attributes, including the genre, the release date and the duration of a film, are controllable. On the other hand, user attributes such as the age, the gender and the occupation of a person, are typically uncontrollable.

Given an RCP-net \(N \in\mathcal{N}[\mathcal{A},\mathcal{T}]\), the dependency graph of N is the digraph \(\mathcal{G}(N)\) with node set \(\mathcal{A}\) and such that there is an edge from X.A to X′.A′ if and only if X′.A′ is the suffix of some slot in par(X.A). Based on this notion, an RCP-net is acyclic if its dependency graph is acyclic, and tree-structured if its dependency graph is a forest. Finally, an RCP-net is bipartite-ordered (resp. star-ordered) if each of the entries of its conditional preference tables is a complete bipartite digraph (resp. a star).

The parent size of N is the maximum number p of parents per attribute in N. It is important to keep in mind that the parent size of an RCP-net does not necessarily coincide with the in-degree of its dependency graph. In particular, a tree-structured RCP-net N can have parent sets composed of multiple terms, provided that at most one suffix is in \(\mathcal{A}\).

Example 2

Consider a restricted view of our movie recommendation domain, described in Fig. 2. The RCP-net N is defined over the set \(\mathcal{A}\) of controllable attributes including \(\mathsf{Critics.Rating}\), \(\mathsf{Movie.Duration}\) and \(\mathsf{Movie.Genre}\). We assume here that all attributes are associated with the mode aggregator. The dependency graph of N is depicted in the left part of the figure, while the parent sets and preference tables of N are presented in the right part. For instance, the first entry of the table associated to \(\mathsf{Movie.Duration}\) states that a long movie is preferred over a short one if the aggregated reviews for this film are positive. Based on the above terminology, we can observe that N is both tree-structured and bipartite-ordered. In particular, the parents of \(\mathsf{Movie.Genre}\) are defined over uncontrollable attributes.

Fig. 2
figure 2

A tree-structured RCP-net for the movie domain

3.2 Semantics

Given a schema \(\mathcal{S}\), a skeleton for \(\mathcal{S}\) is a map κ that assigns to each class name X a finite set 〚X κ of objects, and to each reference X.Y a function 〚X.Y κ from 〚X κ into 〚Y κ . We assume that each object is associated to a unique class, i.e. 〚X κ ∩〚Y κ =∅ whenever XY. Based on the standard semantics of inverse and composition operations, a skeleton assigns a binary relation to any chain in the schema. A ground chain is an expression of the form o.R where X.R is a chain and o is an object in 〚X κ . The notions of ground attribute and ground term are defined similarly. For a ground chain o.R, we denote by 〚o.R κ the set of objects o′ such that (o,o′)∈〚X.R κ .

Given a skeleton κ and a collection of attributes \(\mathcal{A}\), we denote by \(\mathcal{A}_{\kappa}\) the set of ground attributes . For an RCP-net N, the ground dependency graph of N with respect to κ is the digraph \(\mathcal{G}_{\kappa}(N)\) with node set \(\mathcal{A}_{\kappa }\) and such that there is an edge from o.A to o′.A′ if and only if there is an edge from X.A to X′.A in \(\mathcal{G}(N)\), where X is the class of o and X′ is the class of o in κ. Clearly, if \(\mathcal{G}(N)\) is acyclic then \(\mathcal{G}_{\kappa}(N)\) is acyclic.

A relational outcome or interpretation is a map I that extends a skeleton κ by assigning to each attribute X.A a function 〚X.A I from 〚X κ into D A . By \(\mathcal{I}_{\kappa}\), we denote the space formed by all interpretations extending κ. Given a ground attribute o.A, the value assigned by I to o.A is denoted 〚o.A I . More generally, given a ground slot o.R.A where 〚o.R κ ={o 1,…,o n }, we denote by 〚o.R.A I the vector v in \(D_{A}^{n}\) such that v i =〚o i .A I for 1≤in.

With these notions in hand, we are now ready to examine the ceteris paribus semantics of relational preference networks. Consider an RCP-net N defined over a set of attributes \(\mathcal{A}\), and a skeleton κ. A pair (I,J) of interpretations extending κ is called a flip on a ground attribute \(o.A \in\mathcal{A}_{\kappa}\) if they are everywhere identical on \(\mathcal{A}_{\kappa}\), excepted for o.A. Given an attribute X.A with parent set \(\{\gamma_{A_{1}}(X.R_{1}.A_{1}),\ldots,\gamma_{A_{p}}(X.R_{p},A_{p})\}\) and a flip (I,J) on o.A, we say that I dominates J in N if the value 〚o.A I is preferred to the value 〚o.A J in the entry of the table cpt X.A associated to the vector:

By extension, given any pair (I,J) of interpretations in \(\mathcal{I}_{\kappa}\), we say that I dominates J in N, and write I N J, if there is a sequence (I 1,…,I n ) of flips such that I 1=I, I n =J, and I i dominates I i+1 in N κ for 1≤i<n.

Example 3

Consider two relational outcomes I 1 and I 2 for the schema given in Fig. 2, specified in the Table 1. We remark that (I 1,I 2) is a flip on the movie genre. Thus, using the conditional preference table of \(\mathsf{Movie.Genre}\) it follows that I 1 dominates I 2.

Table 1 Two relational outcomes for the movie domain

Finally, we say that an RCP-net N is coherent if, for any skeleton κ, the binary relation ≻ N over \(\mathcal{I}_{\kappa}\) is a preference ordering.

Theorem 1

Any acyclic RCP-net is coherent.

Proof

Let N be an acyclic RCP-net defined over a set of attributes \(\mathcal{A}\), let κ be a skeleton, and suppose that I N I holds for some \(I \in\mathcal{I}_{\kappa}\). By definition of the dominance relation, there is a sequence (I 1,…,I n ) of flips such that I 1=I=I n and, I i N I i+1 for 1≤i<n. Let S be the set of all ground attributes o.A in \(\mathcal{A}_{\kappa}\) such that for some flip (I i ,I i+1). Consider any linear extension of S according to \(\mathcal{G}_{\kappa}(N)\), and take the first element o.A in the ordering. Let i be the first index in the sequence such that . It follows that . However, because \(\mathcal{G}_{\kappa}(N)\) is acyclic, there is no parent of o.A in S, and hence the ground condition 〚par(o.A)〛 I remains unchanged during all the sequence of flips. It follows that , and hence , which contradicts the assumption that I N I. □

For the problems under consideration in the remaining sections, we will assume a prefixed and known schema \(\mathcal{S}\) of domain size d, in which any aggregator can be implemented by a procedure that runs in \(\mathcal{O}(n\log d)\) time, where n is the maximum number of objects per class name assigned by a skeleton. Because the size of preference tables grows exponentially with the number of parents, we will also assume that the parent size p of any RCP-net is constant.

4 Preference reasoning

For a class of RCP-nets, a reasoning task consists in a set of instances and a set of solutions. Of particular interest here is the class \(\mathcal{N}_{\mathtt {acy}}[\mathcal{A},\mathcal{T}]\) of acyclic RCP-nets. The size of \(\mathcal{A}\) is denoted a, and the maximum of the sizes of terms in \(\mathcal{T}\) is denoted k. The next lemma states that the parent values of an attribute can be retrieved in quasi-linear time using standard join and projection operations.

Lemma 1

Let N be an RCP-net in \(\mathcal{N}_{\mathtt{acy}}[\mathcal{A},\mathcal{T}]\). Then, for any interpretation I and any ground attribute o.A, the vector of parent valuespar(o.A)〛 I can be constructed in \(\mathcal{O}(t)\) time, where t=knlog2(n)+nlog2(d).

Proof

Consider any reference X.Y in the schema \(\mathcal{S}\). The inverse 〚Y.X I of the relation 〚X.Y I can be computed in \(\mathcal{O}(n)\) time. In addition, the tuples in 〚X.Y I (or its inverse) can be ordered lexicographically, which requires \(\mathcal{O}(n \log_{2} n)\) steps. Based on this ordering, the composition 〚X.Y I ∘〚Y.Z I can be performed in \(\mathcal{O}(n)\) time, in the following way: project 〚X.Y I onto the objects shared by 〚Y.Z I and prune any tuple in 〚Y.Z I that has no match in that projection. Thus, the ith value of the tuple 〚par(o.A)〛 I can be found in \(\mathcal{O}(k n \log_{2} n + n \log_{2} d)\) time using at most k join operations and one aggregate operation. Since the number of parents per attribute is constant, the result follows. □

4.1 Preference optimization

A partial outcome is an interpretation I that assigns the value “∗” (unknown) to some ground attributes. A completion of I with respect to a set of attributes \(\mathcal{A}\), is a map that extends I by replacing the unknown value of each attribute in \(\mathcal{A}_{\kappa}\) with a value of appropriate type. J is optimal for an RCP-net N if there is no distinct completion J′ of I such that J′≻ N J.

Based on these considerations, an outcome optimization task for \(\mathcal{N}[\mathcal{A},\mathcal{T}]\) is a reasoning task in which instances are partial outcomes and solutions are outcome completions with respect to \(\mathcal{A}\). Given an RCP-net N and an instance I, the task is to find a completion J of I that is optimal for N.

For acyclic RCP-nets, such a completion can be found in polynomial time using the forward sweep algorithm (Boutilier et al. 2004b), adapted to relational domains. Given an RCP-net N defined over an attribute set \(\mathcal{A}\) and a partial outcome I, we first construct a topological ordering of \(\mathcal{A}_{\kappa}\) according to the ground dependency graph \(\mathcal{G}_{\kappa}(N)\), where κ is the skeleton of I. Then, starting from J=I, we instantiate each \(o.A \in\mathcal{A}_{\kappa}\) in turn to a maximally preferred value in the preference ordering cpt X.A (〚par(o.A)〛 I ).

Example 4

Suppose that Ann is a young woman. Based on the tree-structured RCP-net in Fig. 2, what would be her favorite romance movies? Starting from the partial outcome I in which only Ann’s attributes are known, we can derive two optimal completions J 1 and J 2 of I. For both completions, the value of \(\mathsf{Critics.Rating}\) is set to high, and the value of \(\mathsf{Movie.Duration}\) is set to long. The value of \(\mathsf{Movie.Genre}\) is action for J 1 and romance for J 2.

Theorem 2

Let N be an RCP-net in \(\mathcal{N}_{\mathtt{acy}}[\mathcal{A},\mathcal{T}]\). Then, for any partial outcome I, finding an optimal completion of I for N can be done in \(\mathcal{O}(ant)\) time.

Proof

We first establish the correctness of the forward sweep algorithm. Let I be a partial outcome, and J be the completion of I returned by the algorithm. Suppose that J′≻ N J for some distinct completion J′ of I. In this case, there is a sequence of flips from J′ to J in ≻ N . Let S be the set of ground attributes for which the value is switched in the sequence. Because \(\mathcal{G}_{\kappa}(N)\) is acyclic, there is at least one o.AS for which the values of all parents are fixed by I. Since J and J′ are both extensions of I, we have 〚par(o.A)〛 I =〚par(o.A)〛 J =〚par(o.A)〛 J. However, because 〚o.A J is optimal, the value of o.A cannot be switched in the sequence, which contradicts the assumption that J′≻ N J.

Now, consider the computational cost of the algorithm. A linear extension of \(\mathcal{G}(N)\) can be found in \(\mathcal{O}(a)\) time. Based on this ordering, a linear extension of \(\mathcal{G}_{\kappa}(N)\) can be found in \(\mathcal{O}(an)\) time, by replacing the attribute at position i with its n ground instances at positions i 1,…,i n , where i 1>i−1 and i n <i+1. Since by Lemma 1 the parent assignment of each attribute can be found in \(\mathcal{O}(t)\) time, the algorithm returns a completion of I in \(\mathcal{O}(ant)\) time. □

4.2 Preference ranking

For a skeleton κ, an outcome set is any finite collection S={I 1,…,I m } of interpretations in \(\mathcal{I}_{\kappa}\). Note that all interpretations in an outcome set are defined over the same set of objects, but differ in the values assigned to object attributes. A ranking of S is a permutation π over [m]={1,…,m}. We say that π is consistent with an RCP-net N if, for any pair I i ,I j in S, I i N I j implies that I i occurs before I j in π, i.e. π(i)<π(j). An outcome ranking task is a reasoning task for which any instance is an outcome set of size m and any solution is a permutation over [m]. Given an RCP-net N and an instance S, the problem is to find a ranking π of S that is consistent with N.

This problem can be solved in polynomial time for acyclic RCP-nets using a compilation technique inspired from Boutilier et al. (2001) and Brafman and Domshlak (2008). For a skeleton κ, a utility function is a map \(\phi :\mathcal{I}_{\kappa} \rightarrow\mathbb{R}\). Any utility function ϕ induces a preference ordering ≻ ϕ over \(\mathcal{I}_{\kappa}\) such that I ϕ J if and only if ϕ(I)>ϕ(J). The function ϕ is consistent with an RCP-net N if ≻ N implies ≻ ϕ , that is, every linear extension of ≻ ϕ is a linear extension of ≻ N . Based on these notions, the overall idea of the compilation technique is to map any acyclic RCP-net N and any skeleton κ into a utility function ϕ over \(\mathcal{I}_{\kappa}\) that is consistent with N. The function ϕ can then be exploited for solving multiple ranking tasks defined over κ.

We now embark on technical aspects. The basic ingredients of ϕ are two weight mappings f and g, where f is used to quantify local preferences in the tables of N, while g is used to quantify global dependencies between attributes. Consider an attribute X.A with parent set {X 1.R 1.A 1,…,X p .R p .A p }. Then, for each vector u of parent values in \(D_{A_{1}} \times \cdots\times D_{A_{p}}\) and each value v in D A , f(u,v) is the number of descendants of v in the preference ordering cpt X.A (u). Note that f(u,v)≤|D A |−1, and f(u,v)=0 if v is a leaf.

The weight mapping g is constructed in a recursive way using a topological ordering of \(\mathcal{G}(N)\). If X.A is the first attribute in the ordering, then we set g(X.A)=1. Assuming by induction hypothesis that g is fixed for the first k−1 elements in the ordering, if the kth element X.A has no parents, then g(X.A)=1. Otherwise,

(1)

where {X 1.A 1,…,X p .A p } is the set of all parents of X.A in \(\mathcal{G}(N)\), and ch(X i .A i ) is the set of all children of X i .A i in \(\mathcal{G}(N)\). The aim of g is thus to distribute the weight of an attribute evenly between its ground children.

With these ingredients in hand, the utility function ϕ is specified as a sum of sub-utilities ϕ X.A , each defined for a controllable attribute X.A in N. Namely, if X.A is an attribute with parent set par(X.A)={X 1.R 1.A 1,…,X p .R p .A p }, then ϕ X.A associates to each vector u in \(D_{A_{1}} \times \cdots\times D_{A_{p}}\) and to each value v in D A the utility ϕ X.A (u,v)=g(X.A)f(u,v). Based on these sub-utility functions, the value of any interpretation I is simply given by:

(2)

Example 5

Consider again the RCP-net N in Fig. 2, and suppose that three long movies o 1, o 2 and o 3 are presented to John, a middle-aged male user. Specifically, o 1 and o 2 are romance movies, while o 3 is an action movie. The corresponding reviews, denoted r 1, r 2 and r 3, are positive for o 1 and o 3, but quite negative for o 2. We denote by I 1 (resp. I 2 and I 3) the interpretation associated to the entities John and {r 1,o 1} (resp. {r 2,o 2} and {r 3,o 3}). Now, according to our utility function ϕ, we have ϕ(I 1)=1+1/2+0=3/2, ϕ(I 2)=1/2 and ϕ(I 3)=5/2. Therefore, the ranking (3,1,2) is consistent with N.

Theorem 3

Let N be an RCP-net in \(\mathcal{N}_{\mathtt{acy}}[\mathcal{A},\mathcal{T}]\) and κ be a skeleton. Then, compiling N into a consistent utility function ϕ on \(\mathcal{I}_{\kappa}\) can be done in \(\mathcal{O}(a d^{p+1})\) time. Based on this utility function ϕ, finding a consistent ranking of any outcome set \(S \subseteq\mathcal{I}_{\kappa}\) of size m can be done in \(\mathcal{O}(ant m \log_{2} m)\) time.

Proof

Let N κ be the CP-net that associates to each instance o.A of an attribute X.A in N a parent set par(o.A) and a preference table cpt o.A . Specifically, if X.A is an attribute with parent set {X.R 1.A 1,…,X.R p .A p }, then:

  • par(o.A) is the set of all attributes o ij .A i , such that 1≤ip, 1≤jn i , and ,

  • cpt o.A is the map that assigns to each vector w=(v 1,…,v p ) in the space \(D(A_{1})^{n_{1}} \times\cdots \times D(A_{p})^{n_{p}}\) the preference ordering cpt o.A (w)=cpt X.A (v), where v is the aggregated vector \((\gamma_{A_{1}}(\boldsymbol {v}_{1}),\ldots,\gamma_{A_{p}}(\boldsymbol {v}_{p}))\).

We can observe that the dependency graph of N κ is \(\mathcal{G}_{\kappa}(N)\). By a direct application of Brafman and Domshlak (2008, Theorem 3), the utility function ϕ defined above is consistent with the CP-net N κ . Since \(\succ_{N_{\kappa}} = \succ_{N}\), it follows that ϕ is consistent with the RCP-net N.

Now, consider the computational cost required for constructing ϕ. For each of the values of each preference ordering in N, the weight f(u,v) can be computed in \(\mathcal{O}(d)\) time. Since there are at most a attributes and d p table entries per attribute, the weight map f can be constructed in \(\mathcal{O}(a d^{p+1})\) time. The weight map g can be obtained in \(\mathcal{O}(a)\) time by simply constructing a topological ordering of \(\mathcal{G}(N)\) and memorizing the number of ground children per attribute during the traversal.

Finally, the utility ϕ(I) of any interpretation I in the outcome set S can be computed in O(ant) time, as specified by equality (2). Thus, any ranking of S can be found in \(\mathcal{O}(a n t m \log_{2} m )\) time by labeling each IS with ϕ and breaking ties arbitrarily. □

5 Preference learning

By extending our previous considerations, a prediction task for a class of RCP-nets \(\mathcal{N}\) is a triple \((\mathcal{X},\mathcal{Y},\ell)\), where \(\mathcal{X}\) is a space of instances, \(\mathcal{Y}\) is a space of solutions, and \(\ell: \mathcal{N} \times\mathcal{X} \times\mathcal{Y} \rightarrow[0,\lambda]\) is a λ-bounded loss function.

In the online learning model, the decision maker is a learning algorithm that observes instances of a prediction task in a sequence of rounds. At trial t, the algorithm receives an instance \(x_{t} \in\mathcal{X}\) and is required to predict a corresponding solution \(N_{t}(x_{t}) \in \mathcal{Y}\) using its current hypothesis \(N_{t} \in\mathcal{N}\). Once the algorithm has predicted, the true solution \(y_{t} \in\mathcal{Y}\) is revealed and the algorithm incurs the loss (N t ;x t ,y t ) that measures the discrepancy between the predicted solution N t (x t ) and the correct response y t . The ultimate goal of the decision maker is to minimize the cumulative loss it suffers along its run. To achieve this goal, the algorithm is allowed to choose a new hypothesis in \(\mathcal{N}\) at the end of each trial, possibly using a randomized strategy.

As mentioned in the introduction, we make no assumptions regarding the sequence of examples. In this general setting, the performance of the decision maker is measured relatively to the performance of the best hypothesis in \(\mathcal{N}\). Namely, the regret of an online learning algorithm L with respect to a sequence {(x t ,y t )} of T examples is given by the difference between the expected cumulative loss of the algorithm and the cumulative loss of the best hypothesis chosen with the benefit of hindsight:

A class of hypotheses \(\mathcal{N}\) is online learnable with respect to a prediction task \((\mathcal{X},\mathcal{Y},\ell)\) if there exists an online learning algorithm L such that, for any sequence of T examples, the regret of L is sublinear as a function of T. This condition implies that “on average” the algorithm performs as good as the best fixed hypothesis in hindsight. If, in addition, the computational complexity of L is polynomial in the dimension parameters associated to \(\mathcal{N}\), \(\mathcal{X}\), and \(\mathcal{Y}\), then \(\mathcal{N}\) is efficiently learnable.

In this section, any RCP-net is viewed as a set of components, whose data structure will be clarified shortly. Let \(\mathcal{C(N)}\) be the set of all distinct components generated from a class of hypotheses \(\mathcal{N}\). A loss function is linear for \(\mathcal{N}\) if (N;x,y)=∑ CN (C;x,y) for any \(N \in \mathcal{N}\), where (C;x,y) denotes the loss incurred by the component C on the example (x,y).

Our learning algorithm is a variant of the Hedge algorithm (Freund and Schapire 1997) adapted to structured models. Following Koolen et al. (2010), we call this algorithm Expanded Hedge. As indicated in Fig. 3, the algorithm maintains a parameter vector θ t over \(\mathcal{C(N)}\). On round t, the algorithm predicts with a hypothesis N t chosen at random according to the exponential distribution \(\mathbb{P}_{\theta_{t}}\) induced by θ t . Then, θ t is updated using the rule θ t+1(C)=θ t (C)−η t (C;x t ,y t ), where η t is an adaptive learning rate.

Fig. 3
figure 3

The expanded Hedge algorithm

The following result can be derived by a simple adaptation of Hedge’s amortized analysis (Cesa-Bianchi and Lugosi 2006, Theorem 2.2).

Lemma 2

Let \(\mathcal{N}\) be a finite class of RCP-nets, and \((\mathcal{X},\mathcal{Y},\ell)\) be a predication task where is a λ-bounded linear loss function for \(\mathcal{N}\). Then, for any sequence {(x t ,y t )} of T examples, the regret of the Expanded Hedge algorithm with adaptive learning rate \(\eta_{t} = (2/\lambda) \sqrt{\ln \vert \mathcal{N}\vert /t}\) satisfies:

The rest of this section is devoted to the learnability issue of the class \(\mathcal{N}_{\mathtt{tree}}\) of tree-structured and bipartite-ordered RCP-nets. In order to obtain an efficient online learning algorithm for this class, we need a polynomial time procedure for generating hypotheses at random according to an exponential distribution. To this end, we first present a general procedure for generating random subsets; this procedure is the backbone for constructing RCP-nets. Next, we investigate the problem of learning bipartite orderings, and then, we turn to the problem of learning tree structures. We conclude by applying the results to optimization and ranking tasks.

5.1 Generating random subsets

Let \(\mathcal{B}\) be an arbitrary non-empty set of subsets of [b]={1,…,b}, and let w be a map that associates to each subset \(B \in\mathcal{B}\) a positive real w(B) that captures the weight of B. In this section, we present a simple algorithm due to Kulkarni (1990) that generates random subsets of [b] which belong to \(\mathcal{B}\) and such that the probability of generating a given \(B \in\mathcal{B}\) is proportional to w(B).

Before presenting the algorithm, we introduce the following notations. Let U and V be subsets of [b], not necessarily in \(\mathcal{B}\), such that UV. By [U,V], we denote the interval \(\{B \in\mathcal{B}: U \subseteq B \subseteq V\}\). The weight of [U,V], denoted w[U,V], is defined as the sum of weights of sets in [U,V], i.e.

(3)

The algorithm, called Random Subset and described in Fig. 4, starts from the largest interval [U,V] with U=∅ and V=[b], and iteratively shrinks [U,V] until it contains a unique set in \(\mathcal{B}\). Namely, for each element i∈[b], the algorithm inserts i in U with probability p and removes i from V with probability 1−p, where p is the ratio of w[U∪{i},V] to w[U,V]. By a reformulation of Theorem 2.1. in Kulkarni (1990), we get the following result.

Fig. 4
figure 4

The random subset algorithm

Lemma 3

For any nonempty set \(\mathcal{B}\) of subsets of [b] and any positive weight function w over \(\mathcal{B}\), the Random Subset algorithm produces a random variable U taking values in \(\mathcal{B}\) with distribution proportional to w.

5.2 Parameter learning

After a digression on generating random subsets, let us investigate the problem of learning the conditional preference tables of RCP-nets. More precisely, we examine the class \(\mathcal{N}_{\mathtt {tree}}[\mathit{par}]\) where the parent set par is prefixed and known. In this setting, any component of the preference network is specified as a triplet of the form C=(X.A,u,v) where X.A is an attribute with parent set {X 1.R 1,A 1,…,X p ,R p .A p }, u is a vector over \(D_{A_{1}} \times\cdots\times D_{A_{p}}\) and v is a value in D A . Based on this notation, cpt X.A (u) is encoded by the set of components (X.A,u,v), where v is maximally preferred value in D A .

Given a schema of domain size d involving a attributes, the number of components is bounded by ad p+1, and hence, remains polynomial in a and d whenever the parent size p is taken as constant. By contrast, the cardinality of \(\mathcal{N}_{\mathtt{tree}}[\mathit {par}]\) is bounded by \((2^{d} - 1)^{ad^{p}}\), where (2d−1) is the number of complete bipartite digraphs over d values. Because this cardinality grows exponentially with a and d, a key computational issue is to generate hypotheses at random according to a given distribution over \(\mathcal{N}_{\mathtt{tree}}[\mathit{par}]\). Fortunately, this issue can be circumvented by exploiting the bipartite structure of preference orderings.

Lemma 4

For the class \(\mathcal{N}_{\mathtt{tree}}[\mathit{par}]\), let θ be a parameter vector over the component set \(\mathcal{C}({\mathcal{N}_{\mathtt {tree}}[\mathit{par}]})\). Then, generating a hypothesis \(N \in\mathcal{N}_{\mathtt {tree}}[\mathit{par}]\) at random according to θ can be done in \(\mathcal{O}(a d^{p+2})\) time.

Proof

Consider the map w that assigns to each \(N \in\mathcal{N}_{\mathtt{tree} }[\mathit{par}]\) the weight

(4)

By construction, w is proportional to ℙ θ .

Now, consider any attribute X.A with parents {X 1.R 1.A 1,…,X p .R p .A p }, and let u be a vector in \(D_{A_{1}} \times\cdots \times D_{A_{p}}\). Without loss of generality, assume that D A ={1,…,d}, and let {C 1,…,C d } be the set of components of the form C i =(X.A,u,i). Recall that any bipartite preference ordering B over D A is encoded by the set of components (X.A,u,i), where i is maximally preferred value in D A . Based on equality (4), the weight of B is given by w(B)=∏ iB w i , where w i =w(C i ). Let [U,V] be the set of all nonempty subsets B of D A such that UBV. Based on the definition of w[U,V] given in (3) and using the specification of w in (4), we get that:

(5)

The first equality in (5) is obtained by taking the sum of weights of all subsets of V and removing the weight of the empty subset. The second equality follows by taking the sum of weights of all subsets of V which include U. Thus, using the Random Subset algorithm, we can generate in \(\mathcal{O}(d^{2})\) time a random bipartite ordering for the entry cpt X.A (u). Since there are at most a attributes and d p tuples of values per attribute, the result follows. □

5.3 Structure learning

Let us turn to the more general glass \(\mathcal{N}_{\mathtt {tree}}[\mathcal{A},\mathcal{T}]\) of tree-structured RCP-nets with attribute set \(\mathcal{A}\) and term set \(\mathcal{T}\). Since the parent structure is unknown, any component is now defined as a quadruplet C=(X.A,par(X.A),u,v), where X.A is an attribute, par(X.A) is a parent set, u is a vector over the domain of par(X.A) and v is a value in D A . As specified above, any entry cpt X.A (u) is encoded by the set of components (X.A,par(X.A),u,v), where v is maximally preferred value in D A .

For a schema of domain size d, the number of components is bounded by at p d p+1, where \(\vert \mathcal{A}\vert = a\) and \(\vert \mathcal{T}\vert = t\). By contrast, the size of \(\mathcal{N}_{\mathtt{tree}}[\mathcal{A},\mathcal{T}]\) is bounded by \({(a+1)^{a - 1}} (t^{p})^{a} (2^{d} - 1)^{ad^{p}}\), where (a+1)a−1 is the number of spanning forests of the complete digraph of order a. Again, this combinatorial barrier can be handled by exploiting the Matrix-Tree Theorem.

Lemma 5

For the class \(\mathcal{N}_{\mathtt{tree}}[\mathcal{A},\mathcal{T}]\), let θ be a parameter vector over the component set \(\mathcal{C}({\mathcal{N}_{\mathtt{tree}}[\mathcal{A},\mathcal{T}]})\). Then, generating a hypothesis \(N \in\mathcal{N}_{\mathtt {tree}}[\mathcal{A},\mathcal{T}]\) at random according to θ can be done in \(\mathcal{O}(a^{5} + a^{2} t^{p} d^{p} + a d^{p+2})\) time.

Proof

By analogy with the proof of Lemma 4, let w be the map that assigns to each \(N \in\mathcal{N}_{\mathtt{tree}}[\mathcal{A},\mathcal{T}]\) the weight w(N) specified by equality (4). Again, w is by construction proportional to ℙ θ .

Let X.A. be an attribute with parent set par(X.A), and u be a vector in the domain of par(X.A)={X 1.R 1,A 1,…,X p ,R p .A p }. We also assume without loss of generality that D A ={1,…,d}. Let {C 1,…,C d } be the set of components of the form C i =(X.A,par(X.A),u,i), each associated with its weight w i =w(C i ). By w[X.A,par(X.A),u], we denote the sum of weights of all bipartite orderings over D A . From equality (5), we get that

(6)

By extension, let w[X.A,par(X.A)] be the sum of weights of all possible conditional preference tables defined over the attribute X.A with parent set par(X.A). Applying the distributive property,

(7)

Now, let \(\mathcal{G}\) be the weighted digraph with node set \(\mathcal{A} \cup\{\top\}\), where ⊤ is a “dummy” node denoting the absence of parent. The edge set is formed by all pairs of distinct attributes in \(\mathcal{A}\), together with all pairs of the form (X.A,⊤) where \(X.A \in \mathcal{A}\). If e is an edge of the form (X.A,X′.A), then w(e) is given by the sum of weights w[X.A,par(X.A)] of all possible parent sets par(X.A) including one term in \(\mathcal{T}\) with suffix X′.A′ and most p−1 terms in \(\mathcal{T}\) with uncontrollable suffix. Analogously, if e is of the form (X.A,⊤), then w(e) is the sum of weights w[X.A,par(X.A)] of all parent sets par(X.A) formed by at most p terms in \(\mathcal{T}\) with uncontrollable suffix.

Based on these specifications, the weighted digraph \(\mathcal{G}\) can be constructed in \(\mathcal{O}(a^{2} t^{p} d^{p})\) time. Namely, there are \(2\binom{a}{2} + a\) edges in the graph, and the weight of each edge is the sum of at most t p weights of the form w[X.A,par(X.A)], each being computed in \(\mathcal{O}(d^{p})\) time using equalities (6) and (7).

Finally, let \(\mathrm{span}(\mathcal{G})\) denote the set of all spanning trees of \(\mathcal{G}\) rooted at ⊤. For any spanning tree S in \(\mathrm{span}(\mathcal{G})\), let w(S) be the product of weights of its edges. By construction, w(S) is equal to the sum of weights w(N) of all RCP-nets \(N \in\mathcal{N}_{\mathtt{tree}}[\mathcal{A},\mathcal{T}]\) with parent structure S. Therefore,

With this property in hand, let U,V be two subsets of edges of \(\mathcal{G}\) such that UV, and let w[U,V] be the sum of weights of all spanning trees \(S \in\mathrm{span}(\mathcal{G})\) for which USV. By \(\mathcal{G}[U,V]\) we denote the weighted digraph obtained from \(\mathcal{G}\) by deleting the edges which are not in V, and contracting the edges which are in U. By Lemma 5.4 in Kulkarni (1990),

Thus, based on the Random Subset algorithm, we can generate at random a spanning tree S of \(\mathcal{G}\) rooted at ⊤ in \(\mathcal{O}(a^{5})\) time. This, together with the fact that any RCP-net over S can be generated in \(\mathcal{O}(a d^{p+2})\) time, yields the result. □

5.4 Applications

We now have all ingredients in hand to examine the learnability of tree-structured RCP-nets.

For optimization tasks, each example supplied to the decision maker consists in a partial outcome x and a completion y of x. Here, opt (C,x,y) indicates whether C has made a prediction mistake on (x,y). Specifically, a component C of the form (X.A,par(X.A),u,v) is charged one mistake on (x,y) if there is at least one object o of type X for which the value of o.A is incorrectly predicted, i.e. 〚par(o.A)〛 y =u, 〚o.A x =∗, and 〚o.A y v. Based on this notion, opt (N;x,y) is equal to the number of components C in N which have made a mistake on (x,y). Thus, opt is an upper bound on the number of mistakes made by the decision maker. Together with the fact that opt is bounded by λ=a, the composition of Lemmas 2 and 5 yields the following result.

Theorem 4

The class \(\mathcal{N}_{\mathtt{tree}}[\mathcal{A},\mathcal{T}]\) is efficiently online learnable from outcome optimization tasks with regret bound

For ranking tasks, each example consists in an outcome set x={I 1,…,I m } and a permutation y over [m]={1,…,m}. Here, the loss rank (C,x,y) is recursively defined as follows. For m=2, any component C of the form (X.A,par(X.A),u,v) is charged one mistake on ({I 1,I 2},(2,1)) if there is at least one object o of type X such that , and . Now, for m≥2, rank (C;x,y) is the number of pairs in x for which C has made a prediction mistake:

Based on this definition, rank (N;x,y) is an upper bound on the Kendall’s tau distance (1938) that calculates the number of pairs in x for which the permutations N(x) and y have opposite orderings (called discordant pairs). Since rank is bounded by \(\lambda= a \binom {m}{2}\), we get the following result.

Theorem 5

The class \(\mathcal{N}_{\mathtt{tree}}[\mathcal{A},\mathcal{T}]\) is efficiently online learnable from outcome ranking tasks with regret bound

We note in passing that both regret bounds are independent of the potentially large number n of objects in the database.

6 Experiments

The experiments reported in the section are not meant to give a rigorous empirical evaluation of our theoretical framework. Instead, they are intended as an illustration of the typical behavior of Expanded Hedge for learning tree-structured RCP-nets, comparing the type of prediction task (optimization vs. ranking), the number p of parents per attributes, and the type of preference ordering over domain values (bipartite graph vs. star).

Our experiments are based on a recent benchmark in movie recommendation (Cantador et al. 2011) for which the schema is described in Fig. 1. As an integration of the MovieLens, IMDb and \(\mathsf{Rotten\, Tomatoes}\) systems, the database includes 10198 movies, 7742 actors, 4053 directors, and 6040 anonymous users, where each user rated at least 20 movies.

In our experiments, outcomes are defined according to 4 objects: a user, a movie picked in her watchlist, the actress or actor with leading role in the movie, and the director of the movie. Each outcome involves 3 uncontrollable attributes, specified by the user’s age, gender and occupation, and 25 controllable attributes including the starring actor’s fame and gender, the director’s fame, and the film’s country, genres, release date, revenue and critics rating. Notably, because a movie can have multiple genres, the set-valued attribute Movie.Genre was split into 18 binary attributes, each referring to a particular category in IMDb. Movie ratings were formatted according to a five-star scale. Other numeric attributes were discretized using the standard “equal frequency discretization” technique with 5 intervals.

Each experiment was conducted by selecting a group of 50 users generated at random from 4 known occupations. The experiment lasts for 1000 training rounds and, after each series of 5 rounds, we measured the algorithm’s accuracy on 100 test examples generated from our user group. The final results are obtained by averaging the algorithm’s accuracy on 10 experiments.

Learning to optimize

In movie recommendation, a natural optimization task is to predict the type of movie users would like to see based solely on their profile. In this setting, each example at round t was generated by selecting a user and a highest-rated movie in her watchlist, together with the movie’s starring actor and director. By denoting y t the resulting interpretation, the partial outcome x t was obtained from y t by removing the value of every controllable attribute.

In order to evaluate the performance of the decision maker, we must ensure that its predicted completion N t (x t ) covers at least one movie that has been seen by the user. To this end, the forward sweep algorithm was slightly modified as follows: for each attribute o.A iteratively processed according to the dependency graph of N t , generate a linear extension of the preference ordering corresponding to , and instantiate o.A to the first value in the linear extension that matches at least one movie in the user’s watchlist. A mistake is made if the resulting completion N t (x t ) does not cover any highest-rated movie. In this case, the completion y t is returned.

The algorithm’s accuracy was measured by counting the number of correct predictions. The left part of Fig. 5 shows plots for learning tree-directed (and star-ordered) RCP-nets with parent size p=1,2 and 3. Such results indicate that uncontrollable attributes are crucial for prediction. Notably, by observing the cumulative loss of each component at the end of the 1000 rounds, the improvement from p=2 to p=3 is essentially due to the presence of rules involving both the user’s gender and the user’s age. The right part of Fig. 5 compares the algorithm’s performance with star-ordered RCP-nets and bipartite-ordered RCP-nets. Here, the convergence rate is slightly faster with stars, indicating that a single maximally preferred value per entry in the tables of the network is sufficient for making accurate predictions.

Fig. 5
figure 5

Accuracy results on optimization tasks, comparing the parent sizes (left part with star orderings) and the type of preference orderings (right part with p=3)

Learning to rank

Concerning ranking tasks, each instance x t was generated by first selecting a user, and then by forming an outcome set of m=20 movies, each selected at random from the user’s watchlist. Here, a prediction mistake is made if the ranking \(\hat{y}_{t} = N_{t}(x_{t})\) is inconsistent with the user’s ratings; if so, the response y t is the closest ranking of \(\hat{y}_{t}\) that is consistent with the user’s ratings. Because our rank loss defined above is an upper bound on Kendall’s tau distance, the accuracy was measured here using the corresponding Kendall’s tau coefficient, given by \(1 - 4D(\hat{y}_{t},y_{t})/m(m-1)\), where \(D(\hat{y}_{t},y_{t})\) is the Kendall’s tau distance between \(\hat{y}_{t}\) and y t .

The left part of Fig. 6 shows plots for learning bipartite ordered RCP-nets with parent size p=1,2 and 3. Again, it is apparent that the presence of uncontrollable attributes plays an important role on the algorithm’s performance. However, a sharp contrast between optimization tasks and ranking tasks is observed when comparing star-ordered RCP-nets and bipartite-ordered RCP-nets. As plotted in the right part of Fig. 6, the degraded performance of the algorithm using star-shaped orderings indicates that such preference relations are too restrictive; the full expressiveness of bipartite orderings is required for ranking tasks.

Fig. 6
figure 6

Accuracy results on ranking tasks, comparing the parent sizes (left part with bipartite orderings) and the type of preference orderings (right part with p=3)

Running times

Both series of experiments were conducted on a 3.00 GHz Intel Xeon 5570 bi-processor with 8 GB RAM running Windows 7. All procedures were written in C++. For generating spanning trees we used the cycle popping algorithm due to Propp and Wilson (1998). Although this Markov chain-based procedure does not offer the theoretical guarantees of discriminant-based procedures, it is much simpler to implement. Based on this algorithm and a B-tree data structure for storing components (about 106 for p=3), the running time needed for generating tree-structured RCP-nets is less than 10 ms. Optimizing outcomes and ranking outcome sets with these RCP-nets takes less than 1 ms.

7 Related work

In recent years, the topic of preferences has received a great deal of attention in AI (see e.g. Brafman and Domshlak 2009; Fürnkranz and Hüllermeier 2011). A key issue in this topic is to devise preference models endowed with efficient forms of reasoning and learning. This section focuses on preference networks which adopt a graph-based representation for identifying preferential dependencies among variables. By analogy with probabilistic networks (Koller and Friedman 2009), preference networks can be divided into two main categories depending on whether the underlying structure is an undirected graph (more precisely a hypergraphFootnote 1 whose primal graph is undirected) or a digraph.

7.1 Undirected preference networks

Utility functions have long been recognized as a natural paradigm for modeling preferences. For an outcome space \(\mathcal{I}\), a utility function is a map \(\phi: \mathcal{I} \rightarrow\mathbb{R}\) that associates a “degree of desire” to each outcome I in \(\mathcal{I}\). The function induces a total preorder ⪰ ϕ such that IJ if and only if ϕ(I)≤ϕ(J). By far, the most common approach for decomposing utility functions is the additive independence principle (Keeney and Raiffa 1976): a utility function ϕ is a weighted sum of sub-utility functions, or features, considered as pairwise independent. Given a multi-attribute representation of \(\mathcal{I} = D_{1} \times\cdots\times D_{n}\), the simplest class of utility functions satisfying this principle is the family of linear functions, each defined as a weighted sum over the attributes {X 1,…,X n }.

Since the strong independence assumption required by linear functions is too restrictive in many application domains, Fishburn (1967) introduced the Generalized Additive Independence (GAI) principle which allows for a similar additive decomposition of a utility function, but where overlapping subsets of attributes are the features involved rather than simple attributes. Bacchus and Grove (1995) and Gonzales and Perny (2004) proposed a graphical representation of this principle: a GAI-net is a pair \((\mathcal{H},\phi)\) such that \(\mathcal{H} = (\mathcal{X},\mathcal{E})\) is a hypergraph with node set \(\mathcal{X} = \{X_{1},\ldots,X_{n}\}\), and ϕ is a map that associates to each hyperedge \(E \in\mathcal{E}\) a feature ϕ E :D E →ℝ, where D E is the product of domains of the variables in E. Also known as weighted CSPs (Dechter 2003), GAI-nets have been the subject of extensive research in the AI literature (Braziunas and Boutilier 2005; Boutilier et al. 2006; Gonzales et al. 2011). Since a GAI-net is merely a linear combination of features, various algorithms can be applied for learning GAI-nets from ranking instances in multi-attribute domains (Herbrich et al. 2000; Crammer and Singer 2001; Freund et al. 2003; Harrington 2003; Domshlak and Joachims 2005).

In this line of work, the closet framework to ours is due to Brafman (2008) who recently proposed a relational extension of GAI-nets. Intuitively, a relational GAI-net is a template over a database schema \(\mathcal{S}\) that specifies a ground GAI-nets for each database of objects. More formally, a template feature is a pair (F,ϕ) where \(F = \{\gamma_{A_{1}}(X.R_{1}.A_{1}),\ldots,\gamma_{A_{p}}(X.R_{p},A_{p})\}\) is a set of terms in \(\mathcal{T}(\mathcal{S})\) and ϕ is a mapping from \(D_{A_{1}} \times\cdots\times D_{A_{p}}\) into ℝ. A relational GAI-net over \(\mathcal{S}\) is a set of template features. For a skeleton κ and an interpretation \(I \in\mathcal{I}_{\kappa}\) the sub-utility assigned to I by a template feature (F,ϕ) is given by:

Based on the generalized additive independence condition, the utility assigned by a relational GAI-net N={(F 1,ϕ 1),…,(F m ,ϕ m )} to any interpretation \(I~\in~\mathcal{I}_{\kappa}\) is given by \(\phi_{N}(I) = \sum_{i=1}^{m} \phi_{i}(I)\). Thus, relational GAI-nets can be used to compare relational outcomes and to answer relational preference queries. However, the crucial difference between RCP-nets and relational GAI-nets lies in the fact that preference optimization can be solved in polynomial time for acyclic RCP-nets, while it is NP-hard for relational GAI-nets (Brafman 2008, Theorem 1). To this point, finding subclasses of relational GAI-nets which are tractable for both reasoning and learning is a challenging problem.

7.2 Directed preference networks

From a cognitive viewpoint, the interest of directed graphical models lies their ability to represent conditional preferences in an intuitive manner (Boutilier et al. 2004b; Engel and Wellman 2008). Notably, CP-nets are directed graphical models in which the semantics of conditional preferences is defined according to the ceteris paribus principle. Various extensions of CP-nets have been proposed in multi-attribute domains (Boutilier et al. 2001; Brafman et al. 2006; Goldsmith et al. 2008; Binshtok et al. 2009). The learnability issue of CP-nets has been considered in different learning protocols including, notably, the distribution specific PAC learning model (Dimopoulos et al. 2009) and the exact learning model with equivalence and membership queries (Koriche and Zanuttini 2010). In a related setting, Yaman et al. (2011) developed efficient online algorithms for learning ensembles of lexicographic CP-nets whose dependency graph is a chain.

In a nutshell, our framework extends the CP-net formalism to handle reasoning and learning problems in relational domains. Because the number of objects in a database is typically large, an important aspect of our framework is to operate at the template level whenever as possible. Notably, optimization and ranking techniques presented in Theorems 2 and 3 do not require an explicit construction of the ground dependency graph \(\mathcal{G}_{\kappa}(N)\) of the RCP-net N.Footnote 2 For this reason, the complexity of outcome optimization and outcome ranking is low-polynomial in the number n of objects.

Directed preference networks have similarities with preference functions described in Cohen et al. (1999). Conceptually, a preference function is a map \(F: \mathcal{I} \times\mathcal{I} \rightarrow[0,1]\) such that F(I,J)=1−F(J,I). Following the additive independence principle, preference functions are defined as weighted sums of atomic functions. This line of work has been recently extended to the relational setting by Ceci et al. (2010) who use first-order logical rules for representing atomic functions. Yet, one of key issues with such approaches is that the weighted digraph over \(\mathcal{I}\) induced by a preference function is not necessarily acyclic. For this reason, the problem of finding a ranking that is maximally consistent with a preference function is, in general, NP-hard.

8 Conclusions

In this study, we have presented a unifying framework for learning and reasoning with conditional preferences in relational domains. Our main theoretical results state that acyclic RCP-nets support tractable inference for both preference optimization and preference ranking, and tree-structured RCP-nets can be robustly learned from optimization and ranking tasks using linear loss functions. Our theoretical findings have been complemented by experiments on a large-scale recommendation domain.

Clearly, there are many directions in which one might attempt extensions of this study. Two of them are particularly important for addressing practical situations in relational domains.

Constraints

In our framework, the decision maker’s background knowledge is a relational schema with class names, attribute names, and functional dependencies among them. However, in many application domains, such as control systems (Brafman 2008) and configuration problems (Junker 2006), the background knowledge also involves other sorts of constraints. For example, the choice of a particular motherboard in a computer can restrict the choice of graphic cards. The sum of prices of the computer components cannot exceed the user’s budget. In presence of such constraints, our result for outcome optimization (Theorem 2) no longer holds. To this point, the anytime algorithm suggested by Boutilier et al. (2004a) for CP-nets can provide a first step for investigating constrained optimization problems with RCP-nets.

Cyclic preferences

All theoretical results presented in this paper are limited to classes of acyclic RCP-nets. As argued by Goldsmith et al. (2008), the acyclic assumption for conditional preferences can be too strong in some situations. Notably, a natural form of cyclic preference in relational domains arises from social networks: evidence suggests that people tend to rely more on recommendations from their friends than on recommendations from anonymous individuals (Sinha and Swearingen 2001; Golbeck 2006). However, from a semantical viewpoint cyclic RCP-nets are not always guaranteed to be coherent (Theorem 1 does not hold for general RCP-nets). A challenging problem is thus to identify and learn coherent forms of cyclic RCP-nets capable of representing mutual influences between entities in relational domains.