1 Introduction

The fields of data mining and machine learning have contributed numerous effective and highly optimized algorithms for analyzing data. However, this focus on efficiency and scalability has come at the cost of generality. Indeed, while the algorithms are highly effective, their application range is often very restricted, and the algorithms are typically hard to change and adapt even to small variations on the problem definition. This observation has led to an interest in declarative methods for data mining and machine learning in which the focus lies on the use of expressive models that can capture a wide range of different problem settings and that can then be solved using off-the-shelf constraint solving technology; see Guns et al. (2013a), De Raedt (2012), Arimura et al. (2012), De Raedt (2015).

Motivated by this quest for more general and generic data analysis approaches, the present paper introduces the problem of relational data factorization (ReDF). ReDF is inspired by matrix factorization, one of the most popular techniques in machine learning and data mining for which many variants have been studied, such as non-negative, singular value and Boolean matrix factorization. In matrix factorization, one is given an \(n \times m\) matrix \(\mathbf {A}\), and the problem is to rewrite it as the product of some other matrices, e.g., the product of an \(n \times k\) matrix \(\mathbf {B}\) and \(k\times m\) matrix \(\mathbf {C}\) such that \({\mathbf {A}_{i,j} = \sum _k \mathbf {B}_{i,k} \cdot \mathbf {C}_{k,j}}\). In relational data factorization, one is given a relation (i.e., a set of tuples over the same attributes) and asked to rewrite it in terms of other relations. Consider, for instance, a relation sells(Company, Part, Project), stating that companies sell particular parts to particular projects. While it is well-known that ternary relations, in general, can not be rewritten as the join of three binary relations (Heath 1971; Jones et al. 1996),Footnote 1 we might be interested in an approximation of the ternary relation. That is, we might approximate sells(Company, Part, Project) by the query offers(Company,Part), needs(Project, Part), deliversto(Company, Project) (we follow logic programming notation, where the same variable name denotes a natural join). The question is then how to determine the extensions for the relations offers, needs, and delivers. The found solution will generally be imperfect, so in ReDF we want to find the best approximation w.r.t. a scoring function and we allow the user to specify hard constraints. In the example these might specify, e.g., that only tuples in the target relation sells may be derivable from the query.

In this paper, we develop a modeling and solving approach for ReDF using answer set programming (ASP) (Brewka et al. 2011). This is realized by showing for a number of ReDF problems how they can be tackled with ASP. This leads to the identification of constraints and scoring functions, which we then abstract to an even higher-level declarative language. We show that the resulting ReDF framework is general and generic and is in line with the declarative modeling approach to machine learning and data mining as (1) it allows one to easily specify and solve a wide range of well-known data analysis problems (such as tiling, Boolean matrix factorization, discriminative pattern mining, matrix block diagonalization, etcetera), (2) it is effective for prototyping such tasks (as we show in our experiments), even though it cannot yet compete with optimized special purpose algorithms in terms of efficiency, and (3) the constraints and optimization criteria are specified in a declarative and flexible manner. Translating problem definitions in the ReDF framework to ASP models is straightforward, and small changes in the problem definitions generally result in small changes in the model.

Relational data factorization is a form of relational learning. That is, it is a relational analog of matrix factorization and is therefore relevant to inductive logic programming (Muggleton and De Raedt 1994; De Raedt 2008) and can also be seen as a form of large-scale abduction (Denecker and Kakas 2002). Moreover, the solution techniques that we adopt are based on answer set programming, which has also been adopted in some recent works and methods on inductive logic programming (Paramonov et al. 2015; Järvisalo 2011). The implementation techniques we employ may also be used in more traditional inductive logic programming settings.

This paper is structured as follows. Section 2 introduces the formal ReDF framework. Section 3 introduces ASP. Section 4 shows how a wide range of data mining problems can be expressed as ReDF problems. Section 5 introduces some novel problems that the framework can express. Section 6 discusses the encoding of the problems into ASP, while Sect. 7 reports on the experimental evaluation. In Sect. 8 we discuss related work, and we formulate some conclusions and directions for future work in Sect. 9.

2 Relational data factorization

Before we formalize the ReDF problem and approach in its full generality, we illustrate Relational Data Factorization on the sells(Company, Part, Project) example from the Introduction.

2.1 An example

Assume we are given (1) a set of tuples for the database relation sells(Company, Part, Project), (2) a definite shape clause defining the predicate \(\textit{approx} \textit{(Company,} \textit{Part, Project)}\), e.g.,

$$\begin{aligned} \textit{approx} \textit{(Com, Pa, Proj)} \leftarrow \textit{offers(Com, Pa), needs(Proj, Pa), deliversto(Com, Proj),} \end{aligned}$$

which should approximate the database relation sells(Company, Part, Project) in terms of the (unknown) relations offers(Company, Part), needs(Project, Part) and deliversto(Com, Project), and 3) an error function \(\textit{error} (\textit{approx}, \textit{sells})\) that measures how different the database predicate and its approximation are, e.g., the number of tuples that is one in relation but not in the other. Then, the goal is to find sets of tuples for the unknown relations that minimize the error.

In practice, it is usually impossible to find a perfect solution (with \(\textit{error} =0\)) to relational data factorization problems, in this example because of Heath’s theorem (Heath 1971) (as discussed in the Introduction). Therefore, it is often useful to impose further restrictions on the sets to be considered. One such constraint could specify that there is no overcoverage, i.e., that all tuples in \(\textit{approx}\) must be in sells.

2.2 Problem statement

Using a logic programming formalism, we generalize the above example into the following ReDF problem statement.

Given

  • a dataset D: a set of ground facts for target predicate db;

  • a factorization shape Q: \(\textit{approx} ({\bar{T}}) \leftarrow q_1({\bar{T}}_1), \ldots , q_k({\bar{T}}_k)\), where the \(q_i\) are factors and the \({\bar{T}}_i\) denote tuples of variables;

  • a set of constraints C;

  • an error function measuring difference between two predicates (i.e., between the corresponding sets of ground facts);

Find the set of ground facts F for the factors \(q_i\) that minimizes \(\textit{error} (\textit{approx},\textit{db})\) and for which \(Q \cup F \cup D\) satisfies all constraints in C.

The factorization shape is a single non-recursive rule defining \(\textit{approx}\), the approximation of the target predicate \(\textit{db} \), where the predicates in the body are the factors. If a variable occurs in a body atom \({\bar{T}}_i\) and not in \({\bar{T}}\) (the head), then it is called latent. The task is to find a set F of ground facts defining the factors \(q_i\). Furthermore, each such set F uniquely determines a set of facts for \(\textit{approx}\). Notice that if a predicate \(q_i\) is already known and defined, then the task simplifies.

As in matrix factorization, it is quite likely that a perfect solution, with \(\textit{error} =0\), cannot be obtained. Consider the following example: \(\textit{db} (X,Y) \leftarrow p(X), q(Y)\) and dataset \(D = \{ \textit{db} (a,c), \textit{db} (b,d)\}\). Then it is impossible to perfectly reconstruct the target D. If \(F = \{p(a), p(b), q(c), q(d)\}\), the resulting program overgeneralizes as it entails facts not in D: \(\textit{db} (a,d) \in \textit{approx} \) and \(\textit{db} (a,d) \not \in D\); if, on the other hand, there are facts in D that are not entailed in \(\textit{approx} \), one undergeneralizes (e.g., when \(F = \emptyset \)).

The scoring function in relational factorization measures the error between the predicates \(\textit{approx}\) and db. Instead of minimizing error, however, in some cases it is more convenient to maximize similarity. Since these two perspectives can be trivially transformed from one to the other, we will use both without loss of generality.

2.3 Approach

To make this setup operational, we represent ReDF problems at two different levels. First, at the high level, we characterize typical constraints of interest that are employed across different models. Further, all problems are formulated using the template shown in Listing 1. Second, at the low level, the high-level constraints and encodings are formulated in ASP. The high-level constraints can in principle be automatically transformed into low-level ones.

figure a

We next illustrate this on the sells example. The high-level description from which we start is given in Listing 2.

figure b

Next, this high-level formulation can be encoded in and solved using the ASP program given in Listing 3 (here, an ASP program can be thought as a conjunction of logical rules, where implication is denoted by “:-”).

figure c

We introduce ASP in more detail below, but this model is easy to understand if one is familiar with the basics of logic programming. The ASP model basically defines the necessary predicates in ASP using a set of clauses. In addition, the rule in Line 4 encodes the constraint that whenever a tuple holds for sells(Com, Pa, Proj) there should be 0 or 1 corresponding tuples for the predicate offers(Com, Pa). Furthermore, the minimize statement specifies that we are looking for a model (a set of ground facts or tuples) that minimizes the error. The encoding in Listing  3 together with a set of facts for sells can be given to an ASP solver such as clasp (Gebser et al. 2011b).

Observe that the relational data factorization approach we propose perfectly fits within the declarative modeling paradigm for machine learning and data mining (De Raedt 2012). Indeed, the next sections will show that it naturally supports a wide range of popular and well-known factorization problems. Modeling different problems corresponds to specifying different constraints, shapes and optimization functions. By doing so, one obtains a deep understanding of the relationships among the many variations of factorization, and one can easily design, prototype and experiment with new variations of factorization problems. Furthermore, the models of factorization are in principle solver-independent and do not depend on a particular ASP solver implementation.

Notice that it would also be possible to use other constraint satisfaction and optimization approaches (such as, e.g., Integer Linear Programming), but given that we work within a relational framework, ASP is a natural choice. It is also declarative and has the right expressiveness for the class of problems that we will study, many of which are NP-complete such as BMF; see Sect. 4.2.

Finally, let us mention that there are many factorization approaches in both linear algebra, databases, and even in logic. We provide a detailed discussion of their relationship to ReDF in Sect. 8.

3 Preliminaries: ASP essentials

We use the answer set programming (ASP) paradigm for solving relational data factorization problems. Contrary to the programming language Prolog, which is based on a proof-theoretic approach to answer queries, ASP follows a model generation approach. It has been shown to be effective for a wide range of constraint satisfaction problems (Gebser et al. 2012).

The remainder of this subsection introduces the essentials of ASP in a rather informal way. ASP is a rich (and technical) research area, so we do not focus on technical issues as these would complicate the presentation, but rather refer the interested reader to Gebser et al. (2012), Eiter et al. (2009), Leone et al. (2002), Lifschitz (2008) for more details on this. For the actual implementation, we will use the clasp system (Gebser et al. 2012; Brewka et al. 2011).

Definition 1

(Disjunctive datalog program) A disjunctive datalog program is a finite set of rules of the form:

$$\begin{aligned} a_1 \vee a_2 \vee \ldots \vee a_n \leftarrow b_1, \ldots , b_k, \textit{ not }c_1,\ldots ,\textit{ not }c_h \end{aligned}$$

where \(a_1, \ldots , a_n, b_1, \ldots , b_k,c_1, \ldots c_h\) are atoms of a function-free first order language L. Each atom is an expression of the form \(p(t_1,\ldots ,t_n)\), where p is a predicate name and \(t_i\) is either a constant or a variable. We refer to the head of rule r as \(H(r) = \{a_1,\ldots ,a_n\}\) and to the body as \(B(r) = B^{+}(r) \cup B^{-}(r)\), where \(B^{+}(r) = \{ b_1, \ldots , b_k \}\) is the positive part of the body and \(B^{-}(r) = \{ c_1, \ldots , c_h \}\) the negative.

If a disjunctive datalog program P has variables, then its semantics are considered to be the same as that of its grounded version, written as ground(P), i.e. all variables are substituted with constants from the Herbrand Universe \(H_P\) (the constants occurring in the program). The semantics of a program with variables is defined by the semantics of the corresponding grounded version.

An interpretation I w.r.t. to a program P is a set of ground atoms of P. Let P be a positive disjunctive datalog program (i.e. without negation), then an interpretation I is called closed under P, if for every \(r \in \textit{ground}(P)\) it holds that \(H(r) \cap I \ne \emptyset \) whenever \(B(r) \subseteq I\).

Definition 2

(Answer set of a positive program (Eiter et al. 2009)) An answer set of a positive program P is a minimal (under set inclusion) interpretation among all interpretations that are closed under P.

Definition 3

(Gelfond–Lifschitz reduct) A reduct of a ground program P w.r.t. an interpretation I, written as \(P^I\), is a positive ground program \(P^I\) obtained by:

  • removing all rules \(r \in P\) for which \(B^{-}(r) \cap I \ne \emptyset \);

  • removing the literals “\(\textit{not }a\)” from all remaining rules.

Intuitively, the reduct of a program is a program where all rules with bodies contradicting I are removed and in all non-contradicting all negative ones are ignored. The interpretation I is a guess as to what is true and what is false.

Definition 4

(An answer set of a disjunctive program) An answer set of a disjunctive program P is an interpretation I such that I is an answer set of positive ground program \(\textit{ground}(P)^I\).

Example 1

Consider the following disjunctive datalog program P.

$$\begin{aligned} a \vee c \leftarrow b. \quad \quad b \leftarrow a, \textit{ not }c. \quad \quad a. \end{aligned}$$

If we take the interpretation \(I=\{a,b\}\) of P as candidate answer set, then the reduct \(P^I\) is

$$\begin{aligned} a \vee c \leftarrow b. \quad \quad b \leftarrow a. \quad \quad a. \end{aligned}$$

and it is easily seen that I is a minimal interpretation closed under \(P^I\), and therefore an answer set.

We also use a special form of disjunctive rules called choice rules (Gebser et al. 2012):

$$\begin{aligned} v_1~\{ a_1, a_2, \ldots a_n \}~v_2 \leftarrow b_1, \ldots , b_k, \textit{ not }c_1,\ldots ,\textit{ not }c_h \end{aligned}$$

where \(v_1\) and \(v_2\) are integer constants. The semantics are as follows: if the body is satisfied, then the number of true atoms in \(\{ a_1, a_2 \ldots a_n \}\) is from \(v_1\) to \(v_2\).

An aggregate atom is an atom that has the following form: \(l \# \{ a_1, \ldots ,a_n \} u\) where l and u are constant numbers, each \(a_i\) is a literal. The atom is true in an answer set A iff there are from l to u literals \(a_i\) that are true in A.

Another construct is maximization (Gebser et al. 2012; Leone et al. 2002) (minimization is defined analogously) stated as \(\#maximize\{ a_1=k_1, \ldots , a_n=k_n \}\), where \(a_1, \ldots , a_n\) are classic literals and \(k_1, \ldots , k_n\) are integer constants (possibly negative). The semantics of this constraint are as follows: a model I is selected if the weighted sum of \([a_i]*k_i\) is maximal in I, where \([\cdot ]\) are Iverson brackets, i.e. [a] is equal to 1 iff a is true in I and 0 otherwise.

4 Application to data mining problems

In this section we show that the ReDF framework generalizes a wide range of data mining tasks and provides a truly declarative modeling approach for relational data factorization. We introduce a range of constraints and optimization criteria that can be used in practice. The data mining tasks studied include tiling (Geerts et al. 2004), Boolean Matrix Factorization (BMF) (Miettinen et al. 2008), discriminative pattern mining (Knobbe and Ho 2006), and block-diagonal matrix forms (Aykanat et al. 2002).

4.1 Tiling

Data mining has contributed numerous techniques for finding patterns in (Boolean) matrices. One fundamental approach is that of tiling (Geerts et al. 2004). A tile is a rectangular area in a Boolean matrix represented by set of rows and columns such that all values on the corresponding rows and columns in the matrix are equal to 1.

One is typically not interested in any tile, but in maximal tiles, i.e., tiles that cannot be extended. For instance, Fig. 1 shows a binary dataset and two tiles. The first tile consists of the first and second column together with the first and second row. All entries for these rows and columns are 1s. Furthermore, it cannot be expanded as adding the third column or row would also include 0 values. The second tile consists of all three columns and the third row. Together these two tiles “cover” the whole dataset, that is, all cells with value 1 in the matrix belong to one of the tiles. The area of a set of tiles, denoted as \(\textit{area}(\mathcal {T}, D)\), is the number of cells (7 in our example) in the (union of the) tiles \(\mathcal {T}\) occurring in the dataset D

Fig. 1
figure 1

Example of Boolean tiles and their coverage

Definition 5

(Maximum k-Tiling) Given a binary dataset D and a positive integer k, find a tiling \(\mathcal {T}\) consisting of at most k tiles and maximizing \(\textit{area}(\mathcal {T}, D)\).

We now formalize tiling as a relational data factorization problem and then solve it using ASP. Rather than restricting ourselves to Boolean values as in the traditional formulation, we consider the relational case. The standard way of dealing with tables in attribute-value datasets was to expand them into a sparse Boolean matrix (with one Boolean for every attribute-value). In contrast, our formulation employs the attribute-value format directly.

Given a relation \(\textit{db}(\textit{Value},\textit{Attr},\textit{Transct})\), denoting that transaction \(\textit{Transct}\) has \(\textit{Value}\) for \(\textit{Attr}\), the task is to find a set of tiles that can be applied to the transactions to summarize the dataset db. Here, a tile is a set of attribute-value pairs.

Fig. 2
figure 2

Relational tiling: two relational tiles (right) in a toy dataset (left) concerning cars

In Fig. 2, for example, we can see the initial dataset, in which State is an attribute and Fair and Good are values for this attribute. Moreover, the blue and green areas indicate two relational tiles occurring in particular sets of transactions.

The two example tiles can be expressed as

$$\begin{aligned}&\textit{tile} (i_1,\textit{fair}, \textit{state}).~\textit{tile} (i_1,\textit{old}, \textit{age}). ~\textit{in} (i_1,t_1). ~\textit{in} (i_1,t_3).\\&\textit{tile} (i_2,\textit{gas}, \textit{fuel}).~\textit{tile} (i_2,\textit{sport}, \textit{type}). ~\textit{in} (i_2,t_1). ~\textit{in} (i_2,t_2). \end{aligned}$$

where the first argument of each tile is the index of the tile, the second is the value of the attribute, and the third argument is the name of the attribute. When tile I is applied to a transaction T (i.e., it occurs in the transaction), this is denoted by \(\textit{in} (I,T)\). We call a set of tiles a tiling.

We would like to factorize the initial dataset, represented as a set of \(\textit{db} (\textit{fair}, \textit{state}, t_1), \textit{db} (\textit{old},\textit{age},t_1), \ldots \), using the following shape query:

$$\begin{aligned} \textit{approx} (\textit{Attr},\textit{Value},\textit{Transct}) \leftarrow \textit{tile} (\textit{Indx},\textit{Value}, \textit{Attr}), \textit{in} (\textit{Indx},\textit{Transct}). \end{aligned}$$
(1)

To reason about the coverage of the shape, i.e., which transactions and attributes are covered in the dataset (indicated by color in Fig. 2), we use the following definition:

$$\begin{aligned} \textit{covered} ( \textit{Transct},\textit{Attr}) \leftarrow \textit{approx} (\textit{Attr},\textit{Value},\textit{Transct}). \end{aligned}$$

For instance, \(\textit{covered} (t_1,\textit{age})\) holds because \(\textit{tile} (i_1,\textit{old},\textit{age})\) and \(\textit{in} (i_1,t_1)\) hold.

To specify the maximum k-tiling problem, we need the following constraints.

  • one-value-attribute: for every attribute of a tile there is at most one value:

    $$\begin{aligned} \leftarrow \textit{tile} (\textit{Indx},\textit{Val}_1, \textit{Attr}),\textit{tile} (\textit{Indx},\textit{Val}_2, \textit{Attr}),\textit{Val}_1\ne \textit{Val}_2. \end{aligned}$$
    (2)
  • no-tile-intersection: tiles do not overlap in the same transaction

    $$\begin{aligned} \leftarrow \textit{in} (I_1,T), \textit{in} (I_2,T), \textit{tile} (I_1,V,A), \textit{tile} (I_2,V,A), I_1 \ne I_2. \end{aligned}$$
    (3)
  • no-overcoverage: tiles cannot “overcover” the transaction, that is, they are only allowed to cover tuples that are in the dataset;

    $$\begin{aligned} \leftarrow \textit{tile} (\textit{Indx},\textit{Value}, \textit{Attr}),\textit{in} (\textit{Indx},\textit{Transct}), \textit{not } \textit{db} (\textit{Value}, \textit{Attr}, \textit{Transct}). \end{aligned}$$
    (4)
  • number-of-patterns(K): there are at most k-tiles (numbered from 1 to k):

    $$\begin{aligned} \textit{Indx} = 1 \vee \textit{Indx} = 2 \vee \ldots \textit{Indx} = k \leftarrow \textit{tile} (\textit{Indx},\textit{Value}, \textit{Attr}). \end{aligned}$$

Furthermore, the maximum k-tiling problem searches for the k tiles that maximize the area. This leads to an instance of the \(\textit{similarity} \) score defined by

$$\begin{aligned} \texttt {coverage}: \quad \#\{ (T,A) : \textit{covered} (T,A) \}. \end{aligned}$$
(5)

The statement above correspond to the standard mathematical function optimization notation, that reads as follows: count (#) the cardinality of the set (\(\{ \cdot \}\)) of tuples (TA) such that ( : ) \(\textit{covered} (T,A)\) holds. When we translate this statements into ASP formulation we have to use special syntax of ASP (#maximize) to capture this mathematical formulation.

We specify the high-level model for maximum k-tiling in Listing 4.

figure d

To illustrate the advantages of our declarative and modular approach, let us consider a small variation of the tiling task, in which tiles may overlap.

Fig. 3
figure 3

Example of a 0/1 database with a tiling consisting of two overlapping tiles (darkest shaded area corresponds to the intersection of the two tiles), due to Geerts et al. (2004)

Overlapping tiling Figure 3, taken from Geerts et al. (2004), illustrates a Boolean dataset with two overlapping tiles. We investigate and present two new variations of maximum k-tiling: overlapping and noisy tiling. The first investigates the global pattern mining task, when the overall coverage is optimized, allowing overlaps between tiles. The second investigates the task when, in k-maximum tiling, a tile can have a number of mismatches as covering a transaction. It is straightforward to change the assumption in our ReDF framework (and the corresponding ASP implementation). For the first task, it only involves replacing the constraint no-tile-intersection by the following constraint.

  • overlapping-tiles(N): two tiles in one transaction can intersect only on at most N attributes:

    $$\begin{aligned} \leftarrow \textit{in} (I_1,T), \textit{in} (I_2,T), \textit{tile} (I_1,V,A_1), \textit{tile} (I_2,V,A_2), I_1 \ne I_2, \# \{ A_1 = A_2\} > N. \end{aligned}$$

To model the variation that tolerates some noise in the data, we can replace constraint no-overcoverage by

  • noisy-overcoverage(N): every tile I can overcover at most N attributes in every transaction T where it occurs:

    $$\begin{aligned} \leftarrow \textit{tile} (I,V,A),\textit{in} (I,T), \textit{not}~ \textit{db} (V, A, T), \# \{ A \} > \textit{N}. \end{aligned}$$

Both variations show that a slight change of the formulation of property of a solution leads to a small change in the modeling and to a small change in the implementation.

4.2 The Discrete Basis Problem (DBP) and Boolean matrix factorization (BMF)

BMF has been extensively studied by Miettinen (2012), resulting in the well-known ASSO algorithm. Let us now show how it can be expressed as ReDF problem. As a starting point we take the same shape (Eq. 1) as in the tiling example in Sect. 4.1. However, we need to change the constraints to reflect the different properties of the desired solutions: tiles may now overlap, since one is not interested in tiles per se, but in good coverage of the dataset. That is why we remove the no-tile-intersection and no-overcoverage constraints, and introduce a notion of ‘overcoverage’ instead, by means of the following definition:

$$\begin{aligned} \textit{overcovered} (T,A) \leftarrow \textit{approx} (V,A,T), \textit{not}~ \textit{db} (V,A,T). \end{aligned}$$

In the Discrete Basis Problem, the scoring function maximizes the number of covered elements, while minimizing the overcovered ones. The latter term can be simply defined as:

  • \(\texttt {overcoverage}\): \(\#\{ (T,A) : \textit{overcovered} (T,A) \}. \)

We specify the high-level DBP model in Listing 5.

figure e

This formulation mimics The Discrete Basis Problem (Miettinen et al. 2008). That is, K plays the role of the basis size and \(\alpha \) mimics the bias towards rewarding covering and penalizing overcovering (the flags –bonus-covered and –penalty-overcovered in ASSO).

It is well-known that tiling and Boolean matrix factorization (BMF) are closely related (Miettinen 2012). Hence, let us also briefly show how BMF can be realized in our framework. It corresponds to an instance of DBP where only binary values (true and false) are possible and the no-overcoverage constraint applies. Hence, it is required that the factorization undercovers the initial dataset, i.e., if there is a 0 in a position in the original dataset, then there cannot be a 1 in the approximation. Therefore, the optimization criterion of DBP is further simplified and we obtain the following BMF model, without overcovering, in Listing 6.

figure f

4.3 Discriminative k-pattern set mining

A common supervised data mining task is that of discriminative pattern set mining (Knobbe and Ho 2006). Let \(\textit{db}(\textit{Value},\textit{Attr},\textit{Transct})\) be a categorical dataset, \(\textit{positive}(T)\) (\(\textit{negative}(T)\)) be the set of positive (negative) transactions, and k the number of tiles. Then, the task is to find extensions of the relations tile(\(\textit{Indx}\),\(\textit{Value}\),\(\textit{Attr}\)) and in(\(\textit{Indx}\),\(\textit{Transct}\)) such that positive and negative transactions are discriminated. A standard interpretation is to find tiles that cover as many positive and as few negative ones as possible (Liu et al. 1998). The only required change in the model concerns the scoring function (and assigning some weight to the errors):

$$\begin{aligned} \#\{T: \textit{covered} (T),\textit{positive}(T) \} - \alpha \#\{ T: \textit{covered} (T),\textit{negative}(T) \}, \end{aligned}$$
(6)

where \(\alpha \) is a constant that represents the weights for the errors made. It is typically a domain specific parameter (the cost of covering a negative example by a rule, i.e., the false positive cost or a weight of a negative example). Let us denote the coverage of the positive transactions as \(\texttt {coverage}^+\) (left set term in Eq. 6) and the coverage of negative as \(\texttt {coverage}^-\) (right set term in Eq. 6).

Given that we have no no-overcoverage constraint and negative transactions can be covered, the optimization criterion is given by

$$\begin{aligned}&\textit{similarity} (T) = \texttt {coverage} ^+ - \alpha \times \texttt {coverage} ^- . \end{aligned}$$

This corresponds to the high-level model in Listing 7.

figure g
Fig. 4
figure 4

Re-arranging a matrix in block-diagonal form (Animals dataset): a regular, b with penalties, c with noisy blocks and penalties

4.4 Block-diagonal matrix form

Aykanat et al. (2002) introduced the problem of and an algorithm for permuting the rows and columns of a sparse matrix into block diagonal form. They relate this problem to other combinatorial and classical linear algebra problems. The underlying block-diagonal structure of a matrix can be used to parallelize certain matrix computations. An illustration of block-diagonalization (several variants) of the Animals dataset is depicted in Fig. 4.

We reduce it to a form of tiling. The shape query is the same as in tiling but the constraints are different: if a tile \(I_1\) has an attribute A, then a tile \(I_2\) cannot use the same attribute. A similar constraint is imposed on the \(\textit{in} \) predicate and transactions stating that each item A can belong to only one tile

  • item-blocking: \(\leftarrow \textit{tile} (I_1,A), \textit{tile} (I_2,A), I_1 \ne I_2. \)

Only one tile can occur in a transaction T.

  • transaction-blocking: \(\leftarrow \textit{in} (I_1,T), \textit{in} (I_2,T), I_1 \ne I_2. \)

We also modify the optimization criterion to take into account elements not covered by a tile but blocked by this tile. Every tile that selects attributes and transactions prohibits other tiles to use these attributes and transactions by means of the item-blocking and transaction-blocking constraints. We penalize excessive usage of attributes and transactions by a single tile. We do this to improve the block form of the matrix, since in this task we are not just interested in a tiling with maximal coverage, but in a tiling that maximizes the number of elements on the diagonal and minimizes the number of elements everywhere else. To enforce this we introduce two functions:

  • \(\texttt {item-penalty}\): \(\#\{ (T,A) : \textit{approx} (T,A'), ~\textit{not } \textit{covered} (T,A) \}\)

  • \(\texttt {transt-penalty}\): \(\#\{ (T,A) : \textit{approx} (T',A), ~\textit{not } \textit{covered} (T,A) \}\)

Then, the whole problem is formulated in Listing 8.

figure h

If we omit \(\texttt {item-penalty}\) and \(\texttt {transt-penalty}\), we obtain the standard optimization function for tiling. In the experimental section we evaluate the effect of the presence of this penalty.

5 Beyond classic problems

So far we have focused on matrix-like representations of the data, in which the dataset was represented by instances of \(\textit{db} (T,A,V)\), for a transactions T having a value V for an attribute A. This representation is independent of the number of attributes and values, it allows one to easily specify constraints over all attributes and to access the data using the predicate db only. We will now show that it is also possible to use other, purely relational representations, such as the sells example from the Introduction.

Section 2 already provided the sells example for decomposing a ternary relation into three binary ones. In the shape for the sells example in Listing 3 there is no latent variable: there are only attributes from the original dataset. Since there is no latent variable, there is no “pattern” to be found for which the optimization criterion needs to be optimized, which allowed us to use a simple error function using only one type of atom.

However, latent variables can also be useful in a purely relational setting. Let us illustrate this on an example inspired by the ArXiv community analysis example of Gopalan and Blei (2013). Assume we are given a relation publishedIn with attributes Author, University, and Venue, specifying that an author belonging to a particular university publishes in a particular venue. Furthermore, assume we want to factorize this relation into the relation \(\textit{approx}\) (A,U,V) by introducing a latent attribute Topic, denoted as T. The latent topic variable clusters authors, universities and venues together in such a way that their join results in publications.

We obtain the following high-level model in Listing 9, where \(\alpha \) is the constant that indicates the relative cost of overcovering an element and the integer constant k is the number of value that the latent variable (T) can take:

$$\begin{aligned}&\textit{approx} (A,U,V) \leftarrow \textit{interestedIn}(A, \textit{T}), \textit{specializedIn}(U, \textit{T}), \textit{inField}(V, \textit{T}). \end{aligned}$$
figure i

The corresponding model without latent variables would be different only in the decomposition shape, i.e., it would look like

$$\begin{aligned} \textit{approx} (A,U,V) \leftarrow \textit{worksAt(A,U), publishesAt(A,V), knownAt(U,V)}. \end{aligned}$$

Discriminative relational learning In the spirit of discriminative pattern mining, described in Sect. 4.3, we can also do discriminative learning in the purely relational setting. To do so, we assume that the relation has an extra argument Co-Author and we would like to discriminate the dataset by a particular Co-Author \(c^+\), i.e.,

$$\begin{aligned} \begin{aligned}&\texttt {coverage} ^+(A,U,V) \leftarrow \textit{approx} (A,U,V), \textit{publishedIn}(A,U,V,c^+).\\&\texttt {coverage} ^-(A,U,V) \leftarrow \textit{approx} (A,U,V), \textit{publishedIn}(A,U,V,C), C \ne c^+. \end{aligned} \end{aligned}$$
(7)

Then, the optimization criterion remains the same as in Sect. 4.3. Intuitively, if we have only information about an author of a paper (together with his or her university affiliation and a venue), we use this to ‘predict’ his or her co-author using the patterns we obtain in this discriminative setting.

6 Implementation

This section describes how ReDF models can be implemented in ASP. We do this for the basic problem of tiling, as well as for the purely relational data factorization presented before. Implementations of the other variations are included in Appendix C. Our primary implementation is written in clasp, can be used with the clasp system (Gebser et al. 2012; Brewka et al. 2011) and will be made available online upon acceptance of this manuscript.

6.1 General computation methods: greedy and sampling approaches

In all described problems, the goal is to find k patterns or tiles, where a pattern is interpreted as a set of facts corresponding to a particular value of the latent variable. We will follow an iterative approach to finding these patterns, in which the discovery of the next pattern or tile will be encoded in ASP. We will consider both a greedy and a sampling algorithm for realizing this. The sampling approach is intended for better scalability and will be evaluated in Sect. 7.1.

Greedy model The greedy approach is described in Algorithm 1. Essentially, when the next best pattern has been computed (where pattern is a set of facts associated with the pattern identifier, e.g., in tiling a pattern is a set of transactions and attributes), it is added to the current set of patterns. The specific part for each tile is represented by executeProgram and is encoded separately in ASP. Note that this greedy, iterative approach to finding k patterns is very common in pattern mining. Theoretical bounds on the solution quality of the greedy approach have been studied in the context of the maximum k-set coverage problem (Hochbaum and Pathria 1998; Feige 1996); more details can be found in Appendix F.

figure j

Column sampling execution model To improve scalability, we employ a sampling approach. Interestingly, our approach is different from most existing sampling techniques in data mining: instead of sampling a rows or patterns, we sample columns. Algorithm 2 presents the column sampling approach we propose. The key difference with the greedy approach is that instead of determining the next best pattern on the overall dataset in each iteration, this approach samples N subsets of the data and determines the next best pattern for all of these subsets. The best among these is then fixed, and the process is repeated. We empirically evaluate the effects of sampling on the quality of the computed patterns and on the runtime in the experiment section. Quality bounds for this type of greedy search have also been analyzed previously (Hochbaum and Pathria 1998); for more details we refer to Appendix F.

figure k
figure l

6.2 Data mining problems expressed in the framework

The maximum k-tiling problem can be encoded in answer set programming as indicated in Listing 10. The code implements the greedy model, i.e., Algorithm 1, for the maximum k-tiling problem with a fixed number of tiles (Geerts et al. 2004). It assumes we have already found an optimal tiling for \(n-1\) tiles, and indicates how to find the n-th tile to cover the largest area. The n-th tile is called \(\textit{currentI}\) in the listing. Further, we have information about the names of the attributes and the possible values for each attribute through predicates \(\textit{col} (\textit{Attr})\) and \(\textit{valid} (\textit{Attr}, \textit{Value})\). That is, \(\textit{col} (A)\) is an unary predicate that encodes possible column indices, and \(\textit{valid} (A,V)\) is a binary predicate that encodes which possible values V can occur in column A.

Let us explain the code in Listing 10. The constraint in Line 2 generates at most one value for each attribute. The constraints in Lines 4 and 6 compute the transactions where the current tile cannot occur, i.e., intersect(T) is the set of all transactions where the current tile overlaps with another tile and the current tile cannot cover these transactions. Similarly, overcovered(currentI,T) is the set of transactions that cannot be covered because there is an element in the current tile, with fixed index currentI, that is not present in transaction T. The constraint in Line 8 states that if the tile does not violate the overcovering and intersection constraints in a transaction, it occurs in the transaction. Line 10 defines the coverage and the optimization constraint in Line 11 enforces the selection of the best model.

Theorem 1

(Correctness of the greedy ASP tiling encoding) The ASP program \(\mathcal {P}\) defined by the Listing 10 computes the k-th largest tile w.r.t. the scoring function coverage (5) as extensions of the predicates \(\textit{tile} (k,\cdot ,\cdot )\) and \(\textit{in} (k,\cdot )\) in its answer set \(\mathcal {A}\), provided that the dataset is represented extensionally through the predicates db, \(\textit{valid}\), and \(\textit{col}\) and the \(k-1\) already found tiles are represented extensionally through the predicates \(\textit{tile} (I,\cdot ,\cdot )\) and \(\textit{in} (I,\cdot )\) for \(I \in [1,k-1]\).

For the proof, see Appendix B. The clasp encodings for the other models are sketched in Appendix C.

6.3 Purely relational data factorization

In Sect. 5 we presented a factorization of the publishedIn relation into three binary relations. It constitutes a proof-of-concept prototype model in ASP and could be improved by, e.g., incorporating heuristics.

The general structure of the ASP encoding is similar to the sells example in Listing 3: we indicate here only a possible optimization for the relation generators. We use the left-to-right order of the atoms in the schema (replicated below) while generating candidates for the factorization.

figure m

Implementation differences When we generalize the factorization encoding with two relations to three relations, we observe a slight implementation difference between them. Factorization with the two relation shapes can be naturally implemented using the core ASP generate-and-test paradigm. Once we have guessed an extension for a certain value of the latent variable, we propagate it to the second relation and test against the constraints. This strategy is often deployed in specialized algorithms (Geerts et al. 2004; Miettinen et al. 2008). For a multiple relation shape we guess an extension of one relation, then we constrain the possible values we generate for the second value (e.g., see Line 2 in Listing 11). In general, we can search for one at a time using a greedy strategy (as in tiling). Theoretically, we can simultaneously search for values of a latent variable by replacing the fixed latent parameter by a variable and searching over the latent parameter as well. The work of Guns et al. (2013b) provides evidence that this approach does not scale well, unless special propagators are introduced into the solver. This technique would allow extending the method to other shapes with more than three relations.

7 Experiments

The main goal of this section is to evaluate whether ReDF problems can be solved using a generic solver. In particular, we focus on solving the problem formulations as we specified them in ASP. We investigate whether the problems can be solved, and for a number of tasks compare the results and runtimes to those obtained by specialized algorithms. Since we here use generic problem formulations and generic solvers that have neither been designed nor optimized for the tasks under consideration, we cannot expect the approach to be as efficient as specialized algorithms. However, what is more important is that we demonstrate that all tasks formalized and prototyped using the ReDF framework can be solved using a unified approach.

Experimental setup and datasets The ASP engine we use is 64-bit clingo (clasp with the gringo grounder) version 3.0.5 with the parameter –heuristic=Vmtf (see Appendix A for details on the parameters) and all experiments are executed on a 64-bit Ubuntu machine with Intel Core i5-3570 CPU @ 3.40GHz \(\times \) 4 and 8GB memory, except for Maximum k-tiling on Chess and Mushrooms datasets where Intel Xeon CPU with 128GB of memory (all single-threaded) has been used due to high memory requirements. For most experiments we use the datasets summarized in Table 1, which all but one originate from the UCI Machine Learning repository (Bache and Lichman 2013). The Animals (with Attributes) dataset was taken from Osherson et al. (1991). For the purely relational factorization task, the data and experiment results are described separately in the corresponding subsection.

In Sect. 7.1 we show how ReDF formulations of existing data mining tasks (from Sect. 4) can be solved using the implementation presented in Sect. 6, afterwards in Sect. 7.2 we show the results of the purely relational data factorization task. The ASP solver parameters used in the experiments and a breakdown of individual solving steps and their runtimes determined by the meta-experiment are presented in Appendix A.

Table 1 Dataset properties

7.1 Solving existing tasks

Maximum k-tiling in categorical data We first consider the maximum k-tiling problem from Sect. 4.1 and present timing and coverage results in Table 2 obtained on all datasets from Table 1.

In all cases the problem specification given in Listing 10 was used to greedily mine \(k=25\) tiles. Since the problem becomes more constrained as the number of tiles increases, runtime decreases for each additional tile mined. We therefore report total runtime and coverage for different values of k, i.e., for different total numbers of tiles. Only \(k=10\) tiles were mined on Chess and Mushroom due to long runtimes.

Effect of sampling As we can see from Table 2a, runtimes are quite long on datasets like Mushroom. To address this issue, we use the sampling procedure of Algorithm 2 with the following parameters: \(\alpha = 0.4\) and \(N = 20\), i.e., 40% of all attributes were selected uniformly at random for each sample and 20 samples were used. Intuitively, the larger the sample size and the more samples, the better we approximate the exact result.

With the given parameters, we attain an order of magnitude improvement in runtime: instead of 19 hours with the regular algorithm, using sampling it takes only one hour to compute 10 tiles as indicated in Fig. 5a. The effect of using sampling on coverage can be seen in Fig. 5b: the first tiles that are mined have lower coverage than when sampling is not used, but after a while the difference in coverage with LTM-k remains more or less constant and even slightly decreases. LTM-k is the original, specialized tiling algorithm, to which we compare next.

Comparison to a specialized algorithm We now compare the performance of the ASP-based implementation of LTM-k greedy strategy to that of a specialized implementation.Footnote 2 Figure 5a, b present both runtime and coverage comparisons obtained on Mushroom, both for our approach (with and without sampling) and the specialized miner.

Table 2 Maximum k-tiling

Without sampling, we can see that our approach gives the same results in terms of the coverage as the LTM-k algorithm. This is as expected though, since both LTM-k and our approach guarantee to find an optimal solution in each iteration. The slight difference between the two coverage curves in Fig. 5b is caused by the fact that multiple tiles can have the same (maximum) area, and some choice between those has to be made. Although these choices are typically made deterministically, the different implementations make decisions based on different criteria, resulting in slightly different tilings.

Unfortunately, the ASP solver is not as efficient as the specialized miner as can be seen in Fig. 5a, and the generality of the approach comes at the cost of longer runtimes. However, as already discussed, using a sampling approach can substantially decrease the runtime. Experiments on other datasets showed similar behavior to that depicted here.

Fig. 5
figure 5

Tiling comparison (runtime, coverage) with LTM-k (Mushroom dataset). a Runtime. b Coverage

Overlapping tiling To evaluate the overlapping tiling task from Sect. 4.1, we apply the model in Listing 12 (ASP encoding in Appendix C) to the five smaller datasets from Table 1. We experiment with two levels of overlap, i.e., parameter N is set to either 1 or 2: tiles can intersect on at most one or two attribute(s). As the results in Table 3 show, allowing limited overlap can lead to a small increase in coverage, but runtimes also increase due to the costly aggregate operation in Line 1 of Listing 12.

However, what is important to emphasize here is that only a small change in the problem formalization is sufficient to allow for overlap in the tilings, while the solver can still solve the problem without any further changes. And although the runtimes are longer when more overlap is allowed, the difference with the basic, non-overlapping setting is moderate.

Table 3 Maximum k-Tiling with overlap

Boolean matrix factorization (BMF) We perform Boolean matrix factorization (Sect. 4.2) by applying the formalization of Listing 13 and compare the results to those obtained by ASSOFootnote 3 (Miettinen 2012) with the no-overcoverage flag (-P1000). The factorization rank k is incremented by one in each iteration, and meanwhile coverage gain and runtime are measured. The results for Animals are presented in Fig. 6 and show that coverage is almost identical to that obtained by ASSO. Again, this is unsurprising, as our implementation follows the same solving strategy. However, runtimes are several times higher, which is due to the usage of a general solver that is not optimized for this type of task. Results obtained on other datasets are very similar and are therefore not presented here.

Fig. 6
figure 6

Boolean matrix factorization on datasets animals. Runtime and coverage are depicted for different factorization ranks

Fig. 7
figure 7

Discriminative pattern set mining summary: runtime (left) and coverage (right). a Runtime (in s) to mine k-th discriminative pattern on Chess dataset (\(\alpha = 1\), i.e., positive and negative tuples are weighted equally). b Discriminative mining coverage on Chess and Tic-tac-toe datasets (\(\alpha = 1\), i.e., positive and negative tuples are weighted equally)

Fig. 8
figure 8

Discriminative pattern set mining (Tic-tac-toe dataset): precision (left) and recall (right) for different \(\alpha \), i.e., for varying weights of covering negative transactions

Discriminative pattern set mining Here we demonstrate how the discriminative k-pattern mining model from Sect. 4.3 can be solved. For this we use Chess and Tic-tac-toe from Table 1, each of which has a binary class label indicating whether a game was won or not and can therefore be naturally used for this task.

We apply the encoding from Listing 14 to both datasets, set \(\alpha = 1\) to weigh positive and negative tuples equally, and summarize the results in Fig. 7b. The results show that five patterns suffice to cover all positive examples of Tic-tac-toe, hence mining more than five patterns would be useless. 92 of the 718 covered tuples are negative, i.e., \(12.8\%\), while \(34.7\%\) of the tuples in the complete dataset is negative. For Tic-tac-toe, the time needed to solve this task is very limited: about half a second.

Figure 7a shows the runtime needed to iteratively find subsequent patterns in the Chess dataset. Interestingly, it seems that the problem becomes substantially easier (computationally) once the first few patterns have been found: the runtime per pattern drops heavily. This confirms that the search space shrinks when the problem becomes more constrained, i.e., the number of answer sets decreases with the addition of more constraints.

We next show the influence of the \(\alpha \) parameter, i.e., the relative weight of covering positive and negative tuples in the optimization criterion. By increasing \(\alpha \), the ‘penalty’ for covering a negative tuple is increased and the algorithm can be forced to select more conservative rules. We investigate the effect of this parameter by measuring and comparing precision and recall of the obtained pattern sets for \(\alpha = 1\) and \(\alpha = 5\). Figure 8 shows that precision goes to 1 when \(\alpha \) is increased, while recall is decreased but this can be compensated by mining a larger number of patterns.Footnote 4

This task differs from the previous one in its optimization criterion: positive coverage penalized by negative coverage allows for fast inference and discovery of the optimal solution, which results in shorter runtimes than for tiling.

Matrix block-diagonal form We apply three versions of the encoding to the Animals dataset (Osherson et al. 1991). The results presented in Fig. 4 demonstrate that the Animals dataset can be re-arranged into block-diagonal form using our proposed framework. The runtime in all experiments are on the order of seconds. Parameters used in the experiments were \(\alpha =\frac{3}{20}\) and \(\beta =\frac{1}{20}\). Figure 4c demonstrates the model from Sect. 4.4, with the same \(\alpha \), \(\beta \) and \(N=1\). The low-level encoding of this model is given in Listing 15.

Fig. 9
figure 9

Clustering in topics by ReDF. Red nodes represent co-authors, blue their university cities, white nodes venues and green topics that bind them together. If there is an edge between a topic and a node, then there is a corresponding element in the relation (i.e., interestedIn, specializedIn or inField) (Color figure online)

7.2 Purely relational data factorization

In Sect. 5 we described how to model the factorization of publishedIn(Author, University, Venue) into three binary relations with a latent variable Topic. We now evaluate whether the standard ASP solver can solve this task. Unfortunately, we cannot expect a generic solver to handle enormous datasets such as the one from ArXiv as described by Gopalan and Blei (2013). Instead, we demonstrate a proof-of-concept of solving the model in Listing 16 on a moderate dataset.

We constructed a dataset for a well-known colleague from the data mining community: Bart Goethals (Antwerp University). We collected his publication list from Microsoft Academic SearchFootnote 5 and extracted for each paper the publication venue, and all co-authors together with their corresponding affiliations (i.e., the last known affiliation for each author in this list of papers). Each unique combination of venue, co-author, and affiliation resulted in a tuple in the publishedIn relation. The complete dataset contains 57 tuples over 19 universities, 38 authors, and 15 venues.

Intuitively, if a set of authors from a set of universities publish in a set of venues, then there must be an underlying research topic that unites them. Hence, by factorizing the relation into three separate relations, we cluster each of the entity types into a (fixed) number of topics, as indicated by the value of the latent Topic variable.

The results for factorization using \(K = 12\) topics and \(\alpha =\frac{1}{2}\) are presented in Fig. 9, including co-authors (red), universities (blue), publication venues (purple), and topics (green). To determine the number of topics K, we tracked the optimization criterion while increasing K and stopped when this no longer improved.

Since the task is of an exploratory character, we can only qualitatively evaluate the results. We observe that all data mining venues are located together in the center, connected to the same topics. SEBD, an Italian database conference, stands apart, and there is also a separate block for database and computing venues DaWaK and SAC. Manual inspection of the results indicates the topics (or clusters) to be coherent and meaningful: they represent different affiliations and groups of co-authors that Bart Goethals has collaborated with. For example, topic 5 contains the SDM conference, the University of Helsinki, and three co-authors specialized in Data Mining. Hence, this topic could be described as “Data mining collaboration with the University of Helsinki”, which makes perfect sense as Bart Goethals was previously a researcher in Helsinki.

Not all authors are represented in the factorization. How much of the publishedIn dataset is covered depends on the number of topics K (which was chosen as described before). The higher the cardinality of the pattern set, the larger the total coverage. The \(\textit{covered}\) elements positively contribute to coverage, whereas the \(\textit{overcovered}\) elements contribute negatively. This implies that each pattern is chosen such that the number of \(\textit{covered}\) and \(\textit{overcovered}\) elements are balanced and the optimization criterion is maximized. In general, covering all authors with few patterns would lead to significant overcovering of the original dataset, while introducing too many patterns would create clusters with only one author (which is clearly undesirable, since these clusters would not be meaningful).

The decompositions, as the one depicted in Fig. 9, could serve as a basis for new analyses. For example, we might visualize the intersection of common (latent) topics shared by two researchers. We outline possible examples in Appendix E.

Relational factorization without a latent variable In Sect. 5, we also described a factorization that does not use any latent variables (analog to the sells example in Listing 11 from the introduction section). We evaluate this model using Listing 11 on the same dataset as used in the previous experiment, i.e., the co-author relation publishedIn(Author, Uni, Venue) for Bart Goethals.

In general, factorizations do not perfectly match the original relation (i.e., \(\textit{error} \ne 0\)), but in this particular case the system found a lossless solution. It is easy to see that this will not always be possible though. For instance, let us assume we keep multiple affiliations per author in the dataset. For example, apart from a fact \(\textit{p(bonchi,barcelona,pakdd)}\), there may be another fact \(\textit{p(bonchi, pisa, pakdd)}\) in publishedIn. Although the same factorization would be found by the solver, the found solution would be imperfect as the latter fact is not represented in the factorized relation.

Solving this task was computationally easy, since there is no latent variable to iterate over: the runtime was only 0.01 s. Table 4 presents a summary.

Table 4 Experimental summary for pure relational factorizations from Sect. 5

Relational discriminative learning Here we investigate discriminative learning in the purely relational setting, for which we use the discriminative optimization criteria described in Sect. 5. For this experiment we collected DBLP data for two well-known researchers in the field of data mining: Jiawei Han and Philip S. Yu. In this example all publications belonging to either researcher are considered a class. Since DBLP data does not have authors affiliations, we replace this attribute with the publication date converted to a categorical variable \(M \in \{ \textit{old}, \textit{recent}, \textit{new} \}\), using the following rule: if the date is later than 2010, it is represented as “new”; if it is between 2005 and 2010, then it is “recent”; otherwise it is “old”. The complete dataset contains around 6000 ground atoms of the following form: paper(Author,Age,Venue,Han \(\backslash \) Yu). The goal is to predict whether the paper is co-authored by Han or Yu based on author, venue and age using discriminative rules as defined in Eq. 7. In this experiment the shape is

$$\begin{aligned} \textit{approx} (A,M,V) \leftarrow \textit{author}(A,T), \textit{paperAge}(M,T), \textit{venue}(V,M), \end{aligned}$$

where T is a latent variable. As for the previous discriminative experiment, in Fig. 10 we present an overview of the dependency of precision and recall on the number of patterns and \(\alpha \), the penalty for covering the incorrect class. Runtimes are similar as in Fig. 7a. ASP finds an optimal solution in half an hour and then spends a substantial amount of time to prove it is indeed the best solution. We therefore used a time limit of one hour per pattern speedup the computation. This limit was only reached in the computation of the first pattern, for both values of \(\alpha \).

In Fig. 10 we see that, depending on the number of patterns and penalty for covering the wrong class (\(\alpha \)), we can obtain a classifier with different precision and recall.Footnote 6

Fig. 10
figure 10

Discriminative learning in the purely relational setting; precision (left) and recall (right) for different \(\alpha \), i.e., for two weights of covering negative atoms

7.3 Runtime discussion

In this section we have seen a number of experiments that solve ReDF problems using generic solving technology, i.e., answer set programming. As we can see in Figs. 5a and 6, specialized algorithms are substantially faster than ASP. On datasets of moderate size, however, generic solvers obtain reasonable runtimes, as indicated by the results in Tables 2a, 3a, and 7b, and Figs. 6 and 7a. For the purely relational data factorization task from Sect. 5 we present a summary of the experiments in Table 4. In these experiments, computation time ranged from several seconds to few minutes.

8 Related work

Our work is related to (1) previous work on generalizing problem definitions and solutions in factorization, (2) existing forms of relational decomposition, and (3) approaches in inductive and abductive logic programming, and (4) the use of declarative languages and solvers for data mining.

8.1 General models for pattern mining

Our work can be related to a number of approaches that have generalized some of the tasks addressed in Sect. 5. Lu et al. (2008) used BMF as a basis for defining several data mining tasks and modeled them using integer linear programming. While Lu et al. (2008) also used a general purpose solver, it is restricted to Boolean matrix products involving only two Boolean matrices. In a similar manner Li (2005) defined a General Model for Clustering Binary Data, using matrix factorization to model several well-known clustering methods. The framework supports only one possible factorization shape, a lower-level modeling language, and requires complete partitions as well as specialized algorithms for different problems. In our approach, the shape of the factorization is separated from the constraints and optimization criterion.

Biskup et al. (2004) and Fan et al. (2012) investigated inverse querying and the problem of solving relational equations \(e_1(D) = e_2(D)\) exactly under several assumptions, that could be used to compute exact solutions to a restricted form of ReDF. However, this approach does not seem to allow for approximations and the use of loss functions.

8.2 Decomposition of databases, tensors, and real-valued matrices

ReDF is related to several forms of relational decomposition, a term that has been heavily overloaded in the literature. Hence, it is imperative that we present an overview and contrast existing paradigms to our own work. Moreover, ReDF is also related to decomposition methods for real-valued matrices.

Relational decomposition in database theory Ever since the seminal paper by Codd (1970), the decomposition of relations has been an important theme in database research (Koehler 2007; Date 2006). Key properties of this form of relational decomposition are (Elmasri and Navathe 2010): (1) a relational schema together with its constraints, e.g., functional dependencies, is assumed given; (2) decomposition is never based on the data (extension), but only on the schema (intension); (3) decomposition is always lossless, i.e., factorization is always exact for any possible extension, and never an approximation. An interesting exception is Relational Decomposition via Partial Relations (Berzal et al. 2002), where one is looking for partially satisfied dependencies in the data and then uses these partial dependencies to derive a normal form. It does take into account data, but only to mine additional schema constraints in the decomposition.

Relational decomposition in tensor calculusKim and Candan (2011) extend classical tensor factorization, CP decomposition, to deal with datasets composed of several relations, i.e., CP is generalized to multi-relational datasets. This requires adding relational algebra operations to CP. Key differences are: the data consists of several tables, with a schema to join them at the end; the shape is always the same and a tensor is decomposed into a sum of terms having the same structure; the optimization function is fixed; no user constraints are supported.

Decomposition of real-valued matrices Let us start with SVD (Singular Value Decomposition) (Golub and Van Loan 1996), the best-known method in this area, which gives an optimal rank-k decomposition of a real-valued matrix A into a composition of three matrices \(U \varSigma V^T\), where U and V are orthogonal real-valued matrices and \(\varSigma \) is a diagonal non-negative matrix with singular values of A. One of the key problems with SVD in the context of relational and Boolean factorization is that U and V may contain negative values, which make interpretation in the relational setting problematic. To overcome this issue NNMF (Non-Negative Matrix Factorization) has been introduced (Paatero and Tapper 1994). Still, there are two key issues with the usage of NNMF and SVD for relational and Boolean data.

First of all, for a Boolean matrix A there is no clear relation between its real value rank, denoted \(\text {rank}_\mathcal {R}(A)\), and its Boolean rank, denoted \(\text {rank}_\mathcal {B}(A)\), (\(\text {rank}_{\mathcal {R}_{\ge 0}}(A)\) denotes the non-negative rank). We know that the following inequalities hold (Miettinen 2009):

$$\begin{aligned}&\text {rank}_\mathcal {R}(A) \le \text {rank}_{\mathcal {R}_{\ge 0}}(A) \nonumber \\&\text {rank}_\mathcal {B}(A) \le \text {rank}_{\mathcal {R}_{\ge 0}}(A) \end{aligned}$$
(8)

Furthermore, there are examples where \(\text {rank}_\mathcal {R}(A) = n\) and \(\text {rank}_\mathcal {B}(A) = \log (n)\) (where A is \(n \times n\) matrix) (Miettinen 2009), which implies that there are cases where Boolean factorization could be preferable over real-valued matrix factorization, i.e., there are cases where we can obtain a smaller decomposition if we use discrete rather than real-valued methods. Also, if we look at approximate ranks (when the approximation is set within \(\epsilon \)), Eq. 8 above does not hold, i.e., there is no clear connection between NNMF and Boolean ranks in the approximate case.

Secondly, existing real-valued matrix factorization methods do not support multiple relations in the decomposition shape and extra constraints in the decomposition, which is at the core of the ReDF method. Furthermore, the constraints used in our method are hard constraints over discrete values. The latter problem has been addressed by Collective Matrix Factorization (Singh and Gordon 2008), which allows to handle multiple relations and optimization criteria. However, at its core the method relies on stochastic optimization over reals, which leads to the problems discussed at the beginning of the section.

Finally, as ReDF is defined over discrete values in the presence of the hard constraints, all the problems described above (rank inequalities, optimization over reals, uninterpretable values, etc) apply to the comparison of ReDF with real-valued matrix factorizations as well.

8.3 Relational learning

ReDF is also related to some well known techniques in inductive logic programming and statistical relational learning and even to abductive reasoning.

Several frameworks for abduction have been introduced over the years (Denecker and Kakas 2002; Flach and Kakas 2000). In abduction, the goal is to find a (minimal) hypothesis in the form of a set of ground facts that explains an observation. Abductive reasoning uses a rich background theory as well as integrity constraints; it also uses a set of clauses defining the predicate in the observation. The differences with ReDF are that ReDF uses a much simpler shape definition and no real background theory. On the other hand, abductive reasoning proceeds in a purely logical manner, and typically does neither take into account multiple facts in the observation nor does it use complex optimization functions. There also exist similarities between ReDF and fuzzy abduction (Vojtás 1999; Miyata et al. 1995), but we differ in the core assumptions we make: all rules and constraints in our setting are deterministic, as well as the evidence that needs to be derived. Also, ReDF has the shape constraint, which allows to derive only specific explanations in a form of a factorization.

Meta-interpretive learning (Muggleton et al. 2015) uses templates together with a kind of abductive reasoning to find a set of rules and facts in a typical inductive logic programming setting. While it can use much richer templates and background theory, it uses neither constraints nor optimisation functions like ReDF does.

Kok and Domingos (2007) introduced a probabilistic framework based on Markov logic together with the EM principle to realize statistical predicate invention. This captures what the authors call multiple relational clustering and addresses essentially the same task as the infinite relational model of Kemp et al. (2006). Statistical predicate invention shares several ideas with our approach: it employs a kind of query or schema to denote the kind of factorization one wants and also imposes some hard constraints on the possible solutions. On the other hand, its optimization criterion is built-in and based on the maximum likelihood principle, the framework seems restricted to a kind of block modeling approach, essentially clustering the different rows and columns into different blocks, and the approach is inherently probabilistic.

8.4 Declarative data mining

The idea of using generic solvers and languages for data mining is not new and has been investigated by, for instance, Guns et al. (2011, 2013a), Métivier et al. (2012), who used various constraint programming languages for modeling and solving itemset mining problems. The use of ASP for frequent item set and graph mining were investigated in Järvisalo (2011) and Paramonov et al. (2015). Furthermore, the use of integer linear programming is quite popular in data mining and machine learning; e.g., Chang et al. (2008). While the choice of a particular framework for modeling and solving may lead to both different models and performances, it should be possible to use alternative frameworks, such as constraint programming or integer programming, for modeling and solving ReDF problems.

Aftrati et al. (2012) extended the typical structure of the mining problem using three-level graphs that represent a chain of relations in the multi-relational setting: authors writing papers, and papers being about certain topics. The goal is to find the subgraphs that satisfy particular constraints and optimization criteria. E.g., an author is an authority if the number of topics he has written papers on is maximal. They provide various interesting discovery tasks and solve them using integer programming.

9 Conclusions

The key contribution of this paper is the introduction of the framework of relational data factorization, which was shown to be relevant for modeling, prototyping, and experimentation purposes.

On the modeling side, we have formulated several well-known data mining tasks in terms of ReDF, which allowed us to identify commonalities and differences between these data mining tasks. One advantage of the framework is that small changes in the problem definition typically lead to small changes in the model. Furthermore, ReDF allowed us to model new types of relational data mining problems.

We have not only modeled problems, but also demonstrated that these models can be easily translated into concrete executable ASP encodings. The experiments have shown the feasibility of the approach, especially for prototyping, and especially with the sampling technique. The runtimes were typically not comparable with highly optimized and much more specific implementations that are typically used in data mining. Still they could be run on reasonable datasets of modest size (e.g., Mushroom and Chess have approximately \(185\,000\) and \(115\,000\) non-empty elements respectively).

Directions for future research include investigating the use of alternative solvers (such as constraint or integer programming), the study of heuristics and local search, and the expansion of the range of tasks to which ReDF can be applied. For example, a general ReDF framework is needed to factorize evidence for probabilistic lifted inference, where the shape of the factorization crucially affects the overall performance of the algorithm (Van den Broeck and Darwiche 2013).