Introduction

For decades, logic programming (LP) representation has been considered mainly in the form of symbolic logic [14], which is useful for declarative problem solving and symbolic reasoning. Logic programming starts gaining more attention recently to build explainable learning models [8, 27], whereas it still has some limitations in terms of computation. In other words, symbolic computation is not an efficient way when we need to combine it with other numerical learning models such as artificial neural network (ANN). Recently, several studies have been done on embedding logic programs to numerical spaces so that we can exploit great computing resources ranging from multi-threaded CPU to GPU. The linear algebraic approach is a robust way to manipulate logic programs in numerical spaces. Because linear algebra is at the heart of many applications of scientific computation, this approach is promising to develop scalable techniques to process huge relational knowledge base (KB) [20, 29]. In addition, it enables the ability to use efficient parallel algorithms of numerical linear algebra for computing LP.

In [7], Cohen described a probabilistic deductive database system in which reasoning is performed by a differentiable process. With this achievement, they can enable novel gradient-based learning algorithms. In [23], Sato presented the use of first-order logic in vector spaces for Tarskian semantics, which demonstrates how tensorization realizes efficient computation of Datalog. In [24], Sato proposed a linear algebraic approach to datalog evaluation. In this work, the least Herbrand model of DB is computed via adjacency matrices. He also provided theoretical proofs for translating a program into a system of linear matrix equations. This approach achieves \(O(N^3)\) time complexity where N is the number of variables in a clause. Continuing to this direction, Sato, Inoue, and Sakama developed linear algebraic abduction to abductive inference in Datalog [25]. They did empirical experiments on linear and recursive cases and indicated that the approach can successfully abduce base relations.

In [13], Hitzler et al. theoretically proved that first-order normal logic programs can be approximated by feedforward connectionist networks based on the well-known theorem of Funahashi [9] that every feedforward neural network with at least 3 layers can uniformly approximate any continuous function. Hitzler et al. realized the use of neural networks to compute the immediate consequence operator \(T_P\) and further extended it to first-order logic. However, the main open question is how to find the appropriate structure of the network (how many layers, how many neurons per layer) for a given logic program. In this regard, Serafini and Garcez show how real logic can be implemented in deep ANN [26] then propose logic tensor networks (LTN). The framework is built upon a learning task with both knowledge and data being mapped onto real-valued vectors that the authors follow an inference-as-learning approach.

Using a linear algebraic method, Sakama, Inoue, and Sato define relations between LP and multi-dimensional array (tensor) then propose algorithms for computation of LP models [21, 22]. The representation is done by defining a series of conversions from logical rules to vectors and then the computation is done by applying matrix multiplication. Later, elimination techniques are applied to reduce the matrix size [16] and gain impressive performance. In [3], a similar idea using 3D tensor was employed to compute solutions of abductive Horn propositional tasks. In addition, Aspis built upon previous works on matrix characterization of Horn propositional logic programs to explore how inference from logic programs can be done by linear algebraic algorithms [2]. He also proposed a new algorithm for the non-monotonic deduction, based on linear algebraic reducts and differentiable deduction. These works show that the linear algebraic methods are promising for logic inference on large scales. However, such methods have not yet been proved to be really efficient, since they have not yet been done adequate experiments, to the best of our knowledge.

In this paper, we continue Sakama et al.’s idea of representing logic programs by tensors [16, 21, 22]. Although the method is well-defined, there are some problems, which limit the performance of the approach and have not been solved. First, the obtained matrix after conversion is sparse but sparsity analysis has never been considered yet. Second, the experiments were limited to small-size logic programs that are not sufficient to prove the robustness of matrix representation. In this research, we further raise the bar of computing performance using sparse representation for logic programs in order to reach the fixed point of the immediate consequence operator (\(T_P\)-operator). We are able to do experiments on large sizes of logic programs to demonstrate the performance for computing least models of definite programs. Note that computation of the fixed point of the \(T_P\)-operator frequently appears in logic programming, not only in obtaining the least model of a definite program but also in any model construction, e.g., computing the minimal models of the reduct of a normal or disjunctive logic program with negation. In this regard, we also conduct experiments on the computation of stable models of normal programs with a small number of negations.

Accordingly, the rest of this paper is organized as follows:Footnote 1 Sect. 2 reviews and summaries some definitions and computation algorithms for definite and normal programs, Sect. 3 discusses sparsity problem in tensorized logic programs and proposes a method to represent LP, Sect. 4 investigates space and time complexity of the methods, Sect. 5 demonstrates experimental results with definite and normal programs, and Sect. 6 gives final conclusions and future works.

Preliminaries

Definite Programs

We consider a language \({\mathscr {L}}\) that contains a finite set of propositional variables. A definite (logic) program is a finite set of rules of the form:

$$\begin{aligned} h\leftarrow \; b_1\wedge \cdots \wedge b_m\quad (m\ge 0), \end{aligned}$$
(1)

where h and \(b_i\) are propositional variables (atoms) in \({\mathscr {L}}\).

Given a logic program P, the set of all propositional variables appearing in P is called the Herbrand base of P (written \(B_P\)). For each rule r of the form (1), define \(head(r)=h\) and \(body(r)=\{b_1,\ldots , b_m\}\). A rule is called a fact if \(body(r) = \emptyset\). A definite program P is called an singly-defined (SD) program if there are no two rules that have the same head in it, that is \(head(r_1)\ne head(r_2)\) for any two rules \(r_1\) and \(r_2\) (\(r_1\ne r_2\)) in P.

When a definite program P contains more than one rule (of the form (1)) having the same head:

$$\begin{aligned} \begin{aligned}&h \leftarrow {\fancyscript {B}}_1 \\&\dots \\&h \leftarrow {\fancyscript {B}}_n, \\ \end{aligned} \end{aligned}$$

where \({\fancyscript {B}}_i,\; (1 \le i \le 1)\) is a conjunction (possibly empty) of atoms, we can replace them with a set of new rules:

$$\begin{aligned}&h\leftarrow \; b_1\vee \cdots \vee b_n\quad (n\ge 0), \end{aligned}$$
(2)
$$\begin{aligned}&b_i \leftarrow {\fancyscript {B}}_i \; (i = 1, \ldots , n), \end{aligned}$$
(3)

where \(b_i \; (i = 1, \ldots , n)\) are newly introduced atoms (\(b_i \notin B_P\)) such that \(b_i \ne b_j\) if \(i \ne j\). Then the set of rules of (3) is an SD program. Each rule of form (2) is called an OR-rule. Every definite program P is transformed to a program \({P'} = Q \cup D\) such that Q is an SD program and D is a set of OR-rules. The resulting program \({P'}\) is called a standardized program. A definite program P coincides with its standardized form \({P'}\) iff P is an SD program. By introducing the OR-rule (2) which is a shorthand of n rules: \(h \leftarrow b_1,\ldots , h \leftarrow b_n\) including new atoms, the Herbrand base of \({P'}\) (written \(B_{{P'}}\)) is usually larger that \(B_P\). In this paper, a program means a standardized program unless stated otherwise.

A set \(I\subseteq B_P\) is an interpretation of P. An interpretation I is a model of a standardized program P if \(\{b_1,\ldots ,b_m\}\subseteq I\) implies \(h\in I\) for every rule (1) in P, and \(\{b_1,\ldots ,b_m\}\cap I\ne \emptyset\) implies \(h\in I\) for every rule (2) in P. A model I is the least model of P if \(I\subseteq J\) for any model J of P. A mapping \(T_P:\, 2^{B_P} \rightarrow 2^{B_P}\) (called a \(T_P\)-operator) is defined as: \(T_P(I) = \{\, h\,\mid \, h\leftarrow b_1\wedge \cdots \wedge b_m\in P\;\text{ and }\; \{ b_1,\ldots , b_m \}\subseteq I\,\}\; \cup \; \{\, h\,\mid \, h\leftarrow b_1\vee \cdots \vee b_n\in P\;\text{ and }\; \{ b_1,\ldots , b_n \}\cap I\ne \emptyset \,\}.\)

The powers of \(T_P\) are defined as: \(T_P^{k+1}(I)=T_P(T_P^k(I))\) \((k\ge 0)\) and \(T_P^0(I)=I\). Given \(I\subseteq B_P\), there is a fixed-point \(T_P^{n+1}(I)=T_P^n(I)\) \((n\ge 0)\). For a definite program P, the fixed-point \(T_P^n(\emptyset )\) coincides with the least model of P [28].

Definition 1

(Matrix representation of standardized programs [21])

Let P be a standardized program and \(B_P=\{ p_1\), \(\ldots\), \(p_n \}\). Then, P is represented by a matrix \(M_P\in \mathbb {R}^{n\times n}\) such that for each element \(a_{ij}\) \((1\le i,j\le n)\) in \(M_P\),

  1. 1.

    \(a_{ij_k}=\frac{1}{m}\;\; (1\le k\le m;\, 1\le i,j_k\le n)\) if  \(p_i\leftarrow p_{j_1}\wedge \cdots \wedge p_{j_m}\) is in P;

  2. 2.

    \(a_{ij_k}=1\;\; (1\le k\le l;\, 1\le i,j_k\le n)\) if  \(p_i\leftarrow p_{j_1}\vee \cdots \vee p_{j_l}\) is in P;

  3. 3.

    \(a_{ii}=1\) if \(p_i\leftarrow\) is in P;

  4. 4.

    \(a_{ij}=0\), otherwise.

\(M_P\) is called a program matrix. We write \(\mathsf{row}_i(M_P)=p_i\) and \(\mathsf{col}_j(M_P)=p_j\) \((1\le i, j\le n)\).

To better understand Definition 1, let us consider a concrete example.

Example 1

Consider the definite program \(P = \left\{ p \leftarrow q \wedge r,\; p \leftarrow s \wedge t,\; r \leftarrow s,\; q \leftarrow t,\; s \leftarrow ,\; t \leftarrow \right\}\).

P is not an SD program because there are two rules \(p \leftarrow q \wedge r\) and \(p \leftarrow s \wedge t\) having the same head, then P is transformed to the standardized program \(P'\) by introducing new atoms u and v as follows: \(P' = \{ u \leftarrow q \wedge r,\; v \leftarrow s \wedge t,\; p \leftarrow u \vee v,\; r \leftarrow s,\; q \leftarrow t,\; s \leftarrow ,\; t \leftarrow \}\). Then by applying Definition 1, we obtain:

Sakama et al. further define representation of interpretation using interpretation vectors (Definition 2). This vector is used to store the truth value of all propositions in P. The starting point of interpretation vector is defined as the initial vector (Definition 3).

Definition 2

(Interpretation vector [21])

Let P be a program and \(B_P=\{ p_1,\ldots ,p_n \}\). Then an interpretation \(I\subseteq B_P\) is represented by a vector \(v=(a_1,\ldots ,a_n)^\mathsf{T}\), where each element \(a_i\) \((1\le i\le n)\) represents the truth value of the proposition \(p_i\) such that \(a_i=1\) if \(p_i\in I\); otherwise, \(a_i=0\). We write \(\mathsf{row}_i(v)=p_i\).

Definition 3

(Initial vector) Let P be a program and \(B_P=\{ p_1,\ldots ,p_n \}\). Then, the initial vector of P is an interpretation vector \(v_0=(a_1,\ldots ,a_n)^\mathsf{T}\) such that \(a_i=1\) \((1\le i\le n)\) if \(\mathsf{row}_i(v_0)=p_i\) and a fact \(p_i\leftarrow\) is in P; otherwise, \(a_i=0\).

To compute the least model in vector space, Sakama et al. proposed an algorithm that is equivalent to the result of computing least models by the \(T_P\)-operator. This algorithm is presented in Algorithm 1.

Definition 4

(\(\theta\)-thresholding) Given a value x, define \(\theta (x) = x'\), where \(x' = 1\) if \(x \ge 1\); otherwise, \(x' = 0\).

Similarly, the \(\theta\)-thresholding is extended in an element-wise way to vectors and matrices.

figure a

Normal Programs

Normal programs can be transformed to definite programs as introduced in [1]. Therefore, we transform normal programs to definite programs before encoding them in matrices.

Definition 5

(Normal program) A normal program is a finite set of normal rules:

$$\begin{aligned} h \leftarrow b_1 \wedge b_2 \wedge \cdots \wedge b_l \wedge \lnot b_{l+1} \wedge \cdots \wedge \lnot b_m \ (m \ge l \ge 0), \end{aligned}$$
(4)

where h and \(b_i (1\le i\le m)\) are propositional variables (atoms) in \({{{\fancyscript {L}}}}\).

P is transformed to a definite program by rewriting the above rule into the following form:

$$\begin{aligned} h \leftarrow b_1 \wedge b_2 \wedge \cdots \wedge b_l \wedge {\overline{b}}_{l+1} \wedge \cdots \wedge {\overline{b}}_m \ (m \ge l \ge 0), \end{aligned}$$
(5)

where \({\overline{b}}_i\) is a new proposition associated with \(b_i\).

In this part, we denote P as a normal program with an interpretation \(I \subseteq B_P\). The positive form \(P^{+}\) of P is obtained by applying the above transformation. Since a definite program \(P^{+}\) is transformed to its standardized program, then we can apply Algorithm 1 to compute the least model. [1] proved that if P is a normal program, I is a stable model of P iff \(I^{+}\) is the least model of \(P^{+} \cup {\bar{I}}\), where \({\bar{I}} = \{{\bar{p}}\ |\ p \in B_P {\setminus }I\}\), then \(I^+ = I \cup {\bar{I}}\). We should note that \(I^+\) is an interpretation of \(P^+\) which is a definite program. We can obtain \(I^+\) by applying Algorithm 1 to the transformed program \(P^+\).

Definition 6

(Matrix representation of normal programs [16])

Let P be a normal program with \(B_P=\{ p_1, \ldots , p_n \}\) and its positive form \(P^{+}\) with \(B_{P^{+}}=\{ p_1,\ldots ,p_n, {\overline{q}}_{n + 1},\ldots , {\overline{q}}_m\}\).

Then, \(P^{+}\) is represented by a matrix \(M_P\in \mathbb {R}^{m\times m}\) such that for each element \(a_{ij}\) \((1\le i,j\le m)\):

  1. 1.

    \(a_{ii} = 1\) for \(n + 1 \le i \le m\);

  2. 2.

    \(a_{ij} = 0\) for \(n + 1 \le i \le m\) and \(1 \le j \le m\) such that \(i \ne j\);

  3. 3.

    Otherwise, \(a_{ij}\) (\(1 \le i \le n\); \(1 \le j \le m\)) is encoded as in Definition 1.

\(M_P\) is called a program matrix. We write \(\mathsf{row}_i(M_P)=p_i\) and \(\mathsf{col}_j(M_P)=p_j\) \((1\le i, j\le n)\).

Example 2

Consider a program \(P = \{ p \leftarrow q \wedge s,\; q \leftarrow p \wedge t,\; s \leftarrow \lnot t,\; t \leftarrow ,\; u \leftarrow v \}\).

First, transform P to \(P^{+}\) such that \(P^{+} = \{ p \leftarrow q \wedge s,\; q \leftarrow p \wedge t,\; s \leftarrow {\overline{t}},\; t \leftarrow ,\; u \leftarrow v \}\). Then applying Definition 6, we obtain:

Instead of the initial vector in the case of definite programs, the notion of an initial matrix is introduced to encode multiple interpretations containing positive and negative facts in a program.

Definition 7

(Initial matrix [16])

Let P be a normal program and \(B_P=\{ p_1,\ldots ,p_n \}\) and its positive form \(P^{+}\) with \(B_{P^{+}}=\{ p_1,\ldots ,p_n,\) \({\overline{q}}_{n + 1},\ldots , {\overline{q}}_m\}\). The initial matrix \(M_0 \in {\mathbb {R}}^{m\times h} (1 \le h \le 2^{m -n})\) is defined as follows:

  1. 1.

    each row of \(M_0\) corresponds to each element of \(B_P\) in a way that \(row_i(M_0) = p_i\) for \(1 \le i \le n\) and \(row_i(M_0) = {\overline{q}}_i\) for \(n + 1 \le i \le m\);

  2. 2.

    \(a_{ij} = 1\) (\(1 \le i \le n\), \(1 \le j \le h\)) iff a fact \(q_i \leftarrow\) is in P; otherwise \(a_{ij} = 0\);

  3. 3.

    \(a_{ij} = 0\) (\(n + 1 \le i \le m\), \(1 \le j \le h\)) iff a fact \(p_k \leftarrow\) (with \(1 \le k \le n\)) is in P and \(q_i = \overline{p_k}\); otherwise, \(a_{ij}\) takes the value 0 or 1 in a way that every combination in \(2^{m - n}\) (except the deterministic case of \(a_{ij} = 0\)) is enumerated.

Each column of \(M_0\) is a potential stable model in the first stage. We update \(M_0\) by applying matrix multiplication with the matrix representation obtained by Definition 6 as \(M_{k + 1} = \theta (M_P M_k)\). The resulting matrices are called interpretation matrices that each of which includes multiple interpretations of the corresponding program. Then, the algorithm for computing the stable models is presented in Algorithm 2.

figure b
figure c

This method requires extra steps on transforming and finding stable models of a program that is represented in Algorithm 3. As we can see, Algorithm 3 loops over each interpretation vector of the fixed point of M which we obtain by applying matrix multiplication and thresholding. The main idea behinds this algorithm is to verify the consistency of each interpretation \(I^+ (= I \cup {\bar{I}})\) that does not contain 1s for both positive and negative forms of an atom. This is done by the condition in line 8 of Algorithm 3 that tests whether the sum of values (corresponding to positive and negative forms of an atom in P) is 1 or not.

In addition, the initial matrix size grows exponentially by the number of negations \(m - n\). Therefore, this representation requires a lot of memory and the algorithm performs considerably slower than the method for definite programs if there are many negations appearing in the program. Nevertheless, we will later show that this method still has the advantage when there are a small number of negations.

Sparse Representation of Logic Programs

The idea of representing logic programs in vector spaces could minimize the work with symbolic computation and utilize better computing performance. Besides that, this method copes with the curse of dimension when a matrix representing logic programs becomes very large. Previous works on this topic only consider dense matrices for their implementation and it seems not very impressive in terms of performance even on small datasets [16]. To solve this problem, this paper focuses on analyzing the sparsity of logic programs in vector spaces and proposes improvement using sparse representation for logic programs. Additionally, we analyze and verify different sparse representations to conclude which format is efficient for logic programs in terms of memory cost.

Sparsity of Logic Programs in Vector Spaces

A sparse matrix is a matrix in which most of the elements are zero. The level of sparseness is measured by sparsity which equals the number of zero-valued elements divided by the total number of elements [6]. Because there are a large number of zero elements in sparse matrices, we can save the computation by ignoring these zero values [12]. According to the conversion method of linear algebraic approach, we can calculate the sparsity of a program P.Footnote 2 This calculation is done by counting the number of non-zero-valued elements of each rule in P, then let 1 minus the fraction of the number of non-zero-valued elements and the matrix size.

By definition, the sparsity of a program P is computed by the following equation:

$$\begin{aligned} \mathrm{sparsity}(P) = 1 - \frac{\sum _{r \in P}{|body(r)|}}{n^2}, \end{aligned}$$
(6)

where n is the number of elements in \(B_P\) and |body(r)| is the length of body of rule r.

Accordingly, the representation matrix becomes a high level of sparsity if the matrix size becomes larger, while the length of the body rule is insignificant. In fact, a rule r in a logic program rarely has a body length approx n, therefore, \(|body(r)| \ll n\). In short, we can say that the matrix representation of a logic program according to the linear algebraic approach is sparse in most cases.

Converting Logic Programs to Sparse Matrices

Sparse matrix computation is very important due to the large number of zero elements in real-world matrix data; therefore, compaction techniques are used to reduce the amount of storage, memory accesses, and computation [6]. Among several sparse storage formats, we select the three formats coordinate, compressed sparse row (CSR) and block compressed sparse row (BSR) [5] which are the most general, efficient, robust, and widely adopted by many programming libraries.

Because the matrix representation of a logic program P is sparse, applying Algorithm 1 and Algorithm 2 on sparse representation is remarkably faster than the dense matrix. Moreover, sparse representation saves the memory space as well, therefore enabling the ability to deal with a large scale KBs.

The Coordinate Format

The COO format is the most simple idea of sparse matrix format which represents each non-zero element by a tuple of a row index, a column index, and the value of the element. That means the COO format uses 2 arrays of coordinates and 1 array of values. The length of these arrays is equal to the number of non-zero elements. The first array stores the row index of each value, and the second array stores the row and column indices of each value, while the third array stores the values in the original matrix. We can imagine that the ith non-zero element in a matrix is represented by a 3-tuple extracted from these 3 arrays at index i.

Example 3 illustrates sparse representation in the COO format for the program P in Example 1. We should note that in Example 3, zero-based indexingFootnote 3 is used and we follow row-major order.Footnote 4

Example 3

The COO representation for P in Example 1 becomes:

Row index

0

0

1

2

3

4

5

5

6

6

Col index

5

6

4

3

3

4

1

2

3

4

Value

1.0

1.0

1.0

1.0

1.0

1.0

0.5

0.5

0.5

0.5

This format is the most simple and flexible for general-purpose usage. The storage requirement for this format is \(O(3 \times \eta _z)\) where \(\eta _z\) is the number of non-zero elements. Because of the generality, we often use the COO format as the baseline to evaluate other sparse representations.

The Compressed Sparse Row Format

The CSR format is an improvement of the COO format. Noticeably, in the row index array of the COO format, a value can be repeated consecutively because the non-zero elements may appear in the same row many times. We may reduce the size of the row index array by considering the CSR format. In this format, while the column index and the value arrays remain the same, we compress the row index array by storing the index of the row only where non-zero elements appear. That means we do not need to store two consecutive 0s and two consecutive 5s as in Example 3. Instead, we store the index of the next row, then finally point the last index to the end of the row (which equals the number of non-zero elements). Concretely in the row index array, the first element is the starting index which is 0. The last element is an extra element to indicate the end of this array which is equal to the number of non-zero elements. We need two consecutive values in the row index array to extract the non-zero elements in this row. To be specific, we need to interpret \(row\_start\) and \(row\_end\) of the ith row from the compressed value in \(row\_index\) array: \(row\_start_i = row\_index[i], row\_end_i = row\_index[i + 1]\).

Example 4

The CSR representation for P in Example 1 becomes:

Row index

0

2

3

4

5

6

8

10

  

Col index

5

6

4

3

3

4

1

2

3

4

Value

1.0

1.0

1.0

1.0

1.0

1.0

0.5

0.5

0.5

0.5

Example 4 illustrates this method. For the first row (\(i = 0\)), we have \(row\_start_0 = 0, row\_end_i = 2\), then we extract two values 0 and 1 for the non-zero element in the first row. These start and end will be used to extract column index and value of non-zero elements. Similarly, the second row (\(i = 1\)), we have \(row\_start_1 = 2, row\_end_1 = 3\) then we have only one non-zero element at index 2. Continue this interpretation until we reach the final row (\(i = 6\)), we have \(row\_start_6 = 8, row\_end_6 = 10\) then we extract last two non-zero elements at index 8 and 9 for the final row.

For a sparse matrix of the size \(m \times n\), the CSR format saves on memory compared to the dense format only when \(\eta _z < (m (n - 1) - 1) / 2\) (where \(\eta _z\) is number of non-zero elements). Compared to the COO format, the CSR format uses less numbers in the row index array only when \(m + 1 < \eta _z\). This is because the actual size of the row index array is \(m + 1\). Therefore, the space complexity of the CSR format is \(O(2 \times \eta _z + m + 1)\).

There is another format compressed sparse column (CSC) which is similar to the CSR. The only difference is that the CSC enumerates non-zero elements following the column-major orderFootnote 5 and compress the column index array. Hence, the space complexity of the CSC is \(O(2 \times \eta _z + n + 1)\). In the case of logic programs, the matrices are square so that these two formats are identical.

The Block Compressed Sparse Row Format

There is another sparse representation BSR which stores a two-dimensional square block of primitive data types instead of storing a single value. The dimension of the square block is \(d_b\) then the matrix is divided into multiple blocks of the size \(d_b \times d_b\). In case that the dimension of the matrix is not a multiple of the \(d_b\), we need to add a zero column or row to the matrix. For example, the matrix program in Example 1 has the dimension \(7 \times 7\) and the \(d_b\) is 2, we need to pad the matrix to the dimension \(8 \times 8\). Then, we divide the padded matrix into 16 blocks of the dimension \(d_b \times d_b\). In the BSR, the format only stores non-zero blocks and uses the same way to index each block as in the CSR. Let us consider the BSR format for the logic program P in Example 1, we can identify 8 non-zero blocks in the matrix. The illustration of these steps and the BSR representation of P are presented in Example 5.

Example 5

Illustration of block representation and the BSR representation for P in Example 1 are following:

Row index

0

2

3

6

8

   

Col index

2

3

1

0

1

2

1

2

Block

\(B_{13}\)

\(B_{14}\)

\(B_{22}\)

\(B_{31}\)

\(B_{32}\)

\(B_{33}\)

\(B_{42}\)

\(B_{43}\)

Block value

0 1 1 0

1 0 0 0

0 1 0 1

0 0 0 1/2

0 0 0 1/2

1 0 0 0

0 1/2 0 0

1/2 0 0 0

Note that in each block, we store all the numbers following an exact order, row-major order in this example. If we follow the column-major order, the block value vector may be different, for example, the block \(B_{22}\) in the column-major order is 0 0 1 1.

Noticeably, this format is not efficient in this example because it stores many blocks with only 1 or 2 non-zero elements. In fact, this format only shows its advantages in case the matrix is highly concentrated in a few blocks. In other words, if the matrix has \(\eta _z\) non-zero elements and \(\eta _b\) non-zero blocks of the size \(d_b \times d_b\) then the BSR performs the best in case \(\eta _z \approx \eta _b \times {d_b}^2\).

Assume we have a sparse matrix of the size \(m \times n\). In the matrix, there are \(\eta _z\) non-zero elements and \(\eta _b\) non-zero blocks of the size \(d_b \times d_b\). Note that in the BSR format, we only need to store the indices of non-zero blocks and all values in those blocks. So, we can consider it as a CSR matrix where each non-zero block (in the BSR format) is a single non-zero element (in the CSR format) that the matrix size is , where is the ceiling function. Accordingly, the space complexity of the BSR format is .

Which Format is the Best for Logic Programs?

As we can see in Example 4, the row index array now has only 8 indices rather than 10 in Example 3. We save storing repeatedly indices in the row index array by storing only the position where it starts and ends. Accordingly, the CSR can be considered more economical than the COO but it comes with the cost that non-zero elements must follow row-major order while a strict order is not necessary for the COO format. Fortunately, in the case of linear algebraic methods for fixed-point computation, we do not need to update the program matrix frequently. Then the CSR format will be a better choice over the COO format. In fact, we can save up to \(25\%\) of the size of the row index array using the CSR format as will be illustrated in the experiments. The BSR format takes advantage over the CSR format when the program matrix is concentrated in a few non-zero blocks. Unfortunately, it is not very often in the case of program matrices. The experiments section will reveal which kind of logic programs will be beneficial from this sparse format. Accordingly, we propose the CSR format is the ideal sparse representation for linear algebraic computation methods.

Complexity Analysis

In this section, we analyze the time and space complexity of the linear algebraic methods for computing fixed points as defined in Algorithm 1 and Algorithm 2.

Linear Algebraic Method for Definite Programs

Assume that a definite program P has a matrix representation \(M_P \in {\mathbb {R}}^{n \times n}\) and the matrix has \(\eta _z\) non-zero elements.Footnote 6

Proposition 1

The space complexity of linear algebraic method for definite programs is

  1. 1.

    \(O(n^2 + n)\) for dense format,

  2. 2.

    \(O(\eta _z + n)\) for sparse format.

Proof

Obviously, we have to store the program matrix and the interpretation vector. As defined, the program matrix size is \(n \times n\) and the interpretation vector size is \(n \times 1\). Note that only the program matrix can be stored in the sparse format while the interpretation vector must be stored in dense format. \(\square\)

Proposition 2

The time complexity of linear algebraic method for definite programs is

  1. 1.

    \(O(n^3)\) for dense format,

  2. 2.

    \(O(\eta _z \times n)\) for sparse format.

Proof

Similar to the \(T_P\)-operator the main loop of Algorithm 1 repeats n times in the worst case. In addition, the complexity of each loop depends on the matrix multiplication between a matrix of the size \(n \times n\) and a vector of the size \(n \times 1\), so the multiplication takes \(O(n^2)\) for dense format and \(O(\eta _z)\) for sparse format.

Theoretically, if the program matrix is sparse, methods using sparse format outperform methods using the dense format in both time and space complexity. \(\square\)

Linear Algebraic Method for Normal Programs

Let us consider a normal program P which has k negations. Assume that P has a matrix representation \(M_P \in {\mathbb {R}}^{n \times n}\) and the matrix has \(\eta _z\) non-zero elements.Footnote 7

Proposition 3

The space complexity of the linear algebraic method for normal programs is

  1. 1.

    \(O(n^2 + n \times 2^k)\) for dense format,

  2. 2.

    \(O(\eta _z + n \times 2^k)\) for sparse format.

Proof

Similar to the methods for definite programs, the size of the program matrices is the same. The cost for storing the interpretation matrix exponentially depends on the number of negations because we have to consider all the combinations according to the Algorithm 2. Therefore, it is the limitation of the method that we can handle programs with a limited number of negations. \(\square\)

Proposition 4

The time complexity of linear algebraic method for normal programs is

  1. 1.

    \(O(n^3 \times 2^k + n^2 \times (2^k - 1))\) for dense format,

  2. 2.

    \(O(\eta _z \times n \times 2^k + n^2 \times (2^k - 1))\) for sparse format.

Proof

Similar to previous proof, the main loop of Algorithm 2 repeats n times in the worst case. Each loop involves the multiplication between a matrix of the size \(n \times n\) and a matrix of the size \(n \times 2^k\). Hence, the complexity of Algorithm 2 is \(O(n^3 \times 2^k)\) if we use dense format and \(O(\eta _z \times n \times 2^k)\) if we use sparse format. Then, we have to apply the Algorithm 3 to find the stable model. This algorithm loops over all \(2^k\) combinations to verify the model in case \(k > 0\). If \(k = 0\) the loop is not executed. Each verification takes 2 nested loops over n times. Therefore, the complexity of this algorithm is \(O(n^2 \times (2^k - 1))\). \(\square\)

Obviously, if k is small, then we obtain the same complexity as the method for definite programs. If k is considerably large, then both the space and time complexity are infeasible, so that is the limitation of the method. Although both formats are exponential in terms of time and space complexity, sparse representation improves a lot in general cases.

Experimental Results

In this section, we report the results of two experiments on finding the least models of definite programs and computing stable models of normal programs.To evaluate the performance of linear algebraic methods, we compared the implementations of Algorithm 1 and Algorithm 2 with (i) the \(T_P\)-operator and (ii) Clasp (Clingo v5.4.1 running with flag –mode=clasp). Our implementations are done with (iii) dense matrices and (iv) sparse matrices. Except Clasp, all implementations (i), (iii) and (iv) are implemented on C++ with CPU x64 as a targeted device. In (i), we implement the operator using hashset instead of list for better set operations performance. To avoid ambiguity with the original definition of the \(T_P\)-operator, we will call (i) as Hashset method from now on in this section. (ii) is the solver of Clingo which is a powerful Answer Set Programming (ASP) solver developed at the University of Potsdam [10]. In terms of matrix representations and operators for (iii) and (iv), we use Eigen 3 library [11] with the default backend. The computer running experiments has the following configurations: CPU: Intel Core i7-4770 (4 cores, 8 threads) @3.4 GHz; RAM: 16 GB DDR3 @1333 MHz; GPU: NVIDIA GTX 1080; Operating system: Ubuntu 18.04 LTS 64 bit.

Focusing on analyzing the performance of sparse representation, we first evaluate our method by conducting experiments on randomized logic programs. We use the same method of LP generation conducted in [16] that the size of a logic program is defined by the size \(n = |B_P|\) of the Herband base \(B_P\) and the number of rules \(m = |P|\) in P. The number of facts (rules with the body length is 0) of the logic program is limited by n/3. The other rules are uniformly generated based on the length of their rule body (maximum length is 8) according to Table 1.

Table 1 Proportion of rules in P based on the number of propositional variables in their bodies

According to Algorithms 1 and 2, we have to transform logic programs to standardized programs to encode them as matrices. Hence, in the experiments, we also track the size of the Herbrand base of a standardized program which is equal to the actual square matrix size and denote it by \(n'\).

We further generate denser matrices in order to analyze the efficacy of the sparse method. While keeping the same proportion of facts and rules with the body length of 1 and 2, we generate the rest \(70\sim 80\%\) rules such that their body length is around \(5\%\) of the number of propositions. This method leads to the lower sparsity level of generated matrices with approximate 0.95.

Also based on the generation method for definite programs, we generate normal programs by randomly changing literals to negations and limit the number of negations, denoted by k, such that \(4 \le k \le 8\). The important difference from [16] is that we do experiments on much larger n and m, because our method, which is implemented on C++, is dramatically more efficient than Nguyen et al.’s implementation using Maple. The largest size of the logic program in this experiment reaches thousands of propositions and hundreds of thousands of rules. Further, we also compare our method with one of the best ASP solvers—Clasp [10] running in the same environment. All methods are conducted 30 times on each LP to obtain mean values of execution time.

In addition, we also conduct a further experiment using non-random problems with definite programs using the transitive closure problem. The graph we use is selected from the Koblenz network collection [15]. This dataset contains binary tuples and we compute the transitive closure of them using the following rules:

  • \(path(X,Y) \leftarrow edge(X,Y)\)

  • \(path(X,Y) \leftarrow edge(X,Z) \wedge path(Z,Y)\)

Definite programs

The final results on definite programs are illustrated in Table 2 and Fig. 1.

Table 2 Details of experimental results on definite programs of Hashset method, Clasp and linear algebraic methods (with dense and sparse representation)
Fig. 1
figure 1

Comparison of execution time between Hashset method, Clasp and linear algebraic methods (with dense and sparse representation) on definite programs

We can see in the results that the dense matrix method is slowest and being unable to run with very large programs that is why the data for this method is not displayed if the number of rules is larger or equal to 120,000. We should mention that the number of rules m is used as horizontal axis in the Fig. 1 similar to the experiments in [16]. The reason for choosing n and m is to generate actual matrix size \(n'\) increasing linearly with two different levels: smaller scale (\(n < 10,000\)) and larger scale (\(n > 10,000\)). The same parameters are used for other experiments using the random generated method. Overall, the sparse matrix method is very efficient which is 10–15 times faster than Clasp.

The benchmark results on denser matrix are presented in Table 3 and Fig. 2. As can be seen in the results, denser matrices require more computation for the sparse matrix method, while they do not affect the same scale on other competitors. Despite that fact, the sparse matrix method still holds first place in this benchmark. In terms of analyzing the sparseness level of logic programs, we hardly find a program in which the sparsity is less than 0.97. This observation strongly encourages the use of sparse representation for logic programs.

Table 3 Details of experimental results on definite programs (with lower sparsity level) of Hashset method, Clasp and linear algebraic methods (with dense and sparse representation)
Fig. 2
figure 2

Comparison of execution time between Hashset method, Clasp and linear algebraic methods (with dense and sparse representation) on definite programs with lower sparsity level

In the next experiment, we show the comparison for computing transitive closure. We assume that a dataset contains edges (tuples of nodes), then first perform grounding two rules of defining path. The obtained results are demonstrated in Table 4 and Fig. 3. In this non-randomized problem, we can see that the matrix representations are very sparse. Therefore, it is no doubt that the sparse matrix method outperforms the dense matrix method. Accordingly, we only highlight the efficiency of sparse representation and omit the dense matrix approach. Surprisingly, the sparse matrix method surpasses Clasp once again in this experiment by a large margin.

As can be witnessed in the results, the dense matrix method is the slowest, even slower than the hashset method, in terms of computation time due to wasting computation on a huge amount of zero elements. This could be explained by the high level of sparsity of logic programs provided in Tables 2, 3 and 4. Moreover, large dense matrices consume a huge amount of memory, therefore the method is unable to run with a large scale matrix size. Overall, the sparse matrix method is effective in computing the fixed points of definite programs. On the other hand, the performance would be improved if we use GPU accelerated code and exploit parallel computing power. The results indicate that using sparse representation for logic programs opens the gate to deal with large-scale logic programs.

Table 4 Details of experimental results on the transitive closure problem of Hashset method, Clasp and sparse representation approach
Fig. 3
figure 3

Comparison of execution time between Hashset method, Clasp and linear algebraic methods (with dense and sparse representation) on definite programs with Transitive closure problem using Koblenz network datasets

Normal Programs

The goal of this experiment is to highlight the enhancement of the sparse representation in terms of computing the stable models in normal logic programs. To generate normal programs for this benchmark, we use the same method to generate definite programs, then randomly select some rules and set some atoms in the rule body to negations. In our current method, since the number of columns in the initial matrix (Definition 7) grows exponentially by the number of negations, we limit the number of negations in this benchmark by 8Footnote 8 as specified in the experiment setup. The experiment results show that the sparse method can be applied to normal logic programs with a small number of negations. The performance gain from this improvement is potential for further developing more efficient algorithms.

First, we perform benchmarks on normal programs which has 0.99 sparsity level. Table 5 and Fig. 4 illustrate the execution time in detail. As can be witnessed in the results, the sparse matrix method is still faster than Clasp but with a smaller scale than it did in definite programs. It is needed to mention that the initial matrix size is remarkably larger due to the occurrence of negations. We have to initialize all possible combinations of atoms that appear with their negation form in the program. There is no doubt that with a larger number of negations, the space complexity of the linear algebraic method is exponential. Accordingly, the performance of the sparse matrix method is better than Clasp when there are a small number of negations.

In the next experiments, we compare different methods on denser matrices. Table 6 and Fig. 5 present the data for this benchmark. Once again, with a limited number of negations, the sparse matrix method holds the winner position.

Table 5 Details of experimental results on normal programs of Hashset method, Clasp and linear algebraic methods (with dense and sparse representation)
Fig. 4
figure 4

Comparison of execution time between Hashset method, Clasp and linear algebraic methods (with dense and sparse representation) on normal programs

Table 6 Details of experimental results on normal programs (with lower sparsity level) of Hashset method, Clasp and linear algebraic methods (with dense and sparse representation)

Noticeably, execution time on normal programs is generally greater than that on definite programs. This is obvious because we have a larger size of initial matrices as well as the need for extra computation on transforming and finding the least models as described in Algorithm 2. Then, the weakness of the linear algebraic method is that we have to deal with all combinations of truth assignments to compute the stable model. Accordingly, the column size of the initial matrix exponentially increases by the number of negations. Thus, in the benchmark on randomized programs, we limit the number of negations for all benchmarks so that the matrix can fit in memory. This limitation will become clearer in real problems which have many negations. This is a major problem that we are investigating to do further research.

Fig. 5
figure 5

Comparison of execution time between Hashset method, Clasp and linear algebraic methods (with dense and sparse representation) on normal programs with lower sparsity level

Sparse Representations Comparison

In this experiment, we focus on space complexity of different sparse representations for logic programs. The benchmark is done on the same datasets in the previous results. To highlight the efficiency of sparse formats, we compare the memory space in Bytes to store the program matrices using the three mentioned methods in Sect. 3 including: COO, CSR and BSR. The BSR will be analyzed with two different \(d_b\): \(2 \times 2\) and \(4 \times 4\). The figures for the COO format will be considered as the baseline to compare these other spare formats (Fig. 6).

Table 7 Comparison of different sparse representations in terms of the memory size on definite programs in Table 2
Fig. 6
figure 6

Comparison of execution time between different sparse representations on definite programs

Table 8 Comparison of different sparse representations in terms of the memory size on definite programs for the transitive closure problem in Table 4
Fig. 7
figure 7

Comparison of execution time between different sparse representations on definite programs for the transitive closure problem

The experimental results for definite programs, definite programs for the transitive closure problem and normal programs are illustrated in Tables 7, 8 and 9 respectively. As can be witnessed in the data, the CSR format is better than the baseline COO 20–30% in terms of storage usage. It is a remarkable saving because we only need to store fewer numbers in the row index array as explained in Sect. 3. On the other hand, the data for the BSR format show an increase in memory usage by a large margin. This is due to the program matrices are not concentrated and we have to store many blocks with zero included (Figs. 7, 8).

Table 9 Comparison of different sparse representations in terms of the memory size on normal programs in Table 5
Fig. 8
figure 8

Comparison of execution time between different sparse representations on normal programs

Accordingly, in general cases, the CSR format is the best option in terms of space efficiency. We also understand that the BSR format is efficient when the matrix is highly concentrated in a way that non-zero elements are stored in as small number of blocks as possible. In this experiment, we also conduct the comparison on special logic programs. For example, consider the program P and its matrix representation that contains the following rules:

figure d

We can easily see that in this case, the program matrix contains only \(2 \times 2\) blocks that will be ideal for the BSR \(2 \times 2\) format. In this case, the block value matrix does not need to store zero elements while the indexing arrays for non-zero blocks are much less than the indexing arrays for non-zero elements. The data for this experiment is illustrated in Table 10. In the perfect case, the BSR can save up to \(50\%\) compared to the baseline COO format and is much more efficient than the CSR format (Fig. 9).

Table 10 Comparison of different sparse representations in terms of the memory size on special programs as defined above
Fig. 9
figure 9

Comparison of execution time between different sparse representations on special programs

Scalability of Sparse Matrix on GPU

In this experiment, we compare the execution time of the sparse matrix implementation on CPU and GPU using definite programs. We use the same method for generating definite programs as presented. Additionally, we increase the body length of generated rules to obtain large-scale programs. The implementation on GPU is done using cuSPARSE.Footnote 9

As we can see in Fig. 10, the implementation on GPU is faster than that on CPU approximately 3–4 times. That is because sparse matrix computation usually does not reach maximum throughput on GPU. Thus, it is less scalable than dense computation. However, the sparse matrix computation is faster than the dense counterpart. We should note that we generate very large matrices which can not be fit in GPU memory if we store them in dense format. Accordingly, although sparse matrix computation is more difficult to scale up, using the sparse matrix is the ideal solution for large-scale logic programs in terms of both time and space complexity (Tables 11, 12).

Fig. 10
figure 10

Comparison of execution time between sparse matrix implementations on CPU and GPU

Table 11 Details of experimental results of sparse matrix implementations on CPU and GPU (higher sparsity level)
Table 12 Details of experimental results of sparse matrix implementations on CPU and GPU (lower sparsity level)

Conclusion

In this paper, we analyze the sparsity of matrix representation for LP and then propose an improved implementation for logic programming in vector space using sparse matrix representation. The experimental results on computing the least models of definite programs demonstrate a very significant enhancement in terms of computation performance even when compared to Clasp. This improvement remarkably reduced the burden of computation in previous linear algebraic approaches for representing LP. The \(T_P\)-operator plays an important role in model construction for computation of definite and normal logic programs. Thus, improving the efficiency of fixed-point computation is the key to develop algorithms dealing with large-scale datasets. Although the current method requires a huge amount of memory to store all possible combinations of negated atoms, we witnessed considerable improvement when there are small numbers of negations. Moreover, matrix computation could be more accelerated using GPU. We have tested our implementation in this way, and obtained expected results too.

In addition to the improvement using sparse representation, we conducted experiments on different general-purpose sparse matrix representations and demonstrated the merits and demerits of each format. Accordingly, we propose to use the CSR in the linear algebraic methods of logic programs for both efficiency and generality. If we need a flexible way to access and modify non-zero elements individually, we strongly recommend using the COO format. On the other hand, if we deal with special types of logic programs as demonstrated in Sect. 5, we can consider applying the BSR format or maybe other methods that meet the need.

Sato’s linear algebraic method is based on a completely different idea to represent logic programs, where each predicate is represented in one matrix and an approximation method is used to compute the extension of a target predicate of a recursive program [24]. We should note that this approximation method is limited to a matrix size of 10,000, while our exact method is comfortable with 320,000. Further comparison is a future research topic, yet we could expect that Sato’s method can also be enhanced by sparse representation.

The encouraging results open up room for improvement and optimization. Potential future work is to apply a sampling method to reduce the number of guesses in the initial matrix for normal programs. An algorithm would be to prepare some manageable size of the initial matrix, and if all guesses fail then we do some local search and replace column vectors with new assignments and repeat it until a stable model is found. Using a gradient-based search algorithm in continuous vector spaces could be another potential approach [4], this method could also be beneficial from using sparse representation. In addition, the sparse method also can combine with the partial evaluation that has been introduced in [17]. Further research directions on implementing disjunctive LP and abductive LP should be considered to reveal the applicability of tensor-based approaches for LP. In our recent work, we have extended the use of program matrix transpose to realize abduction in vector spaces [19]. Additionally, more complex types of the program should be taken into account to be represented in vector space, for instance, 3-valued logic programs and answer set programs with aggregates and constraints.