1 Introduction

Boolean networks (BNs) are a simple yet effective model of gene regulatory networks where nodes are genes and their state transition is controlled by Boolean functions (Kauffman, 1969). They have been studied mathematically (Cheng and Qi, 2010; Kobayashi and Hiraishi, 2014), logically in AI (Inoue et al., 2014; Tourret et al., 2017; Chevalier et al., 2019; Gao et al., 2022) and from the viewpoint of deep learning (Zhang et al., 2019). Their learning is reduced to learning Boolean functions from a set of input–output pairs and can be carried out for example by the REVEAL algorithm (Liang et al., 1998) or by the BestFit extension algorithm (Lähdesmäki et al., 2003).

In this paper, we propose a new approach to learning Boolean functions. We introduce a simple ReLU neural network (NN) called Mat_DNF that learns Boolean functions and outputs Boolean formulas in disjunctive normal form (DNFs). We represent a DNF by a pair \(( {\textbf{C}} , {\textbf{D}} )\) of binary matrices where \({\textbf{C}}\) stands for conjunctions and \({\textbf{D}}\) a disjunction respectively. Mat_DNF learns a matricized DNF \(( {\textbf{C}} , {\textbf{D}} )\) as network parameters from the learning data by minimizing a non-negative cost function \(\text {J}( {\textbf{C}} , {\textbf{D}} )\) to zero. As a result, every network parameter in Mat_DNF has a clear meaning of (potentially) denoting a literal or a conjunction (disjunctFootnote 1) in the learned DNF.

Although there exist several ways to represent Boolean functions such as decision trees (Oliveira and Sangiovanni-Vincentelli, 1993), polynomial threshold functions (Hansen and Podolskii, 2015), Boolean circuits (Malach and Shalev-Shwartz, 2019) and support vector machines (Mixon and Peterson, 2015), we choose DNFs for two reasons: one is explainability and the other is to relate the learning process to logical inference. Explainability is guaranteed as our network parameters directly represent a matricized DNF. Moreover since the learned output is a DNF, exploring the logical relationship between the learning data and the learned DNF becomes possible and we find that the learned DNF is what is called an interpolant in logic (Craig, 1957) interpolating between the positive and negative input data, which uncovers a new connection that connects neural learning to symbolic inference.

Boolean function learning can be either discrete or continuous. One group such as SAT encoding with integer programming (Kamath et al., 1992) and stochastic local search (Ruckert and Kramer, 2003) works in discrete spaces. The other group uses NNs in continuous spaces such as simulating Boolean circuits (Malach and Shalev-Shwartz, 2019), Neural Logic Networks (Payani and Fekri, 2019) and Net-DNF (Katzir et al., 2021). Our learning is just between the two. Unlike the former, Mat_DNF is differentiableFootnote 2. Unlike the latter, it explicitly operates on matricized DNFs, discrete expressions, which are not implicitly embedded in the neural network architecture.

In the context of BN learning, Mat_DNF offers a robust yet explainable end-to-end approach as an alternative to previous ones (Liang et al., 1998; Lähdesmäki et al., 2003; Inoue et al., 2014; Tourret et al., 2017; Gao et al., 2022). Compared to the REVEAL algorithm (Liang et al., 1998) and the BestFit extension algorithm (Lähdesmäki et al., 2003), Mat_DNF imposes no limit on the number of function variables. So if there are 18 genes (Irons, 2009), DNFs in 18 variables are considered. The LF1T algorithm (Inoue et al., 2014) symbolically learns a BN represented as a ground normal logic program from state transitions. Generalization is done by resolution. The NN-LFIT algorithm (Tourret et al., 2017) adopts a two-stage approach that learns features by a feed-forward NN and extracts DNFs from the learned parameters. D-LFIT (Gao et al., 2022) takes a further elaborated approach of combining two neural networks to reduce search space. By comparison, Mat_DNF is a much simpler single layer NN whose learned parameters directly represent a DNF and there is no need for post processing.

To improve the accuracy of the DNF learned from insufficient data, we introduce two operations. The first one is “noise-expansion”. It appends a noise vector to the input learning vector.Footnote 3 The second one is “over-iteration” which keeps learning even after learning error becomes zero. Since adding a noise vector causes extra steps of parameter update while moving around local minima of the cost function \(\text {J}\), the net effect of the first one is attributable to the second one. The fact that these two operations can considerably improve accuracy means that the choice of a root of the cost function \(\text {J}( {\textbf{C}} , {\textbf{D}} ) = 0\), or more generally the choice of a local minimum significantly affects accuracy and generalizability.

Finally we confirm the effectiveness of our approach through three learning experiments with literature-curated BNs (Fauré et al., 2006; Irons, 2009; Krumsiek et al., 2011). We applied Mat_DNF to learning data generated from these BNs to see if Mat_DNF can recover the original DNFs in BNs. For the first two synchronous BNs (Fauré et al., 2006; Irons, 2009), the recovery rate is high. By detailed analysis of the learning results, it is suggested that this high recovery rate is due to the effect of over-iteration caused by implicit noise-expansion. However, the third asynchronous BN (Krumsiek et al., 2011; Ribeiro et al., 2021) presents a much more difficult case and only six DNFs are completely recovered out of 11 original DNFs, though this result is comparable to that of rfBFE (Gao et al., 2018), one of the state-of-the-art BN learning algorithms.

Thus our contributions are three fold. First a proposal of new approach to the end-to-end learning of Boolean functions by an explainable single layer NN Mat_DNF together with its application to BN learning, second the establishment of the equivalence between neural learning of DNFs by Mat_DNF and symbolic inference of DNFs as interpolants between the positive and negative data and third the introduction of two new operations, noise-expansion and over-iteration, that can improve accuracy by shifting the choice of a local minimum.

In what follows, after a preliminary section, we introduce Mat_DNF in Sect. 3. We then prove the relationship between the learning by Mat_DNF and the inference of interpolants in logic in Sect. 4. Section 5 examines the behavior of Mat_DNF w.r.t. insufficient learning data and introduces noise-expansion and over-iteration that improve accuracy. Section 6 reports three BN learning experiments and Sect. 7 discusses related work. Section 8 is the conclusion.

2 Preliminaries

Throughout this paper, bold italic capital letters such as \({\textbf{A}}\) stand for matrices and so do bold italic lower case letters such as \({\textbf{a}}\) for vectors. We equate a one-dimensional matrix with a vector. The i-th element of \({\textbf{a}}\) is designated by \({\textbf{a}} (i)\) and the ij-th element of \({\textbf{A}}\) by \({\textbf{A}} (i,j)\). Given two \(m \times n\) matrices \({\textbf{A}}\) and \({\textbf{B}}\), \([ {\textbf{A}} ; {\textbf{B}} ]\) represents the \(2\,m \times n\) matrix of \({\textbf{A}}\) stacked onto \({\textbf{B}}\). \(\Vert {\textbf{a}} \Vert _1 = \sum _i \mid {\textbf{a}} (i) \mid\) denotes the 1-norm of \({\textbf{a}}\) and \(\Vert {\textbf{A}} \Vert _F\) the Frobenius norm of \({\textbf{A}}\). Let \({\textbf{a}}\) and \({\textbf{b}}\) be n dimensional vectors. Then \(( {\textbf{a}} \bullet {\textbf{b}} )\) stands for their inner product (dot product) and \({\textbf{a}} \odot {\textbf{b}}\) their Hadamard product, i.e., \(( {\textbf{a}} \odot {\textbf{b}} )(i) = {\textbf{a}} (i) {\textbf{b}} (i)\) for \(i (1\le i \le n)\). For a scalar \(\theta\), \(( {\textbf{a}} )_{\ge \theta }\) denotes a binary vector such that \(( {\textbf{a}} )_{\ge \theta }(i) = 1\) if \({\textbf{a}} (i) \ge \theta\) and \(( {\textbf{a}} )_{\ge \theta }(i) = 0\) otherwise for \(i (1\le i \le n)\). Similarly \(1 - {\textbf{a}}\) denotes the complement of \({\textbf{a}}\), i.e. \((1 - {\textbf{a}} )(i) = 1 - {\textbf{a}} (i)\) for \(i (1\le i \le n)\). These notations naturally extend to matrices like \(( {\textbf{A}} )_{\ge \theta }\) and \(1 - {\textbf{A}}\). \(\text{ min}_1(x) = \text{ min }(x,1)\) is a function returning the lesser of 1 and x. \(\text{ min}_1( {\textbf{A}} )\) is the component-wise application of \(\text{ min}_1(x)\) to \({\textbf{A}}\). We implicitly assume that all dimensions of vectors and matrices in various expressions are compatible. Let \(d_1 \vee \cdots \vee d_h\) be a DNF in n variables. If every disjunct \(d_i\) is a conjunction of n distinct literals, it is said to be full. For a set S, \(\mid S \mid\) stands for the number of elements in S.

3 Learning DNFs in vector spaces

3.1 Evaluating matricized DNFs

Let \(\varphi = (x_1 \wedge x_2) \vee (x_1 \wedge \lnot x_3)\) be a DNF in three variables. \(\varphi\) has two disjuncts \((x_1 \wedge x_2)\) and \((x_1 \wedge \lnot x_3)\). We represent \(\varphi\) by a pair \(( {\textbf{C}} , {\textbf{D}} )\) of binary matrices:

         \(x_1\;\, x_2\;\, x_3\; \lnot x_1\, \lnot x_2\, \lnot x_3\)

     \({\textbf{C}} = \begin{array}{cccccc} \left[ \begin{array}{cccccc} 1\;\;\; &{} 1\;\;\; &{} 0\;\;\, &{} 0\;\;\; &{} 0\;\;\; &{} 0 \\ 1\;\;\; &{} 0\;\;\; &{} 0\;\;\, &{} 0\;\;\; &{} 0\;\;\; &{} 1 \end{array} \right] \end{array}\;\;\;\;\) \({\textbf{D}} = \left[ 1\;\; 1\right]\)

As can be seen, each row of \({\textbf{C}}\) represents a disjunct (conjunction of literals) of \(\varphi\). For example, the first row of \({\textbf{C}}\) represents the first disjunct \((x_1 \wedge x_2)\) by setting \({\textbf{C}} (1,1) = {\textbf{C}} (1,2) = 1\). \({\textbf{D}}\) on the other hands represents the choice of a conjunction as a disjunct; in the current case, both disjuncts in \({\textbf{C}}\) are chosen as disjunct of \(\varphi\) as designated by \({\textbf{D}} = [1 \; 1]\). If \({\textbf{D}} = [1 \; 0]\), \(\varphi\) will contain only the first disjunct \((x_1 \wedge x_2)\) in \({\textbf{C}}\). Generally a DNF \(\varphi\) in n variables with at most h disjuncts is represented by an \(h \times 2n\) binary matrix \({\textbf{C}}\) and a \(1 \times h\) binary matrix \({\textbf{D}}\). By default, we consider a DNF \(\varphi\) and its matrix representation \(( {\textbf{C}} , {\textbf{D}} )\) exchangeable and call \(( {\textbf{C}} , {\textbf{D}} )\) matricized DNF \(\varphi\).

Now we describe how \(\varphi\) is evaluated as a Boolean function \(\varphi ( {\textbf{x}} )\) over its domain \({\textbf{I}_0} = \{1,0\}^n\) of bit sequences. Each \({\textbf{x}} \in {\textbf{I}_0}\) is equated with a binary column vector called “interpretation vector” representing an interpretation (assignment) such that a variable \(x_j\) (\(1 \le j \le n\)) is mapped to \({\textbf{x}} (j) \in \{1,0\}\). Henceforth for convenience we treat \({\textbf{I}_0}\) as an \(n \times 2^n\) binary matrix packed with such \(2^n\) possible interpretation vectors and specifically call it the domain matrix for n variables.

Let \({\textbf{x}}\) be an interpretation vector in \({\textbf{I}_0}\). A matricized DNF \(\varphi = ( {\textbf{C}} (h \times 2n), {\textbf{D}} (1 \times h))\) is evaluated by \({\textbf{x}}\) as follows. First compute a column vector \({\textbf{N}} = {\textbf{C}} [(1- {\textbf{x}} ); {\textbf{x}} ]\). \({\textbf{N}} (j)\) (\(1 \le j \le h\)) denotes the number of literals contained in the j-th conjunction of \({\textbf{C}}\) and falsified by \({\textbf{x}}\), and hence \(\text{ min}_1( {\textbf{N}} )(j) = 0\) holds if-and-only-if the j-th conjunction is false in \({\textbf{x}}\). Next compute a column vector \({\textbf{M}} = 1 - \text{ min}_1( {\textbf{N}} )\) which is the bit inversion of \(\text{ min}_1( {\textbf{N}} )\) and \({\textbf{M}} (j)\) gives the truth value \(\in \{0,1\}\) of the j-th conjunction in \({\textbf{C}}\). Finally compute a scalar \({\textbf{V}} = {\textbf{D}} {\textbf{M}}\). It denotes the number of disjuncts in \(\varphi\) satisfied by \({\textbf{x}}\). Hence \(( {\textbf{V}} )_{\ge 1} \in \{0,1\}\) gives the truth value of \(\varphi\) evaluated by \({\textbf{x}}\). Write \({\textbf{x}} \models \varphi\) when \(\varphi\) is true in \({\textbf{x}}\), i.e. \({\textbf{x}}\) satisfies \(\varphi\). In fact we have \({\textbf{x}} \models \varphi \;\text {if-and-only-if}\; ( {\textbf{V}} )_{\ge 1} = 1\).

Write \({\textbf{C}} = [ {\textbf{C}^P} \, {\textbf{C}} ^N]\) where \({\textbf{C}^P} (h \times n)\) (resp. \({\textbf{C}^N} (h \times n)\)) is a submatrix representing positive (resp. negative) occurrences of variables in \(\varphi\). Then the whole evaluation process is described by one line (1):

$$\begin{aligned} \varphi ( {\textbf{x}} )= & {} ( {\textbf{D}} ( {\textbf{1}_{h}} - \text{ min}_1( {\textbf{C}} [( {\textbf{1}_{n}} - {\textbf{x}} ); {\textbf{x}} ] )))_{\ge 1} \end{aligned}$$
(1)
$$\begin{aligned}= & {} ( {\textbf{D}} ( {\textbf{1}_{h}} - \text{ min}_1( ( {\textbf{C}^N} - {\textbf{C}^P} ) {\textbf{x}} + {\textbf{C}^P} {\textbf{1}_{n}} )))_{\ge 1} \nonumber \\= & {} ( {\textbf{D}} ( \textrm{ReLU}( ( {\textbf{C}^P} - {\textbf{C}^N} ) {\textbf{x}} + {\textbf{1}_h} - {\textbf{C}^P} {\textbf{1}_{n}} ) ))_{\ge 1} \nonumber \\ & \text { because ReLU } (x) = \textrm{max}(x,0) = 1- \text{ min}_1(1-x) \end{aligned}$$
(2)

where \(\varphi ( {\textbf{x}} )\) denotes the truth value \(\in \{0,1\}\) of \(\varphi\) as a Boolean function evaluated by \({\textbf{x}}\). This notation is naturally extended to a set of interpretation vectors like \(\varphi ( {\textbf{I}_0} )\). \({\textbf{1}_{h}}\) and \({\textbf{1}_{n}}\) are all-one vectors of length h and n respectively. We rewrite (1) to (2). What the latter tells us is that our evaluation process is exactly a forward pass of a single layer ReLU network consisting of a linear output layer and a hidden layer with a weight matrix \({\textbf{C}} ^P- {\textbf{C}} ^N\) and a bias vector \({\textbf{1}_h} - {\textbf{C}^P} {\textbf{1}_{n}}\). We name this ReLU network Mat_DNF. It is a simple NN specialized for DNFs derived from the evaluation process of a DNF where the disjunction \(x \vee y\) is replaced by \(\text{ min}_1(x+y)\) as in Łukasiewicz’s many valued logic.

3.2 Learning DNFs by Mat_DNF

By adding a backward pass to the equation (1), Mat_DNF can learn Boolean functions. Here we describe how Mat_DNF learns them. Let f be a target Boolean function in n variables and \({\textbf{I}_0} = [ {\textbf{x}_1} \cdots {\textbf{x}_{2^n}} ]\) the domain matrix for n variables. In learning, we are given a submatrix \({\textbf{I}_1} (n \times l) = [ {\textbf{x}_{i_1}} \cdots {\textbf{x}_{i_l}} ]\) \((l \le 2^n)\) of \({\textbf{I}_0}\). \({\textbf{I}_1}\) is mapped by f to a \(1 \times l\) row vector \({\textbf{I}_2} = f( {\textbf{I}_1} ) = [f( {\textbf{x}_{i_1}} ) \cdots f( {\textbf{x}_{i_l}} )]\). \(( {\textbf{I}_1} , {\textbf{I}_2} )\) \(= ( {\textbf{I}_1} ,f( {\textbf{I}_1} ))\) is called an input–output pair for f and \({\textbf{I}_1}\) its input domain. Learning a DNF \(\varphi\) here thus means a learner receives an input–output pair \(( {\textbf{I}_1} , {\textbf{I}_2} ) = ( {\textbf{I}_1} ,f( {\textbf{I}_1} ))\) for a target Boolean function f and returns a DNF \(\varphi\) such that \(\varphi ( {\textbf{I}_1} ) = {\textbf{I}_2}\). Mat_DNF receives \(( {\textbf{I}_1} , {\textbf{I}_2} )\) and returns a matricized DNF \(\varphi\) such that \(\varphi ( {\textbf{I}_1} ) = {\textbf{I}_2}\) when it stops with learning error = 0.

Let \(\widetilde{\textbf{C}}\) and \(\widetilde{\textbf{D}}\) be real matrices. They are relaxation versions of \({\textbf{C}}\) and \({\textbf{D}}\). Introduce \(\text{ max}_0(x) = \textrm{max}(x,0)\) (ReLU), \(\widetilde{\textbf{N}} = \widetilde{\textbf{C}}[(1- {\textbf{I}_1} );\! {\textbf{I}_1} ]\), \(\widetilde{\textbf{M}} = 1-\text{ min}_1( \widetilde{\textbf{N}})\), \(\widetilde{\textbf{V}} = \widetilde{\textbf{D}} \widetilde{\textbf{M}}\), \(\textrm{Y} = \Vert \widetilde{\textbf{C}} \odot (1 - \widetilde{\textbf{C}}) \Vert _F^2\) and \(\textrm{Z} = \Vert \widetilde{\textbf{D}} \odot (1 - \widetilde{\textbf{D}}) \Vert _F^2\). Then define a non-negative cost function \(\text {J}( \widetilde{\textbf{C}}, \widetilde{\textbf{D}})\) by

$$\begin{aligned} \text {J}= & {} ( {\textbf{I}_2} \bullet (1 - \text{ min}_1( \widetilde{\textbf{V}}))) + ((1 - {\textbf{I}_2} ) \bullet \text{ max}_0( \widetilde{\textbf{V}})) +\, (1/2)\textrm{Y} + (1/2)\textrm{Z}. \end{aligned}$$
(3)

The first term \(( {\textbf{I}_2} \bullet (1 - \text{ min}_1( \widetilde{\textbf{V}})))\) is a non-negative scalar and deals with the case of \(f( {\textbf{x}_{i_j}} ) = {\textbf{I}_2} (i_j) = 1\) (\(1\le j \le l\)). Likewise the second term \(((1 - {\textbf{I}_2} ) \bullet \text{ max}_0( \widetilde{\textbf{V}}))\) is non-negative and takes care of the case of \(f( {\textbf{x}_{i_j}} ) = {\textbf{I}_2} (i_j) = 0\). Y and Z are penalty terms to make \(\widetilde{\textbf{C}}\) and \(\widetilde{\textbf{D}}\) binary respectively.

Proposition 1

J(\(\widetilde{\textbf{C}}\), \(\widetilde{\textbf{D}}\)) = 0 \(\,\text {if-and-only-if}\,\) \(\widetilde{\textbf{C}}\) and \(\widetilde{\textbf{D}}\) are binary matrices representing a DNF \(\varphi\) such that \(\varphi ( {\textbf{I}_1} ) = {\textbf{I}_2}\).

Proof

We prove only-if part. The converse is obvious. Suppose J = J(\(\widetilde{\textbf{C}}\),\(\widetilde{\textbf{D}}\)) = 0. Every term in (3) is zero. Y = Z = 0 immediately implies \(\widetilde{\textbf{C}}\) and \(\widetilde{\textbf{D}}\) are binary. Let \(\varphi\) be a DNF represented by them. The first term deals with the case of \({\textbf{I}_2} (i_j) = f( {\textbf{x}_{i_j}} ) = 1\) \((1 \le j \le l)\). It is a sum of non-negative summands of the form \((1 - \text{ min}_1( \widetilde{\textbf{V}}(i_j)))\). Hence J = 0 implies \(\text{ min}_1( \widetilde{\textbf{V}}(i_j)) = 1\), i.e. \(\varphi\) is true in \({\textbf{x}_{i_j}} \in {\textbf{I}_1}\) when \({\textbf{I}_2} (i_j) = 1\). The second term is dual to the first term, dealing with the case of \({\textbf{I}_2} (i_j) = 0\). Similarly to the first term, we can prove that \(\varphi\) is false in \({\textbf{x}_{i_j}} \in {\textbf{I}_1}\) when \({\textbf{I}_2} (i_j) = 0\). By combining the two, we conclude that \(\varphi\) gives \({\textbf{I}_2}\) when evaluated by \({\textbf{I}_1}\), i.e., \(\varphi ( {\textbf{I}_1} ) = {\textbf{I}_2}\). \(\square\)

Learning by Mat_DNF is carried out based on Proposition 1 by minimizing \(\text {J}\) until \(\text {J} = 0\) using gradient descent. \(\widetilde{\textbf{C}}\) and \(\widetilde{\textbf{D}}\) are iteratively updated by their Jacobians, \({\textbf{J}} _a^{\tilde{C}}\) for \(\widetilde{\textbf{C}}\) and \({\textbf{J}} _a^{\tilde{D}}\) for \(\widetilde{\textbf{D}}\), for example like \(\widetilde{\textbf{C}} = \widetilde{\textbf{C}} - \alpha {\textbf{J}} _a^{\tilde{C}}\) where \(\alpha >0\) is a learning rate. To compute the Jacobians, we introduce \(\widetilde{\textbf{W}} = -( \widetilde{\textbf{V}})_{\le 1} \odot {\textbf{I}_2} + ( \widetilde{\textbf{V}})_{\ge 0} \odot (1 - {\textbf{I}_2} )\). Then \({\textbf{J}} _a^{\tilde{C}}\) and \({\textbf{J}} _a^{\tilde{D}}\) are computed by (4).

$$\begin{aligned} {\textbf{J}} _a^{\tilde{C}}= & {} (( -( \widetilde{\textbf{N}})_{\le 1}) \odot ( \widetilde{\textbf{D}}^T \widetilde{\textbf{W}}) )[(1- {\textbf{I}_1} );\! {\textbf{I}_1} ]^T +\; (1 - 2 \widetilde{\textbf{C}}) \odot {\textbf{Y}} \nonumber \\ {\textbf{J}} _a^{\tilde{D}}= & {} \widetilde{\textbf{W}} \widetilde{\textbf{M}}^T + (1 - 2 \widetilde{\textbf{D}}) \odot {\textbf{Z}} \end{aligned}$$
(4)

These Jacobians are derived as follows. We first derive \({\textbf{J}} _a^{\tilde{C}}\). Let \(\widetilde{\textbf{C}}_{pq} = \widetilde{\textbf{C}}(p,q)\) be an arbitrary element of \(\widetilde{\textbf{C}}\). Put \(\Delta _Y = (1 - 2 \widetilde{\textbf{C}}) \odot {\textbf{Y}}\). We have

$$\begin{aligned} \partial \widetilde{\textbf{M}}/\partial \widetilde{\textbf{C}}_{pq}= & {} - \partial \text{ min}_1( \widetilde{\textbf{N}})/\partial \widetilde{\textbf{C}}_{pq} \\= & {} - ( \widetilde{\textbf{N}})_{\le 1} \odot ( {\textbf{I}} _{pq}(1-[ {\textbf{I}} _1;\! (1- {\textbf{I}} _1)]) ) \end{aligned}$$

where \({\textbf{I}} _{pq}\) is a zero matrix except for the pq-th element which is 1. We use \(( {\textbf{A}} \bullet {\textbf{B}} ) = \sum _{i,j} {\textbf{A}} (i,j) {\textbf{B}} (i,j)\) to denote the dot product of \({\textbf{A}}\) and \({\textbf{B}}\). Note \(( {\textbf{A}} \bullet ( {\textbf{B}} \odot {\textbf{C}} )) = (( {\textbf{B}} \odot {\textbf{A}} ) \bullet {\textbf{C}} )\) and \(( {\textbf{A}} \bullet ( {\textbf{B}} {\textbf{C}} )) = (( {\textbf{B}} ^T {\textbf{A}} ) \bullet {\textbf{C}} ) = (( {\textbf{A}} {\textbf{C}} ^T) \bullet {\textbf{B}} )\) hold. Then put \(\delta _Y = (\Delta _Y \bullet {\textbf{I}} _{pq})\) and compute the partial derivative of J w.r.t. \(\widetilde{\textbf{C}}_{pq}\) as follows:

$$\begin{aligned}{} & {} \partial \text {J}/\partial \widetilde{\textbf{C}}_{pq} \\{} & {} \quad = ( {\textbf{I}} _2 \bullet (- ( \widetilde{\textbf{V}})_{\le 1}\odot (\partial \widetilde{\textbf{V}} / \partial \widetilde{\textbf{C}}_{pq})) ) + ( (1- {\textbf{I}} _2) \bullet ( ( \widetilde{\textbf{V}})_{\ge 0} \odot (\partial \widetilde{\textbf{V}} / \partial \widetilde{\textbf{C}}_{pq})) ) + \delta _Y \\{} & {} \quad = ( ( -( \widetilde{\textbf{V}})_{\le 1}\odot {\textbf{I}} _2) \bullet (\partial \widetilde{\textbf{V}} / \partial \widetilde{\textbf{C}}_{pq}) ) + ( (( \widetilde{\textbf{V}})_{\ge 0}\odot (1 - {\textbf{I}} _2)) \bullet (\partial \widetilde{\textbf{V}} / \partial \widetilde{\textbf{C}}_{pq}) ) + \delta _Y \\{} & {} \quad = ( (- ( \widetilde{\textbf{V}})_{\le 1} \odot {\textbf{I}} _2 + ( \widetilde{\textbf{V}})_{\ge 0} \odot (1- {\textbf{I}} _2)) \bullet ( \widetilde{\textbf{D}}( \partial \widetilde{\textbf{M}}/ \partial \widetilde{\textbf{C}}_{pq} )) ) + \delta _Y \\{} & {} \quad = ( (-( \widetilde{\textbf{N}})_{\le 1} \odot ( \widetilde{\textbf{D}}^T ( - ( \widetilde{\textbf{V}})_{\le 1} \odot {\textbf{I}} _2 + ( \widetilde{\textbf{V}})_{\ge 0} \odot (1- {\textbf{I}} _2) ) ) ) (1-[ {\textbf{I}} _1;\! (1- {\textbf{I}} _1)])^T \bullet {\textbf{I}} _{pq} ) \\{} & {} \quad + (\Delta _Y \bullet {\textbf{I}} _{pq} ) \\{} & {} \quad = ( ((-( \widetilde{\textbf{N}})_{\le 1} \odot ( \widetilde{\textbf{D}}^T \widetilde{\textbf{W}})) (1-[ {\textbf{I}} _1;\! (1- {\textbf{I}} _1)])^T + \Delta _Y) \bullet {\textbf{I}} _{pq} ) \end{aligned}$$

Since pq are arbitrary, we have

$$\begin{aligned} {\textbf{J}} _a^{\tilde{C}}= & {} \partial \text {J}/ \partial \widetilde{\textbf{C}} \\= & {} ( -( \widetilde{\textbf{N}})_{\le 1} \odot ( \widetilde{\textbf{D}}^T \widetilde{\textbf{W}})) (1-[ {\textbf{I}} _1;\! (1- {\textbf{I}} _1)])^T + \Delta _Y \\{} & {} \text {where} \;\; \widetilde{\textbf{W}} = - ( \widetilde{\textbf{V}})_{\le 1}\odot {\textbf{I}} _2 + ( \widetilde{\textbf{V}})_{\ge 0}\odot (1- {\textbf{I}} _2). \end{aligned}$$

Next we derive \({\textbf{J}} _a^{\tilde{D}} = \partial \text {J}/\partial \widetilde{\textbf{D}}\) similarly. Put \(\Delta _Z = (1 - 2 \widetilde{\textbf{D}}) \odot {\textbf{Z}}\) and \(\delta _Z = (\Delta _Z \bullet {\textbf{I}} _{pq} )\). Then for arbitrary p,q, we see

$$\begin{aligned} \partial \text {J}/\partial \widetilde{\textbf{D}}_{pq}= & {} ( {\textbf{I}} _2 \bullet -\partial \text{ min}_1( \widetilde{\textbf{V}}) /\partial \widetilde{\textbf{D}}_{pq} ) + ( 1- {\textbf{I}} _2 \bullet \partial \text{ max}_0( \widetilde{\textbf{V}}) /\partial \widetilde{\textbf{D}}_{pq} ) + \delta _Z \\= & {} ( (-( \widetilde{\textbf{V}})_{\le 1}\odot {\textbf{I}} _2) + ( \widetilde{\textbf{V}})_{\ge 0}\odot (1 - {\textbf{I}} _2) \bullet \partial \widetilde{\textbf{V}}/\partial \widetilde{\textbf{D}}_{pq} ) + \delta _Z \\= & {} ( ((-( \widetilde{\textbf{V}})_{\le 1}\odot {\textbf{I}} _2) + ( \widetilde{\textbf{V}})_{\ge 0}\odot (1 - {\textbf{I}} _2)) \widetilde{\textbf{M}}^T \bullet {\textbf{I}} _{pq} ) + \delta _Z \\= & {} ( \widetilde{\textbf{W}} \widetilde{\textbf{M}}^T \bullet {\textbf{I}} _{pq} ) + (\Delta _Z \bullet {\textbf{I}} _{pq} ) \\= & {} ( ( \widetilde{\textbf{W}} \widetilde{\textbf{M}}^T + \Delta _Z) \bullet {\textbf{I}} _{pq} ). \end{aligned}$$

So we reach \({\textbf{J}} _a^{\tilde{D}} = \partial \text {J}/\partial \widetilde{\textbf{D}} = \widetilde{\textbf{W}} \widetilde{\textbf{M}}^T + \Delta _Z\). In actual learning, we use an adaptive gradient method Adam (Kingma and Ba, 2015) instead of gradient descent with a constant learning rate.

3.3 Learning algorithm

Given an input–output pair \(( {\textbf{I}_1} , {\textbf{I}_2} )\) such that \(f( {\textbf{I}_1} ) = {\textbf{I}_2}\) for the target Boolean function f, Mat_DNF returns a matricized DNF \(\varphi = ( {\textbf{C}} , {\textbf{D}} )\) giving \(\varphi ( {\textbf{I}_1} ) = {\textbf{I}_2}\), basically by running Algorithm 1 until \(\text {J} = 0\).

figure a

We however take a practical approach of thresholding \(( \widetilde{\textbf{C}}, \widetilde{\textbf{D}})\) to binary \(( {\textbf{C}}\), \({\textbf{D}} )\) even before \(\text {J} = 0\) is reached assuming J is small and \(\widetilde{\textbf{C}}, \widetilde{\textbf{D}}\) are close to binary matrices. In more detail, the inner q-loop in Algorithm 1 below iteratively updates \(( \widetilde{\textbf{C}}, \widetilde{\textbf{D}})\) at most \(max\_itr\) times while thresholding them optimally to binary \(( {\textbf{C}} , {\textbf{D}} )\) (line 6,7,8)Footnote 4 and computing learning_error using them. If \(\varphi = ( {\textbf{C}} , {\textbf{D}} )\) achieves \(\text {learning\_error} = 0\), it exits from the q-loop and p-loop and returns \(\varphi\). If \(\text {learning\_error} > 0\) happens even after \(max\_itr\) iterations, it restarts the next q-loop with \(( \widetilde{\textbf{C}}, \widetilde{\textbf{D}})\) perturbated by (5) where \(\Delta _a\) and \(\Delta _b\) are matrices of the same size as \(\widetilde{\textbf{C}}\) and \(\widetilde{\textbf{D}}\) respectively. They are comprised of elements sampled from the standard normal distribution \(\mathcal{N}(0,1)\). The perturbated \(\widetilde{\textbf{C}}\) and \(\widetilde{\textbf{D}}\) are used as initial parameters in the next loop (line 16). This perturbation is intended to escape from a local minimum.

$$\begin{aligned} \begin{array}{ll} \widetilde{\textbf{C}_0 } = \sqrt{2/(h\cdot 2n)} \Delta _a + 0.5,\;\; \widetilde{\textbf{C}} = 0.5\cdot ( \widetilde{\textbf{C}} + \widetilde{\textbf{C}_0}) \\ \widetilde{\textbf{D}_0 } = \sqrt{2/h} \Delta _b + 0.5,\;\; \widetilde{\textbf{D}} = 0.5\cdot ( \widetilde{\textbf{D}} + \widetilde{\textbf{D}_0}) \end{array} \end{aligned}$$
(5)

Restart is allowed at most \(max\_try\) times. Note that Mat_DNF possibly fails to achieve \(\text {learning\_error} = 0\) within given h, \(max\_itr\) and \(max\_try\),Footnote 5 but when Mat_DNF returns a matricized DNF \(\varphi = ( {\textbf{C}} , {\textbf{D}} )\) with learning_error = 0, it is guaranteed that J(\({\textbf{C}}\),\({\textbf{D}}\)) = 0 and \(\varphi ( {\textbf{I}_1} ) = {\textbf{I}_2}\) hold.

4 Learning as logical interpolation: a logical perspective

Here we characterize the learning of DNF \(\varphi\) by Mat_DNF from a logical perspective. Write \(\models \phi _1 \Rightarrow \phi _2\) if \(\phi _1 \Rightarrow \phi _2\) is a tautology. If we also have \(\models \phi _2 \Rightarrow \phi _3\), \(\phi _2\) is called an interpolant between \(\phi _1\) and \(\phi _3\). Roughly, Craig’s interpolation theorem (Craig, 1957) in first order logic states the existence of such interpolant. We prove that our learning of \(\varphi\) from an input–output pair \(( {\textbf{I}_1} , {\textbf{I}_2} )\) such that \(\varphi ( {\textbf{I}_1} ) = {\textbf{I}_2}\) is logically viewed as an inference of an interpolant \(\varphi\).Footnote 6

Suppose \(( {\textbf{I}_1} , {\textbf{I}_2} )\) is an input–output pair for some n-variable Boolean function f and \(f( {\textbf{I}_1} ) = {\textbf{I}_2}\) holds. We divide the input binary matrix \({\textbf{I}_1} (n \times l)\) into two submatrices \({\textbf{I}_1^{P}} (n \times l_P)\) and \({\textbf{I}_1^N} (n \times l_N)\) where \(l_P+l_N = l\). \({\textbf{I}_1^{P}}\) represents the positive (resp. negative) data and if \({\textbf{x}} \in {\textbf{I}_1^P}\) (resp. \({\textbf{x}} \in {\textbf{I}_1^N}\)), \(f( {\textbf{x}} ) = 1\) (resp. \(f( {\textbf{x}} ) = 0\)) holds.

We consider \({\textbf{I}_1^{P}}\) as full DNF, DNF\(( {\textbf{I}_1^{P}} )\) in notation, in the following way. Let \({\textbf{x}}\) be an interpretation vector in \({\textbf{I}_1}\). Introduce \(\text {conj}( {\textbf{x}} )\) denoting a conjunction \(l_1 \wedge \cdots \wedge l_n\) of literals such that \(l_j = x_j\) if \({\textbf{x}} (j) = 1\), else \(l_j = \lnot x_j\) \((1 \le j \le n)\). For example if \({\textbf{x}} = [1\; 0\; 1]^{T}\), \(\text {conj}( {\textbf{x}} ) = x_1 \wedge \lnot x_2 \wedge x_3\). Put \(\text {DNF}( {\textbf{I}_1^{P}} ) = \bigvee _{ {\textbf{x}} \in {\textbf{I}_1^{P}} } \text {conj}( {\textbf{x}} )\) and call it the positive DNF for \(( {\textbf{I}_1} , {\textbf{I}_2} )\). Likewise we define \(\text {DNF}( {\textbf{I}_1^{N}} ) = \bigvee _{ {\textbf{x}} \in {\textbf{I}_1^{N}} } \text {conj}( {\textbf{x}} )\) and call it the negative DNF for \(( {\textbf{I}_1} , {\textbf{I}_2} )\). For simplicity, we equate \(\text {DNF}( {\textbf{I}_1^{P}} )\) and \(\text {DNF}( {\textbf{I}_1^{N}} )\) respectively with the positive data \({\textbf{I}_1^{P}}\) and negative data \({\textbf{I}_1^{N}}\).

Proposition 2

Let \(( {\textbf{I}_1} , {\textbf{I}_2} )\) be an input–output pair for a Boolean function f such that \(f( {\textbf{I}_1} ) = {\textbf{I}_2}\). Also let \(\text {DNF}( {\textbf{I}_1^{P}} )\) and \(\text {DNF}( {\textbf{I}_1^{N}} )\) respectively be the positive and negative DNF for \(( {\textbf{I}_1} , {\textbf{I}_2} )\). For a DNF \(\varphi\), \(\varphi ( {\textbf{I}_1} ) = {\textbf{I}_2}\) if-and-only-if \(\varphi\) is an interpolant between \(\text {DNF}( {\textbf{I}_1^{P}} )\) and \(\lnot \text{ DNF }( {\textbf{I}_1^{N}} )\).

Proof

We first prove the only-if part. Suppose \(\varphi ( {\textbf{I}_1} ) = {\textbf{I}_2}\). Let \({\textbf{i}}\) be an interpretation vector over n variables satisfying \(\text {DNF}( {\textbf{I}_1^{P}} )\). It satisfies some disjunct conj(\({\textbf{x}}\)) in \(\text {DNF}( {\textbf{I}_1^P} )\). Since conj(\({\textbf{x}}\)) is a conjunction of n distinct literals, the fact that \({\textbf{i}}\) satisfies conj(\({\textbf{x}}\)) implies \({\textbf{i}} = {\textbf{x}}\) as vector. On the other hand, we have \(\varphi ( {\textbf{I}_1} ) = {\textbf{I}_2} = f( {\textbf{I}_1} )\) by assumption and hence \(\varphi ( {\textbf{x}} ) = f( {\textbf{x}} )\) as \({\textbf{x}} \in {\textbf{I}_1^P} \subseteq {\textbf{I}_1}\). We also have \(f( {\textbf{x}} ) = 1\) as \({\textbf{x}} \in {\textbf{I}_1^P}\). Putting the two together, we conclude \(\varphi ( {\textbf{i}} ) = \varphi ( {\textbf{x}} ) = f( {\textbf{x}} ) = 1\). Since \({\textbf{i}}\) is arbitrary and satisfies \(\varphi\), \(\models \text {DNF}( {\textbf{I}_1^{P}} ) \Rightarrow \varphi\) is proved. \(\models \varphi \Rightarrow \lnot \text {DNF}( {\textbf{I}_1^{N}} )\) is proved similarly by proving \(\models \text {DNF}( {\textbf{I}_1^{N}} ) \Rightarrow \lnot \varphi\).

To prove the if-part, recall that an interpolant \(\varphi\) satisfies \(\models \text {DNF}( {\textbf{I}_1^{P}} ) \Rightarrow \varphi\) and \(\models \text {DNF}( {\textbf{I}_1^{N}} ) \Rightarrow \lnot \varphi\). So if \({\textbf{x}} \in {\textbf{I}_1^P}\) (resp. \({\textbf{x}} \in {\textbf{I}_1^N}\)), then \(\text {DNF}( {\textbf{I}_1^{P}} )( {\textbf{x}} ) = 1\) and hence \(\varphi ( {\textbf{x}} )=1\) holds (resp. then \(\text {DNF}( {\textbf{I}_1^{N}} )( {\textbf{x}} ) = 1\) and hence \(\varphi ( {\textbf{x}} ) = 0\) holds). In other words, if \({\textbf{x}} \in {\textbf{I}_1^P}\), \(\varphi ( {\textbf{x}} ) = 1 = f( {\textbf{x}} )\) and if \({\textbf{x}} \in {\textbf{I}_1^N}\), \(\varphi ( {\textbf{x}} ) = 0 = f( {\textbf{x}} )\). So we reach \(\varphi ( {\textbf{I}_1} ) = f( {\textbf{I}_1} ) = {\textbf{I}_2}\). \(\square\)

By Proposition 2, we can say that \(\varphi\) returned by Mat_DNF with learning_error = 0 is an interpolant between \(\text {DNF}( {\textbf{I}_1^{P}} )\) and \(\lnot \textrm{DNF}( {\textbf{I}_1^{N}} )\). We can also say by combining Proposition 1 and 2 that finding a root of \(\text {J}( {\textbf{C}} , {\textbf{D}} ) = 0\) defined by (3), learning a DNF \(\varphi\) satisfying \(\varphi ( {\textbf{I}_1} ) = {\textbf{I}_2}\) and inferring an interpolant \(\varphi\) between \(\text {DNF}( {\textbf{I}_1^{P}} )\) and \(\lnot \textrm{DNF}( {\textbf{I}_1^{N}} )\) are one and the same thing, they are all equivalent.

The recognition of this equivalence has some interesting consequences. The first one is that from the viewpoint of classification, learning by Mat_DNF consists of learning the feature space of conjunctions \(\widetilde{\textbf{C}}\) and its linear separation by a hyperplane specified by a continuous disjunction \(\widetilde{\textbf{D}}\) as shown in the equation (2). Hence it seems possible to modify Mat_DNF so that it can search for a “max-margin interpolant” corresponding to the max-merging hyperplane, which is expected to generalize well. Sharma et. al already proposed to use SVM to infer interpolants (Sharma et al., 2012) where SVM is applied to the predefined feature space. In our “max-margin interpolant” inference, if realized, the feature space itself will be learned by Mat_DNF.

The second one is the possibility of a neural end-to-end refutation prover. Let S be a set of ground clauses. Also let \(S = S_1 \cup S_2\) be any split of S such that \(\text {atom}(S_1) \cap \text {atom}(S_2) \ne \emptyset\) where \(\text {atom}(S_i)\) denotes the set of atoms in \(S_i\) (\(i=1,2\)). It can be proved that S is unsatisfiable if-and-only-if there is an interpolant \(\varphi\) between \(S_1\) and \(\lnot S_2\) (proof omitted as it is out of the scope of this paper (Vizel et al., 2015; McMillan et al., 2018)). We can apply Mat_DNF to infer this \(\varphi\) assuming that \(S_1\) is positive data (\(\varphi\) is true over \(S_1\)) and \(S_2\) is negative data (\(\varphi\) is false over \(S_2\)) respectively.

The third one concerns the generalizability of the DNF \(\varphi\) learned by Mat_DNF. It is observed that \(\varphi\) tends to overgeneralize positive data \({\textbf{I}_1^{P}}\) in the input data. That is, \(\models \text {DNF}( {\textbf{I}_1^{P}} ) \rightarrow \varphi\) holds but sometimes the degree of generalization by logical implication measured by the distance between \(\text {DNF}( {\textbf{I}_1^{P}} )\) and \(\varphi\) is too high, which adversely affects the accuracy of \(\varphi\). Later in Sect. 5.5, we propose a way of controlling the distance between \(\text {DNF}( {\textbf{I}_1^{P}} )\) and \(\varphi\) and show that the accuracy of \(\varphi\) is actually improved.

5 Learning random DNFs

5.1 Performance measures and generalization

First we define some performance measures concerning Mat_DNF to clarify the meaning of generalization. Let f be a target Boolean function in n variables, \({\textbf{I}_0}\) the domain matrix for n variables and \(( {\textbf{I}_1} ,f( {\textbf{I}_1} ))\) (\({\textbf{I}_1} \subseteq {\textbf{I}_0}\)) an input–output pair for f supplied as learning data for Mat_DNF. We introduce “domain ratio” \(dr = \frac{\mid {\textbf{I}_1} \mid }{\mid {\textbf{I}_0} \mid }\) (\(0 \le dr \le 1\)) where \({\mid {\textbf{I}} \mid }\) denotes the number of interpretation vectors in \({\textbf{I}}\). Domain ratio dr is the relative size of learning data to the whole domain data. In what follows, purely for convenience, we use dr even when \(dr\cdot {\mid {\textbf{I}_0} \mid }\) is not an integer. In such case, it means \({\textbf{I}_1}\) contains the \(\lfloor {dr\cdot \mid {\textbf{I}_0} \mid } \rfloor\) number of interpretation vectors of \({\textbf{I}_0}\).

Suppose we have obtained a DNF \(\varphi = ( {\textbf{C}} , {\textbf{D}} )\) with \(\text {learning\_error} = 0\) by running Mat_DNF on \(( {\textbf{I}_1} ,f( {\textbf{I}_1} ))\). Compute \(\varphi ( {\textbf{I}_0} ) = ( {\textbf{D}} (1-\text{ min}_1( {\textbf{C}} [(1- {\textbf{I}_0} ); {\textbf{I}_0} ])_{\ge 1}\) (see (1)) and \(\text {exact\_error} = \Vert f( {\textbf{I}_0} ) - \varphi ( {\textbf{I}_0} ) \Vert _1\) which is the number of different bits between \(f( {\textbf{I}_0} )\) and \(\varphi ( {\textbf{I}_0} )\). Introduce acc_DNF, the “exact accuracy” of \(\varphi\), by defining \(\text {acc\_DNF} = \displaystyle {1 - \text {exact\_error}/{2^n} }\). Since learning_error is zero, \(\varphi\) perfectly reproduces \(f( {\textbf{I}_1} )\) and hence it follows that \(\text {acc\_DNF} = dr + (1-dr)\cdot \text {acc\_pred}\) where acc_pred is the prediction accuracy of \(\varphi\) over the unseen domain data \({\textbf{I}_0} {\setminus } {\textbf{I}_1}\) not used for learning. Consequently we have \(\text {acc\_pred} = \displaystyle {(\text {acc\_DNF} - dr)/(1-dr) }\). Thus prediction accuracy and exact accuracy are mutually convertible. Finally we define generalization. Introduce \(\text {acc\_dr} = dr + 0.5 \cdot (1-dr) = 0.5\cdot (1+dr)\) which is the expected accuracy of a base line learner learning data with domain ratio dr that completely memorizes learning data (dr) and makes a random guess on unseen data (\(0.5 \cdot (1-dr)\)). We say generalization occurs when \(\text {acc\_DNF} > \text {acc\_dr} = 0.5\cdot (1+dr)\), or equivalently \(\text {acc\_pred} > 0.5\) holds (because \(\text {acc\_DNF} - \text {acc\_dr} = (1-dr)\cdot (\text {acc\_pred} - 0.5)\)).

5.2 Measuring accuracy for random DNFs

We conduct a learning experiment with small random DNFs to examine the learning behavior of Mat_DNF w.r.t. data scarcity controlled by domain ratio dr and see how generalization occursFootnote 7.Footnote 8 We first randomly generate a DNF \(\varphi _0\) in \(n = 5\) variables that consists of three disjuncts, each containing at most 5 lals a half of which is negative on average. We also generate a domain matrix \({\textbf{I}_0} (n \times 2^n)\) for \(n = 5\) variables. Next suppose a domain ratio dr is given. For this dr, we generate a binary matrix \({\textbf{I}_1} (n \times l)\) consisting of \(l = 2^n\cdot dr\) interpretation vectors randomly sampled without replacement from \({\textbf{I}_0}\). Then we run Mat_DNF on the learning data \(( {\textbf{I}_1} ,\varphi _0( {\textbf{I}_1} ))\)Footnote 9 and obtain a DNF \(\varphi _1\) that perfectly classifies the learning data, i.e. \(\varphi _1( {\textbf{I}_1} ) = \varphi _0( {\textbf{I}_1} )\) and compute the exact accuracy acc_DNF of \(\varphi _1\). We repeat this process 100 times and obtain the average acc_DNF of \(\varphi _1\) against dr.

By varying \(dr \in \{0.1,\ldots ,1.0\}\), we obtain a curve of exact accuracy w.r.t. dr denoted as acc_DNF in Fig. 1. There acc_dr denotes the expected accuracy of the base line learner performing only memorization and random guess. Other two curves, acc_DNF_noise and acc_over, are explained next. We observe that acc_DNF is always (and slightly) above acc_dr for all dr’s. So this experiment confirms that generalization in our sense actually occurs and the learned DNF does more than just pure memorization and random guess by detecting some logical pattern.

Fig. 1
figure 1

“exact accuracy” of DNF learned with noise-expansion and over-iteration

5.3 Noise-expansion and over-iteration

The acc_DNF_noise and acc_DNF_over curves in Fig. 1 demonstrate that generalization occurs with a greater degree than acc_DNF, i.e. \(\text {acc\_DNF\_noise} \approx \text {acc\_DNF\_over} > \text {acc\_DNF}\) holds at most dr’s. They are obtained by two different operations, \(\text {acc\_DNF\_noise}\) by “noise-expansion” and \(\text {acc\_DNF\_over}\) by “over-iteration”, respectively.

The first operation, noise-expansion, means the expansion of an input vector in the learning data \({\textbf{I}_1}\) by a random bit vector. For example, a 5 bit input vector \({\textbf{x}} = [0\,1\,0\,1\,0]^T\) in \({\textbf{I}_1}\) is expanded into a 10 dimensional vector \({\textbf{x}} _\textrm{noise} = [ {\textbf{x}} ;\!\! {\textbf{n}} ] = [0\,1\,0\,1\,0\,1\,0\,0\,1\,1]^T\) by appending a random bit vector \({\textbf{n}} = [1\,0\,0\,1\,1]^T\) to \({\textbf{x}}\). In learning, each \({\textbf{x}}\) in \({\textbf{I}_1}\) is expanded into \({\textbf{x}} _{\textrm{noise}}\) and then used for learning. Although each input vector in \({\textbf{I}_1}\) gets longer (length doubled) by noise-expansion, the number of input vectors remains the same. It simply means Mat_DNF has an additional task of identifying those variables in an input vector \({\textbf{x}} _{\textrm{noise}}\) that are relevant to the output, hereby causing additional update steps in Algorithm 1. So from the viewpoint of minimizing \(\text {J}\) to zero, the net effect of noise-expansion is to force Mat_DNF to find another root of \(\text {J}\) even when \(\text {J} = 0\) is reached in the original learning task. This point is made clear by comparing with “over-iteration” explained below.

The second operation, over-iteration, forces Mat_DNF to skip a root of \(\text {J} = 0\) found first and keep learning. Only after some prespecified extra steps (for example extra_update = 20 in the case of acc_DNF_itr in Fig. 1) have been made, Mat_DNF is allowed to return when a root of \(\text {J}\) is found again. Intuitively, this operation have the effect of avoiding a root near the initializing point that often overfits the learning data and exploring a root in the relatively flat landscape of \(\text {J}\). In other words, over-iteration searcher for a root of \(\text {J}\) closer to a global minimum such as the target DNF.

Observe that as the acc_DNF_noise and acc_DNF_over curves in Fig. 1 show, not only both noise-expansion and over-iteration improve exact accuracy, or equivalently prediction accuracy, but with a similar degree of improvement. Hence it seems reasonable to hypothesize that noise-expansion causes over-iteration and over-iteration causes the improvement of exact accuracy.

The result of this experiment also indicates the importance of an intentional choice of a local minimum (choosing a root in our case) which is independently suggested by “flooding” (Ishida et al., 2020) and “grokking” (Power et al., 2021). In flooding, learning is controlled by gradient descent and ascent to keep training error small but non-zero. In grokking, learning is continued even after learning accuracy is saturated, and then test accuracy suddenly rises to a high level. Our over-iteration has a similar effect of moving around local minima in a flat loss landscape, leading to better generalization.

5.4 The logical relations and over-iteration

When a learning target is a DNF \(\varphi _0\), we naturally ask a logical question of whether the consequence relation and equivalence relation between \(\varphi _0\) and a learned DNF \(\varphi\) hold or not. We also interested in the distance between themFootnote 10 because we expect \(\varphi\) to be logically related to \(\varphi _0\) when \(\varphi\) is close to \(\varphi _0\). So we estimate the probability p_conseq (resp. p_equiv) of \(\varphi\) being a logical consequence of \(\varphi _0\), i.e., \(\models \varphi _0 \Rightarrow \varphi\) in notation (resp. \(\varphi\) being logically equivalent to \(\varphi _0\), i.e., \(\models \varphi _0 \Leftrightarrow \varphi\)) for a 5-variable DNF \(\varphi _0\) generated as in the previous section, together with the average distance between \(\varphi _0\) and \(\varphi\) by running Mat_DNF 100 timesFootnote 11 and counting the number of runs that make these logical relations hold and computing the average distance. We obtain Table 1.

Table 1 Domain ratio dr, distance and the probability of logical consequence and equivalence

In Table 1, distance_itr is the same as distance between the target DNF \(\varphi _0\) and a learned DNF \(\varphi\) but obtained by over-iteration with extra_update = 60. The same applies for p_equiv_itr and p_equiv.

First we can recognize in the table that larger data gives us a more exact solution. That is, the distance between the target DNF \(\varphi _0\) and a learned DNF \(\varphi\) monotonically decreases as dr gets closer to 1. Furthermore the effect of over-iteration is clearly visible. It gets the learned DNF much closer to the target DNF, from 7.5 to 4.2 at \(dr = 0.5\) for example. In other words, it chooses a root of the cost function \(\text {J}\) near the target \(\varphi _0\).

Concerning logical relations, observe that p_conseq and p_equiv in Table 1 more or less monotonically increase as dr increases. So again, larger data gives a bigger chance of the logical relationship. Second observe that p_conseq, the probability of \(\models \varphi _0 \Rightarrow \varphi\), is rather high through all dr’s but lowered considerably by over-iteration. Third over-iteration has the opposite effect on p_equiv, the probability of \(\models \varphi _0 \Leftrightarrow \varphi\) however. It greatly improves the chance of \(\models \varphi _0 \Leftrightarrow \varphi\) after \(dr > 0.5\). For example, p_equiv suddenly jumps up from 0.02 to 0.19 at \(dr = 0.7\) and from 0.20 to 0.55 at \(dr = 0.9\) (see bold figures in Table 1). This positive effect of over-iteration on p_equiv becomes critical when applying Mat_DNF to Boolean network learning. This is because the primary purpose of our Boolean network learning is to recover the original DNFs in the target Boolean network and over-iteration in this section enhances the chance of discovering such DNFs.

5.5 Controlling logical generalization

Over-iteration wanders in the search space for a better local minimum. Here we introduce another more proactive approach for the same purpose based on Proposition 2 in Sect. 4. This approach has the sense of search direction, away from negative data and toward positive data, thus making it possible to control the degree of generalization of the learned DNF.

Let \(\varphi _0\) be a target DNF, \({\textbf{I}_0}\) the domain of \(\varphi _0\), \(( {\textbf{I}_1} , {\textbf{I}_2} )\) an input–output pair for learning where \({\textbf{I}_1} \subseteq {\textbf{I}_0}\) and \({\textbf{I}_2} = \varphi _0(I_1)\). Also let \(\text {DNF}( {\textbf{I}_1^{P}} )\) and \(\text {DNF}( {\textbf{I}_1^{N}} )\) respectively be the positive and negative DNF for \(( {\textbf{I}_1} , {\textbf{I}_2} )\) introduced in Sect. 4 associated with the positive data \({\textbf{I}_1^{P}}\) and negative data \({\textbf{I}_1^{N}}\) in \({\textbf{I}_1}\).

Our idea is based on the empirical observation that when learning random DNFs form insufficient data by Mat_DNF, despite the fact that the target DNF \(\varphi _0\) and the learned DNF \(\varphi\) are both interpolants between the \(\text {DNF}( {\textbf{I}_1^{P}} )\) and \(\lnot \text {DNF}( {\textbf{I}_1^{N}} )\) according to Proposition 2, their distance to \(\text {DNF}( {\textbf{I}_1^{P}} )\) and \(\text {DNF}( {\textbf{I}_1^{N}} )\) often differs greatly. Since learning data is randomly generated using the target DNF \(\varphi _0\), usually \(\varphi _0\) is located (almost) in the middle between \(\text {DNF}( {\textbf{I}_1^{P}} )\) and \(\text {DNF}( {\textbf{I}_1^{N}} )\) distance-wise. However, it is observed that the learned \(\varphi\) is very close to the negative data \(\text {DNF}( {\textbf{I}_1^{N}} )\). In other words, due to the learning bias of Mat_DNF, \(\varphi\) tends to overgeneralize positive data by yielding disjuncts outside the original positive data \(\text {DNF}( {\textbf{I}_1^{P}} )\).

To combat this overgeneralization of positive data by Mat_DNF, we add a special term \(\text {J}_{int}\) to the cost function \(\text {J}\) to suppress the generation of disjuncts in \(\varphi\). Concretely \(\text {J}_{int}\) is computed as follows.

$$\begin{aligned} {\textbf{I}_0^{P}}= & {} {\textbf{I}_{0}} \setminus {\textbf{I}_1^{P}} \\ \widetilde{\textbf{N}}^{P}= & {} \widetilde{\textbf{C}}[(1- {\textbf{I}_0^{P}} ); {\textbf{I}_0^{P}} ] \\ \widetilde{\textbf{M}}^{P}= & {} 1 - \text{ min}_1( \widetilde{\textbf{N}}^{p}) \\ \text {J}_{int}= & {} \sum \text{ max}_0( \widetilde{\textbf{D}} \widetilde{\textbf{M}}^{P}) \end{aligned}$$

Here \({\textbf{I}_0^{P}}\) is the set of interpretation vectors which, when considered as conjunctions, can be added to \(\text {DNF}( {\textbf{I}_1^{P}} )\) as disjuncts in the learned \(\varphi\). \(\widetilde{\textbf{M}}^{P}\) is the truth values of continuous conjunctions represented by \(\widetilde{\textbf{C}}\). \(\widetilde{\textbf{D}} \widetilde{\textbf{M}}^{P}\) is the truth values of the continuous DNF \(( \widetilde{\textbf{C}}, \widetilde{\textbf{D}})\) evaluated by the interpretation vectors \({\textbf{I}_0^{P}}\). Minimizing \(\text {J}_{int}\) causes minimizing positive elements in \(\widetilde{\textbf{D}} \widetilde{\textbf{M}}^{P}\) sifted out by \(\text{ max}_0(\cdot )\) to zero, in which case, as \(\widetilde{\textbf{M}}^{P}\) is non-negative, pushing positive elements in \(\widetilde{\textbf{D}}\) to zero, leading to a small number of disjuncts in the thresholded disjunction D in \(\varphi\), i.e. a small number of disjuncts in \(\varphi\).

We conduct a learning experiment of the 5-ary random DNF with this penalty term \(\text {J}_{int}\) added to the cost function J in the form of \(\beta \cdot \text {J}_{int}\) (\(\beta \ge 0\)) while varying \(\beta\) from 0 to 5.Footnote 12 We choose \(dr = 0.5\) and randomly generate a target DNF \(\varphi _0\) and the learning data \(( {\textbf{I}_1} ,\varphi _0( {\textbf{I}_1} ))\) as in Sect. 5.2. So half of the complete data necessary for identifying the target \(\varphi _0\) is supplied to the learner.

We run Mat_DNF on the learning data until learning error becomes zero and measure the exact accuracy of the learned DNF \(\varphi\) in each learning trial. Table 2 contains figures averaged over 100 trialsFootnote 13.

Table 2 The effect of \(\text {J}_{int}\) on the learned \(\varphi\)

Clearly as \(\beta\) gets larger (while \(\models \text {DNF}( {\textbf{I}_1^{P}} ) \rightarrow \varphi\) is the same), the distance between the positive learning data \(\text {DNF}( {\textbf{I}_1^{P}} )\) and the learned DNF \(\varphi\) monotonically decreases, which verifies the effectiveness of the penalty term \(\text {J}_{int}\) to manipulate the degree of logical implication.

On the other hand, the distance between the target \(\varphi _0\) and the learned \(\varphi\) draws a convex curve w.r.t. \(\beta\) and \(\varphi\) achieves the maximum exact accuracy 0.824 when dist(\(\varphi _0\),\(\varphi\)) is the least 5.6. In other words, we can change the distance between the target DNF and learned DNF by a parameter \(\beta\) in vector spaces for better generalization.

6 Learning Boolean networks

We apply Mat_DNF to learning Boolean networks (BNs) introduced by Kauffman (Kauffman, 1969) which have been used to model gene regulatory networks in biology. A BN is biological network where nodes are genes with \(\{0,1\}\) states and a state transition (activation of gene expression) of a gene occurs according to a Boolean formula associated with it. The learning task is to infer Boolean formulas associated with nodes from state transition data. Due to the general hardness results of learning Boolean formulas (Feldman, 2007), BN learning on a large scale is difficult. We select three BNs of moderate size from literature for learning, one for mammalian cell cycle from Fauré et al. (2006), one for budding yeast cell cycle from Irons (2009) and one for myeloid differentiation from Krumsiek et al. (2011). Learning performance is evaluated in terms of the recovery rate of the original DNFs associated with a BN.

6.1 Learning a mammalian cell cycle BN

In the first learning experiment, we use a synchronous BN for mammalian cell cycle having 10 nodes (genes) (Fauré et al., 2006) where state transition occurs simultaneously for all genes. A state of the BN is represented by a state vector \({\textbf{x}} \in \{0,1\}^{10}\) and a state of each gene_i is described by a Boolean variable \(x_i\) (\(1 \le i \le 10\)) and its state by \({\textbf{x}} (i) \in \{1,0\}\). A state transition of gene_i is controlled by a DNF \(\phi _i\) associated with it, i.e. the next state of \(\text {gene\_}\)i\(= 1\) if \(\phi _i( {\textbf{x}} ) = 1\), otherwise \(\text {gene\_}\)i\(= 0\). We obtain from Fauré et al. (2006) 10 DNFs associated with 10 genes. For example \(\phi _{6} = (\lnot x_1 \wedge \lnot x_4 \wedge \lnot x_5 \wedge \lnot x_{10}) \vee (\lnot x_1 \wedge \lnot x_4 \wedge x_6 \wedge \lnot x_{10}) \vee (\lnot x_1 \wedge \lnot x_5 \wedge x_6 \wedge \lnot x_{10})\) is associated with gene_6.

To see to what degree Mat_DNF can recover the original 10 DNFs, following (Inoue et al., 2014), we consider \(\phi _i\) (\(1 \le i \le 10\)) as a 10-variable Boolean function and prepare as learning data a complete input–output pair \(( {\textbf{I}_0^{10}} ,\phi _i( {\textbf{I}_0^{(10)}} ))\) for \(\phi _i\) where \({\textbf{I}_0^{(10)}}\) is the domain matrix for 10 variables containing 1024 interpretation vectors. Then we let Mat_DNF learn a DNF \(\varphi\) from \(( {\textbf{I}_0^{(10)}} ,\phi _i( {\textbf{I}_0^{(10)}} ))\)Footnote 14 and check if \(\varphi\) is identical to the original \(\phi _i\). The result is encouraging. Nine DNFs out of the original 10 DNFs are successfully recovered (modulo renaming) and the remaining one is logically equivalent to the original DNF.

To understand the origin of this high recovery rate, we pick up a DNF \(\phi _{6}\) associated with gene_6 and examine noise-expansion effect on it. We consider \(\phi _{6}\) as a 5-variable Boolean function over the domain matrix \({\textbf{I}_0^{(5)}}\) and measure \(\text {acc\_DNF}\) w.r.t. dr. To measure \(\text {acc\_DNF\_noise}\), we append a 5 dimensional random bit vector to each interpretation vector in \({\textbf{I}_0^{(5)}}\). The learning result is shown in Fig. 2 where figures are the average over 100 trials. There we see the acc_DNF curve shows a large improvement in acc_DNF by noise-expansion compared to the case of Fig. 1. For example it achieves acc_DNF = 0.817 at dr = 0.1, which means on average, given only 3 input–output pairs, Mat_DNF learns by noise-expansion a DNF that correctly predicts 26 input–output pairs in \(( {\textbf{I}_0^{(5)}} ,\phi _{6}( {\textbf{I}_0^{(5)}} ))\) out of 32 possible tests. Such high accuracies plotted in Fig. 2 strongly suggests that noise-expansion helps Mat_DNF find a DNF with high generalizability, or the original DNF. Also we can point out that the big difference in the effect of noise-expansion between Fig. 1 and Fig. 2 might be attributed to the nature of the learning target \(\phi _{6}\) which is not randomly generated but comes from biological literature.

Fig. 2
figure 2

“exact accuracy” of DNF learned from \(\phi _6\) with noise-expansion

Then look at the learning experiment of mammalian cell cycle BN again. Note that although \(\phi _6\) is a function of 5 variables \(\{x_1,x_4,x_5,x_6,x_{10}\}\), it is treated as a function of 10 variables \(\{x_1,\ldots ,x_{10}\}\) in the experiment. So the remaining 5 variables \(\{x_2,x_3,x_,x_8,x_9\}\) behave as noise bits in learning just like noise-expansion. This implicit noise-expansion happens to the learning of all DNFs \(\{\phi _1,\ldots ,\phi _{10} \}\) because they contain only at most 6 variables. Moreover, since they are not random DNFs, noise-expansion can be particularly effective as shown in Fig. 2, and hence it is not unreasonable to assume that Mat_DNF is likely to able to learn the original DNFs, which explains the high recovery rate of the original DNFs.

Table 3 Examples of learned DNFs learned from \(\phi _6\)

We conclude this section by looking at DNFs learned from insufficient data to develop an insight into the syntactic aspect of learned DNFs and their logical relationship to the target DNF. Table 3 lists some DNFs learned from an input-out pair for \(\phi _6\) obtained by applying \(\phi _6\) as a 10-variable function to the interpretation vectors of size \(2^{10} \times dr\) sampled without replacement from the domain matrix \({\textbf{I}_0^{(10)}}\).Footnote 15

In Table 3, for \(dr \in \{1.0, 0.8, 0.5\}\), every data used for learning contains 32 different input–output pairs, i.e. contains complete information about \(\phi _6\). That is why all learned DNFs are logically equivalent to \(\phi _6\). At dr = 0.3, learning data still contains all information on \(\phi _6\). Nonetheless the learned DNF have extraneous variables not appearing in the original \(\phi _6(x_1,x_4,x_5,x_6,x_{10})\) which destroy the logical equivalence to \(\phi _6\) though it still continues to be a logical consequence. When dr is further lowered to \(dr = 0.1\), constraint by learning data is more loosened. So more conjunctions and extraneous variables are introduced to the learned DNF and they stop the learned DNF from being either a logical consequence of or logically equivalent to \(\phi _6\).

6.2 Learning a budding yeast cell cycle BN

We conduct the second experiment with a synchronous BN for budding yeast cell cycle taken from Irons (2009). Since it contains 18 genes (DNFs) and preparing gene expression data is very time-consuming, it is unrealistic to assume the whole domain matrix \({\textbf{I}_0^{(18)}}\) containing \(2^{18} = 262,144\) data points as learning data to learn a Boolean formula \(\phi _i\) for gene_i in the BN (Irons, 2009) (\(1 \le i \le 18\)).

We instead randomly generate a set of state vectors \({\textbf{I}_1^\textrm{rand}}\) of size 1, 000 and use \(( {\textbf{I}_1^{\textrm{rand}}} , \phi _i( {\textbf{I}_1^{\textrm{rand}}} ))\) (\(1 \le i \le 18\)) as learning data to learn a DNF for \(\phi _i\).Footnote 16

In this experiment, 17 DNFs out the 18 original DNFs are successfully recovered in at most three trials and the remaining DNF is logically equivalent to the original one. Considering the severe data scarcity such that only \(0.38\%\) (\(1000/2^{18}\)) of the whole data is supplied as learning data, this success rate is somewhat surprising, but again can be explained as the effect of implicit noise-expansion as in the mammalian cell cycle case because the set of variables relevant to a target gene is surely a proper subset of 18 variables and the remaining irrelevant ones would behave as noise.

6.3 Learning a myeloid differentiation BN

The last example is learning an asynchronous BN with 11 genes for myeloid differentiation process (Krumsiek et al., 2011). In this “biologically more feasible” BN (Gao et al., 2018), state transition occurs asynchronously where a gene is nondeterministically chosen and the Boolean function (DNF) associated with the gene is applied to the current state to decide the next state of the BN.

Following (Gao et al., 2018), we generate learning data for asynchronous BN by simulating all possible asynchronous sate transitions starting from an “early, unstable undifferentiated state, where only GATA-2, C/EBPa, and PU.1 are active” (Krumsiek et al., 2011). This simulation generates 160 distinct hierarchically layered states containing four point attractors that correspond to four mature blood cells. For each gene, we generate state transition data of size 160 from these states and let Mat_DNF learn it with over-iteration (extra_update = 100). Since a learned DNF varies with initialization, we repeat this asynchronous BN data learning ten times and consider the majority of ten learned DNFs as the learned DNF for the target gene.

Out of 11 DNFs to be recovered, Mat_DNF correctly recovered the original DNFs for 6 genes (Table 4). They are all pure conjunctions. DNFs for the remaining 5 genes are recovered partially in such a way that they lost at most three variables from the original ones. We performed other measurements.

We now compare our results with those by rfBFE (Gao et al., 2018) in more detail. rfBFE is one of the state-of-the-art BN learning algorithms which is a refinement of BestFit extension algorithm (Lähdesmäki et al., 2003)Footnote 17. Since the purpose of BN learning is to infer Boolean formulas governing the state transitions process, the recovery rate of target Boolean formulas is the most important criterion. From this viewpoint, it is to be noted that when applied to complete data generated by synchronous BN, both rfBFE and Mat_DNF recover all original 11 DNFs. However there is a big difference in execution time. While rfBFE only takes 1.24 s to process 11 complete datasets (\(2^{11}\) data points) for 11 genes according to Gao et al. (2018), Mat_DNF takes 483.1 s, which suggests the need for improving implementation of Mat_DNF for example by parallel technologies.

Also we observe differences in terms of “score” which the number of genes whose domain (regulators) is correctly inferred when the learning data is not complete. We randomly sample m states and their state transitions and measure scores for \(m = 80, 160\) by running Mat_DNF on sampled transitions.Footnote 18 We repeat this trial five times and take the average. The results are score = 8.8 for m = 80 and score = 10.6 for m = 160, which are lower than those by rfBFE reported in Gao et al. (2018) where score = 10.8 for m = 80 and score = 10.9 for m = 160 respectively. This may be due to the lack of a special mechanism in Mat_DNF to identify regulators (domain).

Table 4 Recovered Boolean formulas for the asynchronous myeloid differentiation BN

In the case of asynchronous learning data described above, Mat_DNF and rfBFE return Boolean formulas listed in Table 4.Footnote 19 Table 4 shows that Mat_DNF and rfBFE return exactly the same Boolean formulas except for gene PU.1 and both successfully recover six original Boolean formulas. Concerning PU.1 however, while Mat_DNF successfully recovers one of the two original disjuncts, rfBFE recovers no original disjunct or recovers only one of the four original conjuncts (assuming the original one is in CNF). So, as far as the target asynchronous BN (Krumsiek et al., 2011) is concerned, Mat_DNF seems qualitatively competitive with rfBFE, though learning is considerably slow.

7 Related work

From a logical point of view, Mat_DNF infers a matricized DNF as an interpolant by numerical optimization and there is no previous work of the same kind as far as we know. As Sect. 4 reveals, any interpolant represented by a matricized DNF \(\varphi = ( {\textbf{C}} , {\textbf{D}} )\) between the positive and negative data is translated to a single layer ReLU network described by (2) with network parameters \(( {\textbf{C}} , {\textbf{D}} )\) and vice versa. This mutual translation is expected to contribute to cross-fertilization of NNs and logic. For example logical characterization of interpolants with good generalizability can contribute to designing NNs with high generalizability.

On the optimization side, our approach is categorized as continuous and unconstrained global optimization applied to DNFs instead of CNFs (Gu et al., 1996). What differs from traditional approaches surveyed in Gu et al. (1996) is the Mat_DNF’s cost function, which for instance encodes a conjunction as a sum of piecewise multivariate linear terms unlike those in Gu et al. (1996) that encode a conjunction by a product of some functions in one form or another.

Representing Boolean formulas by matrix is an established idea. Theoretically we can represent any Boolean formula in n variables in terms of \(2^n \times 2^n\) or \(2n \times 2^n\) matrix (Cheng and Qi, 2010; Kobayashi and Hiraishi, 2014). Our matricized DNF representation also requires a matrix \({\textbf{C}}\) of similar size, for example \(2^{n-1} \times 2n\) to represent the n-parity function. The technique of learning and outputting Boolean formulas represented by matrix has already been applied to learning AND/OR BNs in Sato and Kojima (2021), but with different purposes. Sato and Kojima (2021) aims at finding useful logical patterns in the biological data whereas DNFs in this paper are learned to verify or suggest BNs.

Mat_DNF is a simple neuro-symbolic system that explicitly represents DNFs. From this neuro-symbolic viewpoint, we notice several NNs have been proposed that can learn DNFs (Towell and Shavlik, 1994; Payani and Fekri, 2019; Katzir et al., 2021). However, they all implicitly embed DNFs in their NN architecture. In KBANN-net (Towell and Shavlik, 1994), for example, a conjunction containing k literals is encoded as a neuron represented by a tree with k leaves, each having a link weight \(\omega\) such as 4 for positive literal and \(-\omega\) for negative one, and the neuron is activated when \(k\cdot \omega\) exceeds \(\text {bias} = (k-1/2) \cdot \omega\). In Neural Logic Networks (Payani and Fekri, 2019), conjunctions are represented by a product of linear functions of the form \(1-m(1-x)\) where \(0< m < 1\) and embedded in a neural network isomorphically to a DNF. In Net-DNF (Katzir et al., 2021), a trainable AND function is used: \(\text {AND}( {\textbf{x}} ) = \text {tanh}( ( {\textbf{c}} \bullet L( {\textbf{x}} )^T ) - \Vert {\textbf{c}} \Vert _1 + 1.5)\) where \(L( {\textbf{x}} ) = \text {tanh}( {\textbf{x}} ^TW + {\textbf{b}} )\) to encode conjunctions. As a result, they need an extra process to reconstruct a DNF from the learned parameters.

There are logical approaches to BN learning (Inoue et al., 2014; Tourret et al., 2017; Chevalier et al., 2019; Gao et al., 2022). Logically our work can be considered as a matricized version of “learning from interpretation transition” in logic programming in which a BN is represented by a propositional normal logic program (Inoue et al., 2014; Gao et al., 2022). The most related work is NN-LFIT proposed by Tourret et al. (2017) which performs two-stage DNF learning. First a single layer feed-forward NN is trained by state transition data. Then learned parameters irrelevant to the output are filtered out and DNFs are extracted from the remaining parameters. However since their performance evaluation is based on error rate of learned rules, not recovery rate of the learned DNFs like ours, direct comparison is difficult.

8 Conclusion

We proposed a simple feed-forward neural network Mat_DNF for the end-to-end learning of Boolean functions. It learns a Boolean function and outputs a matricized DNF realizing the target function. It searches for a DNF as a root of a non-negative cost function by minimizing the cost function to zero. We also established a new connection between neural learning and logical inference. We proved the equivalence between DNF learning by Mat_DNF and the inference of interpolants in logic between the positive and negative input data. We applied Mat_DNF to learning two synchronous BNs and one asynchronous BN from biological literature and empirically confirmed the effectiveness of our approach.

While doing so, we introduced “domain ratio” dr as an indicator of data scarcity and defined generalization w.r.t. dr. By examining the generalizability of DNFs learned from scarce data while varying dr, we discovered two operations, noise-expansion (expanding input vectors with noise vectors) and over-iteration (continuing learning after learning error reaches zero), can considerably improve generalizability by shifting the choice of a learned DNF. These two operations explain high recovery rate of original DNFs in our BN learning experiments.

Future work includes a reimplementation of Mat_DNF by GPUs, the refinement of noise-expansion and over-iteration and pursuing the idea of binary classifier as logical interpolant.