Differentiable learning of matricized DNFs and its application to Boolean networks

Sato, Taisuke; Inoue, Katsumi

doi:10.1007/s10994-023-06346-5

Differentiable learning of matricized DNFs and its application to Boolean networks

Open access
Published: 21 June 2023

Volume 112, pages 2821–2843, (2023)
Cite this article

Download PDF

You have full access to this open access article

Machine Learning Aims and scope Submit manuscript

Differentiable learning of matricized DNFs and its application to Boolean networks

Download PDF

1494 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Boolean networks (BNs) are well-studied models of genomic regulation in biology where nodes are genes and their state transition is controlled by Boolean functions. We propose to learn Boolean functions as Boolean formulas in disjunctive normal form (DNFs) by an explainable neural network Mat_DNF and apply it to learning BNs. Directly expressing DNFs as a pair of binary matrices, we learn them using a single layer NN by minimizing a logically inspired non-negative cost function to zero. As a result, every parameter in the network has a clear meaning of representing a conjunction or literal in the learned DNF. Also we can prove that learning DNFs by the proposed approach is equivalent to inferring interpolants in logic between the positive and negative data. We applied our approach to learning three literature-curated BNs and confirmed its effectiveness. We also examine how generalization occurs when learning data is scarce. In doing so, we introduce two new operations that can improve accuracy, or equivalently generalizability for scarce data. The first one is to append a noise vector to the input learning vector. The second one is to continue learning even after learning error becomes zero. The first one is explainable by the second one. These two operations help us choose a learnable DNF, i.e., a root of the cost function, to achieve high generalizability.

Learning Boolean Controls in Regulated Metabolic Networks: A Case-Study

Towards Better Generalization for Neural Network-Based SAT Solvers

Classifier Construction in Boolean Networks Using Algebraic Methods

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Boolean networks (BNs) are a simple yet effective model of gene regulatory networks where nodes are genes and their state transition is controlled by Boolean functions (Kauffman, 1969). They have been studied mathematically (Cheng and Qi, 2010; Kobayashi and Hiraishi, 2014), logically in AI (Inoue et al., 2014; Tourret et al., 2017; Chevalier et al., 2019; Gao et al., 2022) and from the viewpoint of deep learning (Zhang et al., 2019). Their learning is reduced to learning Boolean functions from a set of input–output pairs and can be carried out for example by the REVEAL algorithm (Liang et al., 1998) or by the BestFit extension algorithm (Lähdesmäki et al., 2003).

In this paper, we propose a new approach to learning Boolean functions. We introduce a simple ReLU neural network (NN) called Mat_DNF that learns Boolean functions and outputs Boolean formulas in disjunctive normal form (DNFs). We represent a DNF by a pair $( {\textbf{C}} , {\textbf{D}} )$ of binary matrices where ${\textbf{C}}$ stands for conjunctions and ${\textbf{D}}$ a disjunction respectively. Mat_DNF learns a matricized DNF $( {\textbf{C}} , {\textbf{D}} )$ as network parameters from the learning data by minimizing a non-negative cost function $\text {J}( {\textbf{C}} , {\textbf{D}} )$ to zero. As a result, every network parameter in Mat_DNF has a clear meaning of (potentially) denoting a literal or a conjunction (disjunct^{Footnote 1}) in the learned DNF.

Although there exist several ways to represent Boolean functions such as decision trees (Oliveira and Sangiovanni-Vincentelli, 1993), polynomial threshold functions (Hansen and Podolskii, 2015), Boolean circuits (Malach and Shalev-Shwartz, 2019) and support vector machines (Mixon and Peterson, 2015), we choose DNFs for two reasons: one is explainability and the other is to relate the learning process to logical inference. Explainability is guaranteed as our network parameters directly represent a matricized DNF. Moreover since the learned output is a DNF, exploring the logical relationship between the learning data and the learned DNF becomes possible and we find that the learned DNF is what is called an interpolant in logic (Craig, 1957) interpolating between the positive and negative input data, which uncovers a new connection that connects neural learning to symbolic inference.

Boolean function learning can be either discrete or continuous. One group such as SAT encoding with integer programming (Kamath et al., 1992) and stochastic local search (Ruckert and Kramer, 2003) works in discrete spaces. The other group uses NNs in continuous spaces such as simulating Boolean circuits (Malach and Shalev-Shwartz, 2019), Neural Logic Networks (Payani and Fekri, 2019) and Net-DNF (Katzir et al., 2021). Our learning is just between the two. Unlike the former, Mat_DNF is differentiable^{Footnote 2}. Unlike the latter, it explicitly operates on matricized DNFs, discrete expressions, which are not implicitly embedded in the neural network architecture.

In the context of BN learning, Mat_DNF offers a robust yet explainable end-to-end approach as an alternative to previous ones (Liang et al., 1998; Lähdesmäki et al., 2003; Inoue et al., 2014; Tourret et al., 2017; Gao et al., 2022). Compared to the REVEAL algorithm (Liang et al., 1998) and the BestFit extension algorithm (Lähdesmäki et al., 2003), Mat_DNF imposes no limit on the number of function variables. So if there are 18 genes (Irons, 2009), DNFs in 18 variables are considered. The LF1T algorithm (Inoue et al., 2014) symbolically learns a BN represented as a ground normal logic program from state transitions. Generalization is done by resolution. The NN-LFIT algorithm (Tourret et al., 2017) adopts a two-stage approach that learns features by a feed-forward NN and extracts DNFs from the learned parameters. D-LFIT (Gao et al., 2022) takes a further elaborated approach of combining two neural networks to reduce search space. By comparison, Mat_DNF is a much simpler single layer NN whose learned parameters directly represent a DNF and there is no need for post processing.

To improve the accuracy of the DNF learned from insufficient data, we introduce two operations. The first one is “noise-expansion”. It appends a noise vector to the input learning vector.^{Footnote 3} The second one is “over-iteration” which keeps learning even after learning error becomes zero. Since adding a noise vector causes extra steps of parameter update while moving around local minima of the cost function $\text {J}$, the net effect of the first one is attributable to the second one. The fact that these two operations can considerably improve accuracy means that the choice of a root of the cost function $\text {J}( {\textbf{C}} , {\textbf{D}} ) = 0$, or more generally the choice of a local minimum significantly affects accuracy and generalizability.

Finally we confirm the effectiveness of our approach through three learning experiments with literature-curated BNs (Fauré et al., 2006; Irons, 2009; Krumsiek et al., 2011). We applied Mat_DNF to learning data generated from these BNs to see if Mat_DNF can recover the original DNFs in BNs. For the first two synchronous BNs (Fauré et al., 2006; Irons, 2009), the recovery rate is high. By detailed analysis of the learning results, it is suggested that this high recovery rate is due to the effect of over-iteration caused by implicit noise-expansion. However, the third asynchronous BN (Krumsiek et al., 2011; Ribeiro et al., 2021) presents a much more difficult case and only six DNFs are completely recovered out of 11 original DNFs, though this result is comparable to that of rfBFE (Gao et al., 2018), one of the state-of-the-art BN learning algorithms.

Thus our contributions are three fold. First a proposal of new approach to the end-to-end learning of Boolean functions by an explainable single layer NN Mat_DNF together with its application to BN learning, second the establishment of the equivalence between neural learning of DNFs by Mat_DNF and symbolic inference of DNFs as interpolants between the positive and negative data and third the introduction of two new operations, noise-expansion and over-iteration, that can improve accuracy by shifting the choice of a local minimum.

In what follows, after a preliminary section, we introduce Mat_DNF in Sect. 3. We then prove the relationship between the learning by Mat_DNF and the inference of interpolants in logic in Sect. 4. Section 5 examines the behavior of Mat_DNF w.r.t. insufficient learning data and introduces noise-expansion and over-iteration that improve accuracy. Section 6 reports three BN learning experiments and Sect. 7 discusses related work. Section 8 is the conclusion.

2 Preliminaries

Throughout this paper, bold italic capital letters such as ${\textbf{A}}$ stand for matrices and so do bold italic lower case letters such as ${\textbf{a}}$ for vectors. We equate a one-dimensional matrix with a vector. The i-th element of ${\textbf{a}}$ is designated by ${\textbf{a}} (i)$ and the i, j-th element of ${\textbf{A}}$ by ${\textbf{A}} (i,j)$. Given two $m \times n$ matrices ${\textbf{A}}$ and ${\textbf{B}}$, $[ {\textbf{A}} ; {\textbf{B}} ]$ represents the $2\,m \times n$ matrix of ${\textbf{A}}$ stacked onto ${\textbf{B}}$. $\Vert {\textbf{a}} \Vert _1 = \sum _i \mid {\textbf{a}} (i) \mid$ denotes the 1-norm of ${\textbf{a}}$ and $\Vert {\textbf{A}} \Vert _F$ the Frobenius norm of ${\textbf{A}}$. Let ${\textbf{a}}$ and ${\textbf{b}}$ be n dimensional vectors. Then $( {\textbf{a}} \bullet {\textbf{b}} )$ stands for their inner product (dot product) and ${\textbf{a}} \odot {\textbf{b}}$ their Hadamard product, i.e., $( {\textbf{a}} \odot {\textbf{b}} )(i) = {\textbf{a}} (i) {\textbf{b}} (i)$ for $i (1\le i \le n)$. For a scalar $\theta$, $( {\textbf{a}} )_{\ge \theta }$ denotes a binary vector such that $( {\textbf{a}} )_{\ge \theta }(i) = 1$ if ${\textbf{a}} (i) \ge \theta$ and $( {\textbf{a}} )_{\ge \theta }(i) = 0$ otherwise for $i (1\le i \le n)$. Similarly $1 - {\textbf{a}}$ denotes the complement of ${\textbf{a}}$, i.e. $(1 - {\textbf{a}} )(i) = 1 - {\textbf{a}} (i)$ for $i (1\le i \le n)$. These notations naturally extend to matrices like $( {\textbf{A}} )_{\ge \theta }$ and $1 - {\textbf{A}}$. $\text{ min}_1(x) = \text{ min }(x,1)$ is a function returning the lesser of 1 and x. $\text{ min}_1( {\textbf{A}} )$ is the component-wise application of $\text{ min}_1(x)$ to ${\textbf{A}}$. We implicitly assume that all dimensions of vectors and matrices in various expressions are compatible. Let $d_1 \vee \cdots \vee d_h$ be a DNF in n variables. If every disjunct $d_i$ is a conjunction of n distinct literals, it is said to be full. For a set S, $\mid S \mid$ stands for the number of elements in S.

3 Learning DNFs in vector spaces

3.1 Evaluating matricized DNFs

Let $\varphi = (x_1 \wedge x_2) \vee (x_1 \wedge \lnot x_3)$ be a DNF in three variables. $\varphi$ has two disjuncts $(x_1 \wedge x_2)$ and $(x_1 \wedge \lnot x_3)$. We represent $\varphi$ by a pair $( {\textbf{C}} , {\textbf{D}} )$ of binary matrices:

$x_1\;\, x_2\;\, x_3\; \lnot x_1\, \lnot x_2\, \lnot x_3$

${\textbf{C}} = \begin{array}{cccccc} \left[ \begin{array}{cccccc} 1\;\;\; &{} 1\;\;\; &{} 0\;\;\, &{} 0\;\;\; &{} 0\;\;\; &{} 0 \\ 1\;\;\; &{} 0\;\;\; &{} 0\;\;\, &{} 0\;\;\; &{} 0\;\;\; &{} 1 \end{array} \right] \end{array}\;\;\;\;$ ${\textbf{D}} = \left[ 1\;\; 1\right]$

As can be seen, each row of ${\textbf{C}}$ represents a disjunct (conjunction of literals) of $\varphi$. For example, the first row of ${\textbf{C}}$ represents the first disjunct $(x_1 \wedge x_2)$ by setting ${\textbf{C}} (1,1) = {\textbf{C}} (1,2) = 1$. ${\textbf{D}}$ on the other hands represents the choice of a conjunction as a disjunct; in the current case, both disjuncts in ${\textbf{C}}$ are chosen as disjunct of $\varphi$ as designated by ${\textbf{D}} = [1 \; 1]$. If ${\textbf{D}} = [1 \; 0]$, $\varphi$ will contain only the first disjunct $(x_1 \wedge x_2)$ in ${\textbf{C}}$. Generally a DNF $\varphi$ in n variables with at most h disjuncts is represented by an $h \times 2n$ binary matrix ${\textbf{C}}$ and a $1 \times h$ binary matrix ${\textbf{D}}$. By default, we consider a DNF $\varphi$ and its matrix representation $( {\textbf{C}} , {\textbf{D}} )$ exchangeable and call $( {\textbf{C}} , {\textbf{D}} )$ matricized DNF $\varphi$.

Now we describe how $\varphi$ is evaluated as a Boolean function $\varphi ( {\textbf{x}} )$ over its domain ${\textbf{I}_0} = \{1,0\}^n$ of bit sequences. Each ${\textbf{x}} \in {\textbf{I}_0}$ is equated with a binary column vector called “interpretation vector” representing an interpretation (assignment) such that a variable $x_j$ ($1 \le j \le n$) is mapped to ${\textbf{x}} (j) \in \{1,0\}$. Henceforth for convenience we treat ${\textbf{I}_0}$ as an $n \times 2^n$ binary matrix packed with such $2^n$ possible interpretation vectors and specifically call it the domain matrix for n variables.

Let ${\textbf{x}}$ be an interpretation vector in ${\textbf{I}_0}$. A matricized DNF $\varphi = ( {\textbf{C}} (h \times 2n), {\textbf{D}} (1 \times h))$ is evaluated by ${\textbf{x}}$ as follows. First compute a column vector ${\textbf{N}} = {\textbf{C}} [(1- {\textbf{x}} ); {\textbf{x}} ]$. ${\textbf{N}} (j)$ ($1 \le j \le h$) denotes the number of literals contained in the j-th conjunction of ${\textbf{C}}$ and falsified by ${\textbf{x}}$, and hence $\text{ min}_1( {\textbf{N}} )(j) = 0$ holds if-and-only-if the j-th conjunction is false in ${\textbf{x}}$. Next compute a column vector ${\textbf{M}} = 1 - \text{ min}_1( {\textbf{N}} )$ which is the bit inversion of $\text{ min}_1( {\textbf{N}} )$ and ${\textbf{M}} (j)$ gives the truth value $\in \{0,1\}$ of the j-th conjunction in ${\textbf{C}}$. Finally compute a scalar ${\textbf{V}} = {\textbf{D}} {\textbf{M}}$. It denotes the number of disjuncts in $\varphi$ satisfied by ${\textbf{x}}$. Hence $( {\textbf{V}} )_{\ge 1} \in \{0,1\}$ gives the truth value of $\varphi$ evaluated by ${\textbf{x}}$. Write ${\textbf{x}} \models \varphi$ when $\varphi$ is true in ${\textbf{x}}$, i.e. ${\textbf{x}}$ satisfies $\varphi$. In fact we have ${\textbf{x}} \models \varphi \;\text {if-and-only-if}\; ( {\textbf{V}} )_{\ge 1} = 1$.

Write ${\textbf{C}} = [ {\textbf{C}^P} \, {\textbf{C}} ^N]$ where ${\textbf{C}^P} (h \times n)$ (resp. ${\textbf{C}^N} (h \times n)$) is a submatrix representing positive (resp. negative) occurrences of variables in $\varphi$. Then the whole evaluation process is described by one line (1):

$$\begin{aligned} \varphi ( {\textbf{x}} )= & {} ( {\textbf{D}} ( {\textbf{1}_{h}} - \text{ min}_1( {\textbf{C}} [( {\textbf{1}_{n}} - {\textbf{x}} ); {\textbf{x}} ] )))_{\ge 1} \end{aligned}$$

(1)

$$\begin{aligned}= & {} ( {\textbf{D}} ( {\textbf{1}_{h}} - \text{ min}_1( ( {\textbf{C}^N} - {\textbf{C}^P} ) {\textbf{x}} + {\textbf{C}^P} {\textbf{1}_{n}} )))_{\ge 1} \nonumber \\= & {} ( {\textbf{D}} ( \textrm{ReLU}( ( {\textbf{C}^P} - {\textbf{C}^N} ) {\textbf{x}} + {\textbf{1}_h} - {\textbf{C}^P} {\textbf{1}_{n}} ) ))_{\ge 1} \nonumber \\ & \text { because ReLU } (x) = \textrm{max}(x,0) = 1- \text{ min}_1(1-x) \end{aligned}$$

(2)

where $\varphi ( {\textbf{x}} )$ denotes the truth value $\in \{0,1\}$ of $\varphi$ as a Boolean function evaluated by ${\textbf{x}}$. This notation is naturally extended to a set of interpretation vectors like $\varphi ( {\textbf{I}_0} )$. ${\textbf{1}_{h}}$ and ${\textbf{1}_{n}}$ are all-one vectors of length h and n respectively. We rewrite (1) to (2). What the latter tells us is that our evaluation process is exactly a forward pass of a single layer ReLU network consisting of a linear output layer and a hidden layer with a weight matrix ${\textbf{C}} ^P- {\textbf{C}} ^N$ and a bias vector ${\textbf{1}_h} - {\textbf{C}^P} {\textbf{1}_{n}}$. We name this ReLU network Mat_DNF. It is a simple NN specialized for DNFs derived from the evaluation process of a DNF where the disjunction $x \vee y$ is replaced by $\text{ min}_1(x+y)$ as in Łukasiewicz’s many valued logic.

3.2 Learning DNFs by Mat_DNF

By adding a backward pass to the equation (1), Mat_DNF can learn Boolean functions. Here we describe how Mat_DNF learns them. Let f be a target Boolean function in n variables and ${\textbf{I}_0} = [ {\textbf{x}_1} \cdots {\textbf{x}_{2^n}} ]$ the domain matrix for n variables. In learning, we are given a submatrix ${\textbf{I}_1} (n \times l) = [ {\textbf{x}_{i_1}} \cdots {\textbf{x}_{i_l}} ]$ $(l \le 2^n)$ of ${\textbf{I}_0}$. ${\textbf{I}_1}$ is mapped by f to a $1 \times l$ row vector ${\textbf{I}_2} = f( {\textbf{I}_1} ) = [f( {\textbf{x}_{i_1}} ) \cdots f( {\textbf{x}_{i_l}} )]$. $( {\textbf{I}_1} , {\textbf{I}_2} )$ $= ( {\textbf{I}_1} ,f( {\textbf{I}_1} ))$ is called an input–output pair for f and ${\textbf{I}_1}$ its input domain. Learning a DNF $\varphi$ here thus means a learner receives an input–output pair $( {\textbf{I}_1} , {\textbf{I}_2} ) = ( {\textbf{I}_1} ,f( {\textbf{I}_1} ))$ for a target Boolean function f and returns a DNF $\varphi$ such that $\varphi ( {\textbf{I}_1} ) = {\textbf{I}_2}$. Mat_DNF receives $( {\textbf{I}_1} , {\textbf{I}_2} )$ and returns a matricized DNF $\varphi$ such that $\varphi ( {\textbf{I}_1} ) = {\textbf{I}_2}$ when it stops with learning error = 0.

Let $\widetilde{\textbf{C}}$ and $\widetilde{\textbf{D}}$ be real matrices. They are relaxation versions of ${\textbf{C}}$ and ${\textbf{D}}$. Introduce $\text{ max}_0(x) = \textrm{max}(x,0)$ (ReLU), $\widetilde{\textbf{N}} = \widetilde{\textbf{C}}[(1- {\textbf{I}_1} );\! {\textbf{I}_1} ]$, $\widetilde{\textbf{M}} = 1-\text{ min}_1( \widetilde{\textbf{N}})$, $\widetilde{\textbf{V}} = \widetilde{\textbf{D}} \widetilde{\textbf{M}}$, $\textrm{Y} = \Vert \widetilde{\textbf{C}} \odot (1 - \widetilde{\textbf{C}}) \Vert _F^2$ and $\textrm{Z} = \Vert \widetilde{\textbf{D}} \odot (1 - \widetilde{\textbf{D}}) \Vert _F^2$. Then define a non-negative cost function $\text {J}( \widetilde{\textbf{C}}, \widetilde{\textbf{D}})$ by

$$\begin{aligned} \text {J}= & {} ( {\textbf{I}_2} \bullet (1 - \text{ min}_1( \widetilde{\textbf{V}}))) + ((1 - {\textbf{I}_2} ) \bullet \text{ max}_0( \widetilde{\textbf{V}})) +\, (1/2)\textrm{Y} + (1/2)\textrm{Z}. \end{aligned}$$

(3)

The first term $( {\textbf{I}_2} \bullet (1 - \text{ min}_1( \widetilde{\textbf{V}})))$ is a non-negative scalar and deals with the case of $f( {\textbf{x}_{i_j}} ) = {\textbf{I}_2} (i_j) = 1$ ($1\le j \le l$). Likewise the second term $((1 - {\textbf{I}_2} ) \bullet \text{ max}_0( \widetilde{\textbf{V}}))$ is non-negative and takes care of the case of $f( {\textbf{x}_{i_j}} ) = {\textbf{I}_2} (i_j) = 0$. Y and Z are penalty terms to make $\widetilde{\textbf{C}}$ and $\widetilde{\textbf{D}}$ binary respectively.

Proposition 1

J($\widetilde{\textbf{C}}$, $\widetilde{\textbf{D}}$) = 0 $\,\text {if-and-only-if}\,$ $\widetilde{\textbf{C}}$ and $\widetilde{\textbf{D}}$ are binary matrices representing a DNF $\varphi$ such that $\varphi ( {\textbf{I}_1} ) = {\textbf{I}_2}$.

Proof

We prove only-if part. The converse is obvious. Suppose J = J($\widetilde{\textbf{C}}$,$\widetilde{\textbf{D}}$) = 0. Every term in (3) is zero. Y = Z = 0 immediately implies $\widetilde{\textbf{C}}$ and $\widetilde{\textbf{D}}$ are binary. Let $\varphi$ be a DNF represented by them. The first term deals with the case of ${\textbf{I}_2} (i_j) = f( {\textbf{x}_{i_j}} ) = 1$ $(1 \le j \le l)$. It is a sum of non-negative summands of the form $(1 - \text{ min}_1( \widetilde{\textbf{V}}(i_j)))$. Hence J = 0 implies $\text{ min}_1( \widetilde{\textbf{V}}(i_j)) = 1$, i.e. $\varphi$ is true in ${\textbf{x}_{i_j}} \in {\textbf{I}_1}$ when ${\textbf{I}_2} (i_j) = 1$. The second term is dual to the first term, dealing with the case of ${\textbf{I}_2} (i_j) = 0$. Similarly to the first term, we can prove that $\varphi$ is false in ${\textbf{x}_{i_j}} \in {\textbf{I}_1}$ when ${\textbf{I}_2} (i_j) = 0$. By combining the two, we conclude that $\varphi$ gives ${\textbf{I}_2}$ when evaluated by ${\textbf{I}_1}$, i.e., $\varphi ( {\textbf{I}_1} ) = {\textbf{I}_2}$. $\square$

Learning by Mat_DNF is carried out based on Proposition 1 by minimizing $\text {J}$ until $\text {J} = 0$ using gradient descent. $\widetilde{\textbf{C}}$ and $\widetilde{\textbf{D}}$ are iteratively updated by their Jacobians, ${\textbf{J}} _a^{\tilde{C}}$ for $\widetilde{\textbf{C}}$ and ${\textbf{J}} _a^{\tilde{D}}$ for $\widetilde{\textbf{D}}$, for example like $\widetilde{\textbf{C}} = \widetilde{\textbf{C}} - \alpha {\textbf{J}} _a^{\tilde{C}}$ where $\alpha >0$ is a learning rate. To compute the Jacobians, we introduce $\widetilde{\textbf{W}} = -( \widetilde{\textbf{V}})_{\le 1} \odot {\textbf{I}_2} + ( \widetilde{\textbf{V}})_{\ge 0} \odot (1 - {\textbf{I}_2} )$. Then ${\textbf{J}} _a^{\tilde{C}}$ and ${\textbf{J}} _a^{\tilde{D}}$ are computed by (4).

$$\begin{aligned} {\textbf{J}} _a^{\tilde{C}}= & {} (( -( \widetilde{\textbf{N}})_{\le 1}) \odot ( \widetilde{\textbf{D}}^T \widetilde{\textbf{W}}) )[(1- {\textbf{I}_1} );\! {\textbf{I}_1} ]^T +\; (1 - 2 \widetilde{\textbf{C}}) \odot {\textbf{Y}} \nonumber \\ {\textbf{J}} _a^{\tilde{D}}= & {} \widetilde{\textbf{W}} \widetilde{\textbf{M}}^T + (1 - 2 \widetilde{\textbf{D}}) \odot {\textbf{Z}} \end{aligned}$$

(4)

These Jacobians are derived as follows. We first derive ${\textbf{J}} _a^{\tilde{C}}$. Let $\widetilde{\textbf{C}}_{pq} = \widetilde{\textbf{C}}(p,q)$ be an arbitrary element of $\widetilde{\textbf{C}}$. Put $\Delta _Y = (1 - 2 \widetilde{\textbf{C}}) \odot {\textbf{Y}}$. We have

$$\begin{aligned} \partial \widetilde{\textbf{M}}/\partial \widetilde{\textbf{C}}_{pq}= & {} - \partial \text{ min}_1( \widetilde{\textbf{N}})/\partial \widetilde{\textbf{C}}_{pq} \\= & {} - ( \widetilde{\textbf{N}})_{\le 1} \odot ( {\textbf{I}} _{pq}(1-[ {\textbf{I}} _1;\! (1- {\textbf{I}} _1)]) ) \end{aligned}$$

where ${\textbf{I}} _{pq}$ is a zero matrix except for the p, q-th element which is 1. We use $( {\textbf{A}} \bullet {\textbf{B}} ) = \sum _{i,j} {\textbf{A}} (i,j) {\textbf{B}} (i,j)$ to denote the dot product of ${\textbf{A}}$ and ${\textbf{B}}$. Note $( {\textbf{A}} \bullet ( {\textbf{B}} \odot {\textbf{C}} )) = (( {\textbf{B}} \odot {\textbf{A}} ) \bullet {\textbf{C}} )$ and $( {\textbf{A}} \bullet ( {\textbf{B}} {\textbf{C}} )) = (( {\textbf{B}} ^T {\textbf{A}} ) \bullet {\textbf{C}} ) = (( {\textbf{A}} {\textbf{C}} ^T) \bullet {\textbf{B}} )$ hold. Then put $\delta _Y = (\Delta _Y \bullet {\textbf{I}} _{pq})$ and compute the partial derivative of J w.r.t. $\widetilde{\textbf{C}}_{pq}$ as follows:

$$\begin{aligned}{} & {} \partial \text {J}/\partial \widetilde{\textbf{C}}_{pq} \\{} & {} \quad = ( {\textbf{I}} _2 \bullet (- ( \widetilde{\textbf{V}})_{\le 1}\odot (\partial \widetilde{\textbf{V}} / \partial \widetilde{\textbf{C}}_{pq})) ) + ( (1- {\textbf{I}} _2) \bullet ( ( \widetilde{\textbf{V}})_{\ge 0} \odot (\partial \widetilde{\textbf{V}} / \partial \widetilde{\textbf{C}}_{pq})) ) + \delta _Y \\{} & {} \quad = ( ( -( \widetilde{\textbf{V}})_{\le 1}\odot {\textbf{I}} _2) \bullet (\partial \widetilde{\textbf{V}} / \partial \widetilde{\textbf{C}}_{pq}) ) + ( (( \widetilde{\textbf{V}})_{\ge 0}\odot (1 - {\textbf{I}} _2)) \bullet (\partial \widetilde{\textbf{V}} / \partial \widetilde{\textbf{C}}_{pq}) ) + \delta _Y \\{} & {} \quad = ( (- ( \widetilde{\textbf{V}})_{\le 1} \odot {\textbf{I}} _2 + ( \widetilde{\textbf{V}})_{\ge 0} \odot (1- {\textbf{I}} _2)) \bullet ( \widetilde{\textbf{D}}( \partial \widetilde{\textbf{M}}/ \partial \widetilde{\textbf{C}}_{pq} )) ) + \delta _Y \\{} & {} \quad = ( (-( \widetilde{\textbf{N}})_{\le 1} \odot ( \widetilde{\textbf{D}}^T ( - ( \widetilde{\textbf{V}})_{\le 1} \odot {\textbf{I}} _2 + ( \widetilde{\textbf{V}})_{\ge 0} \odot (1- {\textbf{I}} _2) ) ) ) (1-[ {\textbf{I}} _1;\! (1- {\textbf{I}} _1)])^T \bullet {\textbf{I}} _{pq} ) \\{} & {} \quad + (\Delta _Y \bullet {\textbf{I}} _{pq} ) \\{} & {} \quad = ( ((-( \widetilde{\textbf{N}})_{\le 1} \odot ( \widetilde{\textbf{D}}^T \widetilde{\textbf{W}})) (1-[ {\textbf{I}} _1;\! (1- {\textbf{I}} _1)])^T + \Delta _Y) \bullet {\textbf{I}} _{pq} ) \end{aligned}$$

Since p, q are arbitrary, we have

$$\begin{aligned} {\textbf{J}} _a^{\tilde{C}}= & {} \partial \text {J}/ \partial \widetilde{\textbf{C}} \\= & {} ( -( \widetilde{\textbf{N}})_{\le 1} \odot ( \widetilde{\textbf{D}}^T \widetilde{\textbf{W}})) (1-[ {\textbf{I}} _1;\! (1- {\textbf{I}} _1)])^T + \Delta _Y \\{} & {} \text {where} \;\; \widetilde{\textbf{W}} = - ( \widetilde{\textbf{V}})_{\le 1}\odot {\textbf{I}} _2 + ( \widetilde{\textbf{V}})_{\ge 0}\odot (1- {\textbf{I}} _2). \end{aligned}$$

Next we derive ${\textbf{J}} _a^{\tilde{D}} = \partial \text {J}/\partial \widetilde{\textbf{D}}$ similarly. Put $\Delta _Z = (1 - 2 \widetilde{\textbf{D}}) \odot {\textbf{Z}}$ and $\delta _Z = (\Delta _Z \bullet {\textbf{I}} _{pq} )$. Then for arbitrary p,q, we see

$$\begin{aligned} \partial \text {J}/\partial \widetilde{\textbf{D}}_{pq}= & {} ( {\textbf{I}} _2 \bullet -\partial \text{ min}_1( \widetilde{\textbf{V}}) /\partial \widetilde{\textbf{D}}_{pq} ) + ( 1- {\textbf{I}} _2 \bullet \partial \text{ max}_0( \widetilde{\textbf{V}}) /\partial \widetilde{\textbf{D}}_{pq} ) + \delta _Z \\= & {} ( (-( \widetilde{\textbf{V}})_{\le 1}\odot {\textbf{I}} _2) + ( \widetilde{\textbf{V}})_{\ge 0}\odot (1 - {\textbf{I}} _2) \bullet \partial \widetilde{\textbf{V}}/\partial \widetilde{\textbf{D}}_{pq} ) + \delta _Z \\= & {} ( ((-( \widetilde{\textbf{V}})_{\le 1}\odot {\textbf{I}} _2) + ( \widetilde{\textbf{V}})_{\ge 0}\odot (1 - {\textbf{I}} _2)) \widetilde{\textbf{M}}^T \bullet {\textbf{I}} _{pq} ) + \delta _Z \\= & {} ( \widetilde{\textbf{W}} \widetilde{\textbf{M}}^T \bullet {\textbf{I}} _{pq} ) + (\Delta _Z \bullet {\textbf{I}} _{pq} ) \\= & {} ( ( \widetilde{\textbf{W}} \widetilde{\textbf{M}}^T + \Delta _Z) \bullet {\textbf{I}} _{pq} ). \end{aligned}$$

So we reach ${\textbf{J}} _a^{\tilde{D}} = \partial \text {J}/\partial \widetilde{\textbf{D}} = \widetilde{\textbf{W}} \widetilde{\textbf{M}}^T + \Delta _Z$. In actual learning, we use an adaptive gradient method Adam (Kingma and Ba, 2015) instead of gradient descent with a constant learning rate.

3.3 Learning algorithm

Given an input–output pair $( {\textbf{I}_1} , {\textbf{I}_2} )$ such that $f( {\textbf{I}_1} ) = {\textbf{I}_2}$ for the target Boolean function f, Mat_DNF returns a matricized DNF $\varphi = ( {\textbf{C}} , {\textbf{D}} )$ giving $\varphi ( {\textbf{I}_1} ) = {\textbf{I}_2}$, basically by running Algorithm 1 until $\text {J} = 0$.

We however take a practical approach of thresholding $( \widetilde{\textbf{C}}, \widetilde{\textbf{D}})$ to binary $( {\textbf{C}}$, ${\textbf{D}} )$ even before $\text {J} = 0$ is reached assuming J is small and $\widetilde{\textbf{C}}, \widetilde{\textbf{D}}$ are close to binary matrices. In more detail, the inner q-loop in Algorithm 1 below iteratively updates $( \widetilde{\textbf{C}}, \widetilde{\textbf{D}})$ at most $max\_itr$ times while thresholding them optimally to binary $( {\textbf{C}} , {\textbf{D}} )$ (line 6,7,8)^{Footnote 4} and computing learning_error using them. If $\varphi = ( {\textbf{C}} , {\textbf{D}} )$ achieves $\text {learning\_error} = 0$, it exits from the q-loop and p-loop and returns $\varphi$. If $\text {learning\_error} > 0$ happens even after $max\_itr$ iterations, it restarts the next q-loop with $( \widetilde{\textbf{C}}, \widetilde{\textbf{D}})$ perturbated by (5) where $\Delta _a$ and $\Delta _b$ are matrices of the same size as $\widetilde{\textbf{C}}$ and $\widetilde{\textbf{D}}$ respectively. They are comprised of elements sampled from the standard normal distribution $\mathcal{N}(0,1)$. The perturbated $\widetilde{\textbf{C}}$ and $\widetilde{\textbf{D}}$ are used as initial parameters in the next loop (line 16). This perturbation is intended to escape from a local minimum.

$$\begin{aligned} \begin{array}{ll} \widetilde{\textbf{C}_0 } = \sqrt{2/(h\cdot 2n)} \Delta _a + 0.5,\;\; \widetilde{\textbf{C}} = 0.5\cdot ( \widetilde{\textbf{C}} + \widetilde{\textbf{C}_0}) \\ \widetilde{\textbf{D}_0 } = \sqrt{2/h} \Delta _b + 0.5,\;\; \widetilde{\textbf{D}} = 0.5\cdot ( \widetilde{\textbf{D}} + \widetilde{\textbf{D}_0}) \end{array} \end{aligned}$$

(5)

Restart is allowed at most $max\_try$ times. Note that Mat_DNF possibly fails to achieve $\text {learning\_error} = 0$ within given h, $max\_itr$ and $max\_try$,^{Footnote 5} but when Mat_DNF returns a matricized DNF $\varphi = ( {\textbf{C}} , {\textbf{D}} )$ with learning_error = 0, it is guaranteed that J(${\textbf{C}}$,${\textbf{D}}$) = 0 and $\varphi ( {\textbf{I}_1} ) = {\textbf{I}_2}$ hold.

4 Learning as logical interpolation: a logical perspective

Here we characterize the learning of DNF $\varphi$ by Mat_DNF from a logical perspective. Write $\models \phi _1 \Rightarrow \phi _2$ if $\phi _1 \Rightarrow \phi _2$ is a tautology. If we also have $\models \phi _2 \Rightarrow \phi _3$, $\phi _2$ is called an interpolant between $\phi _1$ and $\phi _3$. Roughly, Craig’s interpolation theorem (Craig, 1957) in first order logic states the existence of such interpolant. We prove that our learning of $\varphi$ from an input–output pair $( {\textbf{I}_1} , {\textbf{I}_2} )$ such that $\varphi ( {\textbf{I}_1} ) = {\textbf{I}_2}$ is logically viewed as an inference of an interpolant $\varphi$.^{Footnote 6}

Suppose $( {\textbf{I}_1} , {\textbf{I}_2} )$ is an input–output pair for some n-variable Boolean function f and $f( {\textbf{I}_1} ) = {\textbf{I}_2}$ holds. We divide the input binary matrix ${\textbf{I}_1} (n \times l)$ into two submatrices ${\textbf{I}_1^{P}} (n \times l_P)$ and ${\textbf{I}_1^N} (n \times l_N)$ where $l_P+l_N = l$. ${\textbf{I}_1^{P}}$ represents the positive (resp. negative) data and if ${\textbf{x}} \in {\textbf{I}_1^P}$ (resp. ${\textbf{x}} \in {\textbf{I}_1^N}$), $f( {\textbf{x}} ) = 1$ (resp. $f( {\textbf{x}} ) = 0$) holds.

We consider ${\textbf{I}_1^{P}}$ as full DNF, DNF$( {\textbf{I}_1^{P}} )$ in notation, in the following way. Let ${\textbf{x}}$ be an interpretation vector in ${\textbf{I}_1}$. Introduce $\text {conj}( {\textbf{x}} )$ denoting a conjunction $l_1 \wedge \cdots \wedge l_n$ of literals such that $l_j = x_j$ if ${\textbf{x}} (j) = 1$, else $l_j = \lnot x_j$ $(1 \le j \le n)$. For example if ${\textbf{x}} = [1\; 0\; 1]^{T}$, $\text {conj}( {\textbf{x}} ) = x_1 \wedge \lnot x_2 \wedge x_3$. Put $\text {DNF}( {\textbf{I}_1^{P}} ) = \bigvee _{ {\textbf{x}} \in {\textbf{I}_1^{P}} } \text {conj}( {\textbf{x}} )$ and call it the positive DNF for $( {\textbf{I}_1} , {\textbf{I}_2} )$. Likewise we define $\text {DNF}( {\textbf{I}_1^{N}} ) = \bigvee _{ {\textbf{x}} \in {\textbf{I}_1^{N}} } \text {conj}( {\textbf{x}} )$ and call it the negative DNF for $( {\textbf{I}_1} , {\textbf{I}_2} )$. For simplicity, we equate $\text {DNF}( {\textbf{I}_1^{P}} )$ and $\text {DNF}( {\textbf{I}_1^{N}} )$ respectively with the positive data ${\textbf{I}_1^{P}}$ and negative data ${\textbf{I}_1^{N}}$.

Proposition 2

Let $( {\textbf{I}_1} , {\textbf{I}_2} )$ be an input–output pair for a Boolean function f such that $f( {\textbf{I}_1} ) = {\textbf{I}_2}$. Also let $\text {DNF}( {\textbf{I}_1^{P}} )$ and $\text {DNF}( {\textbf{I}_1^{N}} )$ respectively be the positive and negative DNF for $( {\textbf{I}_1} , {\textbf{I}_2} )$. For a DNF $\varphi$, $\varphi ( {\textbf{I}_1} ) = {\textbf{I}_2}$ if-and-only-if $\varphi$ is an interpolant between $\text {DNF}( {\textbf{I}_1^{P}} )$ and $\lnot \text{ DNF }( {\textbf{I}_1^{N}} )$.

Proof

We first prove the only-if part. Suppose $\varphi ( {\textbf{I}_1} ) = {\textbf{I}_2}$. Let ${\textbf{i}}$ be an interpretation vector over n variables satisfying $\text {DNF}( {\textbf{I}_1^{P}} )$. It satisfies some disjunct conj(${\textbf{x}}$) in $\text {DNF}( {\textbf{I}_1^P} )$. Since conj(${\textbf{x}}$) is a conjunction of n distinct literals, the fact that ${\textbf{i}}$ satisfies conj(${\textbf{x}}$) implies ${\textbf{i}} = {\textbf{x}}$ as vector. On the other hand, we have $\varphi ( {\textbf{I}_1} ) = {\textbf{I}_2} = f( {\textbf{I}_1} )$ by assumption and hence $\varphi ( {\textbf{x}} ) = f( {\textbf{x}} )$ as ${\textbf{x}} \in {\textbf{I}_1^P} \subseteq {\textbf{I}_1}$. We also have $f( {\textbf{x}} ) = 1$ as ${\textbf{x}} \in {\textbf{I}_1^P}$. Putting the two together, we conclude $\varphi ( {\textbf{i}} ) = \varphi ( {\textbf{x}} ) = f( {\textbf{x}} ) = 1$. Since ${\textbf{i}}$ is arbitrary and satisfies $\varphi$, $\models \text {DNF}( {\textbf{I}_1^{P}} ) \Rightarrow \varphi$ is proved. $\models \varphi \Rightarrow \lnot \text {DNF}( {\textbf{I}_1^{N}} )$ is proved similarly by proving $\models \text {DNF}( {\textbf{I}_1^{N}} ) \Rightarrow \lnot \varphi$.

To prove the if-part, recall that an interpolant $\varphi$ satisfies $\models \text {DNF}( {\textbf{I}_1^{P}} ) \Rightarrow \varphi$ and $\models \text {DNF}( {\textbf{I}_1^{N}} ) \Rightarrow \lnot \varphi$. So if ${\textbf{x}} \in {\textbf{I}_1^P}$ (resp. ${\textbf{x}} \in {\textbf{I}_1^N}$), then $\text {DNF}( {\textbf{I}_1^{P}} )( {\textbf{x}} ) = 1$ and hence $\varphi ( {\textbf{x}} )=1$ holds (resp. then $\text {DNF}( {\textbf{I}_1^{N}} )( {\textbf{x}} ) = 1$ and hence $\varphi ( {\textbf{x}} ) = 0$ holds). In other words, if ${\textbf{x}} \in {\textbf{I}_1^P}$, $\varphi ( {\textbf{x}} ) = 1 = f( {\textbf{x}} )$ and if ${\textbf{x}} \in {\textbf{I}_1^N}$, $\varphi ( {\textbf{x}} ) = 0 = f( {\textbf{x}} )$. So we reach $\varphi ( {\textbf{I}_1} ) = f( {\textbf{I}_1} ) = {\textbf{I}_2}$. $\square$

By Proposition 2, we can say that $\varphi$ returned by Mat_DNF with learning_error = 0 is an interpolant between $\text {DNF}( {\textbf{I}_1^{P}} )$ and $\lnot \textrm{DNF}( {\textbf{I}_1^{N}} )$. We can also say by combining Proposition 1 and 2 that finding a root of $\text {J}( {\textbf{C}} , {\textbf{D}} ) = 0$ defined by (3), learning a DNF $\varphi$ satisfying $\varphi ( {\textbf{I}_1} ) = {\textbf{I}_2}$ and inferring an interpolant $\varphi$ between $\text {DNF}( {\textbf{I}_1^{P}} )$ and $\lnot \textrm{DNF}( {\textbf{I}_1^{N}} )$ are one and the same thing, they are all equivalent.

The recognition of this equivalence has some interesting consequences. The first one is that from the viewpoint of classification, learning by Mat_DNF consists of learning the feature space of conjunctions $\widetilde{\textbf{C}}$ and its linear separation by a hyperplane specified by a continuous disjunction $\widetilde{\textbf{D}}$ as shown in the equation (2). Hence it seems possible to modify Mat_DNF so that it can search for a “max-margin interpolant” corresponding to the max-merging hyperplane, which is expected to generalize well. Sharma et. al already proposed to use SVM to infer interpolants (Sharma et al., 2012) where SVM is applied to the predefined feature space. In our “max-margin interpolant” inference, if realized, the feature space itself will be learned by Mat_DNF.

The second one is the possibility of a neural end-to-end refutation prover. Let S be a set of ground clauses. Also let $S = S_1 \cup S_2$ be any split of S such that $\text {atom}(S_1) \cap \text {atom}(S_2) \ne \emptyset$ where $\text {atom}(S_i)$ denotes the set of atoms in $S_i$ ($i=1,2$). It can be proved that S is unsatisfiable if-and-only-if there is an interpolant $\varphi$ between $S_1$ and $\lnot S_2$ (proof omitted as it is out of the scope of this paper (Vizel et al., 2015; McMillan et al., 2018)). We can apply Mat_DNF to infer this $\varphi$ assuming that $S_1$ is positive data ($\varphi$ is true over $S_1$) and $S_2$ is negative data ($\varphi$ is false over $S_2$) respectively.

The third one concerns the generalizability of the DNF $\varphi$ learned by Mat_DNF. It is observed that $\varphi$ tends to overgeneralize positive data ${\textbf{I}_1^{P}}$ in the input data. That is, $\models \text {DNF}( {\textbf{I}_1^{P}} ) \rightarrow \varphi$ holds but sometimes the degree of generalization by logical implication measured by the distance between $\text {DNF}( {\textbf{I}_1^{P}} )$ and $\varphi$ is too high, which adversely affects the accuracy of $\varphi$. Later in Sect. 5.5, we propose a way of controlling the distance between $\text {DNF}( {\textbf{I}_1^{P}} )$ and $\varphi$ and show that the accuracy of $\varphi$ is actually improved.

5 Learning random DNFs

5.1 Performance measures and generalization

First we define some performance measures concerning Mat_DNF to clarify the meaning of generalization. Let f be a target Boolean function in n variables, ${\textbf{I}_0}$ the domain matrix for n variables and $( {\textbf{I}_1} ,f( {\textbf{I}_1} ))$ (${\textbf{I}_1} \subseteq {\textbf{I}_0}$) an input–output pair for f supplied as learning data for Mat_DNF. We introduce “domain ratio” $dr = \frac{\mid {\textbf{I}_1} \mid }{\mid {\textbf{I}_0} \mid }$ ($0 \le dr \le 1$) where ${\mid {\textbf{I}} \mid }$ denotes the number of interpretation vectors in ${\textbf{I}}$. Domain ratio dr is the relative size of learning data to the whole domain data. In what follows, purely for convenience, we use dr even when $dr\cdot {\mid {\textbf{I}_0} \mid }$ is not an integer. In such case, it means ${\textbf{I}_1}$ contains the $\lfloor {dr\cdot \mid {\textbf{I}_0} \mid } \rfloor$ number of interpretation vectors of ${\textbf{I}_0}$.

Suppose we have obtained a DNF $\varphi = ( {\textbf{C}} , {\textbf{D}} )$ with $\text {learning\_error} = 0$ by running Mat_DNF on $( {\textbf{I}_1} ,f( {\textbf{I}_1} ))$. Compute $\varphi ( {\textbf{I}_0} ) = ( {\textbf{D}} (1-\text{ min}_1( {\textbf{C}} [(1- {\textbf{I}_0} ); {\textbf{I}_0} ])_{\ge 1}$ (see (1)) and $\text {exact\_error} = \Vert f( {\textbf{I}_0} ) - \varphi ( {\textbf{I}_0} ) \Vert _1$ which is the number of different bits between $f( {\textbf{I}_0} )$ and $\varphi ( {\textbf{I}_0} )$. Introduce acc_DNF, the “exact accuracy” of $\varphi$, by defining $\text {acc\_DNF} = \displaystyle {1 - \text {exact\_error}/{2^n} }$. Since learning_error is zero, $\varphi$ perfectly reproduces $f( {\textbf{I}_1} )$ and hence it follows that $\text {acc\_DNF} = dr + (1-dr)\cdot \text {acc\_pred}$ where acc_pred is the prediction accuracy of $\varphi$ over the unseen domain data ${\textbf{I}_0} {\setminus } {\textbf{I}_1}$ not used for learning. Consequently we have $\text {acc\_pred} = \displaystyle {(\text {acc\_DNF} - dr)/(1-dr) }$. Thus prediction accuracy and exact accuracy are mutually convertible. Finally we define generalization. Introduce $\text {acc\_dr} = dr + 0.5 \cdot (1-dr) = 0.5\cdot (1+dr)$ which is the expected accuracy of a base line learner learning data with domain ratio dr that completely memorizes learning data (dr) and makes a random guess on unseen data ($0.5 \cdot (1-dr)$). We say generalization occurs when $\text {acc\_DNF} > \text {acc\_dr} = 0.5\cdot (1+dr)$, or equivalently $\text {acc\_pred} > 0.5$ holds (because $\text {acc\_DNF} - \text {acc\_dr} = (1-dr)\cdot (\text {acc\_pred} - 0.5)$).

5.2 Measuring accuracy for random DNFs

We conduct a learning experiment with small random DNFs to examine the learning behavior of Mat_DNF w.r.t. data scarcity controlled by domain ratio dr and see how generalization occurs^{Footnote 7}.^{Footnote 8} We first randomly generate a DNF $\varphi _0$ in $n = 5$ variables that consists of three disjuncts, each containing at most 5 lals a half of which is negative on average. We also generate a domain matrix ${\textbf{I}_0} (n \times 2^n)$ for $n = 5$ variables. Next suppose a domain ratio dr is given. For this dr, we generate a binary matrix ${\textbf{I}_1} (n \times l)$ consisting of $l = 2^n\cdot dr$ interpretation vectors randomly sampled without replacement from ${\textbf{I}_0}$. Then we run Mat_DNF on the learning data $( {\textbf{I}_1} ,\varphi _0( {\textbf{I}_1} ))$^{Footnote 9} and obtain a DNF $\varphi _1$ that perfectly classifies the learning data, i.e. $\varphi _1( {\textbf{I}_1} ) = \varphi _0( {\textbf{I}_1} )$ and compute the exact accuracy acc_DNF of $\varphi _1$. We repeat this process 100 times and obtain the average acc_DNF of $\varphi _1$ against dr.

By varying $dr \in \{0.1,\ldots ,1.0\}$, we obtain a curve of exact accuracy w.r.t. dr denoted as acc_DNF in Fig. 1. There acc_dr denotes the expected accuracy of the base line learner performing only memorization and random guess. Other two curves, acc_DNF_noise and acc_over, are explained next. We observe that acc_DNF is always (and slightly) above acc_dr for all dr’s. So this experiment confirms that generalization in our sense actually occurs and the learned DNF does more than just pure memorization and random guess by detecting some logical pattern.

5.3 Noise-expansion and over-iteration

The acc_DNF_noise and acc_DNF_over curves in Fig. 1 demonstrate that generalization occurs with a greater degree than acc_DNF, i.e. $\text {acc\_DNF\_noise} \approx \text {acc\_DNF\_over} > \text {acc\_DNF}$ holds at most dr’s. They are obtained by two different operations, $\text {acc\_DNF\_noise}$ by “noise-expansion” and $\text {acc\_DNF\_over}$ by “over-iteration”, respectively.

The first operation, noise-expansion, means the expansion of an input vector in the learning data ${\textbf{I}_1}$ by a random bit vector. For example, a 5 bit input vector ${\textbf{x}} = [0\,1\,0\,1\,0]^T$ in ${\textbf{I}_1}$ is expanded into a 10 dimensional vector ${\textbf{x}} _\textrm{noise} = [ {\textbf{x}} ;\!\! {\textbf{n}} ] = [0\,1\,0\,1\,0\,1\,0\,0\,1\,1]^T$ by appending a random bit vector ${\textbf{n}} = [1\,0\,0\,1\,1]^T$ to ${\textbf{x}}$. In learning, each ${\textbf{x}}$ in ${\textbf{I}_1}$ is expanded into ${\textbf{x}} _{\textrm{noise}}$ and then used for learning. Although each input vector in ${\textbf{I}_1}$ gets longer (length doubled) by noise-expansion, the number of input vectors remains the same. It simply means Mat_DNF has an additional task of identifying those variables in an input vector ${\textbf{x}} _{\textrm{noise}}$ that are relevant to the output, hereby causing additional update steps in Algorithm 1. So from the viewpoint of minimizing $\text {J}$ to zero, the net effect of noise-expansion is to force Mat_DNF to find another root of $\text {J}$ even when $\text {J} = 0$ is reached in the original learning task. This point is made clear by comparing with “over-iteration” explained below.

The second operation, over-iteration, forces Mat_DNF to skip a root of $\text {J} = 0$ found first and keep learning. Only after some prespecified extra steps (for example extra_update = 20 in the case of acc_DNF_itr in Fig. 1) have been made, Mat_DNF is allowed to return when a root of $\text {J}$ is found again. Intuitively, this operation have the effect of avoiding a root near the initializing point that often overfits the learning data and exploring a root in the relatively flat landscape of $\text {J}$. In other words, over-iteration searcher for a root of $\text {J}$ closer to a global minimum such as the target DNF.

Observe that as the acc_DNF_noise and acc_DNF_over curves in Fig. 1 show, not only both noise-expansion and over-iteration improve exact accuracy, or equivalently prediction accuracy, but with a similar degree of improvement. Hence it seems reasonable to hypothesize that noise-expansion causes over-iteration and over-iteration causes the improvement of exact accuracy.

The result of this experiment also indicates the importance of an intentional choice of a local minimum (choosing a root in our case) which is independently suggested by “flooding” (Ishida et al., 2020) and “grokking” (Power et al., 2021). In flooding, learning is controlled by gradient descent and ascent to keep training error small but non-zero. In grokking, learning is continued even after learning accuracy is saturated, and then test accuracy suddenly rises to a high level. Our over-iteration has a similar effect of moving around local minima in a flat loss landscape, leading to better generalization.

5.4 The logical relations and over-iteration

When a learning target is a DNF $\varphi _0$, we naturally ask a logical question of whether the consequence relation and equivalence relation between $\varphi _0$ and a learned DNF $\varphi$ hold or not. We also interested in the distance between them^{Footnote 10} because we expect $\varphi$ to be logically related to $\varphi _0$ when $\varphi$ is close to $\varphi _0$. So we estimate the probability p_conseq (resp. p_equiv) of $\varphi$ being a logical consequence of $\varphi _0$, i.e., $\models \varphi _0 \Rightarrow \varphi$ in notation (resp. $\varphi$ being logically equivalent to $\varphi _0$, i.e., $\models \varphi _0 \Leftrightarrow \varphi$) for a 5-variable DNF $\varphi _0$ generated as in the previous section, together with the average distance between $\varphi _0$ and $\varphi$ by running Mat_DNF 100 times^{Footnote 11} and counting the number of runs that make these logical relations hold and computing the average distance. We obtain Table 1.

Table 1 Domain ratio dr, distance and the probability of logical consequence and equivalence

Full size table

In Table 1, distance_itr is the same as distance between the target DNF $\varphi _0$ and a learned DNF $\varphi$ but obtained by over-iteration with extra_update = 60. The same applies for p_equiv_itr and p_equiv.

First we can recognize in the table that larger data gives us a more exact solution. That is, the distance between the target DNF $\varphi _0$ and a learned DNF $\varphi$ monotonically decreases as dr gets closer to 1. Furthermore the effect of over-iteration is clearly visible. It gets the learned DNF much closer to the target DNF, from 7.5 to 4.2 at $dr = 0.5$ for example. In other words, it chooses a root of the cost function $\text {J}$ near the target $\varphi _0$.

Concerning logical relations, observe that p_conseq and p_equiv in Table 1 more or less monotonically increase as dr increases. So again, larger data gives a bigger chance of the logical relationship. Second observe that p_conseq, the probability of $\models \varphi _0 \Rightarrow \varphi$, is rather high through all dr’s but lowered considerably by over-iteration. Third over-iteration has the opposite effect on p_equiv, the probability of $\models \varphi _0 \Leftrightarrow \varphi$ however. It greatly improves the chance of $\models \varphi _0 \Leftrightarrow \varphi$ after $dr > 0.5$. For example, p_equiv suddenly jumps up from 0.02 to 0.19 at $dr = 0.7$ and from 0.20 to 0.55 at $dr = 0.9$ (see bold figures in Table 1). This positive effect of over-iteration on p_equiv becomes critical when applying Mat_DNF to Boolean network learning. This is because the primary purpose of our Boolean network learning is to recover the original DNFs in the target Boolean network and over-iteration in this section enhances the chance of discovering such DNFs.

5.5 Controlling logical generalization

Over-iteration wanders in the search space for a better local minimum. Here we introduce another more proactive approach for the same purpose based on Proposition 2 in Sect. 4. This approach has the sense of search direction, away from negative data and toward positive data, thus making it possible to control the degree of generalization of the learned DNF.

Let $\varphi _0$ be a target DNF, ${\textbf{I}_0}$ the domain of $\varphi _0$, $( {\textbf{I}_1} , {\textbf{I}_2} )$ an input–output pair for learning where ${\textbf{I}_1} \subseteq {\textbf{I}_0}$ and ${\textbf{I}_2} = \varphi _0(I_1)$. Also let $\text {DNF}( {\textbf{I}_1^{P}} )$ and $\text {DNF}( {\textbf{I}_1^{N}} )$ respectively be the positive and negative DNF for $( {\textbf{I}_1} , {\textbf{I}_2} )$ introduced in Sect. 4 associated with the positive data ${\textbf{I}_1^{P}}$ and negative data ${\textbf{I}_1^{N}}$ in ${\textbf{I}_1}$.

Our idea is based on the empirical observation that when learning random DNFs form insufficient data by Mat_DNF, despite the fact that the target DNF $\varphi _0$ and the learned DNF $\varphi$ are both interpolants between the $\text {DNF}( {\textbf{I}_1^{P}} )$ and $\lnot \text {DNF}( {\textbf{I}_1^{N}} )$ according to Proposition 2, their distance to $\text {DNF}( {\textbf{I}_1^{P}} )$ and $\text {DNF}( {\textbf{I}_1^{N}} )$ often differs greatly. Since learning data is randomly generated using the target DNF $\varphi _0$, usually $\varphi _0$ is located (almost) in the middle between $\text {DNF}( {\textbf{I}_1^{P}} )$ and $\text {DNF}( {\textbf{I}_1^{N}} )$ distance-wise. However, it is observed that the learned $\varphi$ is very close to the negative data $\text {DNF}( {\textbf{I}_1^{N}} )$. In other words, due to the learning bias of Mat_DNF, $\varphi$ tends to overgeneralize positive data by yielding disjuncts outside the original positive data $\text {DNF}( {\textbf{I}_1^{P}} )$.

To combat this overgeneralization of positive data by Mat_DNF, we add a special term $\text {J}_{int}$ to the cost function $\text {J}$ to suppress the generation of disjuncts in $\varphi$. Concretely $\text {J}_{int}$ is computed as follows.

$$\begin{aligned} {\textbf{I}_0^{P}}= & {} {\textbf{I}_{0}} \setminus {\textbf{I}_1^{P}} \\ \widetilde{\textbf{N}}^{P}= & {} \widetilde{\textbf{C}}[(1- {\textbf{I}_0^{P}} ); {\textbf{I}_0^{P}} ] \\ \widetilde{\textbf{M}}^{P}= & {} 1 - \text{ min}_1( \widetilde{\textbf{N}}^{p}) \\ \text {J}_{int}= & {} \sum \text{ max}_0( \widetilde{\textbf{D}} \widetilde{\textbf{M}}^{P}) \end{aligned}$$

Here ${\textbf{I}_0^{P}}$ is the set of interpretation vectors which, when considered as conjunctions, can be added to $\text {DNF}( {\textbf{I}_1^{P}} )$ as disjuncts in the learned $\varphi$. $\widetilde{\textbf{M}}^{P}$ is the truth values of continuous conjunctions represented by $\widetilde{\textbf{C}}$. $\widetilde{\textbf{D}} \widetilde{\textbf{M}}^{P}$ is the truth values of the continuous DNF $( \widetilde{\textbf{C}}, \widetilde{\textbf{D}})$ evaluated by the interpretation vectors ${\textbf{I}_0^{P}}$. Minimizing $\text {J}_{int}$ causes minimizing positive elements in $\widetilde{\textbf{D}} \widetilde{\textbf{M}}^{P}$ sifted out by $\text{ max}_0(\cdot )$ to zero, in which case, as $\widetilde{\textbf{M}}^{P}$ is non-negative, pushing positive elements in $\widetilde{\textbf{D}}$ to zero, leading to a small number of disjuncts in the thresholded disjunction D in $\varphi$, i.e. a small number of disjuncts in $\varphi$.

We conduct a learning experiment of the 5-ary random DNF with this penalty term $\text {J}_{int}$ added to the cost function J in the form of $\beta \cdot \text {J}_{int}$ ($\beta \ge 0$) while varying $\beta$ from 0 to 5.^{Footnote 12} We choose $dr = 0.5$ and randomly generate a target DNF $\varphi _0$ and the learning data $( {\textbf{I}_1} ,\varphi _0( {\textbf{I}_1} ))$ as in Sect. 5.2. So half of the complete data necessary for identifying the target $\varphi _0$ is supplied to the learner.

We run Mat_DNF on the learning data until learning error becomes zero and measure the exact accuracy of the learned DNF $\varphi$ in each learning trial. Table 2 contains figures averaged over 100 trials^{Footnote 13}.

Table 2 The effect of $\text {J}_{int}$ on the learned $\varphi$

Full size table

Clearly as $\beta$ gets larger (while $\models \text {DNF}( {\textbf{I}_1^{P}} ) \rightarrow \varphi$ is the same), the distance between the positive learning data $\text {DNF}( {\textbf{I}_1^{P}} )$ and the learned DNF $\varphi$ monotonically decreases, which verifies the effectiveness of the penalty term $\text {J}_{int}$ to manipulate the degree of logical implication.

On the other hand, the distance between the target $\varphi _0$ and the learned $\varphi$ draws a convex curve w.r.t. $\beta$ and $\varphi$ achieves the maximum exact accuracy 0.824 when dist($\varphi _0$,$\varphi$) is the least 5.6. In other words, we can change the distance between the target DNF and learned DNF by a parameter $\beta$ in vector spaces for better generalization.

6 Learning Boolean networks

We apply Mat_DNF to learning Boolean networks (BNs) introduced by Kauffman (Kauffman, 1969) which have been used to model gene regulatory networks in biology. A BN is biological network where nodes are genes with $\{0,1\}$ states and a state transition (activation of gene expression) of a gene occurs according to a Boolean formula associated with it. The learning task is to infer Boolean formulas associated with nodes from state transition data. Due to the general hardness results of learning Boolean formulas (Feldman, 2007), BN learning on a large scale is difficult. We select three BNs of moderate size from literature for learning, one for mammalian cell cycle from Fauré et al. (2006), one for budding yeast cell cycle from Irons (2009) and one for myeloid differentiation from Krumsiek et al. (2011). Learning performance is evaluated in terms of the recovery rate of the original DNFs associated with a BN.

6.1 Learning a mammalian cell cycle BN

In the first learning experiment, we use a synchronous BN for mammalian cell cycle having 10 nodes (genes) (Fauré et al., 2006) where state transition occurs simultaneously for all genes. A state of the BN is represented by a state vector ${\textbf{x}} \in \{0,1\}^{10}$ and a state of each gene_i is described by a Boolean variable $x_i$ ($1 \le i \le 10$) and its state by ${\textbf{x}} (i) \in \{1,0\}$. A state transition of gene_i is controlled by a DNF $\phi _i$ associated with it, i.e. the next state of $\text {gene\_}$i$= 1$ if $\phi _i( {\textbf{x}} ) = 1$, otherwise $\text {gene\_}$i$= 0$. We obtain from Fauré et al. (2006) 10 DNFs associated with 10 genes. For example $\phi _{6} = (\lnot x_1 \wedge \lnot x_4 \wedge \lnot x_5 \wedge \lnot x_{10}) \vee (\lnot x_1 \wedge \lnot x_4 \wedge x_6 \wedge \lnot x_{10}) \vee (\lnot x_1 \wedge \lnot x_5 \wedge x_6 \wedge \lnot x_{10})$ is associated with gene_6.

To see to what degree Mat_DNF can recover the original 10 DNFs, following (Inoue et al., 2014), we consider $\phi _i$ ($1 \le i \le 10$) as a 10-variable Boolean function and prepare as learning data a complete input–output pair $( {\textbf{I}_0^{10}} ,\phi _i( {\textbf{I}_0^{(10)}} ))$ for $\phi _i$ where ${\textbf{I}_0^{(10)}}$ is the domain matrix for 10 variables containing 1024 interpretation vectors. Then we let Mat_DNF learn a DNF $\varphi$ from $( {\textbf{I}_0^{(10)}} ,\phi _i( {\textbf{I}_0^{(10)}} ))$^{Footnote 14} and check if $\varphi$ is identical to the original $\phi _i$. The result is encouraging. Nine DNFs out of the original 10 DNFs are successfully recovered (modulo renaming) and the remaining one is logically equivalent to the original DNF.

To understand the origin of this high recovery rate, we pick up a DNF $\phi _{6}$ associated with gene_6 and examine noise-expansion effect on it. We consider $\phi _{6}$ as a 5-variable Boolean function over the domain matrix ${\textbf{I}_0^{(5)}}$ and measure $\text {acc\_DNF}$ w.r.t. dr. To measure $\text {acc\_DNF\_noise}$, we append a 5 dimensional random bit vector to each interpretation vector in ${\textbf{I}_0^{(5)}}$. The learning result is shown in Fig. 2 where figures are the average over 100 trials. There we see the acc_DNF curve shows a large improvement in acc_DNF by noise-expansion compared to the case of Fig. 1. For example it achieves acc_DNF = 0.817 at dr = 0.1, which means on average, given only 3 input–output pairs, Mat_DNF learns by noise-expansion a DNF that correctly predicts 26 input–output pairs in $( {\textbf{I}_0^{(5)}} ,\phi _{6}( {\textbf{I}_0^{(5)}} ))$ out of 32 possible tests. Such high accuracies plotted in Fig. 2 strongly suggests that noise-expansion helps Mat_DNF find a DNF with high generalizability, or the original DNF. Also we can point out that the big difference in the effect of noise-expansion between Fig. 1 and Fig. 2 might be attributed to the nature of the learning target $\phi _{6}$ which is not randomly generated but comes from biological literature.

Then look at the learning experiment of mammalian cell cycle BN again. Note that although $\phi _6$ is a function of 5 variables $\{x_1,x_4,x_5,x_6,x_{10}\}$, it is treated as a function of 10 variables $\{x_1,\ldots ,x_{10}\}$ in the experiment. So the remaining 5 variables $\{x_2,x_3,x_,x_8,x_9\}$ behave as noise bits in learning just like noise-expansion. This implicit noise-expansion happens to the learning of all DNFs $\{\phi _1,\ldots ,\phi _{10} \}$ because they contain only at most 6 variables. Moreover, since they are not random DNFs, noise-expansion can be particularly effective as shown in Fig. 2, and hence it is not unreasonable to assume that Mat_DNF is likely to able to learn the original DNFs, which explains the high recovery rate of the original DNFs.

Table 3 Examples of learned DNFs learned from $\phi _6$

Full size table

We conclude this section by looking at DNFs learned from insufficient data to develop an insight into the syntactic aspect of learned DNFs and their logical relationship to the target DNF. Table 3 lists some DNFs learned from an input-out pair for $\phi _6$ obtained by applying $\phi _6$ as a 10-variable function to the interpretation vectors of size $2^{10} \times dr$ sampled without replacement from the domain matrix ${\textbf{I}_0^{(10)}}$.^{Footnote 15}

In Table 3, for $dr \in \{1.0, 0.8, 0.5\}$, every data used for learning contains 32 different input–output pairs, i.e. contains complete information about $\phi _6$. That is why all learned DNFs are logically equivalent to $\phi _6$. At dr = 0.3, learning data still contains all information on $\phi _6$. Nonetheless the learned DNF have extraneous variables not appearing in the original $\phi _6(x_1,x_4,x_5,x_6,x_{10})$ which destroy the logical equivalence to $\phi _6$ though it still continues to be a logical consequence. When dr is further lowered to $dr = 0.1$, constraint by learning data is more loosened. So more conjunctions and extraneous variables are introduced to the learned DNF and they stop the learned DNF from being either a logical consequence of or logically equivalent to $\phi _6$.

6.2 Learning a budding yeast cell cycle BN

We conduct the second experiment with a synchronous BN for budding yeast cell cycle taken from Irons (2009). Since it contains 18 genes (DNFs) and preparing gene expression data is very time-consuming, it is unrealistic to assume the whole domain matrix ${\textbf{I}_0^{(18)}}$ containing $2^{18} = 262,144$ data points as learning data to learn a Boolean formula $\phi _i$ for gene_i in the BN (Irons, 2009) ($1 \le i \le 18$).

We instead randomly generate a set of state vectors ${\textbf{I}_1^\textrm{rand}}$ of size 1, 000 and use $( {\textbf{I}_1^{\textrm{rand}}} , \phi _i( {\textbf{I}_1^{\textrm{rand}}} ))$ ($1 \le i \le 18$) as learning data to learn a DNF for $\phi _i$.^{Footnote 16}

In this experiment, 17 DNFs out the 18 original DNFs are successfully recovered in at most three trials and the remaining DNF is logically equivalent to the original one. Considering the severe data scarcity such that only $0.38\%$ ($1000/2^{18}$) of the whole data is supplied as learning data, this success rate is somewhat surprising, but again can be explained as the effect of implicit noise-expansion as in the mammalian cell cycle case because the set of variables relevant to a target gene is surely a proper subset of 18 variables and the remaining irrelevant ones would behave as noise.

6.3 Learning a myeloid differentiation BN

The last example is learning an asynchronous BN with 11 genes for myeloid differentiation process (Krumsiek et al., 2011). In this “biologically more feasible” BN (Gao et al., 2018), state transition occurs asynchronously where a gene is nondeterministically chosen and the Boolean function (DNF) associated with the gene is applied to the current state to decide the next state of the BN.

Following (Gao et al., 2018), we generate learning data for asynchronous BN by simulating all possible asynchronous sate transitions starting from an “early, unstable undifferentiated state, where only GATA-2, C/EBPa, and PU.1 are active” (Krumsiek et al., 2011). This simulation generates 160 distinct hierarchically layered states containing four point attractors that correspond to four mature blood cells. For each gene, we generate state transition data of size 160 from these states and let Mat_DNF learn it with over-iteration (extra_update = 100). Since a learned DNF varies with initialization, we repeat this asynchronous BN data learning ten times and consider the majority of ten learned DNFs as the learned DNF for the target gene.

Out of 11 DNFs to be recovered, Mat_DNF correctly recovered the original DNFs for 6 genes (Table 4). They are all pure conjunctions. DNFs for the remaining 5 genes are recovered partially in such a way that they lost at most three variables from the original ones. We performed other measurements.

We now compare our results with those by rfBFE (Gao et al., 2018) in more detail. rfBFE is one of the state-of-the-art BN learning algorithms which is a refinement of BestFit extension algorithm (Lähdesmäki et al., 2003)^{Footnote 17}. Since the purpose of BN learning is to infer Boolean formulas governing the state transitions process, the recovery rate of target Boolean formulas is the most important criterion. From this viewpoint, it is to be noted that when applied to complete data generated by synchronous BN, both rfBFE and Mat_DNF recover all original 11 DNFs. However there is a big difference in execution time. While rfBFE only takes 1.24 s to process 11 complete datasets ($2^{11}$ data points) for 11 genes according to Gao et al. (2018), Mat_DNF takes 483.1 s, which suggests the need for improving implementation of Mat_DNF for example by parallel technologies.

Also we observe differences in terms of “score” which the number of genes whose domain (regulators) is correctly inferred when the learning data is not complete. We randomly sample m states and their state transitions and measure scores for $m = 80, 160$ by running Mat_DNF on sampled transitions.^{Footnote 18} We repeat this trial five times and take the average. The results are score = 8.8 for m = 80 and score = 10.6 for m = 160, which are lower than those by rfBFE reported in Gao et al. (2018) where score = 10.8 for m = 80 and score = 10.9 for m = 160 respectively. This may be due to the lack of a special mechanism in Mat_DNF to identify regulators (domain).

Table 4 Recovered Boolean formulas for the asynchronous myeloid differentiation BN

Full size table

In the case of asynchronous learning data described above, Mat_DNF and rfBFE return Boolean formulas listed in Table 4.^{Footnote 19} Table 4 shows that Mat_DNF and rfBFE return exactly the same Boolean formulas except for gene PU.1 and both successfully recover six original Boolean formulas. Concerning PU.1 however, while Mat_DNF successfully recovers one of the two original disjuncts, rfBFE recovers no original disjunct or recovers only one of the four original conjuncts (assuming the original one is in CNF). So, as far as the target asynchronous BN (Krumsiek et al., 2011) is concerned, Mat_DNF seems qualitatively competitive with rfBFE, though learning is considerably slow.

7 Related work

From a logical point of view, Mat_DNF infers a matricized DNF as an interpolant by numerical optimization and there is no previous work of the same kind as far as we know. As Sect. 4 reveals, any interpolant represented by a matricized DNF $\varphi = ( {\textbf{C}} , {\textbf{D}} )$ between the positive and negative data is translated to a single layer ReLU network described by (2) with network parameters $( {\textbf{C}} , {\textbf{D}} )$ and vice versa. This mutual translation is expected to contribute to cross-fertilization of NNs and logic. For example logical characterization of interpolants with good generalizability can contribute to designing NNs with high generalizability.

On the optimization side, our approach is categorized as continuous and unconstrained global optimization applied to DNFs instead of CNFs (Gu et al., 1996). What differs from traditional approaches surveyed in Gu et al. (1996) is the Mat_DNF’s cost function, which for instance encodes a conjunction as a sum of piecewise multivariate linear terms unlike those in Gu et al. (1996) that encode a conjunction by a product of some functions in one form or another.

Representing Boolean formulas by matrix is an established idea. Theoretically we can represent any Boolean formula in n variables in terms of $2^n \times 2^n$ or $2n \times 2^n$ matrix (Cheng and Qi, 2010; Kobayashi and Hiraishi, 2014). Our matricized DNF representation also requires a matrix ${\textbf{C}}$ of similar size, for example $2^{n-1} \times 2n$ to represent the n-parity function. The technique of learning and outputting Boolean formulas represented by matrix has already been applied to learning AND/OR BNs in Sato and Kojima (2021), but with different purposes. Sato and Kojima (2021) aims at finding useful logical patterns in the biological data whereas DNFs in this paper are learned to verify or suggest BNs.

Mat_DNF is a simple neuro-symbolic system that explicitly represents DNFs. From this neuro-symbolic viewpoint, we notice several NNs have been proposed that can learn DNFs (Towell and Shavlik, 1994; Payani and Fekri, 2019; Katzir et al., 2021). However, they all implicitly embed DNFs in their NN architecture. In KBANN-net (Towell and Shavlik, 1994), for example, a conjunction containing k literals is encoded as a neuron represented by a tree with k leaves, each having a link weight $\omega$ such as 4 for positive literal and $-\omega$ for negative one, and the neuron is activated when $k\cdot \omega$ exceeds $\text {bias} = (k-1/2) \cdot \omega$. In Neural Logic Networks (Payani and Fekri, 2019), conjunctions are represented by a product of linear functions of the form $1-m(1-x)$ where $0< m < 1$ and embedded in a neural network isomorphically to a DNF. In Net-DNF (Katzir et al., 2021), a trainable AND function is used: $\text {AND}( {\textbf{x}} ) = \text {tanh}( ( {\textbf{c}} \bullet L( {\textbf{x}} )^T ) - \Vert {\textbf{c}} \Vert _1 + 1.5)$ where $L( {\textbf{x}} ) = \text {tanh}( {\textbf{x}} ^TW + {\textbf{b}} )$ to encode conjunctions. As a result, they need an extra process to reconstruct a DNF from the learned parameters.

There are logical approaches to BN learning (Inoue et al., 2014; Tourret et al., 2017; Chevalier et al., 2019; Gao et al., 2022). Logically our work can be considered as a matricized version of “learning from interpretation transition” in logic programming in which a BN is represented by a propositional normal logic program (Inoue et al., 2014; Gao et al., 2022). The most related work is NN-LFIT proposed by Tourret et al. (2017) which performs two-stage DNF learning. First a single layer feed-forward NN is trained by state transition data. Then learned parameters irrelevant to the output are filtered out and DNFs are extracted from the remaining parameters. However since their performance evaluation is based on error rate of learned rules, not recovery rate of the learned DNFs like ours, direct comparison is difficult.

8 Conclusion

We proposed a simple feed-forward neural network Mat_DNF for the end-to-end learning of Boolean functions. It learns a Boolean function and outputs a matricized DNF realizing the target function. It searches for a DNF as a root of a non-negative cost function by minimizing the cost function to zero. We also established a new connection between neural learning and logical inference. We proved the equivalence between DNF learning by Mat_DNF and the inference of interpolants in logic between the positive and negative input data. We applied Mat_DNF to learning two synchronous BNs and one asynchronous BN from biological literature and empirically confirmed the effectiveness of our approach.

While doing so, we introduced “domain ratio” dr as an indicator of data scarcity and defined generalization w.r.t. dr. By examining the generalizability of DNFs learned from scarce data while varying dr, we discovered two operations, noise-expansion (expanding input vectors with noise vectors) and over-iteration (continuing learning after learning error reaches zero), can considerably improve generalizability by shifting the choice of a learned DNF. These two operations explain high recovery rate of original DNFs in our BN learning experiments.

Future work includes a reimplementation of Mat_DNF by GPUs, the refinement of noise-expansion and over-iteration and pursuing the idea of binary classifier as logical interpolant.

Availability of data and material

Not Applicable.

Code availability

Mat_DNF is available upon request as an octave program.

Notes

For a disjunction $A \vee B$, the subformula A (resp. B) is called a disjunct. If $A \vee B$ is a DNF, each disjunct is a conjunction of literals. Disjuncts are sometimes called terms.
Mat_DNF uses $\text{ min}_1(x) = \text{ min }(x,1)$ in stead of $\textrm{ReLU}(x)$ (note $\text{ min}_1(x) = 1-\textrm{ReLU}(1-x)$ holds) for computing disjucntion, which is originated from real-valued Łukasiewicz logic. $\text{ min}_1(x)$ is non-differentiable at $x=1$. However non-differentiable points form a Lebesgue measure zero set and the probability of hitting a non-differentiable point in learning is zero. So practically the non-differentiability of $\text{ min}_1(x)$ causes no problem. Theoretically the subgradient method may be usable in a special case. There are other types of differentiable logic and logical operators applicable to them (van Krieken et al., 2022).
In this paper, noise does not mean classification noise but irrelevant bits in the learning data that disturb Boolean function learning.
For example, $\widetilde{\textbf{C}}$ is thresholded into $( \widetilde{\textbf{C}})_{\ge \theta }$ where $\theta$ is between the maximum and minimum elements of $\widetilde{\textbf{C}}$. We choose the best $\theta$ by trying 10 different $\theta$’s that gives the least learning_error.
This happens, for example, when learning data is inconsistent and there is no Boolean function satisfying the learning data. It also can happen when the target function is difficult to learn as the case of the n-parity function with large n.
Craig’s interpolation theorem is the one for first-order logic but its propositional version has long been practically applied to model checking (Vizel et al., 2015; McMillan et al., 2018)
All programs used in this paper are written in GNU Octave 4.2.2 and run on a PC with Intel(R) Core(TM) i7-10700@2.90GHz CPU with 26GB memory. Due to the naive nature of our implementation of Mat_DNF, the experiment scale is small.
We also implemented Mat_DNF by PyTorch and conducted a learning experiment for the 7-parity function from complete data. We chose the parity function because it is known to be hard to learn. As average over 5 trials, the PyTorch version took 42.9 s(10.5) on Google Colaboratory (GPU) while the octave version (CPU) took only 9.6 s(11.5). Although the difference may be due to our naive use of PyTorch, it seems likely that our matrix-based implementation is suitable for Octave.
The learning parameters are set to $\alpha$ = 0.1, $max\_try$ = 20, $max\_itr$ = 500 and h = 1000.
The distance between $\varphi _0$ and $\varphi$ in n variables is defined to be the number of interpretation vectors ${\textbf{x}}$ in the domain matrix for n variables such that $\varphi _0( {\textbf{x}} ) \ne \varphi ( {\textbf{x}} )$.
Learning parameters are $\alpha$ = 0.1, $max\_try$ = 20, $max\_itr$ = 500 and h = 1000.
The experimental parameters for one trial are $\alpha$ = 0.1, max_try = 20, max_itr = 500, h = 1000. No over-iteration is used.
Recall that for Boolean formulas A, B in n variables, the distance between A and B is given by $\text {dist}(A,B)$ = $\mid \{ {\textbf{x}} \in {\textbf{I}_0} \mid {\textbf{x}} \models (A \wedge \lnot B) \vee (\lnot A \wedge B) \} \mid$ where ${\textbf{I}_0}$ is the set of $2^n$ interpretation vectors.
Learning parameters are set to $\alpha$ = 0.1, $max\_itr$ = 500, $max\_try$ = 50 and h = 1000.
Learning parameters are $\alpha$ = 0.1, $max\_try$ = 20, $max\_itr$ = 500 and h = 1000.
Learning parameters are $\alpha$ = 0.1, $max\_try$ = 20, $max\_itr$ = 500 and h = 2000.
rfBFE is a combination of two algorithms, random forest for feature selection and the BestFit extension algorithm (Lähdesmäki et al., 2003) for Boolean formula discovery.
Parameters are set to max_try = 10, max_itr = 1000, h = 10000 and over-iteration with extra_update = 20.
The table format and Boolean formulas learned by rfBFE are borrowed from Gao et al. (2018). Fact denotes the original Boolean formulas. We run Mat_DNF with $\alpha$ = 0.005, max_try = 10, max_itr = 1000, h = 4000 and over-iteration (extra_itr = 100).

References

Cheng, D., & Qi, H. (2010). A linear representation of dynamics of Boolean networks. IEEE Transactions on Automatic Control, 55(10), 2251–2258. https://doi.org/10.1109/TAC.2010.2043294
Article MathSciNet MATH Google Scholar
Chevalier, S., Froidevaux, C., Paulevé, L., & Zinovyev, A. (2019). Synthesis of boolean networks from biological dynamical constraints using answer-set programming. In: 31st International Conference on Tools with Artificial Intelligence. ICTAI
Craig, W. (1957). Three uses of the Herbrand-Gentzen theorem in relating model theory and proof theory. The Journal of Symbolic Logic, 22(3), 269–285.
Article MathSciNet MATH Google Scholar
Fauré, A., Naldi, A., Chaouiya, C., & Thieffry, D. (2006). Dynamical analysis of a generic Boolean model for the control of the mammalian cell cycle. Bioinformatics, 22(14), 124–131. https://doi.org/10.1093/bioinformatics/btl210
Article Google Scholar
Feldman, V. (2007). Efficiency and computational limitations of learning algorithms. PhD thesis, USA. AAI3251269
Gao, S., Xiang, C., Sun, C., Qin, K., & Lee, T.H. (2018). Efficient Boolean Modeling of Gene Regulatory Networks via Random Forest Based Feature Selection and Best-Fit Extension. In: 2018 IEEE 14th International Conference on Control and Automation (ICCA), pp. 1076–1081 (2018). https://doi.org/10.1109/ICCA.2018.8444221
Gao, K., Wang, H., Cao, Y., & Inoue, K. (2022). Learning from interpretation transition using differentiable logic programming semantics. Machine Learning, 111(1), 123–145. https://doi.org/10.1007/s10994-021-06058-8
Article MathSciNet MATH Google Scholar
Gu, J., Purdom, P.W., Franco, J., & Wah, B.W. (1996). Algorithms for the satisfiability (sat) problem: A survey. In: DIMACS Series in Discrete Mathematics and Theoretical Computer Science, pp. 19–152
Hansen, K. A., & Podolskii, V. V. (2015). Polynomial threshold functions and boolean threshold circuits. Information and Computation, 240, 56–73. https://doi.org/10.1016/j.ic.2014.09.008
Article MathSciNet MATH Google Scholar
Inoue, K., Ribeiro, T., & Sakama, C. (2014). Learning from Interpretation Transition. Machine Learning, 94(1), 51–79.
Article MathSciNet MATH Google Scholar
D.J.Irons: (2009). Logical analysis of the budding yeast cell cycle. Journal of theoretical biology 257(4)
Ishida, T., Yamane, I., Sakai, T., Niu, G., & Sugiyama, M. (2020). Do we need zero training loss after achieving zero training error? CoRR, ICML2020 poster
Kamath, A. P., Karmarkar, N., Ramakrishnan, K. G., & Resende, M. G. C. (1992). A continuous approach to inductive inference. Mathematical Programming, 57, 215–238.
Article MathSciNet MATH Google Scholar
Katzir, L., Elidan, G., & El-Yaniv, R. (2021). Net-dnf: Effective deep modeling of tabular data. In: Proceedings of the 9th International Conference on Learning Representations (ICLR 2021)
Kauffman, S. (1969). Homeostasis and differentiation in random genetic control networks. Nature, 51, 177–178. https://doi.org/10.1038/224177a0
Article Google Scholar
Kingma, D.P., & Ba, J. (2015). Adam: A method for stochastic optimization. In: 3rd International Conference on Learning Representations, (ICLR 2015) Conference Track Proceedings
Kobayashi, K., & Hiraishi, K. (2014). Ilp/smt-based method for design of Boolean networks based on singleton attractors. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 11, 1253–1259.
Article Google Scholar
Krumsiek, J., Marr, C., Schroeder, T., & Theis, F. (2011). Hierarchical differentiation of myeloid progenitors is encoded in the transcription factor network. PloS One, 6, 22649. https://doi.org/10.1371/journal.pone.0022649
Article Google Scholar
Lähdesmäki, H., Shmulevich, I., & Yli-Harja, O. (2003). On learning gene regulatory networks under the boolean network model. Machine Learning, 52(1–2), 147–167. https://doi.org/10.1023/A:1023905711304
Article MATH Google Scholar
Liang, S., Fuhrman, S., & Somogyi, R. (1998). REVEAL, A General Reverse Engineering Algorithm for Inference of Genetic Network Architectures. In: Pacific Symposium on Biocomputing, vol. 3, pp. 18–29 (1998)
Malach, E., & Shalev-Shwartz, S. (2019). Learning boolean circuits with neural networks. CoRR
McMillan, K.L., In: Clarke, E.M., Henzinger, T.A., Veith, H., & Bloem, R. (2018). (eds.) Interpolation and Model Checking, pp. 421–446. Springer, Cham
Mixon, D.G., & Peterson, J. (2015). Learning Boolean functions with concentrated spectra. In: Papadakis, M., Goyal, V.K., Ville, D.V.D. (eds.) Wavelets and Sparsity XVI, vol. 9597, pp. 88–95. SPIE, ??? (2015). International Society for Optics and Photonics. https://doi.org/10.1117/12.2189112
Oliveira, A.L., & Sangiovanni-Vincentelli, A. (1993). Learning complex boolean functions: Algorithms and applications. In: Proceedings of the 6th International Conference on Neural Information Processing Systems (NIPS’93), pp. 911–918 (1993)
Payani, A., & Fekri, F. (2019). Learning algorithms via neural logic networks. CoRR abs/1904.01554 (2019)
Power, A., Burda, Y., Edwards, H., Babuschkin, I., & Misra, V. (2021). Grokking: Generalization beyond overfitting on small algorithmic datasets. 1st Mathematical Reasoning in General Artificial Intelligence Workshop
Ribeiro, T., Folschette, M., Magnin, M., & Inoue, K. (2021). Learning any memory-less discrete semantics for dynamical systems represented by logic programs. Machine Learning. https://doi.org/10.1007/s10994-021-06105-4
Rückert, U., & Kramer, S. (2003). Stochastic local search in k-term DNF learning. In: Proceedings of the Twentieth International Conference on Machine Learning (ICML 2003), pp. 648–655 (2003)
Sato, T., & Kojima, R. (2021). Boolean network learning in vector spaces for genome-wide network analysis. In: Proceedings of the 18th International Conference on Principles of Knowledge Representation and Reasoning (KR2021), pp. 560–569 (2021). https://doi.org/10.24963/kr.2021/53
Sharma, R., Nori, A.V., & Aiken, A. (2012). Interpolants as classifiers. In: Computer Aided Verification, pp. 71–87. Springer, Berlin
Tourret, S., Gentet, E., & Inoue, K. (2017). Learning human-understandable description of dynamical systems from feed-forward neural networks. In: Advances in Neural Networks—14th International Symposium, Proceedings, Part I, LNCS 10261, pp. 483–492
Towell, G. G., & Shavlik, J. W. (1994). Knowledge-based artificial neural networks. Artificial Intelligence, 70(1), 119–165. https://doi.org/10.1016/0004-3702(94)90105-8
Article MATH Google Scholar
van Krieken, E., Acar, E., & van Harmelen, F. (2022). Analyzing differentiable fuzzy logic operators. Artificial Intelligence, 302, 103602. https://doi.org/10.1016/j.artint.2021.103602
Article MathSciNet MATH Google Scholar
Vizel, Y., Weissenbacher, G., & Malik, S. (2015). Boolean satisfiability solvers and their applications in model checking. Proceedings of the IEEE, 103(11), 2021–2035.
Article Google Scholar
Zhang, Z., Zhao, Y., Liu, J., Wang, S., Tao, R., Xin, R., & Zhang, J. (2019). A general deep learning framework for network reconstruction and dynamics learning. Applied Network Science 4(110)

Download references

Funding

This work is supported by JSPS KAKENHI Grant Number JP21H04905 and JST CREST Grant Number JPMJCR22D3.

Author information

Authors and Affiliations

National Institute of Informatics, 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo, 101-8430, Japan
Taisuke Sato & Katsumi Inoue

Authors

Taisuke Sato
View author publications
You can also search for this author in PubMed Google Scholar
Katsumi Inoue
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

TS is a major contributor in writing the manuscript. KI assists in preparing the manuscript and financial support. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Taisuke Sato.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose.

Ethics approval

Not Applicable.

Consent to participate

Not Applicable.

Consent for publication

Not Applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Editors: Alireza Tamaddoni-Nezhad, Alan Bundy, Luc De Raedt, Artur d’Avila Garcez, Sebastijan Dumančić, Cèsar Ferri, Pascal Hitzler, Nikos Katzouris, Denis Mareschal, Stephen Muggleton, Ute Schmid.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Sato, T., Inoue, K. Differentiable learning of matricized DNFs and its application to Boolean networks. Mach Learn 112, 2821–2843 (2023). https://doi.org/10.1007/s10994-023-06346-5

Download citation

Received: 06 June 2022
Revised: 04 April 2023
Accepted: 08 May 2023
Published: 21 June 2023
Issue Date: August 2023
DOI: https://doi.org/10.1007/s10994-023-06346-5

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Differentiable learning of matricized DNFs and its application to Boolean networks

Abstract

Similar content being viewed by others

Learning Boolean Controls in Regulated Metabolic Networks: A Case-Study

Towards Better Generalization for Neural Network-Based SAT Solvers

Classifier Construction in Boolean Networks Using Algebraic Methods

1 Introduction

2 Preliminaries

3 Learning DNFs in vector spaces

3.1 Evaluating matricized DNFs

3.2 Learning DNFs by Mat_DNF

Proposition 1

Proof

3.3 Learning algorithm

4 Learning as logical interpolation: a logical perspective

Proposition 2

Proof

5 Learning random DNFs

5.1 Performance measures and generalization

5.2 Measuring accuracy for random DNFs

5.3 Noise-expansion and over-iteration

5.4 The logical relations and over-iteration

5.5 Controlling logical generalization

6 Learning Boolean networks

6.1 Learning a mammalian cell cycle BN

6.2 Learning a budding yeast cell cycle BN

6.3 Learning a myeloid differentiation BN

7 Related work

8 Conclusion

Availability of data and material

Code availability

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethics approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation