Differentiable learning of matricized DNFs and its application to Boolean networks

Boolean networks (BNs) are well-studied models of genomic regulation in biology where nodes are genes and their state transition is controlled by Boolean functions. We propose to learn Boolean functions as Boolean formulas in disjunctive normal form (DNFs) by an explainable neural network Mat_DNF and apply it to learning BNs. Directly expressing DNFs as a pair of binary matrices, we learn them using a single layer NN by minimizing a logically inspired non-negative cost function to zero. As a result, every parameter in the network has a clear meaning of representing a conjunction or literal in the learned DNF. Also we can prove that learning DNFs by the proposed approach is equivalent to inferring interpolants in logic between the positive and negative data. We applied our approach to learning three literature-curated BNs and confirmed its effectiveness. We also examine how generalization occurs when learning data is scarce. In doing so, we introduce two new operations that can improve accuracy, or equivalently generalizability for scarce data. The first one is to append a noise vector to the input learning vector. The second one is to continue learning even after learning error becomes zero. The first one is explainable by the second one. These two operations help us choose a learnable DNF, i.e., a root of the cost function, to achieve high generalizability.


Introduction
Boolean networks (BNs) are a simple yet effective model of gene regulatory networks where nodes are genes and their state transition is controlled by Boolean functions (Kauffman, 1969). They have been studied mathematically (Cheng and Qi, 2010;Kobayashi and Hiraishi, 2014), logically in AI (Inoue et al., 2014;Tourret et al., 2017;Chevalier et al., 2019;Gao et al., 2022) and from the viewpoint of deep learning (Zhang et al., 2019). Their learning is reduced to learning Boolean functions from a set of input-output pairs and can be carried out for example by the REVEAL algorithm (Liang et al., 1998) or by the BestFit extension algorithm (Lähdesmäki et al., 2003).
In this paper, we propose a new approach to learning Boolean functions. We introduce a simple ReLU neural network (NN) called Mat_DNF that learns Boolean functions and outputs Boolean formulas in disjunctive normal form (DNFs). We represent a DNF by a pair (C, D) of binary matrices where C stands for conjunctions and D a disjunction respectively. Mat_DNF learns a matricized DNF (C, D) as network parameters from the learning data by minimizing a non-negative cost function J(C, D) to zero. As a result, every network parameter in Mat_DNF has a clear meaning of (potentially) denoting a literal or a conjunction (disjunct 1 ) in the learned DNF.
Although there exist several ways to represent Boolean functions such as decision trees (Oliveira and Sangiovanni-Vincentelli, 1993), polynomial threshold functions (Hansen and Podolskii, 2015), Boolean circuits (Malach and Shalev-Shwartz, 2019) and support vector machines (Mixon and Peterson, 2015), we choose DNFs for two reasons: one is explainability and the other is to relate the learning process to logical inference. Explainability is guaranteed as our network parameters directly represent a matricized DNF. Moreover since the learned output is a DNF, exploring the logical relationship between the learning data and the learned DNF becomes possible and we find that the learned DNF is what is called an interpolant in logic (Craig, 1957) interpolating between the positive and negative input data, which uncovers a new connection that connects neural learning to symbolic inference.
Boolean function learning can be either discrete or continuous. One group such as SAT encoding with integer programming (Kamath et al., 1992) and stochastic local search (Ruckert and Kramer, 2003) works in discrete spaces. The other group uses NNs in continuous spaces such as simulating Boolean circuits (Malach and Shalev-Shwartz, 2019), Neural Logic Networks (Payani and Fekri, 2019) and Net-DNF (Katzir et al., 2021). Our learning is just between the two. Unlike the former, Mat_DNF is differentiable 2 . Unlike the latter, it explicitly operates on matricized DNFs, discrete expressions, which are not implicitly embedded in the neural network architecture.
In the context of BN learning, Mat_DNF offers a robust yet explainable end-to-end approach as an alternative to previous ones (Liang et al., 1998;Lähdesmäki et al., 2003;Inoue et al., 2014;Tourret et al., 2017;Gao et al., 2022). Compared to the REVEAL algorithm (Liang et al., 1998) and the BestFit extension algorithm (Lähdesmäki et al., 2003), Mat_DNF imposes no limit on the number of function variables. So if there are 18 genes (Irons, 2009), DNFs in 18 variables are considered. The LF1T algorithm (Inoue et al., 2014) symbolically learns a BN represented as a ground normal logic program from state transitions. Generalization is done by resolution. The NN-LFIT algorithm (Tourret et al., 2017) adopts a two-stage approach that learns features by a feed-forward NN and extracts DNFs from the learned parameters. D-LFIT (Gao et al., 2022) takes a further elaborated approach of combining two neural networks to reduce search space. By comparison, Mat_ DNF is a much simpler single layer NN whose learned parameters directly represent a DNF and there is no need for post processing.
To improve the accuracy of the DNF learned from insufficient data, we introduce two operations. The first one is "noise-expansion". It appends a noise vector to the input learning vector. 3 The second one is "over-iteration" which keeps learning even after learning error becomes zero. Since adding a noise vector causes extra steps of parameter update while moving around local minima of the cost function J , the net effect of the first one is attributable to the second one. The fact that these two operations can considerably improve accuracy means that the choice of a root of the cost function J(C, D) = 0 , or more generally the choice of a local minimum significantly affects accuracy and generalizability.
Finally we confirm the effectiveness of our approach through three learning experiments with literature-curated BNs (Fauré et al., 2006;Irons, 2009;Krumsiek et al., 2011). We applied Mat_DNF to learning data generated from these BNs to see if Mat_DNF can recover the original DNFs in BNs. For the first two synchronous BNs (Fauré et al., 2006;Irons, 2009), the recovery rate is high. By detailed analysis of the learning results, it is suggested that this high recovery rate is due to the effect of over-iteration caused by implicit noise-expansion. However, the third asynchronous BN (Krumsiek et al., 2011;Ribeiro et al., 2021) presents a much more difficult case and only six DNFs are completely recovered out of 11 original DNFs, though this result is comparable to that of rfBFE (Gao et al., 2018), one of the state-of-the-art BN learning algorithms.
Thus our contributions are three fold. First a proposal of new approach to the end-toend learning of Boolean functions by an explainable single layer NN Mat_DNF together with its application to BN learning, second the establishment of the equivalence between neural learning of DNFs by Mat_DNF and symbolic inference of DNFs as interpolants between the positive and negative data and third the introduction of two new operations, noise-expansion and over-iteration, that can improve accuracy by shifting the choice of a local minimum.
In what follows, after a preliminary section, we introduce Mat_DNF in Sect. 3. We then prove the relationship between the learning by Mat_DNF and the inference of interpolants in logic in Sect. 4. Section 5 examines the behavior of Mat_DNF w.r.t. insufficient learning data and introduces noise-expansion and over-iteration that improve accuracy. Section 6 reports three BN learning experiments and Sect. 7 discusses related work. Section 8 is the conclusion.

3 2 Preliminaries
Throughout this paper, bold italic capital letters such as A stand for matrices and so do bold italic lower case letters such as a for vectors. We equate a one-dimensional matrix with a vector. The i-th element of a is designated by a(i) and the i, j-th element of A by A(i, j) . Given two m × n matrices A and B , [A;B] represents the 2 m × n matrix of A stacked onto B . ‖a‖ 1 = ∑ i | a(i) | denotes the 1-norm of a and ‖A‖ F the Frobenius norm of A . Let a and b be n dimensional vectors. Then (a • b) stands for their inner product (dot product) and a ⊙ b their Hadamard product, i.e., (a ⊙ b) . These notations naturally extend to matrices like (A) ≥ and 1 − A . min 1 (x) = min (x, 1) is a function returning the lesser of 1 and x. min 1 (A) is the component-wise application of min 1 (x) to A . We implicitly assume that all dimensions of vectors and matrices in various expressions are compatible. Let d 1 ∨ ⋯ ∨ d h be a DNF in n variables. If every disjunct d i is a conjunction of n distinct literals, it is said to be full. For a set S, | S | stands for the number of elements in S.

Evaluating matricized DNFs
and (x 1 ∧ ¬x 3 ) . We represent by a pair (C, D) of binary matrices: x 1 x 2 x 3 ¬x 1 ¬x 2 ¬x 3 C = 1 1 0 0 0 0 1 0 0 0 0 1 D = [1 1] As can be seen, each row of C represents a disjunct (conjunction of literals) of . For example, the first row of C represents the first disjunct (x 1 ∧ x 2 ) by setting C(1, 1) = C(1, 2) = 1 . D on the other hands represents the choice of a conjunction as a disjunct; in the current case, both disjuncts in C are chosen as disjunct of as designated by D = [1 1] . If D = [1 0] , will contain only the first disjunct (x 1 ∧ x 2 ) in C . Generally a DNF in n variables with at most h disjuncts is represented by an h × 2n binary matrix C and a 1 × h binary matrix D . By default, we consider a DNF and its matrix representation (C, D) exchangeable and call (C, D) matricized DNF . Now we describe how is evaluated as a Boolean function (x) over its domain I 0 = {1, 0} n of bit sequences. Each x ∈ I 0 is equated with a binary column vector called "interpretation vector" representing an interpretation (assignment) such that a variable x j ( 1 ≤ j ≤ n ) is mapped to x(j) ∈ {1, 0} . Henceforth for convenience we treat I 0 as an n × 2 n binary matrix packed with such 2 n possible interpretation vectors and specifically call it the domain matrix for n variables.
Let x be an interpretation vector in I 0 . A matricized DNF = (C(h × 2n), D(1 × h)) is evaluated by x as follows. First compute a column vector N = C[(1 − x);x] . N(j) ( 1 ≤ j ≤ h ) denotes the number of literals contained in the j-th conjunction of C and falsified by x , and hence min 1 (N)(j) = 0 holds if-and-only-if the j-th conjunction is false in x . Next compute a column vector M = 1 − min 1 (N) which is the bit inversion of min 1 (N) and M(j) gives the truth value ∈ {0, 1} of the j-th conjunction in C . Finally compute a scalar V = DM . It denotes the number of disjuncts in satisfied by x . Hence (V) ≥1 ∈ {0, 1} gives the truth value of evaluated by x . Write x ⊧ when is true in x , i.e. x satisfies . In fact we have ) is a submatrix representing positive (resp. negative) occurrences of variables in . Then the whole evaluation process is described by one line (1): where (x) denotes the truth value ∈ {0, 1} of as a Boolean function evaluated by x . This notation is naturally extended to a set of interpretation vectors like (I 0 ) . 1 h and 1 n are all-one vectors of length h and n respectively. We rewrite (1) to (2). What the latter tells us is that our evaluation process is exactly a forward pass of a single layer ReLU network consisting of a linear output layer and a hidden layer with a weight matrix C P − C N and a bias vector 1 h − C P 1 n . We name this ReLU network Mat_DNF. It is a simple NN specialized for DNFs derived from the evaluation process of a DNF where the disjunction x ∨ y is replaced by min 1 (x + y) as in Łukasiewicz's many valued logic.

Learning DNFs by Mat_DNF
By adding a backward pass to the equation (1), Mat_DNF can learn Boolean functions. Here we describe how Mat_DNF learns them. Let f be a target Boolean function in n variables and I 0 = [x 1 ⋯ x 2 n ] the domain matrix for n variables. In learning, we are given a submatrix I 1 (n × l) = [x i 1 ⋯ x i l ] (l ≤ 2 n ) of I 0 . I 1 is mapped by f to a 1 × l row vector . (I 1 , I 2 ) = (I 1 , f (I 1 )) is called an input-output pair for f and I 1 its input domain. Learning a DNF here thus means a learner receives an input-output pair (I 1 , I 2 ) = (I 1 , f (I 1 )) for a target Boolean function f and returns a DNF such that (I 1 ) = I 2 . Mat_DNF receives (I 1 , I 2 ) and returns a matricized DNF such that (I 1 ) = I 2 when it stops with learning error = 0.
Let C and D be real matrices. They are relaxation versions of C and D . Intro- Then define a non-negative cost function J(C,D) by The first term (I 2 • (1 − min 1 (Ṽ))) is a non-negative scalar and deals with the case of f (x i j ) = I 2 (i j ) = 1 ( 1 ≤ j ≤ l ). Likewise the second term ((1 − I 2 ) • max 0 (Ṽ)) is non-negative and takes care of the case of f (x i j ) = I 2 (i j ) = 0 . Y and Z are penalty terms to make C and D binary respectively.
Proposition 1 J(C , D ) = 0 if-and-only-if C and D are binary matrices representing a DNF such that (I 1 ) = I 2 .
Proof We prove only-if part. The converse is obvious. Suppose J = J(C,D ) = 0. Every term in (3) is zero. Y = Z = 0 immediately implies C and D are binary. Let be a DNF represented by them. The first term deals with the case of It is a sum of non-negative summands of the form (1 − min 1 (Ṽ(i j ))) . Hence J = 0 implies min 1 (Ṽ(i j )) = 1 , i.e. is true in x i j ∈ I 1 when I 2 (i j ) = 1 . The second term is dual to the first term, dealing with the case of I 2 (i j ) = 0 . Similarly to the first term, we can prove that is false in x i j ∈ I 1 when I 2 (i j ) = 0 . By combining the two, we conclude that gives I 2 when evaluated by I 1 , i.e., (I 1 ) = I 2 . ◻ Learning by Mat_DNF is carried out based on Proposition 1 by minimizing J until J = 0 using gradient descent. C and D are iteratively updated by their Jacobians, JC a for C and JD a for D , for example like C =C − JC a where > 0 is a learning rate. To compute the Jacobians, we . Then JC a and JD a are computed by (4).
These Jacobians are derived as follows. We first derive JC a . Let C pq =C(p, q) be an arbi- where I pq is a zero matrix except for the p, q-th element which is 1. We and compute the partial derivative of J w.r.t. C pq as follows: Since p, q are arbitrary, we have . Then for arbitrary p,q, we see In actual learning, we use an adaptive gradient method Adam (Kingma and Ba, 2015) instead of gradient descent with a constant learning rate.

Learning algorithm
Given an input-output pair (I 1 , I 2 ) such that f (I 1 ) = I 2 for the target Boolean function f, Mat_DNF returns a matricized DNF = (C, D) giving (I 1 ) = I 2 , basically by running We however take a practical approach of thresholding (C,D) to binary (C , D) even before J = 0 is reached assuming J is small and C ,D are close to binary matrices. In more detail, the inner q-loop in Algorithm 1 below iteratively updates (C,D) at most max_itr times while thresholding them optimally to binary (C, D) (line 6,7,8) 4 and computing learning_error using them. If = (C, D) achieves learning_error = 0 , it exits from the q-loop and p-loop and returns . If learning_error > 0 happens even after max_itr iterations, it restarts the next q-loop with (C,D) perturbated by (5) where Δ a and Δ b are matrices of the same size as C and D respectively. They are comprised of elements sampled from the standard normal distribution N(0, 1) . The perturbated C and D are used as initial parameters in the next loop (line 16). This perturbation is intended to escape from a local minimum.
Restart is allowed at most max_try times. Note that Mat_DNF possibly fails to achieve learning_error = 0 within given h, max_itr and max_try, 5 but when Mat_DNF returns a matricized DNF = (C, D) with learning_error = 0, it is guaranteed that J(C,D ) = 0 and (I 1 ) = I 2 hold.

Learning as logical interpolation: a logical perspective
Here we characterize the learning of DNF by Mat_DNF from a logical perspective. Write ⊧ 1 ⇒ 2 if 1 ⇒ 2 is a tautology. If we also have ⊧ 2 ⇒ 3 , 2 is called an interpolant between 1 and 3 . Roughly, Craig's interpolation theorem (Craig, 1957) in first order logic states the existence of such interpolant. We prove that our learning of from an input-output pair (I 1 , I 2 ) such that (I 1 ) = I 2 is logically viewed as an inference of an interpolant . 6 Suppose (I 1 , I 2 ) is an input-output pair for some n-variable Boolean function f and f (I 1 ) = I 2 holds. We divide the input binary matrix I 1 (n × l) into two submatrices I P 1 (n × l P ) and I N 1 (n × l N ) where l P + l N = l . I P 1 represents the positive (resp. negative) data and if We consider I P 1 as full DNF, DNF(I P 1 ) in notation, in the following way. Let x be an interpretation vector in I 1 . Introduce conj(x) denoting a conjunction l 1 ∧ ⋯ ∧ l n of literals such conj(x) and call it the positive DNF for conj(x) and call it the negative DNF for (I 1 , I 2 ) . For simplicity, we equate DNF(I P 1 ) and DNF(I N 1 ) respectively with the positive data I P 1 and negative data I N 1 .
Proposition 2 Let (I 1 , I 2 ) be an input-output pair for a Boolean function f such that f (I 1 ) = I 2 . Also let DNF(I P 1 ) and DNF(I N 1 ) respectively be the positive and negative DNF for (I 1 , I 2 ) . For a DNF , (I 1 ) = I 2 if-and-only-if is an interpolant between DNF(I P 1 ) and ¬ DNF (I N 1 ).

Proof
We first prove the only-if part. Suppose (I 1 ) = I 2 . Let i be an interpretation vector over n variables satisfying DNF(I P 1 ) . It satisfies some disjunct conj(x ) in DNF(I P 1 ) . Since conj(x ) is a conjunction of n distinct literals, the fact that i satisfies conj(x ) implies i = x as vector. On the other hand, we have (I 1 ) = I 2 = f (I 1 ) by assumption and hence ( We also have f (x) = 1 as x ∈ I P 1 . Putting the two together, we conclude By Proposition 2, we can say that returned by Mat_DNF with learning_error = 0 is an interpolant between DNF(I P 1 ) and ¬DNF(I N 1 ) . We can also say by combining Proposition 1 and 2 that finding a root of J(C, D) = 0 defined by (3), learning a DNF satisfying (I 1 ) = I 2 and inferring an interpolant between DNF(I P 1 ) and ¬DNF(I N 1 ) are one and the same thing, they are all equivalent.
The recognition of this equivalence has some interesting consequences. The first one is that from the viewpoint of classification, learning by Mat_DNF consists of learning the feature space of conjunctions C and its linear separation by a hyperplane specified by a continuous disjunction D as shown in the equation (2). Hence it seems possible to modify Mat_DNF so that it can search for a "max-margin interpolant" corresponding to the maxmerging hyperplane, which is expected to generalize well. Sharma et. al already proposed to use SVM to infer interpolants (Sharma et al., 2012) where SVM is applied to the predefined feature space. In our "max-margin interpolant" inference, if realized, the feature space itself will be learned by Mat_DNF.
The second one is the possibility of a neural end-to-end refutation prover. Let S be a set of ground clauses. Also let S = S 1 ∪ S 2 be any split of S such that atom(S 1 ) ∩ atom(S 2 ) ≠ � where atom(S i ) denotes the set of atoms in S i ( i = 1, 2 ). It can be proved that S is unsatisfiable if-and-only-if there is an interpolant between S 1 and ¬S 2 (proof omitted as it is out of the scope of this paper (Vizel et al., 2015;McMillan et al., 2018)). We can apply Mat_DNF to infer this assuming that S 1 is positive data ( is true over S 1 ) and S 2 is negative data ( is false over S 2 ) respectively.
The third one concerns the generalizability of the DNF learned by Mat_DNF. It is observed that tends to overgeneralize positive data I P 1 in the input data. That is, ⊧ DNF(I P 1 ) → holds but sometimes the degree of generalization by logical implication measured by the distance between DNF(I P 1 ) and is too high, which adversely affects the accuracy of . Later in Sect. 5.5, we propose a way of controlling the distance between DNF(I P 1 ) and and show that the accuracy of is actually improved.

Performance measures and generalization
First we define some performance measures concerning Mat_DNF to clarify the meaning of generalization. Let f be a target Boolean function in n variables, I 0 the domain matrix for n variables and (I 1 , f (I 1 )) ( I 1 ⊆ I 0 ) an input-output pair for f supplied as learning data for Mat_DNF. We introduce "domain ratio" dr = where | I | denotes the number of interpretation vectors in I . Domain ratio dr is the relative size of learning data to the whole domain data. In what follows, purely for convenience, we use dr even when dr ⋅ | I 0 | is not an integer. In such case, it means I 1 contains the ⌊dr⋅ | I 0 |⌋ number of interpretation vectors of I 0 .

Measuring accuracy for random DNFs
We conduct a learning experiment with small random DNFs to examine the learning behavior of Mat_DNF w.r.t. data scarcity controlled by domain ratio dr and see how generalization occurs 7 . 8 We first randomly generate a DNF 0 in n = 5 variables that consists of three disjuncts, each containing at most 5 lals a half of which is negative on average. We also generate a domain matrix I 0 (n × 2 n ) for n = 5 variables. Next suppose a domain ratio dr is given. For this dr, we generate a binary matrix I 1 (n × l) consisting of l = 2 n ⋅ dr interpretation vectors randomly sampled without replacement from I 0 . Then we run Mat_DNF on the learning data (I 1 , 0 (I 1 )) 9 and obtain a DNF 1 that perfectly classifies the learning data, i.e. 1 (I 1 ) = 0 (I 1 ) and compute the exact accuracy acc_DNF of 1 . We repeat this process 100 times and obtain the average acc_DNF of 1 against dr.
By varying dr ∈ {0.1, … , 1.0} , we obtain a curve of exact accuracy w.r.t. dr denoted as acc_DNF in Fig. 1. There acc_dr denotes the expected accuracy of the base line learner performing only memorization and random guess. Other two curves, acc_DNF_noise and acc_over, are explained next. We observe that acc_DNF is always (and slightly) above acc_ dr for all dr's. So this experiment confirms that generalization in our sense actually occurs and the learned DNF does more than just pure memorization and random guess by detecting some logical pattern.

Noise-expansion and over-iteration
The acc_DNF_noise and acc_DNF_over curves in Fig. 1 demonstrate that generalization occurs with a greater degree than acc_DNF, i.e. acc_DNF_noise ≈ acc_DNF_over > acc_DNF holds at most dr's. They are obtained by two different operations, acc_DNF_noise by "noiseexpansion" and acc_DNF_over by "over-iteration", respectively.
The first operation, noise-expansion, means the expansion of an input vector in the learning data I 1 by a random bit vector. For example, a 5 bit input vector x = [0 1 0 1 0] T in I 1 is expanded into a 10 dimensional vector x noise = [x; n] = [0 1 0 1 0 1 0 0 1 1] T by appending a random bit vector n = [1 0 0 1 1] T to x . In learning, each x in I 1 is expanded 7 All programs used in this paper are written in GNU Octave 4.2.2 and run on a PC with Intel(R) Core(TM) i7-10700@2.90GHz CPU with 26GB memory. Due to the naive nature of our implementation of Mat_DNF, the experiment scale is small. 8 We also implemented Mat_DNF by PyTorch and conducted a learning experiment for the 7-parity function from complete data. We chose the parity function because it is known to be hard to learn. As average over 5 trials, the PyTorch version took 42.9 s(10.5) on Google Colaboratory (GPU) while the octave version (CPU) took only 9.6 s(11.5). Although the difference may be due to our naive use of PyTorch, it seems likely that our matrix-based implementation is suitable for Octave. 9 The learning parameters are set to = 0.1, max_try = 20, max_itr = 500 and h = 1000. into x noise and then used for learning. Although each input vector in I 1 gets longer (length doubled) by noise-expansion, the number of input vectors remains the same. It simply means Mat_DNF has an additional task of identifying those variables in an input vector x noise that are relevant to the output, hereby causing additional update steps in Algorithm 1. So from the viewpoint of minimizing J to zero, the net effect of noiseexpansion is to force Mat_DNF to find another root of J even when J = 0 is reached in the original learning task. This point is made clear by comparing with "over-iteration" explained below.
The second operation, over-iteration, forces Mat_DNF to skip a root of J = 0 found first and keep learning. Only after some prespecified extra steps (for example extra_update = 20 in the case of acc_DNF_itr in Fig. 1) have been made, Mat_DNF is allowed to return when a root of J is found again. Intuitively, this operation have the effect of avoiding a root near the initializing point that often overfits the learning data and exploring a root in the relatively flat landscape of J . In other words, over-iteration searcher for a root of J closer to a global minimum such as the target DNF.
Observe that as the acc_DNF_noise and acc_DNF_over curves in Fig. 1 show, not only both noise-expansion and over-iteration improve exact accuracy, or equivalently prediction accuracy, but with a similar degree of improvement. Hence it seems reasonable to hypothesize that noise-expansion causes over-iteration and over-iteration causes the improvement of exact accuracy.
The result of this experiment also indicates the importance of an intentional choice of a local minimum (choosing a root in our case) which is independently suggested by "flooding" (Ishida et al., 2020) and "grokking" (Power et al., 2021). In flooding, learning is controlled by gradient descent and ascent to keep training error small but non-zero. In grokking, learning is continued even after learning accuracy is saturated, and then test accuracy

The logical relations and over-iteration
When a learning target is a DNF 0 , we naturally ask a logical question of whether the consequence relation and equivalence relation between 0 and a learned DNF hold or not. We also interested in the distance between them 10 because we expect to be logically related to 0 when is close to 0 . So we estimate the probability p_conseq (resp. p_equiv) of being a logical consequence of 0 , i.e., ⊧ 0 ⇒ in notation (resp. being logically equivalent to 0 , i.e., ⊧ 0 ⇔ ) for a 5-variable DNF 0 generated as in the previous section, together with the average distance between 0 and by running Mat_DNF 100 times 11 and counting the number of runs that make these logical relations hold and computing the average distance. We obtain Table 1.
In Table 1, distance_itr is the same as distance between the target DNF 0 and a learned DNF but obtained by over-iteration with extra_update = 60. The same applies for p_ equiv_itr and p_equiv.
First we can recognize in the table that larger data gives us a more exact solution. That is, the distance between the target DNF 0 and a learned DNF monotonically decreases as dr gets closer to 1. Furthermore the effect of over-iteration is clearly visible. It gets the learned DNF much closer to the target DNF, from 7.5 to 4.2 at dr = 0.5 for example. In other words, it chooses a root of the cost function J near the target 0 .
Concerning logical relations, observe that p_conseq and p_equiv in Table 1 more or less monotonically increase as dr increases. So again, larger data gives a bigger chance of the logical relationship. Second observe that p_conseq, the probability of ⊧ 0 ⇒ , is rather high through all dr's but lowered considerably by over-iteration. Third over-iteration has the opposite effect on p_equiv, the probability of ⊧ 0 ⇔ however. It greatly improves the chance of ⊧ 0 ⇔ after dr > 0.5 . For example, p_equiv suddenly jumps up from 0.02 to 0.19 at dr = 0.7 and from 0.20 to 0.55 at dr = 0.9 (see bold figures in Table 1). This positive effect of over-iteration on p_equiv becomes critical when applying Mat_DNF to Boolean network learning. This is because the primary purpose of our Boolean network learning is to recover the original DNFs in the target Boolean network and over-iteration in this section enhances the chance of discovering such DNFs.

Controlling logical generalization
Over-iteration wanders in the search space for a better local minimum. Here we introduce another more proactive approach for the same purpose based on Proposition 2 in Sect. 4. This approach has the sense of search direction, away from negative data and toward positive data, thus making it possible to control the degree of generalization of the learned DNF. Let 0 be a target DNF, I 0 the domain of 0 , (I 1 , I 2 ) an input-output pair for learning where I 1 ⊆ I 0 and I 2 = 0 (I 1 ) . Also let DNF(I P 1 ) and DNF(I N 1 ) respectively be the positive and negative DNF for (I 1 , I 2 ) introduced in Sect. 4 associated with the positive data I P 1 and negative data I N 1 in I 1 . Our idea is based on the empirical observation that when learning random DNFs form insufficient data by Mat_DNF, despite the fact that the target DNF 0 and the learned DNF are both interpolants between the DNF(I P 1 ) and ¬DNF(I N 1 ) according to Proposition 2, their distance to DNF(I P 1 ) and DNF(I N 1 ) often differs greatly. Since learning data is randomly generated using the target DNF 0 , usually 0 is located (almost) in the middle between DNF(I P 1 ) and DNF(I N 1 ) distance-wise. However, it is observed that the learned is very close to the negative data DNF(I N 1 ) . In other words, due to the learning bias of Mat_DNF, tends to overgeneralize positive data by yielding disjuncts outside the original positive data DNF(I P 1 ). To combat this overgeneralization of positive data by Mat_DNF, we add a special term J int to the cost function J to suppress the generation of disjuncts in . Concretely J int is computed as follows.
Here I P 0 is the set of interpretation vectors which, when considered as conjunctions, can be added to DNF(I P 1 ) as disjuncts in the learned . M P is the truth values of continuous conjunctions represented by C . DM P is the truth values of the continuous DNF (C,D) evaluated by the interpretation vectors I P 0 . Minimizing J int causes minimizing positive elements in DM P sifted out by max 0 (⋅) to zero, in which case, as M P is non-negative, pushing positive elements in D to zero, leading to a small number of disjuncts in the thresholded disjunction D in , i.e. a small number of disjuncts in .
We conduct a learning experiment of the 5-ary random DNF with this penalty term J int added to the cost function J in the form of ⋅ J int ( ≥ 0 ) while varying from 0 to 5. 12 We choose dr = 0.5 and randomly generate a target DNF 0 and the learning data (I 1 , 0 (I 1 )) as in Sect. 5.2. So half of the complete data necessary for identifying the target 0 is supplied to the learner. We run Mat_DNF on the learning data until learning error becomes zero and measure the exact accuracy of the learned DNF in each learning trial. Table 2 contains figures averaged over 100 trials 13 .
Clearly as gets larger (while ⊧ DNF(I P 1 ) → is the same), the distance between the positive learning data DNF(I P 1 ) and the learned DNF monotonically decreases, which verifies the effectiveness of the penalty term J int to manipulate the degree of logical implication.
On the other hand, the distance between the target 0 and the learned draws a convex curve w.r.t. and achieves the maximum exact accuracy 0.824 when dist( 0 , ) is the least 5.6. In other words, we can change the distance between the target DNF and learned DNF by a parameter in vector spaces for better generalization.

Learning Boolean networks
We apply Mat_DNF to learning Boolean networks (BNs) introduced by Kauffman (Kauffman, 1969) which have been used to model gene regulatory networks in biology. A BN is biological network where nodes are genes with {0, 1} states and a state transition (activation of gene expression) of a gene occurs according to a Boolean formula associated with it. The learning task is to infer Boolean formulas associated with nodes from state transition data. Due to the general hardness results of learning Boolean formulas (Feldman, 2007), BN learning on a large scale is difficult. We select three BNs of moderate size from literature for learning, one for mammalian cell cycle from Fauré et al. (2006), one for budding yeast cell cycle from Irons (2009) and one for myeloid differentiation from Krumsiek et al. (2011). Learning performance is evaluated in terms of the recovery rate of the original DNFs associated with a BN.

Learning a mammalian cell cycle BN
In the first learning experiment, we use a synchronous BN for mammalian cell cycle having 10 nodes (genes) (Fauré et al., 2006) where state transition occurs simultaneously for all genes. A state of the BN is represented by a state vector x ∈ {0, 1} 10 and a state of each gene_i is described by a Boolean variable x i ( 1 ≤ i ≤ 10 ) and its
To see to what degree Mat_DNF can recover the original 10 DNFs, following (Inoue et al., 2014), we consider i ( 1 ≤ i ≤ 10 ) as a 10-variable Boolean function and prepare as learning data a complete input-output pair (I 10 0 , i (I (10) 0 )) for i where I (10) 0 is the domain matrix for 10 variables containing 1024 interpretation vectors. Then we let Mat_DNF learn a DNF from (I (10) 0 , i (I (10) 0 )) 14 and check if is identical to the original i . The result is encouraging. Nine DNFs out of the original 10 DNFs are successfully recovered (modulo renaming) and the remaining one is logically equivalent to the original DNF.
To understand the origin of this high recovery rate, we pick up a DNF 6 associated with gene_6 and examine noise-expansion effect on it. We consider 6 as a 5-variable Boolean function over the domain matrix I (5) 0 and measure acc_DNF w.r.t. dr. To measure acc_DNF_noise , we append a 5 dimensional random bit vector to each interpretation vector in I (5) 0 . The learning result is shown in Fig. 2 where figures are the average over 100 trials. There we see the acc_DNF curve shows a large improvement in acc_DNF by noise-expansion compared to the case of Fig. 1. For example it achieves acc_DNF = 0.817 at dr = 0.1, which means on average, given only 3 input-output pairs, Mat_DNF learns by noise-expansion a DNF that correctly predicts 26 input-output pairs in (I (5) 0 , 6 (I (5) 0 )) out of 32 possible tests. Such high accuracies plotted in Fig. 2 strongly suggests that noise-expansion helps Mat_DNF find a DNF with high generalizability, or the original DNF. Also we can point out that the big difference in the effect of noise-expansion between Fig. 1 and Fig. 2 might be attributed to the nature of the learning target 6 which is not randomly generated but comes from biological literature. Then look at the learning experiment of mammalian cell cycle BN again. Note that although 6 is a function of 5 variables {x 1 , x 4 , x 5 , x 6 , x 10 } , it is treated as a function of 10 variables {x 1 , … , x 10 } in the experiment. So the remaining 5 variables {x 2 , x 3 , x , x 8 , x 9 } behave as noise bits in learning just like noise-expansion. This implicit noise-expansion happens to the learning of all DNFs { 1 , … , 10 } because they contain only at most 6 variables. Moreover, since they are not random DNFs, noise-expansion can be particularly effective as shown in Fig. 2, and hence it is not unreasonable to assume that Mat_DNF is likely to able to learn the original DNFs, which explains the high recovery rate of the original DNFs.
We conclude this section by looking at DNFs learned from insufficient data to develop an insight into the syntactic aspect of learned DNFs and their logical relationship to the target DNF. Table 3 lists some DNFs learned from an input-out pair for 6 obtained by applying 6 as a 10-variable function to the interpretation vectors of size 2 10 × dr sampled without replacement from the domain matrix I (10) 0 . 15 In Table 3, for dr ∈ {1.0, 0.8, 0.5} , every data used for learning contains 32 different input-output pairs, i.e. contains complete information about 6 . That is why all learned DNFs are logically equivalent to 6 . At dr = 0.3, learning data still contains all information on 6 . Nonetheless the learned DNF have extraneous variables not appearing in the original 6 (x 1 , x 4 , x 5 , x 6 , x 10 ) which destroy the logical equivalence to 6 though it still continues to be a logical consequence. When dr is further lowered to dr = 0.1 , constraint by learning data is more loosened. So more conjunctions and extraneous variables are introduced to the learned DNF and they stop the learned DNF from being either a logical consequence of or logically equivalent to 6 . 15 Learning parameters are = 0.1, max_try = 20, max_itr = 500 and h = 1000. 1.0 (¬x 1 ∧ ¬x 4 ∧ ¬x 5 ∧ ¬x 10 ) ∨ (¬x 1 ∧ ¬x 4 ∧ x 6 ∧ ¬x 10 ) Identical ∨(¬x 1 ∧ ¬x 5 ∧ x 6 ∧ ¬x 10 ) 1 3

Learning a budding yeast cell cycle BN
We conduct the second experiment with a synchronous BN for budding yeast cell cycle taken from Irons (2009). Since it contains 18 genes (DNFs) and preparing gene expression data is very time-consuming, it is unrealistic to assume the whole domain matrix I (18) 0 containing 2 18 = 262, 144 data points as learning data to learn a Boolean formula i for gene_i in the BN (Irons, 2009) We instead randomly generate a set of state vectors I rand 1 of size 1, 000 and use (I rand 1 , i (I rand 1 )) ( 1 ≤ i ≤ 18 ) as learning data to learn a DNF for i . 16 In this experiment, 17 DNFs out the 18 original DNFs are successfully recovered in at most three trials and the remaining DNF is logically equivalent to the original one. Considering the severe data scarcity such that only 0.38% ( 1000∕2 18 ) of the whole data is supplied as learning data, this success rate is somewhat surprising, but again can be explained as the effect of implicit noise-expansion as in the mammalian cell cycle case because the set of variables relevant to a target gene is surely a proper subset of 18 variables and the remaining irrelevant ones would behave as noise.

Learning a myeloid differentiation BN
The last example is learning an asynchronous BN with 11 genes for myeloid differentiation process (Krumsiek et al., 2011). In this "biologically more feasible" BN (Gao et al., 2018), state transition occurs asynchronously where a gene is nondeterministically chosen and the Boolean function (DNF) associated with the gene is applied to the current state to decide the next state of the BN.
Following (Gao et al., 2018), we generate learning data for asynchronous BN by simulating all possible asynchronous sate transitions starting from an "early, unstable undifferentiated state, where only GATA-2, C/EBPa, and PU.1 are active" (Krumsiek et al., 2011). This simulation generates 160 distinct hierarchically layered states containing four point attractors that correspond to four mature blood cells. For each gene, we generate state transition data of size 160 from these states and let Mat_DNF learn it with over-iteration (extra_update = 100). Since a learned DNF varies with initialization, we repeat this asynchronous BN data learning ten times and consider the majority of ten learned DNFs as the learned DNF for the target gene.
Out of 11 DNFs to be recovered, Mat_DNF correctly recovered the original DNFs for 6 genes ( Table 4). They are all pure conjunctions. DNFs for the remaining 5 genes are recovered partially in such a way that they lost at most three variables from the original ones. We performed other measurements.
We now compare our results with those by rfBFE (Gao et al., 2018) in more detail. rfBFE is one of the state-of-the-art BN learning algorithms which is a refinement of Best-Fit extension algorithm (Lähdesmäki et al., 2003) 17 . Since the purpose of BN learning is to infer Boolean formulas governing the state transitions process, the recovery rate of target Boolean formulas is the most important criterion. From this viewpoint, it is to be noted that when applied to complete data generated by synchronous BN, both rfBFE and Mat_ DNF recover all original 11 DNFs. However there is a big difference in execution time. While rfBFE only takes 1.24 s to process 11 complete datasets ( 2 11 data points) for 11 genes according to Gao et al. (2018), Mat_DNF takes 483.1 s, which suggests the need for improving implementation of Mat_DNF for example by parallel technologies.
Also we observe differences in terms of "score" which the number of genes whose domain (regulators) is correctly inferred when the learning data is not complete. We randomly sample m states and their state transitions and measure scores for m = 80, 160 by running Mat_DNF on sampled transitions. 18 We repeat this trial five times and take the average. The results are score = 8.8 for m = 80 and score = 10.6 for m = 160, which are lower than those by rfBFE reported in Gao et al. (2018) where score = 10.8 for m = 80 and score = 10.9 for m = 160 respectively. This may be due to the lack of a special mechanism in Mat_DNF to identify regulators (domain).
In the case of asynchronous learning data described above, Mat_DNF and rfBFE return Boolean formulas listed in Table 4. 19 Table 4 shows that Mat_DNF and rfBFE return exactly the same Boolean formulas except for gene PU.1 and both successfully recover six original Boolean formulas. Concerning PU.1 however, while Mat_DNF successfully recovers one of the two original disjuncts, rfBFE recovers no original disjunct or recovers only one of the four original conjuncts (assuming the original one is in CNF). So, as far as the target asynchronous BN (Krumsiek et al., 2011) is concerned, Mat_DNF seems qualitatively competitive with rfBFE, though learning is considerably slow. 17 rfBFE is a combination of two algorithms, random forest for feature selection and the BestFit extension algorithm (Lähdesmäki et al., 2003) for Boolean formula discovery. 18 Parameters are set to max_try = 10, max_itr = 1000, h = 10000 and over-iteration with extra_update = 20. 19 The table format and Boolean formulas learned by rfBFE are borrowed from Gao et al. (2018). Fact denotes the original Boolean formulas. We run Mat_DNF with = 0.005, max_try = 10, max_itr = 1000, h = 4000 and over-iteration (extra_itr = 100).

Related work
From a logical point of view, Mat_DNF infers a matricized DNF as an interpolant by numerical optimization and there is no previous work of the same kind as far as we know. As Sect. 4 reveals, any interpolant represented by a matricized DNF = (C, D) between the positive and negative data is translated to a single layer ReLU network described by (2) with network parameters (C, D) and vice versa. This mutual translation is expected to contribute to cross-fertilization of NNs and logic. For example logical characterization of interpolants with good generalizability can contribute to designing NNs with high generalizability.
On the optimization side, our approach is categorized as continuous and unconstrained global optimization applied to DNFs instead of CNFs (Gu et al., 1996). What differs from traditional approaches surveyed in Gu et al. (1996) is the Mat_DNF's cost function, which for instance encodes a conjunction as a sum of piecewise multivariate linear terms unlike those in Gu et al. (1996) that encode a conjunction by a product of some functions in one form or another.
Representing Boolean formulas by matrix is an established idea. Theoretically we can represent any Boolean formula in n variables in terms of 2 n × 2 n or 2n × 2 n matrix (Cheng and Qi, 2010;Kobayashi and Hiraishi, 2014). Our matricized DNF representation also requires a matrix C of similar size, for example 2 n−1 × 2n to represent the n-parity function. The technique of learning and outputting Boolean formulas represented by matrix has already been applied to learning AND/OR BNs in Sato and Kojima (2021), but with different purposes. Sato and Kojima (2021) aims at finding useful logical patterns in the biological data whereas DNFs in this paper are learned to verify or suggest BNs.
Mat_DNF is a simple neuro-symbolic system that explicitly represents DNFs. From this neuro-symbolic viewpoint, we notice several NNs have been proposed that can learn DNFs (Towell and Shavlik, 1994;Payani and Fekri, 2019;Katzir et al., 2021). However, they all implicitly embed DNFs in their NN architecture. In KBANN-net (Towell and Shavlik, 1994), for example, a conjunction containing k literals is encoded as a neuron represented by a tree with k leaves, each having a link weight such as 4 for positive literal and − for negative one, and the neuron is activated when k ⋅ exceeds bias = (k − 1∕2) ⋅ . In Neural Logic Networks (Payani and Fekri, 2019), conjunctions are represented by a product of linear functions of the form 1 − m(1 − x) where 0 < m < 1 and embedded in a neural network isomorphically to a DNF. In Net-DNF (Katzir et al., 2021), a trainable AND function is used: AND(x) = tanh((c • L(x) T ) − ‖c‖ 1 + 1.5) where L(x) = tanh(x T W + b) to encode conjunctions. As a result, they need an extra process to reconstruct a DNF from the learned parameters.
There are logical approaches to BN learning (Inoue et al., 2014;Tourret et al., 2017;Chevalier et al., 2019;Gao et al., 2022). Logically our work can be considered as a matricized version of "learning from interpretation transition" in logic programming in which a BN is represented by a propositional normal logic program (Inoue et al., 2014;Gao et al., 2022). The most related work is NN-LFIT proposed by Tourret et al. (2017) which performs two-stage DNF learning. First a single layer feed-forward NN is trained by state transition data. Then learned parameters irrelevant to the output are filtered out and DNFs are extracted from the remaining parameters. However since their performance evaluation is based on error rate of learned rules, not recovery rate of the learned DNFs like ours, direct comparison is difficult.

Conclusion
We proposed a simple feed-forward neural network Mat_DNF for the end-to-end learning of Boolean functions. It learns a Boolean function and outputs a matricized DNF realizing the target function. It searches for a DNF as a root of a non-negative cost function by minimizing the cost function to zero. We also established a new connection between neural learning and logical inference. We proved the equivalence between DNF learning by Mat_DNF and the inference of interpolants in logic between the positive and negative input data. We applied Mat_DNF to learning two synchronous BNs and one asynchronous BN from biological literature and empirically confirmed the effectiveness of our approach.
While doing so, we introduced "domain ratio" dr as an indicator of data scarcity and defined generalization w.r.t. dr. By examining the generalizability of DNFs learned from scarce data while varying dr, we discovered two operations, noise-expansion (expanding input vectors with noise vectors) and over-iteration (continuing learning after learning error reaches zero), can considerably improve generalizability by shifting the choice of a learned DNF. These two operations explain high recovery rate of original DNFs in our BN learning experiments.
Future work includes a reimplementation of Mat_DNF by GPUs, the refinement of noise-expansion and over-iteration and pursuing the idea of binary classifier as logical interpolant.
Author's contributions TS is a major contributor in writing the manuscript. KI assists in preparing the manuscript and financial support. All authors read and approved the final manuscript.
Funding This work is supported by JSPS KAKENHI Grant Number JP21H04905 and JST CREST Grant Number JPMJCR22D3.

Availability of data and material Not Applicable.
Code availability Mat_DNF is available upon request as an octave program.

Conflict of interest
The authors have no relevant financial or non-financial interests to disclose.
Ethics approval Not Applicable.

Consent for publication Not Applicable.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.