1 Introduction

State-of-the-art deep neural networks, while powerful, face inherent limitations like the need for vast training examples and a deficiency in generalisation and interpretability (Chollet, 2019). Inductive logic programming (ILP) offers a promising alternative with over three decades of history, emphasising data-efficient methods to learn first-order logic rules, which has the potential for greater generalisability (d’Avila Garcez et al., 2019; Muggleton & De Raedt, 1994). However, the ILP approach is not without its challenges, such as handling non-linearity and continuous properties, the computational intensity of predicate rule generation, and dependence on expert-defined background knowledge (Cropper & Dumančic, 2020).

The recent surge in neuro-symbolic ILP has aimed to establish hybrid frameworks that address these challenges, enabling features like noise handling, recursion-based generalisation, and supporting predicate invention. Such innovations have paved the way for research combining deep learning with relational ILP models, which not only enhances interpretability but also positions decision boundaries using logical rules (Evans & Grefenstette, 2017; Payani & Fekri, 2019; Yang & Song, 2019). However, the intricacy of addressing non-linearity in continuous domains remains an underexplored area.

Our work embarks on this challenge, aiming to design a neuro-symbolic ILP framework tailored for modelling non-linearity and continuous properties. Using the differentiable Neural Logic (dNL) networks as a foundation (Payani & Fekri, 2019), we focus on extracting non-linear rules in mixed discrete-continuous spaces, or more concretely, Our aim is to design an ILP-based framework specifically for learning in a mixed discrete-continuous space for the purpose of non-linear function extraction. This task necessitates adapting current systems to efficiently derive non-linear rules from data reflecting non-linear functions specified in advance.

Understanding the significance of our proposed framework becomes clearer when considering its application to real-world problems. Take, for instance, the domain of physics: if we were to gather data from a device measuring the relationship \(E = m \times c^2\), the iconic energy-mass equation. The extended capabilities of ILP, with its focus on non-linear continuous predicates, could intuitively model such a relationship. This is accomplished by teaching the system to recognise operations like squaring a variable (\(c^2\)) and multiplying variables (\(m \times c^2\)). With this, we are not just adding to the toolset of data analysis; we are providing a path for data-efficient neuro-symbolic methods that can further scientific discoveries by extracting and understanding non-linear equations.

We introduce a three-step methodology, commencing with continuous data discretisation and the learning of variable transformations using differentiable Neural Logic Non-Linear (dNL-NL) networks, repurposed from the standard dNL. The second phase involves utilising a separate (dNL-NL) network to represent operational predicates like addition or multiplication. Each dNL-NL network aims to define transformations, using the provided non-linear predicates. A subsequent dNL-NL module extracts operational predicates between individual features, utilising an augmented dataset transformed by the initial dNL-NL module’s non-linear predicates. The first and the second modules are end-to-end differentiable but the same cannot be said about the overall pipeline. To ensure accurate non-linear function extraction, we calculate the true loss on the continuous target using the compiled non-linear function from the extracted rules. This approach centres on extracting non-linear functions from mixed domain data.

While the current study focuses on supervised learning, future endeavours might extend this proposal to dynamic contexts, such as discerning state dynamics within reinforcement learning environments. Our findings showcase our architecture’s ability to retrieve non-linear functions from continuous data where clausal rules describe the function. This venture unveils a unique direction in neural-symbolic research, suggesting that ILP-based frameworks are adaptable to continuous data, addressing regression challenges. To our knowledge, this proposed work is the first to focus on the differential logical learning of non-linear functions.

2 Related work

Logic programs, by their very nature, offer a high degree of interpretability. As models become increasingly complex, there is a growing need in many domains for models that can be understood, validated, and trusted by humans (Barredo Arrieta et al., 2020). Logic programs allow for the symbolic representation of knowledge, which can be useful in scenarios where domain knowledge needs to be incorporated or where explanations are required in symbolic form (Hitzler & Sarker, 2022). While neural networks and kernel machines are powerful function approximators, they often act as black-box models. The trade-off between accuracy and interpretability is a well-known challenge (Barredo Arrieta et al., 2020). Our approach seeks to bridge this gap by providing a mechanism to learn non-linear relationships in data while retaining the interpretability of symbolic methods.

The foundational principles of ILP, especially as detailed in Muggleton and de Raedt (1994), played a significant role in guiding this research. Classic ILP systems excel at learning logic-based rules and first-order logical predicates from structured, discrete, and relational data. However, they struggle when faced with non-linear predicates in continuous domains. Several classical ILP systems, including FOIL (Quinlan, 1990), Progol (Muggleton, 1995), Claudien (De Raedt & Dehaspe, 1997), Aleph (Srinivasan, 2001), XHAIL (Ray, 2009), and Atom (Ahlgren & Yuen, 2013), have showcased their proficiency in handling noisy data and addressing problems within infinite domains. While they are adept at processing certain recursive rules, they do not consistently deliver optimal results and lack support for predicate invention. To illustrate a specific challenge, Aleph’s heuristic search, with its greedy nature, runs the perpetual risk of entrapment in local optima. Notably, these systems harness the set covering algorithm, learning hypotheses one clause at a time.

Another approach to note is Metagol (Muggleton et al., 2018), which, in contrast to its predecessors, boasts capabilities in predicate invention, recursion handling, and producing optimal programs across infinite domains. One disadvantage of Metagol is its inability to manage noise. Newer systems, such as Popper (Cropper & Morel, 2021a) and Poppi (Cropper & Morel, 2021b)-which builds upon Popper to enable automatic predicate invention-fill this gap. They adeptly handle noise, predicate invention, recursion, and can craft optimal programs within infinite domains (Cropper & Morel, 2021a, b).

ILP systems are inherently designed for symbolic representations, specifically for relational data in discrete domains. They are built to identify patterns using logical predicates. This design makes them unsuitable for continuous domains, which rely on numerical values and mathematical operations, differing greatly from ILP’s logic-based foundation. Tackling non-linear predicates in these domains is challenging due to the complex mathematical computations and the vast function space it requires to search, which is computationally intensive.

Early ILP research was primarily focused on reasoning within discrete spaces, with notable contributions from Chavira and Darwiche (2008); Kersting et al. (2000); Richardson and Domingos (2006); Kimmig et al. (2012). A shift towards exploring mixed discrete-continuous spaces was evident in works such as Speichert and Belle (2018), where piecewise polynomials were harnessed to model continuous distributions, subsequently informing target predicate definitions. In a parallel vein, Nitti et al. (2016) proposed a probabilistic approach, leveraging Gaussian base atoms to derive rules.

Non-linear predicates present clear challenges in traditional ILP systems. This has led to growing interest in neuro-symbolic methods for better modelling in continuous domains. Our proposed method introduces a new ILP technique, learning from magic values with lazy evaluation. A magic value in a program refers to a constant symbol vital for the program’s proper execution, even if its selection lacks a clear rationale (Hocquette & Cropper, 2023). In our system, the lower and upper bound weights (detailed in Sect. 4.1) for the non-linear and operation predicates serve as these magic values. While these bounds ultimately become constant symbols in predicate definitions post-training, they start as trainable parameters. Lazy evaluation, as adopted by Aleph for constant refinement, is also utilised in our method (Srinivasan & Camacho, 1999). Aleph refines bottom clauses by seeking variable substitutions and executing a partial hypothesis on both positive and negative examples. Essentially, rather than exploring all constant symbols, lazy evaluation focuses solely on symbols derived from the examples. Similarly, we evaluate our rules using evolving lower/upper bounded weights and predicate membership weights, making our approach aligned with the principles of lazy evaluation. Our work expands ILP’s learning capabilities to cover large and unbounded domains, and our tests show its advantages over previous systems.

Recent advancements in the ILP field have embraced a neuro-symbolic approach, as highlighted by Evans and Grefenstette (2017). This study introduced dILP, a novel neural ILP solver with a differentiable architecture for deduction. The emergence of such neuro-symbolic ILP systems has spurred a trend in benchmarking ILP based on various features, including noise resilience, compatibility with infinite domains, recursion, and predicate invention. As this field matures, richer feature sets will be introduced, setting the stage for more nuanced evaluations and progress benchmarking. Neuro-symbolic ILP methods, while presenting numerous enhancements, have not prioritized learning non-linear predicates in continuous domains. In contributing to this narrative, our work introduces the modelling of non-linearity in continuous or mixed domains, broadening the comparative landscape.

Neuro-symbolic ILP has seen various advancements, with several noteworthy contributions pushing the boundaries of the field. Shindo et al. employ differentiable logic modules that softly compose logic programs. Instead of utilizing MLPs, they manage multiple clauses with function symbols to enhance interpretability. Additionally, they incorporate predicate operations such as negation and preservation, which bolster flexibility (Shindo et al., 2021). This method’s differentiable aspects, such as tensor encoding and inference, function on discrete logic symbols and their respective truth values. Sen et al. expanded upon logical neural networks to derive rules in first-order logic. They demonstrate the joint learning of rules and logical connectives (Sen et al., 2021). The flexibility of their learning algorithm accommodates various linear inequality and equality constraints. Owing to adaptable parametrisation, their approach outperforms others on multiple benchmarks. Krishnan et al. build upon the differentiable ILP framework introduced by Evans et al. Their primary objective is to learn recursive programs and conduct predicate invention with stratified and safe negation (Krishnan et al., 2021).

Addressing the learning challenge in our work, we acknowledge parallels in other research that prioritise distinct component learning for data modelling, such as the approach detailed in Duvenaud et al. (2013). They innovatively redefine kernel learning as a structure discovery challenge, automating kernel form choices and composing kernel structures using base components. This approach offers an expressive modelling language, capturing widely-used kernel construction methods. Their emphasis on Gaussian process regression with the kernel as a covariance function leverages the Bayesian framework for streamlined structure discovery, using marginal likelihood for evaluation.

In this initial study, the focus has been on establishing the dNL-NL model as a foundational approach for extracting non-linear equations from continuous data within a regression framework. This work serves as a proof of concept, laying the groundwork for future research which will aim to scale the model and undertake comprehensive comparative evaluations with other neuro-symbolic approaches to further validate its applicability and effectiveness. Given that, we note other models which could be considered in evaluating future iterations of the dNL-NL architecture in program synthesis with continuous domains. TerpreT, a probabilistic programming language, integrates neural networks with traditional search techniques in Inductive Program Synthesis (IPS) (Gaunt et al., 2016). It uses models inspired by compiler representations, which are trainable via gradient descent, enabling the handling of complex control flows and external storage interactions. TerpreT’s unique architecture supports defining execution models like Turing Machines, using parameterized programs and interpreters. It accommodates a variety of back-end inference algorithms, facilitating the synthesis of interpretable source code with intricate control structures. This setup not only aids in learning complex programs but also permits comparisons across different inference techniques and representations.

Building on this foundation, DeepCoder emerges as an innovative IPS method, harnessing neural networks to decode patterns in problem descriptions for guiding search-based synthesis (Balog et al., 2017). DeepCoder redefines IPS as a big data challenge, training extensively on IPS problems. Its framework establishes a versatile programming language, easily predictable from input–output examples, and devises models to link these examples to program attributes. This leads to considerable speed enhancements in program synthesis, particularly for the complex problems typical in competitive programming. DeepCoder’s use of machine learning improves the efficiency and effectiveness of program synthesis by predicting program attributes and influencing the synthesis process with neural network insights.

Other approaches to program synthesis using elements of non-linear bias include DreamCoder (Ellis et al., 2020). DreamCoder is an innovative program learning system that specializes in program induction across multiple domains, utilizing self-supervised learning, bootstrapping, and domain-specific languages. It employs a unique “Wake/Sleep" architecture for program induction, combining generative models and neural networks to efficiently synthesize and refactor programs. DreamCoder stands out for its ability to discover specialized abstractions, enabling the expression of complex solutions to tasks at hand and achieving significant advancements in the field of program learning.

3 Background

3.1 ILP

Inductive logic programming (ILP) is a method of symbolic computing which can automatically construct logic programs provided a background knowledge base (KB) (Muggleton & de Raedt, 1994). An ILP problem is represented as a tuple \(({\mathcal {B}}, {\mathcal {P}}, {\mathcal {N}})\) of ground atoms, with \({\mathcal {B}}\) defining the background assumptions, \({\mathcal {P}}\) is a set of positive instances which help in defining the target predicate to be learned, and \({\mathcal {N}}\) defines the set of negative instances of the target predicate. The aim of ILP is to eventually construct a logic program that explains all provided positive sets and rejects the negative ones. Given an ILP problem \(({\mathcal {B}}, {\mathcal {P}}, {\mathcal {N}})\), the aim is to identify a set of hypotheses (clauses) \({\mathcal {H}}\) such that (Muggleton & de Raedt, 1994):

  • \({\mathcal {B}} \wedge {\mathcal {H}} \models \gamma\) for all \(\gamma \in {\mathcal {P}}\)

  • \({\mathcal {B}} \wedge {\mathcal {H}} \not \models \gamma\) for all \(\gamma \in {\mathcal {N}}\)

Where \(\models\) denotes logical entailment. Thus stating, that the conjunction of the background knowledge and hypothesis should entail all positive instances and the same should not entail any negative instances. We assume for example a KB with provided constants \(\{bob, carol, volvo, jacket, pants, skirt, \cdots \}\), where the task is to learn the predicate \(\texttt {Passenger}(X)\). Then the ILP problem is defined as:

  • \({\mathcal {B}} =\{ \texttt {Car}(ford), \texttt {Clothing}(jacket), \texttt {On}(jacket, bob), \texttt {Inside}(carol, volvo), \cdots \}\)

  • \({\mathcal {P}} = \{\texttt {Passenger}(bob), \texttt {Passenger}(carol), \cdots \}\)

  • \({\mathcal {N}} = \{ \texttt {Passenger}(volvo), \texttt {Passenger}(jacket), \cdots \}\)

The outcome of the induction performed is a hypothesis of the form:

$$\begin{aligned} \texttt {Passenger}(X) \leftarrow \texttt {Inside}(X,Y_1) \wedge \texttt {Car}(Y_1) \wedge \texttt {On}(Y_2, X) \wedge \texttt {Clothing}(Y_2) \end{aligned}$$

The learned first-order logic rule from the KB states “if an object is inside the car with clothing on it, then it is a passenger”.

The ILP problem may also contain a language frame \({\mathcal {L}}\) and program template \(\Pi\) (Evans & Grefenstette, 2017). The language frame is a tuple which contains information on the target predicate, the set of extensional predicates, arity of each predicate, and a set of constants, while the program template describes the range of programs that can be generated.

The placement of ILP in the context of non-linearity is that predicate rules can be equated with a non-linear function. For example, consider the equation for the mass-energy equivalence \(E=m \times c^2\), which takes in as input mass (m) and multiplies it by the square of the constant for the speed of light (c). For the sake of the example, we treat both m and c as random variables within the range [0, 1). As non-linear equations produce continuous output, we can discretise the output into specified ranges. Discretising the continuous output transforms the non-linear regression problem into a classification problem, aligning with the discrete reasoning of ILP. By discretising the output of a function, such as \(E=m \times c^2\) into distinct intervals, we can also interpret the distinct intervals as level sets on the equation. These level sets, which can be arbitrarily shaped and even disconnected, correspond to the range within specific bins and serve as target predicates for ILP to learn. Associating each FOL rule with specific level sets enriches our understanding of the function’s representation, clarifying the relationship between logic rules and the function’s behaviour.

The ILP problem would then focus on learning target predicates which represent an output range, here Class\(_1\)(mc) maps to the output range \((0 \le E \le 3.07\times 10^{-1})\). As before, the tuple of ground atoms \(({\mathcal {B}}, {\mathcal {P}}, {\mathcal {N}})\) would contain background assumptions but in the non-linear context, the predicates in the KB would be associated with inequalities (LessThan), inequalities on transformed input (SquareLessThan), and inequalities for operations between variables such as taking the product between two variables (ProdLessThan). Assuming a KB with well-defined predicates consisting of operations and transformations on continuous data, a hypothesis for the first class can be induced:

$$\begin{aligned}&\texttt {Class}_1(m,c)_{(0 \le E \le 3.07\times 10^{-1})} \leftarrow \texttt {LessThan}(m,0.6) \wedge \texttt {SquareLessThan}(c, 0.5) \wedge \\&\quad \quad \quad \texttt {ProdLessThan}(m, c^2, 0.3) \end{aligned}$$

Here, the logic rule for the classification can be interpreted as “if m is less than 0.6 and the square of c is less than 0.5 and the product between m and \(c^2\) is less than 0.3, then the value of E will be greater than or equal to 0 and less than or equal to \(3.07\times 10^{-1}\)”. The interpretation, while distant from the original non-linear function, can be parsed to derive the true equation. The inequalities on the continuous values assist in defining the discrete range associated with our Class\(_1\) predicate, but the associated transformations (\(c^2\)) and operations (\(m \times c^2\)) lend themselves to defining the non-linear equation for E. We note that the reliability of the non-linear and operation predicate rules we derive hinges on the granularity of the target’s discretisation. A sufficiently fine-grained discretisation ensures the robustness of the learned rules, as a low discretisation scheme risks overgeneralising the non-linear output.

3.2 dNL

Initially, ILP was constrained to program induction on non-noisy symbolic data. While data-efficient, traditional ILP was limited in its applications. In Evans and Grefenstette (2017), ILP was bridged with neural networks, to create an end-to-end differentiable architecture which could learn on noisy data. The core ideas have been used in subsequent proposals, including the approach presented here. Payani et al. proposed a further extension of ILP called a differentiable neural logic (dNL) network which utilises differentiable neural logic layers to learn Boolean functions (Payani & Fekri, 2019), building upon ideas proposed by Evans et al. The concept is to define Boolean functions that can be combined in a similar cascading architecture akin to neural networks. This gives deep learning an explicit symbolic representation that is interpretable. It also redefines ILP as an optimisation problem. The dNL architecture uses membership weights and conjunctive and disjunctive layers with forward chaining to remove the need for the rule template to solve ILP problems.

In the construction of the logical framework, Boolean values (\(true = 1, false = 0)\) are mapped to real value ranges [0, 1]. Payani et al. define fuzzy unary and dual Boolean functions of variables x and y as follows:

  • \({\overline{x}} = 1 - x\)

  • \(x \wedge y = xy\)

  • \(x \vee y = 1 - ( 1 - x)(1 - y)\)

The core component of the dNL network is their use of differentiable neural logic layers to learn Boolean functions (Payani & Fekri, 2019). The dNL architecture uses membership weights and conjunctive and disjunctive layers to learn a target predicate or Boolean function. Learning a target predicate p requires the construction of Boolean function \({\mathcal {F}}_p\) which passes in a Boolean vector \({{\textbf {x}}}\) of size N with elements \(x^{(i)}\), into a neural conjunction function \(f_{conj}\) (see Eq. 1a) which is defined by a conjunction Boolean function \(F_{conj}\) (see Eq. 1a).

$$\begin{aligned} f_{conj}({{\textbf {x}}}) & = \prod _{i=1}^N F_{conj}(x^{(i)}, m^{(i)}) \end{aligned}$$
(1a)
$$\begin{aligned} F_{conj}(x^{(i)}, m^{(i)}) & = \overline{\overline{x^{(i)}} m^{(i)}} = 1 - m^{(i)}(1 - x^{(i)}) \end{aligned}$$
(1b)

A predicate defined by Boolean function in this matter is extracted by parsing the architecture for membership weights (\(w^{(i)}\)) above a given threshold, where membership weights are converted to Boolean weights via a sigmoid \(m^{(i)} = \sigma (c w^{(i)})\) with constant \(c \ge 1\). The constant c effectively acts as a “scaling factor" for the sigmoid’s argument. As the constant is greater than 1, it makes the sigmoid curve steeper. This means that the transition from the lower asymptote to the upper asymptote of the sigmoid function occurs over a narrower range of input values. Membership weights are paired with continuous lower and upper bound predicate functions (discussed in Sect. 3.4) which are eventually interpreted as atoms in the body of the predicate being learned. These same Boolean predicate functions are used to transform non-Boolean data into a Boolean format for the logic layers.

Similarly, a neural disjunction function \(f_{disj}\) (see Eq. 2a) can be constructed using the disjunction Boolean function \(F_{disj}\) (see Eq. 2a).

$$\begin{aligned} f_{disj}({{\textbf {x}}}) & = 1 - \prod _{i =1}^N(1 - F_{disj}(x^{(i)}, m^{(i)})) \end{aligned}$$
(2a)
$$\begin{aligned} F_{disj}(x^{(i)}, m^{(i)}) & = x^{(i)} m^{(i)} \end{aligned}$$
(2b)

By combining different neural Boolean functions, a multi-layered structure can be created. For example, cascading a conjunction function with a neural disjunction function (see Eq. 2a) creates a layer in disjunctive normal form (DNF), so-called dNL-DNF. Alternatively, cascading a disjunctive function with a neural conjunction function reinterprets the architecture in conjunctive normal form (CNF), forming a dNL-CNF.

Each rule to be learned corresponds to a dNL-DNF (or dNL) function, a differentiable symbolic Boolean function with parameterized membership weights in its conjunction and disjunction layers. A rule’s body is represented by a Boolean dNL function, determined by membership weights in its neural conjunction and disjunction functions, given by \({\mathcal {F}}_p \leftarrow f_{disj}(f_{conj}({{\textbf {x}}}))\). The conjunction layer’s membership weights, \(m^{(i)}\), relate to specific Boolean inputs \(x^{(i)}\) from vector \({\textbf{x}}\), signifying the presence or absence of a Boolean feature. Meanwhile, the disjunction layer’s membership weights, which are separate weights from the conjunction layer, map to the conjunction layer’s rows, offering multiple definitions for rule p.

The optimization of the model is performed by evaluating the extracted rules’ membership weights, where evaluation is done by applying the extracted rules to the background knowledge so that negative examples are rejected and positive examples are entailed. The weights are optimized based on the loss functions detailed in Sect. 3.5. As we extend to the continuous case where inputs are continuous and the target is discrete, each class in a tabular dataset is associated with a target predicate Boolean function \({\mathcal {F}}_p\), which is defined by bounded continuous Boolean predicates derived from each continuous feature, as explained in Sect. 3.4.

3.3 Notation

In this proposal, mathematical equations play a vital role in detailing our approach. To ensure clarity, we provide tables that list and define the crucial hyperparameters and notations. Please refer to Table 2 for hyperparameters and Table 1 for the specific notations employed throughout this proposal.

We also clarify, that for a given instance i, we define a discrete output label \(y_i\), and our set of continuous feature values \({{\textbf {x}}}_i \leftarrow \{x_i^{(1)},x_i^{(2)},\cdots ,x_i^{(m)}\}\) representing the data entry for our label \(y_i\). We note that the \(x^{(i)}\) notation is useful in understanding the input with respect to Eqs. (1a and 2a), however, we will use the following notation to refer to a continuous random variable \(X_i\), and for a given instance i in the data, \(\{X_1,X_2,\cdots ,X_m\}_i\), as this uppercase format reflects the final logical rule syntax given by the architecture (seen in the results section).

Table 1 Notation and definitions for implemented equations
Table 2 Listed user defined hyperparameters and definitions

3.4 Continuous predicates

The original dNL architecture made use of continuous predicates but the investigation was limited. To handle continuous inputs, Boolean predicate functions are applied to the continuous variables as a series of lower and upper bound predicates (Payani & Fekri, 2019; Speichert & Belle, 2018; Belle et al., 2016; Bueff et al., 2021). The resulting continuous Boolean predicates are taken in as input continuous values and return either true or false based on whether the input meets the condition. In this interpretation, each continuous variable x is defined by k upper and lower boundary predicates, where we have a Boolean upper boundary predicate \(gt^i_{x}(x, l_{xi})\), which states whether “x is greater than \(l_{xi}\)" is true, and the lower boundary predicate \(lt^i_{x}(x, u_{xi})\) which states whether “x is less than \(u_{xi}\)" is true, where \(i \in {1,2,\cdots ,k}\). The lower and upper boundary values \(l_{xi}\) and \(u_{xi}\) respectively are also treated as trainable weights allowing for the following definition for the predicates:

$$\begin{aligned} {\mathcal {F}}_{gt^i_{x}} = \sigma (c(x-u_{xi})), \quad {\mathcal {F}}_{lt^i_{x}} = \sigma (-c(x-l_{xi})) \end{aligned}$$
(3)

In Eq. (3), the sigmoid (\(\sigma\)) and constant \(c \ge 1\) (set to \(c=20\)) are applied to ensure that the output is Boolean. While the sigmoid ensures a Boolean output, the constant acts on the steepness of the sigmoid curve. A larger constant value increases the steepness, resulting in greater confidence when the input is in fact greater/less than the corresponding boundary value.

3.5 dNL loss function

The following loss functions are used in the dNL network to ensure that training accounts for loss on the predictive output and heuristic loss related to the interpretability of the learned clauses. The primary loss measure is the average cross-entropy between predictions on \({\mathcal {X}}_p[e]\), where e is a ground truth from the set of negative and positive class examples \({\mathcal {N}}_p, {\mathcal {P}}_p\), and the true classification of e. The loss function takes into account all predicates p in the set of target predicates \({\mathbb {P}}\) (see Eq. 4):

$$\begin{aligned} \begin{aligned} {\mathcal {L}} = -{\mathbb {E}}_{p \in {\mathbb {P}}} {\mathbb {E}}_{(e,\lambda _p)\in \Lambda _p} \Big \{ \lambda _p log {\mathcal {X}}_p[e] + (1-\lambda _p) log(1-{\mathcal {X}}_p[e]) \Big \} \\ \text {where,} \; \Lambda _p = \Big \{(\gamma ,1)|\gamma \in {\mathcal {P}}_p \Big \} \bigcup \Big \{ (\gamma , 0)|\gamma \in {\mathcal {N}}_p \Big \} \end{aligned} \end{aligned}$$
(4)

Additional loss metrics include a measure of interpretability (see Eq. 5), where consideration is given to the fact that membership weights may not converge to 0 or 1 in the final network. Often such cases occur when no formal Boolean rule definition can model the discrete target classification in the background knowledge. To increase interpretability, the dNL model is optimised such that membership weights converge to 0 and 1. The membership weights m reflect the weights from both the disjunction and conjunction layers used to define \({\mathcal {F}}_p\).

$$\begin{aligned} {\mathcal {L}}_{int} = {\mathbb {E}}_{p \in {\mathbb {P}}} {\mathbb {E}}_{m \in {\mathcal {F}}_p}m(1-m) \end{aligned}$$
(5)

The final loss term is designed to reduce the number of terms in each clause via pruning (see Eq. 6). During training, it is possible to encounter cases of redundancy, so for each predicate function \({\mathcal {F}}_p\), the dNL model prunes the definitions by subtracting from the sum of membership weights, the term \(N_{max}\), which denotes the number of allowed terms in the definition.

$$\begin{aligned} {\mathcal {L}}_{prn} = \sum _{p \in {\mathbb {P}}}relu \Big ( \sum _{m_i \in {\mathcal {F}}_p} m_i - N_{max} \Big ) \end{aligned}$$
(6)

In Eq. 7, all loss terms are combined into a final aggregated loss metric. Note the included hyperparameters on the pruning loss (\(\lambda _{prn}\)) and on the interpretability loss (\(\lambda _{int}\)).

$$\begin{aligned} {\mathcal {L}}_{agg} = {\mathcal {L}} + \lambda _{prn} {\mathcal {L}}_{prn} + \lambda _{int} {\mathcal {L}}_{int} \end{aligned}$$
(7)

4 Contributions

The following section discusses the extensions applied to baseline dNL networks to derive dNL-NL networks, as well as coverage of the transformation and operation layers which will be used in the overall pipeline to extract non-linear functions from continuous data. We first provide a general overview of the various components prior to discussing them more in-depth with regard to their algorithmic and mathematical implementation

Non-linear and Operation predicates: Our dNL-NL model efficiently handles non-linear transformations of continuous inputs. We apply standard non-linear transformations like power, exponential, and sine functions. Additionally, the model uses operation predicates, symbolizing basic mathematical operations, enhancing its computational capabilities.

Handling of Non-linear Continuous data: To manage the non-linear continuous data, we redefined it within the classification framework. This involved basing our target predicate on continuous Boolean predicates. For each classification within a dataset with discrete targets and continuous variables, we efficiently separated positive and negative examples.

Input Matrix and dNL architecture: Our model’s architecture prominently features an input matrix. This matrix converts continuous inputs into a Boolean interpretation suitable for dNL-NL processing. The activations from the input matrix are fed through a conjunctive and disjunctive layer each with respective membership weights.

Loss function: The model employs a loss function designed for rule consistency and reduced complexity. Additionally, the ‘true loss’ metric measures the difference between the extracted non-linear function and the original continuous target.

Subsequent sections delve into transformation and operation layers and the rule extraction pipeline, integrating the aforementioned techniques into a unified framework.

4.1 Non-linear and operation predicates

Consider the initial example for the equation \(E=m \times c^2\). Our objective is to learn logical rules that define a particular region of its output. Using these rules, we can reconstruct a non-linear equation with the given predicates. In our framework, m is addressed by predicates from Eq. 3, while c necessitates the transformation \(c^2\). Additionally, predicates capturing the multiplication operation \(m \times c^2\) are paramount. Though inequalities define the range \(0 \le E \le 3.07 \times 10^{-1}\) for the associated discrete class predicate, our primary focus is on variable transformations and operations. Subsequent continuous predicates are derived from this need.

To handle continuous inputs, we employ non-linear transformations, specifically focusing on functions like the power (\(f(x) = x^2\)), exponential (\(f(x) = exp(x)\)), and sine (\(f(x) = sin(x)\)) function. We introduce k upper and lower boundary transformation predicates for each (non-linear transformation predicates):

$$\begin{aligned} \text {square: }{\mathcal {F}}_{gt^i_{Sqr(x)}} & = \sigma (c(x^2-u_{xi})), \ {\mathcal {F}}_{lt^i_{Sqr(x)}} = \sigma (-c(x^2-l_{xi})) \\ \text {exponential: }{\mathcal {F}}_{gt^i_{Exp(x)}} & = \sigma (c(exp(x)-u_{xi})), \ {\mathcal {F}}_{lt^i_{Exp(x)}} = \sigma (-c(exp(x)-l_{xi})) \\ \text {sine: }{\mathcal {F}}_{gt^i_{Sin(x)}} & = \sigma (c(sin(x)-u_{xi})), \ {\mathcal {F}}_{lt^i_{Sin(x)}} = \sigma (-c(sin(x)-l_{xi})) \end{aligned}$$

Here, \(i \in {1,...,k}\), \(\sigma\) denotes the sigmoid function, c is a constant, and \(u_{xi}\) and \(l_{xi}\) represent upper and lower bounds for the \(i^{th}\) predicate. The bounded weights are trainable parameters specific to each transformation predicate. Given that, the bounded weights are shared among the learned target predicates in \({\mathbb {P}}\) to ensure learned literals are consistent among the rule definitions. During training, we split transformed continuous features into k bins of equal width.

Furthermore, we define operation predicates, capturing arithmetic operations between variables. For variables x and y, we present k upper and lower boundary predicates (operation predicates):

$$\begin{aligned} \text {addition: }&{\mathcal {F}}_{gt^i_{Add(x,y)}} = \sigma (c((x+y)-u_{(x,y)i})), \\&{\mathcal {F}}_{lt^i_{Add(x,y)}} = \sigma (-c((x+y)-l_{(x,y)i})) \\ \text {subtraction: }&{\mathcal {F}}_{gt^i_{Sub(x,y)}} = \sigma (c((x-y)-u_{(x,y)i})), \\&{\mathcal {F}}_{lt^i_{Sub(x,y)}} = \sigma (-c((x-y)-l_{(x,y)i})) \\ \text {multiplication: }&{\mathcal {F}}_{gt^i_{Prod(x,y)}} = \sigma (c((x \times y)-u_{(x,y)i})), \\&{\mathcal {F}}_{lt^i_{Prod(x,y)}} = \sigma (-c((x \times y)-l_{(x,y)i})) \end{aligned}$$

These boundaries stem from the resulting outputs of variable operations. Combined, our dNL model, equipped with both transformation and operation predicates, can adeptly infer non-linear relationships from data.

4.2 Non-linear continuous data

In the dNL framework and similar ILP-inspired architectures, a Knowledge Base (KB) is formed with positive instances \({\mathcal {P}}\) and negative instances \({\mathcal {N}}\), supported by general background assumptions \({\mathcal {B}}\). For dNL’s continuous input handling, continuous attributes undergo a transformation into Boolean predicate functions via discretisation. We recast continuous data \(D_{c,c}\) to fit a classification problem, resulting in discrete targets, where subscripts c and d indicate continuous and discrete data, respectively. Our target predicate’s body consists of continuous Boolean predicates, with its head representing a discrete class, mirroring a non-linear function’s output range, \((a \le {\textsf{F}}(x) \le b)\). Given a dataset \(D_{d,c}\), with a discrete target \(Y_d\) and continuous variables \(X_i\) (where i ranges from 1 to m), instances belonging to a specific classification are viewed as positive examples, and the rest are considered negative.

Using the dNL-NL model, we craft background knowledge, as illustrated in Fig. 1. Inputs from an example dataset are transformed according to our transformation KB \(\textsf{KB}_T\), resulting in our background knowledge \({\mathcal {B}}\). This transformation process is further elaborated in Algorithm 1 where background knowledge is crafted for each predicate p in the set of target predicates \({\mathbb {P}}\). Depending on our model’s emphasis (either operation or transformation), background knowledge incorporates either operations from \(\textsf{KB}_O\) or transformations from \(\textsf{KB}_T\), but not both. In Algorithm 1, the ‘OR’ reflects this option and is not an executable operation. Training leverages \({\mathcal {B}}\) to determine positive and negative instances corresponding to a class. As an example, for the predicate linked to \(class_1\), we derive \({\mathcal {N}}_{class_1}\) and \({\mathcal {P}}_{class_1}\) from \({\mathcal {B}}\) - a process depicted in Fig. 1. During training iterations, each data batch provides positive and negative instances for every target class.

Algorithm 1
figure a

building Background Knowledge

Fig. 1
figure 1

Example depiction of creating background knowledge from original data and the representation of positive and negative examples for continuous data with respect to target predicate \(class_1\)

4.3 Input matrix and architecture structure

The Boolean target predicate function, \({\mathcal {F}}_p\), is constructed of alternating conjunction and disjunction layers, arranged in DNF form, as described by Eqs. () and (). These layers process Boolean predicate functions derived from continuous variables. Specifically, within \({\mathcal {F}}_p\), the conjunction layer’s membership weight columns, \(N_e\), are defined as \(N_e=2 \times k \times |C_p|\), where the factor of 2 corresponds to separate lower and upper bound predicates, and \(C_p\) indicates continuous predicates derived from each variable \(X_i\) and its affiliated knowledge base \(\textsf{KB}_i\). The number of stacked conjunction neurons at the disjunction layer is defined by the hyperparameter \(N_p\).

The input matrix, \({\textbf{I}}\), defined in Eq. 8, consolidates continuous lower and upper bound predicate functions. The column space is defined by \(N_e\) and the row space is defined by the batch size. During training, batch size, \({\textbf{b}}\), is left as a hyperparameter. This matrix facilitates the transformation of continuous data into a Boolean form suitable for the dNL-NL network’s reasoning.

$$\begin{aligned} {\textbf{I}} = \Bigg [ \bigvee _{e \in C_p} \bigvee ^{k}_{i=1} \Big ( {\mathcal {F}}_{gt^i_{e}} \vee {\mathcal {F}}_{lt^i_{e}} \Big ) \Bigg ]_{{{\textbf {b}}}} \end{aligned}$$
(8)

\({\mathcal {F}}_p\) is defined by \({\textbf{I}}\) and encompasses continuous Boolean predicates, membership weights, disjunction layer \(F_{disj}\), and conjunction layer \(F_{conj}\), as characterized in Eq. 9.

$$\begin{aligned} \qquad \quad {\mathcal {F}}_p|_{{\textbf{I}},N_e,N_p} = F_{disj}(N_p, F_{conj}(N_e, {\textbf{I}} )) \end{aligned}$$
(9)

The neurons in the primary dNL function initialize the membership weights close to zero using random Gaussian distributions to prevent gradients from becoming exceedingly small. Equations (10a) and (10b) detail how conjunction and disjunction functions interact with the input matrix and membership weights.

$$\begin{aligned} \quad {F}_{conj} & = \sum _{i=1}^{N_e} [1 - m^{conj}_i(1- {\textbf{I}}_i) ] \nonumber \\ \text {where,} \; {\textbf{m}}^{conj} & = \sigma (c{\textbf{W}}^{conj}), \ {\textbf{W}}^{conj}_{N_p, N_e} \sim {\mathcal {N}} \end{aligned}$$
(10a)
$$\begin{aligned} \quad {F}_{disj} & = 1 - \sum _{i=1}^{N_p} [1 - m_i^{disj}{F}_{conj}^i ] \nonumber \\ \text {where,} \; {\textbf{m}}^{disj} & = \sigma (c{\textbf{W}}^{disj}), \ {\textbf{W}}^{disj}_{1, N_p} \sim {\mathcal {N}} \end{aligned}$$
(10b)

Note that the Boolean membership weights \({\textbf{m}}\) are derived from a sigmoid transformation of weights \({\textbf{W}}\), and \({\textbf{W}}^{conj}\) is a matrix with dimensions \((N_p \times N_e)\) while \({\textbf{W}}^{disj}\) is a vector of size \(N_p\).

Algorithm 2 encapsulates the dNL-NL model’s single-step design, merging Eqs. (8), (10a), and (10b). In this model, each grounding e updates in one step based on the background knowledge. Following a single step on an input batch, we ground the weighted matrix \({\mathcal {X}}_p\) for predicate p. This method integrates the logical entailments from our input matrix. Upon training all predicate functions, a set \({\mathbb {X}}\) of evaluated target predicate functions is returned.

Algorithm 2
figure b

Non-linear Single Step Forward Chain Model/ dNLNL model

4.4 Loss function

The loss function design ensures that rules for discrete classes are consistent and that the complexity is minimized (refer to Eq. 7). The true loss, \(Loss_{true}\), while not employed during dNL-NL network training, acts as a measure of how closely the non-linear function approximation, \({\textsf{F}}^*\), derived from the network, matches the original continuous target. \(Loss_{true}\) represents the average absolute disparity between our approximated function and the continuous output \(Y_c\), see Eq. 11.

$$\begin{aligned} {Loss}_{true} = \frac{1}{N}\sum _{i=1}^N \Big | \big (Y_c[i] - {\textsf{F}}^*(\{X_1,X_2,\cdots ,X_m \}_i)\Big | \end{aligned}$$
(11)

4.5 Transformation layer

The objective of the transformation layer is to learn the mathematical transformations applied to individual variables. The layer takes as input, discretized data associated with a variable \(X_i\), denoted \(D_{d,c}^{(i)}\), and the knowledge base on transformations, \(\textsf{KB}_{T}\). Then outputs the transformation layer accuracy \(acc_{X_i}^T\) and a set of optimised target predicates \({\mathbb {X}}_{X_i}^T\).

The dNL-NL architecture extracts rules that capture the non-linear relationship between continuous variables using discrete classification of the target. The rules, however, might sometimes deviate from the original non-linear function \({\textsf{F}}\) which derived the data, especially when access to all features and transformations is granted. This deviation might result in either the inclusion of non-existent non-linear transformations or the omission of certain variables. To overcome these challenges, it’s preferable to adopt a more constrained approach. In this context, the transformation layer applies the dNL-NL model to individual variables to determine their mathematical transformations. This is done in a manner where the knowledge base can be updated or specific transformations removed if required.

Seen in Algorithm 3, for input variable \(X_i\), a list of continuous predicates \(C_p\) is created using the available transformations from \(\textsf{KB}_{T}\). Subsequently, the knowledge base is instantiated using the dataset, target predicates, and potential continuous predicates. The dNL-NL model \({\mathbb {X}}\) is then constructed and trained for a number of iterations (\(I_{max}\)) using batches (\({\mathcal {B}}^{(I)}\)). The optimization focuses on the average cross-entropy between the ground truth and the individual class predictions but also uses the loss function \({\mathcal {L}}_{agg}\) (see Eq. 7). The learning rate starts at 0.001 and is adjusted for faster convergence.

Algorithm 3
figure c

Transformation Layer

4.6 Operation layer

The objective of the operation is to learn rules reflecting mathematical operations between variables after their transformations have been learned. The layer takes as input a dataset with discretized target \(D_{d,c}\), the operation knowledge base \(\textsf{KB}_O\), and optimised transformation layer models for all variables, \(\{{\mathbb {X}}^T_{X_1},...,{\mathbb {X}}^T_{X_m}\}\). Then the layer outputs the operation layer accuracy \(acc^O\) and a set of optimised target predicates \({\mathbb {X}}^O\).

Following the transformation layers, the operation layer comes into play. It differs from the transformation layer primarily in its knowledge base and dataset: it uses an operation-associated knowledge base and includes all variables in the dataset. After the transformations have been learned for each feature, they are used to inform the operation layer’s learning process.

$$\begin{aligned}{} & {} {\textsc {transformData}}(D_{d,c},\{ {\mathbb {X}}^T_{X_1},\cdots , {\mathbb {X}}^T_{X_m} \}) = e_{X_i}(D_{d,c}^{(i)}) \nonumber \\{} & {} \quad \text {where,} \; e_{X_i} \leftarrow {\textsc {maxPredicate}}( {\mathbb {X}}^T_{X_i}), \ \forall {\mathbb {X}}^T_{X_i} \in \ \{ {\mathbb {X}}^T_{X_1}, \cdots , {\mathbb {X}}^T_{X_m}\} \end{aligned}$$
(12)
$$\begin{aligned}{} & {} {\textsc {maxPredicate}}({\mathbb {X}}^T_{X_i}) = {\text {*}}{arg\,max}_{e} \left[ \sum _{ {\mathcal {X}}_j \in {\mathbb {X}}_{X_i}^T } \sum _{i=1}^{2 \times k} \mathbbm {1}\left( m_{{\mathcal {X}}_j,i}^e > \epsilon \right) \bigg | \; \forall e \in {\mathcal {X}}_j \right] \end{aligned}$$
(13)

As seen in Algorithm 4, the dataset is transformed based on the predicates from the transformation layer with the highest confidence. The confidence of a predicate is gauged by its associated weights. The function transformDataset (see Eq. 12) takes the original dataset and the transformation layer’s results to derive a transformed dataset \(D_{d,T}\). For maxPredicate (see Eq. 12) we iterate over all associated transformation predicates, ignoring the upper/lower bound distinction, and count the number of transformation predicates with membership weights greater than hyperparameter \(\epsilon\) to determine which transformation to apply on the variable. Like the transformation layer, the operation layer builds its knowledge base and subsequently trains the dNL-NL model. The final optimised architecture, \({\mathbb {X}}^O\), captures operations between transformed variables and is pivotal in deducing the true non-linear function.

Algorithm 4
figure d

Operation Layer

4.7 Rule extraction pipeline

In Algorithm 5, we integrate the transformation layer, operation layer, and construction of non-linear function approximation \({\textsf{F}}^*\) to compute the true loss \(Loss_{true}\) (see Eq. 11). In Fig. 2, a high-level structure of the transformation and operation layers is displayed along with the general pipeline connecting the layers. Our initial step discretizes the continuous dataset \(D_{c,c}\) into d classes, yielding \(D_{d,c}\). Using equal-width binning, we copy the transformation knowledge base \(\textsf{KB}^*_{T}\), updating it during training if extraction fails. We introduce Boolean flags for each variable \(flag_{\{X_1,X_2,\cdots , X_m\}}\) to determine the use of the modified transformation knowledge base \(\textsf{KB}^*_{T}\). A data structure \({\textbf{F}}^*\) stores non-linear function approximations for updating the operation knowledge base \(\textsf{KB}^*_{O}\).

The pipeline’s main loop calculates \(Loss_{true}\) after the operation and transformation layers. We continue until \(Loss_{true}\) is below a threshold \(\beta\) (set at 0.05). Transformation layers, associated with each feature \(X_i\), are trained first. Depending on the feature flag \(flag_{X_i}\), we use either the reduced or full knowledge base. After training, we save the model \({\mathbb {X}}_{X_i}^T\) and its accuracy score \(acc{X_i}^T\). Using accuracy scores, the least performing layer is identified, and its associated predicate is removed from the updated knowledge base for the next iteration.

The true loss’s value guides whether to update the knowledge base. This is done using the calcLoss function. Beforehand, we check the data structure \({\textbf{F}}^*\) for any derived function approximation \({\textsf{F}}^*\). The method \({\textsc {checkOperations}}\) compares functions to see if certain transformation pairings have been previously derived. Recognized pairings lead to the removal of corresponding operations from \(\textsf{KB}_{O}^*\) to avoid redundancy. A boolean flag \(flag_{operation}\) indicates if the operation layer uses the full transformation set \(\textsf{KB}_{O}\) or its updated version \(\textsf{KB}_{O}^*\).

The calcLoss function, using optimised models from each layer, constructs the non-linear function per Eq. (1213). This equation computes the loss on dataset \(D_{c,c}\) with the continuous target, yielding \(Loss_{true}\). After calculating the loss, the non-linear approximation is stored in \({\textbf{F}}^*\).

Algorithm 5
figure e

Pipeline

Fig. 2
figure 2

High level view of transformation and operation layers

5 Experiments

The capacity of the dNL-NL pipeline to extract non-linear functions \({\textsf{F}}^*\) is evaluated on synthetically generated datasets, where the datasets were generated by non-linear functions \({\textsf{F}}\). As dNL-NL represents a unique application of ILP based frameworks, we do not include comparative models, but instead focus on the capacity of dNL-NL to extract the correct non-linear functions from the generated data.Footnote 1 Ultimately, the goal is to extract non-linear rules using dNL-NL reasoning in order to construct a function that simulates the original (\({\textsf{F}}^*\sim {\textsf{F}}\)). With the following experiments, we demonstrate that the dNL-NL framework can be used to extract non-linear functions from data provided a relevant KB.

Prior to discussing the dNL-NL extracted rules, we briefly explain the syntax in the rule hypotheses. As seen in Fig. 3, each dNL-NL network produces fitted predicate functions associated with each class, class1() in the figure. The clausal body is defined by a conjunction of atoms where the associated membership weight is placed before the atom in brackets, extracted from the conjunction neuron. If the membership weight is above 0.95, i.e. the model is confident it should be in the definition, then the weight is not listed next to the atom. As we stack two disjunction neurons in our dNL-NL architecture, we can have at most two hypotheses for a given class. Again, the value in brackets before the rule represents the membership weights from the disjunction neuron. If no rule is listed then no prior membership weights in the associated conjunction vector were above a reasonable confidence threshold.

Fig. 3
figure 3

Explanation of target predicate outputs

5.1 Performance on noisy data

While the primary focus of this proposal is to employ our dNL-NL architecture for non-linear function extraction, we also tested it on noisy data to illustrate its capacity for discerning algorithmic patterns. This assists in determining which non-linear transformations or operations between features are pertinent for classification.

We evaluated our dNL-NL architecture using the Yacht Hydrodynamics dataset, available on the UCI machine learning repository (Dheeru & Karra Taniskidou, 2017). In this case, we explored non-linear transformations across all features and then defined operation predicates based on the two features present within the class rules. For comparison, the baseline model was built on dNL, relying solely on continuous Boolean predicate functions.

The yacht hydrodynamics dataset encompasses a range of adimensional parameters, chiefly concerned with hull geometry coefficients and the influential Froude number. These parameters provide insights into the yacht’s hydrodynamic performance and its interactions with water. The parameters included in the dataset are as follows, Longitudinal position of the center of buoyancy, Prismatic coefficient, Length-displacement ratio, Beam-draught ratio, Length-beam ratio, and the Froude number. The target variable measured in this dataset is the residuary resistance per unit weight of displacement. This is a vital metric as it quantifies the additional resistance a yacht encounters beyond the frictional resistance due to its hull shape and other hydrodynamic factors. As the target feature is continuous, we use equal-frequency binning to parse the data into three discrete classes corresponding to the following features ranges, \(class1() \sim 0.001 \le {\textsf{F}}(x) < 1.286\), \(class2() \sim 1.286\le {\textsf{F}}(x) < 7.806\), and \(class3() \sim 7.806\le {\textsf{F}}(x) \le 62.42\).

In Table 3 we observe the performance of the baseline dNL model when using just continuous Boolean predicates. We note the accuracy is lower than our dNL-NL model using non-linear transformation predicates, see Table 4. In both cases, we observe that the Froude number feature and Beam-draught ratio are prevalent in the causal definitions. This indicates that these two features are more significant when determining the classification.

Table 3 Each dNL learned rule for the discrete classes on the Yacht Hydrodynamics dataset

In Table 4, we further discern the specific non-linear transformations applied to each feature. Notably, the Beam-draught ratio feature predominantly undergoes the square function transformation. Meanwhile, the Froude number is subjected to a variety of transformations: sine, exponential, and square, with the exponential transformation being the most pronounced.

Table 4 Each dNL-NL learned rule for the discrete classes on the Yacht Hydrodynamics dataset, using non-linear transformation predicates

In Table 5, we focus on the two significant features: Beam-draught ratio and Froude number, aiming to discern the relevant operations between them. The Froude number is subjected to sine, exponential, and square transformations, while the Beam-draught ratio is primarily squared. We identify several operation predicates denoting specific operations. Notably, for class1(), class2(),  and class3(), the prevailing operation is \((Beamdraught^2 \times Froude^2)\). It is also worth noting that this approach outperforms the baseline dNL model.

Table 5 Each dNL-NL learned rule for the discrete classes on the Yacht Hydrodynamics dataset, using operation predicate functions

5.2 Learning two variable functions

We use the dNL-NL framework to extract the following non-linear equations which have all been synthetically generated on two continuous variables using various transformations and operations. We tested the proposed dNL-NL networks on 6 separate non-linear equations. Each equation contains a transformation on a continuous variable and a mathematical operation between the two transformed variables.

In Table 6, the value ranges on each variable used are continuous float values between [0, 10]. Each synthetically generated equation contains 200 instances. For the majority of equations, the continuous target was discretized into three classes (\(d = 3\)) using equal-width binning, and we set the number of boundaries to (\(k=7\)) for the continuous variables. The extracted predicates for Table 6 are the result of 5-fold cross-validation for each dNL-NL layer in the framework. We also take the average run time for 10 training sessions to demonstrate the speed in deriving the non-linear rules. From the results, we see that we were able to extract the correct predicates representing the non-linear relationship of the synthetic data, but in some cases, the computation required a significant amount of time. This is notable with equation \(exp({X1})-(X2)^2\), where the time to completion was far larger than the other equations. Similarly, \(sin(X1)-exp({X2})\) also had a long computation time. In both cases the equations shared the subtraction operation, suggesting an area of investigation. We also note that the heuristics for updating the KB is another potential research avenue.

Table 6 Average computation time (in seconds) and results for extracting non-linear equations composed of two variables with float values between [0, 10]

In reading the class rules, we will explain the meaning of the definitions in the context of the example non-linear function (\(\mathbf {exp({X1}) \times (X2)^2}\)) and one of its classes, see Table 7. Each target predicate (class1(), class2(), class3()) represents an output range based on the non-linear function which generated the data \({\textsf{F}}\) (ranges determined by equal-width binning), so the clausal body states what inequalities need to be true on the transformed features in order for the non-linear function’s output to fall within a given range. For the transformation layers, this is only done on a single feature, so we observe rule definitions containing just one variable. In the case of the transformation layer on X1, based on the output we could state that “class 3 is true if the exponentiation of variable X1 is greater than 17065.14 and the square of variable X1 is greater than 91.8" and this rule in conjunction with the rules for class1() and class2() are accurate \(77.5\%\) of the time. In the case of the transformation on X2, we could state “class 3 is true if the square of variable X2 is greater than 55.75 and the square of variable X2 is less than 72.94". In the operation layer, which now defines the discrete classes based on operations between the two features, key to note is that the features have been transformed based on the best performing transformation in the previous transformation layers. The rules learned in the operation layer include operation atoms which describe the operation between transformed variables. In Table 7, this indicates the variable X1 is transformed by the exponent function and X2 is transformed by the square function, which results in rules that can be interpreted as follows “class 3 is true if the product of exp(X1) and \((X2)^2\) is greater than 754844.25 and the product of exp(X1) and \((X2)^2\) is greater than 943555.38 and the product of exp(X1) and \((X2)^2\) is greater than 1132266.38”. In various rule definitions we observe redundant literals (e.g. \(X1X2Prod< 94355.38 \wedge X1X2Prod < 1132266.38\) for class2() of the operation layer). This is due to the stochastic nature of training of the membership weights in that some literals will supersede others based on their inequalities however redundant literals may still be present in the rule definition as the model was not overall penalised for including them. While they do not impact model performance, future research may consider a post-processing step to remove redundant literals for better interpretability. Note, the output in Table 7 has been propositionalised for clarity, however for inference to be performed on unseen instances, the predicate logic format of the rules is used to input values from X1 and X2.

The original output from various dNL-NL layers can be seen in Table 7, which depicts the resulting belief state for the equation (\(\mathbf {exp({X1})) \times (X2)^2}\)) after 100 iterations for each associated dNL-NL layer. The three discrete classes correspond to the following features ranges, \(class1() \sim 1.69\times 10^{-2}\le {\textsf{F}}(x) < 4.43\times 10^5\), \(class2() \sim 4.43\times 10^5\le {\textsf{F}}(x) < 8.86\times 10^5\), and \(class3() \sim 8.86\times 10^5\le {\textsf{F}}(x) \le 1.33\times 10^6\). In the table it can be observed that the resulting rules were stronger at the operation layer than at the individual transformation layers. In the case of the transformation layer, it can be observed that the transformation on the second variable (\((X2)^2\)) was better able to define the discrete classification predicates, as indicated by the higher accuracy. Across the layers, the learned predicate rules for each class were still able to identify the correct transformations based on the confidence associated with each grounding. We do note atoms in some of the definitions are not representative of the original non-linear function, for example in the transformation layer for X2, the first definition for class1() is defined by bounded atoms of the sine function. The pipeline still can learn atoms which model the target predicate yet do not represent the true non-linear function, but taken as a whole across all definitions and layers, the pipeline is able to extract the relevant transformations and operations.

Table 7 Each dNL-NL layer’s associated learned rules, where rules are defined by extracted non-linear predicates for the function (\(exp(X_1)\times (X_2)^2\)) where the target was discretized into three classes and the variable space was floats in the range [0, 10]
Table 8 Each dNL-NL layer’s associated learned rules, where rules are defined by extracted non-linear predicates for the function (\(sin(X_1)+(X_2)^2\)) where the target was discretized into three classes and the variable space was floats in the range [0, 10]
Table 9 Each dNL-NL layer’s associated learned rules, where rules are defined by extracted non-linear predicates for the function (\(exp(X_1)\times sine(X_2)\)) where the target was discretized into three classes and the variable space was floats in the range [0, 10]
Table 10 Each dNL-NL layer’s associated learned rules, where rules are defined by extracted non-linear predicates for the function (\(exp(X_1)- (X_2)^2\)) where the target was discretized into three classes and the variable space was floats in the range [0, 10]
Table 11 Each dNL-NL layer’s associated learned rules, where rules are defined by extracted non-linear predicates for the function (\((X_1)^2 + exp(X_2)\)) where the target was discretized into three classes and the variable space was floats in the range [0, 10]
Table 12 Each dNL-NL layer’s associated learned rules, where rules are defined by extracted non-linear predicates for the function (\(sine(X_1) - exp(X_2)\)) where the target was discretized into three classes and the variable space was floats in the range [0, 10]

In Table 8 the dNL-NL layers extract the non-linear equation (\(\mathbf {sin({X1}) + (X2)^2}\)), again with features sampled from a uniform distribution between [0, 10]. The three discrete classes correspond to the following features ranges, \(class1() \sim -0.92\le {\textsf{F}}(x) < 32.0\), \(class2() \sim 32.0\le {\textsf{F}}(x) < 64.9\), and \(class3() \sim 64.9\le {\textsf{F}}(x) \le 97.8\). The synthetic datasets are again comprised of 200 instances with three classes and 7 equal-width boundaries on the continuous predicates. From Table 8, we again are able to extract the correct non-linear equations. Note the accuracy scores being 100 percent correct for the X2 transformation layer. This indicates the definitions for the three target classes in this layer were able to entail all positive and negative examples successfully, even if not all atoms were relevant to the true non-linear function (e.g. \(X2Exp>3911.76\)). Also note the transformation layer accuracy for X1 is rather poor, such that the definitions for the various classes do not model the discrete target predicates well. This can be attributed to the sine functions periodic output and so greater emphasis is given to the non-linear output of the second transformation \((X2)^2\) in modelling the discrete classes independently. As a whole, the pipeline is able to extract the non-linear function which simulates the data.

In Table 9 the dNL-NL layers extract the non-linear equation (\(\mathbf {exp({X1}) \times sin(X2)}\)). The three discrete classes correspond to the following features ranges, \(class1() \sim -1.26\times 10^{4}\le {\textsf{F}}(x) < -2.06\times 10^3\), \(class2() \sim -2.06\times 10^3\le {\textsf{F}}(x) < 8.43\times 10^3\), and \(class3() \sim 8.43\times 10^3\le {\textsf{F}}(x) \le 1.89\times 10^4\). Again note the accuracy of the first transformation later (exp(X1)) which outperforms the second transformation layer, and also note the accuracy of the second layer containing various irrelevant transformation continuous atoms. Given that, the confidence of the model for the sine function is far stronger across the classification predicates. In general, the periodic transformation of the sine function leads to specific challenges when trying to model the discrete ranges associated with the target predicates.

In Table 10 the dNL-NL layers extract the non-linear equation (\(\mathbf {exp({X1}) - (X2)^2}\)). The five discrete classes correspond to the following features ranges, \(class1() \sim -84.6\le {\textsf{F}}(x) < 4.09\times 10^3\), \(class2() \sim 4.09\times 10^3\le {\textsf{F}}(x) < 8.26\times 10^3\), \(class3() \sim 8.26\times 10^3\le {\textsf{F}}(x) < 1.24\times 10^4\), \(class4() \sim 1.24\times 10^4\le {\textsf{F}}(x) < 1.66\times 10^4\), and \(class5() \sim 1.66\times 10^4\le {\textsf{F}}(x) \le 2.07\times 10^4\). Compared to other experiments, extraction of this equation required 5 discrete target predicates. Running the pipeline with 3 classification predicates resulted in the model failing to learn the true non-linear function. The accuracy for the second transformation layer (X2) is lower than the other layers, even though the model is confident in the presence of bounded atoms associated with the power function. The subtraction operation, was also a challenge for the operation layer to extract, as seen by the computational time in Table 6.

In Table 11 the dNL-NL layers extract the non-linear equation (\(\mathbf {({X1})^2 + exp(X2)}\)). The three discrete classes correspond to the following features ranges, \(class1() \sim 1.94\le {\textsf{F}}(x) < 6.93\times 10^3\), \(class2() \sim 6.93\times 10^3\le {\textsf{F}}(x) < 1.39\times 10^4\), and \(class3() \sim 1.39\times 10^4\le {\textsf{F}}(x) \le 2.08\times 10^4\). The transformation layers are able to extract the correct transformations, as seen by the various bounded predicates and their measure of confidence. The transformation on feature X1 achieved a lower accuracy, and the issue with modelling the discrete classes can be seen with the various bounded continuous predicates in the definitions of this layer. The lower performance of the (X1) can likely be attributed to the perfect performance of the second transformation layer (X2).

In Table 12 the dNL-NL layers extract the non-linear equation (\(\mathbf {sin({X1}) - exp(X2)}\)). The three discrete classes correspond to the following features ranges, \(class1() \sim -2.08\times 10^{4}\le {\textsf{F}}(x) < -1.39\times 10^4\), \(class2() \sim -1.39\times 10^4\le {\textsf{F}}(x) < -6.93\times 10^3\), and \(class3() \sim -6.93\times 10^3\le {\textsf{F}}(x) \le -3.29\times 10^{-1}\). The performance is again noted to be slower, as seen in Table 6. The subtraction operation, which also resulted in a slow performance for Table 10, struggles to be efficiently extracted by the operation layer, thus we propose future research to investigate improving the subtraction operation.

Future Research:

As indicated, there are areas of improvement for the dNL-NL functions and the overall pipeline. Specifically related to the logical reasoning of the dNL-NL layers, it was found that extracting subtraction operations was computationally expensive. This issue is potentially related to the logical structure of the continuous Boolean predicates. Periodic functions such as the sine function also posed certain challenges when modelling the target predicates, and we leave this as an area of investigation for future research.

A more notable future investigation is that of the heuristics used to update the knowledge bases which determined which continuous predicates were available to the dNL-NL layers. In the pipeline proposed here, we rely on a simple heuristic of removing predicates based on their layer’s accuracy scores. Future research might consider a modelling-based approach which uses accuracy scores and the true loss to train a model for selecting the various transformation and operation functions for continuous predicate instantiation.

The proposed architecture here focuses on demonstrating the capacity for ILP inspired models to extract non-linear functions from mixed-continuous domains. Our experiments focus on the two-variable case as a proof of concept. A further extension would be to consider three or more variables. This extension would also need to factor in the heuristics for updating the knowledge base, as well as the structuring of the transformation layers, exploring applications where more complex rules are needed and also coming up with a comparison strategy to relate to other learning approaches. To the best of our knowledge, a key issue in devising a comparison strategy is that little research is available on learning explainable non-linear models, such that we may be the first.

Given that, future work could reconfigure our dNL-NL model for a direct comparison with Duvenaud et al. (2013). A prospective extension to our dNL-NL framework could involve transitioning from our transformation/operation Boolean predicate functions to kernel Boolean predicate functions. By leveraging compositional techniques to define structures via kernels, there is potential to incorporate the method of creating composite kernel spaces from sums and products of base kernels into dNL-NL networks. This could entail embedding kernel predicates while retaining our operational Boolean predicates, enabling a comprehensive and direct model comparison.

While experiments in this proposal focused on transformation and operations between variables. Future extensions could apply constant continuous Boolean predicates, where a constant is multiplied on a random variable as seen in Eq. (14). Here the constant C could be a predetermined value provided by the modeller, or a trainable weight akin to the lower and upper bound values (\(u_{Cxi},l_{Cxi}\)).

$$\begin{aligned} {\mathcal {F}}_{gt^i_{Cx}} = \sigma (c((C \times x)-u_{Cxi})), \quad {\mathcal {F}}_{lt^i_{Cx}} = \sigma (-c((C \times x) -l_{Cxi})) \end{aligned}$$
(14)

6 Conclusion

We expand on a differentiable extension to ILP, so-called differentiable Neural Logic (dNL) networks, by focusing on the extraction of non-linear rules from synthetic data which has been generated by various non-linear functions. We provide a scheme for learning non-linear transformations and operations between variables using the logical framework of dNL and assessing the success of our derived logical rules by comparing them to the continuous target values. This work allows dNL networks to derive logical solutions in mixed discrete and continuous domains. Furthermore, from our results, we demonstrate that a pipeline of dNL networks can successfully extract non-linear functions from mixed discrete-continuous domains. For possible future extensions, we intend to investigate more extensive domain applications such as those common to other scientific disciplines. As there is no current work of a comparable nature, future investigations will also look into possible model comparisons, as well as integration with reinforcement learning. The avenue of combing dNL-NL with reinforcement learning allows for applications in machine vision and possibly natural language processing.