1 Introduction

Logic-based machine learning (Muggleton, 1991; Law et al., 2019) learns interpretable knowledge expressed in the form of a logic program, called a hypothesis, that explains labelled examples in the context of (optional) background knowledge. Recent logic-based machine learning systems have demonstrated the ability to learn highly complex and noise-tolerant hypotheses in a data efficient manner [e.g., Learning from Answer Sets (LAS) (Law et al., 2019)]. However, they require labelled examples to be specified in a structured logical form, which limits their applicability to many real-world problems. On the other hand, differentiable learning systems, such as (deep) neural networks, are able to learn directly from unstructured data, but they require large amounts of training data and their learned models are difficult to interpret (Gilpin et al., 2018).

Within neural-symbolic artificial intelligence, many approaches aim to integrate neural and symbolic systems with the goal of preserving the benefits of both paradigms (Besold et al., 2017; Garcez & Lamb, 2020). Most neural-symbolic integrations assume the existence of pre-defined knowledge expressed symbolically, or logically, and focus on training a neural network to extract symbolic features from raw unstructured data (Manhaeve et al., 2018; Yang et al., 2020; Serafini & d’Avila Garcez, 2016; Cohen, 2016; Riegel et al., 2020). In this paper, we introduce Feed-Forward Neural-Symbolic Learner (FFNSL), a neural-symbolic learning framework that assumes the opposite. Given a pre-trained neural network, FFNSL uses a logic-based machine learning system robust to noise to learn a logic-based hypothesis whose symbolic features are constructed from neural network predictions. The motivation is to enable logic-based machine learning systems to utilise pre-trained neural networksFootnote 1 to learn symbolic features from unstructured data, and use these features to learn interpretable knowledge needed to solve a downstream classification task. FFNSL preserves the benefits of both paradigms, increasing the scope of the tasks logic-based machine learning systems can be applied to. The challenge in performing such an integration, is that neural networks are vulnerable to distributional shifts, where unstructured data belonging to a distribution different from that used for training often leads to incorrect predictions (Ovadia et al., 2019; Sensoy et al., 2018; Amodei et al., 2016). By using a logic-based machine learning system that is robust to noise, such as a LAS system, FFNSL is capable of learning robust logic-based hypotheses from examples generated from labelled unstructured data, which may contain incorrect or noisy features as a result of incorrect neural network predictions.

The novel aspect of our FFNSL framework is the Data-to-Knowledge (D2K) generator that bridges the neural and symbolic learning components. The D2K generator automatically constructs a symbolic representation of the features predicted from the unstructured data, and weights such knowledge with a level of truthfulness that reflects the confidence score of the neural network predictions. The symbolic features can then be used by the symbolic learning component to automatically generate weighted examples from which to learn general and interpretable knowledge needed to solve the given downstream task.

FFNSL is general enough to support the integration of any neural component capable of making discrete predictions from unstructured data (binary or multi-class classification), with any logic-based machine learning system capable of learning from noisy examples. In this paper, we present four instances of our framework, where the LAS systems, ILASP (Law, 2018) and FastLAS (Law et al., 2020), are used as the symbolic learning component, and different neural network architectures are used as the neural component. The LAS systems have been shown to learn optimal hypotheses from noisy examples (Law et al., 2018), and to be suitable for different forms of symbolic learning tasks. In these systems, a noisy example includes a weight, which defines the penalty paid by a hypothesis for not covering that example. FFNSL interprets this weight as a level of certainty of the example, and computes it using the confidence score of the related neural network predictions. In this way, the LAS systems become biased towards learning a hypothesis that has minimal penalty, i.e., a hypothesis that covers examples generated from high confidence neural network predictions (examples with high weights). For each proposed instance of our FFNSL framework, we investigate: (1) whether FFNSL can learn an accurate and interpretable hypothesis from incorrect feature predictions of the neural component, (2) how robust the learned hypothesis is in the presence of distributional shifts applied to an increasing percentage of the unstructured data, (3) the impact of using an uncertainty-aware neural network component that provides more robust confidence estimates when distributional shifts are applied to the unstructured data, and (4) how FFNSL performs in comparison to other hybrid systems where the same pre-trained neural networks, used for predicting features from the unstructured data, are integrated with a random forest and deep neural networks trained to learn the knowledge required to solve the downstream task.

To evaluate our FFNSL framework, we use four neural-symbolic classification tasks, one for each proposed instance.Footnote 2 Firstly, the Follow Suit Winner task is a card game where 4 players each play a card and the goal is to predict the winning player. In order to solve the task, the neural network predicts the rank and suit of the playing card images and the rules of the game are learned as symbolic knowledge, where the winner is the player that plays the highest ranked card with the same suit as player 1. The second task is Sudoku Grid Validity classification, which consists of observing a sequence of images of handwritten MNIST digits, corresponding to the digits in a Sudoku grid, and predicting if the grid is valid or not. The neural network classifies each digit and the symbolic knowledge required to be learned is the definition of valid (or invalid) Sudoku grids. The final tasks are Crop Yield Prediction and Indoor Scene Classification, which demonstrate the applicability of FFNSL to real-world problems and datasets. The Crop Yield Prediction task requires predicting the quality of crop yield from an image containing potentially diseased crops, where the neural network predicts the crop’s species and disease status, and the learned symbolic knowledge predicts the quality of yield. In the Indoor Scene Classification task, the neural network is pre-trained to predict the scene class from an image, and the learned symbolic knowledge maps scene classes to high-level super-classes. In the first task, the neural network is pre-trained on images of playing cards from a standard deck, but our FFNSL framework is applied on card images subject to distributional shifts, where a percentage of standard card images are replaced with images from alternative card decks. In the second task, the neural network is pre-trained on the standard MNIST dataset, and our FFNSL system is applied on an out-of-distribution MNIST dataset generated by rotating MNIST digits \(90^\circ \) clockwise. In the Crop Yield Prediction task, we pre-train the neural network on the Plant Village dataset (Hughes & Salathé, 2015), and apply distributional shifts using a hue filter. Finally, in the Indoor Scene Classification task, we adopt a neural network model pre-trained on the MIT Indoor Scene dataset (Quattoni & Torralba, 2009), and apply distributional shift using blur, hue, and rotation filters.

Our evaluation demonstrates that FFNSL outperforms the baselines on all four tasks. The hypotheses learned from unstructured data, subject to distributional shifts, are more interpretable and more accurate than those learned by the random forest and deep neural networks even when these baselines are trained with significantly more data. We have also evaluated the robustness of the FFNSL instances when applied to a test set that is also subject to distributional shifts. The results show that FFNSL outperforms the baselines, trained with the same amount of unstructured data, when up to \(\sim \)80% of the test set is subject to distributional shifts.

The paper is structured as follows. Section 2 provides necessary background material on the LAS framework, alongside further discussion of the drawbacks of the standard neural network Softmax layer for providing robust confidence estimates, and details of the uncertainty-aware neural networks used in this paper. Section 3 presents our general FFNSL framework followed by four instances discussed in detail in Sect. 4. We introduce our evaluation methodology in Sect. 5 and present the results of each FFNSL instance on the Follow Suit Winner and Sudoku Grid Validity tasks in Sects. 6 and 7 respectively, followed by the Crop Yield Prediction and Indoor Scene Classification tasks in Sect. 8. Related work is discussed in Sects. 9 and 10 concludes the paper.

2 Background

This section provides an overview of the LAS framework and the neural network approaches used in FFNSL. We discuss the difference between confidence estimates of uncertainty-aware neural networks versus that of the standard Softmax layer, when applying these trained networks to out-of-distribution data. This is particularly relevant to our FFNSL framework, as FFNSL relies upon neural network predictions and their confidence scores to learn interpretable knowledge for solving a downstream task.

2.1 Learning from answer sets

LAS (Law et al., 2019) is a logic-based machine learning approach that extends the field of Logic Programming (ILP) (Muggleton, 1991) with systems ILASP (Law, 2018) and FastLAS (Law et al., 2020). ILASP and FastLAS are capable of learning interpretable knowledge, expressed in the language of Answer Set Programming (ASP) (Gelfond & Kahl, 2014), from noisy labelled examples in an effective and scalable manner. Typically, an ASP program includes four types of rules: normal rules, choice rules, and hard and weak constraints. In this paper, we consider ASP programs composed of normal rules only.Footnote 3 A normal rule is of the form \({\mathtt {h {{\,\mathrm{\mathtt {:-}}\,}}b_1,\ldots , b_n, {{\,\mathrm{\,\texttt{not}\,}\,}}c_{1},\ldots ,{{\,\mathrm{\,\texttt{not}\,}\,}}c_{m}}}\), where \({\texttt{h}}, {\mathtt {b_1}},\ldots , {\mathtt {b_n}}, {\mathtt {c_{1}}},\ldots ,{\mathtt {c_{m}}}\) are atoms, “\(\texttt{not}\)” is negation as failure, \({\texttt{h}}\) is the head of the rule and \({\mathtt {b_1,\ldots , b_n, {{\,\mathrm{\,\texttt{not}\,}\,}}c_{1},\ldots , {{\,\mathrm{\,\texttt{not}\,}\,}}c_{m}}}\) is the body of the rule. The Herbrand Base of an ASP program P, denoted \(HB_P\), is the set of ground (variable free) atoms that can be formed from predicates and constants in P. Subsets of \(HB_P\) are called interpretations of P. The semantics of an ASP program P is defined in terms of answer sets, a subset, denoted as AS(P), of all interpretations of P that satisfy every rule in P. Given an answer set A, a ground normal rule is satisfied if the head is satisfied by A whenever all positive atoms and none of the negated atoms of the body are in A, that is when the body is satisfied. A partial interpretation, \(e_{\textrm{pi}}\), is a pair of sets of ground atoms \(\left\langle e^{\textrm{inc}}_{\textrm{pi}}, e^{\textrm{exc}}_{\textrm{pi}} \right\rangle \), called the inclusion and exclusion sets respectively. An interpretation I extends \(e_{\textrm{pi}}\) iff \(e_{\textrm{pi}}^{\textrm{inc}} \subseteq I\) and \(e_{\textrm{pi}}^{\textrm{exc}} \cap I = \emptyset \).

In the LAS framework, labelled examples are specified as Context-Dependent Partial Interpretations (CDPIs). A CDPI example e is a pair \(\langle e_{\textrm{pi}}, e_{\textrm{ctx}}\rangle \), where \(e_{\textrm{pi}}\) is a partial interpretation and \(e_{\textrm{ctx}}\) is an ASP program called the context of e. An ASP program P is said to accept e if there is at least one answer set A of \(P \cup e_{\textrm{ctx}}\) that extends \(e_{\textrm{pi}}\). Essentially, a CDPI states that a learned program P, together with \(e_{\textrm{ctx}}\), should bravely entailFootnote 4 all inclusion atoms and none of the exclusion atoms of e. When a CDPI example is noisy, that is, the truthfulness of its context and/or partial interpretation is not guaranteed, it has a weight or penalty assigned to it, in the form of a positive integer. A Weighted Context-Dependant Partial Interpretation (WCDPI) is therefore a CDPI weighted with a penalty. It is formally defined as a tuple \(e=\langle e_{\textrm{id}}, e_{\textrm{pen}}, e_{\textrm{pi}}, e_{\textrm{ctx}} \rangle \) where \(e_{\textrm{id}}\) is a unique identifier of e, \(e_{\textrm{pen}}\) is the penalty of e, and \(e_{\textrm{pi}}\) and \(e_{\textrm{ctx}}\) represent a CDPI. A LAS system that is noise-tolerant learns an ASP program H, called a hypothesis, from WCDPI examples. If a hypothesis H does not accept a WCDPI example, we say that it pays the penalty of that example. Informally, penalties are used to calculate the cost associated with a hypothesis for not covering examples. The cost function of a hypothesis H is the sum over the penalties of all of the examples that are not covered by H, augmented with the length of the hypothesis. A LAS learning task with noisy examples, consists of an ASP program denoting background knowledge, a hypothesis space defined by a language bias,Footnote 5 expressing the set of rules that can be used to construct a solution of the task, and a set of WCDPI examples. The goal of such a task is to find a hypothesis H in the hypothesis space that minimises a cost function with respect to a given set of noisy examples. This is formally defined below, adapted from Law (2018).

Definition 1

An \(\textrm{ILP}^{\textrm{noise}}_{\textrm{LAS}}\) task T is a tuple \(T=\langle B, S_M, E\rangle \), where B is an ASP program, \(S_M\) is a hypothesis space, and E is a set of WCDPIs. Given a hypothesis \(H \subseteq S_M\),

  1. 1.

    \(\textrm{UNCOV}(H, T)\) is the set consisting of all examples \(e \in E\) such that \(B \cup H\) does not accept e.

  2. 2.

    The penalty of H, denoted as \(\textrm{PEN}(H, T)\), is the sum \(\sum _{e \in \textrm{UNCOV}(H, T)} e_{\textrm{pen}}\).

  3. 3.

    The score of H, denoted as \({\mathcal {S}}(H, T)\), is calculated as \(\vert H\vert + \textrm{PEN}(H, T)\).

  4. 4.

    H is an optimal inductive solution of T if and only if \(\not \exists H' \subseteq S_M\) such that \({\mathcal {S}}(H', T) < {\mathcal {S}}(H, T)\).

ILASP and FastLAS are two state-of-the-art systems capable of solving an \(\textrm{ILP}^{\textrm{noise}}_{\textrm{LAS}}\) task. The optimisation function used by both systems aims at learning a hypothesis H that jointly minimises the total penalty paid for the uncovered examples and its length. In practice, this creates a bias towards shorter, and therefore more general solutions that cover examples with a high penalty value.

2.2 Uncertainty-aware neural networks

Our FFNSL framework relies on pre-trained neural networks to extract symbolic features from unstructured data. The neural network prediction and its confidence score may therefore affect the accuracy of a learned hypothesis. In this paper, we consider two different types of neural networks as FFNSL neural components: a standard Convolutional Neural Network (CNN) that uses a Softmax layer, and an uncertainty-aware CNN that provides more robust confidence estimates when given data outside the training distribution.

Uncertainty can be formulated as either aleatoric or epistemic uncertainty (Hüllermeier & Waegeman, 2021; Pearce et al., 2021). In a machine learning classification task, aleatoric uncertainty can be thought of as the uncertainty along the class decision boundary, whereas epistemic uncertainty can be thought of as whether the sample falls into any of the classes at all. The confidence estimates output by a neural network Softmax layer in a classification task often only capture aleatoric uncertainty, as these outputs are based on a single probability distribution over a set of classes squashed into real values between 0 and 1. For example, given neural network output logits \({\varvec{l}}\) and k possible classes, the Softmax output \(\sigma ({\varvec{l}})\) for class i, where \(1\le i\le k\) is calculated as:

$$\begin{aligned}\sigma ({\varvec{l}})_{i}=\frac{{\varvec{e}}^{l_i}}{\sum _{j=1}^{k}{\varvec{e}}^{l_j}}\end{aligned}$$

where \({\varvec{e}}=2.71828...\) is the Euler number.Footnote 6 There are three challenges with this approach in terms of uncertainty quantification. Firstly, the exponent applied to neural network outputs inflates the confidence estimate. Secondly, as the Softmax output is a point-wise, multinomial distribution, it is only possible to compare the confidence of the predicted class among other classes, as opposed to estimating the predictive distribution variance (Sensoy et al., 2018). Finally, when Softmax is paired with the commonly used cross-entropy loss, the network is only trained to minimise prediction error, as opposed to expressing uncertainty robustly.

To address these challenges, many techniques have been proposed in the literature (Rasmussen, 2003; Mackay, 1995; Blundell et al., 2015; Abdar et al., 2021). In this paper we consider the EDL-GEN (Sensoy et al., 2020) approach, which is a neural network based on generative models of  Evidential Deep Learning (EDL) systems (Sensoy et al., 2018) that have been shown to achieve state-of-the-art performance in handling epistemic uncertainty. An EDL (Sensoy et al., 2018) system replaces the Softmax layer in a neural network with a linear layer that represents the parameters of a Dirichlet distribution, a second-order distribution that inherently models the variance of a predictive distribution as opposed to the single point-wise output provided by Softmax. It then uses a new loss function that jointly minimises prediction error and the variance of the Dirichlet distribution, to reduce aleatoric uncertainty on the class decision boundary. EDL-GEN (Sensoy et al., 2020) extends this approach to also capture epistemic uncertainty by firstly treating the output of each class as a binary decision and secondly, using a variational auto-encoder to automatically generate out-of-distribution samples for training, in order to help the network discriminate between samples within and outside the training distribution.

To better understand how the uncertainty estimation of neural network predictions impacts the overall accuracy of our FFNSL framework, we analyse in the evaluation Sects. 6 and 7, the predicted confidence scores generated by a standard CNN with a Softmax layer and an EDL-GEN neural network and evaluate how they affect the accuracy of FFNSL when increasing percentages of input training data are subject to distributional shifts.

3 FFNSL framework

In this section we present our general FFNSL framework. It consists of three components, a pre-trained neural network, a symbolic (logic-based) learning system and a D2K generator that bridges the neural and symbolic learning components. It takes as input a dataset D of labelled (sequences of) unstructured data, alongside a background knowledge B (if any) and a search space \(S_{M}\). The output is a hypothesis H in the search space \(S_{M}\) (\(H\subseteq S_{M}\)), that predicts the labels of (sequences of) unstructured data. An overview of the FFNSL architecture is presented in Fig. 1.

Fig. 1
figure 1

FFNSL architecture and data flow generated for a single data point \(\langle {\varvec{x}},y\rangle \), where \({\varvec{x}}\) is a sequence of images and y is a label for the sequence. B is the background knowledge, \(S_{M}\) is the hypothesis search space and H is the learned hypothesis. In practice, the architecture is applied on a set of data points from which the D2K generator produces a set of symbolic examples passed in as input to the symbolic learner

We now define each of the three components of our FFNSL architecture. Let us assume that the training dataset D is given by a finite set \(D=\{\langle {\varvec{x}}_{w}, y_{w}\rangle \mid 1 \le w\le \vert D\vert \}\). The downstream task is a classification task where the objective is to predict the target label \(y\!\in \! {\mathcal {Y}}\) given a sequence of unstructured data \({\varvec{x}}\!\in \! {\mathcal {X}}_{1}\!\times \!\ldots \!\times \! {\mathcal {X}}_{n}\). Note that \({\mathcal {X}}_{i}\) could refer to different types of unstructured inputs and the sequence could also contain only a single input. The neural component of FFNSL contains up to n pre-trained neural network(s).Footnote 7 Each neural network \(g_{i}:{\mathcal {X}}_{i}\rightarrow [0,1]^{k_{i}}\) returns a vector denoting relative assignment to \(k_{i}\) possible classes for an unstructured input \(x_{i} \in {\varvec{x}}\). Each possible class \(z_{i}\in \left\{ 1,\ldots ,k_{i}\right\} \) represents a set of symbolic feature and value pairs from a given set \(F_{g_{i}}\) of symbolic feature mappings associated with the neural network \(g_{i}\). For example, in the Follow Suit Winner task, \(F_{g_{i}}\) contains all possible suit and rank values corresponding to the possible predictions of the neural network, when given an image of a playing card.

The second component of FFNSL is the D2K generator that outputs a symbolic representation of the sequence of neural network predictions, together with an aggregated confidence value. Specifically, for a given sequence of unstructured data \({\varvec{x}}=\langle x_{1},\ldots ,x_{n}\rangle \), the D2K generator takes each neural network output \(g_{i}(x_{i})\), and computes the corresponding prediction \(z_{i}\). Each \(z_{i}\), for \(1\le i\le n\), is obtained by using the standard “arg max" function, i.e., the class with the maximum confidence score:

$$\begin{aligned}z_{i} = {\arg \max }_{j\in \{1,\ldots ,k_{i}\}} (g_{i}(x_{i})[j])\end{aligned}$$

The D2K generator then uses the set \(F_{g_{i}}\), associated with \(g_{i}\) and generates the set \(f^{z_{i}}_{g_{i}}\subseteq F_{g_{i}}\) of symbolic feature and value pairs corresponding to the prediction \(z_{i}\). As an example, in the Follow Suit Winner task, \(z_{i}\) is an identifier for one of 52 playing cards, and \(f^{z_{i}}_{g_{i}}\) contains two feature and value pairs, one for the suit, and one for the rank of the card \(z_{i}\). The D2K generator also generates a set \(l_{i}\) of pairs containing additional symbolic meta-data, associated with each input \(x_{i}\). Again, each pair in \(l_{i}\) contains a name and a value. In the Follow Suit Winner task, \(l_{i}\) contains one pair indicating which player played the card \(z_{i}\). The generated set of tuples \(\{\langle m_{g_{i}}, f^{z_{i}}_{g_{i}}, l_{i} \rangle \mid x_{i} \in {\varvec{x}}\}\), where \(m_{g_{i}}\) is a unique identifier for the neural network \(g_{i}\), defines the symbolic features extracted from a sequence of unstructured data \({\varvec{x}}\), based on the neural network predictions. Finally, the D2K generator computes an aggregated confidence value \(W({\varvec{x}})\) for the generated symbolic features, representing the combined confidence scores of the neural network predictions:

$$\begin{aligned} W({\varvec{x}}) = \textrm{min}(\left\{ g_{i}(x_{i})[z_{i}] \mid x_{i} \in {\varvec{x}}\right\} ) \end{aligned}$$
(1)

\(W({\varvec{x}})\) is a generalisation of the binary Gödel t-norm used in fuzzy logic to encode fuzzy conjunctions (Metcalfe et al., 2008). So, given a sequence of unstructured inputs \({\varvec{x}}=\langle x_{1},\ldots ,x_{n}\rangle \) and the predicted vector \(\langle g_{1}(x_1)[z_1],\ldots ,g_{n}(x_n)[z_{n}]\rangle \) from the neural network, the output of the D2K generator is formally defined as:

$$\begin{aligned} D2K({\varvec{x}}) = \langle W({\varvec{x}}), \left\{ \langle m_{g_{i}}, f^{z_{i}}_{g_{i}}, l_{i} \rangle \mid x_{i} \in {\varvec{x}} \} \right\rangle \end{aligned}$$
(2)

A pseudo-code implementation of the D2K generator is presented in Algorithm 1. Note that some aspects are task specific, such as the set of feature value pairs \(f^{z_{i}}_{g_{i}}\), and meta-data \(l_{i}\). These are left general in Algorithm 1, and specified in more detail for each task in Sect. 4.

figure a

The third component of our FFNSL framework is a symbolic logic-based machine learning system. For each labelled unstructured data \(\langle {\varvec{x}}, y\rangle \in D\), the symbolic learning system takes as input \(D2K({\varvec{x}})\) and the label y, and generates a weighted symbolic labelled example denoted as the tuple \(\langle W^{\prime }({\varvec{x}}), e_{\langle {\varvec{x}}, y\rangle }\rangle \) where \(W^{\prime }({\varvec{x}})\) is a penalty for the example, calculated from the aggregated confidence score \(W({\varvec{x}})\), and \(e_{\langle {\varvec{x}}, y\rangle }\) is a labelled example. The syntactic form of \(e_{\langle {\varvec{x}}, y\rangle }\) and the calculation of the penalty \(W^{\prime }({\varvec{x}})\) depends on the specific symbolic learning system used in the instantiation of the framework. In Sect. 4 we present two specific instances of FFNSL where the symbolic learning systems are LAS systems and we show how weighted symbolic labelled examples are defined as WCDPI examples. We denote with E the set of weighted symbolic labelled examples defined by the symbolic learning system for all \(\langle {\varvec{x}}, y\rangle \in D\). A symbolic learning task \(T=\langle B, S_{M}, E\rangle \) is then generated where B and \(S_{M}\) are respectively the background knowledge and a search space given as input to FFNSL. The symbolic learner then computes an optimal solution H for this task T as the output of FFNSL.

Formally, an FFNSL learning task is a tuple \(T=\langle B, S_{M}, D\rangle \) where D is a set of labelled unstructured data, B is a set of optional background knowledge and \(S_{M}\) is a search space of possible solutions for T. A hypothesis \(H\subseteq S_{M}\) is an inductive solution of T if and only if H is an optimal inductive solution of the symbolic learning task \(\langle B, S_{M}, E\rangle \), where E is the set of weighted symbolic labelled examples automatically generated relative to the given set D. In the next section, we present four specific instances of our FFNSL framework and give specific examples of the components described here.

4 FFNSL with LAS systems

The generality of our FFNSL framework allows it to be instantiated differently, using alternative neural and/or symbolic learning components, depending on the nature of the classification task in hand. We have considered four different classification tasks, called Follow Suit Winner, Sudoku Grid Validity, Crop Yield Prediction and Indoor Scene Classification respectively. The first requires the learning of concepts that are not directly observed in the labels, but linked to the label through the background knowledge, whereas the other tasks require the learning of concepts that define the classification label. Because of the different types of symbolic learning, we consider instantiations of our FFNSL framework with different LAS systems. In what follows we introduce these tasks, their datasets, define the respective FFNSL learning tasks and describe in more detail the FFNSL instances we have implemented to solve these tasks. Firstly, let us define the weighted symbolic labelled examples within a LAS system, based on the output from the D2K generator. Essentially, the predicted symbolic features and meta-data define the context of a LAS example, represented as a conjunction of facts, and the aggregated confidence score \(W({\varvec{x}})\) is used to calculate the associated weight penalty:

$$\begin{aligned} W^{\prime }({\varvec{x}}) = \lfloor 100 \times W({\varvec{x}})\rfloor + 1 \end{aligned}$$
(3)

which converts \(W({\varvec{x}})\) to an integer \(W^{\prime }({\varvec{x}})\!>\!0\) as required by the LAS systems. Given the output generated by D2K, a LAS system constructs a weighted symbolic labelled example \(\langle W^{\prime }({\varvec{x}}), e_{\langle {\varvec{x}}, y\rangle }\rangle \) as a WCDPI of the form \(\langle e_{\textrm{id}}, e_{\textrm{pen}}({\varvec{x}}), e_{\textrm{pi}}(y), e_{\textrm{ctx}}({\varvec{x}})\rangle \), where \(e_{\textrm{id}}\) is a unique identifier, \(e_{\textrm{pen}}({\varvec{x}}) = W^{\prime }({\varvec{x}})\), \(e_{\textrm{pi}}(y)\) is the partial interpretation \(\langle \{y\}, {\mathcal {Y}}\setminus \{y\}\rangle \), defined in terms of the label y and its domain \({\mathcal {Y}}\), and the context \(e_{\textrm{ctx}}({\varvec{x}})\) is a conjunction of facts created from the predicted symbolic features and meta-data. The components \(e_{\textrm{pi}}(y)\) and \(e_{\textrm{ctx}}({\varvec{x}})\) together constitute the labelled example \(e_{\langle {\varvec{x}}, y\rangle }\).

Given a set \(E^{\prime }\) of WCDPIs, a background knowledge B, and a search space \(S_{M}\), a hypothesis \(H \subseteq S_{M}\) is learned such that H is an optimal inductive solution of the task \(T^{\textrm{noise}}_{\textrm{LAS}}=\langle B, S_{M}, E^{\prime }\rangle \). Let us now present the tasks used in our evaluation, alongside examples of each instantiated FFNSL component.

4.1 Follow suit winner

This is a classification task where 4 players each play 1 card and the goal is to predict the winning player. The symbolic knowledge required to solve the task defines the rules of the game, that is the winner is the player that plays the highest ranked card with the same suit as player 1. Each \(\langle {\varvec{x}}, y\rangle \in D\) is composed of a sequence \({\varvec{x}}\) of 4 card images corresponding to the cards played by players \(1,\ldots ,4\), and a label \(y\in \{1,2,3,4\}\) denoting the player who wins the 4 card trick.

Let us assume which contains images of the cards 10 of hearts, jack of hearts, 4 of clubs and 8 of spades played by player 1, 2, 3 and 4 respectively. For this trick, the ground truth label is \(y=2\) indicating that player 2 is the winner since player 2 has played the highest ranked card with the same suit as player 1. Since the unstructured inputs in the sequence \({\varvec{x}}\) are of the same type (i.e., card images), FFNSL can simply use a single neural network g pre-trained to predict the features of a card image, that is the rank and suit of each card. Therefore, g has two associated symbolic features rank and suit each with values \(\{2,\ldots ,10, jack, queen, king, ace\}\) and \(\{hearts, clubs, spades, diamonds\}\) respectively. For each input \(x_{i}\), there are 52 possible predictions, one for each combination of rank and suit, i.e., \(g:{\mathcal {X}}\rightarrow [0,1]^{52}\), where \({\mathcal {X}}\) is the set of possible card images. g has an associated feature value mapping \(F_{g}\) which gives for each card prediction \(z_{i}\in \{1,\ldots ,52\}\), a unique set of two pairs, each containing a feature and value, i.e., \(f^{z_{i}}_{g}=\{\langle rank, \nu _{rank}\rangle , \langle suit, \nu _{suit}\rangle \}\), where \(\nu _{rank}\) is one of the 13 rank values and \(\nu _{suit}\) is one of the 4 suit values. Furthermore, each input \(x_{i}\) also has associated symbolic meta-data \(l_{i}=\{\langle player, \nu _{player}\rangle \}\) where \(\nu _{player}\in \{1,2,3,4\}\) indicates the player that has played card \(x_{i}\).

We instantiate our FFNSL framework as follows. Given a sequence \({\varvec{x}}\) of 4 card images, the neural component of FFNSL generates 4 vectors \(g(x_{i})\), where \(1\le i\le 4\). The D2K component generates for each \(x_{i}\), the card prediction \(z_{i}\) and its corresponding symbolic features and meta-data, thus computing the tuple \(D2K(x_{i})=\langle {\texttt{card}}, f^{z_{i}}_{g},l_{i}\rangle \), where \({\texttt{card}}\) is the identifier for the network g (i.e., \(m_{g}={\texttt{card}}\)).

Example 1

Consider the sequence and \(y=2\). Let us assume that the neural network g computes the outputs \(g(x_{1}),...,g(x_{4})\) from which the D2K generator generates the correct card predictions \(z_{1}=10\), \(z_{2}=11\), \(z_{3}=17\), and \(z_{4}=34\). Let us also assume the neural network confidence scores for these predictions are:

$$\begin{aligned} \begin{array}{llll} g(x_{1})[z_{1}] = 0.95;&g(x_{2})[z_{2}] = 0.92;&g(x_{3})[z_{3}] = 0.80;&g(x_{4})[z_{4}] = 0.94; \end{array} \end{aligned}$$

\(D2K({\varvec{x}})\) is given by the following tuple:

$$\begin{aligned} \begin{array}{lll} D2K({\varvec{x}}) = &{} \langle 0.80, \{ &{}\langle {\texttt{card}}, \{ \langle {\texttt{rank}}, {\texttt{10}} \rangle , \langle {\texttt{suit}}, {\texttt{hearts}}\rangle \}, \{ \langle {\texttt{player}}, {\texttt{1}} \rangle \}\rangle , \\ &{} &{} \langle {\texttt{card}}, \{ \langle {\texttt{rank}}, {\texttt{jack}} \rangle , \langle {\texttt{suit}}, {\texttt{hearts}}\rangle \}, \{ \langle {\texttt{player}}, {\texttt{2}} \rangle \}\rangle ,\\ &{} &{} \langle {\texttt{card}}, \{ \langle {\texttt{rank}}, {\texttt{4}} \rangle , \langle {\texttt{suit}}, {\texttt{clubs}}\rangle \}, \{ \langle {\texttt{player}}, {\texttt{3}} \rangle \}\rangle , \\ &{} &{} \langle {\texttt{card}}, \{ \langle {\texttt{rank}}, {\texttt{8}} \rangle , \langle {\texttt{suit}}, {\texttt{spades}}\rangle \}, \{ \langle {\texttt{player}}, {\texttt{4}} \rangle \}\rangle \;\;\;\}\rangle . \end{array} \end{aligned}$$

In this task, FFNSL uses the symbolic learner ILASP. The concept to be learned is not directly expressed as a label, but is related to it. The label is a single winning player for a trick, but the learned concept requires reasoning over the conditions of the suit and rank values of the other players’ cards. We encode as background knowledge, possible suit and rank values, the four players, as well as the definition of a higher rank predicate. ILASP is particularly suited for solving such learning tasks, known as non-observational predicate learning. The full background knowledge B and language bias used to construct the search space \(S_{M}\) for this classification task are given in Appendix F. To generate its learning task, ILASP has to generate its set \(E^{'}\) of WCDPI examples based on the output of the D2K component. For example, the WCDPI generated from the D2K output and the corresponding label in Example 1 is:

$$\begin{aligned}\langle e_{\textrm{id}}, 81, \langle \{ 2 \}, \{ 1,3,4\}\rangle , e_{\textrm{ctx}} \rangle \end{aligned}$$

where \(e_{\textrm{id}}\) is a unique identifier and \(e_{\textrm{ctx}}\) is the set of facts \(\{{\mathtt {card(1,10,hearts).}}, {\mathtt {card(2,jack,hearts).}}, {\mathtt {card(3,4,clubs).}}, {\mathtt {card(4,8,spades).}}\}\).

4.2 Sudoku grid validity

Our second classification task is Sudoku Grid Validity. This consists of observing a sequence of images of handwritten MNIST digits, corresponding to the digits in a Sudoku grid, and predicting if the grid is valid or not.Footnote 8 The learned symbolic knowledge required to solve this task is the definition of a valid Sudoku grid. In this task, each \(\langle {\varvec{x}},y\rangle \in D\) contains a sequence of digit images \({\varvec{x}}\) with a label \(y\in \{0, 1\}\) for valid and invalid respectively. The length of the sequence depends on the size of the grid. We consider \(4\times 4\) and \(9\times 9\) Sudoku grids as two separate tasks, with respective datasets \(D_{4\times 4}\) and \(D_{9\times 9}\) where the maximum length of the sequence in input is given by \(n=16\) and \(n=81\) respectively. As the images are all MNIST digits, FFNSL uses two neural networks \(g_{4\times 4}\) and \(g_{9\times 9}\), depending on the grid size, pre-trained to predict the feature digit of a single image \(x_{i}\) in \({\varvec{x}}\). So \(g_{k\times k}:{\mathcal {X}}\rightarrow [0,1]^{k}\), where \({\mathcal {X}}\)=MNIST. In the case of \(D_{4\times 4}\), \(n=16\) and \(k=4\) whereas in the case of \(D_{9\times 9}\), \(n=81\) and \(k=9\). The neural network \(g_{k\times k}\) has associated a feature value mapping \(F_{g_{k\times k}}\) which gives for each digit prediction \(z_{i}\in \{1,\ldots ,k\}\) a unique set of pairs \(f^{z_{i}}_{g_{k\times k}} = \{\langle value, \nu \rangle \}\), where \(\nu \) is one of the k digits that can appear in a Sudoku grid of size \(k\times k\). The meta-data related to each \(x_{i}\) is a set of two feature value pairs denoting the row and column that the image \(x_{i}\) has in the Sudoku grid, i.e., \(l_{i}=\{\langle row, \nu _{row}\rangle , \langle col, \nu _{col}\rangle \}\), where \(\nu _{row},\nu _{col}\in \{1,\ldots ,k\}\).

The instantiated FFNSL framework for this classification task is defined as follows. Given a sequence, \({\varvec{x}}\), of MNIST digit images, for each \(x_i\in {\varvec{x}}\), the pre-trained neural network \(g_{k \times k}\) computes the vector \(g(x_{i})\). The D2K component generates for each \(x_{i}\) the tuple \(D2K(x_{i})=\langle {\texttt{digit}}, f^{z_{i}}_{g_{k\times k}}, l_{i}\rangle \) where \({\texttt{digit}}\) is the network identifier, \(f^{z_{i}}_{g_{k\times k}}\) is the set of symbolic feature values associated with the prediction \(z_{i}\), and \(l_{i}\) is the set of symbolic meta-data feature value pairs associated with \(x_{i}\).

Example 2

Consider the task of predicting the validity of a \(4\times 4\) Sudoku grid. Let , with label \(y=1\), and associated symbolic meta-data:

$$\begin{aligned} \begin{array}{ll} l_{1} = &{} \{ \langle {\texttt{row}},{\texttt{1}} \rangle , \langle {\texttt{col}},{\texttt{1}} \rangle \}\\ l_{2} = &{} \{ \langle {\texttt{row}},{\texttt{1}} \rangle , \langle {\texttt{col}},{\texttt{3}} \rangle \}\\ l_{3} = &{} \{ \langle {\texttt{row}},{\texttt{1}} \rangle , \langle {\texttt{col}},{\texttt{4}} \rangle \}\\ l_{4} = &{} \{\langle {\texttt{row}},{\texttt{3}} \rangle , \langle {\texttt{col}},{\texttt{2}} \rangle \}\\ l_{5} = &{} \{\langle {\texttt{row}},{\texttt{4}} \rangle , \langle {\texttt{col}},{\texttt{3}} \rangle \} \end{array} \end{aligned}$$

Let us assume the neural network \(g = g_{4 \times 4}\) and g computes the outputs \(g(x_{1}),...,g(x_{5})\) from which the D2K generator generates the correct digit predictions \(z_{1}=2\), \(z_{2}=4\), \(z_{3}=1\), \(z_{4}=3\), and \(z_{5}=4\). Let us also assume the neural network confidence scores for these predictions are: \(g(x_{1})[z_{1}] = 0.88\), \( g(x_{2})[z_{2}] = 0.93\), \(g(x_{3})[z_{3}] = 0.87\), \(g(x_{4})[z_{4}] = 0.97\), and \(g(x_{5})[z_{5}] = 0.99\). The aggregated confidence score \(W({\varvec{x}}) = 0.87\). \(D2K({\varvec{x}})\) is given by the following tuple:

$$\begin{aligned} \begin{array}{lll} D2K({\varvec{x}}) = &{} \langle 0.87, \{ &{} \langle {\texttt{digit}}, \{ \langle {\texttt{value}}, {\texttt{2}} \rangle \}, \{ \langle {\texttt{row}}, {\texttt{1}} \rangle ,\langle {\texttt{col}}, {\texttt{1}} \rangle \}\rangle ,\\ &{} &{} \langle {\texttt{digit}}, \{ \langle {\texttt{value}}, {\texttt{4}} \rangle \}, \{ \langle {\texttt{row}}, {\texttt{1}} \rangle ,\langle {\texttt{col}}, {\texttt{3}} \rangle \}\rangle ,\\ &{} &{} \langle {\texttt{digit}}, \{ \langle {\texttt{value}}, {\texttt{1}} \rangle \}, \{ \langle {\texttt{row}}, {\texttt{1}} \rangle , \langle {\texttt{col}}, {\texttt{4}} \rangle \}\rangle ,\\ &{} &{} \langle {\texttt{digit}}, \{ \langle {\texttt{value}}, {\texttt{3}} \rangle \}, \{ \langle {\texttt{row}}, {\texttt{3}} \rangle ,\langle {\texttt{col}}, {\texttt{2}} \rangle \}\rangle , \\ &{} &{} \langle {\texttt{digit}}, \{ \langle {\texttt{value}}, {\texttt{4}} \rangle \}, \{ \langle {\texttt{row}}, {\texttt{4}} \rangle ,\langle {\texttt{col}}, {\texttt{3}} \rangle \}\rangle \;\;\}\;\rangle . \end{array} \end{aligned}$$

In this task FFNSL uses the FastLAS symbolic learner because the task is to learn the definition of the classification label, and FastLAS has been shown, for these types of learning tasks, to be more scalable than ILASP (Law et al., 2020). For both \(4\times 4\) and \(9\times 9\) Sudoku grids, the knowledge of the grid is encoded as part of the background knowledge B, given in Appendix F together with the language bias used to construct the search space \(S_{M}\). For each \(\langle {\varvec{x}}, y\rangle \), FastLAS takes as input \(D2K({\varvec{x}})\) and generates a WCDPI example. For instance, the WCDPI generated for the D2K output and the corresponding label in Example 2 is:

$$\begin{aligned}\langle e_{\textrm{id}}, 88, \langle \{ 1 \}, \{ 0\}\rangle , e_{\textrm{ctx}} \rangle \end{aligned}$$

where \(e_{\textrm{id}}\) is a unique identifier and \(e_{\textrm{ctx}}\) is given by the set of facts \(\{{\mathtt {digit(1,1,2).}}, {\mathtt {digit(1,3,4).}},\; {\mathtt {digit(1,4,1).}},\; {\mathtt {digit(3,2,3).}},\; {\mathtt {digit(4,3,4).}}\}\).

4.3 Crop yield prediction

To demonstrate the application of FFNSL to a real-world problem and dataset, consider the Crop Yield Prediction task. The goal is to classify the quality of yield, given an image and the location of a particular crop. The symbolic knowledge required to solve the task defines the quality of yield according to the crop’s location, species, and any disease that may be present. Each \(\langle {\varvec{x}},y \rangle \in D\) is composed of a sequence \({\varvec{x}}\) containing a single image, and a label \(y\in \{0,1,2\}\) denoting the quality of yield as poor, moderate, and strong respectively.

Let us assume which contains an image of a peach crop with the bacterial spot disease. Given symbolic meta-data denoting the location of this crop, let us assume the label \(y=0\), indicating poor yield. In this task, we use one neural network g to predict the features of a crop image, which are the crop species and disease. In total, there are 38 possible combinations of crop species and diseases, and g is trained to classify each combination. To assist with neural network training, the image dataset also contains a background class with unrelated images.Footnote 9 Therefore, \(g : {\mathcal {X}} \rightarrow [0,1]^{39}\), where \({\mathcal {X}}\) is the set of possible crop and background images. g has an associated feature value mapping \(F_{g}\), which specifies for the crop prediction \(z_{i} \in \{ 1,...,38\}\), a unique set of feature and value pairs \(f_{g}^{z_{i}}= \{ \langle species, \nu _{species} \rangle , \langle disease, \nu _{disease} \rangle \}\), where \(\nu _{species}\) and \(\nu _{disease}\) are the crop species and disease values respectively. Also, each input \(x_{i}\) has associated symbolic meta-data \(l_{i} = \{ \langle location, \nu _{location}\rangle \}\) where \(\nu _{location} \in \{ 1,...,19\}\) is the location of the crop.Footnote 10

We instantiate our FFNSL framework as follows. Given a sequence \({\varvec{x}}\) containing a single crop image, the neural component generates a single vector \(g(x_{i})\). The D2K component generates the prediction \(z_{i}\) and its corresponding symbolic features and meta-data, thus computing the tuple \(D2K(x_{i}) = \langle {\texttt{crop}}, f_{g}^{z_{i}}, l_{i}\rangle \), where \({\texttt{crop}}\) is the identifier for the network g (i.e., \(m_{g}={\texttt{crop}}\)).

Example 3

Consider the sequence and \(y=0\). Let us assume the neural network g computes the output \(g(x_{1})\) from which the D2K generator generates the correct crop prediction \(z_{1}=17\). Let us also assume the neural network predicts with confidence \(g(x_{1})[z_{1}]=0.98\), and this crop is in location 5. \(D2K({\varvec{x}})\) is given by the following tuple:

$$\begin{aligned} \begin{array}{llll} D2K({\varvec{x}}) = &{} \langle 0.98, \{ &{} \langle {\texttt{crop}}, &{} \{ \langle {\texttt{species}}, {\texttt{peach}} \rangle , \langle {\texttt{disease}}, {\mathtt {bacterial\_spot}} \rangle \},\\ {} &{} &{} &{} \{\langle {\texttt{location}}, {\texttt{5}} \rangle \}\rangle \;\;\}\;\rangle . \end{array} \end{aligned}$$

In this task FFNSL uses the FastLAS symbolic learner which is shown to be more scalable than ILASP. The background knowledge B contains a rule that ensures a classification is performed, i.e., given a crop, disease, and a location, the learned hypothesis should output only one class of crop yield. This rule, alongside the language bias used to construct the search space \(S_{M}\) is given in Appendix F. For each \(\langle {\varvec{x}},y\rangle \), FastLAS takes as input \(D2K({\varvec{x}})\) and generates a WCDPI example. For instance, the WCDPI generated from the D2K output and the corresponding label given in Example 3 is:

$$\begin{aligned}\langle e_{\textrm{id}}, 99, \langle \{ 0 \}, \{ 1,2\}\rangle , e_{\textrm{ctx}} \rangle \end{aligned}$$

where \(e_{\textrm{id}}\) is a unique identifier and \(e_{\textrm{ctx}}\) is given by the set of facts \(\{{\mathtt {species(peach).}}, {\mathtt {disease(bacterial\_spot).}},\; \{{\mathtt {location(5).}}\}\).

4.4 Indoor scene classification

Our final instantiation of FFNSL is with the Indoor Scene Classification task, where both neural and symbolic components are trained with real data. The goal is to learn symbolic knowledge that maps indoor scene classes (e.g., bedroom, bathroom, kitchen) into higher level super-classes (e.g., home), given images of indoor scenes. Each \(\langle {\varvec{x}},y \rangle \in D\) is composed of a sequence \({\varvec{x}}\) containing a single indoor scene image, and a label \(y\in \{0,\ldots ,4\}\) denoting the super-class as store, home, public space, leisure, and working place respectively.

Fig. 2
figure 2

Example bookstore image from the MIT Indoor Scenes dataset. Quattoni and Torralba (2009)

Let us assume which contains an image of a bookstore (also shown in Fig. 2). The label for this example is \(y=0\) (i.e., store). We use one neural network g to predict the scene class. In total, there are 67 different classes of various indoor scenes, and therefore \(g:{\mathcal {X}}\rightarrow [0,1]^{67}\), where \({\mathcal {X}}\) is the set of possible images in the MIT Indoor Scene dataset. g has an associated feature value mapping \(F_{g}\), which for the scene prediction \(z_{i} \in \{1,...,67\}\), gives a pair that denotes the symbolic scene name \(\nu _{scene}\), i.e., \(f^{z_{i}}_{g}=\{\langle scene, \nu _{scene} \rangle \}\). In this task there is no symbolic meta-data associated with each input \(x \in {\varvec{x}}\). Given a sequence \({\varvec{x}}\) containing a single scene image, the neural component generates a single vector \(g(x_{i})\). The D2K component generates the prediction \(z_{i}\) and its corresponding symbolic feature, thus computing the tuple \(D2K(x_{i})=\langle {\texttt{image}}, f^{z_{i}}_{g}, \{ \}\rangle \), where \({\texttt{image}}\) is the identifier for the network g (i.e., \(m_{g}={\texttt{image}}\)).

Example 4

Consider the sequence and \(y=0\). Let us assume the neural network g computes the output \(g(x_{1})\) from which the D2K generator generates the correct scene prediction \(z_{1}=8\). Let us also assume the neural network predicts with confidence \(g(x_{1})[z_{1}]=0.96\). \(D2K({\varvec{x}})\) gives as output the following tuple:

$$\begin{aligned} \begin{array}{llll} D2K({\varvec{x}}) = \langle 0.96, \{&\langle {\texttt{image}}, \{ \langle {\texttt{scene}}, {\texttt{bookstore}} \rangle \}, \{\} \rangle&\; \}\rangle . \end{array} \end{aligned}$$

The FastLAS symbolic learner is also used in this task. No background knowledge is required, and the language bias is given in Appendix F. For each \(\langle {\varvec{x}}, y\rangle \), FastLAS takes as input \(D2K({\varvec{x}})\) and generates a WCDPI example. For instance, the WCDPI generated for the D2K output in Example 4 is:

$$\begin{aligned}\langle e_{\textrm{id}}, 96, \langle \{ 0 \}, \{ 1,2,3,4\}\rangle , e_{\textrm{ctx}} \rangle \end{aligned}$$

where \(e_{\textrm{id}}\) is a unique identifier and \(e_{\textrm{ctx}}\) is given by the set \(\{{\mathtt {scene(bookstore).}}\}\).

5 Evaluation methodology

In this section we describe the methodology used to evaluate the FFNSL framework. In the first two tasks, the focus is on learning complex first-order knowledge involving negation as failure and predicate invention, which are essential aspects of common-sense learning and reasoning. In the second two tasks, we demonstrate FFNSLs applicability to real-world problems and datasets. For each of the four classification tasks, we divide the evaluation into two types. Firstly, we evaluate the symbolic learning capability of FFNSL, where the goal is to learn interpretable knowledge from symbolic features extracted from pre-trained neural network predictions. Secondly, we evaluate the inference capability of FFNSL, where the pre-trained neural networks together with the learned knowledge are used to make a downstream classification of unseen unstructured data. We refer to the first type of evaluation as the learned hypothesis evaluation and the second type as the FFNSL framework evaluation, since this targets both neural and symbolic components. Let us now describe each evaluation type in more detail.

5.1 Learned hypothesis evaluation

We evaluate the learned hypothesis in terms of accuracy, interpretability and learning time. To measure accuracy, we use a symbolic test set containing ground truth symbolic features. This ensures that the evaluation only targets the accuracy of the learned hypothesis. For each example in the test set, the symbolic features are used with the learned hypothesis to make a prediction of the downstream label. This prediction is compared to the ground truth label in the test set and accuracy is computed using the standard measure. Since FFNSL learns knowledge from a pre-trained neural network, we consider the hypotheses that have been learned at each (increasing) percentage of distributional shift and evaluate the accuracy of the knowledge that FFNSL learns in the presence of incorrect neural network predictions. Note that the symbolic test set remains unchanged and is not affected by the distributional shifts, as we want to evaluate in this case just the accuracy of the learned hypotheses.

To perform a deeper analysis of the accuracy of the learned hypotheses, we take into consideration the following measures. Firstly, the accuracy and confidence score distribution of the pre-trained neural network(s) in classifying unstructured data in the training set D. Since the neural networks were pre-trained on a dataset different from D, this measure enables us to understand the reliability of the pre-trained neural network predictions over new unseen input data (For more dataset details, see Appendix C.). Secondly, we measure the percentage of WCDPI examples generated by the LAS system, that contains features in the context which are incorrect with respect to the label in the inclusion set. This enables us to understand the relationship between incorrect neural network predictions and the accuracy of the learned hypotheses, as well as analyse how many correct WCDPI examples are needed to learn hypotheses with a certain level of accuracy. Thirdly, we calculate the weight penalty ratio r over the generated WCDPI examples, defined as

$$\begin{aligned}r = \frac{\sum _{e \in E^{\prime }_{\textrm{correct}}}{e_{\textrm{pen}}}}{\sum _{e \in E^{\prime }}{e_{\textrm{pen}}}}\end{aligned}$$

where \(E^{\prime }_{\textrm{correct}}\) is the set of correctly generated WCDPI examples, (i.e., WCDPI examples with features in the context that are consistent with the label in the inclusion set) and \(E^{\prime }\) is the complete set of generated WCDPI examples. This enables us to measure the bias given to the LAS system by the weights of the WCDPI examples, which are based on the neural network confidence scores. Ideally, FFNSL should allocate a higher proportion of the total weight penalty to WCDPI examples that contain correct neural network predictions. We compare the accuracy of the knowledge learned from these WCDPI examples with that of knowledge learned from corresponding WCDPI examples where we fix the penalty to be constant for all examples, as a baseline. To measure interpretability, we count the total number of atoms in a learned hypothesis: a hypothesis with a lower number of atoms is considered to be more interpretable (Lakkaraju et al., 2016). Finally, we measure the wall-clock time taken to learn a hypothesis at each percentage of distributional shift.

5.2 FFNSL framework evaluation

When a hypothesis has been learned, the entire FFNSL framework can be evaluated using a test set containing unseen labelled unstructured data. In this case, the neural network component of FFNSL classifies each element of a sequence of unstructured data. The symbolic features predicted from the neural network classification are added to the background knowledge alongside the learned hypothesis. The symbolic component of FFNSL is used to compute the downstream prediction. This is compared to the ground-truth label associated with the sequence of unstructured data, and the accuracy is computed with the standard measure. To assist the evaluation, and to provide insight into where mistakes are being made, we evaluate the neural network accuracy in predicting the symbolic features from the unstructured data in the test set with respect to ground truth information. This enables us to identify whether any downstream classification error is due to neural network feature prediction, the learned hypothesis, or both. We also evaluate FFNSL under distributional shifts. We inject into the test data the same percentages of distributional shifts used during the learning of hypotheses, and evaluate the accuracy of FFNSL. This evaluates the performance of FFNSL in realistic scenarios where distributional shifts occur during learning and inference.

5.3 Experimental setting

In the next four sections, we present the results of the Follow Suit Winner, Sudoku Grid Validity, Crop Yield Prediction, and Indoor Scene Classification tasks, using the evaluation methodology outlined in this section. In the first three tasks we pre-train a Softmax CNN and an EDL-GEN neural network and when used in combination with a symbolic learning system in FFNSL, we refer to these as FFNSL Softmax and FFNSL EDL-GEN respectively. For the Indoor Scene Classification task we adopt a pre-trained network, called Semantic Aware Scene Recognition (SASR), tailored to the task of scene classification. We use a Random Forest (RF) and neural network as baseline rule learning approaches in all tasks, and they both use the same pre-trained Softmax neural network for feature extraction as used in FFNSL Softmax, and are trained to learn the knowledge needed to predict the downstream label, given Softmax neural network predictions. In the Indoor Scene Classification task the SASR network is used. The RF is chosen as a powerful decision tree approach, known for being a lightweight model that is quick to train and exhibits a certain level of interpretability. In the Follow Suit Winner and Crop Yield Prediction tasks, the neural network is a Fully Connected Network (FCN), chosen to evaluate a deeper architecture, and in the Sudoku Grid Validity task, the neural network is a  Convolutional Neural Network-Long Short Term Memory (CNN-LSTM) designed for sequence classification problems where the CNN component can learn spatial dependencies in the Sudoku grid. Full details of the baseline architectures are given in Appendix D. To measure interpretability of the RF baseline, we used the first tree in the forest and extract a rule from each branch (from root to leaf) of this tree. For the neural network baselines, we fitted a surrogate decision tree model (Molnar, 2019) to approximate black-box predictions, and applied the same rule extraction methods as that used for the RF. Let us now present our results.

6 Follow suit winner

In this section we present the results of the Follow Suit Winner task. We start with Softmax and EDL-GEN neural networks pre-trained on standard playing card images and apply minor and major distributional shifts by substituting standard playing card images with images from alternative decks. Example images are shown in Fig. 3 for the queen of hearts card taken from the Standard Fig. 3a, Batman Joker Fig. 3b, Captain America Fig. 3c, Adversarial Standard Fig. 3d, Adversarial Batman Joker Fig. 3e and Adversarial Captain America Fig. 3f decks.

Fig. 3
figure 3

Example playing card images

The Batman Joker and Captain America decks represent minor distributional shifts; the adversarial decks represent instead major distributional shifts where card images from each of the Standard, Batman Joker and Captain America decks are placed against a background containing additional card images from the Standard deck. These adversarial decks are designed to trick the neural networks into predicting incorrectly as images from the Standard deck are from the same distribution as the card images used during neural network pre-training. In order to understand the challenge faced by the LAS system when learning from neural network feature predictions in the presence of distributional shifts, Fig. 4 presents the accuracy and confidence score distribution of pre-trained neural networks when evaluated on different playing card decks than the one used for pre-training.

Fig. 4
figure 4

Neural network performance under distributional shifts

In Fig. 4, each row shows the type of neural network, the playing card deck used for evaluation, the predictive accuracy, and the confidence score distribution. As one would expect, the accuracy was very high when classifying playing card images from the standard deck, as this was the deck used for pre-training. For the Softmax neural network the confidence score was also very high in this case, whereas EDL-GEN had more distributed confidence scores. When evaluating the pre-trained networks on decks different from the one used in training, the Softmax neural network still reported high confidence despite its overall low accuracy, whereas the EDL-GEN network reported comparable low accuracy but with much lower confidence. For example, evaluating the networks over the Captain America deck (see 3rd and 9th rows), 96% of Softmax predictions were made with confidence in the interval [0.95, 1], despite an accuracy of 0.0697, whereas only 10% of EDL-GEN predictions were made within this same confidence interval. As for the overall accuracy, EDL-GEN performed slightly better than Softmax over decks representing minor distributional shifts, whereas both networks performed in a similar way when applied to decks representing major distributional shifts. This highlights the challenge for our FFNSL framework in learning knowledge when presented with out-of-distribution data, as neural network predictions are likely to be incorrect, and may potentially be made with high confidence.

6.1 Learned hypothesis evaluation

Figure 5 presents the accuracy of the learned hypotheses when an increasing percentage of labelled unstructured data were subject to distributional shifts, applied with cards from the alternative decks. The reported accuracy is the mean accuracy over 5 repeats and the error bars indicate standard error.

Fig. 5
figure 5

Accuracy of learned hypotheses with increasing percentages of data subject to distributional shifts, Follow Suit Winner task. 5 repeats

FFNSL outperformed the baselines and learned far superior hypotheses when up to 90% of labelled unstructured data were subject to distributional shifts. This was the case for both instances of FFNSL. The baselines required \(100\times \) the number of examples in order to perform close to FFNSL, and despite the significant increase in the amount of data used by the baselines, FFNSL still learned more accurate hypotheses. Figures 5a and b refer to the injection of minor distributional shifts. In these two cases, when the percentage of distributional shift was very high (above \(90\%\)), the accuracy of the FFNSL learned hypotheses decreased, but still remained between \(\sim 70{-}100\%\), whereas the accuracy of the baselines trained with the same amount of data reduced to \(\sim 40\%\). Figure 5c–e refer to the injection of major distribution shifts. FFNSL Softmax had similar performance in Fig. 5c and d but a much lower accuracy than that shown with minor distributional shifts when \(90\%\) or more of the unstructured data were subject to distributional shifts. The FFNSL EDL-GEN maintained instead a higher accuracy in these cases. We now perform a more in-depth analysis to explore the reasons for dropping accuracy in the presence of high percentages of distributional shifts. Given the two groups of similar behaviours we consider only two representative cases: Batman Joker, as minor distributional shift, and Adversarial Batman Joker as major distributional shift. A full set of analysis results, with respect to all the other card decks, is given in Appendix A.

In particular, we explore whether FFNSL EDL-GEN provides a performance benefit over FFNSL Softmax, and if so, what are the contributing factors. Specifically we analyse the accuracy performance in relation to either or both (i) better neural network predictive accuracy, when classifying out-of-distribution data, and (ii) more informative weight penalties of the generated WCDPI examples, calculated from the neural network confidence scores. For this analysis we focus on high percentages of distributional shifts, 95–100%, as this was when FFNSL instances deteriorated in their learned hypothesis accuracy. We run 50 experimental repeats to generate statistically significant results. In order to isolate the effect of the example weight penalties, we also run two additional baseline FFNSL instances where the weight penalties of the generated WCDPI examples are all constant and equal to 10. The results are shown in Fig. 6. We have also included the performances with respect to distributional shifts given by the Adversarial Captain America deck (Fig. 6c), since Fig. 5e shows that in this case the accuracy of both FFNSL instances decreased to around \(40\%\) when nearly \(100\%\) of the data were subject to distributional shifts. Full analysis of this deck is presented in Appendix A.

Fig. 6
figure 6

FFNSL Softmax vs. FFNSL EDL-GEN. Accuracy of learned hypotheses with 95–100% distributional shifts using 50 repeats. Follow Suit Winner task

For both Batman Joker and Adversarial Batman Joker, FFNSL EDL-GEN outperformed FFNSL Softmax. Note, however, the difference in y-axis scale between Fig. 6a and b and the difference in FFNSL performance. This was due to the fact that both Softmax and EDL-GEN neural networks predicted more accurately on the Batman Joker deck than the Adversarial Batman Joker deck, as presented in Fig. 4. The improved performances of FFNSL EDL-GEN versus FFNSL Softmax did not seem to depend on the more informative weights of WCDPI examples with weights calculated from neural network confidence scores, versus constant weights, since the accuracy of FFNSL instances (denoted “...with NN penalties") was similar to that of the respective baselines with constant weight penalties. However, Fig. 6c shows that when the distributional shift was more severe,Footnote 11 the decrease in accuracy of FFNSL EDL-GEN was less drastic than that of its corresponding baseline with constant penalty, whereas there was no difference in the case of FFNSL Softmax. Even though the overall accuracy of the framework was lower than that reported for less drastic forms of distributional shifts, the more informative weight penalties of WCDPI examples, calculated from the EDL-GEN neural network confidence scores, provided a clear benefit compared to using constant weights, in particular when the percentage of distributional shift was very high.

It still remains open the question as to why FFNSL EDL-GEN performed better than FFNSL Softmax in Fig. 6a and b. For the percentages of distributional shifts between 95–100%, the pre-trained neural networks both reported low average accuracy. So a natural question to ask is whether EDL-GEN led to more consistent symbolic feature predictions than Softmax. This is important to investigate because LAS systems are capable of learning accurate hypotheses from few “good" examples. So, we investigated the percentage of incorrect WCDPI examples generated when 95–100% of the unstructured data were subject to distributional shifts. These were examples whose contextual symbolic features were inconsistent with the ground-truth label due to incorrect neural network predictions.

Fig. 7
figure 7

The effect of distributional shifts on percentage of incorrect WCDPI examples generated. Follow Suit Winner task

Figure 7 shows, first of all, that the percentage of generated incorrect WCDPI examples was lower than the corresponding percentage of data subject to distributional shifts. This indicated that some correct WCDPI examples could be generated even when the neural networks made incorrect predictions. Incorrect predictions made over the 4 cards played could in combination lead to predicted symbolic features for the trick whose winning player would match the ground truth label. Secondly, more correct WCDPI examples were generated when the distributional shift was given by the Batman Joker deck, compared to that of the Adversarial Batman Joker deck. This was because, as indicated in Fig. 4, the neural network accuracy for the former was better than that for the latter. Furthermore, EDL-GEN led to a lower number of incorrect WCDPI examples compared to that of Softmax in both forms of distributional shifts, and this difference was bigger in the case of the Adversarial Batman Joker deck. Given the relatively small number of WCDPI examples used by the LAS system (104), this difference contributed to the larger gap in accuracy between FFNSL Softmax and FFNSL EDL-GEN in Fig. 6b than Fig. 6a.

Now, how did the weight penalty, generated from the neural network confidence score, effect the accuracy of learned hypotheses? Clearly, EDL-GEN provided improved confidence scores than the Softmax neural network which in-turn, improved the accuracy of FFNSL. This explains why the accuracy of each FFNSL approach was higher in Fig. 6a than that shown in Fig. 6b. However, Fig. 6 shows that for FFNSL Softmax, using WCDPI example weight penalties calculated from Softmax neural network confidence scores appears to have no benefit compared to using WCDPI examples with constant weight penalties. However, this was different in the case of FFNSL EDL-GEN. To investigate this further, we calculated the weight penalty ratio for the WCDPI examples generated from both Softmax and EDL-GEN neural network confidence scores. The analysis is shown in Fig. 8 for each deck and 95–100% distributional shifts.

Fig. 8
figure 8

WCDPI example weight penalty ratio. Follow Suit Winner task

Figure 8 shows that the weight penalty ratio calculated from EDL-GEN confidence scores provided a clear benefit than that calculated from the Softmax neural network confidence scores, which was instead very similar to the weight penalty ratio given by constant penalties. At 100% distributional shifts, the benefits of calculating WCDPI example weight penalties with the neural network confidence scores reduced, as there were very few correct examples. This explains why the gap between the accuracy of the FFNSL EDL-GEN with neural network penalties and that of FFNSL EDL-GEN with constant penalties, in both decks, reduced as distributional shifts increase towards 100% (see Fig. 6). In summary, improved accuracy of the neural network predictions led to a higher percentage (even if small) of correct WCDPI examples and improved neural network confidence scores led to an improved penalty ratio of correct WCDPI examples. Together they provided an improved bias for the LAS system which even if it was reduced to learn from a small percentage of correct examples, these had improved penalty weight to guide the search for optimal solutions.

Let us now investigate the interpretability of the hypotheses learned using our FFNSL framework compared to that of the baseline approaches. Figure 9 shows the results, where interpretability was measured in terms of the number of atoms that formed the learned hypothesis.

Fig. 9
figure 9

Interpretability of the learned hypotheses, Follow Suit Winner task

FFNSL learned significantly more interpretable knowledge than the baseline approaches (note the logarithmic scale on the y-axis). In the case of the minor form of distributional shift (see Fig. 9a), the interpretability of the baseline models trained with \(100\times \) the amount of data decreased as distributional shift increased. These models reached high accuracy by training over a much larger dataset (see Fig. 5a), but they did so at the cost of much lower interpretability. This was because they learned a more complex mapping between input and output, instead of learning general rules, as was the case for our FFNSL approach. The FCN trained with the same amount of data as FFNSL had similar interpretability to that of FFNSL, because the model learned to largely predict the same class and the surrogate decision tree was very small. This was reflected in the poor performance of the FCN shown in Fig. 5d for the Adversarial Batman Joker deck. Examples of interpretable knowledge learned by our FFNSL approaches are presented in Appendix A.

Finally, to investigate the scalability of FFNSL, we have also computed the time required to learn an interpretable hypothesis. The results are shown in Fig. 10.

Fig. 10
figure 10

Learning time. Follow Suit Winner task

Both FFNSL approaches learned with an order of magnitude of time similar to that of the FCN trained with the same number of examples when no distributional shifts were applied to the data. As distributional shifts increased, FFNSL took longer because the ILASP system required more iterations to prove optimality with respect to minimising the total penalty on the examples. However, the learning time of FFNSL EDL-GEN did not increase as quickly, when compared to FFNSL Softmax. This was because the WCDPI example weight penalties were much more informative (see Fig. 8) and the ILASP learning system required fewer iterations overall to prove optimality.

In conclusion, our analysis shows that FFNSL outperformed the baseline approaches in terms of accuracy and interpretability, even when the baselines were trained with \(100\times \) the amount of data. FFNSL EDL-GEN outperformed FFNSL Softmax, in the accuracy of the learned hypotheses, as EDL-GEN neural network predictions were more accurate, and this influenced the downstream performance of the FFNSL framework more than the neural network confidence scores. When major distributional shifts were applied, the EDL-GEN uncertainty-aware neural network led to significantly more informative WCDPI example weight penalties compared to the Softmax neural network, although this benefit diminished as the percentage of input data subject to distributional shifts approached 100%. Finally, we have shown that more informative WCDPI example weight penalties resulted in faster hypothesis learning times, when the iterative ILASP system was used.

6.2 FFNSL framework evaluation

Figure 11 presents the accuracy of the entire FFNSL framework when evaluated over a test data subject to the same types of distributional shifts used during the learning of interpretable knowledge. The mean accuracy is reported and the error bars denote standard error over 5 repeats.

Fig. 11
figure 11

Accuracy of the FFNSL framework when training and test data were subject to distributional shifts. Follow Suit Winner

FFNSL outperformed the baselines trained with the same amount of data at each percentage of distributional shift on the Batman Joker deck, and until \(\sim \)80% distributional shift on the Adversarial Batman Joker deck. The baselines required \(100\times \) the amount of data in order to match or outperform FFNSL. On the Adversarial Batman Joker deck, the performance was lower for all approaches when the percentage of distributional shift was high, due to the neural networks predicting with lower accuracy (see Fig. 4). The baselines trained with the same amount of data outperformed FFNSL for \(>80\%\) distributional shifts. This was because they largely predicted player 1 and was sufficient to reach approximately \(40\%\) of accuracy on the test set: the Follow Suit Winner task is biased towards player 1 because winning depends on playing the highest ranked card with the same suit as player 1. In the test set, 38.6% of the data was indeed labelled with player 1 as the winner, which roughly corresponds to the performance of the baselines trained with the same number of examples at 100% shifts in Fig. 11b. FFNSL EDL-GEN outperformed FFNSL Softmax because of two reasons. Firstly, the rules learned by FFNSL EDL-GEN, in the presence of high percentage of distributional shifts, were more accurate (see Figs. 5a and d) because of the lower number of incorrect WCDPI examples when the EDL-GEN neural network was used (Fig. 7). In addition, the EDL-GEN neural network provided more informative bias to the LAS system through better WCDPI example weight penalties (Fig. 8). Finally, the decrease in performance of the FFNSL approaches over unseen data subject to distributional shift seemed to be linear in the percentage of applied distributional shift. This was primarily due to the accuracy of the neural network feature predictions. Figure 12 shows that indeed the accuracy of neural network predictions over unseen card images decreased linearly with the increase of the percentage of distributional shifts.

Fig. 12
figure 12

Neural network card accuracy when test data points were subject to distributional shifts. Follow Suit Winner task

As shown in Fig. 12, the EDL-GEN neural network was more accurate than Softmax in predicting unseen playing cards in the case of minor distributional shift given by the Batman Joker deck. However, the accuracy of the neural networks was the same in the case of major distributional shift given by the Adversarial Batman Joker deck. This was why in Fig. 11, FFNSL EDL-GEN’s showed better performance on the Batman Joker deck. For the Adversarial Batman Joker deck, FFNSL EDL-GEN’s better performance than FFNSL Softmax was primarily due to more accurate hypotheses.

7 Sudoku grid validity

Having presented in detail the performance of our FFNSL approaches on the Follow Suit Winner task, we now explore whether the approach can generalise to other tasks. We have applied our approach to a different classification task, the Sudoku Grid Validity task and we present the results in this section. We consider two cases: a \(4\times 4\) Sudoku grid size, for which the sequence \({\varvec{x}}\) of unstructured data is much longer than that used for the Follow Suit Winner task. Therefore, each generated WCDPI example contains more contextual features that are likely to be predicted incorrectly, as a result of distributional shifts applied to input images. We then evaluate the scalability of the FFNSL framework even further by considering \(9\times 9\) Sudoku grid sizes. For the Sudoku grid validiy tasks, the FFNSL instance makes use of the FastLAS system, which has been shown to scale to handle large hypothesis spaces (Law et al., 2020).

We first pre-train both Softmax and EDL-GEN neural networks on standard images from the MNIST training set. In all experiments, we used MNIST digits 1–4 and 1–9 for the respective Sudoku grid size tasks. Figure 13 shows the accuracy and confidence score distribution of the pre-trained neural networks for the \(4\times 4\) and \(9\times 9\) grid tasks, on two test sets: a standard MNIST test set, and a test set where the MNIST digits have been rotated 90\(^{\circ }\) clockwise, representing a distributional shift. The test sets also contain MNIST digits 1-4 or 1-9, depending on the Sudoku grid size.

Fig. 13
figure 13

Neural network performance under distributional shifts, Sudoku Grid Validity task

The results are similar to the Follow Suit Winner task. The Softmax neural network predicted with high confidence also over data subject to distributional shift, despite its low test set accuracy. The EDL-GEN neural network predicted more accurately than Softmax on data subject to distributional shift, but Softmax was slightly more accurate on the standard test sets.

7.1 Learned hypothesis evaluation

Figure 14 presents the accuracy of the hypotheses learned from unstructured data with increasing percentages of distributional shift, given by rotating MNIST digit images. We plot the mean accuracy over 5 repeats and the error bars denote standard error. In both Sudoku Grid Validity tasks the FFNSL approaches have as input a background knowledge that encodes the concept of a Sudoku grid (see Appendix F for details). For the \(4\times 4\) task, we created an additional, more challenging task with a reduced background knowledge where facts about column, row and block were not given but implicitly inferred from a more general notion of division and cell coordinates (given as meta-data). For the \(9\times 9\) task, we also created an additional training task for the RF, (which was the best performing baseline), where pre-trained neural network predictions were post-processed into 3 Boolean features: whether digits were in the same row, column or block, which was given as input to the RF. This type of input effectively encoded the Sudoku grid knowledge into the RF learning task, and constituted even more information than what was provided to our FFNSL approaches. We demonstrate that FFNSL performed similarly to this baseline with additional background information. Full FastLAS task listings are given in Appendix F.

Fig. 14
figure 14

Accuracy, over 5 repeats, of the learned hypotheses with increasing percentages of data subject to distributional shifts, Sudoku Grid Validity task

FFNSL approaches outperformed the baselines in both \(4\times 4\) and \(9\times 9\) tasks by learning far more accurate hypotheses. In the \(4\times 4\) task, the baselines required \(100\times \) the amount of data to reach an accuracy closer to that of FFNSL, whereas in the \(9\times 9\) task, the baselines failed completely. In Fig. 14a the purple line is FFNSL with the explicit background knowledge about the Sudoku grid removed. In this case, the FFNSL approach used the EDL-GEN neural network, and it outperformed the baselines. It also outperformed the other two FFNSL approaches, which used explicit background knowledge about the Sudoku grid, when 90% and 100% distributional shifts were applied to the data. This was because with less explicit facts about the grid, the symbolic learner FastLAS was less constrained and alternative hypotheses could be learned which better accommodated the (incorrect) predictions of the neural networks. With explicit facts about the Sudoku grid, the hypothesis space contained rules that performed either very well or very poorly. In Fig. 14b, the brown line shows the accuracy of the RF with the 3 Boolean input features, post-processed from the pre-trained neural network predictions, indicating if digits were in the same row, column or block. FFNSL approaches performed similarly to this baseline that used extra input knowledge.

We investigate our results further to understand whether using the pre-trained EDL-GEN neural network provides a benefit over Softmax in the presence of high percentages of distributional shifts, also in this domain. We focused on 80–96% distributional shifts for the \(4\times 4\) Sudoku Grid Validity task and 95–99% distributional shifts for the \(9\times 9\) task, as this was where the performance of FFNSL deteriorated. Similarly to the Follow Suit Winner task, we run 50 experimental repeats and run two baseline FFNSL approaches with constant weight penalties. Figure 15 shows our further experimental results.

Fig. 15
figure 15

FFNSL Softmax versus FFNSL EDL-GEN. Average accuracy of learned hypotheses over 50 repeats. Sudoku Grid Validity task

Firstly, in both cases of \(4 \times 4\) and \(9 \times 9\) grids, the FFNSL Softmax and FFNSL EDL-GEN that used WCDPI example weight penalties calculated from neural network confidence scores, outperformed the corresponding FFNSL with constant weight penalties. To investigate this further, we explore the WCDPI example weight penalty ratio for the \(4\times 4\) and \(9\times 9\) tasks.

Fig. 16
figure 16

ILP example weight penalty ratio, Sudoku Grid Validity task

Figure 16 shows that both FFNSL Softmax and FFNSL EDL-GEN with neural network penalties had a larger weight penalty ratio than the corresponding FFNSL with constant penalties. For FFNSL EDL-GEN this was expected, and for FFNSL Softmax, this is explained by the fact that, as shown in Fig. 13, the Softmax neural network had a more varied confidence score distribution over data subject to distributional shift (rotated digits). The difference between the WCDPI example weight penalty ratio with neural network penalties and constant penalties explains the performance gain of FFNSL with neural network penalties versus FFNSL with constant penalties in Fig. 15. Similarly, the difference between the example weight penalty ratio of FFNSL EDL-GEN and that of FFNSL Softmax, with neural network penalties, also explains why FFNSL EDL-GEN outperformed FFNSL Softmax in Fig. 15a for \(4 \times 4\) grids. For the \(9 \times 9\) task the difference between the example weight penalty ratio of FFNSL EDL-GEN and that of FFNSL Softmax, with neural network penalties, is very small and this explains why FFNSL EDL-GEN and FFNSL Softmax show a similar performance in Fig. 15b.

Now, in Fig. 15a, FFNSL Softmax with constant weight penalties outperformed FFNSL EDL-GEN with constant weight penalties, whereas in Fig. 15b these two approaches performed similarly. To investigate this further, we consider the percentage of incorrect ILP examples when distributional shifts were applied. The results are shown in Fig. 17.

Fig. 17
figure 17

The effect of applying distributional shifts on the percentage of incorrect ILP examples, Sudoku Grid Validity task

For the \(4 \times 4\) task, despite the pre-trained EDL-GEN neural network predicting on average more accurately (see Fig. 13), it led to a higher percentage of incorrect WCDPI examples than the pre-trained Softmax neural network, as shown in Fig. 17a. This explains the lower performance in Fig. 15a of FFNSL EDL-GEN with constant penalties. Using WCDPI example weight penalties calculated with EDL-GEN neural network confidence scores was able to rectify this and bias the LAS system to focus on learning a hypothesis from WCDPI examples containing correct neural network predictions. For the \(9\times 9\) task both pre-trained neural networks led to a similar percentage of incorrect WCDPI examples (see Fig. 17b), which explains the similar performance of FFNSL EDL-GEN and FFNSL Softmax with constant penalties (shown in Fig. 15b).

Let us now investigate the interpretability of FFNSL compared to the baseline approaches. The results are shown in Fig. 18.

Fig. 18
figure 18

Interpretability of the learned hypotheses, Sudoku Grid Validity task

Again, FFNSL learns significantly more interpretable hypotheses than the baseline approaches. Example learned hypotheses are presented in Appendix A. As for the learning time, results are shown in Fig. 19.

Fig. 19
figure 19

Hypothesis learning time, Sudoku Grid Validity task

In both \(4\times 4\) and \(9 \times 9\) tasks, the learning time for FFNSL did not increase as the percentage of input data subject to distributional shifts increased. This was because the FastLAS learning system used by FFNSL learned a hypothesis by solving an optimisation problem with respect to all generated WCDPI examples. This was not the case for the Follow Suit Winner task where the ILASP system learned an optimal hypothesis iteratively over the examples. It is interesting to note that in Fig. 19a, FFNSL’s learning time had the same order of magnitude as that of the CNN-LSTM trained with \(100\times \) the amount of data, which had lower accuracy up to \(90\%\) of distributional shifts.

In conclusion, we have shown that for the learned hypotheses evaluation, FFNSL outperformed the baseline approaches in terms of accuracy and interpretability, even when the baselines were trained with \(100\times \) the amount of data. Furthermore, in this task, WCDPI example weight penalties had a larger impact on the performance of FFNSL. We have also shown that FFNSL can scale to learning hypotheses where many more unstructured data points \(x_{i}\) are observed per labelled input \(\langle {\varvec{x}},y\rangle \), and in these cases FFNSL learns in a timely manner.

7.2 FFNSL framework evaluation

The final evaluation is the accuracy of the overall FFNSL framework when it is applied to a test set of unseen data also subject to distributional shifts. Figure 20 shows the mean accuracy over 5 repeats and the error bars denote standard error.

Fig. 20
figure 20

Accuracy of the FFNSL framework when training and test data were subject to distributional shifts. Sudoku Grid Validity task

In the \(4 \times 4\) task, FFNSL outperformed the baselines until 80% of the test data were subject to distributional shifts, even when the baselines were trained with \(100\times \) the amount of data. In the \(9\times 9\) task, FFNSL outperformed all baselines, with the exception of the RF with additional background knowledge, which performed similarly to FFNSL Softmax. It is indeed interesting to analyse why for the \(9\times 9\) task FFNSL Softmax outperformed FFNSL EDL-GEN especially when low percentages of test data were subject to distributional shift.

Aside from the accuracy of the learned hypotheses, there were two contributing factors to the test set accuracy shown in Fig. 20b. Firstly, the ability to correctly predict test examples when input data was subject to distributional shifts, and secondly, the ability to correctly predict test examples when no distributional shifts were applied. In the \(9\times 9\) task, there were many more digit images on the grid for the neural network to predict. For test examples that were not subject to distributional shift, just one single incorrect neural network prediction may have led to a miss-classified example. At \(0\%\) shifts in Fig. 20b, FFNSL Softmax outperforms FFNSL EDL-GEN. Now, the Softmax neural network accuracy over unseen and non-rotated MNIST digits was 0.9927, whereas that of EDL-GEN neural network was 0.9861. This explains the drop in performance for FFNSL EDL-GEN at \(0\%\) shifts. As distributional shifts were applied to the test set for percentages ranging between 10–80%, both FFNSL Softmax and FFNSL EDL-GEN failed to classify most examples subject to distributional shifts, but FFNSL EDL-GEN also failed to classify more examples that were not subject to distributional shifts, when compared to FFNSL Softmax. At distributional shifts \(> 80\%\), the accuracy of the learned rules also became a factor and the performance of both approaches deteriorated towards \(50\%\) accuracy.

8 Real-world datasets

In order to demonstrate FFNSLs applicability to real-world problems and datasets, in this section we present evaluations of two additional tasks: (1) Crop Yield Prediction, and (2) Indoor Scene Classification, introduced in Sects. 4.3 and 4.4 respectively. Let us now summarise each task.

Crop yield prediction The goal is to classify the quality of yield given an image of a particular crop, and symbolic information denoting the crop’s location. Softmax and EDL-GEN neural networks were trained to output species and disease information for each crop image, and the symbolic learner learned knowledge that identifies which predicted crop features correspond to different qualities of yield. We used the Plant Village dataset containing images of healthy and diseased crops (Hughes & Salathé, 2015), and generated a synthetic symbolic dataset for yield prediction. A distributional shift was applied to crop images using a hue filter, after the neural networks were pre-trained on the unaltered images. Example images are shown in Fig. 21 for a grape crop. Fig. 21a and b show standard images of a healthy and black measles image respectively, and Fig. 21c and d are the same as Fig. 21a and b respectively, with the distributional shift applied.

Fig. 21
figure 21

Example crop images from the Plant Village dataset with and without distributional shift

Indoor scene classification The goal is to learn knowledge that maps scene level classifications (e.g., bedroom, bathroom, living room) into higher-level super-classes that correspond to a collection of scenes (e.g., home). In this task, we used a state-of-the art neural network called Semantic Aware Scene Recognition (SASR) (López-Cifuentes et al., 2020). SASR is a dual-branch CNN that is trained to output scene level classifications, utilising semantic segmentation information, and raw image RGB pixel data on each CNN branch respectively. The symbolic learner then learned the super-class of each scene. Both neural and symbolic datasets are real and were constructed from the MIT Indoor Scenes dataset (Quattoni & Torralba, 2009). To apply a distributional shift, we transformed each image using a Gaussian blur, hue shift, and 180\(^\circ \) rotation, after the neural network was trained on unaltered images. An example image for a bedroom scene is shown in Fig. 22. In order to obtain results in a timely manner, in this task we implemented a timeout for FastLAS to return the most optimal hypothesis found after 10 min. Also, in contrast to the other tasks, all models were trained with the same dataset size, as the baselines performed strongly when no distributional shift was applied. Finally, only one experimental repeat was performed as the image train/test split was already defined in the dataset (Quattoni & Torralba, 2009).

Fig. 22
figure 22

Example bedroom image from the MIT Indoor scene dataset with (b) and without (a) distributional shift applied

Figure 23 presents the neural network performance on both tasks, in terms of both accuracy and confidence score distribution when classifying unseen images with and without distributional shift. For the Crop Yield Prediction task in Fig. 23a, the Softmax neural network achieved 88.18% accuracy for the standard images, and performed poorly when classifying hue shift images. In both datasets, predictions were made with very high confidence. The EDL-GEN neural network achieved 83.1% accuracy on the standard images, and 31.8% accuracy on the hue shift images which is much higher than Softmax. Crucially, the EDL-GEN neural network predicted with much lower confidence than Softmax on both datasets which better reflects the predictive accuracy. However, the confidence for standard images was somewhat lower than expected, as 58% of predictions were made with less than 25% confidence, despite 83% accuracy. For the Indoor Scene Classification task in Fig. 23b, the distributional shift reduced the network accuracy from 87.01 to 11.19%, although the confidence scores from the SASR network appropriately reflected the reduced accuracy when distributional shift was applied. Although the SASR network does not have an uncertainty-aware architecture like the EDL-GEN networks used in the other tasks, SASR was able to predict with low confidence under our distributional shift. We suspect this was due to the shifted samples falling between the decision boundary of the 67 scene classes, rather than being completely out-of-distribution, enabling the network to better reflect its uncertainty amongst the possible classes. We now present our evaluation of the learned hypotheses in each task.

Fig. 23
figure 23

Neural network performance under distributional shifts

8.1 Learned hypothesis evaluation

Figure 24 shows the accuracy of the learned hypotheses in each task, when an increasing percentage of labelled unstructured data were subject to distributional shift. In the Crop Yield Prediction task (Fig. 24a), the reported accuracy is the mean accuracy over 5 repeats, and the error bars indicate standard error. In this task, both instances of FFNSL learned accurate hypotheses until 90% of the data were subject to distributional shift, and outperformed all the baseline approaches, even when the baselines were trained with significantly more data. In the Indoor Scene Classification task (Fig. 24b), FFNSL also outperformed the baseline approaches, learning the correct hypothesis up to 50% of the data subject to distributional shift. Setting the weight penalties for the examples based on the neural network confidence scores led to more accurate hypotheses, compared to using constant penalties. This is because using the neural network-based weight penalties enabled FastLAS to find a better optimal solution within the 10 min timeout, as the more informative weight penalties gave a clearer optimisation signal for the final solving stage. With constant weight penalties, the optimisation took significantly longer as the distributional shift increased (see Fig. 27b).

Fig. 24
figure 24

Learned hypothesis accuracy

Exploring deeper the effect of using example weight penalties set by neural network confidence scores, compared to using constant penalties, we ran 50 experimental repeats between 95–100% shifts. Figure 25 shows the accuracy, weight penalty ratio, and hypothesis length comparison for the Crop Yield Prediction task.

Fig. 25
figure 25

The effect of setting ILP example weight penalties based on neural network confidence scores, compared to constant penalties, Crop Yield Prediction task. 95–100% shifts, 50 repeats

The EDL-GEN instances of FFNSL clearly outperformed the Softmax FFNSL instances (see Fig. 25a). This is because the EDL-GEN neural network predicted with greater accuracy than Softmax when distributional shifts were applied (see Fig. 23a). However, setting the weight penalties of the examples for the symbolic learner based on neural network confidence scores made very little difference in the FFNSL EDL-GEN instances. Therefore, we investigated the weight penalty ratio (Fig. 25b). As expected, both FFNSL Softmax instances had a similar weight penalty ratio, due to the Softmax neural network predicting with high confidence when distributional shift was applied (see Fig. 23a). The EDL-GEN instances do however show a difference, and the neural network-based weight penalties did provide a more informative signal. The question therefore, is why did this not translate into an improvement in learned hypothesis accuracy? It turns out that the benefit was realised in the length of the learned hypothesis (Fig. 25c), as FFNSL EDL-GEN with neural network weight penalties learned a shorter hypothesis than when constant penalties were used. Comparing Fig. 25c to a, you can see that at 99% shifts, when the accuracy of FFNSL EDL-GEN with constant penalties decreased, the length of the learned hypotheses also decreased, whilst FFNSL EDL-GEN with neural network weight penalties achieved a higher accuracy with a shorter hypothesis. With constant penalties, to account for the level of noise, the symbolic learner had to learn more rules that map additional values of location type, plant species and disease to crop yield, in order to maintain the same level of accuracy as when neural network-based penalties were used.

Finally, Figs. 26 and 27 present the interpretability and learning time results for both tasks. FFNSL learned significantly more interpretable hypotheses than the baseline approaches in both tasks. In terms of learning time, FFNSL learned a hypothesis faster than the baselines trained with more examples in the Crop Yield Prediction task, and was slower than the baselines in the Indoor Scene Classification task. Figure 27b clearly shows the computational benefit of setting ILP example weight penalties based on neural network confidence scores, as a hypothesis was learned significantly faster than when constant penalties were used. The near constant learning times at 80–100% shifts for FFNSL SASR with neural network-based penalties, and 30–100% shifts with constant penalties, was due to the 10 min timeout imposed on each FastLAS learning task.

Fig. 26
figure 26

Interpretability of the learned hypotheses

Fig. 27
figure 27

Hypothesis learning time

8.2 FFNSL framework evaluation

Figure 28 presents the accuracy of the entire FFNSL framework when evaluated over test data also subject to the same percentage of distributional shift as used during learning.

Fig. 28
figure 28

Accuracy of the FFNSL framework when training and test data were subject to distributional shift

In the Crop Yield Prediction task (Fig. 28a), FFNSL EDL-GEN outperformed all other methods, and FFNSL Softmax outperformed all other methods trained with the same amount of data. The next best approach, the random forest, required significantly more data to match the performance of FFNSL Softmax. The superior performance of FFNSL EDL-GEN compared to FFNSL Softmax was due to the EDL-GEN neural network predicting more accurately for images subject to distributional shift (see Fig. 23a). In the Indoor Scene Classification task, FFNSL performed similarly to the best baseline approach, and all approaches degraded gracefully as the percentage of data points subject to distributional shift increased.

To conclude, this evaluation of FFNSL to real-world datasets shows that the framework can support a wide range of neural modules, and the D2K component is flexible enough to support the interface between different neural and symbolic modules. When taking into account the Follow Suit Winner and Sudoku Grid Validity results, we have also shown that FFNSL can learn complex, first-order symbolic knowledge, using essential aspects of common-sense learning and reasoning such as negation as failure and predicate invention. In the next section, we discuss related work before concluding the paper.

9 Related work

Our proposed FFNSL approach is a specific form of a neural-symbolic learning and reasoning system that, differently from other neural-symbolic methods, uses pre-trained neural networks and logic-based machine learning systems to learn interpretable, logic-based knowledge from unstructured data that can be used to solve a given task. Most of the recently proposed neural-symbolic approaches focus on ways in which a given fixed knowledge can be used to improve the training of a neural network (Serafini & d’Avila Garcez, 2016; Donadello et al., 2017; Riegel et al., 2020; Manhaeve et al., 2018; Yang et al., 2020; Tsamoura et al., 2021). These approaches leverage the notions of Real Logic (Serafini & d’Avila Garcez, 2016; Donadello et al., 2017) or t-norm functions (Flaminio & Marchioni, 2006) to enable the injection of logical reasoning in data-driven relational machine learning. This is the case, for instance, of the Logic Tensor Network approaches proposed in Serafini and d’Avila Garcez (2016); Donadello et al. (2017). Our FFNSL approach also uses a similar notion of t-norms but not to embed logic into the differentiable setting, rather to “combine” neural network predictive approximations with logic-based learning optimisation so enabling the composition of these two different machine learning paradigms.

Neural-symbolic approaches that preserve the composition of neural and symbolic inference include DeepProbLog (Manhaeve et al., 2018), NeurASP (Yang et al., 2020) and NeuroLog (Tsamoura et al., 2021). They compose deep learning architectures with symbolic reasoning in order to use existing background knowledge, expressed as logic programs, to train deep learning models. DeepProbLog (Manhaeve et al., 2018) uses ProbLog (De Raedt et al., 2007) to interpret network outputs as probabilistic atoms, and symbolic knowledge compiled into an arithmetic circuit, to train the network. NeurASP extends ASP with neural predicates, expressed as choice rules, to symbolically capture possible network outputs. The probability of each model of the ASP program is computed based on the network predictions, which is in turn used to optimise a semantic loss function for training the network (Xu et al., 2018). NeuroLog also trains the neural network using a semantic loss function, although uses abduction to prune the space of possible pseudo-label revisions for the latent concepts, instead of considering all possibilities as in NeurASP. Although compositional in their architectural solution, and novel in their end-to-end approach for differentiable training of the neural networks, these methods require the logic-based knowledge to be manually engineered. Our FFNSL approach, on the other hand, enables the learning of logic-based knowledge from unstructured data exploiting pre-trained neural models. The semantics of the underlying logic-based learning algorithm in FFNSL is the Answer Set semantics, as it is the case for the symbolic component of the NeurASP system, but with the advantage in FFNSL that knowledge expressed in ASP programs is learned instead of being fully encoded as input.

Contrary to the end-to-end feature of DeepProbLog and NeurASP neural-symbolic systems, our FFNSL adopts a pipeline approach. It is therefore somewhat related to the Concept Bottleneck Model architecture proposed in Koh et al. (2020), which advocates the idea of training first a model to predict “primary” concepts and then using these concepts to train a downstream model for predicting the labels. These models are however differentiable and even though they can be trained in an end-to-end fashion to improve the overall accuracy (Koh et al., 2020), the trained downstream model is not interpretable. Their interpretability is limited to extracting correlations between the primary concepts and the final label. In our FFNSL approach, the use of LAS logic-based machine learning systems allow the learning of knowledge that is fully interpretable and that is more robust to distributional shifts and noise in the data. In fact, the CNN-LSTM and FCN baselines used in our tasks could be considered as independent concept bottleneck models, and FFNSL outperformed both of these models in our evaluation.

The compositional aspect of our framework could, in principle, make it amenable to instantiations where the symbolic component is a probabilistic rule learning system. Different probabilistic rule learning and statistical relational learning systems have been proposed, such as ProbFOIL (De Raedt et al., 2015), SLIPCOVER (Bellodi & Riguzzi, 2013), Markov Logic Networks (Richardson & Domingos, 2006) and Credal-FOIL (Tuckey et al., 2020). They adopt a probabilistic notion of uncertainty which is different from the notion of WCDPI example weight penalties used in our FFNSL approach. Such systems would, however, make FFNSL not applicable to tasks where non-observational predicate learning with negation as failure is required, like our Follow Suit Winner task, and limit its scalability. This is because it still remains to be shown whether current probabilistic rule learning systems are scalable to a large number of probabilistic facts and large hypothesis search spaces.

Related approaches that support the learning of interpretable knowledge from (unstructured) data in a neural-symbolic manner include \(\delta \)ILP (Evans & Grefenstette, 2018), and NeuralLP (Yang et al., 2017). They make use of rule templates and differentiable reasoning to approximate the inference process and learn instances of the rule templates that cover given labelled examples or to answer given queries. Such approaches, preserve the symbolic, logic-based representation of the knowledge, but replace the logic-based inference process with a purely differentiable one. Our FFNSL approach uses instead a pure symbolic inference process to learn interpretable knowledge, leveraging on state-of-the-art logic-based machine learning systems such as ILASP and FastLAS. The composition of these systems with differentiable feature extraction from unstructured data enables FFNSL to learn knowledge that is more expressive than the definite clausal form supported by \(\delta \)ILP and NeuralLP, broadening the applicability of FFNSL to real-world problems where non-monotonicity and preference learning are required. Results in Law et al. (2018) have already demonstrated that, in the case of structured data, the ILASP system used by our FFNSL framework outperforms \(\delta \)ILP when learning interpretable knowledge from noisy examples.

Neural-symbolic systems such as Neural-Theorem Prover (Rocktäschel & Riedel, 2017) and its extensions, adopt instead a counterpart approach whereby knowledge is expressed as dense vector embedding representations that are learned in a differentiable manner by using a symbolically inspired backward chaining algorithm and (soft) unification. In these systems, the knowledge is represented in a high-dimensional differentiable space and the inference is symbolically inspired. More recently, a fully differentiable rule induction approach based on Logical Neural Networks has been proposed (Sen et al., 2021) that uses differentiable operators from fuzzy and real logic to learn rules from structured data within a very controlled search space expressed using templates. Although some of these systems have recently shown to be somewhat scalable over large knowledge bases (Minervini et al., 2020, 2020), they are all limited in the expressivity of the knowledge that they can learn and they are not guaranteed to learn (mathematically provable) optimal solutions. These are two main properties that our FFNSL framework instead benefits from, making our approach particularly suited for safe and trusted AI applications where data are unstructured, complex, and interpretable knowledge is required to be learned to solve complex tasks.

Recent approaches train a neural network to extract primary concepts from raw data, whilst learning interpretable symbolic knowledge in an end-to-end fashion (Dai et al., 2019; Dai & Muggleton, 2021). These methods don’t require labels for the primary concepts, and train a neural network from scratch whilst simultaneously learning knowledge. The Abductive Learning framework (ABL) (Dai et al., 2019) learns ground operation facts that complete a symbolic knowledge base, to map neural network outputs to downstream labels. This knowledge is then used to abduce revised pseudo-labels to improve the training of the neural network. Crucially, Dai et al. (2019) cannot perform program induction, and assumes monotonicity of the background knowledge, as ground operation facts are abduced and accumulated during an iterative sampling process over the training data. In contrast, our approach learns first-order rule-based programs, which contain universally quantified variables, and are therefore applicable to a range of input sizes greater than the sizes used for training. We can also handle non-monotonicity, thus enabling the learning of more complex knowledge. The \(Meta_{Abd}\) approach (Dai & Muggleton, 2021) extends (Dai et al., 2019) to perform rule induction using the Metagol symbolic learner (Muggleton et al., 2015). The key drawback of \(Meta_{Abd}\) is that Metagol can only learn symbolic knowledge expressed as definite logic programs without function symbols, which can compute only polynomial functions (Dantsin et al., 2001). \(Meta_{Abd}\) cannot learn more expressive knowledge involving defaults, exceptions, constraints and choice, which are essential aspects of common-sense learning and reasoning. In FFNSL, we learn first-order complex knowledge expressed in the language of ASP, which is more general than symbolic learning of definite clauses (Law, 2018; Law et al., 2020, 2018), and can solve computationally harder problems (Karp, 1972). Also, due to the high level of difficulty of such an end-to-end neuro-symbolic task, \(Meta_{Abd}\) has only been applied to very simple classification problems. Our architecture is motivated by a completely different requirement, that of using already trained and therefore possibly much more complex neural components for extracting features from challenging raw data.

10 Conclusion

This paper introduces a neural-symbolic learning framework, FFNSL, that learns interpretable knowledge from unstructured data that is robust to distributional shifts. Three main instantiations of this framework have been presented, which use the ILASP and FastLAS logic-based machine learning systems, according to the type of symbolic learning task required. In each instantiation, pre-trained neural networks have been used for extracting symbolic features from the unstructured data. The novel component of FFNSL is the D2K generator, which generates symbolic features, weighted by neural network confidence scores, that together with a label, form the input to the logic-based machine learning system which then learns interpretable knowledge needed to solve the given downstream task.

Our evaluation on four neural-symbolic classification tasks, Follow Suit Winner, Sudoku Grid Validity, Crop Yield Prediction and Indoor Scene Classification, demonstrates that FFNSL is robust to distributional shifts in the input data, outperforming random forest and deep neural network baselines. FFNSL learns more accurate and interpretable knowledge than the baselines even when the latter are trained with significantly more data. The application of FFNSL learned knowledge to unseen data also subject to similar proportions of distributional shifts shows that FFNSL is again capable of outperforming the baseline approaches trained with the same amount of data up to \(\sim \)80% of data subject to distributional shifts. A detailed analysis of the performance in accuracy of our FFNSL framework shows that using an uncertainty-aware neural network provides an improved bias to the logic-based machine learning system compared to Softmax neural networks, with a greater proportion of the total weight penalty allocated to WCDPI examples containing correct contextual information extracted from the unstructured data.