1 Introduction

Understanding visual scenes is a fundamental problem in building an intelligent agent. Deep Neural Networks such as Convolutional Neural Networks (CNNs) have succeeded in many visual-perception benchmarks but produce poor performance in complex visual scenes, where several objects appear in an image, and the agent needs to reason and learn about the attributes and relations. CNN-based models do not explicitly encode objects and relations, and thus often fail to capture the patterns defined in complex visual scenes.

Kandinsky patterns (Holzinger et al., 2019; Müller & Holzinger, 2021; Holzinger et al., 2021) have been proposed to assess the ability of intelligent systems to explain complex visual scenes. In a similar vein, CLEVR-Hans (Stammer et al., 2021) has been proposed to assess the ability of a model to understand confounded visual scenes. CNN-based models cannot produce proper explanations in such cases and can also suffer from the problem of confounding factors. Moreover, they are data-hungry and struggle to learn abstract visual relations (Kim et al., 2018). A natural question thus arises: How can we build an intelligent system avoiding these pitfalls?

To build a system overcoming the shortages of CNN-based models, Neuro-Symbolic approaches (Besold et al., 2017; d’Avila Garcez & Lamb, 2020; Tsamoura et al., 2021) have emerged, where symbolic computations are integrated with neural networks. As logic-based neuro-symbolic systems, many frameworks have been proposed, e.g., DeepProblog (Manhaeve et al., 2018, 2021), NeurASP (Yang et al., 2020), and \(\partial\)ILP (Evans & Grefenstette, 2018). However, previous studies are not capable of complete structure learning from visual input (Manhaeve et al., 2018, 2021; Yang et al., 2020) or not capable of handling complex rules and visual scenes (Evans & Grefenstette, 2018). Therefore, structure learning on complex scenes such as Kandinsky patterns (Holzinger et al., 2019; Müller & Holzinger, 2021; Holzinger et al., 2021) and CLEVR-Hans (Stammer et al., 2021) problems is difficult, if not impossible, using these frameworks.

To mitigate this issue, we propose \(\alpha\)ILP,Footnote 1 a novel differentiable Inductive Logic Programming (ILP) framework that combines object-centric perception with ILP (Muggleton, 1991, 1995; Nienhuys-Cheng et al., 1997; Cropper et al., 2022), establishing one of the first in the 4th type of neuro-symbolic system, i.e., Neuro:Symbolic\(\rightarrow\)Neuro, as proposed by Kautz (2022). \(\alpha\)ILP maps output of neural networks (Neuro) to symbolic representations (Symbolic), then gradient-based learning is performed on top of it (Neuro). \(\alpha\)ILP performs structure learning, i.e., learns discrete logic programs from complex visual scenes. To this end, our system is an extension of Neuro-Symbolic systems such as DeepProblog (Manhaeve et al., 2018, 2021) and \(\partial\)ILP (Evans & Grefenstette, 2018).

\(\alpha\)ILP has an end-to-end reasoning architecture from visual input, which consists of three main components: (i) visual perception module, (ii) facts converter, and (iii) differentiable reasoning module. The facts converter converts the output of the visual perception module into the form of probabilistic facts, which can be fed into the reasoning module. Then, the reasoning module performs differentiable forward-chaining inference from a given set of facts. It computes the set of facts that can be deduced from the given set of facts and weighted logical rules (Evans & Grefenstette, 2018; Shindo et al., 2021). The final prediction can be made based on the result of the forward-chaining inference. \(\alpha\)ILP learns logic programs that encode high-level scene information by differentiable ILP techniques (Shindo et al., 2021). It generates candidates of clauses by top-k beam search and learns the weights for the clauses by backpropagation.

Overall, we make a number of key contributions: (1) We propose \(\alpha\)ILP, a novel framework that performs differentiable ILP from visual scenes. (2) To establish \(\alpha\)ILP, we propose an end-to-end reasoning architecture from visual inputs. It performs differentiable forward-chaining inference for visual scenes by using perception models and a facts-converting algorithm. (3) We also propose a learning scheme for \(\alpha\)ILP to perform differentiable ILP for complex visual scenes. It integrates differentiable ILP techniques with the visual domain, i.e., generates clauses efficiently and performs gradient-based optimization from complex visual scenes. (4) We empirically show the following advantages of \(\alpha\)ILP: (i) \(\alpha\)ILP solves ILP problems in visual scenes, i.e., Kandinsky patternss (Holzinger et al., 2019; Müller & Holzinger, 2021; Holzinger et al., 2021) and CLEVR-Hans (Stammer et al., 2021), with high accuracy outperforming neural baseline models. (ii) \(\alpha\)ILP can generate explanations, i.e., produces a readable solution in the form of logic programs. (iii) \(\alpha\)ILP is robust to confounding, i.e., avoids being over-fitted to confounding factors. (iv) \(\alpha\)ILP is data-efficient, i.e., reports no performance drop even when using 10% of the training data. (v) \(\alpha\)ILP can perform fast inference. It supports efficient parallelized batch computation on GPUs, therefore, it can classify a large number of instances in a large dataset quickly.

2 Background and related work

We use bold lowercase letters \(\textbf{v}, \textbf{w}, \ldots\) for vectors. We use bold capital letters \(\textbf{X}, \ldots\) for tensors. We use calligraphic letters \({\mathcal{C}}, {\mathcal{A}}, \ldots\) for (ordered) sets and typewriter font \(\mathtt{p(X,Y)}\) for terms and predicates in logical expressions.

2.1 Preliminaries on logic and ILP

Language \({\mathcal{L}}\) is a tuple \(({\mathcal{P}}, {\mathcal{F}}, {\mathcal{T}}, {\mathcal{V}})\), where \({\mathcal{P}}\) is a set of predicates, \({\mathcal{F}}\) is a set of function symbols, \({\mathcal{T}}\) is a set of constants, and \({\mathcal{V}}\) is a set of variables. A term is a constant, a variable, or an expression \(\mathtt{f(t_1, \ldots , t_n)}\) where \(\texttt{f}\) is a n-ary function symbol and \(\mathtt{t_1, \ldots , t_n}\) are terms. We denote n-ary predicate \(\texttt{p}\) by \(\texttt{p}/(n,[\mathtt{dt_1}, \ldots , \mathtt{dt_n}])\), where \(\mathtt{dt_i}\) is the datatype of the i-th argument. An atom is a formula \(\mathtt{p(t_1, \ldots , t_n) }\), where \(\texttt{p}\) is an n-ary predicate symbol and \(\mathtt{t_1, \ldots , t_n}\) are terms. A ground atom or simply a fact is an atom with no variables. A literal is an atom or its negation. A positive literal is just an atom. A negative literal is the negation of an atom. A clause is a finite disjunction (\(\vee\)) of literals. A definite clause is a clause with exactly one positive literal. If \(A, B_1, \ldots , B_n\) are atoms, then \(A \vee \lnot B_1 \vee \cdots \vee \lnot B_n\) is a definite clause. We write definite clauses in the form of \(A\,\hbox{:-}\,B_1,\ldots ,B_n\). Atom A is called the head, and the set of negative atoms \(\{B_1, \ldots , B_n\}\) is called the body. We denote the special constant \(\textit{true}\) as \(\top\) and \(\textit{false}\) as \(\bot\). An atom is an atomic formula. For formula F and G, \(\lnot F\), \(F \wedge G\), and \(F \vee G\) are also formulas. Interpretation of language \({\mathcal{L}}\) is a tuple \(({\mathcal{D}}, {\mathcal{I}}_{\mathcal{A}}, {\mathcal{I}}_{\mathcal{F}}, {\mathcal{I}}_{\mathcal{P}})\), where \({\mathcal{D}}\) is the domain, \({\mathcal{I}}_{\mathcal{A}}\) is the assignments of an element in \({\mathcal{D}}\) for each constant \(\texttt{a} \in {\mathcal{A}}\), \({\mathcal{I}}_{\mathcal{F}}\) is the assignments of a function from \({\mathcal{D}}^n\) to \({\mathcal{D}}\) for each n-ary function symbol \(\texttt{f} \in {\mathcal{F}}\), and \({\mathcal{I}}_{\mathcal{P}}\) is the assignments of a function from \({\mathcal{D}}^n\) to \(\{ \top , \bot \}\) for each n-ary predicate \(\texttt{p} \in {\mathcal{P}}\). For language \({\mathcal{L}}\) and formula F, an interpretation \({\mathcal{I}}\) is a model if the truth value of F w.r.t \({\mathcal{I}}\) is true. Formula F is a logical consequence or logical entailment of a set of formulas \({\mathcal{S}}\), denoted \({\mathcal{S}} \models F\), if, \({\mathcal{I}}\) is a model for \({\mathcal{S}}\) implies that \({\mathcal{I}}\) is a model for F for every interpretation \({\mathcal{I}}\) of \({\mathcal{L}}\).

An ILP problem \({\mathcal{Q}}\) is a tuple \(({\mathcal{E}}^+, {\mathcal{E}}^-, {\mathcal{B}}, {\mathcal{L}})\), where \({\mathcal{E}}^+\) is a set of positive examples, \({\mathcal{E}}^-\) is a set of negative examples, \({\mathcal{B}}\) is background knowledge, and \({\mathcal{L}}\) is a language. Background knowledge can be given in the form of the set of facts or clauses. The solution to an ILP problem is a set of definite clauses \({\mathcal{H}} \subseteq {\mathcal{L}}\) that satisfies the following conditions:

\(\forall A \in {\mathcal{E}}^+ \, {\mathcal{H}} \cup {\mathcal{B}} \models A\) and \(\forall A \in {\mathcal{E}}^- \,{\mathcal{H}} \cup {\mathcal{B}} \not \models A.\)

2.2 Related work towards visual ILP

Over 50 years ago, M. M. Bongard, a Russian computer scientist, invented a collection of one hundred human-designed visual recognition tasks (Bongard & Hawkins, 1970), now named the Bongard Problems (BPs), to demonstrate the gap between high-level human cognition and computerized pattern recognition. Inspired by BPs, the Bongard-LOGO (Nie et al., 2020) problem has been proposed as a benchmark for the machine learning community. Kandinsky patterns (Holzinger et al., 2019; Müller & Holzinger, 2021; Holzinger et al., 2021) have been proposed to assess the ability of intelligent systems to explain complex visual scenes. In a similar vein, CLEVR-Hans (Stammer et al., 2021) has been proposed to assess the ability of the model to understand the confounded visual scenes. These benchmarks present a challenge to CNN-based recognition models.

Logic, both propositional and first-order, is an established framework for performing reasoning on machines (Lloyd, 1984; Kowalski, 1988). A pioneering study of inductive inference on logic was done in the early 70 s (Plotkin, 1971). The Model Inference System (MIS) (Shapiro, 1983) has been implemented as an efficient search algorithm for logic programs. Inductive Logic Programming (Muggleton, 1991, 1995; Nienhuys-Cheng et al., 1997; Cropper et al., 2022) has emerged at the intersection of machine learning and logic programming. Many ILP frameworks have been developed, e.g., FOIL (Quinlan, 1990), Progol (Muggleton, 1995), ILASP (Law et al., 2014), Metagol (Cropper & Muggleton, 2016; Cropper et al., 2019), and Popper (Cropper & Morel, 2021). Symbolic ILP systems are dedicated to symbolic inputs. \(\alpha\)ILP deals with visual inputs by having an end-to-end neuro-symbolic reasoning architecture. \(\alpha\)ILP employs similar structure-learning techniques which have been developed for probabilistic logic programs (Bellodi & Riguzzi, 2015; Nguembang Fadja & Riguzzi, 2019) and performs learning on complex visual scenes. Different settings of probabilistic ILP approaches have been introduced in De Raedt et al. (2008). \(\alpha\)ILP is based on the learning from entailment setting, where the logical entailment is computed from probabilistic inputs. \(\alpha\)ILP computes the logical entailment with probabilistic values for facts and clauses in a differentiable manner.

The integration of symbolic programs and neural networks, which is called Neuro-Symbolic computation (Besold et al., 2017; d’Avila Garcez & Lamb, 2020; Tsamoura et al., 2021), has previously been addressed, e.g., DeepProblog (Manhaeve et al., 2018, 2021), NeurASP (Yang et al., 2020), \(\partial\)ILP (Evans & Grefenstette, 2018; Jiang & Luo, 2019), NS-CL (Mao et al., 2019), integration with abductive learning (Dai et al., 2019), and differentiable theorem provers (Rocktäschel & Riedel, 2017; Minervini et al., 2020). Kandinsky patterns and CLEVR-Hans cannot be solved easily by these frameworks because they require complete structure learning from complex visual scenes. DeepProblog supports structure learning but is limited for the sketching setting (Solar-Lezama, 2008; Bošnjak et al., 2017). \(\alpha\)ILP supports object-centric perception models, differentiable forward reasoning, and efficient clause search for solving tasks in complex visual scenes. Some neuro-symbolic models have been developed for Visual Question Answering (VQA) (Antol et al., 2015; Johnson et al., 2017; Santoro et al., 2018; Mao et al., 2019; Amizadeh et al., 2020). In VQA-based models, the symbolic programs are determined by the natural language sentences that represent questions, but \(\alpha\)ILP does not have that assumption. Moreover, \(\alpha\)ILP stands in the line of probabilistic logic programming (De Raedt et al., 2016; Raedt et al., 2020). Therefore, \(\alpha\)ILP can employ methods for probabilistic logic programming, which have been developed in the community. Similar concepts of some key components of \(\alpha\)ILP have been investigated in previous studies, e.g., Neural Predicates (Diligenti et al., 2017; Donadello et al., 2017; Badreddine et al., 2022), weighted forward-chaining reasoning (Sourek et al., 2018; Si et al., 2019), and differentiable structure learning (Evans & Grefenstette, 2018; Sourek et al., 2017). \(\alpha\)ILP is the first that integrates these concepts for the visual object-centric domain as a consistent framework. Logic Tensor Networks (LTNs) (Badreddine et al., 2022) provide a unified differentiable language for first-order logic. LTNs map each term in first-order logic to numerical representations in place of interpretation. Then predicates are grounded to functions that take numerical representations of terms and return a truth value in [0, 1]. \(\alpha\)ILP takes a similar approach to connect the sub-symbolic and symbolic representations.

Object-centric learning is an approach to decomposing an input image into representations in terms of objects (Dittadi et al., 2022). This problem has been widely addressed in the computer vision community. Another approach is the unsupervised approach (Burgess et al., 2019; Engelcke et al., 2020; Locatello et al., 2020), where the models acquire the ability of object-perception without or with fewer annotations. \(\alpha\)ILP uses these object-centric models as a perception module.

Differentiable solvers for dynammic programming problems have been developed (Cuturi & Blondel, 2017; Mensch & Blondel, 2018). \(\alpha\)ILP adopts some techniques to achieve differentiable implementations of the discrete operations for first-order logic. Various types of differentiable logical operations have been also investigated (van Krieken et al., 2022; Sen et al., 2022).

3 \(\alpha\)ILP

We now introduce \(\alpha\)ILP in the following steps. First, we give an overview of the problem setting and the framework. Second, we explain the reasoning architecture of \(\alpha\)ILP consisting of (i) the visual-perception module, (ii) facts converter, an algorithm to convert object-centric representations into probabilistic facts, and (iii) the differentiable forward-reasoning mechanism. Finally, we describe the learning strategy on \(\alpha\)ILP to perform differentiable ILP on visual scenes.

What is visual ILP? We address the ILP problem in visual scenes, which is called visual ILP problem, where each example is given as an image containing several objects. The classification pattern is defined on high-level concepts such as attributes and relations of objects.

3.1 Architecture overview

Figure 1 illustrates an overview of \(\alpha\)ILP and consists of a Reasoning module and a Learning module. We now introduce these in detail.

Fig. 1
figure 1

An overview of \(\alpha\)ILP. (Reasoning)\(\alpha\)ILP has an end-to-end reasoning architecture from visual input based on differentiable forward reasoning. In the reasoning step, (1) The raw input images are factorized in terms of objects using the visual-perception model. (2) The object-centric representation is converted into a set of probabilistic facts. (3) The differentiable forward reasoning is performed using weighted clauses. (Learning) To solve the classification problem of visual scenes, we provide positive examples, negative examples, and background knowledge. Each example is given as a visual scene. \(\alpha\)ILP performs 2-steps learning as follows: (Step1) A set of candidates of clauses is generated by top-k beam search. The search is conducted from examples of visual scenes using the end-to-end reasoning architecture. (Step2) Then, the weights for the generated clauses are trained to minimize the loss function. By using the end-to-end reasoning architecture, \(\alpha\)ILP finds a logic program that explains the complex visual scenes by gradient descent

3.1.1 Reasoning

\(\alpha\)ILP has an end-to-end reasoning architecture, which works as follows: (i) The raw input images are factorized in terms of objects using the visual-perception model. (ii) The object-centric representation is converted into a set of probabilistic facts. (iii) The differentiable forward reasoning is performed using weighted clauses. The bottom row of Fig. 1 illustrates the reasoning architecture in \(\alpha\)ILP.

3.1.2 Learning

\(\alpha\)ILP learns logic programs from visual inputs by performing differentiable ILP, i.e., we provide positive examples, negative examples, and background knowledge. Each example is given as a visual scene. The top row of Fig. 1 illustrates the learning pipeline in \(\alpha\)ILP. Learning with \(\alpha\)ILP is as follows: (Step1) A set of candidates of clauses is generated by top-k beam search. The search is conducted from examples of visual scenes using the end-to-end reasoning architecture. (Step2) The weights for the generated clauses are trained to minimize the loss function. By using the end-to-end reasoning architecture, \(\alpha\)ILP finds a logic program that explains the complex visual scenes by gradient descent. We now describe our architecture in detail.

3.2 Visual perception

We make the minimal assumption that the perception network takes an image and returns a set of object-centric vectors, where each dimension represents an attribute of the object, e.g., colors, shapes, and positions. Thus, any type of neural network that segments the input images into the individual objects present in the image can be utilized. For example, \(\alpha\)ILP can employ a slot attention model (Locatello et al., 2020) for 3D scenes. However, with natural images, \(\alpha\)ILP can employ other established object-detection models such as YOLO (Redmon et al., 2016), Faster-RCNN (Ren et al., 2015), and Mask-RCNN (He et al., 2017). The visual-perception module is trained on randomly-generated figures with annotations about each object, i.e., the number of objects and the attributes of the objects are randomly determined.

3.3 Facts converter: lifting to symbolic representation

After the object-centric perception, \(\alpha\)ILP generates a logical representation, i.e., a set of probabilistic facts. We propose a new type of predicate that can refer to differentiable functions to compute the probability. We also present an algorithm to convert the perception result into probabilistic facts.

3.3.1 Neural predicate

To build a bridge between the sub-symbolic and symbolic representations, we provide a new type of predicate, which we term as neural predicates. A neural predicate is associated with a differentiable function, which we call valuation function, that produces the probability of the grounded facts.

Definition 1

A neural predicate \(\texttt{p} / (n, [\mathtt{dt_1, \ldots , dt_n}])\) is a n-ary predicate associated with a valuation function \(v_\texttt{p}: {\mathbb{R}}^{d_1 \times \cdots \times d_n} \rightarrow {\mathbb{R}}\), where \(\mathtt{dt_i}\) is the datatype of the i-th argument, \(d_i \in {\mathbb{N}}\) is the dimension of the vector representation of the term whose datatype is \(\mathtt{dt_i}\).

Intuitively, we give the first-order logic interpretation for neural predicates and terms as follows: (i) each neural predicate is assigned to a function in a vector space, (ii) each term in the arguments of neural predicates is assigned to a vector. The vector can be an output of neural networks, or an encoding of the term, e.g., one-hot encoding of the attributes.

3.3.2 Facts-converting algorithm

The facts converter produces a set of probabilistic facts from the output of the perception module. Let \({\mathcal{G}}\) be the set of all facts; then the conversion proceeds as follows: For each fact \(\mathtt{p(t_1, \ldots , t_n)} \in {\mathcal{G}}\), if it consists of a neural predicate, then the corresponding valuation function \(v_\texttt{p}\) is called to compute the probability of the fact. Otherwise, zero is given as the probability of the fact. If the fact is in the background knowledge, one is given as the probability of the fact. The valuation function maps each term \(\mathtt{t_1, \ldots , t_n}\) to vector representations according to the interpretation. The forward reasoning function requires a vector that maps each fact to a probabilistic value to achieve the differentiable computation. Thus, the probabilistic values are computed for all of the facts.

Figure 2 illustrates an example of the implementation of the facts converter. We assume that the perception model produces the probabilities of the attributes (color, shape, position) for each object. (1) For neural predicate \(\texttt{color}/(2,[\texttt{object}, \texttt{color}])\), we compute the probability of fact \(\mathtt{color(obj3,red)}\) by calling the valuation function \(v_\texttt{color}\). Term \(\texttt{obj3}\) is mapped to the output of the perception module, and term \(\texttt{red}\) is mapped to its one-hot encoding. By using these vector representations of the terms, \(v_\texttt{color}\) computes the probability of the atom, simply performing the tensor multiplication and summation. (2)For neural predicate \(\texttt{closeby}(2/[\texttt{object}, \texttt{object}])\), we compute the probability of fact \(\mathtt{closeby(obj1, obj3) }\) by calling the valuation function \(v_\texttt{closeby}\). Term \(\texttt{obj1}\) and \(\texttt{obj3}\) are mapped to the corresponding output of the perception model, respectively. Then the positional information is extracted, and logistic regression is performed on the distance between two data points. By adapting the weights of the linear transformation, the facts converter can learn the concept of \(\texttt{closeby}\) flexibly. We note that the valuation functions of neural predicates are defined by the user, and parameterized valuation functions are trained before performing structure learning.

Fig. 2
figure 2

An illustration of the facts converter in \(\alpha\)ILP. It decomposes the raw-input images into object-centric representations (left). The valuation functions are called to compute the probability of facts. Each term in the arguments is mapped to a vector representation (middle). The result is converted into the form of probabilistic facts (right)

3.4 Differentiable forward-chaining inference

Forward-chaining inference is a type of inference in first-order logic to compute logical entailment (Russell & Norvig, 2009). For example, let \({\mathcal{C}}\) be a set of clauses and \({\mathcal{G}}\) be a set of all known facts. Then, forward-chaining inference can compute the set of facts \({\mathcal{F}}\) such that \({\mathcal{C}} \cup {\mathcal{G}} \models {\mathcal{F}}\). Differentiable forward-chaining inference (Evans & Grefenstette, 2018; Shindo et al., 2021) computes the logical entailment in a differentiable manner. We briefly summarize the steps: (Step 1) A tensor that holds the relationships between clauses and facts is computed. (Step 2) Each clause is compiled into a differentiable function that performs forward reasoning using the tensor. (Step 3) A differentiable logic program is composed of the clause functions and their weights. T-time step inference is computed by amalgamating the inference results recursively.

3.4.1 Tensor encoding

Following (Shindo et al., 2021), we build a tensor holding relationships between clauses \({\mathcal{C}}\) and facts \({\mathcal{G}}\). We assume that \({\mathcal{C}}\) and \({\mathcal{G}}\) are ordered sets, i.e., where every element has its own index. Let L be the maximum body length in \({\mathcal{C}}\), S be the maximum number of substitutions for existentially quantified variables in clauses \({\mathcal{C}}\), \(C = \vert {\mathcal{C}}\vert\) and \(G = \vert {\mathcal{G}}\vert\). Index tensor \(\textbf{I} \in {\mathbb{N}}^{C \times G \times S \times L}\) contains the indices of the facts to compute forward inferences. Intuitively, \(\textbf{I}_{i,j,k,l}\) is the index of the l-th fact (subgoal) in the body of the i-th clause to derive the j-th fact with the k-th substitution for existentially quantified variables.

Example

Let \(R_0 = \mathtt{pos(X)} \,\hbox{:-}\, \mathtt{in(O1,X),color(O1,red)} \in {\mathcal{C}}\) and \(F_2 = \mathtt{pos(img)} \in {\mathcal{G}}\), and we assume that terms of objects are \(\{ \texttt{obj1, obj2} \}\). To compute the subgoals for fact \(F_2\) and clause \(R_0\), \(F_2\) and the head atom can be unified by substitution \(\theta = \{ \mathtt{X = img} \}\). By applying \(\theta\) to body atoms, we get clause \(\mathtt{pos(img)} \,\hbox{:-}\, \mathtt{in(O1,img),color(O1,red).}\), which has an existentially quantified variable \(\texttt{O1}\). By considering the possible substitutions for \(\texttt{O1}\), namely \(\mathtt{O1/obj1}\) and \(\mathtt{O1/obj2 }\), we have grounded clauses, as shown on top of Table 1. Bottom rows of Table 1 shows elements of tensor \(\textbf{I}_{0,:,0,:}\) and \(\textbf{I}_{0,:,1,:}\). Facts \({\mathcal{G}}\) and the indices are represented on the upper rows in the table. For example, \(\textbf{I}_{0,2,0,:} = [3,5]\) because \(R_0\) entails \(\mathtt{pos(img)}\) with substitution \(\theta = \{ \texttt{O1} = \texttt{obj1}\}\). Then the subgoal atoms are \(\{ \mathtt{in(obj1,img),color(obj1,red)}\}\), which have indices [3, 5], respectively. The atoms which have a different predicate, e.g., \(\mathtt{shape(obj1,square)}\), will never be entailed by clause \(R_0\). Therefore, the corresponding values are filled with 0, which represents the index of the false atom.

Table 1 Example of grounded clauses (top) and elements in the index tensor (bottom)

3.4.2 Valuation

The valuation vector \(\textbf{v}^{(t)} \in {\mathbb{R}}^{G}\) maps each fact into a continuous value at each time step t. Each value \(\textbf{v}^{(t)}_{i}\) represents the probability of fact \(F_i \in {\mathcal{G}}\). The differentiable inference is performed based on valuation vectors. To compute the T-step forward-chaining inference, we compute the sequence of valuation vectors \(\textbf{v}^{(0)}, \ldots , \textbf{v}^{(T)}\). We denote a batch of valuation vectors at time step t as \(\textbf{V}^{(t)} \in {\mathbb{R}}^{B \times G}\), where B is the batch size. In logical reasoning, the parallelized batch computation is non-trivial. Thus, we explicitly denote the dimension of the batch in this section.

3.4.3 Clause function

Each clause \(R_i\in {\mathcal{C}}\) is compiled into a clause function. The clause function takes a valuation vector \(\textbf{V}^{(t)}\), and returns a valuation vector \(\textbf{C}_i^{(t)} \in {\mathbb{R}}^{B \times G}\), which is the result of 1-step forward reasoning using \(R_i\) and \(\textbf{V}^{(t)}\). The clause function is computed as follows. Let \(\textbf{I} \in {\mathbb{N}}^{C \times G \times S \times L}\) be an index tensor. First, tensor \(\textbf{I}_i \in {\mathbb{N}}^{G \times S \times L}\) is extended for batches, i.e., \(\tilde{\textbf{I}}_i \in {\mathbb{N}}^{B \times G \times S \times L}\), and \(\textbf{V}^{(t)} \in {\mathbb{R}}^{B \times G}\) is extended to the same shape, i.e., \(\tilde{\textbf{V}}^{(t)} \in {\mathbb{R}}^{B \times G \times S \times L}\). Using these tensors, the clause function is computed as:

$$\begin{aligned} \textbf{C}_i^{(t)} = \textit{softor}^{\gamma }_2 (\textit{prod}_3 ( \textit{gather}_1 (\tilde{\textbf{V}}^{(t)}, \tilde{\textbf{I}}_i)), \end{aligned}$$
(1)

where \(\textit{gather}_1(\textbf{X}, \textbf{Y})_{i,j,k,l} =\textbf{X}_{i,\textbf{Y}_{i,j,k,l},k,l}\), and \(\textit{prod}_3\) returns the product along dimension 3. \(\textit{softor}^\gamma _d\) is a function for taking logical or softly along dimension d:

$$\begin{aligned} \textit{softor}^\gamma _d (\textbf{X}) = \frac{1}{S} \gamma \log \left( \textit{sum}_d \exp \left( \textbf{X}/\gamma \right) \right) , \end{aligned}$$
(2)

where \(\gamma > 0\) is a smooth parameter, \(\textit{sum}_d\) is the sum function along dimension d, and \(S = \textit{max}(1.0, \textit{max} \left( \gamma \log \textit{sum}_d \exp \left( \textbf{X} / \gamma \right) \right)\). Normalization term S ensures that the function returns the normalized probabilistic values. More details of the function is in Appendix D. In Eq. 1, applying the \(\textit{softor}^\gamma _2\) function corresponds to considering all possible substitutions for existentially quantified variables in the body atoms of the clause and taking logical or softly over the results of possible substitutions. The results from each clause is stacked into tensor \(\textbf{C}^{(t)} \in {\mathbb{R}}^{C \times B \times G}\), i.e., \(\textbf{C}^{(t)} =\textit{stack}_0 (\textbf{C}^{(t)}_1, \ldots , \textbf{C}^{(t)}_C)\), where \(\textit{stack}_0\) is stack function for tensors along dimension 0.

Figure 3 illustrates the clause function. A clause function computes the forward-chaining inference for a clause. The perception module and facts converter produce an initial valuation vector \(\textbf{V}^{(0)}\). For each grounded clause, the probability for the subgoals is extracted by the \(\textit{gather}\) function. Then the product for the body atoms is computed, then the logical \(\textit{or}\) is computed softly to amalgamate the results from different groundings of clauses.

Fig. 3
figure 3

An illustration of the clause function for clause \(R_0 = \mathtt{pos(X)\hbox{:-}in(O1,X),color(O1,red). }\) with a schematic illustration of the forward reasoning (top-right). The perception module and facts converter produce initial valuation vector \(\textbf{V}^{(0)}\). For each grounded clause, the probability for the subgoals is extracted by the \(\textit{gather}\) function. Then the product for the body atoms is computed, then the logical \(\textit{or}\) is computed softly to amalgamate the results from different groundings of clauses. For simplicity, the first dimension for the batch is removed in the figure (Color figure online)

3.4.4 Soft (logic) program composition

In \(\alpha\)ILP, a logic program is represented smoothly as a weighted sum of the clause functions following (Shindo et al., 2021). Intuitively, \(\alpha\)ILP has M distinct weights for each clause, i.e., \(\textbf{W} \in {\mathbb{R}}^{M \times C}\). By taking softmax of \(\textbf{W}\) along dimension 1, M clauses are softly chosen from C clauses. The weighted sum of clause functions is computed as follows. First, we take the softmax of the clause weights \(\textbf{W} \in {\mathbb{R}}^{M \times C}\): \(\textbf{W}^* = \textit{softmax}_1(\textbf{W})\) where \(\textit{softmax}_1\) is a softmax function over dimension 1. The clause weights \(\textbf{W}^* \in {\mathbb{R}}^{M \times C}\) and the output of the clause function \(\textbf{C}^{(t)} \in {\mathbb{R}}^{C \times B \times G}\) are expanded to the same shape \(\tilde{\textbf{W}}^*, \tilde{\textbf{C}}^{(t)} \in {\mathbb{R}}^{M \times C \times B \times G}\). Then we compute tensor \(\textbf{H}^{(t)} \in {\mathbb{R}}^{M \times B \times G}\): \(\textbf{H}^{(t)} = \textit{sum}_1(\tilde{\textbf{W}}^* \odot \tilde{\textbf{C}}^{(t)}),\) where \(\odot\) is element-wise multiplication, and \(\textit{sum}_1\) is a summation along dimension 1. Each value \(\textbf{H}^{(t)}_{i,j,k}\) represents the result for the k-th fact using the i-th clause weights for the j-th example in the batch. Finally, we compute tensor \(\textbf{R}^{(t)} \in {\mathbb{R}}^{B \times G}\) corresponding to the fact that logic program is a set of clauses: \(\textbf{R}^{(t)} = \textit{softor}^\gamma _0(\textbf{H}^{(t)})\), taking logical or softly over M-chosen clauses. To compute the multi-step reasoning, \(\textbf{V}^{(t+1)}\) is computed as: \(\textbf{V}^{(t+1)} = \textit{softor}^\gamma _{1} (\textit{stack}_1 (\textbf{V}^{(t)}, \textbf{R}^{(t)}))\). The reasoning process is illustrated in Fig. 4.

Fig. 4
figure 4

An illustration of differentiable forward-chaining reasoning. Each clause is compiled into a clause function. Each clause has M distinct weights. The input valuation vector \(\textbf{V}^{(t)}\) is fed to the clause function. By applying tensor operations, the forward-chaining reasoning is computed using weighted clauses. More details are in the main text. For simplicity, the dimension for the batch is removed in the figure

3.4.5 Prediction

We assume that language \({\mathcal{L}}\) has a constant that represents the input image and a predicate to compose an atom representing that the input is positive, e.g., \(\mathtt{pos(img)} \in {\mathcal{G}}\). For given visual input e, \(\alpha\)ILP simply extracts the value from the result of the forward reasoning to predict class label \(y \in \{0, 1 \}\) as follows:

$$\begin{aligned} p(y \,\vert \, e, {\mathcal{C}}, {\mathcal{B}}, {\mathcal{W}}, \Theta _{\textit{per}}, \Theta _{\textit{np}}) =\textbf{v}^{(T)}[I_{\mathcal{G}}(\mathtt{pos(img)})], \end{aligned}$$
(3)

where \({\mathcal{C}}\) is a set of clauses, \({\mathcal{B}}\) is background knowledge, \({\mathcal{W}}\) is a set of clause weights, \(\Theta _{\textit{per}}\) is a set of parameters for the visual-perception model, and \(\Theta _{\textit{np}}\) is a set of parameters for neural predicates, \(I_{\mathcal{G}}(x)\) a function that returns the index of x in \({\mathcal{G}}\), and \(\textbf{v}[i]\) is the i-th element of \(\textbf{v}\), i.e., \(\textbf{v}_i\). \(\alpha\)ILP accepts background knowledge as a set of facts and clauses.

3.5 Program induction from visual scenes

\(\alpha\)ILP learns differentiable logic programs that describe complex visual scenes. We basically follow the differentiable ILP setting (Evans & Grefenstette, 2018; Shindo et al., 2021), where an ILP problem is formulated as an optimization problem that has the following general form:

$$\begin{aligned} \min _{\mathcal{W}} \textit{loss}({\mathcal{Q}}, {\mathcal{C}}, {\mathcal{W}}), \end{aligned}$$
(4)

where \({\mathcal{Q}}\) is an ILP problem, \({\mathcal{C}}\) is a set of candidates of clauses, \({\mathcal{W}}\) is a set of weights for clauses, and \(\textit{loss}\) is a loss function that returns a penalty when training constraints are violated. We note that we solve visual ILP problems, where each positive and negative example is an image containing several objects.

figure a

Algorithm 1 describes the learning process of \(\alpha\)ILP. (Line 1) The perception model is trained using perception dataset \({\mathcal{D}}_{\textit{perception}}\), which consists of pattern-free figures. The dataset is annotated for objects, e.g., class labels. Parameterized neural predicates can also be trained by visual input with a trained perception module or by scene graphs. (Line 2–4) Finally, the logic program that describes the visual scene is learned by performing differentiable ILP, as illustrated in the top row in Fig. 1. The process mainly consists of two steps: (i) clause generation by top-k beam search and (ii) learning of clause weights by backpropagation. We now describe each step in detail.

3.5.1 Top-k beam search of clauses

Let \({\mathcal{Q}}\) be a visual ILP problem. \(\alpha\)ILP generates promising candidates of clauses using top-k beam search. Promising candidates of clauses for an ILP problem are those that entail a majority of positive examples but few negative examples. Figure 5 illustrates the clause generation steps from visual scenes. We start from given initial clauses and iteratively refine the top-k clauses based on the following evaluation score:

$$\begin{aligned} \textit{eval}(R, {\mathcal{Q}}) =\sum _{e \in {\mathcal{E}}^+} p(y \,\vert \, e, \{R\}, {\mathcal{B}}, \{ \textbf{1} \}, \Theta _{\textit{per}}, \Theta _{\textit{np}}) , \end{aligned}$$
(5)

where \({\mathcal{E}}^+ \in {\mathcal{Q}}\) is a set of positive examples and \(\textbf{1}\) is an \(1\times 1\) identity matrix. If clause R can entail the majority of positive examples combined with background knowledge, then clause R gets a high evaluation score. In each step, \(\alpha\)ILP evaluates clauses in parallel using a variant of the reasoning module designed for the evaluation of clauses, i.e., without loops in terms of clauses. The generation of new clauses is conducted using the downward refinement operator (Nienhuys-Cheng et al., 1997), which is a fundamental clause-generation tool in ILP. The downward refinement operator weakens the clauses, i.e., the new clauses that are produced by the operator entail fewer examples with background knowledge than the original clause (Nienhuys-Cheng et al., 1997). Thus fewer negative examples are entailed by the newly generated clauses. Therefore we evaluate clauses only by positive examples and repeatedly generate clauses by the downward refinement operator to produce a search space that contains general and specific clauses. During the clause generation, \(\alpha\)ILP adopts mode declarations (Muggleton, 1995; Ray & Inoue, 2007) to manage the search space, i.e., the clauses that are inconsistent with mode declarations are pruned. The visual-perception module enables \(\alpha\)ILP to evaluate each clause using visual input. \(\alpha\)ILP utilizes symbolic learning techniques while dealing with complex visual scenes.

Fig. 5
figure 5

Clause generation by beam search using input images. In each step, clauses that explain a majority of positive examples are selected (red-dotted rectangles), and refined to generate new clauses. After expanding them for a certain depth, the set of all of the selected clauses in the process is returned to perform differentiable ILP (Color figure online)

3.5.2 Learning weights

\(\alpha\)ILP assigns weights for generated clauses. Clause weights are optimized by gradient descent. Let \({\mathcal{Q}} = ({\mathcal{E}}^+, {\mathcal{E}}^-, {\mathcal{B}}, {\mathcal{L}})\) be a visual ILP problem, \({\mathcal{C}}\) be a set of generated clauses, \({\mathcal{W}}\) be a set of clause weights, \(\Theta _{\textit{per}}\) be the parameters for the perception model, and \(\Theta _{\textit{np}}\) be the parameters for the neural predicates. We solve visual ILP problem \({\mathcal{Q}}\) by minimizing cross-entropy loss with respect to \({\mathcal{W}}\), defined as:

$$\begin{aligned} \textit{loss}&= -{\mathbb{E}}_{(e, y) \sim {\mathcal{Y}}} [ y \log p(y \,\vert \, e, {\mathcal{C}}, {\mathcal{B}}, {\mathcal{W}}, \Theta _{\textit{per}}, \Theta _{\textit{np}}) \\&\quad + (1-y) \log (1 - p(y \,\vert \, e, {\mathcal{C}}, {\mathcal{B}}, {\mathcal{W}}, \Theta _{\textit{per}}, \Theta _{\textit{np}}))]. \end{aligned}$$
(6)

where \({\mathcal{Y}} = \{(e, 1) \,\vert \, e \in {\mathcal{E}}^+ \} \cup \{(e, 0) \,\vert \, e \in {\mathcal{E}}^- \}\), which is a set of tuples of an example and the label indicating positive or negative.

4 Experimental evaluation

We empirically demonstrate the following desired properties of \(\alpha\)ILP on two different datasets: (i) \(\alpha\)ILP solves ILP problems in visual scenes with high accuracy. (ii) \(\alpha\)ILP can explain, i.e., it produces a readable solution in the form of logic programs. (iii) \(\alpha\)ILP is robust to confoundings. (iv) \(\alpha\)ILP is data-efficient unlike CNNs. (v) \(\alpha\)ILP performs fast inference.

All experiments were performed in the following environment; CPU: AMD EPYC 7742 64-Core Processor, RAM: 2000 GB, GPU: NVIDIA A100-SXM4-40GB GPU with 40 GB of RAM.

4.1 Solving Kandinsky patterns

4.1.1 Dataset

We adopted Kandinsky pattern datasets (Holzinger et al., 2019; Müller & Holzinger, 2021; Holzinger et al., 2021), a relatively new benchmark for object-centric reasoning tasks. Kandinsky-20k contains 10k training examples for each positive and negative class, respectively. Each validation and test split contains 5k examples for each positive and negative class, respectively. In the Kandinsky-2k dataset, we reduced the amount of training data by randomly sampling from the Kandinsky-20k dataset. The training split contains 1k examples for each positive and negative class, respectively. Validation and test split are the same as the Kandinsky-20k dataset.

We use 4 Kandinsky patterns: twopairs, closeby, red-triangle, and online-pair. Figure 6 shows a positive example for each pattern. For the clause generation step, we used 500 examples from the validation split for each dataset.

Fig. 6
figure 6

Positive examples for each dataset in Kandinsky patterns. Each pattern is described as follows: (twopairs) The Kandinsky figure has two pairs of objects with the same shape. In one pair, the objects have the same colors in the other pair different colors. Two pairs are always disjunct, i.e., they do not share objects. (closeby) The Kandinsky figure has a pair of objects that are close to each other. (red-triangle) The Kandinsky figure has a pair of objects that are close to each other. And the one object of the pair is a red triangle, and the other object has a different color and different shape. (online-pair) The Kandinsky figure has five objects that are aligned on a line, and it contains at least one pair of objects that have the same shape and the same color (Color figure online)

4.1.2 Pre-training

For pre-training of the visual perception module, we generated randomly 15k Kandinsky figures with annotations about each object, i.e., the number of objects and the attributes of the objects are randomly determined. We used YOLO (Redmon et al., 2016) as a perception module. Each object has the class label and the bounding box as an annotation. Neural predicate \(\texttt{closeby}\) and \(\texttt{online}\) are trained on the 10k Kandinsky figures that represent the concepts, respectively, e.g., figures that consist of two objects that are close by each other are generated for the positive examples for \(\texttt{closeby}\).

4.1.3 Baselines

We adopted ResNet (He et al., 2016) as a CNN-based benchmark and also compared it against YOLO+MLP, where the input figure is fed to the pre-trained YOLO model, and a simple MLP module predicts the class label from the YOLO outputs. The whole network is jointly trained.

4.1.4 Results

Table 2 shows the results for the test split in each Kandinsky dataset. The CNN model overfits while training and thus performs poorly in every Kandinsky pattern. The YOLO+MLP model performs comparatively better and achieves greater than 90% accuracy in twopairs. However, in relatively complex patterns of closeby, red-triangle, and online-pair, the performance degrades. On the contrary, \(\alpha\)ILP outperforms the considered baselines significantly and achieves perfect classification in all of the patterns.

Table 2 The mean classification accuracy in the test split in the Kandinsky patterns dataset over 5 random seeds

In the smaller dataset, Kandinsky-2k, neural-based benchmarks reduce its performance because of the lack of training data to be generalized. On the contrary, \(\alpha\)ILP still achieves perfect accuracy. This shows the data efficiency of \(\alpha\)ILP.

Figure 7 shows the classification rules discovered by \(\alpha\)ILP, which are obtained by taking argmax of the rule weights. \(\alpha\)ILP successfully produced interpretable results in all of the datasets. After the training step, we observed that the distribution of the weights over clauses turned to be sharp, i.e., one clause get nearly 1.0 and others get almost 0.0.

Fig. 7
figure 7

The classification rules discovered by \(\alpha\)ILP, which are obtained by taking argmax of the rule weights. The top four lines show the classification rules for the four patterns in Kandinsky patterns dataset, respectively. The bottom three lines show the classification rules for the three classes in CLEVR-Hans3 dataset, respectively

Figure 8 shows the accuracy for the test split in the first 600 iterations in Kandinsky-20k dataset. To have a fair comparison, we used the same learning rate \(1\hbox{e}{-}2\) for \(\alpha\)ILP and YOLO+MLP model and \(1\hbox{e}{-}5\) for the CNN baselineFootnote 2 to plot the figure. The line represents the mean, and the shadow represents the standard deviation of 5 trials, respectively. One iteration corresponds to one weight update using a batch of training examples. The result shows \(\alpha\)ILP achieves high classification accuracy with fewer iterations compared to neural-based baselines.

Fig. 8
figure 8

Accuracy for iterations of weight updates in twopairs and online-pair dataset in Kandinsky-20k. The line represents the mean, and the shadow represents the standard deviation of 5 trials, respectively. One iteration corresponds to one weight update using a batch of training examples. \(\alpha\)ILP achieves high classification accuracy with fewer iterations compared to neural-based baselines

4.2 Solving CLEVR-Hans problems

4.2.1 Dataset

The CLEVR-Hans dataset (Stammer et al., 2021) contains confounded CLEVR (Johnson et al., 2017) images, and each image is associated with a class label. We adopted the CLEVR-Hans3 dataset, which has three classes, as shown in Fig. 9. Each class has a corresponding classification rule. For each class, we create an ILP problem, where positive examples are the set of images that belongs to a classification rule, and negative examples are images that belong to other classes. As a result, we have three classification problems: class1, class2, and class3. CLEVR-Hans problems involve confounding data. For example, in the training and validation split of class1, the large cube in the positive examples always has the color of gray, but in the test split, it has different colors. To achieve good performance in the test split, the model needs to know the exact classification rule without overfitting. For the clause generation step, we used 500 examples from the validation split for each dataset.

Fig. 9
figure 9

Examples in CLEVR-Hans3 dataset. The dataset consists of three classes. (class 1)Each figure contains large a cube and a large cylinder. In the training and validation split, the large cube always has the color of gray.(class 2)Each figure has a small metal cube and a small sphere. In the training and validation split, the small sphere always has the material of metal.(class 3)Each figure contains large blue sphere and small yellow sphere.” (Color figure online)

4.2.2 Pre-training

The slot attention model was pre-trained following Locatello et al. (2020) using the set prediction setting on the CLEVR (Johnson et al., 2017) dataset.

4.2.3 Baselines

The considered baselines are the ResNet-based CNN model (He et al., 2016), and the Neuro-Symbolic (NeSy) model (Stammer et al., 2021). The NeSy model has a visual perception module based on slot attention (Locatello et al., 2020) and a reasoning module based on Set Transformer (Lee et al., 2019). The NeSy model was trained in two different settings: (1) training using classification rules (NeSy), and (2) the right for the right reasons (Ross et al., 2017) setting, i.e., the model is trained using supervision about confounding factors (NeSy-XIL). NeSy-XIL is the SOTA model in the CLEVR-Hans dataset.

4.2.4 Results

Table 3 shows the classification accuracy in the CLEVR-Hans dataset. The results of baselines have been presented in Stammer et al. (2021). \(\alpha\)ILP achieved more than 97% in each split. Note that, NeSy-XIL model exploits the supervision of the confounding factors. On the contrary, \(\alpha\)ILP is unsupervised in terms of confounding factors. This shows that \(\alpha\)ILP is robust to confounding. \(\alpha\)ILP can control the complexity of the solution (logic programs) by controlling the depth of top-k beam search, i.e., \(\alpha\)ILP can prevent overfitting by giving a proper depth of top-k beam search, which can be determined by trying from a small number on the validation split. Figure 7 shows the classification rules discovered by \(\alpha\)ILP, which are obtained by taking argmax of the rule weights. After the training step, we observed that the distribution of the weights over clauses turned to be sharp, i.e., one clause gets nearly 1.0, and others get almost 0.0.

Table 3 The mean classification accuracy for CLEVR-Hans3 dataset compared to baselines over 5 random seeds

4.3 Ablation study

We analyze the efficiency of \(\alpha\)ILP for the clause generation step, the weight learning step, and the reasoning step, respectively. To this end, we discuss limitations of \(\alpha\)ILP.

4.3.1 Running time and number of clauses in clause generation

We analyze the clause generation step of \(\alpha\)ILP in terms of the running time and the number of clauses to be generated. We used 500 examples from the validation split for each dataset. The size of the search beam is 20. The depth of the search is [5, 2, 6, 4] for twopairs, closeby, red-triangle, and online-pair in Kandinsky Patterns, respectively, and [5, 6, 7] for class1, class2, and class3 in CLEVR-Hans, respectively. Table 4 shows running time of top-k beam search and number of generated clauses in \(\alpha\)ILP. The clause generation step takes about 58 s in the best case and about 1000 s in the worst case. \(\alpha\)ILP searched a space that contains several thousands of candidates of clauses and then successfully searched promising clauses from visual scenes. This empirically shows that \(\alpha\)ILP performed an efficient search for clauses, which is necessary to solve the visual ILP tasks of complex visual scenes.

Table 4 Running time of top-k beam search and number of generated clauses in \(\alpha\)ILP using 500 examples from the validation split in each dataset

4.3.2 Running time of weight learning

We compare the running time of weight learning of \(\alpha\)ILP with neural baselines. Table 5 shows the running time of weight learning for one epoch in Kandinsky patterns datasets. \(\alpha\)ILP achieved comparably fast weight-learning iterations for problems with simple rules, e.g., closeby. For the problems that have a complex search space, e.g., red-triangle, \(\alpha\)ILP takes longer to compute the gradient and update the clause weights. This is because, as shown in Table 4, \(\alpha\)ILP deals with a large number of clauses for difficult problems, which require a deeper search of clauses. We note that, as shown in Fig. 8, \(\alpha\)ILP can achieve high accuracy with fewer iterations compared to neural-based baselines.

Table 5 The running time (sec) of weight learning per epoch in \(\alpha\)ILP and baseline models. \(\alpha\)ILP has a reasoning process in the forward path, thus it takes longer than baselines per epoch

4.3.3 Running time of reasoning

We show that \(\alpha\)ILP can perform fast inference by parallelized GPU-based batch computation. Figure 10 shows the prediction time with different batch sizes in Kandinsky datasets. We measured the inference time for all training examples in twopairs, closeby, and online-pair datasets for Kandinsky patterns, and class1 and class3 datasets for CLEVR-Hans. We used different batch sizes of [1, 8, 24, 64, 128, 256, 512]. We extracted the learned clause after the training of \(\alpha\)ILP, and fed it to the forward-reasoning module, i.e., the reasoning module handles one clause for each dataset, enabling a large batch size on a single GPU.

Fig. 10
figure 10

Prediction time of \(\alpha\)ILP on the complete dataset. We report the running time to predict all of the training examples for each dataset using the learned rules. The bottom labels specify the datasets. For each dataset, different colors correspond to different batch sizes. We used batch sizes of 1, 8, 24, 64, 128, 256 and 512. Kandinsky patterns have 20k training samples for each pattern, CLEVR-Hans has 9k training samples for each dataset, respectively. \(\alpha\)ILP predicts quickly for large datasets by parallelized batch computation

For each dataset, \(\alpha\)ILP achieved fast prediction with larger batch sizes. In the twopairs dataset, with a batch size of 1, it takes 2380 s to classify all of the 20k training visual examples. However, with the batch size of 512, \(\alpha\)ILP classified them in 31 s. The empirical result shows that \(\alpha\)ILP can perform fast reasoning using batch computation, which is an essential function to be tightly coupled with deep neural networks.

4.3.4 Limitations

We discuss limitations of \(\alpha\)ILP. The approach is memory-intensive because the size of the index tensor is not linear with respect to the number of facts and that of clauses. In contrast to backward reasoning, all of the possible solutions are computed in forward-chaining reasoning. To handle large knowledge bases, a more memory-efficient mechanism is necessary. In the experiments, \(\alpha\)ILP assumed that the perception model is pre-trained. It also assumed that there exists a set of rules that can perfectly classify the examples in the search space. \(\alpha\)ILP requires the language bias to limit the search space, e.g., mode declarations. The learning algorithm is a hybrid approach of top-k beam-search and gradient descent. Before performing numerical optimization, the search space, i.e., the set of candidates of rules, needs to be identified by the top-k beam search.

5 Conclusion and future work

We proposed \(\alpha\)ILP, a novel differentiable ILP framework for visual scenes. \(\alpha\)ILP learns logic programs that explain complex visual scenes from visual inputs based on object-centric perception and differentiable ILP. In our experiments, \(\alpha\)ILP outperformed CNN-based baselines in Kandinsky patterns and CLEVR-Hans datasets, where the classification rules are defined on high-level concepts. \(\alpha\)ILP provides the following advantages against CNN-based models. Firstly, \(\alpha\)ILP solves complex patterns in visual scenes such as Kandinsky patterns and CLEVR-Hans datasets, which cannot be solved by CNN-based models. Secondly, \(\alpha\)ILP produces explicit classification rules as a logic program. Thirdly, \(\alpha\)ILP is data-efficient, i.e., it can achieve high performance even from a small training set. Lastly, \(\alpha\)ILP is robust to confounding, i.e., it can be generalized even if some features are confounded in the training dataset. These advantages highlight that \(\alpha\)ILP can overcome some significant limitations of neural models for complex visual scenes. Moreover, \(\alpha\)ILP performs fast differentiable inference for a large number of instances of complex visual scenes. This feature is critical for logical reasoning to be tightly coupled with neural networks. To this end, \(\alpha\)ILP is an extension to the Neuro-Symbolic systems, e.g., DeepProblog (Manhaeve et al., 2018, 2021) and \(\partial\)ILP (Evans & Grefenstette, 2018), for structure learning in visual domains.

A common criticism of ILP can be applied to \(\alpha\)ILP, e.g., hand-crafted background knowledge and language bias are crucial. A promising direction of future research is to develop a neuro-symbolic pipeline to generate proper background knowledge and language bias from data by incorporating ILP techniques such as predicate invention (Cropper et al., 2022). Solving compositional reasoning tasks (Vedantam et al., 2021) will be another direction of future research. The algorithmic supervision setting (Petersen et al., 2021), where neural networks are trained combined with differentiable implementations of discrete algorithms, is also a promising approach with \(\alpha\)ILP, because the reasoning of \(\alpha\)ILP is compatible with neural networks in terms of the running time. Moreover, differentiable implementations of the top-k operator (Goyal et al., 2018; Xie et al., 2020; Pietruszka et al., 2021) could lead \(\alpha\)ILP to have an end-to-end learning system.