˛ ILP: thinking visual scenes as differentiable logic programs

Deep neural learning has shown remarkable performance at learning representations for visual object categorization. However, deep neural networks such as CNNs do not explicitly encode objects and relations among them. This limits their success on tasks that require a deep logical understanding of visual scenes, such as Kandinsky patterns and Bongard problems. To overcome these limitations, we introduce 𝛼 ILP , a novel differentiable inductive logic programming framework that learns to represent scenes as logic programs—intu-itively, logical atoms correspond to objects, attributes, and relations, and clauses encode high-level scene information. 𝛼 ILP has an end-to-end reasoning architecture from visual inputs. Using it, 𝛼 ILP performs differentiable inductive logic programming on complex visual scenes, i.e., the logical rules are learned by gradient descent. Our extensive experiments on Kandinsky patterns and CLEVR-Hans benchmarks demonstrate the accuracy and efficiency of 𝛼 ILP in learning complex visual-logical concepts.


Introduction
Understanding visual scenes is a fundamental problem in building an intelligent agent. Deep Neural Networks such as Convolutional Neural Networks (CNNs) have succeeded in many visual-perception benchmarks but produce poor performance in complex visual scenes, where several objects appear in an image, and the agent needs to reason and learn about the attributes and relations. CNN-based models do not explicitly encode objects and relations, and thus often fail to capture the patterns defined in complex visual scenes.
Kandinsky patterns Holzinger et al., 2021) have been proposed to assess the ability of intelligent systems to explain complex visual scenes. In a similar vein, CLEVR-Hans (Stammer et al., 2021) has been proposed to assess the ability of a model to understand confounded visual scenes. CNN-based models cannot produce proper explanations in such cases and can also suffer from the problem of confounding factors. Moreover, they are data-hungry and struggle to learn abstract visual relations (Kim et al., 2018). A natural question thus arises: How can we build an intelligent system avoiding these pitfalls? To build a system overcoming the shortages of CNN-based models, Neuro-Symbolic approaches (Besold et al., 2017;d'Avila Garcez & Lamb, 2020;Tsamoura et al., 2021) have emerged, where symbolic computations are integrated with neural networks. As logic-based neuro-symbolic systems, many frameworks have been proposed, e.g., Deep-Problog (Manhaeve et al., 2018(Manhaeve et al., , 2021, NeurASP (Yang et al., 2020), and ILP (Evans & Grefenstette, 2018). However, previous studies are not capable of complete structure learning from visual input (Manhaeve et al., 2018(Manhaeve et al., , 2021Yang et al., 2020) or not capable of handling complex rules and visual scenes (Evans & Grefenstette, 2018). Therefore, structure learning on complex scenes such as Kandinsky patterns Holzinger et al., 2021) and CLEVR-Hans (Stammer et al., 2021) problems is difficult, if not impossible, using these frameworks.
To mitigate this issue, we propose ILP, 1 a novel differentiable Inductive Logic Programming (ILP) framework that combines object-centric perception with ILP (Muggleton, 1991(Muggleton, , 1995Nienhuys-Cheng et al., 1997;Cropper et al., 2022), establishing one of the first in the 4th type of neuro-symbolic system, i.e., Neuro:Symbolic→Neuro, as proposed by Kautz (2022). ILP maps output of neural networks (Neuro) to symbolic representations (Symbolic), then gradient-based learning is performed on top of it (Neuro). ILP performs structure learning, i.e., learns discrete logic programs from complex visual scenes. To this end, our system is an extension of Neuro-Symbolic systems such as DeepProblog (Manhaeve et al., 2018(Manhaeve et al., , 2021 and ILP (Evans & Grefenstette, 2018).
ILP has an end-to-end reasoning architecture from visual input, which consists of three main components: (i) visual perception module, (ii) facts converter, and (iii) differentiable reasoning module. The facts converter converts the output of the visual perception module into the form of probabilistic facts, which can be fed into the reasoning module. Then, the reasoning module performs differentiable forward-chaining inference from a given set of facts. It computes the set of facts that can be deduced from the given set of facts and weighted logical rules (Evans & Grefenstette, 2018;Shindo et al., 2021). The final prediction can be made based on the result of the forward-chaining inference. ILP learns logic programs that encode high-level scene information by differentiable ILP techniques (Shindo et al., 2021). It generates candidates of clauses by top-k beam search and learns the weights for the clauses by backpropagation.
Overall, we make a number of key contributions: (1) We propose ILP, a novel framework that performs differentiable ILP from visual scenes. (2) To establish ILP, we propose an end-to-end reasoning architecture from visual inputs. It performs differentiable forward-chaining inference for visual scenes by using perception models and a facts-converting algorithm.
(3) We also propose a learning scheme for ILP to perform differentiable ILP for complex visual scenes. It integrates differentiable ILP techniques with the visual domain, i.e., generates clauses efficiently and performs gradient-based optimization from complex visual scenes. (4) We empirically show the following advantages of ILP: (i) ILP solves ILP problems in visual scenes, i.e., Kandinsky patternss Holzinger et al., 2021) and CLEVR-Hans (Stammer et al., 2021), with high accuracy outperforming neural baseline models. (ii) ILP can generate explanations, i.e., produces a readable solution in the form of logic programs. (iii) ILP is robust to confounding, i.e., avoids being over-fitted to confounding factors. (iv) ILP is data-efficient, i.e., reports no performance drop even when using 10% of the training data. (v) ILP can perform fast inference. It supports efficient parallelized batch computation on GPUs, therefore, it can classify a large number of instances in a large dataset quickly.

Background and related work
We use bold lowercase letters v, w, … for vectors. We use bold capital letters X, … for tensors. We use calligraphic letters C, A, … for (ordered) sets and typewriter font ( , ) for terms and predicates in logical expressions.

Preliminaries on logic and ILP
Language L is a tuple (P, F, T, V) , where P is a set of predicates, F is a set of function symbols, T is a set of constants, and V is a set of variables. A term is a constant, a variable, or an expression ( , … , ) where is a n-ary function symbol and , … , are terms. We denote n-ary predicate by ∕(n, [ , … , ]) , where is the datatype of the i-th argument. An atom is a formula ( , … , ) , where is an n-ary predicate symbol and , … , are terms. A ground atom or simply a fact is an atom with no variables. A literal is an atom or its negation. A positive literal is just an atom. A negative literal is the negation of an atom. A clause is a finite disjunction ( ∨ ) of literals. A definite clause is a clause with exactly one positive literal. If A, B 1 , … , B n are atoms, then A ∨ ¬B 1 ∨ ⋯ ∨ ¬B n is a definite clause. We write definite clauses in the form of A :-B 1 , … , B n . Atom A is called the head, and the set of negative atoms {B 1 , … , B n } is called the body. We denote the special constant true as ⊤ and false as ⊥ . An atom is an atomic formula. For formula F and G, ¬F , F ∧ G , and F ∨ G are also formulas. Interpretation of language L is a tuple (D, I A , I F , I P ) , where D is the domain, I A is the assignments of an element in D for each constant ∈ A , I F is the assignments of a function from D n to D for each n-ary function symbol ∈ F , and I P is the assignments of a function from D n to {⊤, ⊥} for each n-ary predicate ∈ P . For language L and formula F, an interpretation I is a model if the truth value of F w.r.t I is true. Formula F is a logical consequence or logical entailment of a set of formulas S , 1 3 denoted S ⊧ F , if, I is a model for S implies that I is a model for F for every interpretation I of L.
An ILP problem Q is a tuple (E + , E − , B, L) , where E + is a set of positive examples, E − is a set of negative examples, B is background knowledge, and L is a language. Background knowledge can be given in the form of the set of facts or clauses. The solution to an ILP problem is a set of definite clauses H ⊆ L that satisfies the following conditions:

Related work towards visual ILP
Over 50 years ago, M. M. Bongard, a Russian computer scientist, invented a collection of one hundred human-designed visual recognition tasks (Bongard & Hawkins, 1970), now named the Bongard Problems (BPs), to demonstrate the gap between high-level human cognition and computerized pattern recognition. Inspired by BPs, the Bongard-LOGO (Nie et al., 2020) problem has been proposed as a benchmark for the machine learning community. Kandinsky patterns Holzinger et al., 2021) have been proposed to assess the ability of intelligent systems to explain complex visual scenes. In a similar vein, CLEVR-Hans (Stammer et al., 2021) has been proposed to assess the ability of the model to understand the confounded visual scenes. These benchmarks present a challenge to CNN-based recognition models.
Logic, both propositional and first-order, is an established framework for performing reasoning on machines (Lloyd, 1984;Kowalski, 1988). A pioneering study of inductive inference on logic was done in the early 70 s (Plotkin, 1971). The Model Inference System (MIS) (Shapiro, 1983) has been implemented as an efficient search algorithm for logic programs. Inductive Logic Programming (Muggleton, 1991(Muggleton, , 1995Nienhuys-Cheng et al., 1997;Cropper et al., 2022) has emerged at the intersection of machine learning and logic programming. Many ILP frameworks have been developed, e.g., FOIL (Quinlan, 1990), Progol (Muggleton, 1995), ILASP (Law et al., 2014), Metagol (Cropper & Muggleton, 2016;Cropper et al., 2019), and Popper (Cropper & Morel, 2021). Symbolic ILP systems are dedicated to symbolic inputs. ILP deals with visual inputs by having an end-to-end neuro-symbolic reasoning architecture. ILP employs similar structure-learning techniques which have been developed for probabilistic logic programs (Bellodi & Riguzzi, 2015;Nguembang Fadja & Riguzzi, 2019) and performs learning on complex visual scenes. Different settings of probabilistic ILP approaches have been introduced in De Raedt et al. (2008). ILP is based on the learning from entailment setting, where the logical entailment is computed from probabilistic inputs. ILP computes the logical entailment with probabilistic values for facts and clauses in a differentiable manner.
The integration of symbolic programs and neural networks, which is called Neuro-Symbolic computation (Besold et al., 2017;d'Avila Garcez & Lamb, 2020;Tsamoura et al., 2021), has previously been addressed, e.g., DeepProblog (Manhaeve et al., 2018(Manhaeve et al., , 2021, NeurASP (Yang et al., 2020), ILP (Evans & Grefenstette, 2018;Jiang & Luo, 2019), NS-CL (Mao et al., 2019), integration with abductive learning (Dai et al., 2019), and differentiable theorem provers Minervini et al., 2020). Kandinsky patterns and CLEVR-Hans cannot be solved easily by these frameworks because they require complete structure learning from complex visual scenes. DeepProblog supports structure learning but is limited for the sketching setting (Solar-Lezama, 2008;Bošnjak et al., 2017). ILP supports object-centric perception models, differentiable forward reasoning, and efficient clause search for solving tasks in complex visual scenes. Some neuro-symbolic models have been developed for Visual Question Answering (VQA) (Antol et al., 2015;Johnson et al., 2017;Santoro et al., 2018;Mao et al., 2019;Amizadeh et al., 2020). In VQA-based models, the symbolic programs are determined by the natural language sentences that represent questions, but ILP does not have that assumption. Moreover, ILP stands in the line of probabilistic logic programming (De Raedt et al., 2016;Raedt et al., 2020). Therefore, ILP can employ methods for probabilistic logic programming, which have been developed in the community. Similar concepts of some key components of ILP have been investigated in previous studies, e.g., Neural Predicates (Diligenti et al., 2017;Donadello et al., 2017;Badreddine et al., 2022), weighted forwardchaining reasoning (Sourek et al., 2018;Si et al., 2019), and differentiable structure learning (Evans & Grefenstette, 2018;Sourek et al., 2017). ILP is the first that integrates these concepts for the visual object-centric domain as a consistent framework. Logic Tensor Networks (LTNs) (Badreddine et al., 2022) provide a unified differentiable language for firstorder logic. LTNs map each term in first-order logic to numerical representations in place of interpretation. Then predicates are grounded to functions that take numerical representations of terms and return a truth value in [0, 1]. ILP takes a similar approach to connect the sub-symbolic and symbolic representations.
Object-centric learning is an approach to decomposing an input image into representations in terms of objects (Dittadi et al., 2022). This problem has been widely addressed in the computer vision community. Another approach is the unsupervised approach (Burgess et al., 2019;Engelcke et al., 2020;Locatello et al., 2020), where the models acquire the ability of object-perception without or with fewer annotations. ILP uses these object-centric models as a perception module.
Differentiable solvers for dynammic programming problems have been developed (Cuturi & Blondel, 2017;Mensch & Blondel, 2018). ILP adopts some techniques to achieve differentiable implementations of the discrete operations for first-order logic. Various types of differentiable logical operations have been also investigated (van Krieken et al., 2022;Sen et al., 2022).

˛ILP
We now introduce ILP in the following steps. First, we give an overview of the problem setting and the framework. Second, we explain the reasoning architecture of ILP consisting of (i) the visual-perception module, (ii) facts converter, an algorithm to convert objectcentric representations into probabilistic facts, and (iii) the differentiable forward-reasoning mechanism. Finally, we describe the learning strategy on ILP to perform differentiable ILP on visual scenes.
What is visual ILP? We address the ILP problem in visual scenes, which is called visual ILP problem, where each example is given as an image containing several objects. The classification pattern is defined on high-level concepts such as attributes and relations of objects. Figure 1 illustrates an overview of ILP and consists of a Reasoning module and a Learning module. We now introduce these in detail.

Reasoning
ILP has an end-to-end reasoning architecture, which works as follows: (i) The raw input images are factorized in terms of objects using the visual-perception model. (ii) The object-centric representation is converted into a set of probabilistic facts. (iii) The differentiable forward reasoning is performed using weighted clauses. The bottom row of Fig. 1 illustrates the reasoning architecture in ILP.

Learning
ILP learns logic programs from visual inputs by performing differentiable ILP, i.e., we provide positive examples, negative examples, and background knowledge. Each example is given as a visual scene. The top row of Fig. 1 illustrates the learning pipeline in ILP. Learning with ILP is as follows: (Step1) A set of candidates of clauses is generated by top-k beam search. The search is conducted from examples of visual scenes using the end-to-end reasoning architecture. (Step2) The weights for the generated clauses are trained to minimize the loss function. By using the end-to-end reasoning architecture, ILP finds a logic program that explains the complex visual scenes by gradient descent. We now describe our architecture in detail. conducted from examples of visual scenes using the end-to-end reasoning architecture. (Step2) Then, the weights for the generated clauses are trained to minimize the loss function. By using the end-to-end reasoning architecture, ILP finds a logic program that explains the complex visual scenes by gradient descent 1 3

Visual perception
We make the minimal assumption that the perception network takes an image and returns a set of object-centric vectors, where each dimension represents an attribute of the object, e.g., colors, shapes, and positions. Thus, any type of neural network that segments the input images into the individual objects present in the image can be utilized. For example, ILP can employ a slot attention model (Locatello et al., 2020) for 3D scenes. However, with natural images, ILP can employ other established object-detection models such as YOLO (Redmon et al., 2016), Faster-RCNN (Ren et al., 2015), and Mask-RCNN (He et al., 2017). The visual-perception module is trained on randomlygenerated figures with annotations about each object, i.e., the number of objects and the attributes of the objects are randomly determined.

Facts converter: lifting to symbolic representation
After the object-centric perception, ILP generates a logical representation, i.e., a set of probabilistic facts. We propose a new type of predicate that can refer to differentiable functions to compute the probability. We also present an algorithm to convert the perception result into probabilistic facts.

Neural predicate
To build a bridge between the sub-symbolic and symbolic representations, we provide a new type of predicate, which we term as neural predicates. A neural predicate is associated with a differentiable function, which we call valuation function, that produces the probability of the grounded facts.

Definition 1
]) is a n-ary predicate associated with a valuation function v ∶ ℝ d 1 ×⋯×d n → ℝ , where is the datatype of the i-th argument, d i ∈ ℕ is the dimension of the vector representation of the term whose datatype is .
Intuitively, we give the first-order logic interpretation for neural predicates and terms as follows: (i) each neural predicate is assigned to a function in a vector space, (ii) each term in the arguments of neural predicates is assigned to a vector. The vector can be an output of neural networks, or an encoding of the term, e.g., one-hot encoding of the attributes.

Facts-converting algorithm
The facts converter produces a set of probabilistic facts from the output of the perception module. Let G be the set of all facts; then the conversion proceeds as follows: For each fact ( , … , ) ∈ G , if it consists of a neural predicate, then the corresponding valuation function v is called to compute the probability of the fact. Otherwise, zero is given as the probability of the fact. If the fact is in the background knowledge, one is given as the probability of the fact. The valuation function maps each term , … , to vector representations according to the interpretation. The forward reasoning function requires a vector that maps each fact to a probabilistic value to achieve the differentiable computation. Thus, the probabilistic values are computed for all of the facts. Figure 2 illustrates an example of the implementation of the facts converter. We assume that the perception model produces the probabilities of the attributes (color, shape, position) for each object. (1) For neural predicate ∕(2, [ , ]) , we compute the probability of fact ( , ) by calling the valuation function v . Term is mapped to the output of the perception module, and term is mapped to its one-hot encoding. By using these vector representations of the terms, v computes the probability of the atom, simply performing the tensor multiplication and summation.
(2)For neural predicate (2∕[ , ]) , we compute the probability of fact ( , ) by calling the valuation function v . Term and are mapped to the corresponding output of the perception model, respectively. Then the positional information is extracted, and logistic regression is performed on the distance between two data points. By adapting the weights of the linear transformation, the facts converter can learn the concept of flexibly. We note that the valuation functions of neural predicates are defined by the user, and parameterized valuation functions are trained before performing structure learning.

Differentiable forward-chaining inference
Forward-chaining inference is a type of inference in first-order logic to compute logical entailment (Russell & Norvig, 2009). For example, let C be a set of clauses and G be a set of all known facts. Then, forward-chaining inference can compute the set of facts F such that C ∪ G ⊧ F . Differentiable forward-chaining inference (Evans & Grefenstette, 2018;Shindo et al., 2021) computes the logical entailment in a differentiable manner. We briefly summarize the steps: (Step 1) A tensor that holds the relationships between clauses and facts is computed. (Step 2) Each clause is compiled into a differentiable function that performs forward reasoning using the tensor. (Step 3) A differentiable logic program is composed of the clause functions and their weights. T-time step inference is computed by amalgamating the inference results recursively.

Tensor encoding
Following (Shindo et al., 2021), we build a tensor holding relationships between clauses C and facts G . We assume that C and G are ordered sets, i.e., where every element has its own index. Let L be the maximum body length in C , S be the maximum number of substitutions for existentially quantified variables in clauses C , C = |C| and G = |G| . Index tensor I ∈ ℕ C×G×S×L contains the indices of the facts to compute forward inferences. Intuitively, I i,j,k,l is the index of the l-th fact (subgoal) in the body of the i-th clause to derive the j-th fact with the k-th substitution for existentially quantified variables. Example ∈ C and F 2 = ( ) ∈ G , and we assume that terms of objects are { , } . To compute the subgoals for fact F 2 and clause R 0 , F 2 and the head atom can be unified by substitution = { = } . By applying to body atoms, we get clause . , which has an existentially quantified variable . By considering the possible substitutions for , namely ∕ and ∕ , we have grounded clauses, as shown on top of Table 1. Bottom rows of Table 1 shows elements of tensor I 0,∶,0,∶ and I 0,∶,1,∶ . Facts G and the indices are represented on the upper rows in the table. For example, , respectively. The atoms which have a different predicate, e.g., ( , ) , will never be entailed by clause R 0 . Therefore, the corresponding values are filled with 0, which represents the index of the false atom.

Valuation
The valuation vector v (t) ∈ ℝ G maps each fact into a continuous value at each time step t. Each value v (t) i represents the probability of fact F i ∈ G . The differentiable inference is performed based on valuation vectors. To compute the T-step forward-chaining inference, we compute the sequence of valuation vectors v (0) , … , v (T) . We denote a batch of valuation vectors at time step t as V (t) ∈ ℝ B×G , where B is the batch size. In logical reasoning, the parallelized batch computation is non-trivial. Thus, we explicitly denote the dimension of the batch in this section.

Clause function
Each clause R i ∈ C is compiled into a clause function. The clause function takes a valuation vector V (t) , and returns a valuation vector C (t) i ∈ ℝ B×G , which is the result of 1-step forward reasoning using R i and V (t) . The clause function is computed as follows. Let I ∈ ℕ C×G×S×L be an index tensor. First, tensor I i ∈ ℕ G×S×L is extended for batches, i.e., Ĩ i ∈ ℕ B×G×S×L , and V (t) ∈ ℝ B×G is extended to the same shape, i.e., Ṽ (t) ∈ ℝ B×G×S×L . Using these tensors, the clause function is computed as: where gather 1 (X, Y) i,j,k,l = X i,Y i,j,k,l ,k,l , and prod 3 returns the product along dimension 3. softor d is a function for taking logical or softly along dimension d: (1) Table 1 Example of grounded clauses (top) and elements in the index tensor (bottom) Each fact has its index, and index tensor contains the indices of the facts to compute forward inferences (k = 0) In Eq. 1, applying the softor 2 function corresponds to considering all possible substitutions for existentially quantified variables in the body atoms of the clause and taking logical or softly over the results of possible substitutions. The results from each clause is stacked into tensor C (t) ∈ ℝ C×B×G , i.e., C (t) = stack 0 (C (t) 1 , … , C (t) C ) , where stack 0 is stack function for tensors along dimension 0. Figure 3 illustrates the clause function. A clause function computes the forward-chaining inference for a clause. The perception module and facts converter produce an initial valuation vector V (0) . For each grounded clause, the probability for the subgoals is extracted by the gather function. Then the product for the body atoms is computed, then the logical or is computed softly to amalgamate the results from different groundings of clauses.

Soft (logic) program composition
In ILP, a logic program is represented smoothly as a weighted sum of the clause functions following (Shindo et al., 2021). Intuitively, ILP has M distinct weights for each clause, i.e., W ∈ ℝ M×C . By taking softmax of W along dimension 1, M clauses are softly chosen from C clauses. The weighted sum of clause functions is computed as follows. First, we take the softmax of the clause weights W ∈ ℝ M×C : W * = softmax 1 (W) where softmax 1 is a softmax function over dimension 1. The clause weights W * ∈ ℝ M×C and the output of the clause function C (t) ∈ ℝ C×B×G are expanded to the same shape W * ,C (t) ∈ ℝ M×C×B×G . Then we compute tensor H (t) ∈ ℝ M×B×G : . with a schematic illustration of the forward reasoning (top-right). The perception module and facts converter produce initial valuation vector V (0) . For each grounded clause, the probability for the subgoals is extracted by the gather function. Then the product for the body atoms is computed, then the logical or is computed softly to amalgamate the results from different groundings of clauses. For simplicity, the first dimension for the batch is removed in the figure (Color figure online) and sum 1 is a summation along dimension 1. Each value H (t) i,j,k represents the result for the k-th fact using the i-th clause weights for the j-th example in the batch. Finally, we compute tensor R (t) ∈ ℝ B×G corresponding to the fact that logic program is a set of clauses: R (t) = softor 0 (H (t) ) , taking logical or softly over M-chosen clauses. To compute the multi-step reasoning, V (t+1) is computed as: V (t+1) = softor 1 (stack 1 (V (t) , R (t) )) . The reasoning process is illustrated in Fig. 4.

Prediction
We assume that language L has a constant that represents the input image and a predicate to compose an atom representing that the input is positive, e.g., ( ) ∈ G . For given visual input e, ILP simply extracts the value from the result of the forward reasoning to predict class label y ∈ {0, 1} as follows: where C is a set of clauses, B is background knowledge, W is a set of clause weights, Θ per is a set of parameters for the visual-perception model, and Θ np is a set of parameters for neural predicates, I G (x) a function that returns the index of x in G , and v[i] is the i-th element of v , i.e., v i . ILP accepts background knowledge as a set of facts and clauses.

Program induction from visual scenes
ILP learns differentiable logic programs that describe complex visual scenes. We basically follow the differentiable ILP setting (Evans & Grefenstette, 2018;Shindo et al., 2021), where an ILP problem is formulated as an optimization problem that has the following general form: min W loss(Q, C, W), Fig. 4 An illustration of differentiable forward-chaining reasoning. Each clause is compiled into a clause function. Each clause has M distinct weights. The input valuation vector V (t) is fed to the clause function. By applying tensor operations, the forward-chaining reasoning is computed using weighted clauses. More details are in the main text. For simplicity, the dimension for the batch is removed in the figure where Q is an ILP problem, C is a set of candidates of clauses, W is a set of weights for clauses, and loss is a loss function that returns a penalty when training constraints are violated. We note that we solve visual ILP problems, where each positive and negative example is an image containing several objects.
Algorithm 1 describes the learning process of ILP. (Line 1) The perception model is trained using perception dataset D perception , which consists of pattern-free figures. The dataset is annotated for objects, e.g., class labels. Parameterized neural predicates can also be trained by visual input with a trained perception module or by scene graphs. (Line 2-4) Finally, the logic program that describes the visual scene is learned by performing differentiable ILP, as illustrated in the top row in Fig. 1. The process mainly consists of two steps: (i) clause generation by top-k beam search and (ii) learning of clause weights by backpropagation. We now describe each step in detail.

Top-k beam search of clauses
Let Q be a visual ILP problem. ILP generates promising candidates of clauses using topk beam search. Promising candidates of clauses for an ILP problem are those that entail a majority of positive examples but few negative examples. Figure 5 illustrates the clause generation steps from visual scenes. We start from given initial clauses and iteratively refine the top-k clauses based on the following evaluation score: where E + ∈ Q is a set of positive examples and 1 is an 1 × 1 identity matrix. If clause R can entail the majority of positive examples combined with background knowledge, then clause R gets a high evaluation score. In each step, ILP evaluates clauses in parallel using a variant of the reasoning module designed for the evaluation of clauses, i.e., without loops in terms of clauses. The generation of new clauses is conducted using the downward refinement operator (Nienhuys-Cheng et al., 1997), which is a fundamental clause-generation tool in ILP. The downward refinement operator weakens the clauses, i.e., the new clauses that are produced by the operator entail fewer examples with background knowledge than the original clause (Nienhuys-Cheng et al., 1997). Thus fewer negative examples are entailed by the newly generated clauses. Therefore we evaluate clauses only by positive examples and repeatedly generate clauses by the downward refinement operator to produce a search space that contains general and specific clauses. During the clause generation, ILP adopts mode declarations (Muggleton, 1995;Ray & Inoue, 2007) to manage the search space, i.e., the clauses that are inconsistent with mode declarations are pruned. The visualperception module enables ILP to evaluate each clause using visual input. ILP utilizes symbolic learning techniques while dealing with complex visual scenes.

Learning weights
ILP assigns weights for generated clauses. Clause weights are optimized by gradient descent. Let Q = (E + , E − , B, L) be a visual ILP problem, C be a set of generated clauses, W be a set of clause weights, Θ per be the parameters for the perception model, and Θ np be the parameters for the neural predicates. We solve visual ILP problem Q by minimizing crossentropy loss with respect to W , defined as: where Y = {(e, 1) | e ∈ E + } ∪ {(e, 0) | e ∈ E − } , which is a set of tuples of an example and the label indicating positive or negative.

Experimental evaluation
We empirically demonstrate the following desired properties of ILP on two different datasets: (i) ILP solves ILP problems in visual scenes with high accuracy. (ii) ILP can explain, i.e., it produces a readable solution in the form of logic programs. (iii) ILP is robust to confoundings. (iv) ILP is data-efficient unlike CNNs. (v) ILP performs fast inference.

Dataset
We adopted Kandinsky pattern datasets Holzinger et al., 2021), a relatively new benchmark for object-centric reasoning tasks. Kandinsky-20k contains 10k training examples for each positive and negative class, respectively. Each validation and test split contains 5k examples for each positive and negative class, respectively. In the Kandinsky-2k dataset, we reduced the amount of training data by randomly sampling from the Kandinsky-20k dataset. The training split contains 1k examples for each positive and negative class, respectively. Validation and test split are the same as the Kandinsky-20k dataset.
We use 4 Kandinsky patterns: twopairs, closeby, red-triangle, and online-pair. Figure 6 shows a positive example for each pattern. For the clause generation step, we used 500 examples from the validation split for each dataset.

Pre-training
For pre-training of the visual perception module, we generated randomly 15k Kandinsky figures with annotations about each object, i.e., the number of objects and the attributes of the objects are randomly determined. We used YOLO (Redmon et al., 2016) as a perception module. Each object has the class label and the bounding box as an annotation. Neural predicate and are trained on the 10k Kandinsky figures that represent the concepts, respectively, e.g., figures that consist of two objects that are close by each other are generated for the positive examples for .

Baselines
We adopted ResNet (He et al., 2016) as a CNN-based benchmark and also compared it against YOLO+MLP, where the input figure is fed to the pre-trained YOLO model, and a simple MLP module predicts the class label from the YOLO outputs. The whole network is jointly trained. Table 2 shows the results for the test split in each Kandinsky dataset. The CNN model overfits while training and thus performs poorly in every Kandinsky pattern. The YOLO+MLP model performs comparatively better and achieves greater than 90% accuracy in twopairs. However, in relatively complex patterns of closeby, red-triangle, and online-pair, the performance degrades. On the contrary, ILP outperforms the considered baselines significantly and achieves perfect classification in all of the patterns. In the smaller dataset, Kandinsky-2k, neural-based benchmarks reduce its performance because of the lack of training data to be generalized. On the contrary, ILP still achieves perfect accuracy. This shows the data efficiency of ILP. Figure 7 shows the classification rules discovered by ILP, which are obtained by taking argmax of the rule weights. ILP successfully produced interpretable results in all of the datasets. After the training step, we observed that the distribution of the weights over clauses turned to be sharp, i.e., one clause get nearly 1.0 and others get almost 0.0. Figure 8 shows the accuracy for the test split in the first 600 iterations in Kandinsky-20k dataset. To have a fair comparison, we used the same learning rate 1e−2 for ILP and YOLO+MLP model and 1e−5 for the CNN baseline 2 to plot the figure. The line represents the mean, and the shadow represents the standard deviation of 5 trials, respectively. One iteration corresponds to one weight update using a batch of training examples. The result shows ILP achieves high classification accuracy with fewer iterations compared to neuralbased baselines.

Dataset
The CLEVR-Hans dataset (Stammer et al., 2021) contains confounded CLEVR (Johnson et al., 2017) images, and each image is associated with a class label. We adopted the CLEVR-Hans3 dataset, which has three classes, as shown in Fig. 9. Each class has a corresponding classification rule. For each class, we create an ILP problem, where positive examples are the set of images that belongs to a classification rule, and negative examples are images that belong to other classes. As a result, we have three classification problems: class1, class2, and class3. CLEVR-Hans problems involve confounding data. For example, in the training and validation split of class1, the large cube in the positive examples always has the color of gray, but in the test split, it has different colors. To achieve good performance in the test split, the model needs to know the exact classification rule without

Pre-training
The slot attention model was pre-trained following Locatello et al. (2020) using the set prediction setting on the CLEVR (Johnson et al., 2017) dataset.

Baselines
The considered baselines are the ResNet-based CNN model (He et al., 2016), and the Neuro-Symbolic (NeSy) model (Stammer et al., 2021). The NeSy model has a visual perception module based on slot attention (Locatello et al., 2020) and a reasoning module based on Set Transformer (Lee et al., 2019). The NeSy model was trained in two different settings: (1) training using classification rules (NeSy), and (2) the right for the right reasons (Ross et al., 2017) setting, i.e., the model is trained using supervision about confounding factors (NeSy-XIL). NeSy-XIL is the SOTA model in the CLEVR-Hans dataset. Table 3 shows the classification accuracy in the CLEVR-Hans dataset. The results of baselines have been presented in Stammer et al. (2021). ILP achieved more than 97% in each  split. Note that, NeSy-XIL model exploits the supervision of the confounding factors. On the contrary, ILP is unsupervised in terms of confounding factors. This shows that ILP is robust to confounding. ILP can control the complexity of the solution (logic programs) by controlling the depth of top-k beam search, i.e., ILP can prevent overfitting by giving a proper depth of top-k beam search, which can be determined by trying from a small number on the validation split. Figure 7 shows the classification rules discovered by ILP, which are obtained by taking argmax of the rule weights. After the training step, we observed that the distribution of the weights over clauses turned to be sharp, i.e., one clause gets nearly 1.0, and others get almost 0.0.

Ablation study
We analyze the efficiency of ILP for the clause generation step, the weight learning step, and the reasoning step, respectively. To this end, we discuss limitations of ILP.

Running time and number of clauses in clause generation
We analyze the clause generation step of ILP in terms of the running time and the number of clauses to be generated. We used 500 examples from the validation split for each dataset. The size of the search beam is 20. The depth of the search is [5, 2, 6, 4] for twopairs, closeby, red-triangle, and online-pair in Kandinsky Patterns, respectively, and [5, 6, 7] for class1, class2, and class3 in CLEVR-Hans, respectively. Table 4 shows running time of top-k beam search and number of generated clauses in ILP. The clause generation step takes about 58 s in the best case and about 1000 s in the worst case. ILP searched a space that contains several thousands of candidates of clauses and then successfully searched promising clauses from visual scenes. This empirically shows that ILP performed an

Running time of weight learning
We compare the running time of weight learning of ILP with neural baselines. Table 5 shows the running time of weight learning for one epoch in Kandinsky patterns datasets. ILP achieved comparably fast weight-learning iterations for problems with simple rules, e.g., closeby. For the problems that have a complex search space, e.g., red-triangle, ILP takes longer to compute the gradient and update the clause weights. This is because, as shown in Table 4, ILP deals with a large number of clauses for difficult problems, which require a deeper search of clauses. We note that, as shown in Fig. 8, ILP can achieve high accuracy with fewer iterations compared to neural-based baselines.

Running time of reasoning
We show that ILP can perform fast inference by parallelized GPU-based batch computation. Figure 10 shows the prediction time with different batch sizes in Kandinsky datasets. We measured the inference time for all training examples in twopairs, closeby, and onlinepair datasets for Kandinsky patterns, and class1 and class3 datasets for CLEVR-Hans. We used different batch sizes of [1,8,24,64,128,256,512]. We extracted the learned clause after the training of ILP, and fed it to the forward-reasoning module, i.e., the reasoning module handles one clause for each dataset, enabling a large batch size on a single GPU. For each dataset, ILP achieved fast prediction with larger batch sizes. In the twopairs dataset, with a batch size of 1, it takes 2380 s to classify all of the 20k training visual examples. However, with the batch size of 512, ILP classified them in 31 s. The empirical result shows that ILP can perform fast reasoning using batch computation, which is an essential function to be tightly coupled with deep neural networks.

Limitations
We discuss limitations of ILP. The approach is memory-intensive because the size of the index tensor is not linear with respect to the number of facts and that of clauses. In contrast to backward reasoning, all of the possible solutions are computed in forward-chaining reasoning. To handle large knowledge bases, a more memory-efficient mechanism is necessary. In the experiments, ILP assumed that the perception model is pre-trained. It also assumed that there exists a set of rules that can perfectly classify the examples in the search space. ILP requires the language bias to limit the search space, e.g., mode declarations. The learning algorithm is a hybrid approach of top-k beam-search and gradient descent. Before performing numerical optimization, the search space, i.e., the set of candidates of rules, needs to be identified by the top-k beam search.

Conclusion and future work
We proposed ILP, a novel differentiable ILP framework for visual scenes. ILP learns logic programs that explain complex visual scenes from visual inputs based on object-centric perception and differentiable ILP. In our experiments, ILP outperformed CNN-based baselines in Kandinsky patterns and CLEVR-Hans datasets, where the classification rules are defined on high-level concepts. ILP provides the following advantages against CNNbased models. Firstly, ILP solves complex patterns in visual scenes such as Kandinsky patterns and CLEVR-Hans datasets, which cannot be solved by CNN-based models. Secondly, ILP produces explicit classification rules as a logic program. Thirdly, ILP is dataefficient, i.e., it can achieve high performance even from a small training set. Lastly, ILP is robust to confounding, i.e., it can be generalized even if some features are confounded in the training dataset. These advantages highlight that ILP can overcome some significant limitations of neural models for complex visual scenes. Moreover, ILP performs fast differentiable inference for a large number of instances of complex visual scenes. This feature is critical for logical reasoning to be tightly coupled with neural networks. To this end, ILP is an extension to the Neuro-Symbolic systems, e.g., DeepProblog (Manhaeve et al., 2018(Manhaeve et al., , 2021 and ILP (Evans & Grefenstette, 2018), for structure learning in visual domains.
A common criticism of ILP can be applied to ILP, e.g., hand-crafted background knowledge and language bias are crucial. A promising direction of future research is to develop a neuro-symbolic pipeline to generate proper background knowledge and language bias from data by incorporating ILP techniques such as predicate invention (Cropper et al., 2022). Solving compositional reasoning tasks (Vedantam et al., 2021) will be another direction of future research. The algorithmic supervision setting (Petersen et al., 2021), where neural networks are trained combined with differentiable implementations 1 3 of discrete algorithms, is also a promising approach with ILP, because the reasoning of ILP is compatible with neural networks in terms of the running time. Moreover, differentiable implementations of the top-k operator (Goyal et al., 2018;Xie et al., 2020;Pietruszka et al., 2021) could lead ILP to have an end-to-end learning system.
Mode declarations (Muggleton, 1995;Cropper et al., 2022) we used are shown in Table 6. Table 7 shows the data types and constants, and Table 8 shows the predicates for Kandinsky patterns, respectively. Hyperparameters for the clause generation is shown in Table 9. #obj represents the number of objects to be focused on the classification, which can be identified by trying from the smallest number and evaluating by validation split and increasing if the performance is not enough. We set the initial clause to The object is in the image The object has the shape of the second argument ∕(2, [ , ]) The object has the color of the second argument The two objects are located close by each other The objects are aligned on a line be the root node in the beam search as: where n is the number of objects to be focused, i.e., #obj in Table 9. Background knowledge given in for Kandinsky patterns is shown in Table 10.
For neural predicate , we used the following valuation function: where Z (i) center represents the center coordinate of the bounding box for the i-th object, and w is a parameter to be trained.
For neural predicate , we used the following valuation function where function f reg computes the closed-form solution of the linear regression in batch and returns the error values, and w is a parameter to be trained.

A.2 Kandinsky-2k
CNN. We trained ResNet18 for 300 epochs with a batch size of 64. We used the Adam optimizer with a learning rate of 1e − 5. YOLO+MLP. We used MLP with two hidden layers with a non-linearity. The output of the pre-trained YOLO model is reshaped and fed into MLP to predict the class label. We trained the whole YOLO+MLP network jointly for 1000 epochs with a batch size of 64. We used the Adam optimizer (Kingma & Ba, 2015;Ruder, 2016) with a learning rate of 1e−5.
ILP. We trained in the same setting as in .

A.3 CLEVR-Hans
We trained the ILP model for 100 epochs with a batch size of 256. We used the RMSProp optimizer (Ruder, 2016) with a learning rate of 1e−2 . We used 500 positive examples in the validation split to generate clauses by beam search. Mode declarations (Muggleton, 1995;Cropper et al., 2022) we used are shown in Table 11. Hyper parameters for the clause generation is shown in Table 12. Table 13 shows the data types and constants, and Table 14 shows the predicates for CLEVR-Hans, respectively. We set the initial clause to be the root node in the beam search as: ( ) :-( , ), ( , ) . We did not provide any background knowledge for CLEVR-Hans tasks.

Appendix B: Perception models in experiments
We describe the experimental setting of the pre-training of the perception models in our experiments.  The object has the shape of the second argument ∕(2, [ , ]) The object has the color of the second argument ∕(2, [ , ]) The object has the material of ∕(2, [ , ]) The object has the size of the second argument

B.1 YOLO for Kandinsky patterns
Model We used YOLOv5 3 model, whose implementation is publicly available. We adopted the YOLOv5s model, which has 7.3 M parameters. Dataset We generated 15,000 pattern-free figures for training, 5000 figures for validation. The class labels and positions are generated randomly. The original image size is 620 × 620 , and resized into 128 × 128 . The label consists of the class labels and the bounding box for each object. The class label is generated by the combination of the shape and the color of the object, e.g., red circle and blue square. The number of classes is 9. Each image contains at least 2 objects and at most 10 objects.
Optimization We trained the YOLOv5s model by stochastic gradient descent (SGD) for 400 epochs using the pre-trained weights. 4 We used the loss function that approximates detection performance, presented in Redmon et al. (2016). We set the learning rate to 0.01 and the batch size to 64. The SGD optimizer used the momentum, which is set to 0.937. We set the weight decay as 0.0005. We took 3 warmup epochs for training.

B.2 Slot Attention for CLEVR-Hans
We used the same model and training setup as the pre-training of the slot-attention module in Stammer et al. (2021). In the preprocessing, we downscaled the CLEVR-Hans images to a dimension of 128 × 128 and normalized the images to lie between −1 and 1. For training the slot-attention module, an object is represented as a vector of binary values for the shape, size, color, and material attributes and continuous values between 0 and 1 for the x, y, and z positions. We trained the slot attention model with the set prediction architecture following Locatello et al. (2020), using the loss function, which is based on the Hungarian algorithm. We refer to Stammer et al. (2021) for more details.
where > 0 is a smooth parameter, sum d is the sum function along dimension d, and S = max(1.0, max log sum d exp (X∕ ) . The normalization term ensures that the softor d function returns a normalized probabilistic values. The dimension d specifies the dimension to be removed.
A popular choice is the probabilistic sum function: f prob_sum (X, Y) = X + Y − X ⊙ Y , which was adopted in Evans and Grefenstette (2018) and Jiang and Luo (2019). We plot the various functions for logical or in Fig. 11 to compare. From left to right, each plot corresponds to max, probabilistic sum, softor with = 0.1 , and softor with = 0.01 , respectively, with respect to 2-dimensional input x, y ∈ [0, 1] . The maximum and minimum value for each plot are shown on top of each, which are represented by the colors from blue to red. The softor d function with a sufficiently small smooth parameter approximates well the logical or function for probabilistic values.

Appendix E: Mode declaration
Mode Declaration (Muggleton, 1995;Ray & Inoue, 2007) is one of the common language biases. We used mode declaration, which is defined as follows. A mode declaration is either a head declaration ( , ( , … , )) or a body declaration ( , ( , … , )) , where ∈ ℕ is an integer, is a predicate, and is a mode datatype. A mode datatype is a tuple ( , ) , where is a place-marker and is a datatype. A place-marker is either # , which represents constants, or + (resp. −), which represents input (resp. output) variables. represents the number of the usages of the predicate to compose a solution.