1 Introduction

Galileo’s Sidereus Nuncius (Galilei 2004) describes the first ever telescopic observations of the moon. Using sketches of shadow patterns Galileo conjectured the existence of mountains containing hollow areas (i.e. craters) on a celestial body previously thought perfectly spherical. His reasoned description, derived from a handful of observations, relies on a knowledge of (i) classical geometry, (ii) straight line movement of light and (iii) the Sun as an out-of-view light source. This paper investigates the use of Inductive Logic Programming (ILP) (Muggleton et al. 2011) to derive logical hypotheses, related to those of Galileo, from a small set of real-world images. Figure 1 illustrates part of the generic background knowledge used by ILP for interpreting object convexity in Experiment1 (Sect. 5.1).

Fig. 1
figure 1

Interpretation of light source direction: a waxing crescent moon (Credit: UC Berkeley), b concave/convex illusion, c concave and d convex photon reflection models, e prolog recursive model of photon reflection

Figure 1a shows an image of the crescent moon in the night sky, in which convexity of the overall surface implies the position of the Sun as a hidden light source beyond the lower right corner of the image. Figure 1b shows an illusion in which assuming a light source in the lower right leads to perception of convex circles on the leading diagonal. Conversely, a light source in the upper left implies their being concave. Figure 1c shows how interpretation of a convex feature, such as a mountain, comes from illumination of the right side of a convex object. Figure 1d shows that perception of a concave feature, such as a crater, comes from illumination of the left side. Figure 1e shows how Prolog background knowledge encodes a simple recursive definition of the reflected path of a photon.

This paper explores the phenomenon of knowledge-based perception using an extension of Logical Vision (LV) (Dai et al. 2015). In the previous work LV was shown to accurately learn a variety of polygon classes from artificial images with low sample requirements compared to statistical learners. LV generates logical hypotheses concerning images using an ILP technique called Meta-Interpretive Learning (MIL) (Muggleton et al. 2015; Cropper and Muggleton 2016).

Contributions of this paper The main contributions of this paper are:

  1. 1.

    We describe a generalisation of LV (Dai et al. 2015), which is tolerant to both classification noise and attribute noise.

  2. 2.

    We show that even in the presence of noise in images [absent in artificial images in Dai et al. (2015)] effective learning can be achieved from as few as one image.

  3. 3.

    We demonstrate that in all cases studied the combination of a logic-based learner with a statistical estimator requires far fewer images (sometimes one) to achieve accuracies requiring large numbers of images using statistical machine learning on its own.

  4. 4.

    We demonstrate that LV can use, as well as invent, generic background knowledge about reflection of photons in providing explanations of visual features.

  5. 5.

    We demonstrate that LV has potential in real application domains such as RoboCup.

RoboCup domain In Experiment 2 (Sect. 5.2) we investigate LV in the context of robotics. Figure 2 shows images from the RoboCup Soccer Standard Platform League.Footnote 1 This is a competition with five Aldebaran Nao robots on each team. They are placed on a 9 m \(\times \) 6 m field, and operate autonomously to play soccer. The robots use cameras to detect the ball, field lines, goals and other robots. In Fig. 2a, the ball can be seen distinctly, whereas in Fig. 2b, c the ball is partially occluded. The problem with recognising the ball is that it consists of several patches of black and white, but there are many other objects on the field that also contain white regions. However, background knowledge concerning the geometry of a sphere projected on a 2D plane guarantees a ball has a circular appearance. If three edge points can be found our approach can fit them to a circle and if that circle has the proportions of black and white pixels, the system concludes it is a ball.

Fig. 2
figure 2

Robot’s view of: a another robot and ball clearly separated, b the ball partially occluded by a robot, c the ball within the bounds of a robot

The paper is organised as follows. Section 2 describes related work. The theoretical framework for LV is provided in Sect. 3. Section 4 describes the implementation of LV, including the recursive background knowledge for describing radiation and reflection of light. In Sect. 5 we describe experiments on (1) learning abstract definitions of polygons from artificial images, (2) predicting the light source direction and identification of ambiguities in images of the moon and microscopic images of illuminated micro-organisms and (3) identifying the ball in the RoboCup domain. Finally, we conclude and discuss further work in Sect. 6.

2 Related work

Statistical machine learning based on low-level feature extraction has been increasingly successful in image classification (Rautaray and Agrawal 2015). However, high-level vision, involving interpretation of objects and their relations in the external world, is still relatively poorly understood (Cox 2014). Since the 1990s perception-by-induction (Gregory 1998) has been the dominant model within computer vision, where human perception is viewed as inductive inference of hypotheses from sensory data. The idea originated in the work of the nineteenth century physiologist (von Helmholtz 1962). The approach described in this paper is in line with perception-by-induction in using ILP for generating high-level perceptual hypotheses by combining sensory data with a strong bias in the form of explicitly encoded background knowledge. Whilst Gregory (1974) was one of the earliest to demonstrate the power of the Helmholtz’s perception model for explaining human visual illusion, recent experiments (Heath and Ventura 2016) show Deep Neural Networks fail to reproduce human-like perception of illusion. This contrasts with results in Sect. 5.2, in which LV achieves analogous outcomes to human vision.

Early work in Computer Vision investigated the interaction between visual analysis, linguistic descriptions and geometric models (Waltz 1980; Huffman 1971). In some such approaches visual illusions were identified by testing logical models of images for contradictions (Barrow and Tenenbaum 1981). However, these techniques were based on preformulated models, and did not use machine learning augmented by background knowledge in the fashion described in this paper. Preformulated models are also used in more recent work to capture, for instance, the movement of a human being walking (Hogg 1983) or a hyperbolic curve involved in analysing images of penetrating radar (Olhoeft 2000). However, these techniques lack the flexibility of our Logical Vision approach to combine a set of primitive models in a modular fashion to form a set of composite structured and re-useable models from an image.

Shape-from-shading (Horn 1989; Zhang et al. 1999) is a key computer vision technology for estimating low-level surface orientation in images. Unlike our approach for identifying concavities and convexities, shape-from-shading generally requires observation of the same object under multiple lighting conditions. By using background knowledge as a bias we reduce the number of images for accurate perception of high-level shape properties such as the identification of convex and concave image areas.

ILP has previously been used for learning concepts from images. For instance, in Cohn et al. (2006) object recognition is carried out using existing low-level computer vision approaches, with ILP being used for learning general relational concepts from this already symbolised starting point. Farid and Sammut (2014a, b) adopted a similar approach, extracting planar surfaces from a 3D image of objects encountered by urban search and rescue robots and household objects, then using ILP to learn relational descriptions of those objects. By contrast, LV (Dai et al. 2015) uses ILP to provide a bridge from very low-level features, such as high contrast points, to high-level interpretation of objects. The present paper extends the earlier work on LV by implementing a noise-proofing technique, applicable to real images, and extending the use of generic background knowledge to allow the identification of objects, such as light sources, not directly identifiable within the image itself.

Various statistics-based techniques, making use of high-level vision, have been proposed for one- or even zero-shot learning (Palatucci et al. 2009; Vinyals et al. 2016). They usually start from an existing model pre-trained on a large corpus of instances, and then adapt the model to data with unseen concepts. Approaches can be separated into two categories. The first exploits a mapping from images to a set of semantic attributes, then high-level models are learned based on these attributes (Lampert et al. 2014; Mensink et al. 2011; Palatucci et al. 2009). The second approach uses statistics-based methods, pre-trained on a large corpus, to find localized attributes belonging to objects but not the entire image, and then exploits the semantic or spatial relationships between the attributes for scene understanding (Hu et al. 2016; Li et al. 2014; Duan et al. 2012). Unlike these approaches, we focus on one-shot from scratch, i.e. high-level vision based on just very low-level features such as high contrast points.

Machine learning is used extensively in robotics, mainly to learn perceptual and motor skills. Current approaches for learning perceptual tasks include Deep Learning and Convolutional Neural Networks (Krizhevsky et al. 2012; Redmon et al. 2016). The different approaches to vision in RoboCup can be seen in the SPQR team’s use of convolutional neural networks (Suriani et al. 2016) and the ad hoc, but effective method used by the 2016 SPL champions, B-Human (Rofer et al. 2016). This approach clearly depends on domain knowledge that has been acquired by the human designers. However, the approach described in this paper promises the possibility that similar knowledge could be acquired through machine learning.

3 Framework

The framework for LV is a special case of MIL.

Meta-Interpretive Learning Given background knowledge B and examples E the aim of a MIL system is to learn a hypothesis H such that \(B, H \models E\), where \(B = B_{p} \cup M\), \(B_{p}\) is a set of Prolog definitions and M is a set of metarules (see Table 1). MIL (Muggleton et al. 2014b, a, 2015; Cropper and Muggleton 2015, 2016) is a form of ILP based on an adapted Prolog meta-interpreter. A standard Prolog meta-interpreter proves goals by repeatedly fetching first-order clauses whose heads unify with the goals. By contrast, a MIL learner proves the set of all examples by fetching higher-order metarules (Table 1) whose heads unify with the goals. The resulting meta-substitutions are saved, allowing them to be used to generate a hypothesised program which proves all the examples by substituting the meta-substitutions into corresponding metarules. Use of metarules and background knowledge helps minimise the number of clauses n of the minimal consistent hypothesis H and consequently the number of examples m required to achieve error below \(\epsilon \) bound. Cropper and Muggleton (2016) shows n dominates the upper bound for m.Footnote 2

Table 1 Metarules used in this paper. Uppercase letters PQRS denote existentially quantified variables. Lowercase letters u, x, y, and z are universally quantified

Logical Vision In LV (Dai et al. 2015), the background knowledge B, in addition to Prolog definitions, contains a set of one or more named images I. The examples describe properties associated with I.

4 Implementation

4.1 Noise tolerant Meta-Interpretive Learning

The MIL framework described in the previous section has been implemented in a system called Metagol (Muggleton et al. 2014a, b, 2015; Cropper and Muggleton 2015, 2016). In this section we describe a noise tolerant version of Metagol called \(Metagol_{NT}\).Footnote 3 The standard Metagol implementation uses a modified Prolog meta-interpreter to backtrack through the space of hypotheses which prove all training examples. This strategy is consistent with an assumption of noise-free examples. Because of backtracking, standard methods for handling noise, such as accepting a user-defined maximum number of negative examples (used in the ILP systems Progol and Aleph), are inefficient for Metagol.Footnote 4 For this reason, a more efficient noise-handling method is required.

The noise tolerant version of Metagol (i.e. \(Metagol_{NT}\)) used in this paper, finds hypotheses consistent with randomly selected subsets of the training examples and evaluates each resulting hypothesis on the remaining training set, returning the hypothesis with the highest score. The size of the training samples and the number of iterations (i.e. number of random samples) are user defined parameters. As shown in Algorithm 1, \(Metagol_{NT}\) is implemented as a wrapper around Metagol and returns the highest score hypothesis \(H_{max}\) learned from randomly sampled examples from E after n iterations. The sample size is controlled by \(\nu =(k^+, k^-)\), where \(k^+\) and \(k^-\) are the number of sampled positive and negative examples respectively, reflecting the noise level in the dataset.

figure a

4.2 Logical Vision

Our implementation of Logical Vision, called LogVis, is shown in Algorithm 2. The input consists of a set of images I, background knowledge B including both Prolog primitives \(B_p\) and metarules M, a set of training examples E of the target concept, \(Metagol_{NT}\)’s parameters \(\nu \) and n.

The procedure of LogVis is divided into two stages. The first stage is to extract symbolic background knowledge from images, which is done by the visualAbduce function. By including abductive theories in \(B_p\in B\), visualAbduce can abduce. Points, lines, ellipses and even complex mid-level visual representations such as super-pixels (see Sect. 5.3). In our implementation, visualAbduce can take logic rules, statistical models and functions from a computer vision toolbox as background knowledge, which provide visual primitives. This makes LogVis flexible in learning many kinds of concepts. More details about visual abduction are introduced in Sect. 5.2.

The second stage of LogVis simply calls the noise-tolerant MIL system \(Metagol_{NT}\) to induce a hypothesis for the target concept, as both abduced visual primitives \(B_v\) and training examples E from an image dataset can be noisy.

figure b

Visual abduction The target of visual abduction is to obtain symbolic interpretation of images for further learning. The abduced logical facts are groundings of primitives defined in the background knowledge \(B_p\). For example, in order to learn the concept of a polygon one at least needs to extract points and edges from an image. When the data is noise-free, this can be done by sampling high-contrast pixels from the image, such as the background knowledge about edge_point applied in Dai et al. (2015).

However, for real images that contain a degree of noise, we can include a statistical model in visualAbduce and use it to implement a noise-robust version of edge_point. For example, in the Protist and Moon experiments of Sect. 5, the edge_point/1 predicate calls a pre-trained statistical image background model which can categorise pixels into foreground or background points using Gaussian models or image segmentation.

Fig. 3
figure 3

Object detection: a sampled lines with edge points; b fitting of initial ellipse centred at O. Hypothesis tested using new edge points halfway between existing adjacent points. c Revised hypothesis tested until hypothesis passes test

Furthermore, we can use an abductive theory about shapes to abduce objects. For example, in real images many objects of interest are composed of curves and can be approximated by ellipses or circles. Therefore we can include background knowledge about them in visualAbduce to perform ellipse and circle abduction, as shown in Fig. 3. The abduced objects will take the form elps(Centre, Parameter) or circle(Centre, Radius) where \(Centre=[X,Y]\) is the shape’s centre, \(Parameter=[A,B,Tilt]\) are the axis lengths and tilting angle and Radius is the circle radius. The computational complexity of the abduction procedure is O(rkn), where n is the number of edge_points, and k is the number of iteration of the ellipse fitting algorithm. r is the time required for resampling when the fitted object is not accurate enough, hence it is a constant that reflects the noise level of the input image.

In LogVis, background knowledge about visual primitives is implemented as logical predicates in a library, including basic geometrical concepts and extractors for low-level computer vision features such as the colour histogram and super-pixels. Users can implement their own background knowledge for visual abduction based on these primitives to address different kinds of problems flexibly.

5 Experiments

5.1 Experiment 1

In the first experiment [detailed report in Dai et al. (2015)] we compared a noise-free variant of the LogVis algorithm (refereed to as \({LV_{Poly}}\)) with statistics-based approaches on the task of learning simple geometrical concepts (see example images in Fig. 4).

Materials and methods We used InkscapeFootnote 5 to randomly generate 3 labelled image datasets for 3 polygon shape learning tasks respectively. Training sets contain 40 examples. For simplicity, the images are binary-coloured, each image contains one polygon. Target concepts are: (1) triangle/1, quadrangle/1, pentagon/1 and hexagon/1; (2) regular_poly/1 (regular polygon); (3) right_tri/1 (right triangle). All the datasets were partitioned into fivefold respectively, 4 of them were used for training and the remaining one is for testing, thus each experiment was conducted 5 times.Footnote 6

Fig. 4
figure 4

Experiment 1—examples of the concept regular polygon used in Experiment 1. The first two rows are positive examples and the second two rows negative

Results and discussion Table 2 compares the predictive accuracies of an implementation of \({LV_{Poly}}\) versus several statistics-based computer vision algorithms. We used a popular statistics-based computer vision toolbox VLFeat (Vedaldi and Fulkerson 2008) to implement the statistical learning algorithms. The experiments are carried with different kinds of features. Because the sizes of datasets are small, we used a support vector machine [libSVM (Chang and Lin 2011)] as classifier. The parameters are selected by fivefold cross-validation. The features we have used in the experiments are as follows: HOG, Histogram of Oriented Gradients (Dalal and Triggs 2005), Dense-SIFT, Scale Invariant Feature Transform (Lowe 2004), LBP, Local Binary Pattern (Ojala et al. 2002), CNN, Convolutional Neural Network (CNN) (Simonyan and Zisserman 2015). We also compare with a combination of the above feature sets (i.e. C+d+L). According to Table 2, given 40 training examples the prediction accuracies for \(LV_{Poly}\) are significantly better than other approaches.

Table 2 Predictive accuracy of learning simple geometrical shapes from single object training sets of size 40

5.2 Experiment 2

This subsection describes experiments comparing one-shot LV with multi-shot statistics-based learning.Footnote 7 In this experiments, we investigate the following null hypothesis:

Null hypothesis One-shot LV cannot learn models with accuracy comparable to thirty-shot statistics-based learning.

Materials We collected two real image datasets for the experiments: (1) Protists drawn from a microscope video of a Protists micro-organism, and (2) Moons a collection of images of the moon drawn from Google images. The instances in Protists are coloured images, while the images in Moons come from various sources and some of them are grey-scale. For the purpose of classification, we generated the two datasets by rotating images through 12 clock angles.Footnote 8 Datasets consist of 30 images for each angle, providing a total of 360 images. Each image contains one of four labels as follows: \(North=\{11, 12, 1\}\) clocks, \(East=\{2, 3, 4\}\) clocks, \(South=\{5, 6, 7\}\) clocks, and \(West=\{8, 9, 10\}\) clocks. Examples of data and the labelling are shown in Fig. 5. As we can see from the figure, there is high variance in the image sizes and colours.

Fig. 5
figure 5

Illustrations of Moons and Protists data: a examples of the datasets, b four classes for twelve light source positions

Methods The aim is to learn a model to predict the correct category of light source angle from real images. For each dataset, we randomly divided the 360 images into training and test sets, with 128 and 232 examples respectively. To evaluate the performance, the models were trained by randomly sampling 1, 2, 4, 8, 16, 32, 64 and 128 images from the training set. The sequences of training and test instances are shared by all compared methods. The random partition of data and learning are repeated 5 times.

Logical Vision In the experiments, we used the grey intensity of both image datasets for LV. The hyper-parameter T in Algorithm 2 is set at 11 by validating one-shot learned models on the rest of the training data. To handle image noise, we use a background model as the statistics-based estimator for predicate edge_point/1. When edge_point([X,Y]) is called, a vector of colour distribution (which is represented by a histogram of grey-scale value) of the \(10\times 10\) region centered at (X,Y) is calculated, then the background model is applied to determine whether this vector represents an edge point. The parameter of neighborhood region size 10 is chosen as a compromise between accuracy and efficiency after having tested it ranging from 5 to 20. The background model is trained from five randomly sampled images in the training set by providing the bounding box of the objects.

Statistics-based Classification The experiments with statistics-based classification were conducted in different colour spaces combined with various features. Firstly, we performed feature extraction to transform images into fixed length vectors. Next SVMs [libSVM (Chang and Lin 2011)] with RBF kernel were applied to learn a multiclass-classifier model. Parameters of the SVM are chosen by cross validation on the training set. Like LV, we used grey intensity from both image datasets for the experiments. For the coloured Protists dataset, we transformed the images to HSV and Lab colour spaces to improve the performance. Since the image sizes in the dataset are irregular, during the object detection stage of LV, we used background models and computer graphics techniques (e.g. curve fitting) to extract the main objects and unified them into same sized patches for feature extraction. The sizes of object patches were \(80\times 80\) and \(401\times 401\) in Protists and Moons respectively. For the feature extraction process, we avoided descriptors which are insensitive to scale and rotation, instead we selected the luminance-sensitive features HOG and LBP. The Histogram of Oriented Gradient (HOG) (Dalal and Triggs 2005) is known for its ability to describe the local gradient orientation in an image, and widely used in computer vision and image processing for the purpose of object detection. Local binary pattern (LBP) (Ojala et al. 2002) is a powerful feature for texture classification by converting the local texture of an image into a binary number.

In the Moons task, LV and the compared statistics-based approach both used geometrical background knowledge for fitting circles (though in different forms) during object extraction. However, in the Protists task, the noise in images always caused poor performance in automatic object extraction for the statistics-based method. Therefore, we provided additional supervision to the statistics-based method consisting of bounding boxes labelling the position of the main objects in both training and test images during feature extraction. By comparison LV discovers the objects from raw images without any label information.

Fig. 6
figure 6

Classification accuracy on the two datasets. a Moons. b Protists

Results Figure 6a shows the results for Moons. Note that performance of the statistics-based approach only surpasses one-shot LV after 100 training examples. In this task, background knowledge involving circle fitting exploited by LV and statistics-based approaches are similar, though low-level features used by the statistics-based approach are first-order information (grey-scale gradients), which is stronger than the zeroth-order information (grey-scale value) used by LV. Results on Protists are shown in Fig. 6b. After 30+ training examples only one statistics-based approach outperforms one-shot LV. Since the statistics-based approaches have additional supervision (bounding box of main object) in the experiments, improved performance is unsurprising. The results of LV in Fig. 6a, b are represented by horizontal lines. When the number of training examples exceeds one, LV performs multiple one-shot learning and selects the most frequent output (see Algorithm 2), which we found is always in the same equivalent class in LV’s hypothesis space. This suggests LV learns the optimal model in its hypothesis space from a single example. The learned program is shown in Fig. 7.

Fig. 7
figure 7

Abductive program learned by LV: clock_angle/3 denotes the clock angle from B (highlight) to A (object). high_light/2 is a built-in predicate meaning B is the highlight part of A. light_source_angle/3 is an abducible predicate and the learning target. With background knowledge about lighting and compare this program with Fig. 8, we can interpret the invented predicate clock_angle2 as convex, clock_angle3 as light_source_name

The results in Fig. 6 demonstrate that Logical Vision can learn an accurate model using a single training example. By comparison, the statistics-based approaches require 40 or even 100 more training examples to reach similar accuracy, which refutes the null hypothesis. However, the performance of LV heavily relies on the accuracy of the statistical estimator of edge_point/1, because the mistakes of edge points detection will harm the shape fitting and consequently the accuracy of main object extraction. Unless we train a better edge_point/1 classifier, the best performance of LV is limited as Fig. 6 shows.

LV is implemented in SWI-Prolog (Wielemaker et al. 2012) with multi-thread processing. Experiments were executed on a laptop with Intel i5-3210M CPU (2.50GHz), the time costs of object discovery are 9.5 and 6.4 s per image on Protists and Moons dataset respectively; the average running time Metagol procedure is 0.001 s on both datasets.

Protists and Moons contain only convex objects. If instead we provide images with concave objects (such as Fig. 9), LV learns a program such as Fig. 8. Here the invented predicate clock_angle2/1 can be interpreted as concave because its interpretation can be related to the appearance of opposite_angle/2.

Fig. 8
figure 8

Program learned by LV when concave objects are given as training examples

Discussion: Learning ambiguity

Fig. 9
figure 9

Credit: NASA/JPL/University of Arizona

An image of a crater on Mars and the \(180^\circ \) rotated version. a Crater. b Flipped crater.

Figure 9 shows two images of a crater on Mars, where Fig. 9b is a \(180^\circ \) rotated image of Fig. 9a. Human perception often confuses the convexity of the crater in such images.Footnote 9 This phenomenon, called the crater/mountain illusion, occurs because human vision usually interprets pictures under the default assumption that the light is from the top of the image.

LV can use MIL to perform abductive learning. We show below that incorporation of generic recursive background knowledge concerning light enables LV to generate multiple mutually inconsistent perceptual hypotheses from real images. To the authors’ knowledge, such ambiguous prediction has not been demonstrated previously with machine learning.

Recall the learned programs from Figs. 7 and 8 from the previous experiments. If we rename the invented predicates we get the general theory about lighting and convexity shown in Fig. 10.

Fig. 10
figure 10

Interpreted BK learned by LV

Now we can use the program as a part of interpreted background knowledge for LV to do abductive learning, where the abducible predicates and the rest of background knowledge are shown in Fig. 11.

Fig. 11
figure 11

Background knowledge for learning ambiguity from images

If we input Fig. 9a to LV, it will output four different abductive hypotheses for the image, as shown in Fig. 12.Footnote 10 From the first two results we see that, by considering different possibilities of light source direction, LV can predict that the main object (which is the crater) is either convex or concave, which shows the power of learning ambiguity. The last two results are even more interesting: they suggest that obj2 (the highlighted part of the crater) might be the light source as well, which indeed is possible, though seems unlikely.Footnote 11

Fig. 12
figure 12

Depiction of abduced hypotheses from Fig. 9a

5.3 Experiment 3

In this subsection we describe the experiments conducted on real images involving RoboCupFootnote 12 soccer where the task is to locate the football. We address this task in two stages: first we try to approximately locate the football in the image and then we use the model-driven technique of Logical Vision to abduce its location and shape. By doing this, one can estimate the size of the football, recognise occluded footballs and deduce depth information from the images.

Dataset and task The dataset contains 377 colour images sampled from a video of the robot’s camera view of the football field. As Fig. 13 shows, the scene of this dataset contains the green field, a robot, and a football. The original size of the images are \(480\times 720\). In this experiment they have been scaled into \(240\times 360\) for reducing the computational complexity.

Fig. 13
figure 13

Examples of football images: a the football is clearly separated from other objects, b part of the football is located outside of the image, c the football is occluded by the robot

This task is more difficult than those in the previous experiments. The objects in the images are more complex and contain more noise. Therefore it is difficult to learn a hypothesis using simple primitives such as “edge_point”. For example, the robot and football contain many edges so the original line sampling based abduction used by Logical Vision will become a large-scale combinatorial optimisation problem. Moreover, in 41 of the images the football is either occluded by or connected to other objects, and in 40 images there is no football at all.

To address the challenges, we consider a two-staged learning procedure. The first sub-task is to quickly find candidate locations of the footballs, which can reduce the search space of the fine grained football discovery. The second sub-task is to use Logical Vision to abduce the location and shape of the football from the candidate positions.

Fig. 14
figure 14

Super-pixel segmented data of the images in Fig. 13, where the blue boxes are the original bounding boxes of the images, the super-pixels filled with red colour are the positive super-pixels according to the bounding boxes. Note that in (b) and (c), although the footballs have been split into multiple super-pixels, they are all labelled as positive examples (Color figure online)

For the first sub-task, we use a super-pixel algorithm (Achanta et al. 2012) to segment the images into small regions, which can serve as primitives for estimating the location of football. Super-pixel algorithms are able to group pixels into atomic regions that capture image redundancy, greatly reducing the complexity of subsequent image processing tasks. The super-pixel algorithm implementations we used are OpenCV_contribFootnote 13 (Bradski 2000). The tuned parameter is the size of each super-pixel, which ranges from 10 to 30 with step size 5. During data transformation, we use the football bounding boxes shipped with original images to label the super-pixels: those which have 95% area inside of a bounding box of footballs (which is the label information in original data) are labelled as positive examples with predicate “ball_sp”. The rest are labelled as negatives. Examples from the dataset are shown in Fig. 14. The second sub-task, model-driven football abduction, directly takes “ball_sp” and an abductive theory as input and outputs the circle parameters (centre and radius), where “ball_sp” should be the result produced by the classification model learned in the first stage.

Experiment: Football super-pixel classification This experiment is related to the first sub-task described above, i.e. locating the football from super-pixel segmented images. In this experiment we compare the performance of \(Metagol_{NT}\) versus a statistical learner [we choose the CART algorithm (Breiman et al. 1984)Footnote 14] and investigate the same null hypothesis used in Sect. 5.2.

Materials and methods In this experiment we use the super-pixel dataset as described above. Each super-pixel is regarded as a symbolic object in the background knowledge. We extract some basic properties, such as size, location and colour distribution as features. The colour distribution is represented by the proportion of white, grey, black and green pixels inside a super-pixel, which is identified by Lab values of the pixels. Moreover, we exploit the neighbourhood relationship between super-pixels, which is represented by the “next_to/2” predicate.Footnote 15

In this experiment we randomly sample 128 images for the training and the remaining 249 images for testing. Similar to the Protists and Moons experiments in Sect. 5.2, we randomly sample 1, 2, 4, 8, 16, 32, 64, 128 images from the training set for learning the classification model. Random data partitioning is performed 5 times. The positive training examples (both for the statistical learner and the relational learner) are football super-pixels from each of 1, 2, 4, 8, 16, 32, 64, 128 images and the same number of negative examples (i.e. non-football super-pixels) are randomly sampled from the same set of training images. Similarly, for the test data the negative examples are randomly sampled from non-football super-pixels in the test images. For relational learning (i.e. \(Metagol_{NT}\)), background predicates \(mostly\_white/1\), \(partly\_white/1\), \(mostly\_black/1\), \(partly\_black/1\), etc were defined based on the colour distribution of super-pixels. For example the following background definitions describe a super-pixel which is mostly white or partly white:

figure c

The background knowledge for the relational learner also includes the neighbourhood relationship between super-pixels, i.e. “next_to/2” predicates.

In this experiment the following parameters were used for the relational learner, i.e. \(Metagol_{NT}(B,E,\nu ,n)\) in Algorithm 1. In addition to the above mentioned background knowledge, B includes the Pre2 and Post2 Meta-rules from Table 1.

E is the set of positive and negative training examples as described above. The size of randomly selected training examples \(Tr_i \subset E\) in each iteration i of Algorithm 1 and the number of iterations n can be set according to the expected degree of noise. Given that the expected error rate in the training data is not known in this problem, we choose an extreme case where \(Tr_i\) contains one randomly selected positive example (and one or two randomly selected negative examples) in different experiments. The number of iterations n was set to the number of positive examples in E.

For the statistics-based learner we use the CART decision tree algorithm (Breiman et al. 1984). The goal is to create a model that predicts the value of a target variable based on splitting the feature space. We choose CART as the compared method because we want to ensure the statistical model uses the same features as the relational model. Since the number of features, i.e. the green/white/grey/black pixel proportions, is relatively small, it is natural to choose a decision tree as the statistical learner. The maximum number of splits, is automatically selected by fivefold cross validation on the training data.

A second reason for the choice of decision trees is efficiency of execution. The robots in RoboCup soccer must operate in real-time, which means that all vision, localisation, decision making, localisation and locomotion tasks must be completed in the time it takes to capture the next camera frame, typically 1/30th of a second. Thus, the classifier in the vision system must be extremely efficient to execute. A decision tree, with only a few comparisons leading to a decision in the leaf node, satisfies these stringent timing requirements.

Results Figure 15 compares the predictive accuracy of the relational learner (\(Metagol_{NT}\)) vs the statistics-base learner (CART). As shown in the figure, \(Metagol_{NT}\) achieves consistently higher accuracy than CART with the accuracy difference particularly high for small numbers of training examples. An example of the hypotheses found by the relational learner is as follows:

figure d
Fig. 15
figure 15

Accuracy of \(Metagol_{NT}\) vs CART in the task of football super-pixel classification

Model-driven football abduction After narrowing down the candidate location of the football, Logical Vision is able to exploit geometrical background knowledge to perform model-driven abduction of the football’s exact shape and position (i.e. its centre and radius as a circle). This is important in robotic football games since the robot can use this information to infer the distance between itself and the football. More importantly, by modelling the football with a circle, the robot can figure out the occlusion of the football by other robots and choose appropriate actions accordingly. We apply Logical Vision with an abductive theory for this task, whose abducible is “football/3”. To sample edge points, Logical Vision draws random straight lines inside a super-pixel and its neighbourhood to return the points associated with a colour transition. Examples of football abduction are shown in Fig. 16.

Fig. 16
figure 16

Ball abduction results of the images in Fig. 13. The blue points are the “edge_points” sampled by Logical Vision, the red curves are the abduced circles (Color figure online)

6 Conclusions and further work

Human beings often learn visual concepts from single image presentations (so-called one-shot-learning) (Lake et al. 2011). This phenomenon is hard to explain from a standard Machine Learning perspective, given that it is unclear how to estimate any statistical parameter from a single randomly selected instance drawn from an unknown distribution. In this paper we show that learnable generic logical background knowledge can be used to generate high-accuracy logical hypotheses from single examples. This compares with similar demonstrations concerning one-shot MIL on string transformations (Lin et al. 2014) as well as previous concept learning in artificial images (Dai et al. 2015). The experiments in Sect. 5 show that the LV system can accurately identify the position of a light source from a single real image, in a way analogous to scientists such as Galileo, observing the moon for the first time through a telescope or Hook observing micro-organisms for the first time through a microscope. In Sect. 5.2 we show that logical theories learned by LV from labelled images can also be used to predict concavity and convexity predicated on the assumed position of a light source. Section 5.3 shows how LV can be used effectively in real-time robot vision. Ball recognition in robot soccer is challenging because the ball is frequently occluded by other robots and the similarity in colours of the ball, robots and field lines makes the ball difficult to distinguish.

We have studied LV’s failure cases carefully. The main reason causing misclassification is the noise in images. The noise can cause misclassifications of edge_point/1 since it is implemented with statistical models. The mistakes of edge_point detection will further affect the edge detection and shape fitting. As a result, the accuracy of the main object extraction is limited by both the noise level in input images and the power of statistical model of edge_point/1. Therefore, LV will fail too since the wrongly extracted objects are its inputs. However, if we train stronger models for detecting edge_points, the accuracy of LV will not increase either.

In further work we aim to investigate broader sets of visual phenomena which can naturally be treated using background knowledge. For instance, the effects of object obscuration; the interpretation of shadows in an image to infer the existence of out-of-frame objects; the existence of unseen objects reflected in a mirror found within the image. All these phenomena could possibly be considered in a general way from the point of view of a logical theory describing reflection and absorption of light, where each image pixel is used as evidence of photons arriving at the image plane. In this further work we aim to compare our approach once more against a wider variety of competing methods.

The authors believe that LV has long-term potential as an AI technology with the potential for unifying the disparate areas of logical based learning with visual perception.