In many respects, machine learning’s current concerns are reminiscent of those which heralded the rejection of GOFAI. Since it was evidently not possible to construct a suitably malleable model of the world a priori in terms of rigid logical facts, the solution was surely to induce the required representations, ideally from raw data? Given the challenges previously discussed, the requirement to create robust models is just as pressing today as it was in the early 1990s, when GOFAI was nominally supplanted by the ‘Physical Grounding Hypothesis’ [35]. In that sense, AI still needs learning algorithms that can do more than ‘skim off the surface’ of the world they attempt to represent. By this, we mean that knowledge representation should enjoy both robustness and malleability. By ‘robustness’ we mean Gestalt compositional interpretation in the presence of noise, so that turtles are not considered to be rifles [9] even if their images have some similarity at a local scale. By ‘malleability’ we mean the ability to envision a range of alternative hypotheses which are compatible with some context. In order to achieve this, we believe that machine learning needs to undergo the same fundamental shift that took place in the philosophy of science in the mid 20th century.

1 The Problem of Machine Induction

The discussion of Chaps. 4 and 5 illustrates that a major concern for DL and RL is the ability to obtain robust generalizations with high sample efficiency.Footnote 1 However, the ubiquity of domains with long-tailed distributions is antithetical to the very notion that learning can be predominantly driven via sampling. For example, a widely-respected pioneer of autonomous vehicles [339] has stated:

To build a self-driving car that solves 99% of the problems it encounters in everyday driving might take a month. Then there’s 1% left. So 1% would still mean that you have a fatal accident every week. One hundredth of one percent would still be completely unacceptable in terms of safety.

Human drivers, though naturally fallible, are considerably more robust to the combinatorially vast range of situations they might encounter. What change in perspective might be required in order to imbue a system with analogous capabilities?

The essential practice of supervised learning is to tabulate samples of input/output observations and fit a regression model based on a numerical loss function. Likewise, the RL framework optimizes an objective function sampled iteratively from the environment, and deep RL uses deep learning tools for function approximation. As previously observed [344, 212], in terms of the philosophy of science, this is very much in the empiricist tradition, in which observations are the primary entities. In DL, treating observations as primary has led to the notion of model induction as a curve-fitting process, independent of the domain of discourse. However, the incorporation of sufficient analytic information can obviate the need for sampling in both cases. As observed by Medawar, ‘theories destroy facts’ [221]: in order to predict e.g. future planetary motion, we do not need to tabulate the position of all the molecules that compose a celestial body, but rather apply Newton’s Laws to a macroscale approximation of its center of gravity [326]. Similarly, under certain conditions a description of the behavior of the simple pendulum can be obtained in closed form [160], demoting empirical sampling of orbits to the role of fitting a distribution to any remaining noise. Indeed, science itself can arguably be characterized as the progressive transformation of ‘noise’ into ‘signal’: replacing, insofar as human comprehension permits, uncertainty and nondeterminism with coherent (i.e. relationally-consistent) structure. The resulting structures yield a much stronger notion of ‘compression’ than expressed by the corresponding use of the term in RL.

Although the empiricist perspective has prevailed since the Renaissance, it was inextricably bound to a deep philosophical problem: the ‘Problem of Induction’. The problem asks what firm basis we have to believe that past inferences will continue to be valid, e.g., that the Sun will continue to rise in the morning. Epistemologically, we cannot hypothesize that past distributions of observations will resemble future distributions, since this begs the question. The problem resisted all solution attempts until Karl Popper provided one in the mid 20th century [268]. Popper’s solution was to show that conclusions could be well-founded without requiring that ‘laws’ or distributions be somehow Platonically propagated through time. Instead, he argued that although our hypotheses may be inspired by observation, they are altogether of a higher-order, being predominantly characterized by their explanatory power. As subsequently further developed by Deutsch [63], the key objects of discourse for science are therefore not observations but the explanatory characteristics of the hypotheses which they motivate. Hence, consistent with the definition of Sect. 2.2, we can consider a hypothesis as an inter-related system of statements intended to account for empirical observations. The tentative, self-correcting nature of the scientific method means that:

  • At any given instant, the collection of statements are not necessarily entirely self-consistent (cf. the longstanding inability to reconcile quantum mechanics and general relativity).

  • Falsifiability via observation is not the primary driver. Although Popper emphasized falsification via observations, a subsequent refinement [186] emphasized that the prevailing hypothesis may not even agree with all observations, provided that it is a rich enough source of alternative hypotheses which potentially could do so. It seems reasonable to consider this to be the spirit underlying the opening Feynman quote.

A famous demonstration that such heuristics are part of common scientific practice is the discrepancy between the predictions of general relativity and the observed rotational velocities of galaxies [246], relativity being a hypothesis which has repeatedly been vindicated in other wide-ranging experiments. Hence, while there is still some global inconsistency between different local models (which will continue to motivate further, hopefully ultimately unifying, hypotheses), this still allows the useful application of local models at the appropriate scale. Over the years, the philosophy of science has conjectured various heuristics for confronting rival hypotheses:

  • Parsimony: this heuristic is exemplified by ‘Occam’s Razor’. However, it must be stressed that this is not merely a domain-independent measure such as is advocated by Algorithmic Information Theory [43], but something that is achieved via reflective interpretation of the hypothesis in order to reconcile causal inconsistencies.

  • ‘Hard to Vary’: This notion was introduced by Deutsch [63]. Hypotheses which are so unconstrained as to permit the generation of many roughly-equivalent alternatives are unlikely to capture the pertinent causal aspects of a situation. Conversely, when a hypothesis which has been preferred to many others generates few or no alternatives, that is an indication that it is a good hypothesis. An initial investigation of the role played by ‘hard to vary’ heuristics in AI is given by Elton [79].

It is also important to note that, by virtue of compositionality, the notion of hypothesis here is stronger than ‘distribution over outcomes’ [64]. For example, suppose that six in a hundred patients who are flu sufferers were to hold a crystal and experience a subsequent improvement. Despite statistical significance, an experimenter would not (in the absence of some other deeply compelling reason) subscribe immediately to the notion that the crystal was the cause, because of the end-to-end consistency of existing explanations about how viral infections and crystals actually operate. Such inferences therefore operate at a different level than purely statistical notions, in which claims of causality must anyway be justified in terms of priors known to the domain-aware researcher when they frame the experiment. Hence the researcher here has two privileges that traditional ML lacks: firstly, prior semantic knowledge about the type of variables (the displacement of the pendulum bob is measured in radians, the color of the pendulum bob is a property of the material with which it is coated, etc.). Secondly, in the case that prior knowledge (or hypothetical interventions such as Pearl’s ‘do operator’ [254]) does not adequately make the case for causality, the researcher has the potential to clarify further via alternative experiments.

Scientific explanations have proved remarkably effective in describing the world [363]; for example, our understanding of force and motion at the ‘human scale’ (i.e., between quantum and relativistic) has remained robust since Newton. Most significantly, such understanding is emphatically not in general a quantitative function of the causal chain (e.g., some loss or objective function), but is instead dependent on the overall consistency of explanation. ‘Consistency’ here means not only consistency with respect to empirical observation, but the ‘internal consistency’ of the entire causal chain described by the hypothesis.

The solution to the ‘Problem of Machine Induction’ should therefore precisely mirror Popper’s solution to the ‘Problem of Induction’, i.e., to reject empiricism in favor of explanatory power and attempt to afford suitably curious machine learners the same privileges in determining causality as are presently enjoyed only by human experimenters. In the remainder of this chapter, we describe ‘Semantically Closed Learning’, a framework proposed to support this.

2 Semantically Closed Learning (SCL)

Just as the logical expressions of GOFAI could be said to be too ‘rigid’ with respect to their ability to model complex environments, so the parameter space of DL architectures is too ‘loose’. Hence, while it is relatively computationally inexpensive to fit a deep learning model to almost any naturally-occurring observations [203], generalization is certainly not assured [374]. It would appear that something intermediate is required, in which there is no requirement for a priori provision of either an arbitrarily complex objective function or an exponentially large collection of rules. To that end, we describe below a set of operations intended to support principled and scalable scientific reasoning. In particular, the ‘scientific’ aspect of reasoning can be characterized by the gradual progression from an extensional representation (i.e., pointwise tabulation of the effects of operators) to an intensional one (i.e., representable as an expression tree with a knowable semantic interpretation), as with the analytic description of the pendulum described above.Footnote 2 These operations are invoked by a granular inference architecture, a reference version of which is described in the next chapter. We tie these together under the heading of ‘Semantically Closed Learning’ (SCL), the name having been chosen with reference to the property of ‘semantic closure’. In the context of open-ended evolution, the term semantic closure was coined by Pattee [252, 253], who described it as:

An autonomous closure between (A) the dynamics (physical laws) of the material aspects of an organization and (B) the constraints (syntactic rules) of the symbolic aspects of said organization.

An implementation of such open-ended evolution is given in Clark et al. [51]. They describe a ‘Universal Constructor Architecture’, in which genomes both contain and are decoded via an expressor. The decoding process is stateful (being analogous to gene transcription), and may experience degradation via contact with the environment. Notably, they state:

The cleanest possible example of demonstrating semantic closure: the genome originally encoded the seed expressor, now it encodes a different expressor, but the genome string itself is not altered at any time, only the meaning of the genome string has been altered.

Abstracting from the concrete implementation of Clark et al., we consider a semantically closed system to be one equipped with a stateful interpreter, such that:

  • The next step in the state trajectory is determined via the application of the interpreter.

  • The interpreter is jointly a function of the state trajectory of the system and its interaction with the environment.

This suffices to describe systems capable of open-ended evolution, but is closer to the notion of ‘AI as organism’ than the required one of ‘AI as tool’. Hence, we must additionally cater for the achievement of goals (self-imposed or otherwise). We therefore define a semantically closed learner as a semantically closed system whose interpreter state is adapted, via interaction with the environment, so as to reduce the discrepancy between expected and actual states.

When the discrepancy between actual and expected states is determined via interaction with a ‘sufficiently complex’ environment (see Prop-3 of Sect. 7.3, below for more details), the above notion of semantic closure affords situatedness. The interpreter maps from system state to predictions and actions, being subject to repair when predictions are not met. Learning thus takes place as a function of the discrepancy between the actual and desired state of affairs. The important aspect for general intelligence purposes is support for open-ended learning. Such learning is considered here as an operationalization of the scientific method, which requires that an agent can generate hypotheses which it can process as first-class entities.

In one sense, all human beings are scientists (cf. Sloman on ‘Toddler theorems’ [317]). For example, even at a nascent level of cognition, concept formation [262] can be seen as abstracting from an iterated process of hypothesis generation/validation. One may consider higher levels of cognition to be hierarchical, in the sense that they make use of lower-level hypotheses (such as object permanence) as elements. A certain amount of introspection into one’s own problem-solving activity will reveal that higher levels of human reasoning are an ‘end-to-end white-box’ activity: arbitrary (and even novel) features of both a problem and its proposed solutions can be confronted with one another. These features are of course ultimately grounded in experience of the real world [187]. As such, the hypotheses evoked by any confrontation of features are so strongly biased towards the real world that events from the long-tail of impossible/vanishingly unlikely worlds are never even entertained. However, to talk purely in terms of bias as a means of efficient inference is missing a key aspect of human cognition in general, and the scientific method in particular: a hierarchy of compositional representations offers the potential to reason at a much coarser granularity than the micro-inferences from which it was constructed. Therefore, reasoning can be considered to occur in a Domain-Specific Language (DSL) which has been abstracted from the environment by virtue of the ubiquity and robustness of its terms [87, 187]. This is in contrast to the prevailing approach in ML, in which inference is entirely mediated through numerical representations, as biased via some loss or reward function. There, some level of generality is achieved by reducing the notion of feedback to the ‘lowest common denominator’ across problem domains.

It is therefore instructive to consider compositional learning in the context of the historical development of intelligent systems. The early cyberneticists understood that ‘purposeful’ behavior must be mediated via feedback from the goal [299]. Since they were predominantly concerned with analog systems, the feedback and the input signal with which it was combined were commensurate: they could be said to be ‘in the same language’. In the move to digital systems, this essential grounding property is typically lost: feedback is often mediated via a numeric vector that is typically a surjective mapping from some more richly-structured source of information. Useful information is therefore lost at the boundary between the learner and the feedback mechanism. In Sect. 9.4, we describe a compositional mechanism for hybrid symbolic-numeric feedback at arbitrary hierarchical levels that does not intrinsically require any such information loss.

3 Baseline Properties of SCL

As detailed subsequently, while the specific choice of expression language for the DSL is rightfully an open-ended research problem, a number of elementary properties are required:

Support for Strong Typing (Prop-1)

At the most elementary level of representation, labels for state space dimensions can be used to index into code (e.g. stored procedures) and/or data (e.g. ontologies). Building upon this explicit delineation and naming of state space dimensions, a defining property of SCL is the use of a strongly-typed ‘expression language’ [265] which can be used to represent both constrained subregions of the state space and the ‘transfer functions’ that map between such regions (see Fig. 7.1). Types therefore form the basis for a representation language which, at a minimum, constrains inference to compose only commensurate (i.e. type-compatible) objects. Unlike testing, which can only prove the presence of errors, the absence of errors (and indeed, more general safety guarantees) can be witnessed via strong typing. An elementary such example is the construction of causal models which are physically consistent with respect to dimensional analysis.

Fig. 7.1
figure 1

Concepts involved in the reasoning process. a A rule (colored circle) implements a relation between typed values (shapes on either side). For forward inference, rules are read left-to-right: an object of one type is transformed into an object of another type via a transfer function. b A type may be structured in terms of other types. c A repertoire of rules and types. Rules are values and may be composed, such as in the blue and gray rules. Rule firing is also a value (here depicted on the left side of the yellow rule), and so the reasoning process (i.e., the production of inferences) can be reasoned about. d A possible unfolding of forward inferences produced by the repertoire. e Inferences can produce new rules—they can also produce new types (not depicted)

In software engineering, such modeling has well-understood safety implications; for example, the bug that led to the loss of NASA’s ‘Mars Climate Orbiter’ in 1999 was due to an invalid conversion between two different dimensional representations of impulse [215]. However, this example only scratches the surface of what can be expressed [265, 263]: the rich body of work in type theory is an ongoing investigation into which aspects of system behavior can be expressed statically, i.e., without requiring actual program execution. For example, certain invariants of the application of transfer functions to subregions of the state space can be modeled via refinement types, which use predicates to define which subset of the possible values representable via a tuple actually correspond to valid instances of the type; as pedagogical examples, one can explicitly define the type of all primes or the type of pairs of even integers. As well as constructing arbitrary new dimensions of singleton type, it is possible to create new dimensions via other type-theoretic constructions, e.g. tupling or disjoint union of existing types [176]. Since there are intrinsic trade-offs between expressiveness, decidability, and learnability, the specific choice of type system is intentionally left open, being rightfully a matter for continuing research (a concrete example is nonetheless provided in Sect. 10.1). Fortunately, the constructions for compositional inference given in Chap. 9 can be defined in a manner that is agnostic with respect to the underlying type system.

Fig. 7.2
figure 2

The property of ‘endogenous situatedness’ imbues an agent with knowledge of its own causal abilities, which includes various proxies for the capabilities of its own reasoning process. This of course also requires a reflective representation and declarative goals and constraints

Reflective State Space Representation (Prop-2)

As a minimum, the state space includes the actionables and observables of the environment and/or the system. As discussed in Chap. 6, it must be possible to explicitly declare objectives (‘goals’) as delineated regions within the state space. While the base dimensions of the state space (corresponding to sensors and actuators for a situated agent) are specified a priori, the representation may also permit the construction of synthetic dimensions (e.g. to denote hidden variables or abstractions, as described below), similar to Drescher’s ’synthetic items’ [71]. As discussed under ‘Work on Command’, the property of reflection obviates the need for sampling of rewards, and allows for dynamic changes to goal specification, since state space constraints are available to the agent. A reflective state space is also key for enabling the creation of new types through abstraction.

Endogenous Situatedness (Prop-3)

A system has ‘grounded meaning’ if its symbols are a context-sensitive function of the system’s experience, a property typically lacked by GOFAI systems. In his seminal work, Harnad considers a system to be grounded [134] if reasoning is driven by sensory inputs (or invariants induced thereof) from the real world. Wang [355] argues that this notion of ‘real world’ can be relaxed to apply to ‘real-time operation in any complex, uncertain dynamic environment’, provided that (1) symbol interpretation is contextually driven by information obtained from the environment and (2) information is updated via a feedback loop that includes the action of the system. The reflective state space representation of end-to-end hypothesis chains of Prop-2 thus suffices for the system to be situated in its environment. That is, the mapping between grounded sensors and effectors proceeds via a world model in which feedback from effectors is reflectively inspectable.

However, there is a yet stronger notion of ‘situated’ which more closely captures a system’s causal capabilities: being endogenously situated. This arises from the observation that “an organism’s own patterns [\(\ldots \)] are also stimuli” [97]. A system is therefore endogenously situated when (at least some of) its internal representationsFootnote 3 should also be considered part of the environment in which the system operates, and these endogenous stimuli are given meaning via their ultimate participation in causal sensor-effector chains.

Open-Ended Continual Granular Inference (Prop-4)

As discussed, our pragmatic definition of general intelligence emphasizes the need for flexibility of response. This requires that an intelligent system avoids the ‘perceive–act–update’ cycle of traditional RL and GOFAI, in which it is effectively assumed that the system and the world progress in lockstep. Since system deliberation time will necessarily increase with environment and task complexity, the lockstep approach will not scale. As per previous work on time-bounded inference [242, 243] the alternative is to perform many simultaneous inferences of smaller scope, each inference having a WCET (worst-case execution time) that is both small and bounded—hence ‘granular’.

In some of our previous work [242], scheduling based on dynamic priorities is used to explore a vast number of lines of reasoning in parallel, while retaining flexibility and responsiveness. As described in more detail in the next Chapter, by virtue of scheduling over granular inferences, attention is then an emergent process: the analog of attention-weights [16] are the priorities, which are dynamically updated as a function of the expected value of inference chains.

4 High-Level Inference Mechanisms of SCL

Building on the support for strong typing (Prop-1) and reflective state space representation (Prop-2), SCL makes use of four methods of compound inference: hypothesis generation, abduction, abstraction, and analogy. All inference steps in SCL can be considered to be the application of some rule \(r: A \rightarrow B\), for types A and B. If no such rule exists, then it is necessary to synthesize it, as described in more detail in Chap. 10. This synthesis process may involve any combination of the following:

Abstraction

For purposes of SCL, abstraction is considered to be the process of factorizing commonality from two or more hypotheses. One can view this factorization as a parametric or ‘partially instantiated’ hypothesis. A suitable choice of parameters may allow the motivating hypotheses to be (at least approximately) recovered. In a numerical domain, approaches such as PCA/SVD [25] compress a set of empirical observations into a basis set from which observations can be reconstructed via a specific weighting of the basis vectors. In SCL, the methods used for decomposition and reconstruction of the state space must be applicable at the symbolic level of expression trees, as well as for any numerical expressions at their leaves. In Sect. 9.3, we describe one possible compositional mechanism for abstraction.

Hypothesis Generation

This is the means by which salient hypotheses are generated. Hypothesis generation interprets an existing hypothesis to yield a new one intended to have fewer relational inconsistencies. It is a broader notion than the counterfactual reasoning conducted using structural causal models (SCM), since rather than merely taking different actions, it considers the overall consistency of alternative models. Informally, this can be seen as the imposition of semantic/pragmatic constraints on expressions in a generative grammar. As an example from the domain of natural language, the famous phrase “Colorless green ideas sleep furiously” is syntactically valid, but not semantically consistent: neither color nor the ability to sleep are properties typically associated with ideas. The semantic inconsistency here is immediately obvious to the human reader, but in general an artificial learning system must use feedback from the environment to discover any semantic inconsistencies in its interpretation. Hypothesis generation is described in detail in Sect. 9.2.

Fig. 7.3
figure 3

Bidirectional rules. Rules support both induction and abduction; depending on their denotational semantics, their inputs and outputs (marked ‘?’) are ascribed particular meanings. a Induction: the output can be a prediction or a timeless entailment (e.g., an instance of a subtyping relation). The inputs may be (counter)factual (e.g., sensory inputs or absence thereof), induced or abducted. b Abduction: the input can be a goal, an assumption, or a (counter)fact. The outputs can be subgoals, subassumptions, or timeless premises; they are not necessarily unique. c The choice of outputs is constrained by an input

Analogical Reasoning

Analogy has been argued to be the “core of cognition” [146]. It can be considered as a generative mechanism that factors out a common ‘blend’ [84] between two situations. There is considerable literature on cognitive and computational models of analogy: in-depth surveys can be found in Genter and Forbus [106] and Prade and Richard [271]. Analogy is generally considered to be either predictive or proportional. Predictive analogy is concerned with inferring properties of a target object as a function of its similarity to a source object (e.g. the well-known association between the orbits of planets in the solar system and electron shells in the Rutherford model of the atom). Study of proportional analogy extends at least as far back as Aristotle [52]. A proportional analogy problem, denoted:

$$\begin{aligned} \texttt {A : B\,\, {:}{:}\,\, C : D} \end{aligned}$$

is concerned with finding D such that D is to C as B is to A. For example, “gills are to fish as what is to mammals” is notated as:

$$\begin{aligned} \texttt {fish : gills\,\, {:}{:}\,\, mammals : \,\,???} \end{aligned}$$

In Sect. 9.3, we describe one possible computational approach to proportional analogy.

Abduction

By virtue of the reflective representation, it is possible to perform inverse inference. A hypothesis can thereby be updated directly by working backwards from its effects, rather than via the indirection of a sampled objective function, which was the primary objection raised in both Sects. 4.3 and 5.2.

With a bidirectional reasoning process (illustrated in Fig. 7.3) it is possible to ‘backpropagate’ actions directly along the hypothesis chain from effects (‘failure of an upside-down table to provide support’) to counterfactuals (‘if the table were turned the other way up ...’). In DL and RL, the ‘representation language’ is untyped and noncompositional, so this kind of direct modification of hypotheses is not possible. In Sect. 9.4, we describe a compositional mechanism for abduction.

5 Intrinsic Motivation and Unsupervised Learning

In contrast to the a priori problem formulation required for supervised learning, the scientific method is an iterative process of problem formulation and solving. Such an iterative approach performs both supervised and unsupervised learning, the former corresponding to the meeting of objectives supplied a priori, the latter being the search for more compelling hypotheses, potentially via new experiments. In this wider framework, hypotheses have the non-monotonic property of the scientific method itself, i.e., they are potentially falsifiable by subsequent observation or experiment.

The aspiring robot scientist must therefore decide how to interleave the processes of observation and hypothesis generation. Prior art on this in the (simply-stated but open-ended) domain of number sequence extrapolation is Hofstadter et al.’s Seek-Whence [144], which decides when to take further samples as a function of the consistency of its hypothesis. In a more general setting, the self-modifying PowerPlay framework searches a combined task and solver space, until it finds a solver that can solve all previously learned tasks [307, 323]. In more recent work, Lara-Dammer et al. [190] induce invariants in a ‘molecular dynamics’ microdomain in a psychologically-credible manner.

In particular, our chosen definition of general intelligence acknowledges that resources (compute, expected solution time, relevant inputs, etc.) are finitely bounded. At the topmost level, the corresponding resource-bounded framework for the scientific method is simple: within supplied constraints, devote resources to finding good hypotheses, balancing the predicted merits of hypothesis refinement against the ability of the refinement to further distinguish signal from noise. The presence of such bounds is an intrinsic guard against the kind of ‘pathologically mechanical’ behaviors that one might expect from algorithms which do not incorporate finite concerns about their own operation, as detailed further in the next chapter.