Deep learning (DL) has emerged as the dominant branch of machine learning, becoming the state of the art for machine intelligence in various domains. As discussed in the previous chapter, this has led some researchers to believe that deep learning could hypothetically scale to achieve general intelligence. However, there is increasing consensus (e.g. [57, 210, 230]) that the techniques do not scale as well as was anticipated to harder problems.

In particular, deep learning methods find their strength in automatically synthesizing distributed quantitative features from data. These features are useful insofar as they enable mostly reliable classification and regression, and in some limited cases also few- or zero-shot transfer to related tasks. However, it is increasingly questionable whether deep learning methods are appropriate for autonomous roles in environments that are not strongly constrained. While there are still countless use-cases for narrow artificial intelligence, many of the truly transformative use-cases can only be realized by general intelligence.

Fig. 4.1
figure 1

In this chapter we argue that the lack of structure in representation languages created via deep learning are in conflict with the requirements of general intelligence

We recall from Sect. 2.2 that, while we do not know the internal mechanisms of human general intelligence, we observe that ‘science as extended mind’ is a pragmatic description of a general intelligence model of the environment. However, neural network representations are not readily interpretable, either to humans or more importantly—as we subsequently argue at length—to the learning process itself.

This chapter has two purposes: the first is to explain what properties are wanting (their relationship to the entire book is shown in Fig. 4.1) and to elicit fundamental obstacles posed by deep learning. The second purpose is to argue that ‘science as extended mind‘ offers a more effective perspective for designing the desired system. The latter is further developed in Sect. 7.2 and is the foundation of our proposed inference mechanisms in Chap. 9. The following claims contrast deep learning with the requirements for operationalization of the scientific method:

  • Representations are not compositional, which make them inefficient for modeling long-tailed distributions or hierarchical knowledge.

  • Representations are not strongly typed, which prevents verification against adversarial scenarios and hinders generalization to new domains.

  • Representations are generated by models which do not support reflection, which restricts model improvement to gradient-based methods.

1 Compositionality

There has recently been increasing emphasis on the importance of compositionality [120] for machine learning. To take a famous example from AI history [149, 304], humans do not require an a priori hypothesis to react to the outlier case of ‘goat enters restaurant’. Knowledge about goats can be freely composed with knowledge about restaurants; for example, the sudden arrival of a goat would not generally be expected to preserve a tranquil dining atmosphere or good standards of hygiene [281]. It has recently been stated by some of the most renowned exponents of deep learning [21] that:

We believe that deep networks excel because they exploit a particular form of compositionality in which features in one layer are combined in many different ways to create more abstract features in the next layer.

However, this notion is far weaker than is actually required, in particular for purposes of AI safety but (as we subsequently argue) also for scalable inference via greater sample efficiency. The weakness of this notion of compositionality is evidenced by numerous challenges for deep learning (discussed in more detail in this and subsequent chapters):

  • Adversarial examples.

  • Weak generalization capability.

  • Inability to explicitly induce or propagate more than a small number of types of invariant (translation, rotation, etc.).

Indeed, it could be said that DL is closer to the merely syntactic notion of composability than the semantic notion of compositionality. In its most degenerate form, the syntactic notion is merely the observation that feature ensembles are instances of the ‘composite’ design pattern [101, 369] and hence hierarchically aggregated features are syntactically substitutable for individual ones. However, that does not impose any intrinsic constraints on what the features represent or what the ensembles compute. In contrast, compositionality is defined as [207]:

The algebraic capacity to understand and produce novel combinations from known components.

The term ‘algebraic’ here effectively means ‘having well-defined semantics’, in the sense that the behaviour of a composite exhibits constraints that are a function of those of its component parts. The role played by the alleged compositionality of DL is lacking in almost every respect of this definition: in algebraic terminology, the feature representations in DL layers can be ‘freely composed’. In contrast, in Chap. 9 we describe a mechanism for imposing a denotational semantics on composite representations.

Hence, the only property in ML for which there is a guarantee of generalized, end-to-end compositionality is differentiability [90]. If, as seems likely, it is necessary to express more directly whether or not some desired property is compositional, then this requires extending DL far beyond ‘differentiable programming’. In common practice, composability in DL consists of assembling a network from constituent parts which may be trained ‘end-to-end’. Usually, this follows the encoder-decoder pattern where the encoder is responsible for generating vectorized features, and the decoder maps the features for classification or regression. This paradigm is common in deep learning applied to sequential data or labels. Example domains are text-to-text [66, 276, 330, 370], image-to-text (and vice-versa) [169, 173, 283], and program synthesis [46, 123, 124, 284]. When explicitly tasked with the generation of compositional representations, neural networks have been observed to exhibit better generalization performance [5]. However, as observed throughout the literature (e.g. Liska et al. [204]), more complex architectures tend not to scale well, showing limited scope. We argue the following as the most salient consequence:

Claim 1: DL and Compositionality

Deep learning appears to be fundamentally limited in its ability to create compositional knowledge representations. This severely inhibits the effective formation and use of structured, hierarchical knowledge, which in turn results in weak performance in domains with long-tailed distributions.

Hierarchical Behavior and Representations

In advance of more detailed discussion in the following chapter, we briefly consider here some work from deep reinforcement learning (DRL), so-called for its use of DL for knowledge representation. Deep reinforcement learning has seen much work on encapsulating learned behaviors into skills [94, 199, 222, 312] and options [13] to be composed in a hierarchical fashion. Researchers are motivated by the potential for hierarchy to reduce planning horizons, branching factors and improve sample efficiency. Despite interesting results that point to progress in these directions (e.g. [170]), it is not clear whether these approaches scale to more difficult problems. Nachum et al. [235] study these methods in particular and find that the benefits of hierarchical policy composition have more to do with exploration than with the imposed structure, and that the same benefits can be obtained with a modified exploration technique and a ‘flat’ policy.

Other work [65] explores embeddings for tasks which can be composed arithmetically, in a similar manner to deep word embeddings [223]. However, subsequent work on sentence and document embeddings [58, 125, 256] suggests arithmetic compositionality of properties encoded via embeddings is a difficult constraint and that little besides differentiability is scalable for compositional representations. In certain settings, recursion can effectively be used to hierarchically compose the interfaces of deep learning architectures [41, 244]. However, this composition is still at a coarse granularity, and it appears unlikely that arbitrary properties can be composed by this means. As such, it cannot be said that DL gives scalable solutions for building hierarchical knowledge, and this will most certainly limit DL’s overall scalability toward general intelligence.

Robustness in Long-Tailed Distributions

It is important to note that practically all domains of interest contain long-tailed distributions, particularly if they are grounded in the real world. Indeed, it can be argued that if long tails are not encountered in the data, then that represents a limitation of the dataset and subsequently the evaluation of the model. For example, an autonomous vehicle in an unconstrained environment will need to deal with an endless Borgesian list of edge cases [274]:

  • Telling the difference between a shallow puddle and an impassable flood.

  • Obeying signage written in human language for an intelligent human being to understand.

  • Figuring out what another driver means when they flash their headlights.

  • Badly scuffed or missing road/lane markings.

  • Pulling on to a kerb or through a red light to allow an emergency vehicle to pass.

  • Correctly determining from context whether a traffic light is knocked askew or genuinely pointed at a different lane of traffic from this one.

  • ...and so on, ad infinitum.

It has been observed that symbolic representations are well-suited for long-tailed distributions because of the potential to map recursive expression syntax into complex semantics [214]. In contrast, the issue of long-tailed distributions is not sufficiently emphasized in current deep learning research. This is evident in both the confidence and emerging scrutiny of natural language processing models. Despite more comprehensive benchmarks such as GLUE [354], the combinatorial nature of natural language expressions acts in direct opposition to the notion that a ‘representative’ training corpus can be reasonably sized. When highly-parameterized models such as GPT-2 [276] are thoroughly analyzed [206, 212], they reveal an understanding merely on the level of association, far from the depth required for anything like human-level understanding of what has been parsed. This problem has also been repeatedly identified in neural program synthesis, where program induction should be robust to all unseen inputs. For example, the Neural GPU [163] was trained to do long addition and multiplication. While results suggested robustness for problems into hundreds of digits in size, further work [272] revealed weaknesses when doing arithmetic involving many consecutive carry operations. Many other examples emerged concurrently of neural networks attempting to do program induction for arithmetic [183, 284, 303, 373] which also had a pattern of being unable to succeed on outlier cases. These examples clearly show that, even in domains which can be formally characterized, deep learning in its current form will not be of much use in the many cases where crucial inputs come from long-tailed distributions.

2 Strong Typing

In Chap. 2, we introduced two concepts claimed essential to general intelligence: ‘work on command’ and ‘science as extended mind’. In the previous section we argued for the necessity of compositionality. The primary motivation for that argument is the need for scalability and robustness in the presence of long-tailed distributions. In this section, we argue another claim regarding representations, stated below.

Claim 2: Deep Learning and Types

Deep learning is not designed for generating typed representations. This deficiency is prohibitive for developing general intelligence, since strong typing is essential for invariant propagation, inheritance, verification, and rapid adaptation of existing inferences to new observations.

Types can be used to explicitly delineate subregions of a state space, which is important for specifying constraints and objectives given to an agent as well as the hypotheses constructed by an agent for explaining causal mechanisms. Deep learning essentially concerns itself with only a single type — that of numeric vectors, even for the incredibly large models which are increasingly used. The meanings of intermediate representations remain opaque and, we argue, underconstrained. For example, the statistical and observational nature of supervised learning means that training and test error can converge favorably without the constraint that intermediate representations capture causal relationships. This observation has raised widespread concern for the biases that may be present in deployed models which are used in sensitive situations such as loan approvals and prison sentencing [132, 359].

Adversarial Examples and i.i.d. Assumptions

Instead of type constraints, deep learning is built upon assumptions about the distributions of data to which it is applied. Most consequential is the requirement that training and test data are both independently and identically distributed (i.i.d.). This condition is essential for the strong convergence guarantees which are derived in statistical learning theory. In that context, such assumptions are perfectly reasonable, but are ill-matched for general knowledge representation and learning. The clearest indication of this is the existence of adversarial examples.

The most common formulation of adversarial examples are minutely perturbed inputs specifically designed to severely reduce supervised accuracy on deep learning models [121, 335]. Adversarial training has been developed in response, but new weaknesses have emerged [345] and an all-encompassing solution remains elusive. Meanwhile, this vulnerability has been confirmed in various scenarios related to image classification in the real world [184, 341] and variations applicable to reinforcement learning agents [112, 151] and natural language models [2, 158, 233, 249].

Adversarial examples are intriguing to humans since our perceptual systems are much more robust to such attacks. We argue that typed representations would prevent such catastrophic misrecognitions that have no clearly explainable origin in terms of DL parameter weightings. Objects can essentially be conceptualized as a set of invariances, such as shape being invariant to the brightness of shone light, texture being invariant to orientation, etc. These invariances anchor qualitative descriptions and allow us to construct part-whole and inheritance relationships through deduction [332].

Importantly, we cannot entirely eliminate high-dimensional inputs, since grounding is essential for a science-oriented agent. Instead, we argue that inducing types from raw high-dimensional data should be prioritized to occur at the lowest possible hierarchical level, since any higher level inference would benefit from the stability and clarity of typed language elements. We contrast this proposal with current techniques still bound to the i.i.d. paradigm, such as domain randomization [343]. While this has shown positive results in complex scenarios [31, 247, 255], there are currently no compelling reasons to believe that the method is scalable to levels required by general intelligence, providing only a relatively crude way to learn invariances which does not involve the distillation of new types. This of course is an open challenge and we discuss it further in Chap. 9.

Out-of-Domain Generalization and Meta-learning

An essential trait for general intelligence is the ability to efficiently leverage learned knowledge when facing a novel yet related domain. Existing literature describes techniques for domain adaptation, where a model can perform in another domain with few or no labeled examples. Domain-invariant feature learning [102, 232] and adversarial training methods [348, 352] have shown positive results for deep networks. Transfer learning and semi-supervised learning for deep neural networks are also well-studied topics.

Nonetheless, there is consensus that there remains much to be desired from deep learning in this regard. We argue that typed representations are the natural way to address this requirement for general intelligence. Humans are naturally capable of seeing new situations as modified versions of previous experience. In other words, there is an abstract type of which both the prior observation and the current stimuli are examples, but with certain attributes differing. Given enough new observations, it may be appropriate to reify a different type altogether. Rapid domain adaptation can also be modeled as a scientific exercise of determining an unknown type with minimal experimentation. We expand on this perspective in Sect. 7.2.

Meta-learning has emerged as a popular research topic aiming to expand the generalization of deep learning systems [86, 200, 227, 239, 286]. These methods train models to a location in parameter space which allows for efficient adaptation to unseen tasks as opposed to unseen data points. Conceptually, this may appear to expand generalization capacity. However, the framework makes assumptions about tasks coming from the same distribution, much as for individual data points in a dataset. As such, it suffers similar issues, such as inflexibility to non-stationarity in tasks. More related to generalization, meta-learning does not yield transferable abstractions, rather it gives an optimized starting point for creating adaptable models. As argued by Chao et al. [45], meta-learning is not fundamentally all that different from supervised learning. This makes it unlikely to truly resolve the challenges of generalization when the scope or nature of tasks are broadened.

3 Reflection

In the previous two sections, we made the case for compositionality and strong typing as necessary properties for representing knowledge for general intelligence. This section is concerned with what is needed in order to adapt that knowledge to make it more accurate and comprehensive. In deep learning, this process is handled as an optimization problem with the target being minimal error, which fits neatly with a purely numerical class of models. The incorporation of symbols via typed expressions complicates this but also offers new opportunities, which we now discuss.

The notion of scalability is applied in various ways. In this section we will draw attention to scalability in terms of sample efficiency with respect to training data. We expect that as an agent grows more intelligent, it should be able to evaluate and compare increasingly complex models with roughly the same efficiency as when it was less developed and learning about simpler phenomena. This is one of the merits of the scientific method: given two competing theories for physical reality, e.g. Newtonian mechanics and Einsteinian relativity, a single experiment (indeed, even a ‘thought experiment’) may suffice to decisively favor one model over the other.Footnote 1

In this section we first characterize what learning looks like for deep neural networks and consider the choices that researchers make in order to support this sort of learning. Ultimately, we find that the resulting formulation is not well suited for kind of rapid, scalable learning that general intelligence requires and present the corresponding claim below:

Claim 3: DL and Reflection

Lacking compositionality and strong typing, deep learning also cannot support meaningful reflection over proposed models of its target domain. The property of reflection allows for direct, structured updates to knowledge, which compares favorably to deep learning’s exponentially growing requirements for data and an undesirable dependence on end-to-end model design choices.

Knowledge and Optimization in Neural Networks

Much effort has been directed toward developing networks and optimization procedures which result in reliable training. This reasoning has paid off since deep neural networks are universal function approximators: a class of models that can approximate any continuous function to arbitrary precision [56, 198]. Regardless, training remains nontrivial because of the high-dimensional non-convex optimization involved. Consequently, neural networks continue to get larger at a rapid pace (e.g. [313]), often resulting in dramatic overparameterization. Work shows models are actually able to fit to almost any set of input-output pairs [203] including completely randomized ones [374]. Other interesting examples include the ability to compress networks while retaining accuracy [155] as well as the ‘lottery ticket hypothesis’ [93] which states that large networks regularly have sparse subnetworks of only one tenth the size, which themselves can achieve equal test accuracy. Analysis of neural networks from an information-theoretic perspective [302] shows that generalization follows from a great deal of internal data compression, which is also consistent with the notion that networks are larger than they need to be.

Gradient-based optimization can be considered a very simple form of reflection, wherein credit is assigned to individual parameters with respect to the error. For models using typed and compositional features, changes require maintenance of the associated semantics. Whereas the knowledge of deep learning models is tuned in subtle and unpredictable ways, the ability to reflect on representations provides a basis for ensuring semantic consistency. We argue this stability during learning and amenability to direct updates are necessary properties of learning in general intelligence.

Furthermore, consider that deep learning is designed to result in convergence of the values of parameters. After convergence, adaptation to new tasks or modifying the knowledge is difficult and/or costly. Hence, deep learning with gradient descent does not adequately account for the necessity of open-ended learning for general intelligence. A more reflective learning approach is necessary to re-calibrate existing representations and render them actionable with respect to new goals and constraints. After some exposure to the new domain, a reflective approach can consolidate two models which share underlying similarities or abstractions back into a single, self-consistent and more general model. We explore more details and examples of these propositions in Chap. 9.

4 Implications and Summary

We reiterate that Claim 1 given at the beginning of this chapter does not detract from the merits of deep learning. To date, these methods have far outshone their predecessors in their ability to learn features from observational data and make valuable predictions from them. Most standard neural network architectures are also very amenable to parallelization and acceleration, making them practical for their current use cases. Hence, for some, the preceding discussion may not provide sufficient impetus to look beyond deep learning. For many practical applications in narrow domains, it suffices to have training methods which are applicable in the presence of relatively massive computational resources. In theory, recent work [81, 99, 236, 278, 309, 375] could yield representations which are causally disentangled and enjoy greater compositionality, but it seems likely that even the exponents of such forms of causal representation learning would admit that much progress is yet to be made.

We therefore recall the motivation which opened this chapter: the challenges for deep learning we have discussed arise in the context that the paradigm was not designed with general intelligence in mind. We can tie the challenges of machine learning together as being the symptoms of a ‘narrow framing of the problem’. The most salient part of the framing is that deep learning model parameters are intended to converge to some satisfactory optimum given the dataset and iterative learning procedure. Training a model with a priori knowledge of the desired outcome is fundamentally at odds with the notion of open-ended learning [340], an essential part of general intelligence.

It should be emphasized that, even for the task of constructing general intelligence, we do believe that deep learning may be the most sensible and practical way to implement very basic layers of perception on high-dimensional sensory inputs such as visual and audio feeds. Although we have emphasized the importance of compositionality and strong typing, we also acknowledge that they may not always be applicable at the level of individual pixels or waveform amplitudes. Instead, it should be clear that compositionality and strong typing become increasingly relevant when the subjects being represented can usefully be compressed into ‘concepts’ or ‘expressions’ rather than mere sensory samples.