1 Introduction

Cross-disciplinary investigations, such as when philosophers put artificial intelligence under scrutiny, are healthy, if not crucial. Any discipline has its blind spots, and sometimes it takes a new set of eyes to push research horizons onward. Needless to say, cross-disciplinary investigations require considerable knowledge of at least two scientific fields, and it is both brave and praiseworthy when researchers embark on such endeavors.

Landgrebe and Smith (2021) present a very critical analysis of contemporary language-centric artificial intelligence (natural language processing)—in particular of models based on the Transformer architecture (Vaswani et al., 2017). Their article has two parts: In Sects. 1 and 2, they present their analysis of Transformer models; in Sect. 3, they present an alternative approach to modeling language. I will focus mostly on Sects. 1 and 2, but also briefly discuss the approach advocated for in Sect. 3. In these sections, Landgrebe and Smith (2021) argue that Transformer models are insufficiently expressive, exhibit poor generalization, and will never acquire linguistic semantics, never ’understand’ language. There are many reasons to be critical of recent developments in artificial intelligence, but in this paper, I will argue that the diagnosis presented by Landgrebe and Smith (2021) is misleading in important respects, and I will show why, on the contrary, there are reasons to believe that Transformers suffer from none of the above weaknesses.

2 Understanding transformers

The most widely used models in natural language processing today rely on the Transformer architecture (Vaswani et al., 2017). This includes most popular pretrained language models such as BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), BART (Lewis et al., 2020), and GPT-3 (Brown et al., 2020), but Transformer models are also used across a wide range of downstream applications, including syntactic parsing (Mohammadshahi & Henderson, 2021), summarization (Gu et al., 2019), and semantic parsing (Shiv & Quirk, 2019). We first present a rough outline of how Transformer models work, and then review how they are presented in Landgrebe and Smith (2021).

Transformer models are deep neural networks, typically comprised of dozens of layers. Layers are commonly referred to as Transformer blocks. The neurons of each layer connect to the neurons of the next layer, leading to models in which learning amounts to carefully adjusting millions of numerical nobs. The input to a Transformer is typically a short text.Footnote 1 The first step is so-called tokenization of the text, a translation of the original input into a series of more meaningful entities, called tokens. There are many ways to do this, but the meaningful units roughly correspond to what we understand as words or morphemes. It is important to note that already at this point, the Transformer architecture injects an inductive bias by segmenting the input into words and morphemes. This inductive bias is a linguistically motivated bias: While learned tokenizations do not always align perfectly with how linguists would break a text into meaningful units, it correlates much better with a gold-standard segmentation than a random segmentation would.

The input tokens are then translated into vectors that represent their meaning out of context. The token vectors are combined with vectors that represent where in the short text the different tokens were located. These vectors are called position encodings. It is the combined vectors that thus represent the different tokens and where they are in the text, that are sent through the stacked Transformer blocks.

What is a Transformer block then? A Transformer block is first and foremost a way of combining information about different tokens that takes into account that tokens may be more or less important in a particular context and with a specific purpose in mind. The word good has a particular meaning in the sentence Huawei’s new phone is good, and is particularly important if we were to decide whether the review provides a rationale for buying the phone or not. Consider now the sentence Many say that Huawei’s new phone is good, but I think it is average. The word good obviously has the same meaning, but is less important in deciding the sentiment of the sentence. The importance of the word good depends on the other words in the sentence, and the Transformer architecture presents a particular way of combining the encodings of the different words with a specific purpose in mind.

This is where the calculations in Transformer blocks get a little complicated. In brief, the vectors that represent situated tokens, are multiplied into three different number matrices. This gives us three new vectors \(u_i,v_i,w_i\) for each token \(t_i\). These are now combined with the vectors of other tokens. For each token \(t_i\), the first vector \(u_i\) is multiplied by the second vector \(v_j\) for the other tokens, giving us a scalar value that is used to weight the third vector \(w_j\) of the second token. Each token \(t_i\) is now represented by the sum of these vectors (\(w_j\)). This summed vector now contain not only information about the original word, but also about the context in which it appeared. In each stacked Transformer block, the same operation is repeated. Each layer contains more and more abstract vector representations of the original text, and the various vector representations have been found to contain useful information for a wide range of applications in natural language processing.

Misunderstanding Transformers Landgrebe and Smith (2021), in their criticism, focus on what they see as the limited capacity of Transformer models, but their description of the capacity of Transformer models is factually wrong. They describe, for example, how a Transformer model ’encodes each single-sentence input into an encoding vector of 1024 real numbers.’ This is simply not true. In a Transformer, the input is a series of, say, 512 or 1024 tokens, each represented as, say, 512- or 1024-sized vectors, but then combined through stacked Transformer blocks of multi-headed self-attention into complex sentence representations.Footnote 2 Landgrebe and Smith (2021) also claim that ’these sentence embeddings lose relations between words within the sentence.’ As my presentation of the inner workings of Transformer blocks just showed, this is also false: Transformer models were specifically designed to model the relations between words. Note also that the dimensionality of neural networks are hyper-parameters that can be modified, and that the capacity of networks is only practically bounded by computational resources.

Landgrebe and Smith (2021) also point to a more serious limitation of (some) Transformer models, namely ’the discarding of all information pertaining to the contexts of the input sentences.’ As already mentioned, some applications of the Transformer architecture model sentences independently of each other. This clearly is an important limitation, preventing resolution of inter-sentential anaphora and bridging, or disambiguation based on preceding discourse. However, most applications of the Transformer architecture do model context, when relevant, typically in one of three ways: either by (a) simply processing larger chunks of texts sequentially (Liu et al., 2019; b) conditioning on context presentations (Wang et al., 2020); or (c) applying Transformer models hierarchically (Ging et al., 2020). In sum, Landgrebe and Smith (2021) thus misrepresent Transformers in three ways: by claiming Transformers have limited expressivity, fail to capture relations between words, and fail to model inter-sentential context. None of these are true. Note how all three claims pertain to inferential semantics.

3 Transformers’ understanding

The main argument presented by Landgrebe and Smith (2021) against using models such as Transformers in natural language processing, has to do with referential semantics, and is reminiscent of Searle (1980) and similar thought experiments in philosophy of mind.Footnote 3 Landgrebe and Smith (2021) claim that a Transformer model is necessarily shallow because ’the vector space it uses is merely morpho-syntactical and lacks semantic dimensions.’ This is clearly a false statement under most definitions of morpho-syntax and semantics, since Transformer models obtain very good results on tasks such as topic classification and machine translation. If Transformer models only encoded morpho-syntactic information, they would not be able to distinguish between I just ate an apple and I never painted a lion,Footnote 4 making topic classification and machine translation blind guessing.

Here is what I think they mean, though: Transformer models at best learn inferential semantics, not referential semantics (Marconi, 1997).Footnote 5 Landgrebe and Smith (2021) define understanding language as seeings its relevance to actions and thoughts, and argue this is what Transformer models cannot do. Seeing the relevance of words and phrases to actions and thoughts seems to decompose into the following two properties:

  1. (a)

    Lexical representations are grounded in representations of the (physical, social, and mental) world; and

  2. (b)

    the agent is aware or conscious of such grounding.Footnote 6

Inferential semantics refers to the part of semantics that is concerned with valid inferences. In lexical semantics, this involves establishing relations of synonymy, antonymy, hyponymy, etc. The output of such lexicographic exercises is often a database, which is best thought of as a graph with lexemes as nodes and with edges corresponding to lexical relations. The situation gets more complex at the sentence or discourse level, revolving around discourse relations such as entailment, contrast, consequence and explanation. Referential semantics is the part of semantics that concerns denotation, whether in terms of truth-conditions, mental representations or situations in which using a word is deemed appropriate.

Both (a) and (b) clearly concern referential semantics. Marconi (1997) remarks that Searle’s Chinese Room only applies to referential semantics, not inferential semantics, since category effects, word associations, etc., are unconscious processes. Landgrebe and Smith (2021) thus seem to agree with Searle (1980) on the inherent limitations of Transformer models: lack of proper grounding and lack of consciousness.Footnote 7 In Sect. 4, I will discuss handwritten grammars, which Landgrebe and Smith (2021) claim do not suffer from such limitations. (Searle, however, would.)

Grounding The grounding problem (Harnad, 1990; Jackson & Sharkey, 1996) is the problem of learning a mapping from words and phrases to the objects and events they refer to, or to cognitive representations thereof. How can deep neural network models such as Transformer models learn to ground their representations in this way? Inferential semantics, i.e., relations between words and sentences, are induced implicitly by most learning objectives used to train these models. The most commonly used learning objective today is perhaps masked language modeling (Devlin et al., 2019), but the same holds for the translation objective in Vaswani et al. (2017), for example. Transformer models can therefore be straight-forwardly evaluated as models of inferential semantics. How can Transformer models encode referential semantics, though?

Well, the details of this will depend on what we think it is that linguistic expressions refer to. Let us, for example, assume linguistic expressions refer to vectors in an embedding space of neural activations or (fMRI/EEG) images thereof. If mental imagery is defined broadly enough, this should be compatible with some forms of internalist semantics (Rapaport, 1994; Schank & Colby, 1973), but note the vector space could also be a perceptual or physical space. Referential semantics or grounding now amounts to learning a mapping between the Transformer model vector space and this target space.Footnote 8

But why, you may ask, would language model vector spaces be isomorphic to representations of our physical, mental and social world? After all, language model vector spaces are induced merely from higher-order co-occurrence statistics. I think the answer is straight-forward: Words that are used together, tend to refer to things that, in our experience, occur together. When you tell someone about your recent hiking trip, you are likely to use words like mountain, trail, or camping. Such words, as a consequence, end up close in the vector space of a language model, while being also intimately connected in our mental representations of the world. If we accept the idea that our mental organization maps (is approximately isomorphic to) the structure of the world,Footnote 9 the world-model isomorphism follows straight-forwardly (by closure of isomorphisms) from the distributional hypothesis.

There is plenty of evidence that Transformer-based language models encode words in ways that are near-isomorphic to where neural activation occurs when listening to or reading these words (Pereira et al., 2018; Caucheteux & King, 2021),Footnote 10 to how our perceptual spaces are organized (Abdou et al., 2021; Patel & Pavlick, 2022), as well as to how physical spaces are organized (Liétard et al., 2021).Footnote 11 Such global similarities can also be induced from local ones: Wu et al. (2021) show how brain activity patterns of individual words are encoded in a way that facilitates analogical reasoning—the same analogical reasoning that language models facilitate. Such a property would in the limit entail that brain encodings are isomorphic to language model representations (Peng et al., 2020). To see this, consider an example of analogical reasoning: ’Berlin is to Germany, what Copenhagen is to ____’. In a language model, this is computed by subtracting the vector for Germany from the sum of the vectors for Berlin and Copenhagen, returning the nearest neighbor for the resulting vector. For today’s language models, the result would most likely be the vector for Denmark. If you can compute all possible analogies using vector off-set this way, you have induced a structure that is isomorphic to (the current geopolitical) reality. If you can compute the same analogies by offset of brain imaging vectors, these two spaces must be near-isomorphic. And language models can thus be grounded in brain imaging spaces.

To flesh out my argument that Transformers (and similar neural network architectures) can learn grounding, I present the Color Radio thought experiment, about grounding of color terms:Footnote 12 Consider a common AM/FM radio receiver tuned in on a talk radio channel. The engineer who built the receiver, augmented the device with a pattern recognition module or a modern language model, as well as a one-pixel camera. The radio wants nothing more than to learn the meaning of color terms. It therefore starts to consider the linguistic contexts in which these terms occur. Since the talk radio channel signal is not aligned with the input of its camera, it cannot use co-occurrence statistics to ground these terms in its color perception. Notice also that the problem of grounding color terms, in the eyes of Searle (1980), should be as impossible as learning to understand language in general. The representations of the receiver’s language model were induced ’in a vat’, so to speak. Pursuing its goal, nevertheless, the radio notices how terms such as yellow and turquoise occur in slightly different contexts, but also how other color terms such as violet and purple occur in very similar contexts. Technically, it computes the co-occurrence statistics of these color terms and embeds these in a low-dimensional vector space. After years of practice, it learns to represent colors in a way that is near-isomorphic to how humans perceive colors- Because its language model is contextualized, it even learns to correct for possible reporting biases. It now has learned the inferential semantics of color terms. The radio wants more, though. It also wants to learn the referential semantics of color terms, i.e., the mapping of color terms onto pixel values. However, if the color term representation is isomorphic to the camera’s representation of colors, it follows that unless the color terms lie equidistantly on a sphere, we can induce a mapping, even in the absence of supervision, by straight-forward methods that are humans also seem to be endowed with.Footnote 13 The Color Radio thought experiment is designed to suggest the plausibility of unsupervised grounding, and is as such intended as both a rebuttal of Searle (1980) and Landgrebe and Smith (2021).

In sum, my argument for why (unsupervised) grounding of Transformers is possible, goes as follows:

Premise (P1)

’Transformer language model vector spaces are near-isomorphic across languages and often also with brain imaging, perceptual and physical spaces.’

Premise (P2)

’Two near-isomorphic vector spaces can be aligned with minimal supervision, and often without supervision.’

Conclusion

’Transformer language model vector spaces can be aligned with minimal supervision, and often without supervision.’

Both premises have empirical support, and the conclusion is derived by a simple application of modus ponens.

Awareness Landgrebe and Smith (2021)’s definition of understanding as seeing the relevance of words and phrases to actions and thoughts, was shown to decompose into grounding and awareness. I will argue with Dennett (1987) that seeing awareness as a prerequisite for understanding, rests on a category mistake (Ryle, 1938).Footnote 14 The category mistake of Searle, as well as of Landgrebe and Smith (2021), is to assume that language understanding can be equated with what we experience, when we are aware of our language understanding. Understanding language, we argue, or linguistic meaning, if you prefer, does not belong to the category of private, conscious experiences, but to categories of processes that are orthogonal to consciousness. It is generally easy to conflate these, because our introspection suffers from a severe sampling bias: When we think of instances of our own language production, we naturally tend to think about instances in which we were conscious of our language production.

Now ask yourself this: Does linguistic meaning really imply awareness of linguistic meaning? Does understanding really imply awareness of understanding? Dennett (1987) argued that Searle (1980) conflates understanding and awareness of understanding. Leibniz already emphasized the importance of processes of understanding and reflection that we are unaware of. It certainly seems possible to produce semantically fluent sentences in the absence of conscious thought, e.g., during sleep or under anesthesia (Webster, 2017). Patients that are unconscious—as defined by the Glasgow Coma Scale—reportedly react to and can remember verbal communication, even if they are not able to respond. Comatose patients also seem to comprehend language. Van den Bussche et al. (2009) present several experiments that suggest the possibility of unconscious language understanding, even when participants are fully awake. One of them is a lexical decision task, in which participants were exposed to sequences of letters and asked to classify these as words or non-words. Subliminal primes preceded the exposure. Some primes were semantically related, while others were completely unrelated. Semantically related primes were shown to lead to faster and more accurate responses. In another experiment, participants were asked to read target words aloud, and related subliminal primes were again shown to facilitate reading.Footnote 15 This all suggests that meaning does not require conscious reflection on relevance and attribution. And if so, machines simply do not need consciousness to acquire linguistic meaning.

That we are prone to this category mistake is unsurprising: When we recollect memories of communicating with others, memories of understanding what others were saying, we almost by definition recall events in which we were in fact aware of this process of understanding. It is much easier to recall events you were conscious of than events you were not. Our introspection thus suffers from severe sampling bias, so to speak. This holds true for things we do. Consider the common experience of unconscious driving. You jump on your bike or get into your car to drive to work, but quickly find yourself immersed in thoughts. Perhaps you are preparing yourself for a meeting later that day, or you are thinking about the movie you saw last night. Moments later you park in front of your office, with no recollection of how you made it there. Presumably you navigated through crossings and roundabouts, stopped at traffic lights, etc., but none of this required conscious effort. Nevertheless, if you were asked what it feels like to bike to work, you would likely recall events in which you were conscious of biking to work.

My argument for why awareness is irrelevant for the ability of Transformers to learn referential semantics, is simply that awareness is irrelevant for this pursuit. This follows directly from the empirical observation that language understanding can be unconscious.

Approximation Transformer models are induced from finite amounts of data and hence approximative. If trained on more (representative) data, they will likely learn to better approximate the inferential and referential aspects of semantics. Landgrebe and Smith (2021) find this disturbing and write: ’Even at their very best, they remain approximative, and so any success they achieve is still, in the end, based on luck.’ This, though, is a fallacy. Mastery of archery or talent for counseling is also approximative, but not a matter of luck. While no one—neither a chess computer nor a grand champion—is able to compute the optimal next chess move in real time, because of the doubly exponential search space, a skilled chess player will nevertheless win over me a hundred games in a row. While his craft is in the same sense approximative, any attempt to reduce the difference between us to luck would be ridiculous. Human language acquisition, by the way, is also approximative. This was the most prominent counter-argument against another classical argument for the impossibility of machine understanding of language, namely Gold’s Theorem (Gold, 1967). In fact, Gold’s Theorem seems to provide some motivation for saying approximation is necessary for language understanding. This follows from the fact that language is a moving target, and that members of a linguistic community exhibit a great deal of variation, speaking slightly different dialects, sociolects, and idiolects. A learning algorithm that would iterate through all possible grammars and only discriminate between (exactly) correct and incorrect ones, would never terminate in the face of such variation.

That is: I argue that the approximative nature of Transformer models, like their possible lack of awareness, is orthogonal to their ability to learn referential semantics. This follows from two relatively uncontroversial assumptions, namely that language exhibits drift and inter-speaker variation, and that this makes it possible to identify a language exactly:

Premise (P1)

’Language is a moving target, over time and between speakers.’

Premise (P2)

’Moving targets can be approximated, not modeled exactly.’

Conclusion

’Language models can only approximate language.’

The Robustness of Transformers In a final point of criticism, Landgrebe and Smith (2021) also suggest that Transformer models

  • will quickly become invalid if the input-output relationship changes on either side even in some minor way. This is because the model does not generalize. Once fed with data as input that do not correspond to the distribution it was trained with, the model will fail.

Landgrebe and Smith (2021) are here concerned with the robustness of deep neural networks, e.g., Transformer models, under distributional shift. This is an important subfield of artificial intelligence, and many researchers have devoted their careers to learning good models under distributional shift. Sometimes this literature is referred to as domain adaptation or transfer learning (Søgaard, 2013). While domain adaptation remains a challenge, language models based on Transformers are among the most robust models in artificial intelligence, and it is certainly false to say that they become invalid if the input-output relationship changes moderately.Footnote 16 In other words: The claim that Transformers generally exhibit poor generalization and low performance is inconsistent with empirical observations.

In Sect. 2, I showed how Landgrebe and Smith (2021) misrepresented how inference works in Transformers in three ways. In this section, I have discussed three other claims by Landgrebe and Smith (2021), pertaining to their learning capacity: Landgrebe and Smith (2021) claim Transformers cannot acquire referential semantics and cannot learn to generalize outside of their training data. I presented a mixture of arguments and empirical evidence in an atttempt to refute both claims. Moreover, on my way, I also discussed how awareness is generally not a prerequisite for understanding, and how the fact that machine learning models, including Transformer models, are approximate by nature by no means disqualify them as models of language. I summarize my discussion of Landgrebe and Smith (2021)’s critique of Transformers in the table below.

Table 1 Landgrebe and Smith (2021)’s claims about Transformer architectures. Transformers are approximative, but this is not prohibitive of language understanding

4 Handwritten grammars

Earlier critiques of Transformers and related architectures in natural language processing focused on showing language understanding is unlearnable from raw text (Bender & Koller, 2020), i.e., in the absence of supervision, that language models based on Transformers are uninterpretable (Boge, 2021), or that they tell us nothing about linguistic competencies (Dupre, 2021). Landgrebe and Smith (2021) argue that language understanding is unlearnable for Transformers, even with supervision. They are not interested in interpretability or the ability to distill linguistic theories of competence, merely the learning of inferential and referential semantics.Footnote 17 This section briefly discusses the alternative proposed by Landgrebe and Smith (2021) to such deep neural learning architectures: handwritten grammars mapping sentences to logical form. I will argue that if language understanding is out of reach for deep neural network architectures, it must also be out of reach for handwritten grammars with logical form.

The approach of Landgrebe and Smith (2021) is a pipeline approach. They first use a shallow form of syntactic analysis called part-of-speech tagging to induce the syntactic categories of the input words in context. The authors then rely on a ’proprietary AI-algorithm chain that uses world knowledge in the form of a dictionary of lexemes and their word forms along with associated rules, relating, for example, to the transitivity and intensionality of verbs’.Footnote 18 This proprietary algorithm chain maps the input to logical form, a process which ’requires world knowledge, for example about temporal succession, which is stored in the computer using ontologies’.

How would this approach to text processing or text generation be more meaningful than Transformer models? One argument that perhaps it is not, runs as follows: Assume a handwritten grammar g, following the pipeline approach of Landgrebe and Smith (2021). Assume also g ’understands’ language. The Transformer architecture is Turing-complete (Pérez et al., 2019). This means that there is a translation function \(\tau \) from any handwritten grammar that can be implemented as a Turing machine into an isomorphic Transformer, i.e., \(\tau (g)=t\). If g ’understands’ language, so does t. So for any handwritten grammar that understands language, there exists a Transformer model that also understands language. Q.E.D.

In fact, the steps of the pipeline approach in Landgrebe and Smith (2021) have (all) been modelled by Transformer architectures.Footnote 19 Probing experiments suggest that even moderate-sized Transformer-based language models learn similar pipelines from just doing masked language modeling at scale (Tenney et al., 2019). Transformers could also be trained specifically to simulate the pipeline approach of Landgrebe and Smith (2021). Since this form of teacher-student training (Fan et al., 2018) can be done on raw text, the Transformer model would in the limit become functionally indistinguishable from the pipeline system. For Searle (1980), none of these steps would capture linguistic meaning. For Landgrebe and Smith (2021), it seems the trouble is that you cannot have it both ways: If you think a grammar mapping sentences into logical form can capture linguistic meaning, you have to admit that the same is possible for Transformer models and other forms of deep neural networks.

Somewhat surprisingly, Landgrebe and Smith (2021) do not discuss the fact that the classical arguments of Searle and Dreyfus against the possibility of machine understanding of language were presented with such handwritten grammars in mind. I think Transformers and related neural architectures present real advantages over handwritten grammars. These advantages have nothing to do with expressivity, word-word interactions, and context-sensitivity, but with their explanatory power. Transformers can be used to make theories of learning testable, while handwritten grammars cannot. Consider, for example, the hypothesis that the semantics of directionals is not learnable from next-word prediction alone. Such a hypothesis can be falsified by training Transformers language models and seeing whether their representation of directionals is isomorphic to directional geometry; see Patel and Pavlick (2022) for details. Transformers and related architectures, in this way, provide us with practical tools for evaluating hypotheses about the learnability of linguistic phenomena.

5 Concluding remarks

I have argued that Transformers and related architectures seem able to learn both inferential and referential semantics. Clearly, you can do more with language than inferential and referential semantics, and some of these things are well beyond what you can ask a language model to do. If I ask you to walk like a penguin, I ask you to do something that language models cannot do. What we do with language is to many an important part of its meaning, and if so, language models learn only part of the meaning of language. Many linguists and philosophers have tried to distinguish between referential semantics and such embedded practices. Wittgenstein (1953), for example, would think of referential semantics—or the ability to point—as a non-privileged practice. While Wittgenstein does not give special attention to this ’pointing game’, it has played an important role in psycholinguistics and anthropology, for example. Language models play many languages better than us, e.g., writing poetry or jokes, translating or summarizing texts, or spotting grammatical errors—but the pointing game has been the litmus test for machine understanding of language since Searle’s Chinese Room, and it is widely used to probe for lexical semantics.

Language models have other limitations: You cannot encode the precise semantics of second-order quantifiers like most of in vector space. For a finite set of pairs of sets, it can learn the right inferences, e.g., that most members of A are also members of B, but only for a limited set of cases. So what do we make of a language model that can do the pointing game, as well as the other games just mentioned, but only decide whether most members of A are also in B, if A and B are sufficiently small? My answer is: Well, what would me make of, say, a 14-year old child with the same skills? If a 14-year old child can point to the referents of Italian nouns, translate Italian sentences into another language, summarize documents written in Italian, but only decide whether la maggior parte delle A sono B for small A and B, would you not say this child still speaks Italian? The requirement that you can apply all words correctly in all cases is a very high bar for saying someone understands a language. Just like knowing a strawberry is a nut, is not generally seen as a test of one’s ability to understand English. Also, recall that Landgrebe and Smith (2021) are not claiming that Transformers have insufficient levels of referential semantics. Rather, they claim Transformers have no referential semantics. In other words, any signs of referential semantics would challenge their claims.

I have other, more serious quarrels with Transformers: They are slow and costly to train, with terrible carbon footprints, and they exhibit slow inference times. They generally require GPUs, which are inaccessible in many parts of the world. The word segmentation algorithms and positional encoding schemes typically used in conjunction with the Transformer architecture are biased toward fusional (mostly Indo-European) languages. Each of these points is reason to consider alternatives to Transformer models. The arguments put forward by Landgrebe and Smith (2021) against Transformer models, however, are problematic.

One contribution of this work was the defense of Transformers and related neural architectures against a series of false claims, i.e., that they exhibit limited expressivity, are unable to capture word-word dependencies, are not sensitive to context, and do not generalize well. Another contribution was an in-depth discussion of another claim presented by Landgrebe and Smith (2021), namely that Transformer models are incapable of understanding language, in the sense of ’hatching on’ to the world. I introduced a distinction between inferential and referential semantics, originally presented by Marconi (1997), making it clear that this argument only concerns referential semantics. I then pointed to a recent finding in the artificial intelligence literature: The observation that unsupervised alignment of isomorphic representations enables grounding of language models in mental representations or representations of the physical world. This observation makes referential semantics in neural networks possible, under very permissive assumptions. All that such grounding requires is learning a linear projection into the mental or physical space. This is sufficient, since language model vector spaces have been shown to be near-isomorphic to mental, perceptual, and physical spaces. Projections into such spaces can easily be learned when supervision is available, using point-set-registration or graph alignment algorithms, but it has also been shown that this can even be done in the absence of supervision, e.g., with generative adversarial networks. I provided a thought experiment called the Color Radio to provide some intuition how such grounding could be obtained in practice.

In Sect. 4, I addressed the hybrid pipeline approach to natural language understanding advanced by Landgrebe and Smith (2021), showing that any of the components of their pipeline could be replaced by Transformers without changing the underlying function. Finally, I discussed other limitations of modern-day language models: They are slow and costly to train, have terrible carbon footprints, exhibit slow inference, and require costly GPUs. This is all orthogonal to my discussion of Landgrebe and Smith (2021), of course. There are also obvious limitations to what you can cram into a vector, e.g., the semantics of second-order quantifiers. The question is whether this is relevant for language understanding. I leave this question—as well as the question of how such semantics would be represented in our long-term memory—open for now.