1 Introduction

The introduction of transformers (Vaswani et al., 2017) with multiple attention heads, and pre-trained with large scale word embeddings, has revolutionised NLP. They have yielded near or above human performance on a variety of core NLP tasks, which include, among others, machine translation, natural language generation, question answering, image captioning, text summarisation, and natural language inference (NLI). Transformers operate as autoregressive token predictors (GPT-1-GPT4, OpenAI), or as bidirectional token predictors for masked positions in contexts, such as BERT (Devlin et al., 2019). They significantly outperform sequential deep learning networks, like Long Short Term Memory Recurrent Neural Networks (LSTMs) and Convolutional Neural Networks (CNNs), on most natural language tasks, and for many non-linguistic applications, such as image recognition.

Transformers provide the LLMs that drive chatbots. The rapid success of these bots in generating extended sequences of coherent, human like discourse in response to prompts has produced vigorous debate in both the scientific literature and the popular media. Some of this discussion consists of exaggerated claims on the capabilities of LLMs. Other comments offer pat dismissals of these systems as nothing more than artificial parrots repeating training data. It is important to consider LLMs in a critical and informed way, in order to understand their abilities and their limitations.

In Sect. 2 I take up several of the more prominent weak arguments that have been brought against LLMs. These include

  1. (i)

    the view that they simply return their training data,

  2. (ii)

    the claim that they cannot capture linguistic meaning due to the absence of semantic grounding,

  3. (iii)

    the assertion that they do not acquire symbolic representations of knowledge, and

  4. (iv)

    the statement that they do not learn in the way that humans do.

In Sect. 3 I consider some of the strong arguments concerning the limitations of LLMs. These involve

  1. (i)

    important constraints on LLMs as sources of insight into human cognitive processes,

  2. (ii)

    the lack of robustness in LLM performance on NLI tasks,

  3. (iii)

    the unreliability of LLMs as a source of factually sound information,

  4. (iv)

    the inability of LLMs to identify universal patterns characteristic of natural languages,

  5. (v)

    the consequences of the large data required to train LLMs, for control of the architecture and development of these systems, and

  6. (vi)

    the opactiy of these systems.

Section 4 draws conclusions concerning the capacities and limitations of LLMs. It suggests possible lines of future research in deep learning in light of this discussion.

2 Weak Arguments Against LLMs

2.1 Generalisation, Innovation, and Semantic Grounding

A common criticism of LLMs is that they do little more than synthesise elements of their training data to produce the most highly valued response to a prompt, as determined by their probability distribution over the prompt and the data. Bender et al. (2021) and Chomsky et al. (2023) offer recent versions of this view. This claim is false, given that transformers exhibit subtle and sophisticated pattern identification and inferencing

This inductive capacity permits them to excel at medical image analysis and diagnostics (Shamshad et al., 2023). Transformers have revolutionised computational biology by predicting properties of proteins and new molecular structures (Chandra et al., 2023). This has opened the way for the use of deep learning for the development of new medications and clinical treatments.

Dasgupta et al. (2023) use Reinforcement Learning (RL) to train artificial agents with LLMs to respond appropriately to complex commands in a simulated visual environment. These commands do not correspond to information or commands in their training set.Footnote 1

Bender and Koller (2020) argue that LLMs cannot capture meaning because they are not semantically grounded, by virtue of the fact that their word embeddings are generated entirely from text. Hence they cannot identify speakers’ references to objects in the world or recognise communicative intensions. Sahlgren and Carlsson (2021), Piantadosi and Hill (2022), Sørgaard (2023) reply to this argument by observing that learning the distributional properties of words in text does provide access to central elements of interpretation. These properties specify the topology of lexical meaning.

Even if one accepts Bender and Koller’s view that grounding is a necessary element of interpretation, it does not establish their claim that LLMs are unable to represent interpretations of natural language expressions. It is possible to construct multi-modal word embeddings that capture the distributional patterns of expressions in visual and other modalities. Multi-Model BERT (Lu et al., 2019) and GPT-4 (OpenAI, 2023) are pre-trained on such word embeddings. Multi-modal transformers can identify elements of a graphic image and respond to questions about them. The dialogue in Fig. 1 from (OpenAI, 2023) illustrates the capacity of an LLM to reason about a complex photographic sequence, and to identify the humour in its main image.

Fig. 1
figure 1

ChatGPT-4 Multi-Modal Dialogue, from OpenAI (2023)

Bender and Koller (2020) do raise important questions about what a viable computational model of meaning and interpretation must achieve. It has stimulated a fruitful debate on this issue. The classical program of formal semantics (Davidson, 1967; Montague, 1974) seeks to construct a recursive definition of a truth predicate that entails the truth conditions of the declarative sentences of a language. Lappin (2021) observes that a generalised multi-modal deep neural network (DNN) achieves a major part of this program by pairing a suitable graphic (and other modes of) representation with a sentence that describes a situation.

2.2 Symbolic Representations, Grammars, and Hybrid Systems

Marcus (2022) and Chomsky et al. (2023) maintain that LLMs are defective because they do not represent the symbolic systems, specifically grammars, which humans acquire to express linguistic knowledge. They regard grammars as the canonical form for expresing linguistic knowledge.

This claim is question begging. It assumes what is at issue in exploring the nature of human learning. It is entirely possible that humans acquire and encode knowledge of language in non-symbolic, distributed representations of linguistic (and other) regularities, rather than through symbolic algebraic systems. Smolensky (1987), McClelland (2016), among others, suggests a view of this kind. Some transformers implicitly identify significant amounts of hierarchical syntactic structure and long distance relations (Lappin, 2021; Goldberg, 2019; Hewitt & Manning, 2019; Wilcox et al., 2023; Lasri, 2023). In fact Lappin (2021) and Baroni (2023) argue that DNNs can be regarded as alternative theoretical models of linguistic knowledge.

Marcus (2022) argues that to learn as effectively as humans do DNNs must incorporate symbolic rule-based components, as well as the layers of weighted units that process vectors through the functions that characterise deep learning. It is widely assumed that such hybrid systems will substantially enhance the performance of DNNs on tasks requiring complex linguistic knowledge. In fact, this is not obviously the case.

Tree DNNs incorporate syntactic structure into a DNN, either directly through its architecture, or indirectly through its training data and knowledge distillation. Socher et al. (2011), Bowman et al. (2016), Yogatama et al. (2017), Choi et al. (2018), Williams et al. (2018), Maillard et al. (2019), Ek et al. (2019) consider LSTM-based Tree DNNs. These have been applied to NLP tasks like sentiment analysis, NLI, and the prediction of human sentence acceptability judgments. In general they do not significantly improve performance relative to the non-tree counterpart model, and, in at least one case (Ek et al., 2019), performance was degraded in comparison with the non-enriched models.Footnote 2

More recent work has incorporated syntactic tree structure into transformers like BERT, and applied these systems to a broader range of tasks (Sachan et al., 2021; Bai et al., 2021). The experimental evidence on the extent to which this addition improves the capacity of the transformer to handle these tasks remains unclear. Future work may well show that hybrid systems of the sort that Tree DNNs represent do offer genuine advantages over their non-enriched counterparts. The results achieved with these systems to date have not yet motivated this claim.

2.3 Humans Don’t Learn Like That

Marcus (2022) and Chomsky et al. (2023) claim that DNNs do not capture the way in which humans achieve knowledge of their language. In fact this remains an open empirical question. We do not yet know enough about human learning to exclude strong parallels between the ways in which DNNs and humans acquire and represent linguistic knowledge, and other types of information.

But even if the claim is true, it need not constitute a flaw in DNN design. These are engineering devices for performing NLP (and other) tasks. Their usefulness is evaluated on the basis of their success in performing these tasks, rather than on the way in which they achieve these results. From this perspective, criticising DNNs on the grounds they do not operate like humans is analogous to objecting to aircraft because they do not fly like birds.

Moreover, the success of transformers across a broad range of NLP tasks does have cognitive significance. These devices demonstrate one way in which the knowledge required to perform these tasks can be obtained, even if it does not correspond to the procedures that humans apply. These results have important consequences for debates over the types of biases that are, in principle, needed for language acquisition and other kinds of learning.

3 Strong Arguments Against LLMs

3.1 LLMs as Models of Human Learning

Piantadosi (2023) claims that LLMs provide a viable model of human language acquisition. He suggests that they provide evidence against Chomsky’s domain specific innatist view, which posits a “language faculty” that encodes a Universal Grammar.Footnote 3 This is due to their ablity to learn implicit representations of syntactic structure and lexical semantics without strong linguistic learning biases.

In Sect. 2.3 I argued that the fact that LLMs may learn and represent information differently than humans does not entail a flaw in their design. Moreover they do contribute insight into cognitive issues concerning language acquisition by indicating what can, in principle, be learned from available data, with the the inductive procedures that DNNs apply to this input. However, Piantadosi’s claims go well beyond the evidence. His conclusions seem to imply that humans do, in fact, learn in the way that DNNs do. This is not obviously the case.

Moreoever, although we do not know precisely how humans acquire and represent their linguistic knowledge, we do know that they learn with far less data than DNNs require. As Warstadt and Bowman (2023) observe, LLMs are trained on orders of magnitude more data than human learners have access to. Humans also require interaction in order to achieve knowledge of their language (Clark, 2023), while LLMs are trained non-interactively, with the possible exception of RL as a partial simulation of human feedback.

Warstadt and Bowman argue convincingly that in order to assess the extent to which deep learning corresponds to human learning it is necessary to restrict LLMs to the sort of data that humans use in language acquisition. This involves reconfiguring both the training data and the learning environment for DNNs, to simulate the language learning situation that humans encounter. Only an experimental context of this kind will illuminate the extent to which there is an analogy between deep and human learning.

3.2 Problems with Robust NLI

Transformers have scored well on natural language inference benchmarks. However, they are easily derailed, and their performance can be reduced to close to chance by adversarial testing involving lexical substitutions. Talman and Chatzikyriakidis (2019) show that BERT does not generalise well to new data sets for NLI. Conversely Talman et al. (2021) report that BERT continues to achieve high scores when fine tuned and tested on corrupted data sets containing nonsense sentence pairs. These results suggest that BERT is not learning inference through semantic relations between premisses and conclusions. Instead it appears to be identifying certain lexical and structural patterns in the inference pairs.

Dasgupta et al. (2022) points out that humans also frequently make mistakes in inference. However, their reasoning abilities are more robust, even under adversarial testing. Human performance declines more gracefully than transformers under generalisation challenges, and it is more sharply degraded by nonsense pairs. Dasgupta et al. (2022) discuss work that pre-trains models on abstract reasoning templates to improve their performance in NLI. It is not clear to what extent the natural language inference abilities of transformers constitute more than superficial inductive generalisation over labelled patterns in their training corpora. Mahowald et al. (2023) review substantial amounts of evidence for the claim that LLMs do not perform well on extra-linguistic reasoning and real world knowledge tasks.

3.3 LLMs Hallucinate

LLMs are notorious for hallucinating plausible sounding narratives that have no factual basis. In a particularly dramatic example of this phenomenon ChatGPT recently went to court as a legal expert. A lawyer representing a passenger on Avianca Airline brought a lawsuit against the airline citing 6 legal precedents as part of the evidence for his case (Weiser, 2023). The judge was unable to verify any of the precedents. When the judge demanded an explanation, the lawyer admitted to using ChatGPT-3 to identify the legal precedents that he required. He went on to explain that he “even asked the program to verify that the cases were real. It had said yes.”

The fact that LLMs do not reliably distinguish fact from fiction makes them dangerous sources of (mis)information. Notice that semantic grounding through multi-modal training does not, in itself, solve this problem. The images, sounds, etc. to which multi-modal transformers key text do not insure that the text is factually accurate. The non-linguistic representations may also be artificially generated. A description of the image of a unicorn may accurately describe that image. It does not characterise an animal in the world.

3.4 Linguistic Universals

Natural languages display universal patterns, or tendencies, involving word order, morphology, phonology, and lexical semantic classes. Many of these can be expressed as conditional probabilities specifying that a property will hold with a certain likelihood, given the presence of other features in the language.

Gibson et al. (2019) and Kågebäck et al. (2020) suggest that information theoretic notions of communicative efficiency can explain many of these universals. This type of efficiency involves optimising the balance between brevity of expression and complexity of content. As there are alternative strategies for achieving such optimisations, languages will exhibit different clusters of properties.

Chaabouni et al. (2019) report experiments showing that LSTM communication networks display a preference for anti-efficient encoding of information in which the most frequent expressions are the longest, rather than the shortest. They experiment with additional learning biases to promote DNN preference for more efficient communication systems. Similarly, Lian et al. (2021) describe LSTM simulations in which the network tends to preserve the distribution patterns observed in the training data, rather than to maximise efficiency of communication.

If this tendency carries over to transformers, then they will be unable to distinguish between input from plausible and implausible linguistic communication systems. They will not recognise the class of likely natural languages. Therefore, they will not provide insight into the information theoretic biases that shape natural languages.

3.5 Large Data and the Control of Deep Learning Architecture

Transformers require vast amounts of training data for their word and multi-modal embeddings. Each significant improvement in performance on a range of tasks is generally driven by a substantial increase in training data, and an expansion in the size of the LLM. While GPT-2 has 1.5 billion parameters, GPT-3 has 175 billion, and GPT-4 is thought to be 6 times larger than GPT-3.

Only large tech companies have the computing capacity, infrastructure, and funds to develop and train transformers of this size. This concentrates the design and development of LLMs in a very limited number of centres, to the exclusion of most universities and smaller research agencies. As a result, there are limited resources for research on alternative architectures for deep learning which focuses on issues that are not central to the economic concerns of tech companies. Many (most?) researchers working outside of these companies effectively become clients using their products, which they apply to AI tasks through fine tuning and minor modifications. This is an unhealthy state of affairs. While enormous progress has been made on deep learning models, over a relatively short period of time, it has been largely restricted to a narrow dimension of tasks in NLP. In particular, work on the relation of DNNs to human learning and representation is increasingly limited. Also, examination of learning from small data with more transparent and agile systems is not a major issue in current research on deep learning.

It is important to note that Reinforcement Learning does not alleviate the need for large data, and massive LLMs. RL can significantly improve the performance of transformers across a large variety of tasks (Dasgupta et al., 2023). It can also facilitate zero, one, and few shot learning, where a DNN performs well on a new task with limited or no prior training. However, it does not eliminate the need for large amounts of training data. These are still required for pre-trained word and multi-modal embeddings.

3.6 LLMs are Opaque

Transformers, and DNNs in general, are largely opaque systems, for which it is difficult to identify the procedures through which they arrive at the patterns that they recognise. This is, in large part, because the functions, like ReLU, that they apply to activate their units are non-linear. Autoregressive generative language models also use softmax to map their output vectors into probability distributions.

These functions cause the vectors that a DNN produces to be, in the general case, non-compositional. This is due to the fact that the representations of the input and the output vectors cannot be represented by a homomorphic mapping operation. A mapping \(f:A \rightarrow B\) from group A to group B is a homomorphism iff for every \(v_i, v_j \in A\), and the group operation \(\cdot \), \(f(A \cdot B) = f(A) \cdot f(B)\). As a result, it is not always possible to predict the composite vectors that the units of a transformer generate from their inputs, or to reconstruct these inputs from the output vectors in a uniform and regular way.

Probes (Hewitt & Manning, 2019), and selective de-activation of units and attention heads (Lasri, 2023) can provide insight into the structures that transformers infer. These methods remain indirect, and they do not fully illuminate the way in which transformers learn and represent patterns from data.

Bernardy and Lappin (2023) propose Unitary Recurrent Networks (URNs) to solve the problem of model opacity. URNs apply multiplication to orthogonal matrices. The matrices that they generate are strictly compositional. These models are fully transparent, and all input is recoverable from the output at each phase in a URN’s processing sequence. They achieve good results for deeply nested bracket matching tasks in Dyck languages, a class of artificial context-free languages. They do not perform well on context-sensitive cross serial dependencies in artificial languages, or on agreement in natural languages. One of their limitations is the use of truncation of matrices to reduce the size of their rows. This is necessary to facilitate efficient computation of matrix multiplication, but it degrades the performance of a URN on the tasks to which it is applied.

4 Conclusions and Future Research

LLMs are not simply “stochastic parrots” synthesising fluent sounding responses to prompts from previously observed training data. They achieve a sophisticated level of inductive learning and inference, with transferable skills, across a wide range of tasks. They are able to identify hierarchical syntactic structure and complex semantic relations. Through reinforcement learning on multi-modal input they can be trained to respond appropriately to new questions and commands out of the domain of their training data. This involves significant generalisation and few shot learning. Work on LLMs has yielded dramatic progress across a broad set of AI problems, in a comparatively short period of time. It far exceeds the achievements of symbolic rule-based AI over many decades.

It is not clear to what extent LLMs illuminate human cognitive abilities in the areas of language learning and linguistic representation. While they surpass human performance on many cognitively interesting NLP tasks, they require far more data for language learning than humans do. It is not obvious that they learn and encode linguistic knowledge in the way that humans perform these operations.

LLMs are far from human abilities in natural language inference, analogical reasoning, and interpretation, particularly for figurative language. Their performance in domain general dialogue, while appearing to be fluent, remains informationally limited, and frequently unreliable. They do not distinguish fact from fiction, but freely generate inaccurate claims. They also do not optimise informational efficiency in communication.

The large scale of training data and model size that LLMs require has created a situation in which large tech companies control the design and development of these systems. This has skewed research on deep learning in a particular direction, and disadvantaged scientific work on machine learning with a different orientation.

LLMs are opaque in the way that they generalise from data, which poses serious problems of explainability. At present we can only understand their inference procedures and knowledge representations indirectly, through probes, and through selective oblation of heads and other units in the network.

These conclusions suggest the following lines of future research on deep learning. To compare LLMs to human learners it is necessary to modify the data to which they have access, and to alter their training regimen. This will permit us to examine the extent to which there are correspondences and disanalogies between the two learning processes. It will also be necessary to study the internal procedures applied by each type of learner, computational for LLMs and neurological for humans, more closely to identify the precise mechanisms that drive inference, generalisation, and representation, for each kind of acquisition.

It would be useful to experiment with additional learning biases for LLMs to see if these biases will improve their capacity for robust NLI, and for communicative efficiency. This work may provide deeper understanding of what is involved in both abilities. If it is successful, it will produce more intelligent and effective DNNs, which are better able to handle complex NLP tasks.

It is imperative that we develop procedures for testing the factual accuracy of the text produced by LLMs. Without them we are exposed to the threat of disinformation on an even larger scale than we are currently encountering in bot saturated social media. The need to combat disinformation patterns together with the urgency of filtering racial, ethnic, religious, and gender bias in AI systems powered by LLMs, that are used in decision making applications (hiring, lending, university admission, etc). These research concerns should be a focus of public policy discussion.

Developing smaller, more lightweight models that can be trained on less data would encourage more work on alternative architectures, among a larger number of researchers, distributed more widely across industrial and academic centres. This would facilitate the pursuit of more varied scientific objectives in the area of machine learning.

Finally, designing and implementing fully transparent DNNs will improve our understanding of both artificial and human learning. Scientific insight of this kind should be no less a priority than the engineering success driving current LLMs. Ultimately, good engineering depends on a solid scientific understanding of the principles embodied in the systems that the engineering creates.