1 Introduction

Deep learning has led to remarkable progress in artificial intelligence (AI) research [1]. We define deep learning here as using steps of differentiable computations for end-to-end training of multi-layered artificial neural networks. Deep learning has significantly advanced performance across traditional research areas such as imagine recognition, natural language processing and game play [2]. Its successful application across different domains has led to some debate and speculation on whether deep learning might be the key to fulfilling the ambition articulated by McCarthy et al. in 1955: to describe every aspect of learning and feature of intelligence so precisely that a machine could be made to simulate it [3]. Some researchers now proclaim that domain-general artificial intelligence could be achieved within a decade [4] and that deep learning will be central to its development [5].

Knowing whether or not such drastic progress is to be expected within short timeframes matters. Advanced AI will bring many challenges [6,7,8,9,10] ranging from privacy to safety concerns. Deep learning plays a role in this question: if currently used methods face few fundamental limits, progress is likely to be faster. But if limitations, as, for example, articulated by Cantwell Smith [11] remain a challenge, progress is likely to be slower. Investigating the limitations of deep learning matters to prepare for the challenges posed by advanced AI ahead of time.

Foremostly, the question is academically and strategically interesting in its own right. Different research teams strategically decide where to place their focus. They thereby place bets on what techniques are most likely to lead to results—be those insights into the nature of intelligence or engineered applications. Their choices can reflect the relative credence in the potential of deep learning. Take, for example, two companies which explicitly aim to engineer intelligence. OpenAI places a focus on scaling existing deep learning approaches [12,13,14,15] and emphasises the importance of increasing computing resources [4, 16] DeepMind in contrast is paying more attention to transferable lessons from the study of biological systems [17,18,19] and previous paradigms such as symbolic approaches [20, 21].

Surveys that captured the differences between expert expectations by asking experts for quantitative predictions [22, 23] on high-level machine intelligence (or HLMI defined in [22] as “when unaided machines can accomplish every task better and more cheaply than human workers”) show that indeed many experts do not rule out continued and drastic progress within the next decade and century. Grace et al. [22] asked 352 machine learning experts (21% of the 1634 authors authors who were contacted at NIPS and ICML 2015) to estimate the probability of reaching HLMI in future years. The aggregated responses provided a widely reported [24, 25] 50% chance of reaching HLMI within 45 years (from 2016). Yet a closer look (Fig. 1 in [22]) shows strong disagreement between experts. Forecasts by individuals are widely dispersed, ranging from high confidence of attaining HLMI within a decade, to attributing only a 10% chance of HLMI until the end of the century. Predictions differ so widely that they suggest quite different futures and actions. Without knowledge of the sophistication of arguments underneath these predictions, we cannot discern who’s estimate is likely to be more accurate. Quantitative surveys provide no insight into the substance behind predictions.

Previous studies also chose a narrow selection of experts. It is clear who is an expert on machine learning, but less clear who is qualified to answer question on the nature of intelligence. While the sample in [23] was wide ranging in academic discipline (but still weighted towards computer science), [22] only surveyed attendees of machine learning conferences. This excludes much of the available expertise on intelligence (artificial and biological) from neighbouring disciplines and subfields. And as Page [26, 27] and Landemore [28] argue, the diversity between prediction models in a sample can make a substantial difference to prediction outcomes. Lastly, these surveys rely on an expertise that is not verifiably held by the experts they surveyed. Experts who did not receive training in forecasting tend to have poor predictive judgement [29,30,31], even in their field of expertise and especially for long timescales. This might question the reliability of previous quantitative surveys on AI progress.

Our approach addresses the limitation of quantitative surveys in three ways: first, by refraining from demanding quantitative predictions of AI experts, second, by diversifying the expertise in our sample size by expanding the notion of an AI expert to expertise with relevance to AI, including subfields of neuroscience, cognitive science, philosophy and mathematics and third, by focusing on the extraction of reasons and arguments behind expert disagreement.

Our aim is neither to predict what deep learning will or will not be able to do, nor when HLMI will be achieved. Both are questions that will be answered by research, not forecasts. We use the concept of HLMI only to stimulate the discussion on deep learning limitations. Our aim is to identify those research projects that will most likely advance our understanding of intelligence. We do this by mapping the disagreement without resolving or picking a position within it. The hope is to show that we can clarify, foster and use debate for the overall advancement of progress in AI—a goal common to both sides.

We conducted 25 expert interviews resulting in the identification of 40 limitations of the deep learning approach and 5 origins of expert disagreement. These origins are open scientific questions that partially explain different interpretations by experts and thereby elucidate central issues in AI research. They are: abstraction, generalisation, explanatory models, emergence of planning and intervention. We explore both optimistic and pessimistic arguments that are related to each of the five key questions. We explore common beliefs that underpin optimistic and pessimistic argumentation. Our data provide a basis upon which to construct a research agenda that addresses key deep learning limitations.

This paper makes several novel contributions. First, we systematically collect a list of significant limitations of deep learning from a diverse set of experts. Second, we make these expert estimates legible and transparent, by revealing the reasons and arguments that underlie them. We thereby reduce the information asymmetries between insiders and outsiders to the debate. Third, we make use of our map of expert disagreement to identify central open research questions which can be used build a strategic research agenda to advance progress on current limitations.

This paper outlines the methodology, followed by the results which include a list of limitations and the analysis of arguments derived from interviews. The analysis has a description of common beliefs associated with optimistic and pessimistic viewpoints, a description of five origins of disagreements with evidential excerpts from interviews. We conclude with a discussion section which notes the limitations of this investigation and outlines the proceeding use of expert disagreement in the construction of AI research agenda.

2 Methods

We used the Consolidated Criteria for Reporting Qualitative Research (COREQ) checklist for reporting qualitative research [32]. Our sample selection does not aim to be representative of the frequency distribution of expert opinion [33]. It targets a diversity and variety of expert arguments, not a report on how frequently a view is held. Our aim is to display the arguments in the most objective way possible and to minimise any interference with how arguments are perceived. We want arguments to stand for themselves. To reduce biases and inference we made four methodological choices. First, we stay in line with previous surveys and choose to present positions anonymously to reduce any biases which the recognition of names might introduce. However, we do make transparent what level and area of expertise is found in our sample of experts. Second, we did not supplement each argument or limitation with potentially relevant references from the literature. This ensures that the arguments are not weighted by more or less numerous references. Moreover, an unbiased, full literature review across all disciplines we surveyed is outside the scope of this single study. We do not aim to present only correct arguments but to report on all reasonable arguments brought forward in the academic community, so that the expert community as a whole, rather than one subjective view can, with clarifications and analyses such as this one, converge on the correct arguments over time. Third, we supplement arguments with quotations from interviews that represent this viewpoint directly. Note these quotations are only a concise selection of the evidence found in the interview transcripts. Forth, we do not give a verdict on the correctness of either viewpoint in the discussion section of this paper.

To ensure consistency between what experts were assessing, we provided the same prompt to all experts: the limitations of deep learning in respect to giving rise to HLMI. We avoided the ambiguous and often colloquially used term of “AGI” and generated the following definition of HLMI: a general or specific algorithmic system that collectively performs like average adults on cognitive tests that evaluate the cognitive abilities required to perform economically relevant tasks. Our definition differs from that of Grace et al. [22] in that we ask for cognitive skills needed to do human-level economic tasks and ignore the economic cost of performing the task. Note that cognitive tests of this kind, specifically for algorithms, do not yet exist. The fact that tests for HLMI as well as HLMI itself are currently an ill-defined fiction does not render our study less useful, but instead provides a degree of freedom for experts to choose what particular cognitive skills to assess deep learning on. This is reflected in the different level of abstractions, ambition and relative available scientific understanding of different limitations that were named (see, e.g. adversarial attacks, vs conscious perception).

We combined the non-probabilistic, purposive method and the stratified sampling method, as described by [34] in chapter six. Stratified sampling recognises distinct subcategories within the sample population and samples within subcategories. We sampled 25 experts from the following subcategories: disciplines (cognitive science, mathematics, philosophy and neuroscience), rank (degree attained) and sector (industry or academia). We selected those disciplines because their expertise is relevant to the study of intelligence. Cognitive and neural sciences examine biological intelligence, while subfields of mathematics and philosophy study formal notions of intelligence. We aimed to include experts across sectors because experts under different incentive structures (academia vs industry) might have different perspectives. We followed purposive sampling within subcategories: selecting experts with particular relevance (institution, expertise, research focus). We sampled researchers within subcategories by targeting those that had given a relevant talk at a conference, worked for an organisation aiming to engineer or study HLMI or had written related journal articles. Experts that were geographically close and/or familiar with the notion of HLMI (had publicly spoken about advanced AI or were recommended to us by other interviewees) were preferred. The majority of interviewees were familiar with the notion of HLMI and located in London, Oxford or Cambridge in England and San Francisco or Berkeley in the USA. Experts were approached via email or personally at conference venues and, if receptive, met at an office or conference venue. No participant dropped out of the study, but several female researchers (but no male researchers) declined the invitation to be interviewed. One out of all the interviewed experts was known to the interviewer in a personal capacity.

The sample covered eight researchers in machine learning with specialisation in, for example, natural language processing, interpretability, fairness, robotics, AI progress metrics and game play; seven researchers in computer science all with specialisation in AI, specifically robustness and safety, progress metrics, natural language processing, machine learning, computational models of consciousness, symbolic AI and causal representation; two researchers in cognitive psychology with specialisation in developmental psychology and the cognition behind concepts and rationality; three philosophers with specialisation in comparative cognitions between animal brains and machine learning, philosophy of mind, of AI and of causality; two mathematicians with specialisation in machine learning, optimisation, Bayesian inference and robustness in AI; two computational neuroscientists, with specialisation in neural network applications to neural theory and one engineer with specialisation on AI and computation.

Some interviewees had interdisciplinary backgrounds, such as having worked in both philosophy, computer science and animal cognition. Interviews were conducted in 2019 to early 2020 with seven professors, nine postdoctoral, senior researchers in academia, six researchers (with at least Masters degrees and several years of research experience) in companies or institutes and three PhD candidates. Interviewees sometimes held positions in both academia and industry or had done so in the past. They held the listed position at the time of the interview. The sample is diverse in institutional prestige and expert seniority with deliberate inclusion of junior researchers. Our sample has a bias towards white, male researchers (23/25), reflecting a prevalence of males in higher academic positions and a male prevalence in the discipline of computer science.

We conducted individual semi-structured interviews. Semi-structured interviews use an interview guide with core questions and themes to be explored in response to open-ended questions to allow interviewees to explain their position freely [33, 35]. Each participant was provided information about the purpose of the interview, signed a consent form and was given an extended version of the given definition of HLMI (see Appendix 1). Interviews lasted 30–60 min, were recorded and conducted in English. Notes were taken during and after the interviews, using the recordings. During interviews, only interviewee and interviewer were present. Interviews, note-taking and interview coding were all done by one person. No repeat interviews were carried out. The author devised a questionnaire (Appendix 1) as a guide, with questions like: What do you believe deep learning will never be able to do? Does image recognition show that neural networks understand concepts? Why can we not yet automate the writing of discussion sections of scientific papers? Do you see limitations of deep learning that others appear not to notice? In response to these and similar questions, all interviewees named their perceived limitations. Note that the questions were posed in recognition of the expertise of the interviewee. For example, a philosopher was not asked about the adversarial attacks in machine learning and a machine learning expert was not asked about cognitive development in children. We used this interview data to identify issues that play a central role in the disagreement between experts in our study. We identified these issues following guidelines for conventional, inductive content-analysis. A content-analysis is the “subjective interpretation of text data through the systematic classification process of coding and identifying themes or patterns” [36]. This approach does not utilise preconceived theories and uses both an analysis of manifest (literally in the text) and latent (implied) content [37]. All limitations named by interviewees are manifest interview content and were collated. The list of perceived limitations provided in the results section lists all limitations, with shortened, paraphrased explanations by the author. They are deliberately not ordered into categories and are anonymised. We conserved each expert’s preferred terminology, despite recognising that some limitations may turn out to refer to the same problem under different names.

We provide brief examples of how we implement the traditional coding schemes. Some limitations of content analysis approaches can be found in [38]. We combine coding of verbal designations (e.g. does interviewee use word “abstraction”?), scaling (e.g. is argument provided by interviewee a more pessimistic than optimistic argument?), simulation of hypothesis testing (e.g. does text support or refute the hypothesis that “abstraction” is an origin of the disagreement?). Variables, as required by methodological instructions, in our study are the origins (e.g. “abstraction”) of disagreement. Each variable is thus found in one of two instantiations: a pessimistic [e.g. “artificial neural networks (ANNs) do not abstract, thus have limits.”] and an optimistic (e.g. “ANNs do abstract, thus have potential”) instantiation of an issue (e.g. “abstraction in ANNs).

We categorised arguments into pessimistic and optimistic instantiations and highlighted recurring themes in accordance with previous studies mentioned above. Themes occurred across interviews. They could be a theory, a justification, an open question, an experiment or a study. We identified themes that were used as justifications in both more optimistic and more pessimistic arguments to make two opposing points. They are our variables. If both pessimistic and optimistic arguments would make use of the same variable to support at least two different positions, we would highlight this variable as an origin underpinning the disagreement. We demonstrate the methodology of identifying origins and propose their use in generating research projects that can address algorithmic limitations. This study was approved by the Ethics Committee for the School of the Humanities and Social Sciences at the University of Cambridge.

3 Results

3.1 Interview data

In Table 1 we present 40 limitations of deep learning as currently perceived by experts (at the time of the interviews). Experts differed on estimates of the number and severity of limitations. All interviewees held the view that deep learning is useful.

Table 1 Limitations of deep learning according to experts

This set should be understood as a temporary best estimate of the true number and nature of limitations. Empirical progress will determine whether the number of limitations is actually larger (including currently unknown limitations) or much smaller (some limitations overlap and are not independent). Many limitations appear specific to deep learning. Others apply to AI research generally. Many perceived limitations already receive significant attention by researchers. Interviewees recognised this, but considered the problem insufficiently solved at the time of their interview.

3.2 Common beliefs: scale and insight

We observed common beliefs amongst both the optimistic and pessimistic viewpoints. The perceived limitations of deep learning related to the wider debate about the possibility and timeline of engineering intelligence. An experts’ position in the disagreement depended on how numerous and difficult they thought the limitations to be. Interviewees showed a nuanced consideration of the complexity of the question at hand. They rarely took at definitive stance and mostly differed in how much credence they attributed to a particular position being correct and in how much credence they assigned over plausible ways of achieving HLMI. One inclination, however, united interviewees across the optimism–pessimism spectrum: each believed their own view could more accurately reflect and incorporate the uncertainties inherent in the study of intelligence.Footnote 1 Emblematic of that view is the following quote of one (more optimistic) interviewee:

“There’s a bunch of world models, some of which predict 50% success in engineering HLMI in 10 years. Someone predicting a 0.1% probability of success has insufficient model uncertainty. They are over 99% sure many world models are wrong, even though they usually strongly agree we should be highly uncertain about progress over long time frames.” (P8).

Optimists attributed higher probabilities to reaching HLMI in shorter timespans than pessimists. Optimists were often impressed by how well trained ANNs generalise to test data. They attributed the success and potential of ANNs to their ability to identify useful representations of the environment, without using preconceptions based on human domain knowledge. Optimists stressed how much progress resulted from augmenting data and computational resources and warned not to underestimate potential performance gains derived from scaling existing methods.Footnote 2 They suggested qualitative differences, like perceptibly new skills such as reasoning and planning, might emerge from quantitative scaling.Footnote 3While pessimists interpreted limitations such as e.g. grammar or disentangled representations as insights missing, optimists either saw these as achievable within the deep learning framework or not essential for intelligence.

Optimists believed a trial-and-error research approach could lead to rapid progress towards HLMI, even without substantial improvements in the theoretical understanding of deep learning.Footnote 4 They found it plausible that all foundational insights have been discovered and incremental improvements of deep learning techniques could suffice to build HLMI.Footnote 5 Deep learning, they posited, may have stumbled upon the core components of intelligence.Footnote 6

“I think existing techniques would certainly work given evolutionary amounts of compute. [...] A model [of how to achieve HLMI could be]: it’ll require some ingenuity and good ideas but it will be business as usual and it will rely on much more compute.” (P8)

“[For instance] using convolutional attention mechanisms and applying it to graphs structures and training to learn how to represent code by training it on GitHub corpora…that kind of incremental progress would carry us to [..] superintelligence.” (P21).

3.2.1 Insight

Experts tending towards pessimism also shared beliefs. They considered deep learning useful but expected that essential insights are missingFootnote 7 and paradigmatic shifts in research methods may be required. Missing skills for example include generalising, language and reasoning, and using abstract, disentangled representations. Pessimists often drew from their understanding of animal intelligence and stressed the difficulties of studying intelligence.Footnote 8 Pessimists seldom suggested that deep learning captures central components of intelligence. They rarely believed that data and computing availability alone will lead to the emergence of new skills.Footnote 9 Scaling of deep learning is no solution to them because they deemed it infeasible or inefficient. Pessimists believed new algorithmic innovations must overcome the problem of deep learning requiring disproportionate additions of data for each new feature.Footnote 10

Previous successes were seen as only a weak indicator for future performance. The low hanging fruits of deep learning applications might soon have been harvested.

“Those people who say that’s going to continue are saying it as more of a form of religion. It’s blind faith unsupported by facts. But if you have studied cognition, if you have studied the properties of language… [...] you recognise that there are many things that deep learning [...] right now isn’t doing.” (P23).

“My hunch is that deep learning isn’t going anywhere. It has very good solutions for problems where you have large amounts of labelled data, and fairly well-defined tasks and lots of compute thrown at problems. This doesn’t describe many tasks we care about.” (P10).

Pessimists pointed out norms and practices in research communities (e.g. ineffective benchmarking and unpublished negative results) that could delay progress towards HLMI.Footnote 11 They particularly note the lack of a scientific understanding of deep learning techniquesFootnote 12 and think that trial and error approaches have limited power in navigating researcher towards an ill-defined goal like HLMI.Footnote 13

“If you think you can build the solution even if you don’t know what the problem is, you probably think you can do AI” (P2).

3.3 Origins of disagreement

We identify key origins and scientific questions that underpin expert disagreement about the potential of deep learning approaches to achieve HLMI, which are: abstraction, generalisation, explanatory models, emergence of planning and intervention. These are scientific questions with incomplete evidence, about which experts propose different hypotheses or interpretations and thus end up disagreeing. For each origin, we show what arguments lead to pessimistic or optimistic views on deep learning. Origins of disagreement depend on perceived limitations: disagreement exists because limitations persist. For instance, ANNs are currently limited in representing higher-order concepts. This creates uncertainty about whether they are capable of doing so—an uncertainty within which the disagreement presides. Disagreement can be resolved as solutions are found and uncertainty is reduced.

We present a non-exhaustive list of origins that can be used to make progress towards expert agreement. They are a subset of the origins of disagreement and do not map the dispute or limitations exhaustively. We identified an origin by noting when both optimistic and pessimistic experts referred to the same underlying issue to support opposing positions. Each section states the open question that gives rise to the disagreement and proceeds by paraphrasing pessimistic and optimistic arguments that experts reported as reasons for their position. Note that arguments given by different experts may be mutually exclusive but still point in the same direction. Quotes from interviews were provided as evidence if they captured the argument succintly.

3.3.1 Abstraction

Do current artificial neural networks (ANNs) form abstract representations effectively? (Table 2).

Table 2 Abstraction

3.3.2 Generalisation

Should ANNs’ ability to generalise inspire optimism about deep learning? (Table 3).

Table 3 Generalisation

3.3.3 Explanatory, causal models

Is it necessary, possible and feasible to construct compressed, causal, explanatory models of the environment as described in [39] using deep learning? (Table 4).

Table 4 Explanatory, causal models

3.3.4 Emergence of planning

Will sufficiently complex environments be sufficient in enabling deep learning algorithms to develop the capacity for hierarchical, long-term reasoning and planning? (Table 5).

Table 5 Emergence of planning

3.3.5 Intervention

Will deep learning support and require learning by intervening in a complex, real environment? (Table 6)

Table 6 Intervention

4 Discussion

4.1 Summary

Scientific uncertainty generates expert disagreement. A lack of data leads experts to make reasonable, but opposing interpretations of, and extrapolations from existing data. Limitations contribute to these origins. Both the uncertainty about what skills are required for HLMI and whether deep learning can support each prerequisite leads to disagreement. This disagreement can guide research efforts aimed at overcoming limitations.

4.2 Limitations

Our study too has limitations. Our focus on deep learning as it is now, means that one must be careful to use this data for estimates of AI progress beyond a few years. As our understanding of deep learning improves, its definition will change. Indeed, even now, interviewees had significant disagreement over the definition of deep learning. Deep learning might soon signify algorithms different from those discussed here, or of course, AI progress could occur without deep learning.

This study cannot show how many of the listed limitations are true limitations or how fundamental they are. Several limitations appear to overlap (e.g. representation and variable- binding), partially due to the high level of abstraction on which participants named limitations. A full literature review is beyond the scope and non-interference goal of this paper.

We noticed that experts might be using different notions of key terms that factor in the discussion (e.g. abstraction, priors and generalisation) which might result in significant semantic disagreement. We do not resolve this here and encourage researchers to expand our analysis, to define terms or suggest additions to this list of origins. Our list of origins is by nature subjective and thereby would improve as more interpretations and viewpoints are added. Similarly, even though a debate-based research agenda will reduce expert disagreement, it is unlikely to lead to total agreement. As we mention, some expert differences depended on different interpretations of available data, different views on the nature of intelligence and different views on what intelligence must and could be capable of. Some disagreement will likely remain. But our goal here is to point out how to use these difference not to minimise them.

Because we conducted interviews, our sample size was constrained and led to around about 25 of expert argumentation. It is necessarily not a full representation of all expert arguments. Future research could expand the list of arguments and sample even more sub-disciplines, increase the level of detail provided and investigate the validity of the arguments provided by experts. Finally, and foremostly we hope researchers will utilise these origins to generate specific experiments, concise definitions, benchmarks and a collective research agenda to test the hypotheses that underlie the theories of intelligence presented here.

4.3 Progress in artificial intelligence

Expert disagreement offers a rich landscape upon which to construct a research agenda to overcome the current limitations of deep learning. A research agenda guided by expert disagreement can (a) define key terminology in origins of disagreement, (b) dissect origins into tractable, feasible research questions, (c) collect optimistic and pessimistic hypotheses on origins, (d) specify experiments that could falsify either of the hypotheses, (e) generate benchmarks and competitions that encourage coordinated experimentation on origins and (f) conduct experiments and falsify hypotheses.

We give some examples of the types of research questions which, if addressed, can contribute to progress on reducing uncertainty. For instance, developing tests to distinguish if deep reinforcement learning agents learn heuristics or planning, identifying games which can only be solved using high-order reasoning and planning to test the emergence-hypothesis, developing different measurements for degrees of abstraction and concept formation in ANNs, test the correlation between levels of abstractions formed and computing resources, define what prediction performed by an ANN would indicate that “model-building” has been achieved, defining desirable levels and types of generalisation that liken human generalisation and investigating the causes and extent of unexpected, unspecified, emerging skills in predictive language models.

Coordinated experimentation will reduce the uncertainty that gives rise to origins of disagreement and advance agreement between experts. This research agenda should extend beyond computer science research, as it will benefit from interdisciplinary efforts in, for example, psychology, philosophy or animal cognition as demonstrated by [40,41,42]. Collaborations in which pessimists set the tasks that would make them more optimistic and in which optimists try solve the task using deep learning, could be fruitful. We suggest expert disagreement provides a road map for progress towards artificial intelligence.