We investigate expert disagreement over the potential and limitations of deep learning. We conducted 25 expert interviews to reveal the reasons and arguments that underlie the disagreement about the limitations of deep learning, here evaluated in respect to high-level machine intelligence. Experts in our sample named 40 limitations of deep learning. Using interview data, we identify and explore five crucial, unresolved research subjects that underpin this scholarly disagreement: abstraction, generalisation, explanatory models, emergence of planning and intervention. We suggest that such origins of disagreement can be used to form a research road map to guide efforts towards overcoming the limitations of deep learning.
Deep learning has led to remarkable progress in artificial intelligence (AI) research . We define deep learning here as using steps of differentiable computations for end-to-end training of multi-layered artificial neural networks. Deep learning has significantly advanced performance across traditional research areas such as imagine recognition, natural language processing and game play . Its successful application across different domains has led to some debate and speculation on whether deep learning might be the key to fulfilling the ambition articulated by McCarthy et al. in 1955: to describe every aspect of learning and feature of intelligence so precisely that a machine could be made to simulate it . Some researchers now proclaim that domain-general artificial intelligence could be achieved within a decade  and that deep learning will be central to its development .
Knowing whether or not such drastic progress is to be expected within short timeframes matters. Advanced AI will bring many challenges [6,7,8,9,10] ranging from privacy to safety concerns. Deep learning plays a role in this question: if currently used methods face few fundamental limits, progress is likely to be faster. But if limitations, as, for example, articulated by Cantwell Smith  remain a challenge, progress is likely to be slower. Investigating the limitations of deep learning matters to prepare for the challenges posed by advanced AI ahead of time.
Foremostly, the question is academically and strategically interesting in its own right. Different research teams strategically decide where to place their focus. They thereby place bets on what techniques are most likely to lead to results—be those insights into the nature of intelligence or engineered applications. Their choices can reflect the relative credence in the potential of deep learning. Take, for example, two companies which explicitly aim to engineer intelligence. OpenAI places a focus on scaling existing deep learning approaches [12,13,14,15] and emphasises the importance of increasing computing resources [4, 16] DeepMind in contrast is paying more attention to transferable lessons from the study of biological systems [17,18,19] and previous paradigms such as symbolic approaches [20, 21].
Surveys that captured the differences between expert expectations by asking experts for quantitative predictions [22, 23] on high-level machine intelligence (or HLMI defined in  as “when unaided machines can accomplish every task better and more cheaply than human workers”) show that indeed many experts do not rule out continued and drastic progress within the next decade and century. Grace et al.  asked 352 machine learning experts (21% of the 1634 authors authors who were contacted at NIPS and ICML 2015) to estimate the probability of reaching HLMI in future years. The aggregated responses provided a widely reported [24, 25] 50% chance of reaching HLMI within 45 years (from 2016). Yet a closer look (Fig. 1 in ) shows strong disagreement between experts. Forecasts by individuals are widely dispersed, ranging from high confidence of attaining HLMI within a decade, to attributing only a 10% chance of HLMI until the end of the century. Predictions differ so widely that they suggest quite different futures and actions. Without knowledge of the sophistication of arguments underneath these predictions, we cannot discern who’s estimate is likely to be more accurate. Quantitative surveys provide no insight into the substance behind predictions.
Previous studies also chose a narrow selection of experts. It is clear who is an expert on machine learning, but less clear who is qualified to answer question on the nature of intelligence. While the sample in  was wide ranging in academic discipline (but still weighted towards computer science),  only surveyed attendees of machine learning conferences. This excludes much of the available expertise on intelligence (artificial and biological) from neighbouring disciplines and subfields. And as Page [26, 27] and Landemore  argue, the diversity between prediction models in a sample can make a substantial difference to prediction outcomes. Lastly, these surveys rely on an expertise that is not verifiably held by the experts they surveyed. Experts who did not receive training in forecasting tend to have poor predictive judgement [29,30,31], even in their field of expertise and especially for long timescales. This might question the reliability of previous quantitative surveys on AI progress.
Our approach addresses the limitation of quantitative surveys in three ways: first, by refraining from demanding quantitative predictions of AI experts, second, by diversifying the expertise in our sample size by expanding the notion of an AI expert to expertise with relevance to AI, including subfields of neuroscience, cognitive science, philosophy and mathematics and third, by focusing on the extraction of reasons and arguments behind expert disagreement.
Our aim is neither to predict what deep learning will or will not be able to do, nor when HLMI will be achieved. Both are questions that will be answered by research, not forecasts. We use the concept of HLMI only to stimulate the discussion on deep learning limitations. Our aim is to identify those research projects that will most likely advance our understanding of intelligence. We do this by mapping the disagreement without resolving or picking a position within it. The hope is to show that we can clarify, foster and use debate for the overall advancement of progress in AI—a goal common to both sides.
We conducted 25 expert interviews resulting in the identification of 40 limitations of the deep learning approach and 5 origins of expert disagreement. These origins are open scientific questions that partially explain different interpretations by experts and thereby elucidate central issues in AI research. They are: abstraction, generalisation, explanatory models, emergence of planning and intervention. We explore both optimistic and pessimistic arguments that are related to each of the five key questions. We explore common beliefs that underpin optimistic and pessimistic argumentation. Our data provide a basis upon which to construct a research agenda that addresses key deep learning limitations.
This paper makes several novel contributions. First, we systematically collect a list of significant limitations of deep learning from a diverse set of experts. Second, we make these expert estimates legible and transparent, by revealing the reasons and arguments that underlie them. We thereby reduce the information asymmetries between insiders and outsiders to the debate. Third, we make use of our map of expert disagreement to identify central open research questions which can be used build a strategic research agenda to advance progress on current limitations.
This paper outlines the methodology, followed by the results which include a list of limitations and the analysis of arguments derived from interviews. The analysis has a description of common beliefs associated with optimistic and pessimistic viewpoints, a description of five origins of disagreements with evidential excerpts from interviews. We conclude with a discussion section which notes the limitations of this investigation and outlines the proceeding use of expert disagreement in the construction of AI research agenda.
We used the Consolidated Criteria for Reporting Qualitative Research (COREQ) checklist for reporting qualitative research . Our sample selection does not aim to be representative of the frequency distribution of expert opinion . It targets a diversity and variety of expert arguments, not a report on how frequently a view is held. Our aim is to display the arguments in the most objective way possible and to minimise any interference with how arguments are perceived. We want arguments to stand for themselves. To reduce biases and inference we made four methodological choices. First, we stay in line with previous surveys and choose to present positions anonymously to reduce any biases which the recognition of names might introduce. However, we do make transparent what level and area of expertise is found in our sample of experts. Second, we did not supplement each argument or limitation with potentially relevant references from the literature. This ensures that the arguments are not weighted by more or less numerous references. Moreover, an unbiased, full literature review across all disciplines we surveyed is outside the scope of this single study. We do not aim to present only correct arguments but to report on all reasonable arguments brought forward in the academic community, so that the expert community as a whole, rather than one subjective view can, with clarifications and analyses such as this one, converge on the correct arguments over time. Third, we supplement arguments with quotations from interviews that represent this viewpoint directly. Note these quotations are only a concise selection of the evidence found in the interview transcripts. Forth, we do not give a verdict on the correctness of either viewpoint in the discussion section of this paper.
To ensure consistency between what experts were assessing, we provided the same prompt to all experts: the limitations of deep learning in respect to giving rise to HLMI. We avoided the ambiguous and often colloquially used term of “AGI” and generated the following definition of HLMI: a general or specific algorithmic system that collectively performs like average adults on cognitive tests that evaluate the cognitive abilities required to perform economically relevant tasks. Our definition differs from that of Grace et al.  in that we ask for cognitive skills needed to do human-level economic tasks and ignore the economic cost of performing the task. Note that cognitive tests of this kind, specifically for algorithms, do not yet exist. The fact that tests for HLMI as well as HLMI itself are currently an ill-defined fiction does not render our study less useful, but instead provides a degree of freedom for experts to choose what particular cognitive skills to assess deep learning on. This is reflected in the different level of abstractions, ambition and relative available scientific understanding of different limitations that were named (see, e.g. adversarial attacks, vs conscious perception).
We combined the non-probabilistic, purposive method and the stratified sampling method, as described by  in chapter six. Stratified sampling recognises distinct subcategories within the sample population and samples within subcategories. We sampled 25 experts from the following subcategories: disciplines (cognitive science, mathematics, philosophy and neuroscience), rank (degree attained) and sector (industry or academia). We selected those disciplines because their expertise is relevant to the study of intelligence. Cognitive and neural sciences examine biological intelligence, while subfields of mathematics and philosophy study formal notions of intelligence. We aimed to include experts across sectors because experts under different incentive structures (academia vs industry) might have different perspectives. We followed purposive sampling within subcategories: selecting experts with particular relevance (institution, expertise, research focus). We sampled researchers within subcategories by targeting those that had given a relevant talk at a conference, worked for an organisation aiming to engineer or study HLMI or had written related journal articles. Experts that were geographically close and/or familiar with the notion of HLMI (had publicly spoken about advanced AI or were recommended to us by other interviewees) were preferred. The majority of interviewees were familiar with the notion of HLMI and located in London, Oxford or Cambridge in England and San Francisco or Berkeley in the USA. Experts were approached via email or personally at conference venues and, if receptive, met at an office or conference venue. No participant dropped out of the study, but several female researchers (but no male researchers) declined the invitation to be interviewed. One out of all the interviewed experts was known to the interviewer in a personal capacity.
The sample covered eight researchers in machine learning with specialisation in, for example, natural language processing, interpretability, fairness, robotics, AI progress metrics and game play; seven researchers in computer science all with specialisation in AI, specifically robustness and safety, progress metrics, natural language processing, machine learning, computational models of consciousness, symbolic AI and causal representation; two researchers in cognitive psychology with specialisation in developmental psychology and the cognition behind concepts and rationality; three philosophers with specialisation in comparative cognitions between animal brains and machine learning, philosophy of mind, of AI and of causality; two mathematicians with specialisation in machine learning, optimisation, Bayesian inference and robustness in AI; two computational neuroscientists, with specialisation in neural network applications to neural theory and one engineer with specialisation on AI and computation.
Some interviewees had interdisciplinary backgrounds, such as having worked in both philosophy, computer science and animal cognition. Interviews were conducted in 2019 to early 2020 with seven professors, nine postdoctoral, senior researchers in academia, six researchers (with at least Masters degrees and several years of research experience) in companies or institutes and three PhD candidates. Interviewees sometimes held positions in both academia and industry or had done so in the past. They held the listed position at the time of the interview. The sample is diverse in institutional prestige and expert seniority with deliberate inclusion of junior researchers. Our sample has a bias towards white, male researchers (23/25), reflecting a prevalence of males in higher academic positions and a male prevalence in the discipline of computer science.
We conducted individual semi-structured interviews. Semi-structured interviews use an interview guide with core questions and themes to be explored in response to open-ended questions to allow interviewees to explain their position freely [33, 35]. Each participant was provided information about the purpose of the interview, signed a consent form and was given an extended version of the given definition of HLMI (see Appendix 1). Interviews lasted 30–60 min, were recorded and conducted in English. Notes were taken during and after the interviews, using the recordings. During interviews, only interviewee and interviewer were present. Interviews, note-taking and interview coding were all done by one person. No repeat interviews were carried out. The author devised a questionnaire (Appendix 1) as a guide, with questions like: What do you believe deep learning will never be able to do? Does image recognition show that neural networks understand concepts? Why can we not yet automate the writing of discussion sections of scientific papers? Do you see limitations of deep learning that others appear not to notice? In response to these and similar questions, all interviewees named their perceived limitations. Note that the questions were posed in recognition of the expertise of the interviewee. For example, a philosopher was not asked about the adversarial attacks in machine learning and a machine learning expert was not asked about cognitive development in children. We used this interview data to identify issues that play a central role in the disagreement between experts in our study. We identified these issues following guidelines for conventional, inductive content-analysis. A content-analysis is the “subjective interpretation of text data through the systematic classification process of coding and identifying themes or patterns” . This approach does not utilise preconceived theories and uses both an analysis of manifest (literally in the text) and latent (implied) content . All limitations named by interviewees are manifest interview content and were collated. The list of perceived limitations provided in the results section lists all limitations, with shortened, paraphrased explanations by the author. They are deliberately not ordered into categories and are anonymised. We conserved each expert’s preferred terminology, despite recognising that some limitations may turn out to refer to the same problem under different names.
We provide brief examples of how we implement the traditional coding schemes. Some limitations of content analysis approaches can be found in . We combine coding of verbal designations (e.g. does interviewee use word “abstraction”?), scaling (e.g. is argument provided by interviewee a more pessimistic than optimistic argument?), simulation of hypothesis testing (e.g. does text support or refute the hypothesis that “abstraction” is an origin of the disagreement?). Variables, as required by methodological instructions, in our study are the origins (e.g. “abstraction”) of disagreement. Each variable is thus found in one of two instantiations: a pessimistic [e.g. “artificial neural networks (ANNs) do not abstract, thus have limits.”] and an optimistic (e.g. “ANNs do abstract, thus have potential”) instantiation of an issue (e.g. “abstraction in ANNs).
We categorised arguments into pessimistic and optimistic instantiations and highlighted recurring themes in accordance with previous studies mentioned above. Themes occurred across interviews. They could be a theory, a justification, an open question, an experiment or a study. We identified themes that were used as justifications in both more optimistic and more pessimistic arguments to make two opposing points. They are our variables. If both pessimistic and optimistic arguments would make use of the same variable to support at least two different positions, we would highlight this variable as an origin underpinning the disagreement. We demonstrate the methodology of identifying origins and propose their use in generating research projects that can address algorithmic limitations. This study was approved by the Ethics Committee for the School of the Humanities and Social Sciences at the University of Cambridge.
3.1 Interview data
In Table 1 we present 40 limitations of deep learning as currently perceived by experts (at the time of the interviews). Experts differed on estimates of the number and severity of limitations. All interviewees held the view that deep learning is useful.
This set should be understood as a temporary best estimate of the true number and nature of limitations. Empirical progress will determine whether the number of limitations is actually larger (including currently unknown limitations) or much smaller (some limitations overlap and are not independent). Many limitations appear specific to deep learning. Others apply to AI research generally. Many perceived limitations already receive significant attention by researchers. Interviewees recognised this, but considered the problem insufficiently solved at the time of their interview.
3.2 Common beliefs: scale and insight
We observed common beliefs amongst both the optimistic and pessimistic viewpoints. The perceived limitations of deep learning related to the wider debate about the possibility and timeline of engineering intelligence. An experts’ position in the disagreement depended on how numerous and difficult they thought the limitations to be. Interviewees showed a nuanced consideration of the complexity of the question at hand. They rarely took at definitive stance and mostly differed in how much credence they attributed to a particular position being correct and in how much credence they assigned over plausible ways of achieving HLMI. One inclination, however, united interviewees across the optimism–pessimism spectrum: each believed their own view could more accurately reflect and incorporate the uncertainties inherent in the study of intelligence.Footnote 1 Emblematic of that view is the following quote of one (more optimistic) interviewee:
“There’s a bunch of world models, some of which predict 50% success in engineering HLMI in 10 years. Someone predicting a 0.1% probability of success has insufficient model uncertainty. They are over 99% sure many world models are wrong, even though they usually strongly agree we should be highly uncertain about progress over long time frames.” (P8).
Optimists attributed higher probabilities to reaching HLMI in shorter timespans than pessimists. Optimists were often impressed by how well trained ANNs generalise to test data. They attributed the success and potential of ANNs to their ability to identify useful representations of the environment, without using preconceptions based on human domain knowledge. Optimists stressed how much progress resulted from augmenting data and computational resources and warned not to underestimate potential performance gains derived from scaling existing methods.Footnote 2 They suggested qualitative differences, like perceptibly new skills such as reasoning and planning, might emerge from quantitative scaling.Footnote 3While pessimists interpreted limitations such as e.g. grammar or disentangled representations as insights missing, optimists either saw these as achievable within the deep learning framework or not essential for intelligence.
Optimists believed a trial-and-error research approach could lead to rapid progress towards HLMI, even without substantial improvements in the theoretical understanding of deep learning.Footnote 4 They found it plausible that all foundational insights have been discovered and incremental improvements of deep learning techniques could suffice to build HLMI.Footnote 5 Deep learning, they posited, may have stumbled upon the core components of intelligence.Footnote 6
“I think existing techniques would certainly work given evolutionary amounts of compute. [...] A model [of how to achieve HLMI could be]: it’ll require some ingenuity and good ideas but it will be business as usual and it will rely on much more compute.” (P8)
“[For instance] using convolutional attention mechanisms and applying it to graphs structures and training to learn how to represent code by training it on GitHub corpora…that kind of incremental progress would carry us to [..] superintelligence.” (P21).
Experts tending towards pessimism also shared beliefs. They considered deep learning useful but expected that essential insights are missingFootnote 7 and paradigmatic shifts in research methods may be required. Missing skills for example include generalising, language and reasoning, and using abstract, disentangled representations. Pessimists often drew from their understanding of animal intelligence and stressed the difficulties of studying intelligence.Footnote 8 Pessimists seldom suggested that deep learning captures central components of intelligence. They rarely believed that data and computing availability alone will lead to the emergence of new skills.Footnote 9 Scaling of deep learning is no solution to them because they deemed it infeasible or inefficient. Pessimists believed new algorithmic innovations must overcome the problem of deep learning requiring disproportionate additions of data for each new feature.Footnote 10
Previous successes were seen as only a weak indicator for future performance. The low hanging fruits of deep learning applications might soon have been harvested.
“Those people who say that’s going to continue are saying it as more of a form of religion. It’s blind faith unsupported by facts. But if you have studied cognition, if you have studied the properties of language… [...] you recognise that there are many things that deep learning [...] right now isn’t doing.” (P23).
“My hunch is that deep learning isn’t going anywhere. It has very good solutions for problems where you have large amounts of labelled data, and fairly well-defined tasks and lots of compute thrown at problems. This doesn’t describe many tasks we care about.” (P10).
Pessimists pointed out norms and practices in research communities (e.g. ineffective benchmarking and unpublished negative results) that could delay progress towards HLMI.Footnote 11 They particularly note the lack of a scientific understanding of deep learning techniquesFootnote 12 and think that trial and error approaches have limited power in navigating researcher towards an ill-defined goal like HLMI.Footnote 13
“If you think you can build the solution even if you don’t know what the problem is, you probably think you can do AI” (P2).
3.3 Origins of disagreement
We identify key origins and scientific questions that underpin expert disagreement about the potential of deep learning approaches to achieve HLMI, which are: abstraction, generalisation, explanatory models, emergence of planning and intervention. These are scientific questions with incomplete evidence, about which experts propose different hypotheses or interpretations and thus end up disagreeing. For each origin, we show what arguments lead to pessimistic or optimistic views on deep learning. Origins of disagreement depend on perceived limitations: disagreement exists because limitations persist. For instance, ANNs are currently limited in representing higher-order concepts. This creates uncertainty about whether they are capable of doing so—an uncertainty within which the disagreement presides. Disagreement can be resolved as solutions are found and uncertainty is reduced.
We present a non-exhaustive list of origins that can be used to make progress towards expert agreement. They are a subset of the origins of disagreement and do not map the dispute or limitations exhaustively. We identified an origin by noting when both optimistic and pessimistic experts referred to the same underlying issue to support opposing positions. Each section states the open question that gives rise to the disagreement and proceeds by paraphrasing pessimistic and optimistic arguments that experts reported as reasons for their position. Note that arguments given by different experts may be mutually exclusive but still point in the same direction. Quotes from interviews were provided as evidence if they captured the argument succintly.
Do current artificial neural networks (ANNs) form abstract representations effectively? (Table 2).
Should ANNs’ ability to generalise inspire optimism about deep learning? (Table 3).
3.3.3 Explanatory, causal models
3.3.4 Emergence of planning
Will sufficiently complex environments be sufficient in enabling deep learning algorithms to develop the capacity for hierarchical, long-term reasoning and planning? (Table 5).
Will deep learning support and require learning by intervening in a complex, real environment? (Table 6)
Scientific uncertainty generates expert disagreement. A lack of data leads experts to make reasonable, but opposing interpretations of, and extrapolations from existing data. Limitations contribute to these origins. Both the uncertainty about what skills are required for HLMI and whether deep learning can support each prerequisite leads to disagreement. This disagreement can guide research efforts aimed at overcoming limitations.
Our study too has limitations. Our focus on deep learning as it is now, means that one must be careful to use this data for estimates of AI progress beyond a few years. As our understanding of deep learning improves, its definition will change. Indeed, even now, interviewees had significant disagreement over the definition of deep learning. Deep learning might soon signify algorithms different from those discussed here, or of course, AI progress could occur without deep learning.
This study cannot show how many of the listed limitations are true limitations or how fundamental they are. Several limitations appear to overlap (e.g. representation and variable- binding), partially due to the high level of abstraction on which participants named limitations. A full literature review is beyond the scope and non-interference goal of this paper.
We noticed that experts might be using different notions of key terms that factor in the discussion (e.g. abstraction, priors and generalisation) which might result in significant semantic disagreement. We do not resolve this here and encourage researchers to expand our analysis, to define terms or suggest additions to this list of origins. Our list of origins is by nature subjective and thereby would improve as more interpretations and viewpoints are added. Similarly, even though a debate-based research agenda will reduce expert disagreement, it is unlikely to lead to total agreement. As we mention, some expert differences depended on different interpretations of available data, different views on the nature of intelligence and different views on what intelligence must and could be capable of. Some disagreement will likely remain. But our goal here is to point out how to use these difference not to minimise them.
Because we conducted interviews, our sample size was constrained and led to around about 25 of expert argumentation. It is necessarily not a full representation of all expert arguments. Future research could expand the list of arguments and sample even more sub-disciplines, increase the level of detail provided and investigate the validity of the arguments provided by experts. Finally, and foremostly we hope researchers will utilise these origins to generate specific experiments, concise definitions, benchmarks and a collective research agenda to test the hypotheses that underlie the theories of intelligence presented here.
4.3 Progress in artificial intelligence
Expert disagreement offers a rich landscape upon which to construct a research agenda to overcome the current limitations of deep learning. A research agenda guided by expert disagreement can (a) define key terminology in origins of disagreement, (b) dissect origins into tractable, feasible research questions, (c) collect optimistic and pessimistic hypotheses on origins, (d) specify experiments that could falsify either of the hypotheses, (e) generate benchmarks and competitions that encourage coordinated experimentation on origins and (f) conduct experiments and falsify hypotheses.
We give some examples of the types of research questions which, if addressed, can contribute to progress on reducing uncertainty. For instance, developing tests to distinguish if deep reinforcement learning agents learn heuristics or planning, identifying games which can only be solved using high-order reasoning and planning to test the emergence-hypothesis, developing different measurements for degrees of abstraction and concept formation in ANNs, test the correlation between levels of abstractions formed and computing resources, define what prediction performed by an ANN would indicate that “model-building” has been achieved, defining desirable levels and types of generalisation that liken human generalisation and investigating the causes and extent of unexpected, unspecified, emerging skills in predictive language models.
Coordinated experimentation will reduce the uncertainty that gives rise to origins of disagreement and advance agreement between experts. This research agenda should extend beyond computer science research, as it will benefit from interdisciplinary efforts in, for example, psychology, philosophy or animal cognition as demonstrated by [40,41,42]. Collaborations in which pessimists set the tasks that would make them more optimistic and in which optimists try solve the task using deep learning, could be fruitful. We suggest expert disagreement provides a road map for progress towards artificial intelligence.
“Partly, I’m just more uncertain about research progress: I won’t write off […] a field [that is an alternative to deep learning] just because right now it doesn’t seem to be progressing very quickly.” (Participant 10). “[The compute focused view] seems overconfident to me—I just have more uncertainty around it. I’m not necessarily willing to take a bet that compute is going to get us to general intelligence.” (P25).
“We underestimated how far this goes […] Just more of the same could work.” (P18) “I’m surprised when I’m not surprised anymore. I’m constantly underestimating the rate of progress.” (P5) “There could be one large network to approximate everything.” (P9) “We’re going to have a lot more compute, even in ten years.” (P14).
“If the environment is complex enough, you can’t use heuristics anymore and you will develop general reasoning abilities” (P9) “We’re seeing higher level strategies that no-one seems to have an explanation for. I don’t see any particular reason why it should not continue.” (P21).
“We know that one can get minds without understanding because evolution produced it.” (P21).
“No idea what it would be if not neural nets.” (P9) “I think it’s very unlikely neural nets will be totally replaced by a paradigm shift, but I can imagine something meaningful gets integrated with neural nets like external memory. That’s the largest kind of paradigm shift that appears plausible to me in the next 10 years.” (P8).
“What I think [ANNs] can learn in terms of representations and pattern are a superset of the things and patterns that I think brains can learn. Gradient descent is probably more powerful and a more fundamental method.” (P21) “It does seem plausible to me that all what intelligence is, is some incredibly complicated pattern recognition. [Deep learning] seems [to match] quite a lot of [the] things that we think intelligence to be.” (P25).
“If we had the right idea, probably current compute will be sufficient.” (P10).
“I believe that we can get sufficiently intelligent behaviour out of neurons that are significantly simpler than biological neurons. […] But exactly which details need to be preserved is an empirical question. Anybody who says they have the answer today is being religious rather than scientific.” (P23) “I think [the engineering view] is a gross underestimation of the problem. […] There’s a bunch of strange priors that you have, that maybe are really important. There’s strange stuff that happens in your peripheral vision for when you track the velocity of objects […] maybe it’s not important but maybe it is. […] Defining a task that tests the bit that we know about…I have no idea how big a slice of the problem it is, but I would make the assumption that it’s a small slice […]. I would assume that we don’t know.” (P2) “AI is really, really hard. Making an intuitive concept precise is really hard. How do you take a rough and intuitive and vague concept and make it precise without losing that which you tried to capture? Because if you knew the precise nature of that which you’re trying to capture it would not have been vague and intuitive!” (P3).
“I would not agree with [this] statement [that AGI is a serious possibility in five years]. With keeping what we’re doing we’re not going to get to AGI. Maybe on the sort of problems that OpenAI is tackling, where you can simulate things, but on the natural language side, things are lot more messy” (P7).
“You need increasingly large amount of data to get deep learning to do better. It’s not linear. […] Deep learning-based language tech uses trillions of words of training data and that to me is not scaling, it’s the opposite.” (P23).
“There’s one sense in which people are quite happy to get points amongst our research community for beating benchmarks performance and are not thinking about why do we care about this benchmark? […] Another problem is papers only report the best performance they got after a lot of fine-tuning.” (P10) “Because there’s so much trial and error we don’t really know which problems we’ve solved […] you don’t know the limits of how far this is going to go.” (P25).
“In most of machine learning, what works is not exactly what the theory is telling you. […] We don’t really understand ourselves how the deep models work. […] Can we really control it in a scientific manner and not just an engineering manner?” (P11).
“With trial and error over what is almost an infinite problem space? Seems hard.” (P2).
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521, 436–444 (2015). https://doi.org/10.1038/nature14539
Jordan, M.I., Mitchell, T.M.: Machine learning: trends, perspectives, and prospects. Science 349, 255–260 (2015). https://doi.org/10.1126/science.aaa8415
McCarthy, J., Minsky, M.L., Rochester, N., Shannon, C.E.: A proposal for the Dartmouth summer research project on artificial intelligence, august 31, 1955. AI Mag. 27, 12–12 (2006). https://doi.org/10.1609/aimag.v27i4.1904
Sutskever, I. : Ilya Sutskever at AI Frontiers 2018: Recent Advances in Deep learning and AI from OpenAI.Progress towards the OpenAI mission LINK (2018). https://sqlandsiva.blogspot.com/2019/01/day-202-ilya-sutskever-at-ai-frontiers.html. Accessed 5 Apr 2021
Ord, T.: The Precipice: Existential Risk and the Future of Humanity. Bloomsbury, London (2020)
Liyanage, H., Liaw, S.-T., Jonnagaddala, J., Schreiber, R., Kuziemsky, C., Terry, A.L., de Lusignan, S.: Artificial intelligence in primary health care: perceptions, issues, and challenges: primary health care informatics working group contribution to the yearbook of medical informatics 2019. Yearb. Med. Inf. 28, 041–046 (2019). https://doi.org/10.1055/s-0039-1677901
Frey, C.B., Osborne, M.A.: The future of employment: how susceptible are jobs to computerisation? Technol. Forecast. Soc. Chang. 114, 254–280 (2017). https://doi.org/10.1016/j.techfore.2016.08.019
Crawford, K., Dobbe, R., Dryer, T., Fried, G., Green, B., Kaziunas, E., Kak, A., Mathur, V., McElroy, E., Sánchez, A.N., Raji, D., Rankin, J.L., Richardson, R., Schultz, J., West, S.M., Whittaker, M.: AI now report 2019.
Brundage, M., Avin, S., Clark, J., Toner, H., Eckersley, P., Garfinkel, B., Allan Dafoe, A.: The malicious use of artificial intelligence—forecasting, prevention, and mitigation. https://arxiv.org/ftp/arxiv/papers/1802/1802.07228.pdf (2018)
Russell, S.J.: Human Compatible: Artificial Intelligence and the Problem of Control. Penguin (2019) ISBN:978-0-525-55861-3
Cantwell Smith, B.: The Promise of Artificial Intelligence: Reckoning and Judgement. MIT Press, Cambridge (2019). ISBN:9780262043045
Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language models are few-shot learners. ArXiv200514165 Cs. (2020)
Kaplan, J., McCandlish, S., Henighan, T., Brown, T.B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., Amodei, D.: Scaling laws for neural language models. ArXiv200108361 Cs Stat. (2020)
Hernandez, D., Brown, T.B.: Measuring the algorithmic efficiency of neural networks. ArXiv200504305 Cs Stat. (2020)
McCandlish, S., Kaplan, J., Amodei, D., Team, O.D.: An empirical model of large-batch training. ArXiv181206162 Cs Stat. (2018)
Amodei, D., Hernandez, D.: AI and compute. OpenAI Blog. https://openai.com/blog/ai-and-compute/ (2018). Accessed 17 Dec 2020
Hassabis, D., Kumaran, D., Summerfield, C., Botvinick, M.: Neuroscience-inspired artificial intelligence. Neuron 95(2), 245–258 (2017). https://doi.org/10.1016/j.neuron.2017.06.011
Banino, A., Barry, C., Uria, B., Blundell, C., Lillicrap, T., Mirowski, P., Pritzel, A., Chadwick, M.J., Degris, T., Modayil, J., Wayne, G., Soyer, H., Viola, F., Zhang, B., Goroshin, R., Rabinowitz, N., Pascanu, R., Beattie, C., Petersen, S., Sadik, A., Gaffney, S., King, H., Kavukcuoglu, K., Hassabis, D., Hadsell, R., Kumaran, D.: Vector-based navigation using grid-like representations in artificial agents. Nature 557, 429–433 (2018). https://doi.org/10.1038/s41586-018-0102-6
Banino, A., Koster, R., Hassabis, D., Kumaran, D.: Retrieval-based model accounts for striking profile of episodic memory and generalization. Sci. Rep. 6, 31330 (2016). https://doi.org/10.1038/srep31330
Garnelo, M., Shanahan, M.: Reconciling deep learning with symbolic artificial intelligence: representing objects and relations. Curr. Opin. Behav. Sci. 29, 17–23 (2019). https://doi.org/10.1016/j.cobeha.2018.12.010
Cranmer, M.D., Xu, R., Battaglia, P., Ho, S.: Learning symbolic physics with graph networks. NeurIPS 2019 ArXiv190905862 Astro-Ph Phys. Phys. Stat. (2019)
Grace, K., Salvatier, J., Dafoe, A., Zhang, B., Evans, O.: Viewpoint: when will AI exceed human performance? Evidence from ai experts. J. Artif. Intell. Res. 62, 729–754 (2018). https://doi.org/10.1613/jair.1.11222
Müller, V.C., Bostrom, N.: Future progress in artificial intelligence: a survey of expert opinion. In: Müller, V.C. (ed.) Fundamental Issues of Artificial Intelligence, pp. 555–572. Springer International Publishing, Cham (2016)
Technology Review, experts predict when artificial intelligence will exceed human performance. https://www.technologyreview.com/2017/05/31/151461/experts-predict-when-artificial-intelligence-will-exceed-human-performance/ (2017)
Gray, R.: How long will it take for your job to be automated? BBC (2017). https://www.bbc.com/worklife/article/20170619-how-long-will-it-take-for-your-job-to-be-automated
Hong, L., Page, S.E.: Groups of diverse problem solvers can outperform groups of high-ability problem solvers. Proc. Natl. Acad. Sci. 101(46), 16385–16389 (2004). https://doi.org/10.1073/pnas.0403723101
Page, S.E.: The Difference: How the Power of Diversity Creates Better Groups, Firms, Schools and Societies. Princeton University Press, Princeton (2008)
Landemore, H.: Democratic Reason. Princeton University Press, Princeton (2017)
Tetlock, P.E., Gardner, D.: Superforecasting: the Art and Science of Prediction. Crown Publishers, New York (2015)
Armstrong, J.S.: The seer-sucker theory: the value of experts in forecasting. Postprint version. Published in Technology Review, 82(7), 16–24. http://repository.upenn.edu/marketing_papers/3 (1980)
Chang, W., Chen, E., Mellers, B., Tetlock, P.: Developing expert political judgment: the impact of training and practice on judgmental accuracy in geopolitical forecasting tournaments. Judg. Decis. Mak. 11(5), http://journal.sjdm.org/16/16511/jdm16511.pdf (2016)
Tong, A., Sainsbury, P., Craig, J.: Consolidated criteria for reporting qualitative research (COREQ): a 32-item checklist for interviews and focus groups. Int. J. Qual. Health Care. 19, 349–357 (2007). https://doi.org/10.1093/intqhc/mzm042
DeJonckheere, M., Vaughn, L.M.: Semistructured interviewing in primary care research: a balance of relationship and rigour. Fam. Med. Community Health. 7, e000057 (2019). https://doi.org/10.1136/fmch-2018-000057
Krippendorff, K.: Content Analysis: An Introduction to Its Methodology. SAGE, Thousand Oaks (2013)
Jamshed, S.: Qualitative research method-interviewing and observation. J. Basic Clin. Pharm. 5, 87–88 (2014). https://doi.org/10.4103/0976-0105.141942
Hsieh, H.-F., Shannon, S.E.: Three approaches to qualitative content analysis. Qual. Health Res. 15, 1277–1288 (2005). https://doi.org/10.1177/1049732305276687
Kondracki, N.L., Wellman, N.S., Amundson, D.R.: Content analysis: review of methods and their applications in nutrition education. J. Nutr. Educ. Behav. 34, 224–230 (2002). https://doi.org/10.1016/S1499-4046(06)60097-3
Busch, C., Lynn, T., Kellum, R.: Content Analysis. Colorado State University, Fort Collins (2012)
Lake, B.M., Ullman, T.D., Tenenbaum, J.B., Gershman, S.J.: Building machines that learn and think like people. Behav. Brain Sci. 40, 253 (2017). https://doi.org/10.1017/S0140525X16001837. (Epub 2016 Nov 24)
Buckner, C.: Empiricism without magic: transformational abstraction in deep convolutional neural networks. Synthese 12, 1–34 (2018). https://doi.org/10.1007/s11229-018-01949-1
Crosby, M., Beyret, B., Halina, M.: The Animal-AI Olympics. Nat. Mach. Intell. 1, 257–257 (2019). https://doi.org/10.1038/s42256-019-0050-3
Krishnan, M.: Against interpretability: a critical examination of the interpretability problem in machine learning. Philos. Technol. 33, 487–502 (2020). https://doi.org/10.1007/s13347-019-00372-9
We report no conflicts of interest. This work was financially supported by the Berkeley Existential Risk Initiative (BERI). The work was begun while in residency at the Leverhulme Centre for the Future of Intelligence and the Centre for the Study of Existential Risk and completed at the Future of Humanity Institute at the University of Oxford. Thanks to Seán Ó hÉigeartaigh and Shahar Avin for supervision on this project and to Luke Kemp, Jess Whittlestone and Alexa Hagerty for comments on drafts.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations
Appendix 1: Interview procedure and questionnaire
1.1 Introduction given to interviewee
You can take as much time as you like before you answer a question. We want to collect informed opinions. If you do not know the answer to a question, please say so and we will move to another question. You can speak to me with on highest technical level and assume that I understand what you’re talking about. If I don’t understand a term or the details of a technique, I will ask. Please name your title at your workplace, your workplace, field and expertise.
1.2 Definition human level machine intelligence
This interview is about exploring the potential and limitations of connectionism and in particular deep learning, for building high-level machine intelligence. We are interested in understanding deep learning in reference to high-level cognitive skills across domains of intelligent cognition. We are therefore not asking whether deep learning can, or will be, economically useful. We are not concerned with problems of consciousness, moral patienthood, etc. Instead, we are interested in the potential of machines to reach high-level intelligence on the basis of deep learning techniques.
According to Hutter and Legg, intelligence measure an agent’s ability to achieve goals in a wide range of environments. Here’s a more detailed definition: high-level machine intelligence (HLMI), describes an algorithm that performs as well as an average human in many economically relevant tasks, including patient diagnosis, speech writing or cleaning. Such tasks are only achievable given cognitive skills. Psychologist have developed a variety of tests that characterise a human’s cognitive skills. We thus define HLMI as an algorithm that is able to achieve average performance on the vast majority of these cognitive skill tests. They include IQ tests and sociality assessments which test skills like: memory recall, pattern and story completion, verbal comprehension, analogical, mathematical, analytical, fluid and spatial reasoning, social intent, one-shot learning, and can be extended to testing theory of mind, creative imagination and divergent thinking. An algorithm should perform in most of those standard test as well as an average human would, to be considered a HLMI.
We do not assume these algorithms to be economically viable, widely spread or readily usable by all actors. We do not make assumptions about the actors involved. It could be companies, academics for state-actors. We do not specify whether the algorithm is general or specific. We do not assume that it must be one agent that achieves all tasks. Each task could be solved by a different agent/algorithm, and we would still describe these collective capabilities to be HLMI.
What is deep learning (DL)?
Why and when does deep learning work?
Can you think of a task/skill that you think might be hardest/most implausible to do with ANNs?
Is there anything you believe ANNs will never be able to do?
I will now describe several intelligence tests taken from the cognitive sciences. Please tell me what cognitive skills are necessary, to score as highly as an average adult human on this test.
Intention: An artificial agent has a pencil lying in front of it on the table. The agent observes a human taking a piece of paper into his hands. The human then starts looking around, searching his pockets. Will the agent understand that the human is looking for a pen to write with?
Exclusion inference: “An artificial agent is given four objects. It is told to choose the blimb. It has not learned the concept of a blimb, but it knows that three of the four objects are not blimbs. Can it pick the blimb without trial and error?”.
What skills are involved?
When do you think we can we do it? Can this skill be achieved using ANNs?
Emulation versus Imitation: “The artificial agent has never seen a person open a door. A person demonstrates opening the door for a single time, but happens to drop the key, before unlocking. The artificial agent repeats the exact actions after a single demonstration, but does not drop the key”.
World model: The “Alternative Use Task” is used to assess divergent thinking and creative potential in humans. A human is given an object such as a “Paperclip” and must come up with as many unusual use cases for the object as possible (e.g. “pinch a hole into paper, by bending the steel wire”).
More creative humans come up with more items that are more original and non-obvious. Responses are rated by other humans by originality (statistically uncommon), fluency (quantity), flexibility (how many different categories) and elaboration (detail).
Human subjects in divergent thinking tests are given the task: “Come up with as many problems as possible that could arise between you and your parents”.
The artificial agent is placed into a virtual office space that it has never seen in this exact configuration. You instruct: “Move to the white cupboard in the left corner and fetch the old-looking notebook out of the second drawer”.
Companies such as (https://scinote.net) are automating the writing of the introduction of scientific papers. Why can we not automate the writing of the discussion part?
If your company would have to place a bet on a technique, in the aim of creating high-level machine intelligence (in at least one domain), would you invest all your money and time on research-based ANNs work architectures or would you explore other methods. Why?
Can any ingredients of intelligent behaviour or any skill (or the processes that enables intelligent behaviour), not easily be described by a mapping from feature inputs to target outputs?
Researchers at OpenAI have stated publicly that short term AGI/HLMI is a serious possibility. “It is not possible to determine a lower bound to progress in the near term and maybe the current way of progress will actually lead us to AGI”.
Do you agree or disagree?
If you disagree, why do you think proponents of such claims are missing/get wrong?
What do you think he would have to see/learn about, so that he would agree with you?
What must a chatbot be able to accomplish in the next two years so that you would maybe agree?
A lot of human behaviour and intelligent behaviour is mathematically and formally ill-defined. (Examples: creativity, social interaction and emotion.) Does that pose a problem for near term HLMI?
Which components to human cognitive skills do you think we don’t need to build in order to engineer an intelligent agent? Which components might be redundant, side effects or irrelevant?
Expectation: “If no one has published XXX within the next 2 years, using a connectionist model, I will be really surprised.”
Please fill in XXX.
How would this change your mind?
What type of problem does the research community have to crack so that you start to believe HLMI could be achieved in 10 years, using connectionist architectures? You can name any domain you like.
What type of problem does the AI community have to crack in the next five years so that you start to believe HLMI could be achieved twice as quickly as you think it can be achieved now?
Do you have a hunch or intuition about hard barriers that deep learning will hit, that no one else seems to speak about?
What is your most plausible story for short term HLMI on the basis of deep learning.
What has to happen so that you think we should not continue approaching a task (e.g. one-shot, compositionality, planning, long-term memory, transfer learning) using ANNs?
Why, due to reasons related to the technique themselves, could progress toward HLMI using current techniques slow down? Assuming funding, talent, compute, etc. continues to progress?
What is so fundamental to the success of deep learning so that it is here to stay?
What are the tasks that you confidently think we will solve using ANNs? (it doesn’t matter when we will solve them).
When would you stop trying to improve algorithmic performance (in a particular domain) using the connectionist approach? What would count as a “connectionist failure” to you?
What is the type of data you would have to see to believe that an HLMI algorithm (in a particular domain) cannot be built by scaling ANNs?
What result/published paper would you have to see in the next two years, in any domain of your choice, to think that we could reach HLMI 10 years earlier than you thought?
What papers/techniques you think look most promising to increase data efficiency of ANNs work training?
Is backpropagation here to stay?
Brains are able to do object recognition using much less energy than ANNs. What are the reasons for this difference?
Free generalisations of universals from little data—why are ANNs not doing this? Will ANNs ever be able to do this?
In biological and physical systems, we find inherent, unavoidable trade-offs.
What trade-offs are inherent to all ANNs works?
E.g. in physics: Between temporal and spatial resolution/Heisenberg uncertainty principle Biology: Virulence trade-off hypothesis, which is the trade-off between transmission rate and virulence (damage to host) of a virus.
Here are some of the points of critique that deep learning received: data inefficiency, object capture, causal reasoning, transfer learning, one-shot learning, meta-learning, common sense, concept formation, etc. Many researchers believe that these are current problems of deep learning but that that deep learning researchers will solve them eventually. Are there any current issues in deep learning that you don’t expect us to make great improvements on? Why? What indicates that, for any of the above current limitations, we are unlikely to make progress?
Do you hold an opinion about deep learning that you think most other people don’t hold?
Do you know of a valid criticism against DL, which you believe the community would not like to hear?
Is there a benefit or a potential of DL that you believe most researchers dismiss too easily?
Did AlphaGo/AlphaZero change your thinking about how long it might take to reach HLMI? Did it influence your estimates of what you think is difficult to achieve and what is easy to achieve?
Can every cognitive task be described with a utility function?
How many cost functions does the brain have? How do they interact?
How should we estimate the “computational power” of brains?
What ingredients to intelligence are, you think, easier than most cognitive science researchers assume?
What makes scalability in DL so possible? What features are responsible for its ability to scale or what features prevent it from scaling?
What can we not learn in a simulation?
What cannot be learned from observation, what needs interaction?
Theoretically, ANNs can approximate any function. This theoretical result presumes access to infinite time and memory. Why is this an important result for making practical engineering choices in the real world?
Do you think the ability of ANNs to generalise is surprisingly good or surprisingly bad?
Do you believe a ANNs understand the concept of a dog? If yes, why do you believe this?
What ingredients to intelligence are, you think, harder than most machine learning researchers assume?
Can every cognitive task be described with a utility function?
Supervised learning, which has worked very well thus far, is able to learn well when target features are discrete categories. Can all cognitive skills be translated into a discrete categorisation problem?
What is the role of abstraction in intelligence and do ANNs do this?
What, according to you, are the most fundamental hurdles that we face in deep learning?
What are the skills, which we have not achieved, that you think are fundamental to a lot of ingredients of intelligence?
Appendix 2: Participant information sheet
2.1 Participant information sheet
Title of study: Expert Elicitation on Limitations of Deep Learning Contact:
Funded by: Berkeley Existential Risk Initiative Conducted at: University of Cambridge.
Purpose: Collect arguments from experts in computer science, cognitive science and philosophy, about the fundamental limitations of deep learning, via interviews.
Participation: Involves answering a series of questions regarding the potential and limitation of deep learning. The question will not require you to disclose sensitive information about yourself, your work or your employer.
No known risk is associated with participation.
Your participation is entirely voluntary and you can withdraw from the project at any time without prejudice, now or in the future.
You can indicate your preference regarding anonymity below.
Collected data will be the recorded interviews and notes from the interviews and will be stored securely, in compliance with University guidelines. The audio recordings will not be published.
Interviews will be processed manually. The transcripts/notes may be shared anonymously with other researchers for the purpose of research only.
Some parts of your responses may be included (anonymously if indicated on consent form) in a final publication that summarises the arguments collected over all interviews. The full interviews themselves will not be published.
An academic publication venue may require us to provide a list of the respondents. This list will not be published.
If you, at any point, have any further questions, please email: ….
2.2 Interview consent form
If you consent to being interviewed and to any data gathered being processed as outlined above, please print and sign your name, and date the form, in the spaces provided.
Please indicate, by ticking ONE of the boxes below, whether you are willing to be identified, and whether we may quote your words directly, in reports and publications arising from this research. (2.)
I/and my employer (cross out which does not apply) may be identified in reports made available outside the research and publication review teams, and in publications.
Neither I, nor my employer, may be identified in reports made available outside the research teams and publication review teams, nor in any publications. My words may be quoted provided that they are anonymised.
Neither I, nor my employer, may be identified in reports made available outside the research teams and publication review teams, nor in any publications. My words may not be quoted.
Please print your name:
About this article
Cite this article
Cremer, C.Z. Deep limitations? Examining expert disagreement over deep learning. Prog Artif Intell 10, 449–464 (2021). https://doi.org/10.1007/s13748-021-00239-1