figure a

Dear readers,

A journalist asked to comment on Richard Sutton’s 2019 essay “The Bitter Lesson”, published online at http://www.incompleteideas.net/IncIdeas/BitterLesson.html. It is worth reading, and I would like to share my thoughts about the essay with you in this editorial.

Richard S. Sutton is a true AI pioneer, known for his seminal work on Reinforcement Learning [e.g. 2, 11, 12]. Verena Heidrich-Meisner interviewed him for KI in 2009 [6]. In “The Bitter Lesson”, Sutton concludes:

The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin.

He gives search and learning as prime examples of such general methods. Referring to natural language and speech processing, AI for games, and computer vision, he argues that “researchers always tried to make systems that worked the way the researchers thought their own minds worked—they tried to put that knowledge in their systems—but it proved ultimately counterproductive”. He bases his claim on the undeniable success of deep learning systems exploiting huge amounts of training data and compute resources.

Not surprisingly, as a machine learning researcher I agree in principle with the general praise of search and learning, but let me add some thoughts on the role of prior knowledge in AI and upscaling as a general solution strategy.

Learning that is “leveraging computation” has to go hand in hand with the availability of huge training data sets. Ultimately, data availability limits the performance even in the case of otherwise infinite compute resources. The way I read the essay, it assumes that scaling of learning systems entails increasing data availability. Let us briefly look at this situation from the theoretical perspective. In general, when the amount of (i.i.d.) data increases, distributions of statistical estimates become more concentrated. There are many learning algorithms that converge to the best possible solution when the number of training data points goes to infinity [5, 9]. However, for estimates based on finite data we have no-free-lunch: For any algorithm there is a data distribution for which it works arbitrarily bad given a finite sample [3, 5]. If “there is no a priori restriction on the possible phenomena that are expected, it is impossible to generalize and there is thus no better algorithm (any algorithm would be beaten by another one on some phenomenon)” [3]. As Bousquet et al. nicely put it:

Generalization = Data + Knowledge

This knowledge about possible phenomena can be rather general (e.g., some weak notion of continuity that can be assumed at least partly because the problem at hand is rooted in the physical world) and may already be reflected by the choice of a rather general learning algorithm.

All systems underlying the success stories of learning and optimization mentioned in Sutton’s essay still encode information about the problem at hand in some way, but not necessary about the way humans solve this problem. State-of-the-art learning systems for images, speech, and natural language use prior knowledge about the spatial and temporal structure of the data. For example, convolutional and pooling operations are used because of a priori assumptions about the spatiotemporal structure of the input; recurrent neural networks are chosen to exploit the sequential nature of data. This use of knowledge about the problem is also appreciated in Sutton’s essay (“deep-learning neural networks use only the notions of convolution and certain kinds of invariances”). Data augmentation, which is a common technique to improve generalization when the data is limited, also generally relies on prior knowledge about invariance properties. Incorporation of known constraints is also helpful. For example, the recently published seminal deep learning system AlphaFold [8] for determining the 3D shapes of proteins incorporates information about the physical and geometric constraints that govern protein folding. In conclusion, for efficiently and effectively solving machine learning tasks we have to bring in knowledge about the problem, but not necessary about how humans solve it.

In my opinion, the essay lacks an acknowledgement that in practice scaling up may often not be an option and resource efficient solutions are needed. A difficult AI problem may require huge amounts of data and processing time. For some AI applications, data may be very limited, for example for privacy or economical reasons, and there may be not time to wait until this situation changes (if it can change at all). Incorporation of knowledge may compensate for the lack of data and computational resources. Last but not least, the energy consumption of large AI systems (the state-of-the-art natural language processing tool GPT-3 has 175 billion parameters [4]) is already problematic from an environmental point of view [7, 10] , which questions scaling up compute systems as a general approach. I would like to use this opportunity to promote Carbontracker [1], a tool for measuring the energy consumption of your deep neural networks and other machine learning models.

Best wishes and enjoy reading this issue of KI,

figure b

Christian Igel