Cognitive Sophistication and Deliberation Times

Differences in cognitive sophistication and effort are at the root of behavioral heterogeneity in economics. To explain this heterogeneity, behavioral models assume that certain choices indicate higher cognitive effort. A fundamental problem with this approach is that observing a choice does not reveal how the choice is made, and hence choice data is insufficient to establish the link between cognitive effort and behavior. We show that deliberation times provide the missing link, in the form of an individually-measurable correlate of cognitive effort. We present a model of heterogeneous cognitive depth, incorporating stylized facts from the psychophysical literature, which makes predictions on the relation between choices, cognitive effort, incentives, and deliberation times. We confirm the predicted relations experimentally in different kinds of games. However, we also show that imputing cognitive depth from choices alone can lead to erroneous conclusions when the features leading to iterative thinking are not salient.


Introduction
Economic agents form different expectations and react differently even when confronted with the same information, leading to substantial behavioral heterogeneity, which in turn has long been recognized as a fundamental aspect of economic interactions (e.g., Haltiwanger and Waldman, 1985;Kirman, 1992;Blundell and Stoker, 2005). A key source of heterogeneity is the fact that cognitive capacities differ among individuals, as does the motivation to exert cognitive effort. This observation has given rise to a rich theoretical literature on iterative or stepwise reasoning processes, including level-k models (Stahl, 1993;Nagel, 1995;Stahl and Wilson, 1995;Ho et al., 1998) and models of cognitive hierarchies (Camerer et al., 2004). Such models endow individuals with differing degrees of strategic sophistication or reasoning capabilities, and might hold the key to describe heterogeneity in observed behavior (for a recent survey, see Crawford et al., 2013). In particular, they have proven invaluable to explain behavioral puzzles as overbidding in auctions (Crawford and Iriberri, 2007), overcommunication in senderreceiver games (Cai and Wang, 2006), coordination in market-entry games (Camerer et al., 2004), and why communication sometimes improves coordination and sometimes hampers it (Ellingsen andÖstling, 2010). More recently, a small but growing literature in macroeconomics has started to incorporate heterogeneity in cognitive depth and iterative thinking (Angeletos and Lian, 2017), leading to promising insights on the effects of monetary policy (Farhi and Werning, 2017) or low interest rates (García-Schmidt and Woodford, 2018).
Existing models of heterogeneity in cognitive depth, however, face a fundamental problem. Choices are classified into different cognitive categories assumed to require different levels of cognitive effort. So far, there is little direct evidence linking observed play to cognitive effort. Most of the experimental literature has used observed choices to infer an individual's depth of reasoning from the associated cognitive categories. Hence, the observation of a given choice is used to infer cognitive effort taking the underlying path of reasoning or thought processes that led to the classification of choices as given, creating an essentially circular argument. One problem with this approach is that the same choice is always attributed to the same level of cognitive effort, although it might very well be the result of completely different decision rules. For example, an agent choosing an alternative after a complex cognitive process and another agent choosing the same alternative because of some payoff-irrelevant salient features cannot be distinguished on the basis of those choices alone. As a consequence, the level of cognitive effort associated with a choice becomes a non-testable assumption, and the sources of heterogeneity remain in the dark. A case in point is the work of Goeree et al. (2016), who identified a game where imputing cognitive depth from choices alone leads to clearly unreasonable conclusions, in the form of abnormally high imputed cognitive levels.
To establish that the source of observed behavioral heterogeneity is actually heterogeneity in cognitive effort and capacities, what is needed are individually measurable correlates of cognitive effort beyond choice data. That is, instead of identifying particular choices with particular levels of cognitive effort, one needs to provide a direct measure of effort which allows to independently show that certain choices actually are the result of stronger cognitive effort. We argue that response times, or, more properly in our context, deliberation times can be fruitfully used for this purpose.
In the present work, we provide a simple model linking cognitive sophistication to choices and deliberation times, taking into account stylized facts from the psychophysiological literature on response times. We build on Penta's (2016b,2016c) model of endogenous depth of reasoning. Our model rests on two key assumptions. The is the key innovation of our model, is that the time required for each step is a decreasing function of the difficulty, as captured by the value of reasoning, of conducting an additional step of reasoning. In contrast, Alaoui and Penta (2016c) assume that deliberation times do not vary with the value of reasoning.
This latter assumption is motivated by a well-known phenomenon in psychology and neuroscience (going back to, at least, Dashiell, 1937), according to which easier choice problems (where alternatives' evaluations show large differences) take less time to respond to than harder problems. Hence, deliberation times are longer for alternatives that are more similar, either in terms of preference of along a predefined scale. This so-called chronometric effect has also been shown to be present in various economic settings such as intertemporal choice (Chabris et al., 2009), risk (Alós-Ferrer andGaragnani, 2018), lottery choice , consumer choice (Krajbich et al., 2010;Krajbich and Rangel, 2011) as well as in dictator and ultimatum games (Krajbich et al., 2015). Alós-Ferrer et al. (2018b) examine the consequences of the chronometric effect for revealed preference, and Alós-Ferrer et al. (2016) show that it helps explain and understand preference reversals in decisions under risk. This effect follows naturally when the choice process is captured by a drift diffusion model (DDM) (Ratcliff, 1978), a class of models that has been applied extensively in cognitive psychology and neuroscience, and which is receiving increasing attention in economics (Chabris et al., 2009;Fudenberg et al., 2018;Webb, 2019). 1 In accordance with this evidence, we assume that the deliberation time for a given step of thinking is larger the smaller the value of reasoning for that step.
Our model provides empirically testable predictions, first, on the measurable effects of cognitive sophistication on choices and deliberation times, and second, on the effects of economic incentives on both the revealed depth of reasoning, as inferred from choices, and the psychophysiological correlate of cognitive effort embodied in deliberation times.
We test these predictions experimentally employing two different games commonly used to study iterative thinking (and, in particular, level-k reasoning): the beauty contest game (or guessing game; Nagel, 1995), which is the workhorse in that literature, and several variants of the 11-20 money request game, recently introduced by Arad and Rubinstein (2012), in the graphical version of Goeree et al. (2016). These variants all share the same path of reasoning, usually assumed to result from iterated application of the best-reply operator, but the payoff structures are manipulated in order to change the salience of certain features which might encourage or discourage iterative thinking.
The reason why we use the latter game is that in some of the variants considered in Goeree et al. (2016), reconciling observed behavior with iterative thinking would require inordinately high levels of depth of reasoning, compared to those usually observed in the literature. This provides a natural setting where deliberation times can discriminate whether observed behavior actually corresponds to different levels of cognitive effort, and thus modeling behavior via iterative thinking is justified.
In the beauty contest game we find longer deliberation times for choices commonly associated with more steps of reasoning, confirming the basic prediction of our model that deliberation time is increasing in cognitive effort. That is, the beauty contest game, a game were there is little doubt that level-k reasoning is prevalent, serves as a basic validation of the relationship between cognitive effort and deliberation times.
In the 11-20 games, again we show that deliberation times are longer for higher-level choices in variants where the payoff structure makes iterative thinking salient. Thus, in those cases our data verifies the assumed connection between observed level and cognitive effort in support of level-k reasoning. However, in the variant where level-k reasoning is less natural or where a conflict with alternative decision rules (e.g., based on the salience of high payoffs) is likely, this systematic relation between allegedly-higherlevel choices and deliberation times disappears. Rather, in this case we find overall longer deliberation times, suggesting a conflict between competing decision rules. This indicates that individuals are likely to approach this situation in ways other than iterative thinking. That is, features besides and beyond the best-reply structure matter. More importantly, deliberation times serve as a test of whether a given model of iterative thinking is appropriate to describe actual play in specific settings.
Our model also relates changes in incentives to choices and deliberation times. First, it predicts a higher observed level of reasoning if the incentives, as captured by the payoff differences in the underlying game, are systematically increased. However, this does not imply that higher incentives necessarily imply longer deliberation times, in contrast to Alaoui and Penta (2016c). On the contrary, the second prediction of the model is that deliberation times will be shorter for a given number of steps when the incentives are increased, because decisions farther away from indifference require less deliberation. As a consequence, the model can accommodate the observation of higher depth of reasoning and (simultaneously) shorter deliberation times. Turning to the data, by using implementations of the 11-20 game with different incentive levels, we find a systematic effect of incentives on the observed depth of reasoning as predicted by the model and in line with the results in Alaoui and Penta (2016b). In the basic treatment where iterative thinking is salient, we find both more higher-level choices and shorter deliberation times when incentives are increased, in line with the prediction that higher incentives should decrease the time required for each single step. More importantly, this demonstrates empirically that higher incentives might reduce deliberation times, even though observed depth of reasoning is increased. This latter finding strongly supports our additional assumption, that is, incorporating the fact that easier decisions are faster into models of iterative reasoning is crucial. Otherwise, if decision times per step were assumed to be independent of incentives (as in Alaoui and Penta, 2016c), higher incentives would go hand-in-hand with longer deliberation times, contradicting the data.
Overall, this paper makes two contributions. First, we are the first to show that heterogeneity in behavior can actually be traced back to heterogeneity in cognitive effort by using direct correlates of the latter rather than exogenously identifying choices with different levels of cognitive effort. Second, the very same correlates show the limits of level-k reasoning. More generally, our results show how deliberation times can be used as a tool to identify the domain of applicability for a given model of iterative thinking.
We show that, depending on the strategic situation, behavioral heterogeneity might be mistaken for heterogeneity in cognitive effort even though there are no actual differences in cognitive effort. Hence, one might be led to draw wrong conclusions if models of iterative thinking are applied without an external way of testing for heterogeneity in cognitive effort. We show that deliberation times can be used as a tool to decide for a given economic problem whether a specific model of iterative thinking is justified or whether extraneous or additional elements are at the source of that heterogeneity, which then require further analysis.
The paper is structured as follows. Section 2 briefly relates our work to the literature.
Section 3 introduces the model and derives the predictions. Section 4 describes the experimental design. Sections 5 and 6 present the results of the experiment for the beauty contest and the 11-20 games, respectively. Section 7 presents results on the effect of incentives. Section 8 discusses and summarizes our findings. The Appendix contains a number of additional observations and analyses.

Related Literature
There is a small but growing literature employing sources of evidence beyond choice data which suggests that individuals follow step-wise reasoning processes in certain settings. Bhatt and Camerer (2005) and Coricelli and Nagel (2009) show that iterative reasoning in different games, including the beauty contest game, and very specially "thinking about thinking," correlates with neural activity in areas of the brain associated with mentalizing (Theory of Mind network; see Alós-Ferrer, 2018a), building a notable bridge between social neuroscience and game theory. Brañas-Garza et al. (2012), Carpenter et al. (2013, and Gill and Prowse (2016) relate higher cognitive ability (as measured, e.g., by the Cognitive Reflection Test or the Raven test) with more steps of reasoning in the beauty contest game. Further, Fehr and Huck (2016) find that subjects whose cognitive ability is below a certain threshold lack strategic awareness, that is, they randomly choose numbers from the whole interval. Other works have relied on eyetracking measurements or click patterns recorded via MouseLab to obtain information on search behavior, which is then used to make inferences regarding level-k reasoning (Costa-Gomes et al., 2001;Crawford and Costa-Gomes, 2006;Polonio et al., 2015).
Notably, Lindner and Sutter (2013) found that under time pressure behavior in the 11-20 game was closer to the Nash equilibrium, although the authors recommend caution in interpreting the result. In contrast,  find no evidence for Nash equilibrium play. Instead, subjects exhibit a shift to less complex decision rules (requiring fewer elementary operations to execute) under time pressure in various 3 × 3 games; this shift is primarily driven by a significant increase in the proportion of level-1 players. In a repeated p-beauty contest Gill and Prowse (2018) show that subjects who think for longer on average win more rounds and choose lower numbers closer to the equilibrium. Agranov et al. (2015) study empirically how strategic sophistication develops over time in the beauty contest game and find that sophisticated players show evidence of increased understanding as time passes.
Clearly, our work is also related to the growing literature employing response times in economics. Examples include the studies of risky decision making by Wilcox (1993Wilcox ( , 1994, the web-based studies of Rubinstein (2007Rubinstein ( , 2013, and recent studies as Achtziger and Alós-Ferrer (2014) and Alós-Ferrer et al. (2016). 2 To date, however, only a few works in economics have explicitly incorporated response times in models of reasoning. Chabris et al. (2009) study the allocation of time across decision problems. Their model is similar in spirit to ours in that it is motivated by the chronometric "closenessto-indifference" effect. In particular, they also model response time as a decreasing function of differences in expected utility. However, in contrast to our model they focus on binary intertemporal choices and do not consider iterative reasoning. They report empirical evidence that choices among options whose expected utilities are closer require more time, thus indicating an inverse relationship between response times and utility differences. They argue in favor of the view that decision making is a cognitively costly activity that allocates time according to cost-benefit principles. Achtziger and Alós-Ferrer (2014) and Alós-Ferrer (2018b) consider a dual-process model of response times in simple, binary decisions where different decision processes interact in order to arrive at a choice. The emphasis of the model, however, is on the effects of conflict and alignment among processes, that is, whether a particular decision process or heuristic supports a more rationalistic one or rather leads the decision maker astray. The predictions of the model help understand when errors, defined as deviations from a normative, rationalistic process, are faster or slower than correct responses.
Finally, our work sheds light on the recent literature exploring the limits of models of iterative thinking, as the experiment of Goeree et al. (2016) mentioned above. It has been pointed out that strategic sophistication, as captured by level-k models, might be heavily dependent on the situation at hand. Hargreaves Heap et al. (2014) suggest that even (allegedly-nonstrategic) level-0 behavior might depend on the strategic structure of the game. In a repeated beauty contest, Gill and Prowse (2018) found that the level of strategic reasoning also depends on the complexity of the situation in the previous round. Georganas et al. (2015) show that strategic sophistication can be largely persistent within a given class of games but not necessarily across different classes of games. That is, the congruence between level-k models and subjects' actual decision processes may depend on the context. Allred et al. (2016) complement this result showing that the implications of available cognitive resources on strategic behavior are not persistent across classes of games. These difficulties raise the question of whether models of iterative thinking can be actually understood as procedural, that is, as describing how decisions are actually arrived at, or rather as purely descriptive, outcome-based models. Further, if iterative thinking cannot be taken as a persistent mode of behavior (across individuals and across games), it becomes particularly important to identify what triggers its use and in which situations it conflicts with other decision rules. Again, choice data alone is not sufficient to answer these questions.

The Model
We model decision making as a process of iterative reasoning as put forward in the literature on iterative thinking (Stahl, 1993;Nagel, 1995;Stahl and Wilson, 1995;Ho et al., 1998). Our approach is based on Alaoui and Penta (2016b), who model stepwisereasoning procedures as the result of a cost-benefit analysis. This approach is given an axiomatic foundation in Alaoui and Penta (2016a), where the primitive properties required for a reasoning process to be represented by a cost-benefit analysis are characterized. A player's depth of reasoning is endogenously determined depending on both individual cognitive ability and the payoffs of the game. That is, each step of reasoning requires a certain understanding of the strategic situation modeled by an incremental cognitive cost. On the other hand, the benefit of an additional step, the value of reasoning, is assumed to depend on the payoff structure of the game. Behavior then follows from a combination of depth of reasoning and beliefs about the reasoning process of the opponent. 3 For a given game each additional step of reasoning requires cognitive effort, which, however, is usually not directly observable. We extend the model by linking depth of reasoning to deliberation times as proposed by Alaoui and Penta (2016b). The idea is to use deliberation times as a proxy for cognitive effort. Total deliberation time is assumed to be the sum of the step-wise deliberation times, thus representing total cognitive effort. We extend Alaoui and Penta (2016c) by adding a simple but crucial assumption, which is the key innovation of our model: the time required for each step of reasoning is a decreasing function of the difficulty to reach that unit of understanding, which depends on the payoff structure of the game determining the value of reasoning.
This model yields testable predictions linking deliberation times to choices and incentives for specific classes of strategic games. We present the model for a symmetric, two-player game Γ = (S, π) with finite strategy space S and payoff function π : S × S −→ R. To economize on notation we focus on the two player case, but the extension to the N -player case is straightforward. Following Alaoui and Penta (2016b) we define the path of reasoning in the following way. A path of reasoning for Γ is a sequence of (possibly mixed) strategies (s * k ) k∈N . Strategy s * 0 is the starting point or anchor for the path of reasoning representing the default strategy a player not engaging in any deliberation would choose. As player i performs the first round of introspection, he becomes aware that his opponent may choose s * 0 , and thus considers to choose the next step strategy s * 1 . A process of iterative thinking can then be interpreted as a sequence of steps of reasoning along the induced path of reasoning (s * k ) k∈N : For example, in step k player i, who intends to play s * k−1 after k − 1 rounds of introspection, realizes that j may have reached the same conclusion, that is, to play s * k−1 . Hence, in step k player i considers choosing s * k . The number of steps of reasoning a player is willing (or able) to perform is determined endogenously, and depends on the cognitive cost and the value of reasoning. The cognitive cost associated with the kth step of reasoning is given by a function c i (k) with c i : Player i's cognitive costs represent his cognitive abilities, or in other words, how difficult it is for i to reach the next level of understanding. Similarly, the value of conducting the kth step of reasoning is represented by a function v i : N −→ R + . Cognitive costs represent heterogeneity in cognitive ability and are assumed to be constant for a given individual (for a given game). The value of reasoning, on the other hand, will depend on the specific payoffs of the game. Thus player i's depth of reasoning (or cognitive bound) is given by K i (Γ) = min{k ∈ N | v i (k + 1) < c i (k + 1)} if the set is nonempty, and K i (Γ) = ∞ otherwise. That is, player i stops the process of iterative reasoning as soon as the cost exceeds the value of an additional step of reasoning. The depth of reasoning of player i, K i (Γ), depends on both the cognitive cost and the specifics of the payoff structure of the underlying game determining the value of reasoning. Next, we discuss how systematic changes in the payoff structure affect the depth of reasoning. Consider two games Γ = (π, S) and Γ ′ = (π ′ , S) with common strategy space S and identical path of reasoning (s * k ) k∈N . These games are equally difficult to reason about, or cognitively equivalent (Alaoui and Penta, 2016b). We assume that cognitive equivalent games induce the same cognitive costs. However, the value of reasoning may vary even among cognitive equivalent games, since it depends on the actual payoffs of the game. Denote by v i and v ′ i the value of reasoning induced by the payoff structure in Γ and Γ ′ , respectively. We say that Γ ′ has higher incentives to reason . If Γ ′ has higher incentives to reason than Γ for every . This yields the following prediction regarding the effect of varying incentives for a class of games that are equally difficult to reason about (Alaoui and Penta, 2016b).
Prediction 1. Suppose Γ and Γ ′ are cognitively equivalent. If Γ ′ has higher incentives to reason than Γ for all steps up to k = K i (Γ), then Γ ′ induces weakly more steps of reasoning for player i than Γ, that is, Next, we link this simple model of iterative thinking to deliberation times. We assume that the deliberation time for conducting k steps of thinking is the sum of the deliberation times required for each step, as in Alaoui and Penta (2016c). We then extend their model by adding an additional assumption on how changes in the value of reasoning affect the deliberation times per step.
For a given game Γ, let τ i : N + −→ R + be the time required by player i to conduct the kth step of reasoning. The total deliberation time of player i in game Γ to perform k = K i (Γ) > 0 steps of reasoning is then given by We say that a strategy s requires more steps of reasoning than s ′ if s = s * k and s ′ = s * k ′ with k > k ′ . Then a straightforward consequence of viewing deliberation times as a sum of binary-choice decision times is the following prediction.
Prediction 2. For a given game Γ, deliberation time for a choice is longer if it requires more steps of reasoning, that is, T i (s) > T i (s ′ ) if s requires more steps of reasoning than s ′ .
Alaoui and Penta (2016c) assume that τ i (v i ) is constant within a class of cognitive equivalent games, in particular, τ i (v i ) does not vary with the value of reasoning v i .
This leads them to the following prediction regarding the effect of incentives on total deliberation times.
Prediction 3 (Alaoui and Penta, 2016c). If Γ and Γ ′ are cognitively equivalent, but Γ ′ has a higher value of reasoning, that is, The intuition is simple, a higher value of reasoning leads to more steps of reasoning,

with a strict inequality if the increase in value is large enough)
and since they assume In contrast, we will assume that the per-step deliberation time depends on the value of reasoning, as a consequence Prediction 3 not necessarily holds. The key innovation of our model is motivated by a well-known phenomenon in neuroeconomics and psychology, according to which deliberation times are longer for alternatives that are more similar (Dashiell, 1937), and which follows naturally from sequential sampling models from cognitive psychology (Ratcliff, 1978;Fudenberg et al., 2018). As already mentioned in the introduction, this effect has also been established in various economic settings such as intertemporal choice (Chabris et al., 2009) Krajbich and Rangel, 2011) as well as in dictator and ultimatum games (Krajbich et al., 2015). In accordance with this evidence, we assume that the deliberation time for a given step of thinking is larger the smaller the value of reasoning for that step.
Assumption 1. Difficult steps are slower, that is, the per-step deliberation time τ i (v i ) is decreasing in the value of reasoning v i .
The following prediction crucially hinges on the assumption that deliberation times per step are a decreasing function of incentives.
Prediction 4. Consider two cognitive equivalent games Γ and Γ ′ . If Γ ′ has a higher value of reasoning, then the deliberation time for a choice that requires k steps of reasoning is For a fixed number of steps of thinking our model predicts shorter deliberation times for higher incentives, because a player requires less time for each step. In contrast with Prediction 3 (Alaoui and Penta, 2016c), however, this does not necessarily imply that one should observe shorter total deliberation times for larger incentives, because for larger incentives subjects may also conduct more steps of thinking (Prediction 1), which in turn increases overall deliberation time. Thus, in our model larger incentives have a two-fold effect with (weakly) more steps of reasoning on the one hand and shorter deliberation times per step on the other hand.
Last, we remark on the role of individual cognitive ability. In terms of our model, there are two conceivable ways in which individual differences in cognitive ability may affect choices and deliberation times. One the one hand, higher cognitive ability could translate into uniformly lower cognitive costs of reasoning, c i . In that case, players with higher cognitive ability are likely to conduct more steps of reasoning, because . Since total deliberation time is the sum of one-step deliberation times, conducting more steps tends to increase total deliberation time. On the other hand, it is unclear how higher cognitive ability would translate into deliberation times per step. Both longer deliberation times, e.g. because higher cognitive ability leads to more thorough thinking, or shorter deliberation times, e.g. because performing a step of reasoning is easier for higher ability individuals, are conceivable, so that the overall effect on deliberation times is indeterminate. Importantly, cognitive costs are assumed to be affected only by the strategic structure of the game, the path of reasoning, and potentially by individual cognitive ability. Thus, fixing individual cognitive ability, the effects of changes in the incentive structure described above remain unaffected as long as the path of reasoning (or more generally the difficulty) of the game is not altered.

Application to level-k reasoning
We now apply the model to a particular process of iterative thinking, level-k reasoning.
This is also the model that we will test in our experiment.
Consider a symmetric, two-player game Γ = (S, π) and fix player i with i = j. Denote by BR i : ∆ ⇒ S i's best-response correspondence where ∆ is the set of mixed strategies over S. For simplicity, we assume that for any s ∈ S there is a unique best-reply, denoted by BR(s), that is, BR(s) is the unique maximizer of π(·, s). According to the standard level-k model, the path of reasoning (s * k ) k∈N is given by s * k = BR(s * k−1 ) for k > 0 and s * 0 is the assumed level-0 strategy adopted by non-strategic players. Following Alaoui and Penta (2016b), we assume that an individual's cost of reasoning is the same for games that a player essentially approaches in the same way, that is, for games that are cognitively equivalent. The value of reasoning, however, depends on the specific payoff structure of the game, which may vary even for cognitively equivalent games. More precisely, we assume that the value of reasoning depends only on the payoffs of the game and that increasing all (relevant) payoff differences at a certain step increases the value of reasoning at that step. These two minimal conditions are all that is required for the comparative statics we will use later in the analysis of the experimental data. For concreteness, we will assume that the value of reasoning takes the following "maximum-gain representation": That is, the value of reasoning is the maximum gain the player could obtain by choosing the optimal strategy compared to his current strategy, at step k, for all possible actions of the other player. In a sense, the player is optimistic about the value of thinking more, considering the highest possible payoff improvement. Alaoui and Penta (2016a) provide axiomatic foundations for this and other representations of the value of reasoning.

Experimental Design
We use two games commonly employed to study cognitive sophistication, the classical beauty contest game (Nagel, 1995) and the 11-20 money request game, a more recent alternative that was explicitly designed to study level-k behavior (Arad and Rubinstein, 2012). We ask whether a higher level of reasoning (in the standard level-k sense) is reflected in higher cognitive effort, or in other words, whether there is a direct link between higher levels of reasoning and deliberation times. We use different versions of the 11-20 game that vary the incentives, that is, the value of reasoning, while leaving the underlying best-reply structure, and thus the path of reasoning, unaffected. This allows us to study how choices and deliberation times react to systematic changes in the payoff structure providing a direct test of the implications of the model presented in Section 3 applied to level-k reasoning.

The Beauty Contest Game
The standard workhorse for the study of cognitive sophistication is the guessing game, or p-beauty contest game (Nagel, 1995). We use a standard, one-shot, beauty contest game with p = 2/3 with discrete strategy space S = {0, 1, . . . , 99, 100}. In this game, a population of N players has to simultaneously guess an integer number between 0 and 100. The winner is the person whose guess is closest to p times the average of all chosen numbers. The winner receives a fixed prize P , which is split equally among all winners in case of a tie.
In this game it is usually assumed that non-strategic (level-0) players pick a number at random from the uniform distribution over {0, . . . , 100}. Hence, we assume that the starting point for the level-k path of reasoning, s * 0 , is the mixed strategy that assigns equal probability to all numbers. If a player thinks that all other players choose s * 0 , then (for N large enough) the average of all numbers chosen is (close to) 50 and hence the best reply to s * 0 is to choose s * 1 = 33, that is the integer closest to 2/3 times 50. As a player performs the next step, he becomes aware that his opponents might choose 33 as well, and thus considers choosing a best-reply to a profile where all other players choose 33. Hence, the level-2 strategy is s * 2 = 22, the integer closest to 2/3 times 33. 5 Iterating, this defines the best-reply structure of the beauty contest game (s * k ) k∈N where s * k is the integer closest to (2/3)s * k−1 for k > 0. 6 If N is large enough, this game has two Nash equilibria at 0 and 1 (Seel and Tsakas, 2017).
Assuming that the value of reasoning has a maximum-gain representation, the value of reasoning at each step is the same and equals the prize P . To see this, note that switching from s * k−1 to s * k yields a payoff improvement of P for any strategy profile, where all opponents choose some strategy s ∈ (s * k , s * k−1 ). Since this is the maximum possible gain, it follows that v(k) = P .

The 11-20 Game
The second part of our experiment focuses on variants of the 11-20 money request game, that was introduced by Arad and Rubinstein (2012) as a two-player game specifically well suited to study level-k reasoning. A modified version of this game was also employed by Alaoui and Penta (2016b) to test their model of endogenous depth of reasoning. Goeree et al. (2016) introduced a graphical version of the 11-20 game that allows to vary the payoff structure without affecting the underlying best-reply structure of the game. We now describe a generalized version of this graphical 11-20 game. In what follows, we will refer to this game (and variants thereof) simply as "11-20 game." to the left of his opponent's box. That is, payoffs are given by A feature of this game is that choosing box 0 is the salient and obvious candidate for a non-strategic level-0 choice, because it awards the highest "sure payoff" of 20 that can be obtained without any strategic considerations. Thus, the rightmost box 0 is a natural anchor serving as the starting point for level-k reasoning. If the bonus R is large enough, that is, R > 20 − min{A b |b = 1, . . . 9}, then the path of reasoning for the level-k model . . . , 9. 7 In other words, for a sufficiently large bonus the best reply is always to choose the box that is exactly one to the left of your opponent (if there is such a box). In particular, the path of reasoning is independent of the specific payoff structure, as long as the bonus is sufficiently large and the right-most box is a salient anchor.  We use the three versions of the 11-20 game shown in Figure 2. 8 The sure payoffs given by the amounts A 0 , . . . , A 9 differ across versions, however, they are chosen in such a way that the best-reply structure described above remains unchanged. In the baseline version (BASE) the amounts are increasing from the left box to the rightmost box, containing the highest amount of 20. BASE corresponds to the original version of Arad and Rubinstein (2012) and to the baseline version of Goeree et al. (2016). In BASE there is a natural trade-off between the sure payoffs A 1 , . . . , A 9 and the bonus, with each incremental step of reasoning a player gives up one unit of sure payoff. We 7 The best reply to box 9 is to choose box 0, hence for k > 9 the best-reply structure cycles repeatedly from 0 to 9. That is, theoretically a choice of box k could also result from k + 10 (or generally k + 10n) steps. To solve this issue, Alaoui and Penta (2016b) propose a modified 11-20 game that breaks this best-reply cycle. The observed distribution of play, however, is very similar to the one in Rubinstein's original 11-20 game (exhibiting the best-reply cycle). This is not surprising, because existing evidence in the literature documents that 10 and more steps of reasoning are highly uncommon. Thus, focusing on the first 9 steps only is likely to be inconsequential.
8 For each of the three versions BASE, FLAT, and EXTR there is a unique mixed strategy Nash equilibrium. For the small increment -low bonus versions those are given by 0, 0, 0, 0, 1 4 , 1 4 , 1 5 , 3 20 , 1 10 , 1 20 , 0, 0, 0, 1 10 , 3 20 , 3 20 , 3 20 , 3 20 , 3 20 , 3 20 , and 0, 0, 0, 0, 0, 0, 0, 3 20 , 2 5 , 9 20 , respectively. We further varied these three versions of the 11-20 game along two additional dimensions that systematically vary the incentives to reason without altering the path of reasoning. First, for each treatment there was an additional "large increment" version, where for BASE and EXTR the amounts A 1 , . . . , A 9 range from 2 to 20 in increments of 2 instead of from 11 to 20 in increments of 1, and for FLAT all amounts other than 20 were set to 14 in the large increment version instead of 17 (see Figure 3). Depending on the treatment, the trade-off between bonus and sure payoff for an additional step of reasoning is decreased or increased for large increments. Second, we varied the incentives to reason by changing the size of the bonus R for choosing the box exactly one to the left of the other player's. Specifically, in the additional high-bonus condition, subjects obtained R = 40 additional points for the "correct" box, while in the low-bonus con- dition they received R = 20 additional points. Given the path of reasoning (s * k ) k with s * k = k for k = 1, . . . , 9 induced by the standard level-k model, Table 1 gives the value of reasoning assuming a maximum gain representation.

Design and Procedures
A total of 128 subjects (79 female) participated in 4 experimental sessions with 32 subjects each. Participants were recruited from the student population of the University of Cologne using ORSEE (Greiner, 2015), excluding students of psychology, economics, and economics-related fields, as well as experienced subjects who already participated in more than 10 experiments. The experiment was conducted at the Cologne Laboratory for Economic Research (CLER) and was programmed in z-Tree (Fischbacher, 2007).
The experiment consisted of three parts during which subjects could earn points.
First, each subject played a series of different versions of the money request game.
Each treatment BASE, FLAT, and EXTR was played four times, once for each bonusincrement combination. Second, subjects participated in a single beauty contest game with p = (2/3). In the third part we collected several individual correlates intended to control for cognitive ability, social value orientation, aversion to strategic uncertainty, swiftness, and demographics. There was no feedback during the course of the experiment, that is, subjects did not learn the choices of their opponents nor did they get any information regarding their earnings until the very end of the experiment. All decisions were made independently and at a subject's individual pace. In particular, subjects never had to wait for the decisions of another subject except for the very end of the experiment (when all their decision had already been collected). At that point they had to wait until everybody had completed the experiment so that outcomes and payoffs could be realized.
We now describe each part of the experiment in detail. For the 11-20 games, we randomly assigned the subjects within a session to one of four randomized sequences of the games to control for order effects. 9 Subjects were informed that for every game they would be randomly matched with a new opponent to determine their payoff for that round, hence preserving the one-shot character of the interaction. Each of the three variants BASE, FLAT, and EXTR was played exactly four times, once for each possible combination of increment (small/large) and bonus (low/high).
In the second part, subjects played a single beauty contest game with p = 2/3 among all 32 participants in the session. The winner, that is, the subject whose guess was closest to 2/3 times the average of all choices, received 500 points. In case of a tie, the rules specified to split this amount equally among all winners.
In the final part of the experiment, participants answered a series of questions. First, subjects completed an extended 7-item version of the CRT from Toplak et al. (2014), which includes the three classical items from Frederick (2005). 10 Subjects received 5 points for each correct answer. Next, we elicited aversion to strategic uncertainty using the method by Heinemann et al. (2009) with random groups of four. The task involves measuring certainty equivalents, similarly to Holt and Laury's (2002) multiple price list method, for a situation where payoffs depend on the decision of another subject, that is, strategic uncertainty. In ten situations subjects have to choose between different safe amounts (5 to 50 points) and an option in which they earn 50 points if at least two other members of their group have also chosen that option and zero points otherwise.
Subjects were randomly allocated into groups of four, and for each group one of the decision situations was randomly selected for payment. Finally, we collected a measure to control for differences in mechanical swiftness (Cappelen et al., 2013). To that end we recorded the time needed to complete four simple demographic questions on gender, age, field of study, and native language. This part was integrated into a larger questionnaire, which also comprised questions regarding subjects' understanding of the tasks, their perception of its complexity, number of university semesters, left-or right-handedness, average amount of money needed per month, and previous attendance of a lecture in game theory.
To determine a subject's earnings in the experiment the payoffs from each part were added up and converted into euros at a rate of 0.25 e for each 10 points (around $0.28 at the time of the experiment). In addition subjects received a show-up fee of 4 e for an average total renumeration of 15.67 e. A session lasted on average 60 minutes including instructions and payment. 11 9 The sequences are provided in the supplementary material (see Online Appendix). Besides the three treatments discussed here, the sequences contained an additional treatment discussed in Appendix B.
10 Subjects also answered the two additional items proposed by Primi et al. (2016), but our results do not change if we use their extended CRT version or a combination of both instead. Other studies (Cappelen et al., 2013;Gill and Prowse, 2016) have also used the Raven test as a proxy for cognitive ability. Brañas-Garza et al. (2012) used the Raven test and the CRT by Frederick (2005) in a series of six one-shot p-beauty games and found that CRT predicts lower choices (i.e. higher level), while performance in the Raven test did not. 11 The original instructions were in German. A translation of the instructions into English can be found in the supplementary material (see Online Appendix). This observation is consistent with Prediction 2, that is, deliberation time is longer for choices that require more steps of reasoning. We now test this prediction using a series of three linear regressions with log-transformed deliberation times (log DT) as 12 Classification of levels: Level 1 (31-35), Level 2 (20-24), Level 3 (13-17), Level 4 (8-12), Level 5 (7), Level 0 (rest). There were no choices in the range 1-6. Two subjects with very fast choices of 0 were excluded from the analysis. Our results are robust when those choices are included and classified as level-0. Further, our results are unchanged for narrower classifications of levels, e.g. where only the level-k strategy ±1 are classified as level-k. Notes: Standard errors in parentheses. * p < 0.1, * * p < 0.05, * * * p < 0.01. dependent variable 13 and controls for cognitive ability, individual differences in mechanical swiftness (Cappelen et al., 2013), and gender. The results of those regressions are presented in Table 2. The regressions show a significantly positive effect of higher-level choices on deliberation time. That is, in line with Prediction 2, deliberation time is increasing in the depth of reasoning. This result remains robust when we control for cognitive ability (model 2), measured by the extended CRT, and when we add additional controls. 14 Further, cognitive ability in itself has no effect on deliberation times.
Recall that cognitive ability may decrease cognitive costs potentially leading to more steps of reasoning, while at the same time it may decrease per-step deliberation times. Thus within our model the overall effect of cognitive ability on deliberation time is indeterminate due to these two potentially countervailing effects.
Performance in the CRT was previously found to be correlated with level in the beauty contest (Brañas-Garza et al., 2012). Conducting an additional linear regression (not reported here) with level as dependent variable on CRT, we find a significant and positive coefficient for CRT (N = 126, β = 0.1483, p = 0.0017). This indicates that subjects with higher cognitive ability (as measured by their CRT score) tend to make higher-level guesses in the beauty contest game, which confirms previous results in the literature.
13 Deliberation times usually feature a skewed distribution with rare extreme observations. We follow the standard approach in the literature and consider the logarithm of that variable instead.
14 The control variables are defined as follows: CRTExtended (number of correct answers, 0-7), Swiftness (calculated as 1 − (T i swift / maxi T i swift ) where T i swift is time needed by subject i to answer 3 demographic questions), and Female (dummy). Choices in BASE closely resemble the behavioral patterns found in Arad and Rubinstein (2012) and Goeree et al. (2016), with most subjects selecting one of the three rightmost boxes corresponding to levels 0 to 3. Behavior in FLAT is similar to that in BASE, with most choices corresponding to not more than three steps of reasoning. Compared to BASE, however, there is a larger fraction of level-0 choices in FLAT, which is consistent with our assumptions, since the value of reasoning for that step is lower. In the EXTR variant behavior is comparable to that observed in Goeree et al. (2016), and vastly different from that observed in BASE and FLAT. A large fraction of subjects (between 38% and 62%) chose the rightmost box containing the salient amount of 20, but box 1 and 2 to its left were chosen very rarely compared to BASE and FLAT. Instead, between 25% and 33% of subjects chose one of the two leftmost boxes 8 and 9, which were almost never chosen in the other two variants. Interpreting behavior according to level-k reasoning, these choices correspond (implausibly) to eight or nine steps of Notes: Standard errors in parentheses. * p < 0.1, * * p < 0.05, * * * p < 0.01.
reasoning. 15 As already pointed out by Goeree et al. (2016), it seems unlikely that these choices actually are the result of level-k reasoning. Using deliberation times as a measure for cognitive effort, however, will allow us to directly test this hypothesis.
We now turn to the analysis of deliberation times. For this purpose, we start with bird-eye regressions on the full data set, that is, including all decisions on all variants of the 11-20 game. Table 3 shows GLS random-effects regressions with log DT as the dependent variable including as observations all 12 choices in BASE, FLAT, and EXTR.
In all models we control for mechanical swiftness, gender, and the position within the sequence of games (Period). 16 The regressions confirm that there is a significant and positive relation between deliberation times and depth of reasoning. That is, as predicted, choices associated with more steps of thinking require more deliberation. This relation is unaffected when we include the treatment dummies FLAT and EXTR (model 2), and when we control for 15 When playing against the empirical distribution of choices, the best-performing strategies for BASE, FLAT, and EXTR would correspond to level 2, level 1, and level 1, respectively. Controlling for empirical payoffs does not affect our results.
16 Throughout the paper the standard variables for regressions are defined as follows: Level (0-9; a choice of box k is classified as level k); FLAT and EXTR are treatment dummies; CRTExtended (0-7; number of correct answers); Swiftness ([0, 1]; calculated as 1 − (T i swift / maxi T i swift ) where T i swift is time needed by subject i to answer 3 demographic questions); Female (dummy); Period (1-16; controls for position in the sequence of games). cognitive ability as measured by the number of correct answers in the (extended) CRT (model 3). Also, the coefficient of the CRT is significant and positive, that is, subjects scoring higher on the CRT take longer to make their decisions.
Beyond this, we observe a significant positive coefficient of EXTR (which is robust to controling for CRT performance), indicating that choices in EXTR generally required longer deliberation times. The average deliberation time in EXTR was 12.6 seconds, whereas the average deliberation time in BASE and FLAT was only 9.9 seconds. Pairwise Wilcoxon signed rank (WSR) tests directly confirm that the average deliberation time in EXTR was significantly higher compared to both BASE (N = 128, z = 4.678, p < 0.0001) and FLAT (N = 128, z = 4.375, p < 0.0001).
In the next step, we consider the three game variants BASE, FLAT, and EXTR separately. To this end, we run separate regressions considering only the four choices taken for each of the variants. Table 4 presents the results of these regressions, which are also illustrated in Figure 6. In those and all following regressions we include controls for cognitive ability, mechanical swiftness, gender, and the position within the sequence of games.
The results confirm our previous findings, showing a positive significant relation between deliberation times and higher-level choices in all three variants of the game.
This positive correlation can also be seen in Figure 6, where the solid regression lines have a positive slope for all three variants. There is no effect of cognitive ability on deliberation times in BASE, whereas we find a positive and significant effect in both FLAT and EXTR. In the next step, we investigate the robustness of the previous conclusion to specific  Table 5 and illustrated in Figure 6 (dashed lines).
After controlling for choices of the rightmost box, in BASE we still observe a clear positive relation between deliberation time and level. This is illustrated in Figure 6, where the slope of the regression line is still positive even when the level-0 choices are excluded (dashed line). In addition, the dummy itself is not significant. That is, our conclusions for BASE are robust to controlling for imputed level-0 choices.
In game variant FLAT, however, the picture is quite different. Choices of the rightmost box are significantly faster, and this difference actually explains most of the effect of level on deliberation times: level becomes insignificant when adding the dummy Right-most20. This can be seen graphically in Figure 6, where the regression line becomes almost flat when the level-0 choices are excluded (dashed line). In summary, we find generally longer deliberation times for higher-level choices, which is in line with Prediction 2, but this can only be seen as a full validation of level-k reasoning for game variant BASE. In this variant, decreasing amounts of sure payoff make the successive steps associated with level-k reasoning particularly salient, and indeed we observe the strongest link between deliberation times and level, which is

Effect of Incentives in the 11-20 Game
In this section we examine the effect of changes in the incentives, and thus the value of reasoning, on both choices and deliberation times in the 11-20 game. For this purpose, we make use of the fact that for each 11-20 game variant we also varied the payoff structure along two incentive dimensions, the size of the increment in sure payoff and the bonus that could be received. Conversely, large increments decrease the value of reasoning for all steps in BASE and for the first step in FLAT, hence should correspond to weakly lower levels in these variants. In EXTR, however, large increments sharply lower v(1), but all other values v(2), . . . , v(9) increase (slightly). Hence, the overall effect of large increments on level in EXTR is indeterminate. We do find lower average levels for large increments compared 17 We also ran additional regressions comparing the deliberation time of a given step with the deliberation time of the next step. These pairwise comparisons reveal that a further step requires additional deliberation time for steps 2 to 4, whereas the coefficient is insignificant for step 1.

Incentives and Choices
18 Recall that we focus only on the first nine steps, and for those the increase is exactly 20.  In summary, the changes in the average depth of reasoning resulting from our systematic changes in the payoff structure are in line with Prediction 1. To further examine this conclusion while controlling for individual differences, we turn to a regression analysis. Table 6 shows the results of three random-effects Tobit regressions with level as dependent variable, one for each game variant, using the size of the bonus and the size of the increment as regressors. In addition, we control for subjects' attitudes towards strategic uncertainty (Heinemann et al., 2009) and previous knowledge of game theory. 19 The regressions confirm that large increments led to less steps of reasoning in all game variants (significantly negative coefficients for the large increment dummies). Regarding bonus, in BASE there is a significant and positive effect of bonus, with more high-level choices for a high bonus, confirming again the observation above. Unsurprisingly, we find no effect of high bonus on level for EXTR. Contrary to the conclusion from the nonparametric test, in FLAT we also find no effect of high bonus on level. In this game variant, however, there is a high concentration of choices on levels 0 and 1 (over 60%), which may explain the absence of an effect of bonus on level. Hence, we ran an additional random-effects probit regression (not reported here) on a binary variable that takes the value 1 if level is larger or equal to 1 and 0 otherwise. A positive effect of bonus on this binary variable would indicate that increasing the bonus leads to more choices corresponding to at least one step of reasoning. Indeed, we find a significant positive effect of bonus on this binary variable (N = 512, β = 0.4010, p = 0.0054).

Incentives and Deliberation Times
We now analyze the effect of a change in the incentives on deliberation times. Tables 7, 8, and 9 show the results of a series of random-effects GLS regressions of log DT on level for BASE, FLAT, and EXTR, respectively. The crucial variables are the dummies for the high bonus and large increment conditions, as well as the interactions of level with those. The regressions also control for cognitive ability, swiftness, gender, and period. Additionally, we also control for non-strategic choices by including a dummy for the rightmost box. The reason is that, as shown in Section 6, level-0 choices are significantly faster. Being non-strategic, these choices are unlikely to be affected by changes in incentives.
The results for bonus and increment size are also illustrated in Figure 7. Although the regressions examine the effects of bonus and increment simultaneously for each game type, for expositional clarity we discuss them separately in the following two subsections.

Effect of the bonus
Increasing the bonus has a twofold effect on deliberation times: First, it increases the potential gain from an additional step of reasoning by 20 and thus increases the value of reasoning for the first nine steps. Hence, according to Prediction 4 deliberation times per step should be shorter when the bonus is high. On the other hand, assuming that the cognitive cost is unaffected by a change in the bonus, subjects should conduct more steps of reasoning according to Prediction 1, which should increase overall deliberation time. As a consequence, the aggregate effect on deliberation times is indeterminate.
Controlling for the size of the bonus and the interaction of level with bonus allows us to dissect these two explanations.
For the BASE variants (Table 7), we find shorter deliberation times when the bonus is high (model 1). This effect remains when we control for level (model 2), indicating that the direct effect (shorter deliberation times per step) dominates the indirect one (increased deliberation time through increased number of steps). To check whether  For the FLAT variants (Table 8), subjects overall deliberate longer in the high bonus condition (model 1). This effect remains when we control for level in model 2. Although, this effect becomes non-significant when we additionally control for the interaction of level with bonus (model 3), we can see in the top-middle panel of Figure 7 that the line for high bonus is shifted upwards. The slopes of the two lines are very similar, and indeed, the interaction is not significant. Finally, for the EXTR variants (Table 9) we find no evidence that bonus has any systematic effect on deliberation times. This can also be seen from the top-right-hand panel in Figure 7, where both regression lines are flat, even slightly downward sloping.
Summarizing, we find that increasing the bonus decreases deliberation times in BASE (suggesting that level-k reasoning is salient), increases deliberation times in FLAT, and has no effect on deliberation times in EXTR (suggesting that level-k reasoning is not salient at all). The decrease in BASE is a result of shorter deliberation times per step, in line with Assumption 1, which explains why overall deliberation time decreases although observed levels are higher. We note that this result would be incompatible with any model where the deliberation time per step is independent of value of reasoning of that step (as in Alaoui and Penta, 2016c).

Effect of the increment
The predicted effect of an increase in the increment depends on the specifics of the underlying payoff structure and hence differs across treatments. In BASE, large increments again have a twofold effect. First, the value of reasoning decreases by 1 for the first nine steps. Hence, according to Prediction 4 we would expect longer deliberation times per step for large increments. However, the decrease in incentives is very small compared to the one resulting from a change in the bonus, and hence this effect is likely to be small as well. On the other hand, because large increments imply a lower value of reasoning, subjects potentially conduct less steps of reasoning (again assuming that cognitive costs are unaffected), which in turn should decrease overall deliberation time. Hence, the overall effect is undetermined, and indeed the results for BASE (Table 7) show no effect on deliberation time.
In FLAT, only the value of reasoning for the first step is lower for large increments, while the value for the remaining steps is unaffected. Hence, we expect longer deliberation times for the first step. Again, a smaller value of reasoning for the first step might lead subjects to conduct less steps of reasoning, which in turn might decrease overall deliberation time. The results for this game variant (  The payoff structure in EXTR does not allow for a clear-cut prediction for the effect of large increments on deliberation times. The reason is that for large increments, the value of reasoning for the first step decreases sharply, but increases slightly for all further steps. As a consequence, we would expect longer deliberation times for the first step, and shorter deliberation times for all subsequent steps. It is unclear which of these countervailing effects should dominate. The results (Table 9) show significantly positive coefficients for large increments. However, as in the case of bonus we find no effect of level on deliberation times and thus, perhaps not surprisingly, there is also no interaction effect with increment. This can also be seen from the lower right-hand panel in Figure   7. The regression lines are flat, and the line for large increment is shifted upwards. This effect is in contrast to the negative effect of large increment on the depth of reasoning, but unlike for BASE and FLAT this cannot be explained by a change in deliberation times per step.
Summarizing, for the large increment condition we find overall longer deliberation times in FLAT and EXTR, but not in BASE. The increase in FLAT is a result of longer deliberation times per step, confirming Prediction 4. These results strongly support Assumption 1 stating that deliberation times per step are decreasing in the value of reasoning. That is, our model can explain why deliberation times in FLAT are increasing for large increment although observed choices correspond to less steps of reasoning.
Again, we want to stress that this effect would be incompatible with a model where the deliberation time per step is constant, since in this case less steps of reasoning can only decrease overall deliberation times, but never increase them.

Discussion
In this work, we have introduced a simple model linking depth of reasoning (as revealed by choices), incentives, and deliberation times. We model the total deliberation time of an observed choice as the sum of the deliberation times resulting from a sequence of steps of reasoning. As an immediate consequence we obtain the prediction that higher observed depth of reasoning implies longer deliberation times. They key assumption then builds on the well-established closeness-to-indifference effect, that is, steps of reasoning take longer if the value of reasoning for this step is small. We assume that deliberation time for a given step is a decreasing function of the value of reasoning of that step. This model provides empirically testable predictions regarding the relation of deliberation times, depth of reasoning as revealed by choices, and incentives.
We then test the predictions of our model using experimental data. In the beauty contest and the original version of the 11-20 money request game, choices attributed to more steps of reasoning lead to longer deliberation times. In this way, this work is the first to provide direct evidence on the link between heterogeneity in cognitive effort and behavioral heterogeneity (in the level-k sense). This link is strongest when the payoff structure of the underlying game is such that iterative thinking is salient.
However, for games without a salient iterative structure, there is no relation between deliberation times and alleged depth of reasoning as imputed from choices only. We conclude that, in these (presumably more typical) situations, cognitive depth should not be deduced exclusively from choices, and applying simple models of iterative thinking might be unwarranted. Our work hence also serves as a demonstration that deliberation times can serve as a tool to identify economic problems where features beyond the path of reasoning are crucial determinants of behavioral heterogeneity.
We also show that cognitive depth reacts to monetary incentives. Our model predicts that changes in the incentives that systematically vary the utility difference of a step of reasoning should be reflected in changes in cognitive depth. These effects are found in the data, with the caveat that the link between incentives, cognitive depth, and deliberation times is less than straightforward. In particular, the well-known effects of closeness to indifference imply that higher incentives to reason will be accompanied by shorter deliberation times for a given step of reasoning, resulting in the apparent paradox of higher incentives inducing more steps of reasoning which are implemented in a shorter total deliberation time.
Our results also contribute to a related strand of literature that tries to better understand when iterative thinking describes actual decision processes and what cues trigger it. For example, Ivanov et al. (2009) show that level-k ceases to describe behavior well when the best-reply structure is complex and alternative plausible rules of thumb exist. Chong et al. (2016) show that incorporating a measure of saliency to derive level-0 behavior significantly improves model fit with respect to models where non-strategic agents randomize uniformly. Shapiro et al. (2014) show that the predictive power of the model can vary within a single game when different components of the payoff function are emphasized, with a better fit as the game becomes closer to a standard beauty contest and a worse fit as the pattern of levels of reasoning becomes less salient. This suggests that level-k reasoning is one of many possible decision processes players may employ, and which process ultimately determines the decision can depend on various features of the decision situation. Our results for the different variants of the 11-20 money request game confirm this view.
In conclusion, we provide the missing link between heterogeneity in observed economic choices and imputed differences in cognitive depth by relying on a direct measure of cognitive effort. At the same time, our research shows that this link might only be easily observable in situations where an iterative reasoning structure is salient enough.
We provide a tool to identify situations where it is warranted to account for heterogeneity in behavior through a direct application of iterative thinking models. This simple expansion of the economist's toolbox is a first step towards a more complete account of the determinants of behavioral heterogeneity.
Appendix A Other Level-0 Specifications in the 11-20 Game Arad and Rubinstein (2012) argue that choosing 20 in the 11-20 game is a natural anchor for an iterative reasoning process. However, Hargreaves Heap et al. (2014) show that level-0 behavior might depend on the payoff structure of the game. This might be less problematic in our setting because a further appeal of the original 11-20 game, which essentially corresponds to BASE, is that it is fairly robust to the level-0 specification.
Specifically, choosing 19 in the original 11-20 game, or box 1 in BASE, is the level-1 strategy for a wide range of level-0 specifications. Still, this robustness depends on the particular payoff structure of the game and hence might be different across the various versions used in our experiment. In this appendix we explore the robustness of BASE, FLAT and EXTR to the level-0 specification.
Recall that box 0 always contains the salient amount of 20. We want to study the range of σ 0 such that choosing box 1 is still the unique level-1 strategy, that is, BR(σ 0 ) = {1}.
A necessary condition is that p 0 > p 0 , where p 0 = (20 − A 1 )/R, which is derived from the condition that the expected payoff of box 1 exceeds that of box 0, i.e. A 1 + p 0 R > 20.  The condition p > p 0 is in general not sufficient. It is easy to show that, as long as box 1 contains the second-highest sure amount, that is, A 1 ≥ A j for all j = 0, 1, and p 0 > p 0 , a sufficient condition is that no box j = 0 is assigned a probability larger than p 0 . This holds in BASE as long as box 0 is most probable under σ 0 (note that this implies p 0 > 10%, hence p 0 > p 0 ). Hence, choosing box 1 is the unique level-1 strategy in BASE under fairly weak requirements, in particular even if σ 0 is assumed to be uniform randomization as usually assumed in games without a salient strategy (e.g. the beauty contest game). 20 For FLAT, the sufficient condition holds if box 0 is most likely under σ 0 and p 0 > p 0 (similarly to BASE, this latter condition is void for high bonus and small increment). This is a slightly stronger condition, because the lower bounds p 0 are tighter. In particular, in the extreme case of uniform randomization the level-1 strategy remains to choose box 1 only for high bonus and small increments, while it prescribes to stay with box 0 for the other conditions. Overall, however, the requirement remains mild and amounts to assuming a small degree of salience for box 0.
In Section 6 we assumed that the starting point in the 11-20 game for our model of iterative thinking was to choose the rightmost box containing the salient amount of 20. As just illustrated, the best-reply structure in BASE and FLAT is robust for a wide range of alternative level-0 specifications. Thus, even if, contrary to our level-0 assumption, the starting point does not assign probability one to choosing the rightmost box, the best-reply structure and hence our results are unaffected as long as p 0 is not too small.
A different conclusion obtains for variant EXTR. The condition above is not sufficient for this variant because box 1 contains the lowest sure amount, hence the probability assigned to the rightmost box has to exceed the probability of any box j by more than (A j − A 1 )/R. This condition together with p 0 > p 0 is sufficient to make box 1 the unique best response in EXTR. This is a relatively demanding condition, as is the lower bound p > p 0 in this case. In particular, choosing the leftmost box that grants the second highest sure payoff is the level-1 strategy for a relatively wide range of specifications that include uniform randomization. Hence, the best-reply structure of EXTR is less robust to changes in the level-0 specification, and there is a clear alternative best-reply structure where the leftmost box is the level-1 strategy.
To check for robustness in the case of EXTR we consider an alternative best-reply structure by assuming that the level-0 specification is mixed and the best reply is to choose the leftmost box, which we then classify as the level-1 strategy. The best reply to that is to choose the rightmost box containing the salient amount of 20, now classified as level 2. From there the best-reply structure follows the familiar pattern from right to left. We repeated the complete analysis of EXTR in Sections 6 and 7 for this alternative classification, and found no qualitative difference with the previous analysis. 21 Hence, we conclude that our results, presented in the previous section, cannot be explained by differences in the robustness to the level-0 specification between treatments.

Appendix B A "Social Preference" Variant
The experiment included an additional treatment intended to test for an alternative explanation of the frequent "high-level" choices of the two leftmost boxes in EXTR, as previously observed by Goeree et al. (2016). By choosing the leftmost box in EXTR a subject could obtain the second highest sure amount, while at the same time granting her opponent the chance to receive the bonus. If a subject is motivated by some form of other-regarding preferences, choosing the leftmost box might be attractive because it grants somebody else the chance to get a bonus that is relatively large in comparison 21 The alternative regressions are available upon request.  to the subject's own sacrifice in terms of sure payoff. We thus included a treatment, denoted SOCP, which was a variation of FLAT where the two rightmost boxes contain both the salient amount of 20. Figure Table B.1 presents the value of reasoning for this variant for each step.
As a proxy for prosociality we measured the social value orientation (SVO) of each subject using a computerized version (Crosetto et al., 2012) of the scale developed by Murphy et al. (2011). We used a scaled version of their six primary items in which subjects were asked to choose among different allocations of points between themselves and a randomly selected partner. For the SVO task one of the six items was randomly selected and paid out using a ring matching procedure, that is, each subject received two payments of up to 25 points, one as a sender and one as a receiver. A higher SVO score indicates that a subject is more prosocial.
In SOCP, 36 out of 128 subjects chose the rightmost box at least once. However, we found no difference in SVO scores between subjects choosing the rightmost box at least once and those who never chose it (Mann-Whitney-Wilcoxon test, N = 128, z = −1.068, p = 0.2857), which speaks against the social-preference interpretation. Next, we consider the relative frequency of choosing the rightmost box across all four instances of SOCP per subject. We run a fractional logit regression for this relative frequency with the SVO score as an independent variable. The coefficient of SVO is positive but not significant. Summarizing, we find no evidence that the prosocial motive of granting the opponent the chance to obtain a bonus is a driver of behavior in the 11-20 game.

Carlos Alós-Ferrer and Johannes Buckenmaier
Online Appendix: Supplementary Material

Sequence of Games
To control for order effects we counterbalanced the order of the different 11-20 games using the following four randomized sequences. We denote the small increment -low bonus version of BASE, FLAT, EXTR, and SOCP by B, F, E, and S, respectively.
Similarly for X ∈ {B, F, E, S} we use the notation +X to indicate large increments, and X+ to indicate high bonus, e.g. +B+ denotes BASE with large increments and high bonus.
Pseudo-randomized sequences of the 11-20 games used in the experiment.

Translated Instructions
These are the instructions given to subjects during the experiment. Instructions for each part were presented separately on screen, at the beginning of each part. The original instructions were in German. Text in brackets [...] was not displayed to subjects.

General Instructions
Welcome to this economic experiment. Thank you for supporting our research.
Please note the following rules: 1. From now on until the end of the experiment, you are not allowed to communicate with each other.
2. If you have questions, please raise your hand and one of the instructors will answer your question individually.
3. Please refrain from using any features of the computer that are not part of the experiment.
The experiment consists of five parts and a questionnaire. The experiment involves a series of decisions which will affect your payoff at the end of the experiment. In this experiment you will earn points. At the end of the experiment the points you have earned in each part will be added up and the sum will be exchanged into Euros according to the following exchange rate: 10 points = 25 Eurocents.
Independently of your decisions, you will receive an additional 4 EUR for your participation in the experiment.
For each decision you will see 10 boxes in line on your screen. Each box contains a certain amount of points.
You have to choose one of the boxes.
Each participant will receive the amount in the box he/she selected. In addition, a participant may get a bonus if the selected box is exactly one to the left of the box that the other participant chooses.
The amount of points contained in each box may change from one round to another round. Below you can see an example for such a decision. Note that the amount contained in each box as well as the size of the bonus in the experiment will differ from this example. The size of the bonus and the amount of points contained in each box will be displayed in the following way: In this part you and all other participants in this session will make one decision. You and all other participants each have to choose an integer between 0 and 100.
The participant who chose the number closest to the target number wins. All other participants do not win anything.
To determine the target number, the average of all chosen numbers will be computed and multiplied by 2/3 (in words: two thirds).
Target number= (2/3) * (Average of the numbers chosen by all participants) The participant who chose the number closest to the target number wins and receives 500 points.
In case there is a tie between two or more participants (because all their numbers are equally close to the target number) the points are split equally among all winners.

Part 3 [Cognitive Reflection Test]
In this part you are asked to answer a series of questions. For each question there is exactly one correct answer.
If you answer the questions correctly, you can earn additional points.
In total you have to answer 9 questions.
For each correct answer you will receive 5 points.

Part 4 [Social Value Orientation]
In this part you have to make a series of decisions about allocating points between you and another randomly selected participant.
Henceforth, we will refer to this randomly selected participant simply as the "other." In each of the following 6 decisions, you can choose how many points you would like to allocate to yourself and how many points you would like to allocate to the other.
Please select for each decision exactly one of the 9 available allocations.
All amounts are displayed in points.
Please take all decisions seriously, since each of the 6 decisions has the same probability of being selected.
You can receive additional points in case a decision of another participant is selected, where he has allocated points to you.

Part 5 [Strategic Uncertainty]
In this part you have to make 10 decisions for different decision situations. Each situation is independent of the other.
In each situation you can decide between A and B. The amount of points you will earn in this part depends on these decisions.
In this part, you and 3 other randomly selected participants will form a group.
There will be 10 decision situations displayed on your screen in a table. In each of the situations you can choose between option A and option B. At the end of this part, 1 out of the 10 situations will be chosen randomly. Your payment will be according to the situation picked and is determined as follows: • If you choose option A, you will receive the sure payment given in the second column.
• If you choose option B, your payment will depend on how many members of your group (including yourself) chose B.
-If 3 or more out of the 4 members of your group chose B, you will receive 50 points.
-If 2 or less of the 4 members of your group chose B, you will receive 0 points.