# Strategic experimentation with asymmetric players

- 267 Downloads
- 1 Citations

## Abstract

We examine a two-player game with two-armed exponential bandits à la (Keller et al. in Econometrica 73:39–68, 2005), where players operate different technologies for exploring the risky option. We characterise the set of Markov perfect equilibria and show that there always exists an equilibrium in which the player with the inferior technology uses a cut-off strategy. All Markov perfect equilibria imply the same *amount* of experimentation but differ with respect to the expected speed of the resolution of uncertainty. If and only if the degree of asymmetry between the players is high enough, there exists a Markov perfect equilibrium in which both players use cut-off strategies. Whenever this equilibrium exists, it welfare dominates all other equilibria. This contrasts with the case of symmetric players, where there never exists a Markov perfect equilibrium in cut-off strategies.

## Keywords

Two-armed bandit Heterogeneous agents Free riding Learning## JEL Classification

C73 D83 O31## 1 Introduction

In many instances, the information produced by one agent is interesting to other agents as well. Think, for example, of firms exploring neighbouring oil patches: if one firm strikes oil, chances are there will be oil in its neighbour’s patch as well. Such games of purely informational externalities have been analysed by the strategic bandit literature,^{1} which so far has only analysed the case of homogeneous agents. However, in many instances, one of the oil firms, for example, might be a big multinational firm that has access to a superior drilling technology. In this article, we aim to analyse the impact of asymmetries in players’ exploration technologies in a game of strategic experimentation with two-armed exponential bandits.

The seminal paper by Keller et al. (2005) analyses this problem with homogeneous players. In the current paper, we generalise the analysis by introducing asymmetric players, in the sense that their pay-off arrival rates from a good risky arm differ. This implies that, given the risky arm is good, the expected time needed to learn this differs between the players. As actions and outcomes are perfectly publicly observable, and players start out with a common prior, they will always have a common posterior belief. We characterise the set of Markov perfect equilibria with the players’ common posterior belief as the state variable for all ranges of asymmetry between the players. If the degree of asymmetry between the players is sufficiently high, there exists an *equilibrium in cut-off strategies*, i.e. where both players use a cut-off strategy. That is, either player uses the risky arm if and only if the likelihood he attributes to the option being good is greater than a certain threshold. This equilibrium is unique in the class of equilibria in cut-off strategies. Whenever only one of the players experiments and the other free rides in this equilibrium, it is always the player with the weaker technology who free rides. In the case of homogeneous players (Keller et al. 2005), by contrast, there never exists an equilibrium in cut-off strategies, and players swap the roles of pioneer and free rider at least once in any equilibrium. In our setting, aggregate pay-offs in the equilibrium in cut-off strategies are higher than in any other equilibrium. If the degree of asymmetry is low, at least one player uses a non-cut-off strategy in any equilibrium. In contrast to the homogeneous case (Keller et al. 2005), we furthermore show that more frequent switches of arms do not unambiguously improve the equilibrium welfare with asymmetric players.

### 1.1 Related literature

This paper contributes to the literature on strategic experimentation with bandits, a problem studied quite widely in economics, amongst others, by Bolton and Harris (1999), Keller et al. (2005), Keller and Rady (2010), Klein and Rady (2011), Klein (2013) and Thomas (2017). In all of these papers, players are homogeneous. Except in Thomas (2017) and Klein and Rady (2011), players’ bandits are of the same type and * free riding* is a common feature in all the above models except for Thomas (2017). Many variants of this problem have been studied in the literature^{2}. Rosenberg et al. (2013) and Murto and Välimäki (2011), for instance, assume that switches to the safe arm are irreversible and that experimentation outcomes are private information, while Bonatti and Hörner (2011) and Heidhues et al. (2015) investigate the case of private actions. In Dong (2018), actions and outcomes are public, but one of the players receives an initial private signal. Rosenberg et al. (2007) analyse the role of the observability of outcomes and the correlation between risky-arm types in a setting in which a switch to the safe arm is irreversible. Besanko and Wu (2013) use the Keller et al. (2005) framework to study how an R&D race is impacted by market structure. Das (2019) analyses an R&D race in a strategic bandit setting in which on the risky arm, players can learn both privately and publicly. Guo (2016) and Zambrano (2017) analyse the problem of a principal delegating the operation of a two-armed bandit to an agent; in Klein (2016), the bandit the agent operates has three arms. Banks et al. (1997) provide an experimental test of a single-agent two-armed bandit problem; Hoelzemann and Klein (2018) do so in a strategic setting. The paper closest to the present paper is Keller et al. (2005), who find that, with homogeneous players, there is never an equilibrium in cut-off strategies. By contrast, we show that, with heterogeneous players, an equilibrium in cut-off strategies may exist and that it is welfare maximising whenever it exists.

The rest of the paper is organised as follows. Section 2 sets out the model. Section 3 discusses the social planner’s solution. A detailed analysis of equilibria for different ranges of heterogeneity is undertaken in Sect. 4. Finally, Sect. 5 concludes. Payoff functions are shown in “Appendix A”, while some proofs are relegated to “Appendix B”.

## 2 Two-armed bandit model with heterogeneous players

There are two players (1 and 2), each of whom faces a two-armed bandit in continuous time. One of the arms is safe, in that a player who uses it gets a flow pay-off of \(s>0\). The risky arm can be either good or bad. Both players’ risky arms are of the same type. If the risky arm is good, then a player using it receives a lump sum, drawn from a time-invariant distribution with mean \(h>s\), at the jumping times of a Poisson process. The Poisson process governing player 1’s arrivals has intensity \(\lambda _1 = 1\), while player 2’s arrive according to a Poisson process with intensity \(\lambda _2\in (\frac{s}{h},1)\). Thus, a good risky arm gives player 1 (2) an expected pay-off flow of \(g_1=\lambda _1h = h\) (\(g_2=\lambda _2h\)), with \(g_1>g_2>s\). The parameters and the game are common knowledge.

The uncertainty in this model arises from the fact that players do not initially know whether their risky arms are good or bad. Players start with a common prior belief \(p_{0}\in (0,1)\) that their risky arms are good. Players have to decide in continuous time whether to choose the safe arm or the risky arm. At each instant, players can choose only one arm. We write \(k_{i,t}=1\) (\(k_{i,t}=0\)) if player \(i\in \{1,2\}\) uses his risky (safe) arm at instant \(t\ge 0\). Players’ actions and outcomes are publicly observable, and based on these, they update their beliefs. Players discount the future according to the common discount rate \(r>0\).

*i*’s (\(i\in \{1,2\}\)) actions \(\{k_{i,t}\}_{t\ge 0}\), which are required to be progressively measurable with respect to the available information and to satisfy \(k_i(t)\in \{0,1\}\) for all \(t\ge 0\), player

*i*’s expected pay-off is given by

As mentioned in Introduction, we will focus our analysis on Markov perfect equilibria with the players’ common posterior belief as the state variable. Formally, a Markov strategy of player *i* is any left-continuous function \(k_i : [0,1] \rightarrow \{0,1\}, p\mapsto k_i(p)\) (\(i = 1,2\)) that is also piecewise continuous, i.e. continuous at all but a finite number of points.

*t*:

## 3 Planner’s problem

*v*(

*p*) for the planner’s value function and, like Keller et al. (2005), define the myopic opportunity cost of having player

*i*play risky, \(c_i(p) = s - pg_i\), and the corresponding learning benefit

*i*,

*t*) is without loss in the planner’s problem. To state the following proposition, which describes the planner’s solution, we define \(g = g_1+g_2\), \(\lambda = 1+\lambda _2\), \(\mu = \frac{r}{\lambda }\), \(u_1(p): = (1-p)\left( \frac{1-p}{p}\right) ^r\), \(u_0(p): =(1-p)\left( \frac{1-p}{p}\right) ^{\mu }\).

### Proposition 1

### Proof

Proof is by a standard verification argument. Please see “Appendix B.1” for details. \(\square \)

By the above proposition, the belief at which player 1 switches to the safe arm in the planner’s solution is higher than it would be if both players’ Poisson arrival rates were equal to \(\lambda _1=1\). This is because, as player 2’s arrival rate \(\lambda _2\) decreases, the benefit from player 1’s experimentation decreases.

^{3}The planner’s value function is a smooth convex curve which lies in the range [2

*s*,

*g*]. At the belief \(p_2^{*}\)(\(p_1^{*}\)) , player 2 (1) switches to the safe arm.

## 4 Non-cooperative game

We will first analyse a player’s best responses to a given Markov strategy of the other player.

### 4.1 Best responses

*j*’s strategy \(k_j\) (\(j\in \{1,2\}\setminus \{i\}\)). If the pay-off function from player

*i*’s response satisfies the following Bellman equation, player

*i*is playing a best response:

^{4}

*i*due to his own experimentation, while \(\lambda _j b_i(p,v_i)\) is the learning benefit accruing to player

*i*from player

*j*’s experimentation. The myopic opportunity cost of experimentation continues to be \(c_i(p)=s-g_i p\).

*i*’s pay-off function satisfies the Bellman equation if and only if

*i*chooses the risky arm, safe arm or is indifferent between them depending on whether his value in the (

*p*,

*v*) plane lies above, below or on the line

*i*is given by

### Proposition 2

In any MPE, both players play safe at all beliefs in \([0,\bar{p}_1]\). There is thus no efficient MPE.

### Proof

Suppose to the contrary that \(p_l\), the infimum of the set of beliefs at which at least one player plays risky satisfies \(p_l < \bar{p}_1\). Clearly, \(v_i(p_l)=s\) for both \(i\in \{1,2\}\). We shall now distinguish two cases depending on whether or not there exists an \(\bar{\epsilon }>0\) such that, in any \(\epsilon \)-right neighbourhood of \(p_l\) with \(\epsilon \in (0,\bar{\epsilon })\), only one player *i* plays risky. If there does not exist such an \(\bar{\epsilon }>0\), *i* is not playing a best response, because \(p_l<\bar{p}_i<\frac{s}{g_i}\) implies that the point \((p_l,s)\) is below the diagonal \(D_i\). In the other case, player *i* faces the same trade-off as a single agent and does not play a best response either, because \(p_l<\bar{p}_i\). \(\square \)

In the next subsection, we will characterise the condition under which an equilibrium in cut-off strategies exists.

### 4.2 Equilibrium in cut-off strategies

As we have argued in the proof of Proposition 2, there is no experimentation below the belief \(\bar{p}_1\) in any equilibrium. We will now argue that, in any equilibrium, only player 1 will experiment in some right neighbourhood of \(\bar{p}_1\), implying that player 1 is the last player to experiment in any equilibrium.

By Proposition 2, we know that \(v_1(\bar{p}_1)=v_2(\bar{p}_1)=s\), and thus, by continuity, both players’ value functions must be below their respective diagonals \(D_i\) in some neighbourhood of \(\bar{p}_1\). Thus, in any equilibrium, at most one player can play risky in some right neighbourhood of \(\bar{p}_1\). Now, suppose that player 2 is the only player to experiment in some right neighbourhood of \(\bar{p}_1\). Then, the relevant ODE (Equation 13 in “Appendix A.2”) gives us that \(\lambda _2\bar{p}_1(1-\bar{p}_1)v_2'(\bar{p}_1+)=\bar{p}_1\lambda _2(g_2-s)-rc_2(\bar{p}_1)<0\), as \(\bar{p}_1<\bar{p}_2\). Thus, player 2’s value function drops below *s* immediately to the right of \(\bar{p}_1\), which contradicts his playing a best response. We can thus conclude that there exists some belief \(\hat{p}_1>\bar{p}_1\) such that, on \((\bar{p}_1,\hat{p}_1)\), player 2 plays safe. As either player can always guarantee himself his single-agent pay-off by ignoring the information he gets for free from the other player, his pay-off in any equilibrium is bounded below by his single-agent pay-off. Thus, in any equilibrium, \(v_1>s\) on \((\bar{p}_1,\hat{p}_1]\), and player 1 experiments, while player 2 free rides, in this range.

Since \(\bar{C}_1 > 0\) and \(\bar{C}_2 < 0\), \(\bar{v}_1\) is strictly convex and \(\bar{v}_2\) is strictly concave.^{5} The following lemma shows that the functions \(\bar{v}_i\) intersect the corresponding diagonals \(D_i\) at a unique belief.

### Lemma 1

There exists a unique \(p_1^{'} \in (\bar{p}_1, 1)\) such that \(\bar{v}_1(p_1^{'}) = D_1(p_1^{'})\), and a unique \(p_2^{'} \in \left( \bar{p}_2,\frac{s}{g_2}\right) \) such that \(\bar{v}_2(p_2^{'}) = D_2(p_2^{'})\).

### Proof

Please refer to “Appendix B.2”. \(\square \)

In the following proposition, we will show that there exists an equilibrium in cut-off strategies if and only if the degree of asymmetry between the players is high enough.

### Proposition 3

There exists a \(\lambda _2^{*} \in (\frac{s}{h}, 1)\) such that there exists an equilibrium in cut-off strategies if and only if \(\lambda _2 \in (\frac{s}{h}, \lambda _2^{*}]\). In this equilibrium, player 1 plays risky on \((\bar{p}_1,1]\) and safe otherwise, while Player 2 plays risky on \((p_2^{'},1]\) and safe otherwise.

### Proof

Please refer to “Appendix B.3”. \(\square \)

^{6}. In this equilibrium, both players’ pay-offs are equal to

*s*for \(p\le \bar{p}_1\). For \(p > \bar{p}_1\), the black curve represents \(v_1\) and the red curve represents \(v_2\). For \(p\in (\bar{p}_1, p_2^{'}]\), player

*i*’s (\(i=1,2\)) pay-off is \(\bar{v_i}(p)\). For \(p>p_2^{'}\), player

*i*’s pay-off is given by

^{7}Player 1’s equilibrium pay-off function is (strictly) convex (on \((\bar{p}_1,1)\)); it is smooth, except for a kink at \(p_2^{'}\). (For the particular parameter values used in Fig. 2, we have \(v_1^{'}(p_2^{'+}) = 1.477\) and \(v_1^{'}(p_2^{'-}) = 1.21\)). To depict this kink in the figure, we have magnified the area around \(p = p_2^{'}\). In the magnified part, the orange curve represents \(\bar{v_1}\) for \(p>p_2^{'}\). Player 2’s pay-off function is strictly concave on \((\bar{p}_1,p_2^{'})\) and strictly convex on \((p_2^{'},1)\); it has an inflection point at \(p_2^{'}\). It is smooth except for a kink at \(\bar{p}_1\).

Experimentation decisions are strategic substitutes. Therefore in any equilibrium, at the lowest belief where some experimentation takes place, one *pioneer* is indifferent between choosing the safe and the risky arm, given that the other player is free riding. The *free rider* can determine a threshold belief \(p_2^{'}\) where he is indifferent between choosing the safe arm and the risky arm, given that the pioneer is choosing the risky arm for all beliefs between the lowest cut-off and \(p_2^{'}\). This implies that for beliefs just above \(p_2^{'}\), the free rider finds it beneficial to experiment irrespectively of the action of the pioneer. When players are homogeneous, their free riding opportunities are the same. At \(p_2^{'}\), the pioneer’s pay-off is less than that of the free rider as experimentation is costly. Thus, for beliefs just above \(p_2^{'}\), the pioneer has an incentive to free ride, given that the free rider experiments. This explains [as shown in Keller et al. (2005)] why there does not exist an equilibrium where both players use cut-off strategies. However, the free riding opportunities are different for heterogeneous players. As explained above, in any equilibrium the pioneer is always the player with the higher productivity (player 1). The lower player 2’s productivity, the less player 1 has an incentive to free ride on player 2’s experimentation. If player 2’s productivity is very low, player 1 no longer has any incentive to free ride on 2’s experimentation for beliefs right above \(p_2^{'}\). This intuitively explains the result of Proposition 3.

Geometrically, the diagonals \(D_1\) and \(D_2\) in Fig. 2 do not coincide when players are asymmetric. As the proof of Proposition 3 shows, the condition for existence of an equilibrium in cut-off strategies is precisely that player 2 will enter the region in which risky is dominant at a more optimistic belief than player 1.^{8} This is possible if and only if the region in which risky is dominant for player 2 is relatively small enough compared to that of player 1, i.e. if and only if \(\lambda _2\) is small enough compared to \(\lambda _1=1\).

In Sect. 4.5, we show that if the players’ learning speeds are different while the expected flow pay-off from the good risky arm is the same, there again exists an equilibrium in cut-off strategies if and only if the difference in the learning speeds is high enough. The same qualitative result obtains for identical learning speeds but different expected pay-offs from the good risky arm. Indeed, either form of asymmetry creates differences in the players’ free riding incentives. Diagrammatically, this can be seen by a gap between the best response diagonals.

### 4.3 Equilibria in non-cut-off strategies

In the previous subsection, we have identified a necessary and sufficient condition for the existence of an equilibrium in cut-off strategies. In this subsection, we will analyse equilibria where at least one of the players uses a non-cut-off strategy. To begin with, we show that even for low degrees of asymmetry, there exists an equilibrium where player 2 uses a cut-off strategy.

### Proposition 4

There exists an equilibrium in which only player 2 uses a cut-off strategy if and only if \(\lambda _2>\lambda _2^*\). In this equilibrium, the cut-off for player 2’s strategy is \(p_2^{'}\). Player 1 plays risky on \((\bar{p}_1, p_2^{'}] \cup (p_s^1, 1]\) and safe otherwise, where \(p_s^1 > p_2^{'}\) is the belief at which player 1’s pay-off function and \(D_1\) intersect.

### Proof

Please refer to “Appendix B.5”. \(\square \)

The equilibrium where only player 2 uses a cut-off strategy is depicted in Fig. 3^{9}. The black and the orange curves depict the pay-offs to player 1 and 2, respectively. As the degree of asymmetry between the players is low, \(p_1^{'} > p_2^{'}\), and hence, an equilibrium where both players use cut-off strategies does not exist. In Fig. 3, we magnify the part around \(p = p_2^{'}\). We do not show \(p_1^{'}\) in the figure, but for the parameter values used in Fig. 3, we have \(p_s^1 = 0.4740 < 0.4754 = p_1^{'}\). At \(p = p_2^{'}\), both \(v_1\) and \(v_2\) have a kink.^{10} To the immediate right of \(p_2^{'}\), \(v_1\) becomes concave and \(v_2\) becomes convex. \(v_2\) remains convex for all \(p>p_2^{'}\), but has a kink^{11} at \(p = p_s^1\). \(v_1\) has an inflection point at \(p = p_s^1\) and smoothly becomes convex at this belief.

When the degree of asymmetry is low, there will exist a range of beliefs just above \(p_2^{'}\) where player 1 free rides. Thus, player 1 uses a non-cut-off strategy. This explains the result of Proposition 4. In the limit \(\lambda _2\downarrow \lambda _2^*\), the range above \(p_2^{'}\) where player 1 free rides vanishes, and hence, the equilibrium described in Proposition 4 coincides with the equilibrium in cut-off strategies.

Equilibria where at least one player uses a non-cut-off strategy always exist, as the following proposition shows. The following proposition, together with Proposition 3, fully characterises the set of all Markov perfect equilibria. To state the proposition, we let \(v_i\) be player *i*’s equilibrium pay-off. For both players \(n\in \{1,2\}\), we define \(p_S^n\) as the (unique) point of intersection of \(v_n\) and \(D_n\).^{12} Let \(p_S^i=\min \{p_S^1,p_S^2\}\) and \(p_S^j = \max \{p_S^1,p_S^2\}\).

### Proposition 5

For any \(\lambda _2\in (\frac{s}{h},1)\), there exists a continuum of Markov perfect equilibria in which at least one player uses a non-cut-off strategy. For each integer \(l>1\) and each sequence of threshold beliefs \((\tilde{p}_i)_{i=1}^l\) such that \(\bar{p}_1<\tilde{p}_1<\cdots <\tilde{p}_l=p_S^i\), there exists an equilibrium such that both players play safe at all beliefs \(p\le \bar{p}_1\); player 1 plays risky and player 2 plays safe in \((\bar{p}_1,\tilde{p}_1]\cup \bigcup _{i\in 2\mathbb {N}\wedge i< l}(\tilde{p}_i,\tilde{p}_{i+1}]\) , while player 1 plays safe and player 2 plays risky in \(\bigcup _{i\in 2\mathbb {N}\wedge i\le l}(\tilde{p}_{i-1},\tilde{p}_{i}]\); on \((p_S^i,p_S^j]\), player *i* plays risky and player *j* plays safe, while both players play risky on \((p_S^j,1]\). The same strategies with \(l=1\) also describe an equilibrium in which only player 2 uses a cut-off strategy if and only if \(p_2^{'}=p_S^2<p_S^1=\hat{p_S^1}\).

On \([0,\bar{p}_1]\), both players’ value function is *s*. For even \(i<l\), on \((\tilde{p}_i,\tilde{p}_{i+1}]\), player 1’s (2’s) value function is given by (14), (16), while on \((\tilde{p}_{i-1},\tilde{p}_{i}]\), player 2’s (1’s) value function is given by (14), (16); on \((p_S^i,p_S^j]\), player *i*’s (*j*’s) pay-off is given by (14), (16). On \((p_S^j,1]\), both players’ pay-offs are given by (12). The constants of integration are determined by value matching.

### Proof

That the proposed strategies are mutually best responses immediately follows from our discussion at the top of Sect. 4. That such equilibria always exist follows immediately from the continuity of players’ pay-off functions and the fact that \(D_i(\bar{p}_1)>s\) for both \(i\in \{1,2\}\). \(\square \)

When the degree of asymmetry is low, it is easy to observe that both players have incentives for free riding just below \(p_2^{'}\); i.e. safe and risky are mutually best responses in this region. Although an increase in the degree of asymmetry reduces the free riding incentives for player 1, they never vanish completely. Therefore, there will always be a range just above \(\bar{p}_1\) where safe and risky are mutually best responses. Hence, equilibrium allows players to take turns in experimenting at arbitrary beliefs in \(( \bar{p}_1, p_2^{'})\). This explains the result of Proposition (5).

As \(\bar{p}_1<\bar{p}_2\), the proposition implies that there exist equilibria in which player 2 experiments below his single-agent threshold \(\bar{p}_2\). Indeed, by being the last player to experiment on \((\bar{p}_1,\tilde{p}_1]\), player 1 provides an *encouragement effect* to player 2, as the latter is willing to play risky on \((\tilde{p}_1,\tilde{p}_2]\) only because he knows that, should his experimentation not be successful, he will get to free ride on player 1’s experimentation once the belief will have dropped to \(\tilde{p}_1\).

### 4.4 Welfare rankings of equilibria

As in Keller et al. (2005), there are two potential sources of inefficiency in our model: players might not produce enough information, and/or they might produce the information too slowly. In order to analyse these different effects, we define the *experimentation intensity* at time \(t \ge 0\) as \(K_t = \lambda _1 k_{1,t}+ \lambda _2 k_{2,t}\), and the integral \(\int _0^T K_t \, \mathrm{d}t\) as the *amount of experimentation* up to time *T*. Keller et al. (2005), by contrast, define the *experimentation intensity* at time \(t \ge 0\) as \(\hat{K}_t = k_{1,t}+ k_{2,t}\), and the *amount of experimentation* up to time *T* as \(\int _0^T \hat{K}_t \, \mathrm{d}t\). Thus, we measure the *output* of players’ experimentation efforts, with our measure taking into account that it matters for the information-production process which player invests time in the risky arm. The corresponding concepts in Keller et al. (2005), by contrast, measure the *input*, i.e. the overall resources spent on producing information. In the case of homogeneous players with productivities \(\lambda \), the input, as indicated by their measure, of course corresponds to \(1/\lambda \) times the output, as indicated by our measure. The following result mirrors the finding in Keller et al. (2005) (see their Lemma 3.1 in conjunction with their Propositions 5.1 and 6.1) that the amount of experimentation is the same in any Markov perfect equilibrium. This implies that the welfare ranking of equilibria is solely determined by the delay in information production.

### Lemma 2

Suppose there is no success on the risky arm. Then, the amount of experimentation is the same in any Markov perfect equilibrium.

### Proof

In the following proposition, we establish that in any equilibrium in which players swap the roles of pioneer and free rider at least once, player 1’s (2’s) pay-off will hit \(D_1\) (\(D_2\)) at a more pessimistic (optimistic) belief than in the equilibrium in cut-off strategies.

### Proposition 6

Consider any equilibrium described in Proposition 5. Suppose \(p_S^1 >\bar{p}_1\) is the belief at which the equilibrium pay-off of player 1 meets the line \(D_1\) and \(p_S^2 > \bar{p}_1\) is the belief at which the equilibrium pay-off of player 2 meets the line \(D_2\). Then, we have \(p_S^1 < p_1^{'}\). For \(l >1\) we have \(p_S^2 > p_2^{'}\) and for \(l=1\), \(p_S^2 = p_2^{'}\).

### Proof

Please refer to “Appendix B.6”. \(\square \)

In the equilibrium in cut-off strategies, player 2 free rides for all beliefs in \((\bar{p}_1, p_2^{'}]\). However, in any other equilibrium there exists some open subset of \( (\bar{p}_1, p_2^{'})\) where he experiments and player 1 free rides. Thus, for all \(p \in (\bar{p}_1, p_2^{'}]\), the equilibrium in cut-off strategies gives the highest pay-off to player 2, as he can free ride on the more productive player’s experimentation. This implies that, in the range \(p \in (\bar{p}_1, p_2^{'}]\), player 2’s pay-off function in any non-cut-off equilibrium lies below his pay-off in the cut-off equilibrium and will therefore intersect the diagonal \(D_2\) at a belief higher than \(p_2^{'}\). This explains why we have \(p_S^2 > p_2^{'}\). On the other hand, for all \(p \in (\bar{p}_1, p_1^{'}]\), player 1 experiments in the equilibrium in cut-off strategies and receives his single-agent pay-off. In any other equilibrium, however, there exists some open subset of \( (\bar{p}_1, p_1^{'})\) where his single-agent optimal action is not a best response, and his equilibrium pay-off is therefore higher. Thus, as player 1’s pay-off is lowest in the equilibrium in cut-off strategies, we have \(p_S^1 < p_1^{'}\).

Suppose \(\lambda _2 \in (\frac{s}{h},\lambda _2^{*}]\). This implies that the equilibrium in cut-off strategies exists. In the following proposition, we show that the equilibrium in cut-off strategies strictly welfare dominates all other equilibria.

### Proposition 7

Suppose \(\lambda _2\le \lambda _2^*\) and let \(v_\mathrm{agg}^\mathrm{c}\) be the aggregate equilibrium pay-off in the equilibrium in cut-off strategies and \(v_\mathrm{agg}^\mathrm{nc}\) be the aggregate equilibrium pay-off in an arbitrary equilibrium in non-cut-off strategies. Then, \(v_\mathrm{agg}^\mathrm{c} \ge v_\mathrm{agg}^\mathrm{nc}\), with the inequality strict on \((\tilde{p}_1,1)\).

### Proof

Please refer to “Appendix (B.7)”. \(\square \)

First, observe that in the equilibrium in cut-off strategies, both players experiment for beliefs greater than \(p_2^{'}\). Since \(p_S^2 > p_2^{'}\) (by Proposition 6), the range of beliefs where both players experiment is largest in the equilibrium in cut-off strategies. Next, in the equilibrium in cut-off strategies, whenever only one player experiments, it is the player with the higher pay-off arrival rate, player 1. In any other equilibrium, however, there is a range of beliefs where player 2 plays the role of the lonely pioneer. Since player 1 is more productive, in any equilibrium all experimentation ceases at \(\bar{p}_1\), information is most efficiently generated in the equilibrium in cut-off strategies. This intuitively explains the result of Proposition 7. One can observe that, since, at any belief, the intensity of experimentation is highest in the equilibrium in cut-off strategies, information generation is fastest. Thus, this equilibrium involves least *delay*. As experimentation amounts are the same in all equilibria (Lemma 2), this implies that the cut-off equilibrium welfare dominates all other equilibria.^{13}

^{14}Figure 4a, b depicts the actions of players in the equilibrium in cut-off strategies and the equilibrium in non-cut-off strategies, respectively. These equilibria correspond to the ones depicted in Fig. 4.

The thick purple^{15}curve (\(v_1\)) and the black curve (\(v_2\)) in Fig. 4 depict the pay-offs to player 1 and 2, respectively, in the equilibrium in cut-off strategies. In the equilibrium in non-cut-off strategies, pay-offs coincide for beliefs less than or equal to \(\tilde{p_1}\). At \(\tilde{p_1}\), players switch arms. The thin blue curve depicts the pay-off to player 1, and the thin yellow curve depicts the pay-off to player 2 in the equilibrium in non-cut-off strategies for \(p>\tilde{p_1}\). As argued, the blue curve meets the line \(D_1\) at a belief \(p_S^1\), which is strictly less than \(p_1^{'}\). In the region \((\tilde{p_1},p_S^1]\), player 2 experiments and player 1 free rides. At \(p_s^1\), player 1 switches to the risky arm and player 2 switches to the safe arm. When the red curve meets the line \(D_2\) at \(p_s^2>p_2^{'}\), player 2 switches to the risky arm again. Notice that in the equilibrium in non-cut-off strategies, player 2’s pay-off is negatively sloped at the right neighbourhood of \(p = \tilde{p_1}\). Indeed, in the current example, we have \(\tilde{p_1} = 0.39 < 0.4054 = \bar{p}_2\), where \(\bar{p}_2\) is the single person threshold for player 2. This means that, in the equilibrium in non-cut-off strategies, player 2 is forced to act as the lonely pioneer to the left of his single-agent cut-off, which makes his pay-off negatively sloped.^{16}

When \(\lambda _2 > \lambda _2^{*}\), the equilibrium in cut-off strategies does not exist. However, the argument in the proof of Proposition 7 allows us to show that, on \((\bar{p}_1,p_2^{'}]\), the equilibrium of Proposition 4, which is the only equilibrium in which player 1 is experimenting throughout this range, strictly welfare dominates all other equilibria. Indeed, with heterogeneous players, more frequent switches have the effect of replacing experimentation by the strong player with experimentation by the weak player in some open subset in \((\bar{p}_1,p_2^{'})\), thereby delaying information production in this range. Thus, even though more frequent switches can expand the range of beliefs where both experiment, there is always a welfare loss in the range \((\bar{p}_1, p_2^{'}]\). Hence, if players switch the role of pioneer and free rider more frequently, the equilibrium welfare is not unambiguously improved. This is in contrast to the case with homogeneous players (Keller et al. 2005), where the only effect of increasing the frequency of switches is to expand the range of beliefs where both players experiment, thus unambiguously speeding up information production and improving equilibrium welfare. Yet, we have not been able to establish that the equilibrium of Proposition 4 is *globally* welfare maximising.

### 4.5 Learning rates versus pay-offs

In our baseline model, we have considered asymmetric Poisson arrival rates only. However, since the expected lump-sum pay-off from the good risky arm was the same for both players, the asymmetry in learning rates implied that the expected flow pay-off from a good risky arm was also different across the players. In this subsection, we will analyse a model where learning rates differ, but the expected flow pay-off from a good risky arm is the same for both players.

Define \(\hat{g} = \lambda _1 h_1 \) where \(\lambda _1 = 1\) and \(h_1 > 0\). For any \(\lambda _2 \in (0,1)\), we choose a \(h_2 > 0\) such that \(\lambda _2 h_2 = \hat{g}\).

We will first analyse the social planner’s problem. Please refer to “Appendix (B.10)” for the explicit form of the Bellman equation for the planner’s value function *w*. The following proposition will show that the structure of the planner’s solution is the same as in Proposition 1.

### Proposition 8

### Proof

Proof is by a standard verification argument. Please see the “Appendix B.8” for details. \(\square \)

We will now analyse the non-cooperative game. Please refer to “Appendix (B.10)” for the explicit form of the Bellman equation player *i*’s (\(i = 1,2\)) value function \(w_i\) satisfies.

The single-agent thresholds are \(\hat{p_i} = \frac{rs}{rs+(r+\lambda _i)(\hat{g}-s)}\). It can be verified that \(\hat{p_1} < \hat{p_2}\). As in the baseline model, we can argue that in any equilibrium, \(\hat{p_1}\) is the lowest belief where some experimentation takes place and player 1 is the last one to experiment. This implies that, in any equilibrium, for beliefs right above \(\hat{p_1}\), pay-offs to player 1 and 2 are given by \(\bar{w_1}(p)\) and \(\bar{w_2}(p)\), respectively.^{17} It can be verified that \(\bar{w_1}\) is strictly convex and \(\bar{w_2}\) is strictly concave. By arguments similar to those in Lemma 1, we can infer that there exists a unique \(\bar{p}_1^{'} \in (\hat{p_1},1)\) such that \(\bar{w_1}(\bar{p}_1^{'}) = D_1(\bar{p}_1^{'})\) and a unique \(\bar{p}_2^{'} \in (\hat{p_2}, \frac{s}{\hat{g}})\) such that \(\bar{w_2}(\bar{p}_2^{'}) = D_2(\bar{p}_2^{'})\). In the following proposition, we establish that an equilibrium in cut-off strategies exists if and only if the degree of asymmetry is high enough.

### Proposition 9

There exists a \(\hat{\lambda }_2 \in (0,1)\) such that there exists an equilibrium in cut-off strategies if and only if \(\lambda _2 \in (0,\hat{\lambda }_2]\). In this equilibrium, player 1 plays risky on \((\hat{p_1},1]\) and safe otherwise, while player 2 plays risky on \((\bar{p}_2^{'},1]\) and safe otherwise.

### Proof

Please refer to “Appendix B.9” for details. \(\square \)

^{18}The black (red) curve depicts the pay-offs to player 1 (2). Since the flow pay-off obtained by each player from a good risky arm is fixed at \(\hat{g}\), the point of intersection between the best response line and the horizontal line \(w = s\) is the same for both players. As agents become more asymmetric, the best response lines diverge more from each other. Due to this, there emerges a range of beliefs where only player 2 can free ride. Hence, if the degree of asymmetry between the players is high enough, there exists an equilibrium in cut-off strategies.

*i*(\(i=1,2\)) is now given by \(\hat{D}_i : v= 2s - g_i p\). Beliefs \(\tilde{p_1^{'}}\) and \(\tilde{p_2^{'}}\) can be defined analogously to \(p_1^{'}\) and \(p_2^{'}\) above. Figure 6

^{19}shows an equilibrium in cut-off strategies in this framework. The black (red) curve depicts the pay-offs to player 1 (2). This equilibrium exists only when the players are highly asymmetric, and the best response diagonals are far apart from each other.

In both cases, if it exists, the equilibrium in cut-off strategies is welfare maximising. The argument is similar to above: Player 2 free rides the most in the equilibrium in cut-off strategies, so that the range of beliefs at which both players play risky is largest. In addition, for any equilibrium that is not in cut-off strategies, there is an open set of beliefs in which the roles of experimenting pioneer and free rider are reversed as compared to the equilibrium in cut-off strategies (where only player 2 ever free rides). In the case \(\lambda _1\not =\lambda _2\), both effects lead to greater delay in information production in the non-cut-off equilibrium. In the case \(\lambda _1=\lambda _2=\hat{\lambda }\), the first effect leads to greater delay, while the second effect leads to a higher opportunity cost of information production (\(s-g_1p < s-g_2p\)), in the non-cut-off equilibrium.

## 5 Conclusion

In this paper, we have characterised the set of Markov perfect equilibria in a two-armed bandit model with heterogeneous players. We have shown that there always exists an equilibrium in which the weaker player uses a cut-off strategy. If the heterogeneity is stark enough, there exists an equilibrium in cut-off strategies. If such an equilibrium exists, it is welfare optimal.

Thus, suppose there are two oil companies with vastly different drilling technologies, e.g. a big multinational firm and a small local enterprise. One could argue that the difference in technological capabilities between the two will be bigger in developing countries. On account of the big heterogeneity in capabilities, we should expect the equilibrium in cut-off strategies to exist. An empirically testable prediction of our model would thus be that there will be a higher frequency of instances in developing countries where the small local firm would free ride on the experimentation provided by the big multinational firm, and only enter the market after oil had been struck, even if the original level of uncertainty regarding the presence of oil was only moderate.

We have restricted players to using one arm only at any given instant *t*. By the linearity of the players’ Bellman equations, our equilibria would remain equilibria if we allowed players to select experimentation intensities \(k_{i,t}\in [0,1]\). There might, however, be more equilibria in this case.

Our analysis has relied heavily on the characterisation of players’ best responses via the diagonals \(D_i\) [see Eq. (4)], which was pioneered by Keller et al. (2005) for the homogeneous-player case. We expect that a similar approach could, *mutatis mutandis*, be used to study other kinds of asymmetries, e.g. pertaining to players’ safe-arm pay-offs \(s_i\). We should expect a similar result to our Proposition 3 to hold in these cases, namely, that there existed an equilibrium in cut-off strategies if and only if the heterogeneity was stark enough.

## Footnotes

- 1.
- 2.
See Hörner and Skrzypacz (2016) for an overview

- 3.
Parameter values for this figure: \(\lambda _1 = 1\); \(\lambda _2 = 0.4\); \(h = 3.5;s=1\) and \(r = 0.9\). \(p_1^{*} = 0.1488\) and \(p_2^{*} = 0.6686\).

- 4.
By standard results, on any open interval of beliefs in which player

*j*’s action choice is constant, player*i*’s value function \(v_i\) will be continuously differentiable. At those (finitely many) beliefs at which player*j*’s action changes, \(v'\) should be understood as the left derivative of*v*(since beliefs can only drift down). - 5.
- 6.
Parameter Values: \(\lambda _1=1;\lambda _2 = 0.9\); \(h = 2;s=1\) and \(r = 1.2\). \(p_1^{*} = 0.2857\) and \(p_2^{*} = 0.4126\). \(p_1^{'} = 0.4629\) and \(p_2^{'} = 0.4916\).

- 7.
These pay-offs are obtained from 12 by imposing the condition \(v_i^r(p_2^{'}) = \bar{v_i}(p_2^{'})\) (\(i = 1,2\)).

- 8.
All our figures correspond to parametric values such that the point of intersection of \(v_2\) and \(v_1\) lies to the right of \(p_2^{'}\). However, for very low values of \(\lambda _2\), this intersection will occur to the left of \(p_2^{'}\). All of our analysis goes through unchanged for this case as well.

- 9.
Parameter values: \(\lambda _1 = 1;\lambda _2 = 0.985;h = 2;s = 1\) and \(r = 1.9\). \(\bar{p}_1 = 0.3958;p_1^{'} = 0.4754;p_2^{'} = 0.4644\) and \(p_s^1 = 0.4740\).

- 10.
For the particular parameter values used in Fig. 3, we have \(v_1^{'}(p_2^{'-}) = 0.9694\); \(v_1^{'}(p_2^{'+}) = 1.5085\); \(v_2^{'}(p_2^{'-}) = 0.9894\); \(v_2^{'}(p_2^{'+}) = 0.3197\).

- 11.
For the particular parameter values used in Fig. 3, we have \(v_2^{'}(p_s^{1-}) = 0.4625\); \(v_2^{'}(p_s^{1+}) = 1.0721\).

- 12.
The uniqueness of \(p_S^n\in (\bar{p}_1,\frac{s}{g_n})\) follows from (11).

- 13.
Dong (2018) shows that if the players’ initial beliefs are asymmetric enough, equilibrium welfare improves.

- 14.
Proposition 6 implies that the qualitative characteristics of \(p_s^1\) and \(p_s^2\) are the same in any equilibrium in non-cut-off strategies. For simplicity, we consider an equilibrium in non-cut-off strategies where players swap roles only once in the figure.

- 15.
Parameter values: \(\lambda _1 = 1; \lambda _2 = 0.9;h;s=1\) and \(r = 1.2\). \(\bar{p}_1 = 0.3529;p_1^{'} = 0.4629;p_2^{'} = 0.4916;p_s^1 = 0.4499;p_s^2 = 0.5081\) and \(\tilde{p_1} = 0.39\).

- 16.
Mathematically, this can be seen as follows: consider a function \(v = g_2 p + C(1-p) (\frac{1-p}{p})^{\frac{r}{\lambda _2}}\), such that the integration constant is derived from \(v(\tilde{p_1}) = s\). Since \(\tilde{p_1}<\bar{p}_2\), direct computation shows that \(v^{'}(\tilde{p_1}) < 0\). In the equilibrium in non-cut-off strategies, to the immediate right of \(\tilde{p_1}\), 2’s pay-off is given by \(v_2 = g_2p +c_2(1-p) (\frac{1-p}{p})^{\frac{r}{\lambda _2}}\). The integration constant \(c_2\) is determined from \(v_2(\tilde{p_1}) = \bar{v_2}(\tilde{p_1})>s \Rightarrow c_2 > C\). Direct computation shows that this implies that 2’s pay-off will be negatively sloped in some right neighbourhood of \(\tilde{p_1}\).

- 17.
Please refer to “Appendix (B.10)” for explicit expressions for these functions.

- 18.
Parameter values: \(\lambda _1 = 1; \lambda _2 = 0.3;h_1 = 2;h_2 = \frac{20}{3};s=1\) and \(r = 1.2\). \(\hat{p_1} = 0.3529;\bar{p}_1^{'} = 0.4344;\bar{p}_2^{'} = 0.4779\).

- 19.
Parameter values: \(\lambda _1=\lambda _2 = 1\); \(h_1=2.1;h_2= 1.9\); \(s=1\) and \(r=1.2\). \(\tilde{p_1} = 0.3315;\tilde{p_1^{'}} = 0.4418\) and \(\tilde{p_2^{'}} = 0.4582\).

- 20.
We suppress arguments whenever this is convenient.

- 21.
In “Appendix A.1”, we display the ODEs that

*v*satisfies for each range of beliefs and the corresponding general form of*v*for that range. The specific value of*v*is obtained by value matching. - 22.
As in “Appendix A.1”, from 17 we can obtain the ODEs that

*w*satisfies for each range of beliefs and the corresponding general form of*w*for that range.

## Notes

### Acknowledgements

We are indebted to an anonymous associate editor and two anonymous referees, whose comments and suggestions greatly improved the paper, and to Sven Rady for his help and guidance. We also thank Siddhartha Bandyopadhyay, Kalyan Chatterjee, Martin Cripps and Johannes Hörner for their helpful comments. Thanks are also due to the participants at the World Congress of Game Theory (Masstricht 2016), EEA-ESEM (Geneva 2016) and SAET 2017 (Faro) where a subset of the results of the paper was presented. The remaining errors are our own.

## References

- Banks, J., Olson, M., Porter, D.: An experimental analysis of the bandit problem. Econ. Theory
**10**(1), 55–77 (1997). https://doi.org/10.1007/s001990050146 CrossRefGoogle Scholar - Besanko, D., Wu, J.: The impact of market structure and learning on the tradeoff between R & D competition and cooperation. J. Ind. Econ.
**61**, 166–201 (2013)CrossRefGoogle Scholar - Bolton, P., Harris, C.: Strategic experimentation. Econometrica
**67**, 349–374 (1999)CrossRefGoogle Scholar - Bonatti, A., Hörner, J.: Collaborating. Am. Econ. Rev.
**101**, 632–663 (2011)CrossRefGoogle Scholar - Das, K.: Too Much or Too Little? The Effect of Private Learning in a Model of Strategic Experimentation with Competition. Mimeo, University of Exeter, UK (2019)Google Scholar
- Dong, M.: Strategic Experimentation with Asymmetric Information. Mimeo, Penn State University, University Park, US (2018)Google Scholar
- Guo, Y.: Dynamic delegation of experimentation. Am. Econ. Rev.
**106**, 1969–2008 (2016)CrossRefGoogle Scholar - Heidhues, P., Rady, S., Strack, P.: Strategic experimentation with private payoffs. J. Econ. Theory
**159**, 531–551 (2015)CrossRefGoogle Scholar - Hoelzemann, J., Klein, N.: Bandits in the Lab. Mimeo, University of Toronto and Université de Montréal, Toronto, Canada and Montreal, Canada (2018)Google Scholar
- Hörner, J., Skrzypacz, A.: Learning, Experimentation and Information Design. Working paper, Stanford University (2016)Google Scholar
- Keller, G., Rady, S., Cripps, M.: Strategic experimentation with exponential bandits. Econometrica
**73**, 39–68 (2005)CrossRefGoogle Scholar - Keller, G., Rady, S.: Strategic experimentation with Poisson bandits. Theor. Econ.
**5**, 275–311 (2010)CrossRefGoogle Scholar - Klein, N.: Strategic learning in teams. Games Econ. Behav.
**82**, 632–657 (2013)CrossRefGoogle Scholar - Klein, N.: The importance of being honest. Theor. Econ.
**11**, 773–811 (2016)CrossRefGoogle Scholar - Klein, N., Rady, S.: Negatively correlated bandits. Rev. Econ. Stud.
**78**, 693–792 (2011)CrossRefGoogle Scholar - Murto, P., Välimäki, J.: Learning and information aggregation in an exit game. Rev. Econ. Stud.
**78**, 1426–1461 (2011)CrossRefGoogle Scholar - Rosenberg, D., Solan, E., Vielle, N.: Social learning in one-arm bandit problems. Econometrica
**75**, 1511–1611 (2007)CrossRefGoogle Scholar - Rosenberg, D., Salomon, A., Vieille, N.: On games of strategic experimentation. Games Econ. Behav.
**82**, 31–51 (2013)CrossRefGoogle Scholar - Thomas, C.: Experimentation with Congestion. University of Texas at Austin, Austin, US (2017)Google Scholar
- Zambrano, A.: Motivating informed decisions. Econ. Theory (2017). https://doi.org/10.1007/s00199-017-1087-3

## Copyright information

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.