Accelerated Bayesian learning for decentralized two-armed bandit based decision making with applications to the Goore Game
Authors
- First Online:
DOI: 10.1007/s10489-012-0346-z
- Cite this article as:
- Granmo, O. & Glimsdal, S. Appl Intell (2013) 38: 479. doi:10.1007/s10489-012-0346-z
- 7 Citations
- 276 Views
Abstract
The two-armed bandit problem is a classical optimization problem where a decision maker sequentially pulls one of two arms attached to a gambling machine, with each pull resulting in a random reward. The reward distributions are unknown, and thus, one must balance between exploiting existing knowledge about the arms, and obtaining new information. Bandit problems are particularly fascinating because a large class of real world problems, including routing, Quality of Service (QoS) control, game playing, and resource allocation, can be solved in a decentralized manner when modeled as a system of interacting gambling machines.
Although computationally intractable in many cases, Bayesian methods provide a standard for optimal decision making. This paper proposes a novel scheme for decentralized decision making based on the Goore Game in which each decision maker is inherently Bayesian in nature, yet avoids computational intractability by relying simply on updating the hyper parameters of sibling conjugate priors, and on random sampling from these posteriors. We further report theoretical results on the variance of the random rewards experienced by each individual decision maker. Based on these theoretical results, each decision maker is able to accelerate its own learning by taking advantage of the increasingly more reliable feedback that is obtained as exploration gradually turns into exploitation in bandit problem based learning.
Extensive experiments, involving QoS control in simulated wireless sensor networks, demonstrate that the accelerated learning allows us to combine the benefits of conservative learning, which is high accuracy, with the benefits of hurried learning, which is fast convergence. In this manner, our scheme outperforms recently proposed Goore Game solution schemes, where one has to trade off accuracy with speed. As an additional benefit, performance also becomes more stable. We thus believe that our methodology opens avenues for improved performance in a number of applications of bandit based decentralized decision making.
Keywords
Bandit problemsGoore GameBayesian learningDecentralized decision makingQuality of service controlWireless sensor networks1 Introduction
The conflict between exploration and exploitation is a well-known problem in reinforcement learning, and other areas of artificial intelligence. The Two-Armed Bandit (TAB) problem captures the essence of this conflict. In brief, a decision maker sequentially pulls one of two arms attached to a gambling machine, with each pull resulting in a random reward. The reward distributions are unknown, and thus one must balance between exploiting existing knowledge about the arms, and obtaining new information.
1.1 Thompson sampling
In [8] we reported a Bayesian technique for solving bandit like problems, revisiting the Thompson Sampling [21] principle pioneered in 1933. This revisit lead to novel schemes for handling multi-armed, and dynamic (restless) bandit problems [9, 13, 14], and empirical results demonstrated the advantages of these techniques over established top performers. Furthermore, we provided theoretical results stating that the original technique is instantaneously self-correcting and that it converges to only pulling the optimal arm, with probability as close to unity as desired. In addition to the theoretical convergence results found in [8], May et al. recently reported an alternative proof strategy for establishing convergence properties for optimistic Bayesian sampling [16]. As a further testimony to the renewed importance of the Thompson Sampling principle, Wang et al. [24] combined so-called sparse sampling with Bayesian exploration, enabling efficient searching of the arm selection space using a sparse look-ahead tree. In [4], Dimitrakakis derived optimal decision thresholds for the multi-armed bandit problem, for both the infinite horizon discounted reward case, and for the finite horizon undiscounted case. Later on, a modern Bayesian look at the multi-armed bandit problem was also taken in [16, 20]. Promising recent application areas for Thompson Sampling include Bayesian click-through rate optimization for sponsored search advertising [7] and web site optimization [6, 20].
1.2 Decentralized decision making and multi-armed bandits
Multiple interacting bandits problems are particularly fascinating because they can be used to model, and efficiently solve a large class of real world decentralized decision making problems, such as QoS-control in wireless sensor networks [15], routing [19], game playing [5], combinatorial optimization [1, 10], and resource allocation [11, 12]. In decentralized decision making problems, however, a certain phenomenon renders current bandit problem based solutions sub-optimal. Specifically, multiple decentralized decision makers are simultaneously exploring a collection of interacting bandits. This means that the variances of the reward distributions of each bandit problem are governed by the current level of exploration being manifested in the system as a whole. In other words, the variance of the reward distributions will be fluctuating with the degree of exploration taking place. Thus, initially, when exploration typically is significant, each decision maker should be correspondingly more conservative or cautious when interpreting the received rewards. Otherwise, by being too reckless, the decision maker may be led astray early on, converging to a sub-optimal decision.
The traditional approach to dealing with the above described fluctuation of reward distribution variance is to make learning sufficiently conservative. The purpose is to minimize the chance of each decision maker converging prematurely. Obviously, the disadvantages of this approach is the corresponding loss in learning speed caused by being too conservative also when exploration calms down. A recent approach deals with this problem indirectly by incorporating a Kalman filter into the decision making [9], allowing each decision maker to track changing reward distributions. Thus, too reckless learning initially is offset by the “forgetting” mechanism of the Kalman filter. This means that premature convergence is hindered. Yet, this tracking of changing reward distributions also means that exploration never stops. The decision makers will, as a result, never converge to a single optimal decision.
1.3 Paper contributions and organization
In this paper, we propose a novel scheme for solving one particular class of decentralized decision making problems, namely, the Goore Game (GG) [22]. The GG has applications within QoS control in wireless sensor networks, and we described both the GG and recent applications in Sect. 2. We then proceed to introduce a scheme for Accelerated Decentralized Learning in Two-Armed Bandit Based Decision Making (ADL-TAB) in Sect. 3. The ADL-TAB scheme directly and specifically addresses fluctuating reward distribution variances. To achieve this, we derive theoretical results that characterize the variance of the random rewards experienced by each individual decision maker. Based on these theoretical results, each decision maker is able to accelerate its own learning as follows. When a decision maker chooses which arm to pull, it also submits a measurement of its degree of exploration, which we refer to as arm selection variance. In turn, along with the random reward it receives from the arm pull, it also receives a signal that reflects the current aggregated level of exploration being manifested in the system. Using this signal, each decision maker accelerates learning by taking advantage of the increasingly more reliable feedback that can be obtained when exploration gradually turns into exploitation. In Sect. 4, we demonstrate empirically that the accelerated learning allows us to combine the benefits of conservative learning, which is high accuracy, with the benefits of hurried learning, which is fast convergence. We also achieve significant performance benefits when applying ADL-TAB to QoS control in wireless sensor networks that are dynamically changing through a stochastic sensor birth-death process. In brief, our scheme clearly outperforms recently proposed Goore Game solution schemes, where one has to trade off accuracy with speed. Finally, in Sect. 5, we conclude and provide pointers to further research.
2 The Goore Game (GG)
Imagine a large room containingNcubicles and a raised platform. One person (voter) sits in each cubicle and a Referee stands on the platform. The Referee conducts a series of voting rounds as follows. On each round the voters vote “Yes” or “No” (the issue is unimportant) simultaneously and independently (they do not see each other) and the Referee counts the fraction,λ, of “Yes” votes. The Referee has a uni-modal performance criterionG(λ), which is optimized when the fraction of “Yes” votes is exactlyλ^{∗}. The current voting round ends with the Referee awarding a dollar with probabilityG(λ) and assessing a dollar with probability 1−G(λ) to every voter independently. On the basis of their individual gains and losses, the voters then decide, again independently, how to cast their votes on the next round.
- 1.
The game is a non-trivial non-zero-sum game.
- 2.
Unlike the games traditionally studied in the AI literature (like Chess, Checkers, Lights-Out, etc.) the game is essentially a distributed game.
- 3.
The players of the game are ignorant of all of the parameters of the game. All they know is that they have to make a choice, for which they are either rewarded or penalized. They have no clue as to how many other players there are, how they are playing, or even of how/why they are rewarded/penalized.
- 4.
The stochastic function used to reward or penalize the players can be completely arbitrary, as long as it is uni-modal.
The literature concerning the GG is sparse. It was initially studied in the general learning domain, and, as far as we know, was for a long time merely considered as an interesting pathological game. Recently, however, the GG has found important applications within two main areas, namely, Quality of Service (QoS) control in wireless sensor networks [3] and within cooperative mobile robotics, as summarized in [2].
The GG has found applications within the field of wireless sensor networks, as explained briefly here. Consider a base station that collects data from a sensor network. The sensors of the network are battery driven and have been dropped from the air, leaving some of them non-functioning. The functioning sensors can either be switched on or off, and since they are battery-driven, it is expedient that they should be turned off whenever possible. The base station, on the other hand, has been set to maintain a certain resolution (i.e., QoS), and therefore requires that Q sensors are switched on. Unfortunately, it does not know the number of functioning sensors, and it is only able to contact them by means of a broadcast, leaving it unable to address them individually. This leaves us with the following challenge: How can the base station turn on exactlyQsensors, only by means of its limited broadcast capability?
Iyer et al. [15] proposed a scheme where the base station provided broadcasted QoS feedback to the sensors of the network. Using this model, the above problem was solved by modeling it as a GG [23]. From the GG perspective, a sensor is seen as a voter that chooses between transmitting data or remaining idle in order to preserve energy. Thus, in essence, each sensor takes the role of a GG player that either votes “On” or “Off”, and acts accordingly. The base station, on the other hand, is seen as the GG Referee with a uni-modal performance function G(⋅) whose maximum is found at Q normalized by the total number of sensors available. The “trick” is to let the base station (1) count the number of sensors that have turned on, and (2) use the broadcast mechanism to distribute, among the sensors, the corresponding reward based on the probability obtained from G(⋅). The application of the GG solution to the field of sensor networks is thus both straightforward and obvious.
Furthermore, Tung and Kleinrock [23] have demonstrated how the GG can be used for coordinating groups of mobile robots (also called “mobots”) that have a restricted ability to communicate. The main example application described in [23] consists of a fixed number of mobots that can either (1) collect pieces of ore from a landscape, or (2) sort already collected ore pieces. The individual mobots vary with respect to how fast they collect and how fast they sort these pieces of ore. In this context, the GG is used to make sure that the mobots choose their action so as to maximize the throughput of the overall collection and sorting system.
Other possible cooperative robotics applications include controlling a moving platform and guarding a specified perimeter [2]. In all of these cases, the solution to the problem in question would essentially utilize the solution to the GG in a plug-and-play manner.
3 Accelerated decentralized learning in two-armed bandit based decision making (ADL-TAB)
This paper proposes a novel scheme for decentralized decision making in which each decision maker is inherently Bayesian in nature, yet avoids computational intractability by relying simply on updating the hyper parameters of sibling conjugate priors, and on random sampling from these posteriors. Based on the sibling conjugate priors, we also measure the current degree of exploration and exploitation being manifested in the system as a whole. This allows each decision maker to accelerate its learning by taking advantage of the increasingly more reliable feedback that can be obtained when exploration gradually turns into exploitation.
3.1 Bayesian sampling for two-armed normal bandits (BS-TANB)
As seen from the above BS-TANB algorithm, t is a discrete time index and the parameters ϕ^{t}=〈(μ_{0}[t],σ_{0}[t]),(μ_{1}[t],σ_{1}[t])〉 form an infinite 4-dimensional continuous state space, with each pair (μ_{i}[t],σ_{i}[t]) giving the prior distribution of the unknown reward r_{i} associated with Armi. Within Φ the BS-TANB navigates by transforming each prior distribution into a posterior distribution, based on the rewards \(\tilde{r}_{i}\) obtained from selecting Armi, α[t]=i, as well as the observation noise, \(\sigma_{ob}^{2}\), given as an input parameter to the algorithm. Essentially, the algorithm uses observation noise, \(\sigma_{ob}^{2}\), to determine how much emphasis to put on the reward \(\tilde{r}_{i}\), which is a crucial property that we will now take advantage of.
3.2 BS-TANB based decentralized decision making
Let p_{q}[t]=P(α_{q}[t]=1) be the probability that decision maker V_{q} casts a “Yes” vote on round t. Then 1−p_{q}[t] is the probability that V_{q} casts a “No” vote, and each voting α_{q}[t] can be seen as a Bernoulli trial in which a “Yes” vote is a success and a “No” vote is a failure. Note that the concrete instantiation of the arm selection probability p_{q}[t] is governed by the learning scheme applied, which in our case is BS-TANB.
Definition 1
(Arm Selection Variance)
In a two-armed bandit problem where the current arm selection probability is p, we define Arm Selection Variance, σ^{2}, to be the variance, p(1−p), of the outcome of the corresponding Bernoulli trial.
As seen in Fig. 1, in addition to casting a vote α_{q}[t], each decision maker V_{q} also submits its present Arm Selection Variance, \(\sigma_{q}^{2}[t]\), in order to signal its level of exploration. Thus, as in the traditional Goore Game setup, a Referee calculates the fraction, λ[t], of “Yes” votes. In addition, it now also calculates the variance \(\sigma_{A}^{2}[t]\) of the total number of “Yes” votes, which simply is the sum of the variances of the independently cast votes (cf. Bienayme formula): \(\sigma_{A}^{2}[t] = \sum_{q=1}^{n} \sigma_{q}^{2}[t]\). Note that in practice, such as in QoS control in wireless sensor networks [15], this operation is conducted by the so-called base station of the network.
The Referee has a uni-modal normally distributed performance criterion G(λ[t];μ_{G},σ_{G}), where μ_{G} is the mean and \(\sigma_{G}^{2}\) is the variance, which is thus optimized when the fraction of “Yes” votes is exactly μ_{G}, λ[t]=μ_{G}. The current voting round ends with the Referee awarding a reward \(\tilde{r}_{i}\) to each voter, with the reward being of magnitude G(λ[t];μ_{G},σ_{G}). Additionally, white noise N(0,σ_{W}) is independently added to the reward received by each voter.
On the basis of their individual gains, the voters then decide, again independently how to cast their votes on the next round.
3.3 Measuring fluctuating observation noise in Goore Games
In order to develop a decentralized BS-TANB based scheme for solving the above problem, whose accuracy does not rely merely on conservative learning, it is crucial that we are able to determine the observation noise, \(\sigma_{ob}^{2}\), needed by BS-TANB for its Bayesian computations.
From the perspective of voter V_{q}, let Y_{q}=∑_{r≠q}α_{r}[t] be the total number of “Yes” votes found among the n−1 votes cast by the other voters (r≠q). According to our Bayesian bandit scheme, each voter V_{q}, at any given iteration t of the game, cast its vote according to a Bernoulli distribution with success probability \(p_{q}[t] = P(\alpha_{q}[t] = 1) = P(X_{1} > X_{0} |\phi_{q}^{t})\) — the probability of voting “Yes”. Furthermore, initially, all voters vote “Yes” with probability p_{q}[1]=0.5, and based on Bayesian computations, gradually shift their probability of voting “Yes” towards either 0 or 1, as learning proceeds. This leads us to design a solution for the case where Y_{q} is a sum of independent random variables of similar magnitude, in other words, where Y_{q} is approximately normally distributed for large n, \(Y_{q}\sim N(\mu_{F}^{q}, \sigma_{F}^{q})\). Since each term in the summation is Bernoulli distributed, the mean of the sum becomes \(\mu_{F}^{q} = \sum_{r\ne q} p_{r}[t]\) while the variance becomes \({\sigma_{F}^{q}}^{2} = \sum_{r\ne q} p_{r}[t] (1 - p_{r}[t])\). The above entails that each voter, V_{q}, essentially decides whether to add an additional “Yes” vote or not to a random sum of yes votes, \(Y_{q} \sim N(\mu_{F}^{q}, \sigma_{F}^{q})\). That is, the reward that voter V_{q} receives when he votes either “Yes” (α_{q}[t]=1) or “No” (α_{q}[t]=0), becomes a function \(G(\frac{Y_{q} + \alpha_{q}[t]}{n})\) governed by the random variable \(Y_{q}\sim N(\mu_{F}^{q}, \sigma_{F}^{q})\) as well as the decision α_{q} of voter V_{q}.
Thus \(E[G(\frac{Y_{q} + \alpha_{q}[t]}{n})]\) is the expected reward received by voter V_{q} when pulling arm α_{q}[t] and \(\operatorname {Var}[G(\frac{Y_{q} + \alpha_{q}[t]}{n})]\) is the variance of the reward, which we will refer to as observation noise, σ_{ob}.
Lemma 1
Lemma 2
Theorem 1
Proof
A crucial consequence of the results presented in this section is that since σ_{F} in the above equation can be approximated based on the feedback σ_{A} from the Referee (see Fig. 1), we can find the worst case observation noise based on Theorem 1. Thus, we have found a closed form formula for worst case observation noise,σ_{ob}, that each voter can apply adaptively in its Bayesian computations!
4 Empirical results
In this section we evaluate the ADL-TAB scheme by comparing it with the currently best performing algorithm—the family of Bayesian techniques reported in [8]. Based on our comparison with these “reference” algorithms, it should be quite straightforward to also relate the ADL-TAB performance results to the performance of other similar algorithms. We first use artificial data and then data from a simulated sensor network.
4.1 Artificial data
We have conducted numerous experiments using various reward distributions, including a wide range of G(λ)-functions and a wide range of voters, under varying degrees of observation noise. The full range of empirical results all show the same trend, however, we here report performance on a representative subset of the experiment configurations, involving the 3, 5, and 10 player Goore Game. Performance is measured in terms of Regret—the difference between the sum of rewards expected afterNsuccessive rounds of the GG, and what would have been obtained by always casting the optimal number of “Yes” votes.
Regret after 10, 100, 1000, and 10 000 iterations for 3, 5, and 10 players
Scheme | #Players | Function | 10 | 100 | 1000 | 10 000 |
---|---|---|---|---|---|---|
Accelerating | 3 | G∼N(0.125,0.1) | 11.56 | 26.72 | 30.96 | 33.17 |
Static | 3 | G∼N(0.125,0.1) | 11.63 | 27.27 | 34.88 | 47.20 |
Accelerating | 3 | G∼N(0.125,0.2) | 5.26 | 8.47 | 10.35 | 11.09 |
Static | 3 | G∼N(0.125,0.2) | 5.28 | 9.53 | 15.15 | 25.15 |
Accelerating | 3 | G∼N(0.375,0.2) | 6.62 | 10.86 | 11.99 | 12.63 |
Static | 3 | G∼N(0.375,0.2) | 6.73 | 12.15 | 14.36 | 17.72 |
Accelerating | 5 | G∼N(0.125,0.1) | 18.37 | 41.60 | 51.78 | 61.65 |
Static | 5 | G∼N(0.125,0.1) | 18.28 | 44.94 | 58.52 | 99.49 |
Accelerating | 5 | G∼N(0.125,0.2) | 6.92 | 12.94 | 22.99 | 60.80 |
Static | 5 | G∼N(0.125,0.2) | 7.01 | 15.39 | 32.94 | 69.86 |
Accelerating | 5 | G∼N(0.375,0.2) | 6.12 | 20.47 | 22.70 | 25.75 |
Static | 5 | G∼N(0.375,0.2) | 6.16 | 24.24 | 30.11 | 35.74 |
Accelerating | 10 | G∼N(0.125,0.1) | 32.81 | 93.82 | 133.65 | 443.7 |
Static | 10 | G∼N(0.125,0.1) | 32.84 | 99.27 | 143.67 | 549.8 |
Accelerating | 10 | G∼N(0.125,0.2) | 10.19 | 19.57 | 39.58 | 110.53 |
Static | 10 | G∼N(0.125,0.2) | 10.21 | 22.57 | 56.91 | 167.09 |
Accelerating | 10 | G∼N(0.375,0.2) | 4.40 | 31.20 | 113.42 | 116.65 |
Static | 10 | G∼N(0.375,0.2) | 4.41 | 32.03 | 163.40 | 197.31 |
Performance with σ_{F} augmented, \(\widehat{\sigma_{F}} =c \cdot \sigma_{F}\) (10 players, G∼N(0.1,0.1), σ_{W}=0.1)
Scheme/c | 1.0 | 1.25 | 1.5 | 1.75 |
---|---|---|---|---|
Accelerating | 1030.0 | 684.7 | 444.7 | 408.7 |
Static | 965.6 | 624.3 | 550.4 | 414.2 |
Performance with distorted \(\widehat{\sigma}_{G}\) given to ADL-TAB (10 players, G∼N(0.125,0.2), σ_{W}=0.1)
\(\widehat{\sigma}_{G}\) | 0.85⋅σ_{G} | 0.90⋅σ_{G} | 0.95⋅σ_{G} | 1.0⋅σ_{G} | 1.05⋅σ_{G} | 1.10⋅σ_{G} | 1.15⋅σ_{G} |
---|---|---|---|---|---|---|---|
Regret | 74.4 | 75.7 | 90.9 | 123.5 | 162.9 | 194.4 | 237.5 |
Performance with varying degrees of white noise N(0,σ_{W}) (10 players, G∼N(0.375,0.2))
Scheme/σ_{W} | 0.01 | 0.05 | 0.1 | 0.5 | 1.0 | 5.0 |
---|---|---|---|---|---|---|
Accelerating | 56.6 | 54.8 | 61.1 | 123.4 | 315.8 | 2012.0 |
Static | 120.5 | 121.4 | 121.6 | 184.4 | 371.7 | 2013.3 |
Thus, based on our empirical results on artificial data, we conclude that ADL-TAB is the superior choice for the GG, both when σ_{G} is known or slightly distorted, providing significantly better performance in all experiment configurations.
4.2 QoS control in wireless sensor networks
As mentioned in the introduction, the GG can be used for QoS control in wireless sensor networks. A scenario of particular interest is randomly deployed networks, whose applications include environmental monitoring and battlefield surveillance & reconnaissance [3]. An additional complexity for QoS control under such settings is sensor break down. Due to the random deployment of sensors, and due to the nature of typical applications, it is often infeasible to track down and repair broken sensors. Instead, batches of new sensors are deployed to replace broken ones, e.g., by air drop. As a result, the population of sensors are dynamically changing over time, in a stochastic manner.
To stress the ADL-TAB scheme under particularly challenging conditions, we have thus simulated the latter kind of dynamic environments. In brief, the simulated environment starts out with ten randomly deployed sensors. As in [15], both the lifetime of each sensor and the rate of deployment is governed by the same exponential distribution. Therefore the total number of operative sensors will fluctuate, but remain constant on average.
Regret after 10, 100, 1000, and 10 000 hours
Scheme | Birth/Death Rate | 10 | 100 | 1000 | 10 000 |
---|---|---|---|---|---|
Accelerating | 100 | 4.47 | 31.50 | 180.43 | 1055.23 |
Static | 100 | 4.48 | 31.45 | 194.14 | 3003.87 |
Accelerating | 50 | 4.47 | 31.65 | 195.84 | 1268.32 |
Static | 50 | 4.45 | 31.58 | 203.19 | 4734.49 |
Accelerating | 12.5 | 4.49 | 32.71 | 219.34 | 1270.76 |
Static | 12.5 | 4.39 | 32.04 | 357.60 | 9589.84 |
From the table it is clear that ADL-TAB obtains significantly lower regret than the static BS-TANB scheme in these scenarios too. Indeed, the performance benefit is increasing with faster replacement of sensors. This can be explained by the ability of a newly placed sensor, that is governed by the ADL-TAB scheme, to perceive any learning stability already established among the operating sensors, which in turn will allow the sensor to accelerate its transition from exploration to exploitation.
Standard deviation of regret after 10 000 hours
Scheme | Birth/Death Rate = 100 | Birth/Death Rate = 50 | Birth/Death Rate = 12.5 |
---|---|---|---|
Accelerating | σ_{100}=254.6 | σ_{50}=229.0 | σ_{12.5}=206.5 |
Static | σ_{100}=1561.5 | σ_{50}=1628.9 | σ_{12.5}=1393.7 |
5 Conclusion and further work
In this paper we proposed a novel scheme, ADL-TAB, for decentralized decision making based on the Goore Game. Theoretical results concerning the variance of the observations made by each individual decision maker enabled us to accelerate learning as exploration turns into exploitation. Indeed, our empirical results demonstrated that the accelerated learning improves both learning accuracy and speed, outperforming state-of-the-art Goore Game solution schemes, both when using artificial data and when using data from a wireless sensor network simulation.
As further work, we intend to study how the Kalman filter can be incorporated into ADL-TAB, so that non-stationary behavior can be modeled and addressed in a principled manner. We are also currently investigating how the present result can be extended to other classes of decentralized decision making problems. Finally, we believe this avenue of research can lead to enhancements in application areas such as decentralized task scheduling, processing pipeline optimization, and resource allocation.
By this we mean that P is not a fixed function. Rather, it denotes the probability function for a random variable, given as an argument to P.