Skip to main content
Log in

Infomax Strategies for an Optimal Balance Between Exploration and Exploitation

  • Published:
Journal of Statistical Physics Aims and scope Submit manuscript

Abstract

Proper balance between exploitation and exploration is what makes good decisions that achieve high reward, like payoff or evolutionary fitness. The Infomax principle postulates that maximization of information directs the function of diverse systems, from living systems to artificial neural networks. While specific applications turn out to be successful, the validity of information as a proxy for reward remains unclear. Here, we consider the multi-armed bandit decision problem, which features arms (slot-machines) of unknown probabilities of success and a player trying to maximize cumulative payoff by choosing the sequence of arms to play. We show that an Infomax strategy (Info-p) which optimally gathers information on the highest probability of success among the arms, saturates known optimal bounds and compares favorably to existing policies. Conversely, gathering information on the identity of the best arm in the bandit leads to a strategy that is vastly suboptimal in terms of payoff. The nature of the quantity selected for Infomax acquisition is then crucial for effective tradeoffs between exploration and exploitation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Atick, J.J., Redlich, A.N.: What does the retina know about natural scenes? Neural Comput. 4, 196–210 (1992)

    Article  Google Scholar 

  2. Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multi-armed bandit problem. Mach. Learn. J. 47, 235–256 (2002)

    Article  MATH  Google Scholar 

  3. Barlow, H.B.: Possible Principles Underlying the Transformation of Sensory Messages. MIT Press, Cambridge (1961)

    Google Scholar 

  4. Barron, A., Cover, T.M.: A bound on the financial value of information. IEEE Trans. Inf. Theory 34, 1097–1100 (1988)

    Article  MathSciNet  MATH  Google Scholar 

  5. Bell, A.J., Sejnowski, T.J.: An information-maximization approach to blind separation and blind deconvolution. Neural Comput. 7, 1129–1159 (1995)

    Article  Google Scholar 

  6. Bergstrom, C.T., Lachmann, M.: Shannon information and biological fitness. In: Proceedings of the IEEE Workshop on Information Theory. IEEE, Berlin (2004)

  7. Berry, D.A., Fristedt, B.: Bandit Problems: Sequential Allocation of Experiments. Springer, Dordrecht (2001)

    MATH  Google Scholar 

  8. Bialek, W.: Biophysics: Searching for Principles. Princeton University Press, Princeton (2012)

    Google Scholar 

  9. Burnetas, A., Katehakis, M.: Optimal adaptive policies for Markov decision processes. Math. Oper. Res. 22, 222–255 (1997)

    Article  MathSciNet  MATH  Google Scholar 

  10. Cappé, O., Garivier, A., Maillard, O., Munos, R., Stoltz, G.: Kullback-leibler upper confidence bounds for optimal sequential allocation. Ann. Stat. 41(3), 1516–1541 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  11. Chang, F., Lai, T.L.: Optimal stopping and dynamic allocation. Adv. Appl. Probab. 19(4), 829–853 (1987)

    Article  MathSciNet  MATH  Google Scholar 

  12. Cheong, R., Rhee, A., Wang, C.J., Nemenman, I., Levchenko, A.: Information transduction capacity of noisy biochemical signaling networks. Science 334, 354–358 (2011)

    Article  ADS  Google Scholar 

  13. Cover, T.M., Thomas, J.A.: Elements of Information Theory, 2nd edn. Wiley, New York (2006)

    MATH  Google Scholar 

  14. Dayan, P., Abbott, L.F.: Theoretical Neuroscience: Computational and Mathematical Modeling of Neural Systems. MIT Press, Cambridge (2001)

    MATH  Google Scholar 

  15. Donaldson-Matasci, M.C., Bergstrom, C.T., Lachmann, M.: The fitness value of information. Oikos 119, 219–230 (2010)

    Article  Google Scholar 

  16. François, P., Siggia, E.D.: Predicting embryonic patterning using mutual entropy fitness and in silico evolution. Development 137, 2385–2395 (2010)

    Article  Google Scholar 

  17. Gallager, R.G.: Information Theory and Reliable Communication. Wiley, New York (1968)

    MATH  Google Scholar 

  18. Gillespie, D.T.: Exact stochastic simulation of coupled chemical reactions. J. Phys. Chem. 81(25), 2340–2361 (1977)

    Article  Google Scholar 

  19. Gittins, J., Glazebrook, K., Weber, R.: Multi-Armed Bandit Allocation Indices, 2nd edn. Wiley, New York (2011)

    Book  MATH  Google Scholar 

  20. Gittins, J.C.: Bandit processes and dynamic allocation indices. J. R. Stat. Soc. B 6, 148–177 (1995)

    MathSciNet  MATH  Google Scholar 

  21. Honda, J., Takemura, A.: An asymptotically optimal bandit algorithm for bounded support models. In: Proceedings of the Annual Conference on Learning Theory (COLT) (2010)

  22. Howard, R.A.: Information value theory. IEEE Trans. Syst. Sci. Cybern. 2, 22–26 (1966)

    Article  Google Scholar 

  23. Kaufmann, E., Korda, N., Munos, R.: Thompson Sampling: An Asymptotically Optimal Finite Time Analysis. Lecture Notes in Computer Science, pp. 199–213. Springer, Berlin/Heidelberg (2012)

    MATH  Google Scholar 

  24. Kelly, J.L.: A new interpretation of information rate. Bell Syst. Tech. J. 35, 917–926 (1956)

    Article  MathSciNet  Google Scholar 

  25. Kussell, E., Leibler, S.: Phenotypic diversity, population growth, and information in fluctuating environments. Science 309, 2075–2078 (2005)

    Article  ADS  Google Scholar 

  26. Lai, T.L.: Adaptive treatment allocation and the multi-armed bandit problem. Ann. Stat. 15(3), 1091–1114 (1987)

    Article  MathSciNet  MATH  Google Scholar 

  27. Lai, T.L., Robbins, H.: Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6, 4–22 (1985)

    Article  MathSciNet  MATH  Google Scholar 

  28. Laughlin, S.B.: The role of sensory adaptation in the retina. J. Exp. Biol. 146, 39–62 (1989)

    Google Scholar 

  29. Linsker, R.: Self-organization in a perceptual network. IEEE Comput. 21(3), 105–117 (1988)

    Article  Google Scholar 

  30. MacKay, D.J.C.: Information Theory. Inference and Learning Algorithms. Cambridge University Press, Cambridge (2003)

    MATH  Google Scholar 

  31. Margolin, A.A., Nemenman, I., Basso, K., Wiggins, C., Stolovitzky, G., Favera, R.D.: Aracne: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinf. 7(1), S7 (2006)

    Article  Google Scholar 

  32. Mézard, M., Montanari, A.: Information. Physics and Computation. Oxford University Press, Oxford (2009)

    Book  MATH  Google Scholar 

  33. Nemenman, I.: Information Theory and Adaptation. CRC Press, Chapman and Hall/CRC Mathematical and Computational Biology, Boca Raton (2012)

    Google Scholar 

  34. Press, W.H., Teukolsky, S.A., Vettering, W.T., Flannery, B.P.: The Art of Scientific Computing Numerical Recipes in C, 2nd edn. Cambridge University Press, Cambridge (1992)

    MATH  Google Scholar 

  35. Rieke, F., Warland, D., Stevenick, R., Bialek, W.: Spikes: Exploring the Neural Code. Bradford Book, Cambridge (1999)

    MATH  Google Scholar 

  36. Rivoire, O., Leibler, S.: The value of information for populations in varying environments. J. Stat. Phys. 142, 1124–1166 (2011)

    Article  ADS  MathSciNet  MATH  Google Scholar 

  37. Shannon, C.E.: The mathematical theory of communication. Bell Syst. Tech. J. 27, 379–423 (1948)

    Article  MathSciNet  MATH  Google Scholar 

  38. Sharpee, T.O., Calhoun, A.J., Chalasani, S.H.: Information theory of adaptation in neurons, behavior, and mood. Curr. Opin. Neurobiol. 25, 47–53 (2014)

    Article  Google Scholar 

  39. Sutton, R., Barto, A.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998)

    Google Scholar 

  40. Tishby, N., Polani, D.: Information Theory of Decisions and Actions, pp. 601–636. Springer, New York (2011)

    Google Scholar 

  41. Tkacik Jr., G.: C.G.C., Bialek, W.: Information flow and optimization in transcriptional control. Proc. Natl. Acad. Sci. USA 105(12), 265–70 (2008)

    Google Scholar 

  42. Tkacik, G., Walczak, A.M.: Information transmission in genetic regulatory networks: a review. J. Phys.: Condens. Matter. 23(15), 153102 (2011)

    ADS  Google Scholar 

  43. Vergassola, M., Villermaux, E., Shraiman, B.: Infotaxis as a strategy for searching without gradients. Nature 445, 406–409 (2007)

    Article  ADS  Google Scholar 

  44. Whittle, P.: Optimization Over Time, dynamic Programming and Stochastic Control. Wiley Series in Probability and Statistics. Wiley, New York (1982)

    Google Scholar 

  45. Wyatt, J.: Exploration and inference in learning form reinforcement. Ph.D. thesis, University of Edinburgh (1997)

  46. van Erven, T., Harremoes, O.: Rényi divergence and kullback-leibler divergence. IEEE Trans. Inf. Theory 60(7), 3797–3820 (2014)

    Article  Google Scholar 

Download references

Acknowledgments

We are grateful to Boris Shraiman and Eric Siggia for illuminating discussions. MV acknowledges ICTP for hospitality and support. This work was supported by a Grant from the Simons Foundation (#340106, Massimo Vergassola).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Massimo Vergassola.

Appendices

Appendix: Fast Info-p Numerical Simulations

As the number of plays increases, the posterior distributions (3) peak more and more sharply around their mean. Resolving the distributions with appropriate accuracy requires a progressively finer discretization, which slows down the Info-p algorithm. To speed it up, we remark that the Lai-Robbins bound (2) suggests that a typical realization consists of long stretches of plays of the best arm interspaced with occasional plays of suboptimal ones. It might then be convenient to switch to a simulation scheme where we try to directly determine the minimal length of consecutive plays of the best arm. This brings the advantage of avoiding the update of the posterior distribution and the calculation of its entropy at every time step. The minimal number of times the best arm will be played is obtained by supposing that only losses are obtained. This provides the number of consecutive plays of the best arm that can surely be made before switching to subdominant arms.

Suppose then the Info-p policy selects the best (with largest sample mean) arm for play. Our binary search of the minimal stretch of plays of the best arm proceeds as follows: Set the stretch size to some initial guess; Assume that losses happen throughout the entire stretch for the reasons mentioned above; If the Info-p policy chooses a sub-optimal arm at the end of the stretch, it means that the stretch size was too long: halve the size until the best arm is chosen at the end of the stretch; Otherwise, if the Info-p policy chooses the best arm at the end of stretch, double the stretch size until a sub-optimal arm is chosen; Dissect dichotomically as in binary search algorithms [34] the intervals identified as above. Note that the worst-case scenario of consecutive losses ensures that a lower bound on the stretch length is obtained and the numerical technique is thereby exact.

Once the length \(n_\mathrm{min}\) of consecutive plays is identified, we generate the actual number of wins during the stretch by sampling from the best arm \(n_\mathrm{min}\) times. The posterior distribution of the best arm is updated only once, at the end of the stretch, by using the explicit expression (3).

Proportional Betting

Kelly’s proportional betting [24] (also called Thompson sampling in the machine learning community) is a randomized policy that was recently shown to be asymptotically optimal [23]. At each step, the algorithm plays an arm with a probability proportional to its probability to be the best among the arms in the bandit.

Our arguments for showing the optimality of Info-p (see main text) are easily adapted to confirm that proportional betting is indeed optimal. The probabilities for each arm to be the best are denoted \(q_1, q_2,\ldots \) [see (8) for the case of two arms]. For two arms, in the asymptotic limit \(n_1 \gg n_2\) and n large, we typically have \(\hat{\pi }_1 > \hat{\pi }_2\) and \(\frac{n_1}{n_2} \approx \frac{q_1}{q_2}\) with \(q_2 \simeq e^{-n_2D(\hat{\pi }_2, \hat{\pi }_1)}\) and \(q_1 \simeq 1\), which again (as for Info-p) leads to \(\ln n \simeq n_2D(\hat{\pi }_2, \hat{\pi }_1)\).

Since proportional betting is a randomized algorithm, the technique used for the Fast Info-p algorithm (see Appendix 1) does not carry over. In the asymptotic limit, the probability that one of the arms is the best is very close to unity. This probability, say \(q_1\), depends primarily on the number of plays of the subdominant arms and changes negligibly as the first arm is played. This observation suggests the following approximate algorithm: the best arm is played for a stretch whose size is randomly chosen from an exponential distribution of mean \(\frac{1}{1-q_1}\). Immediately after the stretch, one of the inferior arms is chosen with probabilities \(\frac{q_2}{1-q_1},\frac{q_3}{1-q_1},\dots \). The numerical method is analogous to the Gillespie algorithm used to simulate chemical kinetics [18].

The algorithm that we just described is exact under the assumption that \(q_1\) does not change during the stretch. In practice, we empirically found that results were very reliable once \(q_1>0.99\). Our proportional betting simulations are therefore run exactly for an initial phase that lasts until the condition \(q_1>0.99\) is met, and switched then to the approximate scheme described above.

Theoretical Analysis of Information on the Identity of the Best Arm

The goal of this Section is to provide further details about optimal information on the identity of the best arm and the related Info-id policy. As in the main text, we shall consider for simplicity the case of a two-armed bandit with probabilities of success \(p_1\) and \(p_2\) (\(p_1 > p_2\)). Generalizations to bandits with a finite number of arms are straightforward.

The estimated values of the probabilities of success in a given sample of plays are denoted by \(\pi _1\) and \(\pi _2\), respectively. Their posterior distributions are given by (3). The sample mean of \(\pi _i\) is indicated by \(\hat{\pi }_i\).

We denote by \(q_1 = \text {Pr}(\pi _1 > \pi _2)\) the estimated probability for the first arm to be the best. For a two-armed bandit, \(q_2 = 1-q_1\) is given by (8). The entropy of the unknown identity \(b_{\max }\) of the best arm is: \(H(b_{\text {max}}) = -q_1\ln q_1 -q_2\ln q_2\). In the asymptotic limit \(n_1, n_2 \gg 1\), when the arms have each been played many times, the sample means \(\hat{\pi }_i\) are typically close to their respective true values \(p_i\). Large deviation theory [13] states that the ith posterior and its cumulative distribution are both dominated by the exponential factor \(e^{-{n_i}D(\hat{\pi }_i, p)}\). The probability \(q_1\) is then close to unity and the entropy is well approximated by (9).

When \(n_1,n_2 \gg 1\), the integrals in (8) and (9) can be calculated by Laplace method and have three contributions:

  1. (I)

    The region \(p\le \hat{\pi }_2\). There, we have \(\int _p^{1} P_2(q) dq\sim 1\) and \(P_1(p)\sim \exp \left[ -n_1D\left( \hat{\pi }_1,p\right) \right] \) by large deviations theory [13]. Integrating over p and using that the dominant contribution comes from \(p\simeq \hat{\pi }_2\), we obtain: \(\exp \left[ -n_1D\left( \hat{\pi }_1,\hat{\pi }_2\right) \right] \).

  2. (II)

    The region of p’s between \(\hat{\pi }_2\) and \(\hat{\pi }_1\). Its contribution is \(\int \exp \left[ -n_1D\left( \hat{\pi }_1,p\right) -n_2D\left( \hat{\pi }_2,p\right) \right] dp\) by large deviations theory. Equating to zero the derivative of \(n_1D(\hat{\pi }_1, p)+n_2D(\hat{\pi }_2, p)\) with respect to p and using the definition of the Kullback-Leibler divergence \(D(q,p) = q\ln \frac{q}{p} + (1-q)\ln \frac{1-q}{1-p}\), we obtain that the extremum is located at

    $$\begin{aligned} \pi _s = \frac{n_1\hat{\pi }_1 + n_2 \hat{\pi }_2}{n} , \end{aligned}$$
    (12)

    where \(n = n_1 + n_2\).

  3. (III)

    Finally, the contribution from the rightmost region of p’s is dominated by \(p\simeq \hat{\pi }_1\) and reads: \(\exp \left[ -n_2D\left( \hat{\pi }_2,\hat{\pi }_1\right) \right] \).

In summary, the asymptotic expression of the entropy is

$$\begin{aligned} H(b_{\max })&\sim A\exp \big [{-n_1D(\hat{\pi }_1, \hat{\pi }_2)}\big ] + B\exp \big [{-n_2D(\hat{\pi }_2, \hat{\pi }_1)}\big ] \nonumber \\&\quad +\, C\exp \big [{-n_1D(\hat{\pi }_1, \pi _s) -n_2D(\hat{\pi }_2, \pi _s)}\big ] , \end{aligned}$$
(13)

where ABC are subdominant prefactors.

The expression (13) still depends on \(n_1\) and \(n_2\), which are controlled by the policy of play. The fastest possible rate of acquisition of information is obtained by taking the extremum over \(n_1\) and \(n_2\) with the constraint \(n_1+n_2=n\). Suppose for now (as we shall demonstrate later) that the dominant contribution in (13) is the last one:

$$\begin{aligned} H(b_{\max }) \sim \exp \big [{-n_1D(\hat{\pi }_1, \pi _s) -n_2D(\hat{\pi }_2, \pi _s)}\big ] . \end{aligned}$$
(14)

The maximum possible rate of reduction of log-entropy is then calculated as follows. If we denote \(n_1/n = x\), \(n_2/n = 1-x\) and differentiate the exponent in (14) with respect to x, we obtain the relation

$$\begin{aligned} D(\hat{\pi }_1, \pi _s) = D(\hat{\pi }_2, \pi _s) , \end{aligned}$$
(15)

which defines the optimal value \(\pi _s=\pi _{s,o}\). Using the explicit expression of the Kullback-Leibler divergence D:

$$\begin{aligned} \pi _{s,o} = \frac{1}{1 + e^{f(\hat{\pi }_1,\hat{\pi }_2)}}, \quad f(\hat{\pi }_1,\hat{\pi }_2) = \frac{H(\hat{\pi }_1) - H(\hat{\pi }_2)}{\hat{\pi }_1 - \hat{\pi }_2} . \end{aligned}$$
(16)

The optimal proportion of plays on the arms follows from (12):

$$\begin{aligned} x_o=\left( \frac{n_1}{n}\right) _o = \frac{\pi _{s,o} - \hat{\pi }_2}{\hat{\pi }_1 - \hat{\pi }_2} . \end{aligned}$$
(17)

The decay of the log-entropy averaged over the statistical realizations follows from (14) and (15):

$$\begin{aligned} \overline{\ln H(b_\mathrm{max})} =-nD(p_1, p_{s,o}) . \end{aligned}$$
(18)

where \(p_{s,o} = (1 + e^{f(p_1,p_2)})^{-1}\) and f is defined in (16). Note that the average of the log-entropy gives the typical behavior over the realizations, while the entropy itself or its higher powers are determined by large-deviation fluctuations. That leads to anomalous exponents as a function of the power considered. The appropriate statistic for the information gathered in a typical realization is \(e^{\langle \ln H \rangle }\).

The final piece of our analysis is to check that the claimed maximum exponent \(-nD(\hat{\pi }_1, \pi _{s,o}) \) in (14) is indeed larger than the other two potential candidates \(-n x_oD(\hat{\pi }_1, \hat{\pi }_2)\) and \(-n\left( 1-x_o\right) D(\hat{\pi }_2, \hat{\pi }_1)\) in (13):

$$\begin{aligned} x_oD(\hat{\pi }_1, \hat{\pi }_2) \ge D(\hat{\pi }_1,\pi _{s,o}) ;\quad (1-x_o)D(\hat{\pi }_2, \hat{\pi }_1) \ge D(\hat{\pi }_2,\pi _{s,o}) . \end{aligned}$$
(19)

We concentrate on the first relation in (19); the second one follows by symmetry. The convexity of D in the second argument implies:

$$\begin{aligned} D(\hat{\pi }_1, \hat{\pi }_2) \ge D(\hat{\pi }_1, \pi _{s,o}) + (\hat{\pi }_2 - \pi _{s,o}) \times \frac{\pi _{s,o} - \hat{\pi }_1}{\pi _{s,o}(1-\pi _{s,o})} , \end{aligned}$$
(20)

where we used the explicit expression of the Kullback-Leibler divergence D to calculate the partial derivative at \(\pi _{s,o}\) with respect to the second argument. Multiplying by \(x_o\) both sides of (20) and using (17), it follows that

$$\begin{aligned} x_oD(\hat{\pi }_1, \hat{\pi }_2) \ge x_oD(\hat{\pi }_1, \pi _{s,o}) + \frac{\hat{\pi }_1 - \pi _{s,o}}{\hat{\pi }_1 - \hat{\pi }_2} \frac{(\hat{\pi }_2 - \pi _{s,o})^2}{\pi _{s,o}(1-\pi _{s,o})} \times \frac{D(\hat{\pi }_1,\pi _{s,o})}{D(\hat{\pi }_2,\pi _{s,o})} . \end{aligned}$$
(21)

The ratio \(\frac{D(\hat{\pi }_1,\pi _{s,o})}{D(\hat{\pi }_2,\pi _{s,o})} = 1\), due to (15), and \(\frac{\hat{\pi }_1 -\pi _{s,o}}{\hat{\pi }_1 - \hat{\pi }_2}=1-x_o\), due to (17). We conclude that:

$$\begin{aligned} x_oD(\hat{\pi }_1, \hat{\pi }_2) \ge D(\hat{\pi }_1, \pi _{s,o}) + (1-x_o)D(\hat{\pi }_1,\pi _{s,o})\bigg [ \frac{(\hat{\pi }_2 - \pi _{s,o})^2}{\pi _{s,o}(1-\pi _{s,o})D(\hat{\pi }_2,\pi _{s,o})} - 1\bigg ] . \end{aligned}$$
(22)

To prove (19), it only remains to show that

$$\begin{aligned} \frac{(\hat{\pi }_2 - \pi _{s,o})^2}{\pi _{s,o}(1-\pi _{s,o})}\ge D(\hat{\pi }_2,\pi _{s,o}) , \end{aligned}$$
(23)

which follows from the inequality between the Kullback-Leibler divergence and the \(\chi ^2\) distance of two distributions (see Eqs. 6, 7 in [46]). This completes the proof.

1.1 A Strategy that Maximizes Reduction in Entropy

Does the Info-id policy (which is greedy in its choice of the arm and one-step in time) attain the maximum rate (18)? The aim of this subsection is to give a positive answer to this question.

The Info-d policy selects the arm of the bandit which offers the largest expected reduction in log-entropy

$$\begin{aligned} \langle \Delta \ln H \rangle _i&= (1-\hat{\pi }_i) \times \Delta \ln H(b_{\max }|\text {0 observed})\nonumber \\&\quad +\, \hat{\pi }_i \times \Delta \ln H(b_{\max }|\text {1 observed}) , \end{aligned}$$
(24)

where 0 / 1 correspond to loss/win and \(\langle \bullet \rangle \) denotes the average with respect to the posterior probability distribution. To calculate \(\langle \Delta \ln H \rangle _i\), we use the transformations:

$$\begin{aligned} \text {0 is observed:}\left\{ \begin{aligned} n_i&\rightarrow n_i + 1,\\ \hat{\pi }_i&\rightarrow \hat{\pi }_i - \frac{\hat{\pi }_i}{n_i},\\ \pi _s&\rightarrow \pi _s - \frac{\pi _s}{n} ; \end{aligned} \right. \quad \text {1 is observed:} \left\{ \begin{aligned} n_i&\rightarrow n_i + 1,\\ \hat{\pi }_i&\rightarrow \hat{\pi }_i + \frac{1- \hat{\pi }_i}{n_i},\\ \pi _s&\rightarrow \pi _s + \frac{1-\pi _s}{n} . \end{aligned} \right. \end{aligned}$$
(25)

Let us calculate the expected variation (24) upon playing the first arm, \(i=1\):

$$\begin{aligned} \ln H(b_{\max }|\text {0 observed}) \simeq -(n_1 + 1)D\bigg (\hat{\pi }_1 - \frac{\hat{\pi }_1}{n_1}, \pi _s - \frac{\pi _s}{n}\bigg ) \nonumber \\ -\, n_2D\bigg (\hat{\pi }_2,\pi _s - \frac{\pi _s}{n}\bigg ) \end{aligned}$$
(26)
$$\begin{aligned} \simeq -(n_1+1)\bigg [D(\hat{\pi }_1 , \pi _s) - \frac{\hat{\pi }_1}{n_1}\ln \bigg ( \frac{\hat{\pi }_1}{1- \hat{\pi }_1} \frac{1-\pi _s}{\pi _s}\bigg ) - \frac{\pi _s}{n} \frac{\pi _s - \hat{\pi }_1}{\pi _s(1-\pi _s)}\bigg ] \nonumber \\ - \,n_2\bigg [D(\hat{\pi }_2,\pi _s) - \frac{\pi _s}{n}\frac{\pi _s - \hat{\pi }_2}{\pi _s(1-\pi _s)}\bigg ] \end{aligned}$$
(27)
$$\begin{aligned} \simeq -(n_1 + 1)D(\hat{\pi }_1 , \pi _s) - n_2D(\hat{\pi }_2,\pi _s) + \frac{n_1}{n}\frac{\pi _s - \hat{\pi }_1}{(1-\pi _s)} + \frac{n_2}{n}\frac{\pi _s - \hat{\pi }_2}{(1-\pi _s)} \nonumber \\ +\, \hat{\pi }_1\ln \bigg ( \frac{\hat{\pi }_1}{1- \hat{\pi }_1} \frac{1-\pi _s}{\pi _s}\bigg ). \end{aligned}$$
(28)

The first asymptotic equality (26) follows from (14) and (25). The second line (27) is obtained by expanding D(pq) to first order in its Taylor series for both arguments, which is legitimate as \(n_1,n_2\gg 1\). Finally, for the third line (28) we ignore subdominant terms o(1). Notice that the sum of the third and the fourth terms in (28) vanishes due to (12).

We conclude that

$$\begin{aligned} \Delta \ln H(b_{\max }|\text {0 observed})\sim -D(\hat{\pi }_1 , \pi _s) + \hat{\pi }_1\ln \bigg ( \frac{\hat{\pi }_1}{1- \hat{\pi }_1} \frac{1-\pi _s}{\pi _s}\bigg ) . \end{aligned}$$
(29)

Similarly to (29), when the outcome of the play on the first arm is a win:

$$\begin{aligned}&\ln H(b_{\max }|\text {1 observed}) \sim -(n_1 + 1)D\bigg (\hat{\pi }_1 + \frac{1-\hat{\pi }_1}{n_1}, \pi _s + \frac{1-\pi _s}{n}\bigg ) - n_2D\bigg (\hat{\pi }_2,\pi _s + \frac{1-\pi _s}{n}\bigg ) \end{aligned}$$
(30)
$$\begin{aligned}&\simeq -(n_1+1)D(\hat{\pi }_1 , \pi _s) - n_2D(\hat{\pi }_2,\pi _s) - \left( 1-\hat{\pi }_1\right) \ln \bigg ( \frac{\hat{\pi }_1}{1- \hat{\pi }_1} \frac{1-\pi _s}{\pi _s}\bigg ) , \end{aligned}$$
(31)

where a cancellation similar to the one in (28) simplified the final expression (31). We are thereby left with

$$\begin{aligned} \Delta \ln H(b_{\max }|\text {1 observed})\approx -D(\hat{\pi }_1 , \pi _s) - (1-\hat{\pi }_1)\ln \bigg ( \frac{\hat{\pi }_1}{1- \hat{\pi }_1} \frac{1-\pi _s}{\pi _s}\bigg ) . \end{aligned}$$
(32)

Finally, combining (29) and (32), we obtain that

$$\begin{aligned} \langle \Delta \ln H \rangle _1 = -D(\hat{\pi }_1 , \pi _s) . \end{aligned}$$
(33)

By symmetry, \(\langle \Delta \ln H \rangle _2 = -D(\hat{\pi }_2 , \pi _s)\). We conclude that the decision boundary of Info-id matches the condition (15) and the policy indeed gathers information on the identity of the best arm at the maximum possible rate.

1.2 Why Info-id Involves the Variation of Log-Entropy Rather than Entropy?

We repeatedly stressed that Info-id is based on the expected variation of the log-entropy, as in (24), and not the expected variation of the entropy. The reason is that the expected variation of the dominant term in (13) happens to vanish for the entropy. The choice of the arm to play is then based on subdominant terms, which yields a suboptimal rate as compared to (18). The purpose of this technical Appendix is to present in detail the explicit calculation of those variations.

Let us consider the expected variation of the entropy upon playing the ith arm:

$$\begin{aligned}&\langle \Delta H \rangle _i = (1-\hat{\pi }_i) \times \Delta H(b_{\max }|\text {0 observed})+ \hat{\pi }_i \times \Delta H(b_{\max }|\text {1 observed}) , \end{aligned}$$
(34)

and consider first the third term (14) (which is the one that gives the fastest possible decay (18)). Using again the transformations (25), its expected variation upon playing the first arm is

$$\begin{aligned}&\langle \Delta \exp \big [{-n_1D(\hat{\pi }_1, \pi _s) -n_2D(\hat{\pi }_2, \pi _s)}\big ] \rangle _1 = (1-\hat{\pi }_1)\exp \big [-(n_1+1)\nonumber \\&\qquad D\big (\hat{\pi }_1 - \frac{\hat{\pi }_1}{n_1}, \pi _s - \frac{\pi _s}{n}\big ) - n_2D\big (\hat{\pi }_2, \pi _s - \frac{\pi _s}{n}\big )\big ] \nonumber \\&+ \hat{\pi }_1\!\exp \big [\!-(n_1+1)D\big (\hat{\pi }_1\! +\! \frac{1-\hat{\pi }_1}{n_1}, \pi _s + \frac{1-\pi _s}{n} \big )\! - n_2D\big (\hat{\pi }_2, \pi _s \!+\! \frac{1-\pi _s}{n}\big )\big ]\nonumber \\&\qquad \exp \left[ -n_1D\left( \hat{\pi }_1, \pi _s\right) - n_2D\left( \hat{\pi }_2, \pi _s\right) \right] . \end{aligned}$$
(35)

Note that the exponents in the first two terms on the right-hand side of (35) are related to the objects that we calculated in the previous subsection. By using (29) and (32), it follows from (35) that

$$\begin{aligned}&\left\langle \Delta \exp \big [{-n_1D(\hat{\pi }_1, \pi _s) -n_2D(\hat{\pi }_2, \pi _s)}\big ] \right\rangle _1 \propto \nonumber \\&\left\{ (1-\hat{\pi }_1)\exp \bigg [-D(\hat{\pi }_1 , \pi _s) + \hat{\pi }_1\ln \bigg ( \frac{\hat{\pi }_1}{1- \hat{\pi }_1} \frac{1-\pi _s}{\pi _s}\bigg )\bigg ]\right. \nonumber \\&\left. + \,\hat{\pi }_1\exp \bigg [ -D(\hat{\pi }_1 , \pi _s) - (1-\hat{\pi }_1)\ln \bigg ( \frac{\hat{\pi }_1}{1- \hat{\pi }_1} \frac{1-\pi _s}{\pi _s}\bigg ) \bigg ] - 1\right\} . \end{aligned}$$
(36)

If the two terms (29) and (32) at the exponent in (36) were small, then one would Taylor expand the exponentials and conclude that the variation of the entropy and the log-entropy are proportional. However, that is not the case because (29) and (32) are O(1). By inserting the explicit form of the Kulback-Leibler divergence \(D(p,q) = p\ln \frac{p}{q} + (1-p) \ln \frac{1-p}{1-q}\), the first and second terms on the right-hand side of (36) actually reduce to \(1-\pi _s\) and \(\pi _s\), respectively. Therefore, the expected variation in the dominant term of the entropy turns out to vanish.

To determine the policy determined by the maximization of the expected decrease of entropy, we need then to consider subdominant terms in (13). Let us start with the first one:

$$\begin{aligned} \left\langle \Delta \exp \left[ -n_1D\left( \hat{\pi }_1, \hat{\pi }_2\right) \right] \right\rangle _1&= (1-\hat{\pi }_1)\exp \left[ -(n_1+1)D\left( \hat{\pi }_1 - \frac{\hat{\pi }_1}{n_1}, \hat{\pi }_2 \right) \right] \nonumber \\&\quad +\, \hat{\pi }_1\exp \left[ -(n_1+1)D\left( \hat{\pi }_1 + \frac{1-\hat{\pi }_1}{n_1}, \hat{\pi }_2 \right) \right] \nonumber \\&\quad - \, \exp \big [-n_1D(\hat{\pi }_1, \hat{\pi }_2)\big ] . \end{aligned}$$
(37)

By Taylor expanding the Kullback-Leibler divergence as we have done previously, one can check that the right-hand side in (37) is proportional to the right-hand side in (36) and the expected variation for this term vanishes as well.

The only non-vanishing contribution upon playing the first arm stems from the second term in (13):

$$\begin{aligned} \left\langle \Delta \exp \left[ -n_2D(\hat{\pi }_2, \hat{\pi }_1)\right] \right\rangle _1&= (1-\hat{\pi }_1)\exp \left[ -n_2D\left( \hat{\pi }_2, \hat{\pi }_1 - \frac{\hat{\pi }_1}{n_1} \right) \right] \nonumber \\&\quad +\, \hat{\pi }_1\exp \left[ -n_2D\left( \hat{\pi }_2, \hat{\pi }_1 + \frac{1-\hat{\pi }_1}{n_1}\right) \right] \nonumber \\&\quad - \, \exp \big [-n_2D(\hat{\pi }_2, \hat{\pi }_1)\big ] . \end{aligned}$$
(38)

Expanding again to first order in Taylor series, we get

$$\begin{aligned} \big \langle \Delta \exp \big [-n_2D(\hat{\pi }_2, \hat{\pi }_1)\big ] \big \rangle _1&= \exp \big [-n_2D(\hat{\pi }_2, \hat{\pi }_1)\big ] \bigg \{(1-\hat{\pi }_1)\exp \bigg [ \frac{\hat{\pi }_1n_2}{n_1} \frac{\hat{\pi }_1 - \hat{\pi }_2}{\hat{\pi }_1(1-\hat{\pi }_1)} \bigg ] \end{aligned}$$
(39)
$$\begin{aligned}&\quad + \, \hat{\pi }_1 \exp \bigg [ -\frac{(1-\hat{\pi }_1)n_2}{n_1} \frac{\hat{\pi }_1 - \hat{\pi }_2}{\hat{\pi }_1(1-\hat{\pi }_1)} \bigg ] \bigg \} . \end{aligned}$$
(40)

The terms in the curly braces have \(n_2\) and \(n_1\) only as ratios and tend to non-vanishing constants in the asymptotic limit. The asymptotic behavior is therefore dominated by the exponential decay in \(n_2\). The expected variation upon playing the second arm of the bandit is obtained by interchanging indices. We conclude that

$$\begin{aligned} \langle \Delta H \rangle _1 \sim \exp \big [-n_2D(\hat{\pi }_2, \hat{\pi }_1)\big ] ;\quad \langle \Delta H \rangle _2 \sim \exp \big [-n_1D(\hat{\pi }_1, \hat{\pi }_2)\big ] . \end{aligned}$$
(41)

It follows from (41) that the behavior of the policy based on the maximization of the expected reduction of entropy depends on the balance between subdominant terms and that the decision boundary satisfies the relation

$$\begin{aligned} n_1D(\hat{\pi }_1, \hat{\pi }_2) = n_2D(\hat{\pi }_2, \hat{\pi }_1)\quad \Rightarrow \quad \tilde{x}=\frac{D(\hat{\pi }_1, \hat{\pi }_2)}{D(\hat{\pi }_1, \hat{\pi }_2)+D(\hat{\pi }_2, \hat{\pi }_1) } . \end{aligned}$$
(42)

The relations (42) should be contrasted with (15) and (17).

It remains to show that the decay of the average log-entropy generated by the policy (42) is still given by the third term in (13) with the exponent evaluated at \(x=\tilde{x}\) [and not \(x_o\) as for the optimal policy (15)]. The inequality to be proved is:

$$\begin{aligned} \tilde{x}D(\hat{\pi }_1, \tilde{\pi }_s) + (1-\tilde{x})D(\hat{\pi }_2, \tilde{\pi }_s) \le \tilde{x}D(\hat{\pi }_1, \hat{\pi }_2)= (1-\tilde{x})D\left( \hat{\pi }_2, \hat{\pi }_1\right) , \end{aligned}$$
(43)

with \(\tilde{\pi }_s=\tilde{x}\hat{\pi }_1+\left( 1-\tilde{x}\right) \hat{\pi }_2\). The convexity in the second argument of the Kullback-Leibler divergence gives

$$\begin{aligned} \tilde{x}D(\hat{\pi }_1, \tilde{\pi }_s)&\le \tilde{x}(1-\tilde{x})D(\hat{\pi }_1, \hat{\pi }_2), \end{aligned}$$
(44)
$$\begin{aligned} (1-\tilde{x})D(\hat{\pi }_2,\tilde{\pi }_s)&\le (1-\tilde{x})\tilde{x}D(\hat{\pi }_2, \hat{\pi }_1) . \end{aligned}$$
(45)

Summing up the two inequalities above and using (42), the required relation is obtained.

In summary, the policy that maximizes the reduction of entropy (rather than the reduction of log-entropy) happens to be affected by the cancellation discussed above and yields the decay law

$$\begin{aligned} \overline{\ln H\left( b_{max}\right) }= & {} -\frac{D\left( p_1,p_2\right) D(p_1,\tilde{p}_s)+D(p_2,p_1)D(p_2,\tilde{p}_s)}{D(p_1, p_2)+D(p_2,p_1) } ;\nonumber \\ \tilde{p}_s= & {} \frac{p_1D\left( p_1,p_2\right) +p_2D(p_2,p_1)}{D(p_1, p_2)+D(p_2,p_1)} , \end{aligned}$$
(46)

which is slower than the optimal value (15) that was derived by extremizing over x to obtain the optimal value \(x_o\). In Fig. 6, we confirm the theoretical predictions and compare the regret and the entropy for the two algorithms.

Fig. 6
figure 6

The average log-entropy \(\overline{\ln H(b_{\max })}\) and the average regret \(R = \overline{n}_2(p_1-p_2)\) for the two policies that greedily maximize the expected reduction of the entropy (\(-\langle \Delta H(b_{\max }) \rangle \)) and the expected reduction of the log-entropy (\(-\langle \Delta \ln H(b_{\max }) \rangle \)). Numerical results from simulations are shown by green circles and red squares, respectively. Black lines with circular and square symbols are the corresponding theoretical predictions (46) and (18), respectively. The values of the two probabilities of success are \(p_1 = 0.9\) and \(p_2 = 0.6\), differing from the ones used in the other figures in order to enhance the difference in entropies of the two strategies. The entropy for the Info-id strategy decays faster, although the difference is small. The regret of Info-id is bigger, as expected from the proportionality between average regret and rate of decay of \(\ln H(b_{\max })\) discussed in the main text (Color figure online)

A Mixed Strategy of Play

A simple strategy worth analyzing is to first gather information about the best arm and then greedily play the deemed best arm for N successive trials (where N is assumed to be fixed and known to the player). Denote the number of plays during the first stage by n and the number of plays of the i-th bandit arm by \(n_i\), where \(\sum _i n_i = n\). We consider for simplicity a two-armed Bernoulli bandit with the probability of success ordered as \(p_1 > p_2\).

The expected regret has two contributions. The first one is when the worst bandit arm is incorrectly deemed the best and played for N trials whilst the second one is when the estimate of the best arm is correct. The corresponding regrets are \((N + n_2)(p_1 - p_2)\) and \(n_2(p_1 - p_2)\), respectively. The first event occurs with the probability that the second bandit arm is incorrectly deemed to be the best, \(q_2 \sim e^{-n_2D(p_2,p_1)}\), whilst the second occurs with probability \(q_1 \sim 1\). The final expression for the expected regret R is

$$\begin{aligned} R = \left( (N+n_2)q_2 + n_2q_1\right) (p_1 - p_2) = (Nq_2 + n_2)(p_1 - p_2) . \end{aligned}$$
(47)

The minimum of the above expression with respect to \(n_2\) is achieved for \(n_2 \sim \frac{\ln N}{D(p_2,p_1)}\), which coincides with the Lai-Robbins expression (2). The switching point is \(\sim \log N\) as expected, but the proportionality factor needed to achieve optimality clearly requires the success probabilities \(p_1\) and \(p_2\), which are of course unknown and can only be estimated from the trials. That gives an intuition as to the role of the estimation of success probabilities, which are the quantities considered by the Info-p strategy.

Quantifying the Value of Information

The value of information is the reduction in the average regret obtained when some a priori information is available. In this Section, we provide details on the theoretical argument sketched in the main text. The initial entropy of the identity of the best arm is supposed to be \(H(b_{\max }) = H_0=\frac{\ln 2}{2^m}\).

The Info-id “pre-training” required to reach \(H(b_{\text {max}})=H_0=\ln 2/2^m\) lasts for \(n^{(pt)}\propto m\) steps [see (11)]. Since (14) implies that \(\ln H(b_{\max })=-nD(\hat{\pi }_1, \pi _{s,o})\) , the number of steps \(n^{(pt)}\) satisfies \(n^{(pt)}\simeq m\ln 2/D(\hat{\pi }_1, \pi _{s,o})\) with \(\pi _{s,o}\) given by (16). During the pre-training, the two arms are played \(n_1^{(pt)}\) and \(n_2^{(pt)}\) times. Their respective proportions are controlled by the expression (17). In particular, \(n_2^{(pt)}=n^{(pt)}(\hat{\pi }_1-\pi _{s,o})/(\hat{\pi }_1-\hat{\pi }_2)\). Note that \(n_2^{(pt)}\) scales linearly with \(n^{(pt)}\) and is therefore much bigger than for typical Info-p statistics, where it would scale logarithmically with \(n^{(pt)}\). In other words, since \(n_1\) and \(n_2\) are both \(\propto n\), the typical prior resulting from the pre-training is equivalent to the unlikely (for Info-p) situation of a comparable number of plays \(n_1^{(pt)}\) and \(n_2^{(pt)}\) for the two arms.

Since the suboptimal arm has been vastly overplayed in comparison with the typical Info-p statistics, once the algorithm switches to Info-p after the pre-training, a long stretch of plays of the best arm will ensue. The length \(\ell \) of the stretch is estimated by calculating the time taken to reach the Info-p decision boundary (7), i.e., \(\ln \ell \sim n_2^{(pt)}D(\hat{\pi }_2,\hat{\pi }_1)\). In the absence of any pre-training, a stretch of length \(\ell \) would lead to an average regret \(R=(p_1-p_2)\ln \ell /D(p_2,p_1)\) [see the Lai-Robbins bound (2) in the main text]. We conclude that the expected difference in regret \(\Delta R\) between the case with prior information and the case without, is given by

$$\begin{aligned} \Delta R \simeq -\left( p_1-p_2\right) \overline{n_2^{(pt)}}\simeq -\ln 2\frac{p_1 -p_s}{D(p_1,p_s)}m , \end{aligned}$$
(48)

where we have used the expressions of \(n_2^{(pt)}\) and \(n^{(pt)}\) derived above, \(p_s=\left( 1+e^{f(p_1,p_2)}\right) ^{-1}\) and the function f is defined by (17). The agreement with numerical simulations is shown in Fig. 5. Small deviations are ascribed to finite-size effects, e.g., the Info-p decision boundary that we used to determine the length \(\ell \) of the initial stretch is only asymptotically valid, as evidenced in Fig. 2 (upper left panel).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Reddy, G., Celani, A. & Vergassola, M. Infomax Strategies for an Optimal Balance Between Exploration and Exploitation. J Stat Phys 163, 1454–1476 (2016). https://doi.org/10.1007/s10955-016-1521-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10955-016-1521-0

Keywords

Navigation