Machine Learning

, Volume 32, Issue 2, pp 151–178

# Tracking the Best Expert

• Mark Herbster
• Manfred K. Warmuth
Article

## Abstract

We generalize the recent relative loss bounds for on-line algorithms where the additional loss of the algorithm on the whole sequence of examples over the loss of the best expert is bounded. The generalization allows the sequence to be partitioned into segments, and the goal is to bound the additional loss of the algorithm over the sum of the losses of the best experts for each segment. This is to model situations in which the examples change and different experts are best for certain segments of the sequence of examples. In the single segment case, the additional loss is proportional to log n, where n is the number of experts and the constant of proportionality depends on the loss function. Our algorithms do not produce the best partition; however the loss bound shows that our predictions are close to those of the best partition. When the number of segments is k+1 and the sequence is of length &ell, we can bound the additional loss of our algorithm over the best partition by $$O\left( {klogn + k\log \left( {{\ell \mathord{\left/ {\vphantom {\ell k}} \right. \kern-\nulldelimiterspace} k}} \right)} \right)$$. For the case when the loss per trial is bounded by one, we obtain an algorithm whose additional loss over the loss of the best partition is independent of the length of the sequence. The additional loss becomes $$O\left( {klogn + k\log \left( {{\ell \mathord{\left/ {\vphantom {\ell k}} \right. \kern-\nulldelimiterspace} k}} \right)} \right)$$ , where L is the loss of the best partitionwith k+1 segments. Our algorithms for tracking the predictions of the best expert aresimple adaptations of Vovk's original algorithm for the single best expert case. As in the original algorithms, we keep one weight per expert, and spend O(1) time per weight in each trial.

on-line learning amortized analysis multiplicative updates shifting experts

## References

1. Auer, P. & Warmuth, M. K. (1998). Tracking the best disjunction. Machine Learning, this issue.Google Scholar
2. Blum, A. & Burch, C. (1997). On-line learning and the metrical task system. In Proceedings of the 10th Annual Workshop on Computational Learning Theory. ACM Press, New York, NY.Google Scholar
3. Cesa-Bianchi, N., Freund, Y., Haussler, D., Helmbold, D. P., Schapire, R. E., & Warmuth, M. K. (1997). How to use expert advice. Journal of the ACM, 44(3), 427-485.Google Scholar
4. Cover, T. & Thomas, J. (1991). Elements of Information Theory. Wiley.Google Scholar
5. Feder, M., Merhav, N., & Gutman, M. (1992). Universal prediction of individual sequences. IEEE Transactions on Information Theory, 38, 1258-1270.Google Scholar
6. Freund, Y. & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1), 119-139.Google Scholar
7. Freund, Y., Schapire, R. E., Singer, Y., & Warmuth, M. K. (1997). Using and combining predictors that specialize. In Proceedings of the Twentyninth Annual ACM Symposium on Theory of Computing.Google Scholar
8. Haussler, D., Kivinen, J., & Warmuth, M. K. (1998). Sequential prediction of individual sequences under general loss functions. IEEE Transactions on Information Theory. To appear.Google Scholar
9. Helmbold, D. P., Kivinen, J., & Warmuth, M. K. (1995). Worst-case loss bounds for sigmoided linear neurons. In Proceedings of the 1995 Neural Information Processing Conference, (pp. 309-315). MIT Press, Cambridge, MA.Google Scholar
10. Helmbold, D.P., Long, D.D.E., & Sherrod, B. (1996). A dynamic disk spin-down technique for mobile computing. In Proceedings of the Second Annual ACM International Conference on Mobile Computing and Networking. ACM/IEEE.Google Scholar
11. Herbster, M. (1997). Tracking the best expert II. Unpublished Manuscript.Google Scholar
12. Herbster, M. & Warmuth, M. K. (1995). Tracking the best expert. In Proceedings of the 12th International Conference on Machine Learning, (pp. 286-294). Morgan Kaufmann.Google Scholar
14. Littlestone, N. (1988). Learning when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning, 2, 285-318.Google Scholar
15. Littlestone, N. (1989). Mistake Bounds and Logarithmic Linear-threshold Learning Algorithms. PhD thesis, Technical Report UCSC-CRL-89-11, University of California Santa Cruz.Google Scholar
16. Littlestone, N. & Warmuth, M. K. (1994). The weighted majority algorithm. Information and Computation, 108(2), 212-261.Google Scholar
17. Singer, Y. (1997). Towards realistic and competitive portfolio selection algorithms. Unpublished Manuscript.Google Scholar
18. Vovk, V. (1998). A game of prediction with expert advice. Journal of Computer and System Sciences. To appear.Google Scholar
19. Vovk, V. (1997). Derandomizing stochastic prediction strategies. In Proceedings of the 10th Annual Workshop on Computational Learning Theory. ACM Press, New York, NY.Google Scholar
20. Warmuth, M. K. (1997). Predicting with the dot-product in the experts framework. Unpublished Manuscript.Google Scholar