Abstract
We introduce a learning framework that combines elements of the well-known PAC and mistake-bound models. The KWIK (knows what it knows) framework was designed particularly for its utility in learning settings where active exploration can impact the training examples the learner is exposed to, as is true in reinforcement-learning and active-learning problems. We catalog several KWIK-learnable classes as well as open problems, and demonstrate their applications in experience-efficient reinforcement learning.
Article PDF
Similar content being viewed by others
References
Abbeel, P., & Ng, A. Y. (2005). Exploration and apprenticeship learning in reinforcement learning. In Proceedings of the twenty-second international conference on machine learning (pp. 1–8).
Angluin, D. (1988). Queries and concept learning. Machine Learning, 2, 319–342.
Angluin, D. (2004). Queries revisited. Theoretical Computer Science, 313, 175–194.
Auer, P. (2002). Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3, 397–422.
Bagnell, J., Ng, A. Y., & Schneider, J. (2001). Solving uncertain Markov decision problems (Technical Report CMU-RI-TR-01-25). Robotics Institute, Carnegie Mellon University, Pittsburgh, PA.
Bellman, R. (1957). Dynamic programming. Princeton: Princeton University Press.
Bertsekas, D., & Shreve, S. (1978). Stochastic optimal control: The discrete time case. New York: Academic Press.
Blum, A. (1994). Separating distribution-free and mistake-bound learning models over the Boolean domain. SIAM Journal on Computing, 23, 990–1000.
Boutilier, C., Dean, T., & Hanks, S. (1999). Decision-theoretic planning: Structural assumptions and computational leverage. Journal of Artificial Intelligence Research, 11, 1–94.
Brafman, R. I., & Tennenholtz, M. (2002). R-MAX—a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 3, 213–231.
Brunskill, E., Leffler, B. R., Li, L., Littman, M. L., & Roy, N. (2008). CORL: A continuous-state offset-dynamics reinforcement learner. In Proceedings of the twenty-fourth conference on uncertainty in artificial intelligence (UAI-08) (pp. 53–61).
Brunskill, E., Leffler, B. R., Li, L., Littman, M. L., & Roy, N. (2009). Provably efficient learning with typed parametric models. Journal of Machine Learning Research, 10, 1955–1988.
Cesa-Bianchi, N., Lugosi, G., & Stoltz, G. (2005). Minimizing regret with label efficient prediction. IEEE Transactions on Information Theory, 51, 2152–2162.
Cesa-Bianchi, N., Gentile, C., & Zaniboni, L. (2006). Worst-case analysis of selective sampling for linear classification. Journal of Machine Learning Research, 7, 1205–1230.
Cesa-Bianchi, N., Gentile, C., & Orabona, F. (2009). Robust bounds for classification via selective sampling. In Proceedings of the twenty-sixth international conference on machine learning (ICML-09) (pp. 121–128).
Chow, C.-S., & Tsitsiklis, J. N. (1989). The complexity of dynamic programming. Journal of Complexity, 5, 466–488.
Cohn, D. A., Atlas, L., & Ladner, R. E. (1994). Improving generalization with active learning. Machine Learning, 15, 201–221.
Dean, T., & Kanazawa, K. (1989). A model for reasoning about persistence and causation. Computational Intelligence, 5, 142–150.
Diuk, C., Li, L., & Leffler, B. R. (2009). The adaptive k-meteorologists problem and its application to structure discovery and feature selection in reinforcement learning. In Proceedings of the twenty-sixth international conference on machine learning (ICML-09) (pp. 249–256).
Fong, P. W. L. (1995a). A quantitative study of hypothesis selection. In Proceedings of the twelfth international conference on machine learning (ICML-95) (pp. 226–234).
Fong, P. W. L. (1995b). A quantitative study of hypothesis selection. Master’s thesis, Department of Computer Science, University of Waterloo, Ontario, Canada.
Freund, Y., Schapire, R. E., Singer, Y., & Warmuth, M. K. (1997a). Using and combining predictors that specialize. In STOC’97: Proceedings of the twenty-ninth annual ACM symposium on theory of computing (pp. 334–343).
Freund, Y., Seung, H. S., Shamir, E., & Tishby, N. (1997b). Selective sampling using the query by committee algorithm. Machine Learning, 28, 133–168.
Freund, Y., Mansour, Y., & Schapire, R. E. (2004). Generalization bounds for averaged classifiers. The Annals of Statistics, 32, 1698–1722.
Golub, G. H., & Van Loan, C. F. (1989). Matrix computations (2nd ed.). Baltimore: The Johns Hopkins University Press.
Helmbold, D. P., Littlestone, N., & Long, P. M. (2000). Apple tasting. Information and Computation, 161, 85–139.
Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58, 13–30.
Kakade, S. M. (2003). On the sample complexity of reinforcement learning. Doctoral dissertation, Gatsby Computational Neuroscience Unit, University College London.
Kakade, S., Kearns, M., & Langford, J. (2003). Exploration in metric state spaces. In Proceedings of the 20th international conference on machine learning.
Kearns, M. J., & Koller, D. (1999). Efficient reinforcement learning in factored MDPs. In Proceedings of the 16th International joint conference on artificial intelligence (IJCAI) (pp. 740–747).
Kearns, M. J., & Schapire, R. E. (1994). Efficient distribution-free learning of probabilistic concepts. Journal of Computer and System Sciences, 48, 464–497.
Kearns, M. J., & Singh, S. P. (2002). Near-optimal reinforcement learning in polynomial time. Machine Learning, 49, 209–232.
Kearns, M. J., Schapire, R. E., & Sellie, L. (1994). Toward efficient agnostic learning. Machine Learning, 17, 115–141.
Kearns, M. J., Mansour, Y., & Ng, A. Y. (2002). A sparse sampling algorithm for near-optimal planning in large Markov decision processes. Machine Learning, 49, 193–208.
Klasner, N., & Simon, H. U. (1995). From noise-free to noise-tolerant and from on-line to batch learning. In Proceedings of the eighth annual conference on computational learning theory (COLT-95) (pp. 250–257).
Kocsis, L., & Szepesvári, C. (2006). Bandit based Monte-Carlo planning. In Proceedings of the seventeenth European conference on machine learning (ECML-06) (pp. 282–293).
Lane, T., & Brodley, C. E. (2003). An empirical study of two approaches to sequence learning for anomaly detection. Machine Learning, 51, 73–107.
Leffler, B. R., Littman, M. L., & Edmunds, T. (2007). Efficient reinforcement learning with relocatable action models. In Proceedings of the twenty-second conference on artificial intelligence (AAAI-07).
Li, L. (2009). A unifying framework for computational reinforcement learning theory. Doctoral dissertation, Rutgers University, New Brunswick, NJ.
Li, L., & Littman, M. L. (2010). Reducing reinforcement learning to KWIK online regression. Annals of Mathematics and Artificial Intelligence. doi:10.1007/s10472-010-9201-2.
Li, L., Littman, M. L., & Walsh, T. J. (2008). Knows what it knows: A framework for self-aware learning. In Proceedings of the twenty-fifth international conference on machine learning (pp. 568–575).
Li, L., Chu, W., Langford, J., & Schapire, R. E. (2010). A contextual-bandit approach to personalized news article recommendation. In Proceedings of the nineteenth international conference on World Wide Web (WWW-10) (pp. 661–670).
Littlestone, N. (1987). Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning, 2, 285–318.
Littlestone, N. (1989). From on-line to batch learning. In Proceedings of the second annual workshop on computational learning theory (COLT-89) (pp. 269–284).
Moore, A. W., & Atkeson, C. G. (1993). Prioritized sweeping: Reinforcement learning with less data and less real time. Machine Learning, 13, 103–130.
Puterman, M. L. (1994). Markov decision processes—discrete stochastic dynamic programming. New York: Wiley.
Seung, H. S., Opper, M., & Tishby, N. (1992). Query by committee. In Proceedings of the fifth annual workshop on computational learning theory (COLT-92) (pp. 287–294).
Shafer, G., & Vovk, V. (2008). A tutorial on conformal prediction. Journal of Machine Learning Research, 9, 371–421.
Singh, S. P., & Yee, R. C. (1994). An upper bound on the loss from approximate optimal-value functions. Machine Learning, 16, 227.
Sontag, E. D. (1998). Texts in Applied Mathematics: Vol. 6. Mathematical control theory: Deterministic finite dimensional systems (2nd ed.). Berlin: Springer.
Strehl, A. L., & Littman, M. L. (2008). Online linear regression and its application to model-based reinforcement learning. Advances in Neural Information Processing Systems, 20.
Strehl, A. L., Li, L., & Littman, M. L. (2006a). Incremental model-based learners with formal learning-time guarantees. In Proceedings of the 22nd conference on uncertainty in artificial intelligence (UAI 2006).
Strehl, A. L., Li, L., Wiewiora, E., Langford, J., & Littman, M. L. (2006b). PAC model-free reinforcement learning. In Proceedings of the twenty-third international conference on machine learning (ICML-06).
Strehl, A. L., Mesterharm, C., Littman, M. L., & Hirsh, H. (2006c). Experience-efficient learning in associative bandit problems. In Proceedings of the twenty-third international conference on machine learning (ICML-06).
Strehl, A. L., Diuk, C., & Littman, M. L. (2007). Efficient structure learning in factored-state MDPs. In Proceedings of the twenty-second national conference on artificial intelligence (AAAI-07)
Strehl, A. L., Li, L., & Littman, M. L. (2009). Reinforcement learning in finite MDPs: PAC analysis. Journal of Machine Learning Research, 10, 2413–2444.
Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge: The MIT Press.
Szita, I., & Szepesvári, C. (2010). Model-based reinforcement learning with nearly tight exploration complexity bounds. In Proceedings of the twenty-seventh international conference on machine learning (ICML-2010).
Valiant, L. G. (1984). A theory of the learnable. Communications of the ACM, 27, 1134–1142.
Walsh, T. J., Szita, I., Diuk, C., & Littman, M. L. (2009). Exploring compact reinforcement-learning representations with linear regression. In Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence (UAI-09) (pp. 591–598). A refined version is available as Technical Report DCS-tr-660, Department of Computer Science, Rutgers University, December, 2009.
Weiss, G. M., & Tian, Y. (2006). Maximizing classifier utility when training data is costly. SIGKDD Explorations, 8, 31–38.
Author information
Authors and Affiliations
Corresponding author
Additional information
Editor: Roni Khardon.
Part of the work was done while L. Li, T. Walsh, and A. Strehl were at the Rutgers University.
Rights and permissions
About this article
Cite this article
Li, L., Littman, M.L., Walsh, T.J. et al. Knows what it knows: a framework for self-aware learning. Mach Learn 82, 399–443 (2011). https://doi.org/10.1007/s10994-010-5225-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-010-5225-4