Learning to control dynamic systems with automatic quantization
Reinforcement learning is often used in learning to control dynamic systems, which are described by quantitative state variables. Most previous work that learns qualitative (symbolic) control rules cannot construct symbols themselves. That is, a correct partition of the state variables, or a correct set of qualitative symbols, is given to the learning program.
We do not make this assumption in our work of learning to control dynamic systems. The learning task is divided into two phases. The first phase is to extract symbols from quantitative inputs. This process is also commonly called quantization. The second phase is to evaluate the symbols obtained in the first phase and to induce the best possible symbolic rules based on those symbols. These two phases interact with each other and thus make the whole learning task very difficult. We demonstrate that our new method, called STAQ (Set Training with Automatic Quantization), can aggressively partition the input variables to a finer resolution until the correct control rules based on these partitions (symbols) are learned. In particular, we use STAQ to solve the well-known cart-pole balancing problem.
- [Anderson, 1986]Charles W. Anderson. Learning and problem solving with multilayer connectionist systems. PhD thesis, University of Massachusetts, Amherst, 1986.Google Scholar
- [Barto et al., 1983]Andrew G. Barto, Richard S. Sutton, and Charles W. Anderson. Neuron-like elements that can solve difficult learning control problems. IEEE Trans. on Systems, Man, and Cybernetics, SMC-13(5):834–846, 1983.Google Scholar
- [Lin, 1990]Long-Ji Lin. Self-improving reactive agents: Case studies of reinforcement learning frameworks. In Proceedings of the First International Conference on the Simulation of Adaptive Behavior, September 1990.Google Scholar
- [Michie and Chambers, 1968]D. Michie and R. Chambers. Boxes: An experiment in adaptive control. In Machine Intelligence 2 (E. Dale and D. Michie, Eds.), pages 137–152. Oliver and Boyd, Edinburgh, 1968.Google Scholar
- [Sammut and Cribb, 1990]Claude Sammut and James Cribb. Is learning rate a good performance criterion for learning. In Proceedings of the Seventh International Workshop on Machine Learning. Morgan Kaufmann, 1990.Google Scholar
- [Selfridge et al, 1985]Selfridge, Richard Sutton, and Andrew Barto. Training and tracking in roboltics. In Proceedings of the Ninth International Joint Conference on Artificial Intelligence, Los Angeles, CA, 1985.Google Scholar
- [Sutton, 1984]Richard S. Sutton. Temporal Credit Assignment In Reinforcement Learning. PhD thesis, University of Massachusetts at Amherst, 1984. (Also COINS Tech Report 84-02).Google Scholar
- [Watkins, 1989]Chris Watkins. Learning from delayed rewards. PhD thesis, Cambridge University, 1989.Google Scholar