Skip to main content
Log in

Proximal algorithms and temporal difference methods for solving fixed point problems

  • Published:
Computational Optimization and Applications Aims and scope Submit manuscript

Abstract

In this paper we consider large fixed point problems and solution with proximal algorithms. We show that for linear problems there is a close connection between proximal iterations, which are prominent in numerical analysis and optimization, and multistep methods of the temporal difference type such as TD(\(\lambda \)), LSTD(\(\lambda \)), and LSPE(\(\lambda \)), which are central in simulation-based exact and approximate dynamic programming. One benefit of this connection is a new and simple way to accelerate the standard proximal algorithm by extrapolation towards a multistep iteration, which generically has a faster convergence rate. Another benefit is the potential for integration into the proximal algorithmic context of several new ideas that have emerged in the approximate dynamic programming context, including simulation-based implementations. Conversely, the analytical and algorithmic insights from proximal algorithms can be brought to bear on the analysis and the enhancement of temporal difference methods. We also generalize our linear case result to nonlinear problems that involve a contractive mapping, thus providing guaranteed and potentially substantial acceleration of the proximal and forward backward splitting algorithms at no extra cost. Moreover, under certain monotonicity assumptions, we extend the connection with temporal difference methods to nonlinear problems through a linearization approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. It is possible to scale the eigenvalues of Q to lie in the interval (0, 2] without changing the problem, by multiplying Q and b with a suitable positive scalar. This, however, requires some prior knowledge about the location of the eigenvalues of Q.

  2. In approximate DP it is common to replace a fixed point equation of the form \(x=F(x)\) with the equation \(x=\Pi \big (F(x)\big )\). This approach comes under the general framework of Galerkin approximation, which is widely used in a variety of numerical computation contexts (see e.g., the books by Krasnoselskii [39] and Fletcher [34], and the DP-oriented discussion in the paper [68]). A distinguishing feature of approximate DP applications is that F is a linear mapping and the equation \(x=\Pi \big (F(x)\big )\) is typically solved by simulation-based methods.

  3. The precise nature of the problem that TD(\(\lambda \)) is aiming to solve was unclear for a long time. The paper by Tsitsiklis and VanRoy [61] showed that it aims to find a fixed point of \(T^{(\lambda )}\) or \(\Pi T^{(\lambda )}\), and gave a convergence analysis (also replicated in the book [7]). The paper by Bertsekas and Yu [10] (Section 5.3) generalized TD(\(\lambda \)), LSTD(\(\lambda \)), and LSPE(\(\lambda \)) to the linear system context of this paper.

  4. It is well known that the proximal iteration can be extrapolated by a factor of as much as two while maintaining convergence. This was first shown for the special case of a convex optimization problem in [13], and then for the general case of finding a zero of a monotone operator in [EcB92]; see also a more refined analysis, which quantifies the effects of extrapolation, for the case of a quadratic programming problem, given in [14], Section 2.3.1. However, we are not aware of any earlier proposal of a simple and general scheme to choose an extrapolation factor that maintains convergence and simultaneously guarantees acceleration. Moreover, this extrapolation factor, \((c+1)/c\), may be much larger than two.

References

  1. Busoniu, L., Babuska, R., De Schutter, B., Ernst, D.: Reinforcement Learning and Dynamic Programming Using Function Approximators. CRC Press, New York (2010)

    Book  MATH  Google Scholar 

  2. Bertsekas, D.P., Borkar, V.S., Nedić, A.: Improved temporal difference methods with linear function approximation. In: Si, J., Barto, A., Powell, W., Wunsch, D. (eds.) Learning and Approximate Dynamic Programming. IEEE Press, New York (2004)

    Google Scholar 

  3. Boutsidis, C., Drineas, P., Magdon-Ismail, M.: Near-optimal column-based matrix reconstruction. SIAM J. Comput. 43, 687–717 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  4. Bauschke, H.H., Combettes, P.L.: Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer, New York (2011)

    Book  MATH  Google Scholar 

  5. Bertsekas, D.P., Ioffe, S.: Temporal differences-based policy iteration and applications in neuro-dynamic programming. Laboratory for Information and Decision Systems Report LIDS-P-2349, MIT (1996)

  6. Bertsekas, D.P., Tsitsiklis, J.N.: An analysis of stochastic shortest path problems. Math. OR 16, 580–595 (1991)

    Article  MathSciNet  MATH  Google Scholar 

  7. Bertsekas, D.P., Tsitsiklis, J.N.: Neuro-Dynamic Programming. Athena Scientific, Belmont, MA (1996)

    MATH  Google Scholar 

  8. Beck, A., Teboulle, M.: Gradient-based algorithms with applications to signal-recovery problems. In: Eldar, Y., Palomar, D. (eds.) Convex Optimization in Signal Processing and Communications, pp. 42–88. Cambridge University Press, Cambridge (2010)

    Google Scholar 

  9. Bertsekas, D.P., Yu, H.: Solution of large systems of equations using approximate dynamic programming methods. Laboratory for Information and Decision Systems Report LIDS-P-2754, MIT (2007)

  10. Bertsekas, D.P., Yu, H.: Projected equation methods for approximate solution of large linear systems. J. Comput. Appl. Math. 227, 27–50 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  11. Bertsekas, D.P., Yu, H.: Asynchronous distributed policy iteration in dynamic programming. In: Proceedings of Allerton Conference on Communication, Control and Computing, Allerton Park, Ill, pp. 1368–1374 (2010)

  12. Bertsekas, D.P., Yu, H.: Q-learning and enhanced policy iteration in discounted dynamic programming. Math. OR 37, 66–94 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  13. Bertsekas, D.P.: On the method of multipliers for convex programming. IEEE Trans. Auton. Control 20, 385–388 (1975)

    Article  MathSciNet  MATH  Google Scholar 

  14. Bertsekas, D.P.: Constrained Optimization and Lagrange Multiplier Methods, p. 1997. Academic Press, New York (1982). (Republished by Athena Scientific, Belmont, MA)

    MATH  Google Scholar 

  15. Bertsekas, D.P.: Temporal difference methods for general projected equations. IEEE Trans. Autom. Control 56, 2128–2139 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  16. Bertsekas, D.P.: Approximate policy iteration: a survey and some new methods. J. Control Theory Appl. 9(2011), 310–335 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  17. Bertsekas, D.P.: Dynamic Programming and Optimal Control: Approximate Dynamic Programming, vol. II, 4th edn. Athena Scientific, Belmont, MA (2012)

    MATH  Google Scholar 

  18. Bertsekas, D.P.: \(\lambda \)-policy iteration: a review and a new implementation. In: Lewis, F., Liu, D. (eds.) Reinforcement Learning and Approximate Dynamic Programming for Feedback Control. IEEE Press, New York (2012)

    Google Scholar 

  19. Bertsekas, D.P.: Convex Optimization Algorithms. Athena Scientific, Belmont, MA (2015)

    MATH  Google Scholar 

  20. Bertsekas, D.P.: Proximal algorithms and temporal differences for large linear systems: extrapolation, approximation, and simulation. Laboratory for Information and Decision Systems Report LIDS-P-3205, MIT (2016)

  21. Bertsekas, D.P.: Abstract Dynamic Programming, 2nd edn. Athena Scientific, Belmont, MA (2018). http://web.mit.edu/dimitrib/www/home.html

  22. Boyan, J.A.: Technical update: least-squares temporal difference learning. Mach. Learn. 49, 1–15 (2002)

    Article  MATH  Google Scholar 

  23. Bradtke, S.J., Barto, A.G.: Linear least-squares algorithms for temporal difference learning. Mach. Learn. 22, 33–57 (1996)

    MATH  Google Scholar 

  24. Censor, J., Herman, G.T., Jiang, M.: A note on the behavior of the randomized Kaczmarz algorithm of Strohmer and Vershynin. J. Fourier Anal. Appl. 15, 431–436 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  25. Curtiss, J.H.: A theoretical comparison of the efficiencies of two classical methods and a Monte Carlo method for computing one component of the solution of a set of linear algebraic equations. In: Proceedings of Symposium on Monte Carlo Methods, pp. 191–233 (1954)

  26. Curtiss, J.H.: A Monte Carlo methods for the iteration of linear operators. Uspekhi Mat. Nauk 12, 149–174 (1957)

    MathSciNet  MATH  Google Scholar 

  27. Drineas, P., Kannan, R., Mahoney, M.W.: Fast Monte Carlo algorithms for matrices I: approximating matrix multiplication. SIAM. J. Comput. 35, 132–157 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  28. Drineas, P., Kannan, R., Mahoney, M.W.: Fast Monte Calo algorithms for matrices II: computing a low-rank approximation to a matrix. SIAM. J. Comput. 36, 158–183 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  29. Drineas, P., Mahoney, M.W., Muthukrishnan, S.: Sampling algorithms for L2 regression and applications. In: Proceedings 17th Annual SODA, pp. 1127–1136 (2006)

  30. Drineas, P., Mahoney, M.W., Muthukrishnan, S.: Relative-error CUR matrix decompositions. SIAM J. Matrix Anal. Appl. 30, 844–881 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  31. Drineas, P., Mahoney, M.W., Muthukrishnan, S., Sarlos, T.: Faster least squares approximation. Numer. Math. 117, 219–249 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  32. Eckstein, J., Bertsekas, D.P.: On the Douglas–Rachford splitting method and the proximal point algorithm for maximal monotone operators. Math. Program. 55, 293–318 (1992)

    Article  MathSciNet  MATH  Google Scholar 

  33. Facchinei, F., Pang, J.-S.: Finite-Dimensional Variational Inequalities and Complementarity Problems. Springer, New York (2003)

    MATH  Google Scholar 

  34. Fletcher, C.A.J.: Computational Galerkin Methods. Springer, New York (1984)

    Book  MATH  Google Scholar 

  35. Forsythe, G.E., Leibler, R.A.: Matrix inversion by a Monte Carlo method. Mathematical Tables and Other Aids to Computation 4, 127–129 (1950)

    Article  MathSciNet  Google Scholar 

  36. Gabillon, V., Ghavamzadeh, M., Scherrer, B.: Approximate dynamic programming finally performs well in the game of tetris. In: Advances in Neural Information Processing Systems, pp. 1754–1762 (2013)

  37. Gabay, D.: Applications of the method of multipliers to variational inequalities. In: Fortin, M., Glowinski, R. (eds.) Augmented Lagrangian Methods: Applications to the Solution of Boundary-Value Problems. North-Holland, Amsterdam (1983)

    Google Scholar 

  38. Halton, J.H.: A retrospective and prospective survey of the Monte Carlo method. SIAM Rev. 12, 1–63 (1970)

    Article  MathSciNet  MATH  Google Scholar 

  39. Krasnoselskii, M.A., et al.: Approximate Solution of Operator Equations. Wolters-Noordhoff Publication, Groningen (1972). Translated by D. Louvish

    Book  Google Scholar 

  40. Lagoudakis, M.G., Parr, R.: Least-squares policy iteration. J. Mach. Learn. Res. 4, 1107–1149 (2003)

    MathSciNet  MATH  Google Scholar 

  41. Leventhal, D., Lewis, A.S.: Randomized methods for linear constraints: convergence rates and conditioning. Math. Oper. Res. 35, 641–654 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  42. Lewis, F.L., Liu, D. (eds.): Reinforcement Learning and Approximate Dynamic Programming for Feedback Control. Wiley, Hoboken, NJ (2013)

    Google Scholar 

  43. Lions, P.L., Mercier, B.: Splitting algorithms for the sum of two nonlinear operators. SIAM J. Numer. Anal. 16, 964–979 (1979)

    Article  MathSciNet  MATH  Google Scholar 

  44. Mnih, V., Kavukcuoglu, K., Silver, D., et al.: Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015)

    Article  Google Scholar 

  45. Martinet, B.: Regularisation d’ Inequations Variationnelles par Approximations Successives. Rev. Francaise Inf. Rech. Oper. 4, 154–158 (1970)

    MATH  Google Scholar 

  46. Nedić, A., Bertsekas, D.P.: Least squares policy evaluation algorithms with linear function approximation. Discrete Event Dyn. Syst. Theory Appl. 13, 79–110 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  47. Parikh, N., Boyd, S.: Proximal algorithms. Found. Trends Optim. 1, 123–231 (2013)

    Google Scholar 

  48. Powell, W.B.: Approximate Dynamic Programming: Solving the Curses of Dimensionality. Wiley, New York (2007)

    Book  MATH  Google Scholar 

  49. Rockafellar, R.T.: Monotone operators and the proximal point algorithm. SIAM J. Control Optim. 14, 877–898 (1976)

    Article  MathSciNet  MATH  Google Scholar 

  50. Si, J., Barto, A., Powell, W., Wunsch, D. (eds.): Learning and Approximate Dynamic Programming. IEEE Press, New York (2004)

    Google Scholar 

  51. Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, S., et al.: Mastering the game of go with deep neural networks and tree search. Nature 529, 484–489 (2016)

    Article  Google Scholar 

  52. Scherrer, B., Ghavamzadeh, M., Gabillon, V., Lesner, B., Geist, M.: Approximate modified policy iteration and its application to the game of tetris. J. Mach. Learn. Res. 16, 1629–1676 (2015)

    MathSciNet  MATH  Google Scholar 

  53. Samuel, A.L.: Some studies in machine learning using the game of checkers. IBM J. Res. Dev. 3, 210–229 (1959)

    Article  MathSciNet  Google Scholar 

  54. Samuel, A.L.: Some studies in machine learning using the game of checkers. II–recent progress. IBM J. Res. Dev. 11, 601–617 (1967)

    Article  Google Scholar 

  55. Scherrer, B.: Performance bounds for \(\lambda \)-policy iteration and application to the game of tetris. J. Mach. Learn. Res. 14, 1181–1227 (2013)

    MathSciNet  MATH  Google Scholar 

  56. Strohmer, T., Vershynin, R.: A randomized Kaczmarz algorithm with exponential convergence. J. Fourier Anal. Appl. 15, 262–278 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  57. Sutton, R.S., Barto, A.G.: Reinforcement Learning. MIT Press, Cambridge, MA (1998)

    Google Scholar 

  58. Sutton, R.S.: Learning to predict by the methods of temporal differences. Mach. Learn. 3, 9–44 (1988)

    Google Scholar 

  59. Szepesvari, C.: Algorithms for Reinforcement Learning. Morgan and Claypool Publishers, San Rafael (2010)

    MATH  Google Scholar 

  60. Tesauro, G.J.: TD-gammon, a self-teaching backgammon program, achieves master-level play. Neural Comput. 6, 215–219 (1994)

    Article  Google Scholar 

  61. Tsitsiklis, J.N., Van Roy, B.: An analysis of temporal-difference learning with function approximation. IEEE Trans. Autom. Control 42, 674–690 (1997)

    Article  MathSciNet  MATH  Google Scholar 

  62. Tseng, P.: Applications of a splitting algorithm to decomposition in convex programming and variational inequalities. SIAM J. Control Optim. 29, 119–138 (1991)

    Article  MathSciNet  MATH  Google Scholar 

  63. Vrabie, D., Vamvoudakis, K.G., Lewis, F.L.: Optimal adaptive control and differential games by reinforcement learning principles. The Institution of Engineering and Technology, London (2013)

    MATH  Google Scholar 

  64. Wang, M., Bertsekas, D.P.: Stabilization of stochastic iterative methods for singular and nearly singular linear systems. Math. Oper. Res. 39, 1–30 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  65. Wang, M., Bertsekas, D.P.: Convergence of iterative simulation-based methods for singular linear systems. Stoch. Systems 3, 39–96 (2014)

    Google Scholar 

  66. Wang, M., Bertsekas, D.P.: Incremental constraint projection methods for variational inequalities. Math. Program. 150, 321–363 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  67. Yu, H., Bertsekas, D.P.: Convergence results for some temporal difference methods based on least squares. IEEE Trans. Auton. Control 54, 1515–1531 (2006)

    MathSciNet  MATH  Google Scholar 

  68. Yu, H., Bertsekas, D.P.: Error bounds for approximations from projected linear equations. Math. Oper. Res. 35, 306–329 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  69. Yu, H., Bertsekas, D.P.: Weighted Bellman equations and their applications in dynamic programming. Laboratory for Information and Decision Systems Report LIDS-P-2876, MIT (2012)

  70. Yu, H., Bertsekas, D.P.: Q-learning and policy iteration algorithms for stochastic shortest path problems. Ann. Oper. Res. 208, 95–132 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  71. Yu, H.: Convergence of least squares temporal difference methods under general conditions. In: Proceedings of the 27th ICML, Haifa, Israel (2010)

  72. Yu, H.: Least squares temporal difference methods: an analysis under general conditions. SIAM J. Control Optim. 50, 3310–3343 (2012)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dimitri P. Bertsekas.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bertsekas, D.P. Proximal algorithms and temporal difference methods for solving fixed point problems. Comput Optim Appl 70, 709–736 (2018). https://doi.org/10.1007/s10589-018-9990-5

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10589-018-9990-5

Keywords

Navigation