Journal of Global Optimization

, Volume 73, Issue 2, pp 279–310 | Cite as

A unified DC programming framework and efficient DCA based approaches for large scale batch reinforcement learning

  • Hoai An Le ThiEmail author
  • Vinh Thanh Ho
  • Tao Pham Dinh


We investigate a powerful nonconvex optimization approach based on Difference of Convex functions (DC) programming and DC Algorithm (DCA) for reinforcement learning, a general class of machine learning techniques which aims to estimate the optimal learning policy in a dynamic environment typically formulated as a Markov decision process (with an incomplete model). The problem is tackled as finding the zero of the so-called optimal Bellman residual via the linear value-function approximation for which two optimization models are proposed: minimizing the \(\ell _{p}\)-norm of a vector-valued convex function, and minimizing a concave function under linear constraints. They are all formulated as DC programs for which attractive DCA schemes are developed. Numerical experiments on various examples of the two benchmarks of Markov decision process problems—Garnet and Gridworld problems, show the efficiency of our approaches in comparison with two existing DCA based algorithms and two state-of-the-art reinforcement learning algorithms.


Batch reinforcement learning Markov decision process DC programming DCA Optimal Bellman residual 


  1. 1.
    Abbeel, P., Ng, A.Y.: Apprenticeship learning via inverse reinforcement learning. In: Proceedings of the Twenty-first International Conference on Machine Learning, ICML. ACM, New York (2004)Google Scholar
  2. 2.
    Antos, A., Szepesvári, C., Munos, R.: Learning near-optimal policies with bellman-residual minimization based fitted policy iteration and a single sample path. Mach. Learn. 71(1), 89–129 (2008)CrossRefzbMATHGoogle Scholar
  3. 3.
    Baird, L.C.I.: Residual algorithms: reinforcement learning with function approximation. In: Prieditis, A., Russell, S. (eds.) Machine Learning Proceedings 1995, pp. 30–37. Morgan Kaufmann, San Francisco (1995)Google Scholar
  4. 4.
    Bellman, R.: A markovian decision process. Indiana Univ. Math. J. 6(4), 679–684 (1957)MathSciNetCrossRefzbMATHGoogle Scholar
  5. 5.
    Bertsekas, D.P. (ed.): Dynamic Programming: Deterministic and Stochastic Models. Prentice-Hall Inc, Upper Saddle River (1987)zbMATHGoogle Scholar
  6. 6.
    Bertsekas, D.P., Tsitsiklis, J.N. (eds.): Neuro-Dynamic Programming. Athena Scientific, Belmont (1996)zbMATHGoogle Scholar
  7. 7.
    Bhatnagar, S., Sutton, R.S., Ghavamzadeh, M., Lee, M.: Natural actor-critic algorithms. Automatica 45(11), 2471–2482 (2009)MathSciNetCrossRefzbMATHGoogle Scholar
  8. 8.
    Blanquero, R., Carrizosa, E.: Optimization of the norm of a vector-valued dc function and applications. J. Optim. Theory Appl. 107(2), 245–260 (2000)MathSciNetCrossRefzbMATHGoogle Scholar
  9. 9.
    Blanquero, R., Carrizosa, E.: On the norm of a dc function. J. Glob. Optim. 48(2), 209–213 (2010)MathSciNetCrossRefzbMATHGoogle Scholar
  10. 10.
    Buşoniu, L., Babuska, R., Schutter, B.D., Ernst, D.: Reinforcement Learning and Dynamic Programming Using Function Approximators, 1st edn. CRC Press Inc, Boca Raton (2010)zbMATHGoogle Scholar
  11. 11.
    Coulom, R.: Reinforcement learning using neural networks, with applications to motor control. Ph.D. thesis, Institut National Polytechnique de Grenoble (2002)Google Scholar
  12. 12.
    Cruz Neto, J.X., Lopes, J.O., Santos, P.S.M., Souza, J.C.O.: An interior proximal linearized method for DC programming based on Bregman distance or second-order homogeneous kernels. Optimization, 1–15 (2018).
  13. 13.
    Ernst, D., Geurts, P., Wehenkel, L.: Tree-based batch mode reinforcement learning. J. Mach. Learn. Res. 6, 503–556 (2005)MathSciNetzbMATHGoogle Scholar
  14. 14.
    Esser, E., Lou, Y., Xin, J.: A method for finding structured sparse solutions to non-negative least squares problems with applications. SIAM J. Imaging Sci. 6(4), 2010–2046 (2013)MathSciNetCrossRefzbMATHGoogle Scholar
  15. 15.
    Gaudioso, M., Giallombardo, G., Miglionico, G., Bagirov, A.M.: Minimizing nonsmooth dc functions via successive dc piecewise-affine approximations. J. Glob. Optim. 71(1), 37–55 (2018)MathSciNetCrossRefzbMATHGoogle Scholar
  16. 16.
    Geist, M., Pietquin, O.: Algorithmic survey of parametric value function approximation. IEEE Trans. Neural Netw. Learn. Syst. 24(6), 845–867 (2013)CrossRefGoogle Scholar
  17. 17.
    Geramifard, A., Walsh, T.J., Tellex, S., Chowdhary, G., Roy, N., How, J.P.: A tutorial on linear function approximators for dynamic programming and reinforcement learning. Found. Trends Mach. Learn. 6(4), 375–451 (2013)CrossRefzbMATHGoogle Scholar
  18. 18.
    Gosavi, A.: Reinforcement learning: a tutorial survey and recent advances. INFORMS J. Comput. 21(2), 178–192 (2009)MathSciNetCrossRefzbMATHGoogle Scholar
  19. 19.
    Ho, V.T., Le Thi, H.A.: Solving an infinite-horizon discounted markov decision process by DC programming and DCA. In: Nguyen, T.B., van Do, T., An Le Thi, H., Nguyen, N.T. (eds.) Advanced Computational Methods for Knowledge Engineering, pp. 43–55. Springer, Berlin (2016)CrossRefGoogle Scholar
  20. 20.
    Joki, K., Bagirov, A., Karmitsa, N., Mäkelä, M., Taheri, S.: Double bundle method for finding clarke stationary points in nonsmooth dc programming. SIAM J. Optim. 28(2), 1892–1919 (2018)MathSciNetCrossRefzbMATHGoogle Scholar
  21. 21.
    Joki, K., Bagirov, A.M., Karmitsa, N., Mäkelä, M.M.: A proximal bundle method for nonsmooth dc optimization utilizing nonconvex cutting planes. J. Glob. Optim. 68(3), 501–535 (2017)MathSciNetCrossRefzbMATHGoogle Scholar
  22. 22.
    Koshi, S.: Convergence of convex functions and duality. Hokkaido Math. J. 14(3), 399–414 (1985)MathSciNetCrossRefzbMATHGoogle Scholar
  23. 23.
    Lagoudakis, M.G., Parr, R.: Least-squares policy iteration. J. Mach. Learn. Res. 4, 1107–1149 (2003)MathSciNetzbMATHGoogle Scholar
  24. 24.
    Lange, S., Gabel, T., Riedmiller, M.: Batch Reinforcement Learning. In: Wiering, M., van Otterlo, M. (eds.) Reinforcement Learning., vol. 12, chap. 2, pp. 45–73. Springer, Berlin, Heidelberg, Hillsdale (2012)CrossRefGoogle Scholar
  25. 25.
    Le Thi, H.A.: DC Programming and DCA. (homepage) (2005). Accessed 1 Dec 2005
  26. 26.
    Le Thi, H.A., Le, H.M., Pham Dinh, T.: Feature selection in machine learning: an exact penalty approach using a difference of convex function algorithm. Mach. Learn. 101(1–3), 163–186 (2015)MathSciNetCrossRefzbMATHGoogle Scholar
  27. 27.
    Le Thi, H.A., Nguyen, M.C.: Self-organizing maps by difference of convex functions optimization. Data Min. Knowl. Discov. 28(5–6), 1336–1365 (2014)MathSciNetCrossRefzbMATHGoogle Scholar
  28. 28.
    Le Thi, H.A., Nguyen, M.C., Pham Dinh, T.: A dc programming approach for finding communities in networks. Neural Comput. 26(12), 2827–2854 (2014)MathSciNetCrossRefGoogle Scholar
  29. 29.
    Le Thi, H.A., Pham Dinh, T.: Solving a class of linearly constrained indefinite quadratic problems by D.C. algorithms. J. Glob. Optim. 11(3), 253–285 (1997)MathSciNetCrossRefzbMATHGoogle Scholar
  30. 30.
    Le Thi, H.A., Pham Dinh, T.: The DC (difference of convex functions) programming and DCA revisited with DC models of real world nonconvex optimization problems. Ann. Oper. Res. 133(1–4), 23–46 (2005)MathSciNetzbMATHGoogle Scholar
  31. 31.
    Le Thi, H.A., Pham Dinh, T.: DC programming and DCA: thirty years of developments. Math. Program. Spec. Issue DC Program. Theory Algorithms Appl. 169(1), 5–68 (2018)MathSciNetzbMATHGoogle Scholar
  32. 32.
    Le Thi, H.A., Pham Dinh, T., Le, H.M., Vo, X.T.: DC approximation approaches for sparse optimization. Eur. J. Oper. Res. 244(1), 26–46 (2015)MathSciNetCrossRefzbMATHGoogle Scholar
  33. 33.
    Le Thi, H.A., Vo, X.T., Pham Dinh, T.: Feature selection for linear SVMs under uncertain data: robust optimization based on difference of convex functions algorithms. Neural Netw. 59, 36–50 (2014)CrossRefzbMATHGoogle Scholar
  34. 34.
    Liu, Y., Shen, X., Doss, H.: Multicategory \(\psi \)-learning and support vector machines: computational tools. J. Comput. Gr. Stat. 14(1), 219–236 (2005)MathSciNetCrossRefGoogle Scholar
  35. 35.
    Maillard, O.A., Munos, R., Lazaric, A., Ghavamzadeh, M.: Finite sample analysis of Bellman residual minimization. In: Sugiyama,M., Yang, Q. (eds.) Asian Conference on Machine Learpning. JMLR: Workshop and Conference Proceedings, vol. 13, pp. 309–324 (2010)Google Scholar
  36. 36.
    Munos, R.: Performance bounds in \(L_p\) norm for approximate value iteration. SIAM J. Control Optim. 46(2), 541–561 (2007)MathSciNetCrossRefzbMATHGoogle Scholar
  37. 37.
    Oliveira, W.D.: Proximal bundle methods for nonsmooth DC programming (2017). Accessed 20 July 2018
  38. 38.
    Oliveira, W.D., Tcheou, M.: An inertial algorithm for DC programming (2018). Accessed 20 July 2018
  39. 39.
    Pashenkova, E., Rish, I., Dechter, R.: Value iteration and policy iteration algorithms for markov decision problem. In Proceedings of the National Conference on Artificial Intelligence (AAAI) Workshop on Structural Issues in Planning and Temporal Reasoning, April (1996)Google Scholar
  40. 40.
    Pham Dinh, T., El Bernoussi, S.: Algorithms for solving a class of nonconvex optimization problems. methods of subgradients. In: Hiriart-Urruty, J.B. (ed.) Fermat Days 85: Mathematics for Optimization. North-Holland Mathematics Studies, vol. 129, pp. 249–271. North-Holland, Amsterdam (1986)CrossRefGoogle Scholar
  41. 41.
    Pham Dinh, T., Le Thi, H.A.: Convex analysis approach to DC programming: theory, algorithms and applications. Acta Mathematica Vietnamica 22(1), 289–355 (1997)MathSciNetzbMATHGoogle Scholar
  42. 42.
    Pham Dinh, T., Le Thi, H.A.: DC optimization algorithms for solving the trust region subproblem. SIAM J. Optim. 8(2), 476–505 (1998)MathSciNetCrossRefzbMATHGoogle Scholar
  43. 43.
    Pham Dinh, T., Le Thi, H.A.: Recent advances in DC programming and DCA. In: Nguyen, N.T., Le Thi, H.A. (eds.) Transactions on Computational Intelligence XIII, vol. 8342, pp. 1–37. Springer, Berlin, Heidelberg (2014)CrossRefGoogle Scholar
  44. 44.
    Piot, B., Geist, M., Pietquin, O.: Difference of convex functions programming for reinforcement learning. In: Advances in Neural Information Processing Systems (NIPS 2014) (2014)Google Scholar
  45. 45.
    Puterman, M.L. (ed.): Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, New York (1994)zbMATHGoogle Scholar
  46. 46.
    Rockafellar, R.T.: Convex Analysis. Princeton Mathematical Series. Princeton University Press, Princeton (1970)CrossRefzbMATHGoogle Scholar
  47. 47.
    Salinetti, G., Wets, R.J.: On the relations between two types of convergence for convex functions. J. Math. Anal. Appl. 60(1), 211–226 (1977)MathSciNetCrossRefzbMATHGoogle Scholar
  48. 48.
    Scherrer, B.: Should one compute the Temporal Difference fix point or minimize the Bellman Residual? The unified oblique projection view. In: 27th International Conference on Machine Learning—ICML 2010. Haïfa, Israel (2010)Google Scholar
  49. 49.
    Schüle, T., Schnörr, C., Weber, S., Hornegger, J.: Discrete tomography by convex–concave regularization and d.c. programming. Discrete Appl. Math. 151, 229–243 (2005)MathSciNetCrossRefzbMATHGoogle Scholar
  50. 50.
    Schweitzer, P., Seidmann, A.: Generalized polynomial approximations in markovian decision processes. J. Math. Anal. Appl. 110(2), 568–582 (1985)MathSciNetCrossRefzbMATHGoogle Scholar
  51. 51.
    Sigaud, O., Buffet, O. (eds.): Markov Decision Processes in Artificial Intelligence. Wiley-IEEE Press, Hoboken (2010)zbMATHGoogle Scholar
  52. 52.
    Singh, S., Jaakkola, T., Littman, M.L., Szepesvári, C.: Convergence results for single-step on-policy reinforcement-learning algorithms. Mach. Learn. 38(3), 287–308 (2000)CrossRefzbMATHGoogle Scholar
  53. 53.
    Singh, S.P., Jaakkola, T., Jordan, M.I.: Reinforcement learning with soft state aggregation. In: Tesauro, G., Touretzky, D.S., Leen, T.K. (eds.) Advances in Neural Information Processing Systems, vol. 7, pp. 361–368. MIT Press, San Mateo (1995)Google Scholar
  54. 54.
    Souza, J.C.O., Oliveira, P.R., Soubeyran, A.: Global convergence of a proximal linearized algorithm for difference of convex functions. Optim. Lett. 10(7), 1529–1539 (2016)MathSciNetCrossRefzbMATHGoogle Scholar
  55. 55.
    Sutton, R.S.: Generalization in reinforcement learning: successful examples using sparse coarse coding. In: Advances in Neural Information Processing Systems, vol. 8, pp. 1038–1044. MIT Press (1996)Google Scholar
  56. 56.
    Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998)Google Scholar
  57. 57.
    Szepesvári, C.: Algorithms for Reinforcement Learning. Morgan & Claypool, San Rafael (2010)CrossRefzbMATHGoogle Scholar
  58. 58.
    Szepesvári, C., Smart, W.D.: Interpolation-based q-learning. In: Proceedings of the Twenty-First International Conference on Machine Learning, ICML ’04, pp. 791–798. ACM, New York (2004)Google Scholar
  59. 59.
    Tor, A.H., Bagirov, A., Karasözen, B.: Aggregate codifferential method for nonsmooth dc optimization. J. Comput. Appl. Math. 259, 851–867 (2014)MathSciNetCrossRefzbMATHGoogle Scholar
  60. 60.
    Vapnik, V.N. (ed.): Statistical Learning Theory. Wiley, Hoboken (1998)zbMATHGoogle Scholar
  61. 61.
    Watkins, C.J.C.H.: Learning from delayed rewards. Ph.D. thesis, King’s College, Cambridge (1989)Google Scholar
  62. 62.
    Wiering, M., van Otterlo, M. (eds.): Reinforcement Learning: State-of-the-Art. Adaptation, Learning, and Optimization, vol. 12, 1st edn. Springer, Berlin, Heidelberg (2012)Google Scholar
  63. 63.
    Williams, R.J., Baird, L.C.I.: Tight performance bounds on greedy policies based on imperfect value functions. College of Computer Science, Northeastern University, Tech. rep. (1993)Google Scholar
  64. 64.
    Xu, X., Zuo, L., Huang, Z.: Reinforcement learning algorithms with function approximation: recent advances and applications. Inf. Sci. 261, 1–31 (2014)MathSciNetCrossRefzbMATHGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  • Hoai An Le Thi
    • 1
    Email author
  • Vinh Thanh Ho
    • 1
  • Tao Pham Dinh
    • 2
  1. 1.Laboratory of Theoretical and Applied Computer Science EA 3097University of LorraineMetzFrance
  2. 2.Laboratory of Mathematics, INSA - RouenUniversity of NormandieSaint-Etienne-du-Rouvray CedexFrance

Personalised recommendations