Skip to main content

Advertisement

SpringerLink
  1. Home
  2. Machine Intelligence Research
  3. Article
A Survey on Recent Advances and Challenges in Reinforcement Learning Methods for Task-oriented Dialogue Policy Learning
Download PDF
Your article has downloaded

Similar articles being viewed by others

Slider with three articles shown per slide. Use the Previous and Next buttons to navigate the slides or the slide controller buttons at the end to navigate through each slide.

A hierarchical approach for efficient multi-intent dialogue policy learning

02 July 2020

Tulika Saha, Dhawal Gupta, … Pushpak Bhattacharyya

Deep reinforcement learning: a survey

15 October 2020

Hao-nan Wang, Ning Liu, … Yi-ming Zhang

Recent advances in deep learning based dialogue systems: a systematic survey

20 August 2022

Jinjie Ni, Tom Young, … Erik Cambria

A Review of Plan-Based Approaches for Dialogue Management

31 January 2022

Milene Santos Teixeira & Mauro Dragoni

End-to-End latent-variable task-oriented dialogue system with exact log-likelihood optimization

07 June 2019

Haotian Xu, Haiyun Peng, … Weiguo Zheng

EVA2.0: Investigating Open-domain Chinese Dialogue Systems with Large-scale Pre-training

18 February 2023

Yuxian Gu, Jiaxin Wen, … Minlie Huang

Diverse dialogue generation by fusing mutual persona-aware and self-transferrer

28 July 2021

Fuyong Xu, Guangtao Xu, … Zhenfang Zhu

User-aware dialogue management policies over attributed bi-automata

30 July 2018

Manex Serras, María Inés Torres & Arantza del Pozo

Persistent rule-based interactive reinforcement learning

04 September 2021

Adam Bignold, Francisco Cruz, … Cameron Foale

Download PDF
  • Review
  • Open Access
  • Published: 07 January 2023

A Survey on Recent Advances and Challenges in Reinforcement Learning Methods for Task-oriented Dialogue Policy Learning

  • Wai-Chung Kwan  ORCID: orcid.org/0000-0002-2942-42081 na1,
  • Hong-Ru Wang  ORCID: orcid.org/0000-0001-5027-01381 na1,
  • Hui-Min Wang  ORCID: orcid.org/0000-0002-6147-83101 &
  • …
  • Kam-Fai Wong  ORCID: orcid.org/0000-0002-9427-56591 

Machine Intelligence Research (2023)Cite this article

  • 176 Accesses

  • Metrics details

Abstract

Dialogue policy learning (DPL) is a key component in a task-oriented dialogue (TOD) system. Its goal is to decide the next action of the dialogue system, given the dialogue state at each turn based on a learned dialogue policy. Reinforcement learning (RL) is widely used to optimize this dialogue policy. In the learning process, the user is regarded as the environment and the system as the agent. In this paper, we present an overview of the recent advances and challenges in dialogue policy from the perspective of RL. More specifically, we identify the problems and summarize corresponding solutions for RL-based dialogue policy learning. In addition, we provide a comprehensive survey of applying RL to DPL by categorizing recent methods into five basic elements in RL. We believe this survey can shed light on future research in DPL.

Download to read the full article text

Working on a manuscript?

Avoid the common mistakes

References

  1. H. S. Chen, X. R. Liu, D. W. Yin, J. J. Tang. A survey on dialogue systems: Recent advances and new frontiers. ACM SIGKDD Explorations Newsletter, vol. 19, no. 2, pp. 25–35, 2017. DOI: https://doi.org/10.1145/3166054.3166058.

    Article  Google Scholar 

  2. M. Lewis, D. Yarats, Y. Dauphin, D. Parikh, D. Batra. Deal or no deal? End-to-end learning of negotiation dialogues. In Proceedings of Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 2443–2453, 2017. DOI: https://doi.org/10.18653/v1/D17-1259.

  3. M. Eric, C. Manning. A copy-augmented sequence-to-sequence architecture gives good performance on task-oriented dialogue. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Valencia, Spain, pp. 468–473, 2017.

  4. T. C. Chi, P. C. Chen, S. Y. Su, Y. N. Chen. Speaker role contextual modeling for language understanding and dialogue policy learning. In Proceedings of the 8th International Joint Conference on Natural Language Processing, Taipei, China, pp. 163–168, 2017.

  5. K. Wang, J. F. Tian, R. Wang, X. J. Quan, J. X. Yu. Multi-domain dialogue acts and response co-generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7125–7134, 2020. DOI: https://doi.org/10.18653/v1/2020.acl-main.638.

  6. Z. Zhang, R. Takanobu, Q. Zhu, M. L. Huang, X. Y. Zhu. Recent advances and challenges in task-oriented dialog systems. Science China Technological Sciences, vol. 63, no. 10, pp. 2011–2027, 2020. DOS: https://doi.org/10.1007/s11431-020-1692-3.

    Article  Google Scholar 

  7. S. Y. Gao, A. Sethi, S. Agarwal, T. Chung, D. Hakkani-Tur. Dialog state tracking: A neural reading comprehension approach. In Proceedings of the 20th Annual SIGdial Meeting on Discourse and Dialogue, Stockholm, Sweden, pp. 264–273, 2019. DOI: https://doi.org/10.18653/v1/W19-5932.

  8. E. Levin, R. Pieraccini, W. Eckert. Learning dialogue strategies within the Markov decision process framework. In Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings, IEEE, Santa Barbara, USA, pp. 72–79, 1997. DOI: https://doi.org/10.1109/AS-RU.1997.658989.

    Google Scholar 

  9. S. Singh, M. Kearns, D. Litman, M. Walker. Reinforcement learning for spoken dialogue systems. In Proceedings of the 12th International Conference on Neural Information Processing Systems, Denver, USA, pp. 956–962, 1999. DOI: https://doi.org/10.5555/3009657.3009792.

  10. S. Gandhe, D. R. Traum. Creating spoken dialogue characters from corpora without annotations. In Proceedings of the 8th Annual Conference of the International Speech Communication Association, Antwerp, Belgium, pp. 2201–2204, 2007. DOI: https://doi.org/10.21437/Interspeech.2007-599.

  11. L. F. Shang, Z. D. Lu, H. Li. Neural responding machine for short-text conversation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Beijing, China, pp. 1577–1586, 2015. DOI: https://doi.org/10.3115/v1/P15-1152.

  12. M. A. Walker. An application of reinforcement learning to dialogue strategy selection in a spoken dialogue system for email. Journal of Artificial Intelligence Research, vol. 12, pp. 387–416, 2000. DOI: https://doi.org/10.1613/jair.713.

    Article  MATH  Google Scholar 

  13. S. Singh, D. Litman, M. Kearns, M. Walker. Optimizing dialogue management with reinforcement learning: Experiments with the NJFun system. Journal of Artificial Intelligence Research, vol. 16, no. 1, pp. 105–133, 2002. DOI: https://doi.org/10.5555/1622407.1622410.

    Article  MATH  Google Scholar 

  14. J. Henderson, O. Lemon, K. Georgila. Hybrid reinforcement/supervised learning of dialogue policies from fixed data sets. Computational Linguistics, vol. 34, no. 4, pp. 487–511, 2008. DOI: https://doi.org/10.1162/coli.2008.07-028-R2-05-82.

    Article  Google Scholar 

  15. D. DeVault, A. Leuski, K. Sagae. Toward learning and evaluation of dialogue policies with text examples. In Proceedings of the SIGDIAL Conference, Portland, USA, pp. 39–48, 2011.

  16. O. Vinyals, Q. Le. A neural conversational model. [Online], Available: https://arxiv.org/abs/1506.05869, 2015

  17. V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, M. Riedmiller. Playing Atari with deep reinforcement learning. [Online], Available: https://arxiv.org/abs/1312.5602, 2013.

  18. D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, D. Hassabis. Mastering the game of Go with deep neural networks and tree search. Nature, vol. 529, no. 7587, pp. 484–489, 2016. DOI: https://doi.org/10.1038/nature16961.

    Article  Google Scholar 

  19. A. Y. Ng, H. J. Kim, M. I. Jordan, S. Sastry. Autonomous helicopter flight via reinforcement learning. In Proceedings of the 16th International Conference on Neural Information Processing Systems, Vancouver, Canada, pp. 799–806, 2003. DOI: https://doi.org/10.5555/2981345.2981445.

  20. J. Peters, S. Schaal. Reinforcement learning of motor skills with policy gradients. Neural Networks, vol. 21, no. 4, pp. 682–697, 2008. DOI: https://doi.org/10.1016/j.neunet.2008.02.003.

    Article  Google Scholar 

  21. P. H. Su, M. Gasic, N. Mrksic, L. Rojas-Barahona, S. Ultes, D. Vandyke, T. H. Wen, S. Young. Continuously learning neural dialogue management. [Online], Available: https://arxiv.org/abs/1606.02689, 2016.

  22. M. Fatemi, L. El Asri, H. Schulz, J. He, K. Suleman. Policy networks with two-stage training for dialogue systems. In Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue, Los Angeles, USA, pp. 101–110, 2016. DOI: https://doi.org/10.18653/v1/W16-3613.

  23. P. H. Su, P. Budzianowski, S. Ultes, M. Gašić, S. Young. Sample-efficient actor-critic reinforcement learning with supervised data for dialogue management. In Proceedings of the 18th Annual SIGdial Meeting on Discourse And Dialogue, Saarbrücken, Germany, pp. 147–157, 2017. DOI: https://doi.org/10.18653/v1/W17-5518.

  24. Z. C. Lipton, X. J. Li, J. F. Gao, L. H. Li, F. Ahmed, L. Deng. BBQ-networks: Efficient exploration in deep reinforcement learning for task-oriented dialogue systems. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, New Orleans, USA, pp. 5237–5244, 2018.

  25. H. Cuayáhuitl, S. Yu, A. Williamson, J. Carse. Deep reinforcement learning for multi-domain dialogue systems. [Online], Available: https://arxiv.org/abs/1611.08675, 2016.

  26. I. Gra§l. A survey on reinforcement learning for dialogue systems. [Online], Available: https://arxiv.org/abs/1903.0138, 2019.

  27. Y. P. Dai, H. H. Yu, Y. X. Jiang, C. G. Tang, Y. B. Li, J. Sun. A survey on dialog management: Recent advances and challenges. [Online], Available: https://arxiv.org/abs/2005.02233, 2020.

  28. B. Liu, I. Lane. Iterative policy learning in end-to-end trainable task-oriented neural dialog models. In Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop, Okinawa, Japan, pp. 482–489, 2017. DOI: https://doi.org/10.1109/ASRU.2017.8268975.

  29. B. Liu, I. Lane. Adversarial learning of task-oriented neural dialog models. In Proceedings of the 19th Annual SIGdial Meeting on Discourse And Dialogue, Association for Computational Linguistics, Melbourne, Australia, pp. 350–359, 2018. DOI: https://doi.org/10.18653/v1/W18-5041.

    Google Scholar 

  30. B. L. Peng, X. J. Li, J. F. Gao, J. J. Liu, Y. N. Chen, K. F. Wong. Adversarial advantage actor-critic model for task-completion dialogue policy learning. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, Canada, pp. 6149–6153, 2018. DOI: https://doi.org/10.1109/ICASSP.2018.8461918.

  31. B. L. Peng, X. J. Li, J. F. Gao, J. J. Liu, K. F. Wong. Deep dyna-Q: Integrating planning for task-completion dialogue policy learning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, pp. 2182–2192, 2018. DOI: https://doi.org/10.18653/v1/P18-1203.

  32. Y. Cao, K. T. Lu, X. P. Chen, S. Q. Zhang. Adaptive dialog policy learning with hindsight and user modeling. In Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pp. 329–338, 2020.

  33. G. Gordon-Hall, P. J. Gorinski, G. Lampouras, I. Iacobacci. Show us the way: Learning to manage dialog from demonstrations. [Online], Available: https://arxiv.org/abs/2004.08114, 2020.

  34. R. S. Sutton, A. G. Barto. Reinforcement learning: An introduction. IEEE Transactions on Neural Networks, vol. 9, no. 5, Article number 1054, 1998. DOI: https://doi.org/10.1109/TNN.1998.712192.

    Google Scholar 

  35. W. H. Chen, J. S. Chen, P. D. Qin, X. F. Yan, W. Y. Wang. Semantically conditioned dialog response generation via hierarchical disentangled self-attention. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 3696–3709, 2019. DOI: https://doi.org/10.18653/v1/P19-1360.

  36. P. H. Su, M. Gašić, N. Mrkšić, L. M. Rojas-Barahona, S. Ultes, D. Vandyke, T. H. Wen, S. Young. On-line active reward learning for policy optimisation in spoken dialogue systems. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, pp. 2431–2441, 2016. DOI: https://doi.org/10.18653/v1/P16-1230.

  37. J. Schatzmann, B. Thomson, K. Weilhammer, H. Ye, S. Young. Agenda-based user simulation for bootstrapping a POMDP dialogue system. In Proceedings of Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers, Rochester, USA, pp. 149–152, 2007.

  38. M. A. Walker, D. J. Litman, C. A. Kamm, A. Abella. PARADISE: A framework for evaluating spoken dialogue agents. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics, Madrid, Spain, pp. 271–280, 1997. DOI: https://doi.org/10.3115/976909.979652.

  39. L. Chen, R. Z. Yang, C. Chang, Z. H. Ye, X. Zhou, K. Yu. On-line dialogue policy learning with companion teaching. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Valencia, Spain, pp. 198–204, 2017. DOI: https://doi.org/10.18653/vl/E17-2032.

  40. K. T. Lu, S. Q. Zhang, X. P. Chen. Goal-oriented dialogue policy learning from failures. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence and the 31st Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, Honolulu, USA, pp. 2596–2603, 2019. DOI: https://doi.org/10.1609/aaai.v33i01.33012596.

  41. K. T. Lu, S. Q. Zhang, X. P. Chen. AutoEG: Automated experience grafting for off-policy deep reinforcement learning. [Online], Available: https://arxiv.org/abs/2004.10698, 2020.

  42. G. Gordon-Hall, P. J. Gorinski, S. B. Cohen. Learning dialog policies from weak demonstrations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 1394–1405, 2020. DOI: https://doi.org/10.18653/v1/2020.acl-main.129.

  43. L. H. Li, H. He, J. D. Williams. Temporal supervised learning for inferring a dialog policy from example conversations. In Proceedings of IEEE Spoken Language Technology Workshop, South Lake Tahoe, USA, pp. 312–317, 2014. DOI: https://doi.org/10.1109/SLT.2014.7078593.

  44. P. H. Su, D. Vandyke, M. Gašić, N. Mrkšić, T. H. Wen, S. Young. Reward shaping with recurrent neural networks for speeding up on-line policy learning in spoken dialogue systems. In Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue, Prague, Czech Republic, pp. 417–421, 2015. DOI: https://doi.org/10.18653/v1/W15-4655.

  45. T. C. Zhao, M. Eskenazi. Towards end-to-end learning for dialog state tracking and management using deep re-inforcement learning. In Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue, Los Angeles, USA, pp. 1–10, 2016. DOI: 18653/v1/W16-3601.

  46. P. Budzianowski, S. Ultes, P. H. Su, N. Mrkšić, T. H. Wen, I. Casanueva, L. M. Rojas-Barahona, M. Gašić. Sub-domain modelling for dialogue management with hierarchical reinforcement learning. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, Saarbrücken, Germany, pp. 86–92, 2017. DOI: https://doi.org/10.18653/v1/W17-5512.

  47. B. L. Peng, X. J. Li, L. H. Li, J. F. Gao, A. Celikyilmaz, S. Lee, K. F. Wong. Composite task-completion dialogue policy learning via hierarchical deep reinforcement learning. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 2231–2240, 2017. DOI: https://doi.org/10.18653/v1/D17-1237.

  48. G. Weisz, P. Budzianowski, P. H. Su, M. Gašić. Sample efficient deep reinforcement learning for dialogue systems with large action spaces. IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 11, pp. 2083–2097, 2018. DOI: https://doi.org/10.1109/TASLP.2018.2851664.

    Article  Google Scholar 

  49. I. Casanueva, P. Budzianowski, P. H. Su, S. Ultes, L. M. Rojas-Barahona, B. H. Tseng, M. Gašić. Feudal reinforcement learning for dialogue management in large domains. In Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, USA, pp. 714–719, 2018. DOI: https://doi.org/10.18653/v1/N18-2112.

  50. G. Y. Kristianto, H. W. Zhang, B. Tong, M. Iwayama, Y. Kobayashi. Autonomous sub-domain modeling for dialogue policy with hierarchical deep reinforcement learning. In Proceedings of the EMNLP Workshop SCAI: The 2nd International Workshop on Search-Oriented Conversational AI, Brussels, Belgium, pp. 9–16, 2018. DOI: https://doi.org/10.18653/v1/W18-5702.

  51. S. Y. Su, X. J. Li, J. F. Gao, J. J. Liu, Y. N. Chen. Discriminative deep dyna-Q: Robust planning for dialogue policy learning. In Proceedings of Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 3813–3823, 2018. DOI: https://doi.org/10.18653/v1/D18-1416.

  52. D. Tang, X. J. Li, J. F. Gao, C. Wang, L. H. Li, T. Jebara. Subgoal discovery for hierarchical dialogue policy learning. In Proceedings of Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 2298–2309, 2018. DOI: https://doi.org/10.18653/v1/D18-1253.

  53. Y. X. Wu, X. J. Li, J. J. Liu, J. F. Gao, Y. M. Yang. Switch-based active deep dyna-Q: Efficient adaptive planning for task-completion dialogue policy learning. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence, Honolulu, USA, pp. 7289–7296, 2019. DOI: https://doi.org/10.1609/aaai.v33i01.33017289.

  54. T. C. Zhao, K. G. Xie, M. Eskenazi. Rethinking action spaces for reinforcement learning in end-to-end dialog agents with latent variable models. In Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis USA, pp. 1208–1218, 2019. DOI: https://doi.org/10.18653/v1/N19-1123.

  55. Y. M. Xu, C. G. Zhu, B. L. Peng, M. Zeng. Meta dialogue policy learning. [Online], Available: https://arxiv.org/abs/2006.02588, 2020.

  56. A. Papangelis, Y. C. Wang, P. Molino, G. Tur. Collaborative multi-agent dialogue model training via reinforcement learning. In Proceedings of the 20th Annual SIGdial Meeting on Discourse and Dialogue, Association for Computational Linguistics, Stockholm, Sweden, pp. 92–102, 2019. DOI: https://doi.org/10.18653/v1/W19-5912.

  57. Z. R. Zhang, X. J. Li, J. F. Gao, E. H. Chen. Budgeted policy learning for task-oriented dialogue systems. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 3742–3751, 2019. DOI: https://doi.org/10.18653/v1/P19-1364.

  58. R. Takanobu, H. L. Zhu, M. L. Huang. Guided dialog policy learning: Reward estimation for multi-domain task-oriented dialog. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Hong Kong, China, pp. 100–110, 2019. DOI: https://doi.org/10.18653/v1/D19-1010.

  59. X. T. Huang, J. Z. Qi, Y. Sun, R. Zhang. Semi-supervised dialogue policy learning via stochastic reward estimation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 660–670, 2020. DOI: 10.18653/v1/2020.acl-main.62.

  60. Z. Zhang, L. Z. Liao, X. Y. Zhu, T. S. Chua, Z. T. Liu, Y. Huang, M. L. Huang. Learning goal-oriented dialogue policy with opposite agent awareness. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, Suzhou, China, pp. 122–132, 2020.

  61. Z. M. Li, S. Lee, B. L. Peng, J. C. Li, J. Kiseleva, M. de Rijke, S. Shayandeh, J. F. Gao. Guided dialogue policy learning without adversarial learning in the loop. In Proceedings of Findings of the Association for Computational Linguistics, pp. 2308–2317, 2020. DOI: https://doi.org/10.18653/v1/2020.findings-emnlp.209.

  62. R. Takanobu, R. Z. Liang, M. L. Huang. Multi-agent task-oriented dialog policy learning with role-aware reward decomposition. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 625–638, 2020. DOI: https://doi.org/10.18653/v1/2020.acl-main.59.

  63. P. H. Su, D. Vandyke, Gašić, D. Kim, N. Mrkšić, T. H. Wen, S. Young. Learning from real users: Rating dialogue success with neural networks for reinforcement learning in spoken dialogue systems. In Proceedings of the 16th Annual Conference of the International Speech Communication Association, Dresden, Germany, pp. 2007–2011, 2015. DOI: https://doi.org/10.21437/Interspeech.2015-456.

  64. J. Schatzmann, S. Young. The hidden agenda user simulation model. IEEE Transactions on Audio, Speech, and Language Processing, vol. 17, no. 4, pp. 733–747, 2009. DOI: https://doi.org/10.1109/TASL.2008.2012071.

    Article  Google Scholar 

  65. X. J. Li, Z. C. Lipton, B. Dhingra, L. H. Li, J. F. Gao, Y. N. Chen. A user simulator for task-completion dialogues. [Online], Available: https://arxiv.org/abs/1612.05688, 2016.

  66. S. Ultes, L. M. Rojas-Barahona, P. H. Su, D. Vandyke, D. Kim, I. Casanueva, P. Budzianowski, N. Mrkšić, T. H. Wen, M. Gašić, S. Young. Pydial: A multi-domain statistical dialogue system toolkit. In Proceedings of ACL System Demonstrations, Vancouver, Canada, pp. 73–78, 2017.

  67. J. F. Gao, M. Galley, L. H. Li. Neural approaches to conversational AI. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, Ann Arbor, USA, pp. 1371–1374, 2018. DOI: https://doi.org/10.1145/3209978.3210183.

  68. I. Sutskever, O. Vinyals, Q. V. Le. Sequence to sequence learning with neural networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems, Montreal, Canada, pp. 3104–3112, 2014. DOI: https://doi.org/10.5555/2969033.2969173.

  69. W. Eckert, E. Levin, R. Pieraccini. User modeling for spoken dialogue system evaluation. In Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings, Santa Barbara, USA, pp. 80–87, 1997. DOI: https://doi.org/10.1109/ASRU.1997.658991.

  70. E. Levin, R. Pieraccini, W. Eckert. A stochastic model of human-machine interaction for learning dialog strategies. IEEE Transactions on Speech and Audio Processing, vol. 8, no. 1, pp. 11–23, 2000. DOI: https://doi.org/10.1109/89.817450.

    Article  Google Scholar 

  71. S. Chandramohan, M. Geist, F. Lefévre, O. Pietquin. User simulation in dialogue systems using inverse reinforcement learning. In Proceedings of the 12th Annual Conference of the International Speech Communication Association, Florence, Italy, pp. 1025–1028, 2011.

  72. L. El Asri, J. He, K. Suleman. A sequence-to-sequence model for user simulation in spoken dialogue systems. In Proceedings of the 17th Annual Conference of the International Speech Communication Association, San Francisco, USA, pp. 1151–1155, 2016. DOI: https://doi.org/10.21437/Inter-speech.2016-1175.

  73. J. D. Williams. Evaluating user simulations with the cramér-von mises divergence. Speech Communication, vol. 50, no. 10, pp. 829–846, 2008. DOI: https://doi.org/10.1016/j.specom.2008.05.007.

    Article  Google Scholar 

  74. H. Ai, D. J. Litman. Assessing dialog system user simulation evaluation measures using human judges. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics, Columbus, USA, pp. 622–629, 2008.

  75. O. Pietquin, H. Hastie. A survey on metrics for the evaluation of user simulations. The Knowledge Engineering Review, vol. 28, no. 1, pp. 59–73, 2013. DOI: https://doi.org/10.1017/S0269888912000343.

    Article  Google Scholar 

  76. K. Georgila, C. Nelson, D. Traum. Single-agent vs. multi-agent techniques for concurrent reinforcement learning of negotiation dialogue policies. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, USA, pp. 500–510, 2014. DOI: 3115/v1/P14-1047.

  77. H. M. Wang, K. F. Wong. A collaborative multi-agent reinforcement learning framework for dialog action decomposition. In Proceedings of Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, pp. 7882–7889, 2021. DOI: https://doi.org/10.18653/v1/2021.emnlp-main.621.

  78. M. Gašić, N. Mrkšić, L. Rojas-Barahona, P. H. Su, D. Vandyke, T. H. Wen. Multi-agent learning in multi-domain spoken dialogue systems. In Proceedings of the 29th Annual Conference on Neural Information Processing Systems, Montreal, Canada, 2015.

  79. R. Parr, S. Russell. Reinforcement learning with hierarchies of machines. In Proceedings of Conference on Advances in Neural Information Processing Systems, MIT Press, Denver, USA, pp. 1043–1049, 1998. DOI: https://doi.org/10.5555/302528.302894.

    Google Scholar 

  80. T. G. Dietterich. Hierarchical reinforcement learning with the MAXQ value function decomposition. Journal of Artificial Intelligence Research, vol. 13, pp. 227–303, 2000. DOI: https://doi.org/10.1613/jair.639.

    Article  MathSciNet  MATH  Google Scholar 

  81. S. Young, M. Gašić, B. Thomson, J. D. Williams. Pomdp-based statistical spoken dialog systems: A review. Proceedings of the IEEE, vol. 101, no. 5, pp. 1160–1179, 2013. DOI: https://doi.org/10.1109/JPROC.2012.2225812.

    Article  Google Scholar 

  82. R. S. Sutton, D. Precup, S. Singh. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, vol. 112, no. 1–2, pp. 181–211, 1999. DOI: https://doi.org/10.1016/S0004-3702(99)00052-1.

    Article  MathSciNet  MATH  Google Scholar 

  83. P. L. Bacon, J. Harb, D. Precup. The option-critic architecture. In Proceedings of the 31st AAAI Conference on Artificial Intelligence, San Francisco, USA, pp. 1726–1734, 2017.

  84. M. C. Machado, M. G. Bellemare, M. Bowling. A Laplacian framework for option discovery in reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, pp. 2295–2304, 2017.

  85. C. Wang, Y. N. Wang, P. S. Huang, A. Mohamed, D. Y. Zhou, L. Deng. Sequence modeling via segmentations. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, pp. 3674–3683, 2017.

  86. P. Dayan, G. E. Hinton. Feudal reinforcement learning. In Proceedings of the 5th International Conference on Neural Information Processing Systems, Denver, USA, pp. 271–278, 1992. DOI: https://doi.org/10.5555/2987061.2987095.

  87. I. Casanueva, P. Budzianowski, S. Ultes, F. Kreyssig, B. H. Tseng, Y. C. Wu, M. Gašić. Feudal dialogue management with jointly learned feature extractors. In Proceedings of the 19th Annual SIGdial Meeting on Discourse and Dialogue, Melbourne, Australia, pp. 332–337, 2018. DOI: https://doi.org/10.18653/v1/W18-5038.

  88. P. Abbeel, A. Y. Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the 21st International Conference on Machine Learning, ACM, Banff, Canada, 2004. DOI: https://doi.org/10.1145/1015330.1015430.

  89. M. Jhunjhunwala, C. Bryant, P. Shah. Multi-action dialog policy learning with interactive human teaching. In Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pp. 290–296, 2020.

  90. N. Mrkšić, D. Ó. Séaghdha, T. H. Wen, B. Thomson, S. Young. Neural belief tracker: Data-driven dialogue state tracking. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, Canada, pp. 1777–1788, 2017. DOI: https://doi.org/10.18653/v1/P17-1163.

  91. S. Hochreiter, J. Schmidhuber. Long short-term memory. Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997. DOI: https://doi.org/10.1162/neco.1997.9.8.1735.

    Article  Google Scholar 

  92. T. Winograd. Understanding natural language. Cognitive Psychology, vol. 3, no. 1, pp. 1–191, 1972. DOI: https://doi.org/10.1016/0010-0285(72)90002-3.

    Article  Google Scholar 

  93. J. P. Zhang, T. C. Zhao, Z. Yu. Multimodal hierarchical reinforcement learning policy for task-oriented visual dialog. In Proceedings of the 19th Annual SIGdial Meeting on Discourse and Dialogue, Association for Computational Linguistics, Melbourne, Australia, pp. 140–150, 2018. DOI: https://doi.org/10.18653/v1/W18-5015.

  94. T. Saha, S. Saha, P. Bhattacharyya. Towards sentiment-aware multi-modal dialogue policy learning. Cognitive Computation, vol. 14, no. 1, pp. 246–260, 2022. DOI: https://doi.org/10.1007/s12559-020-09769-7.

    Article  Google Scholar 

  95. R. De Mori. Spoken language understanding: A survey. In Proceedings of IEEE Workshop on Automatic Speech Recognition & Understanding, Kyoto, Japan, pp. 365–376, 2007. DOI: https://doi.org/10.1109/ASRU.2007.4430139.

  96. Y. C. Zhang, Z. J. Ou, Z. Yu. Task-oriented dialog systems that consider multiple appropriate responses under the same context. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 5, pp. 9604–9611, 2020. DOI: https://doi.org/10.1609/aaai.v34i05.6507.

    Article  Google Scholar 

  97. Y. H. Li, Y. Y. Yang, X. J. Quan, J. X. Yu. Retrieve & memorize: Dialog policy learning with multi-action memory. In Proceedings of Findings of the Association for Computational Linguistics, pp. 447–459, 2021. DOI: https://doi.org/10.18653/v1/2021.findings-acl.39.

  98. L. Shu, H. Xu, B. Liu, P. Molino. Modeling multi-action policy for task-oriented dialogues. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Hong Kong, China, pp. 1304–1310, 2019. DOI: https://doi.org/10.18653/v1/D19-1130.

  99. J. H. Wang, Y. Zhang, T. K. Kim, Y. J. Gu. Modelling hierarchical structure between dialogue policy and natural language generator with option framework for task-oriented dialogue system. In Proceedings of the 9th International Conference on Learning Representations, 2020.

  100. L. El Asri, R. Laroche, O. Pietquin. Task completion transfer learning for reward inference. In Proceedings of International Workshop on Machine Learning for Interactive Systems, Québec, Canada, 2014.

  101. H. M. Wang, B. L. Peng, K. F. Wong. Learning efficient dialogue policy from demonstrations through shaping. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 6355–6365, 2020. DOI: https://doi.org/10.18653/v1/2020.acl-main.566.

  102. S. Russell. Learning agents for uncertain environments (extended abstract). In Proceedings of the Eleventh Annual Conference on Computational Learning Theory, Madison, USA, pp. 101–103, 1998. DOI: https://doi.org/10.1145/279943.279964.

  103. A. Y. Ng, S. Russell. Algorithms for inverse reinforcement learning. In Proceedings of the 17th International Conference on Machine Learning, Morgan Kaufmann, Stanford, USA, pp. 663–670, 2000. DOI: https://doi.org/10.5555/645529.657801.

  104. A. Boularias, J. Kober, J. Peters. Relative entropy inverse reinforcement learning. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, USA, pp. 182–189, 2011.

  105. A. Boularias, H. R. Chinaei, B. Chaib-Draa. Learning the reward model of dialogue POMDPs from data. In Proceedings of NIPS Workshop on Machine Learning for Assistive Techniques, 2010.

  106. J. Ho, S. Ermon. Generative adversarial imitation learning. In Proceedings of the 30th International Conference on Neural Information Processing Systems, Barcelona, Spain, pp. 4572–4580, 2016. DOI: https://doi.org/10.5555/3157382.3157608.

  107. A. Y. Ng, D. Harada, S. Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In Proceedings of the 16th International Conference on Machine Learning, Bled, Slovenia, pp. 278–287, 1999.

  108. L. El Asri, R. Laroche, O. Pietquin. Reward shaping for statistical optimisation of dialogue management. Statistical Language and Speech Processing, A. H. Dediu, C. Martín-Vide, R. Mitkov, B. Truthe, Eds., Tarragona, Spain: Springer, pp. 93–101, 2013. DOI: https://doi.org/10.1007/978-3-642-39593-2_8.

    Chapter  Google Scholar 

  109. E. Ferreira, F. Lefévre. Social signal and user adaptation in reinforcement learning-based dialogue management. In Proceedings of the 2nd Workshop on Machine Learning for Interactive Systems: Bridging the Gap Between Perception, Action and Communication, Beijing, China, pp. 61–69. 2013, DOI: https://doi.org/10.1145/2493525.2493535.

  110. H. R. Wang, H. M. Wang, Z. H. Wang, K. F. Wong. Integrating pretrained language model for dialogue policy learning. [Online], Available: https://arxiv.org/abs/2111.01398, 2021.

  111. V. Ilievski, C. Musat, A. Hossman, M. Baeriswyl. Goal-oriented chatbot dialog management bootstrapping with transfer learning. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, Stockholm, Sweden, pp. 4115–4121, 2018. DOI: https://doi.org/10.24963/ijcai.2018/572.

  112. S. J. Pan, Q. Yang. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 10, pp. 1345–1359, 2010. DOI: https://doi.org/10.1109/TKDE.2009.191.

    Article  Google Scholar 

  113. L. Chen, C. Chang, Z. Chen, B. W. Tan, M. Gaišić, K. Yu. Policy adaptation for deep reinforcement learning-based dialogue management. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, Canada, pp. 6074–6078, 2018. DOI: https://doi.org/10.1109/ICASSP.2018.8462272.

  114. K. X. Mo, Y. Zhang, Q. Yang, P. Fung. Cross-domain dialogue policy transfer via simultaneous speech-act and slot alignment. [Online], Available: https://arxiv.org/abs/1804.07691, 2018.

  115. F. Mi, M. L. Huang, J. Y. Zhang, B. Faltings. Meta-learning for low-resource natural language generation in task-oriented dialogue systems. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, Macao, China, pp. 3151–3157, 2019. DOI: https://doi.org/10.5555/3367471.3367479]

  116. C. Finn, P. Abbeel, S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, pp. 1126–1135, 2017. DOI: https://doi.org/10.5555/3305381.3305498.

  117. R. Takanobu, Q. Zhu, J. C. Li, B. L. Peng, J. F. Gao, M. L. Huang. Is your goal-oriented dialog model performing really well? Empirical analysis of system-wise evaluation. In Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pp. 297–310, 2020.

Download references

Acknowledgements

This research was supported by Innovation and Technology Fund (ITF), Government of the Hong Kong Special Administrative Region (HKSAR), China (No. PRP-054-21FX).

Author information

Author notes
  1. These authors contributed equally to this work

Authors and Affiliations

  1. The Systems Engineering and Engineering Management Department, The Chinese University of Hong Kong, Hong Kong, 999077, China

    Wai-Chung Kwan, Hong-Ru Wang, Hui-Min Wang & Kam-Fai Wong

Authors
  1. Wai-Chung Kwan
    View author publications

    You can also search for this author in PubMed Google Scholar

  2. Hong-Ru Wang
    View author publications

    You can also search for this author in PubMed Google Scholar

  3. Hui-Min Wang
    View author publications

    You can also search for this author in PubMed Google Scholar

  4. Kam-Fai Wong
    View author publications

    You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wai-Chung Kwan.

Additional information

Wai-Chung Kwan received the B. Sc. degree in computer science from Hong Kong Baptist University, China in 2019. He is currently a Ph. D. degree candidate in systems engineering and engineering management at Department of Systems Engineering and Engineering Management, the Chinese University of Hong Kong, China.

His research interests include natural language processing, reinforcement learning and dialogue systems.

Hong-Ru Wang received the B.Sc. degree in computer science and technology from Communication University of China, China in 2019, received the M.Sc. degree in computer science from the Chinese University of Hong Kong, China 2020, respectively. He is a currently a Ph. D. degree candidate in systems engineering and engineering management at Department of Systems Engineering and Engineering Management, the Chinese University of Hong Kong, China.

His research interests include task-oriented dialogue system, controllable natural language generation, persona-knowledge enhanced dialogue system.

Hui-Min Wang received B. Eng. and M. Eng. degrees in automation from Tsinghua University, China in 2014 and 2017, respectively, received the Ph. D. degree in systems engineering and engineering management from Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, China in 2021.

Her research interests include reinforcement learning, natural language processing, especially on dialogue system.

Kam-Fai Wong received the Ph.D. degree in electrical engineering from Edinburgh University, UK in 1987. He was a post doctoral researcher in Heriot-Watt University, UK, UniSys, UK and ECRC, Germany. At present, he is professor in Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong (CUHK), China. He serves as the Associate Dean (External Affairs) of Engineering, the Director of the Centre for Innovation and Technology (CINTEC), and Associate Director of the Centre for Entrepreneurship (CfE), CUHK. He serves as the President of Asian Federation of Natural Language Processing (AFNLP, 2015–2016), President of the Governing Board of Chinese Language Computer Society CLCS (2015–2017). He has published over 250 technical papers in these areas in different international journals and conferences and books. He is Fellow of ACL (2020), Member of ACM, Senior Member of IEEE as well as fellow of the following professional bodies BCS (UK), IET (UK) and HKIE. He is the founding Editor-In-Chief of ACM Transactions on Asian Language Processing (TALIP), and serves as Associate Editor of International Journal on Computational Linguistics and Chinese Language Processing. He is the Publication Chair of ACL 2021, General Chair of AACL-IJCNLP 2020, Organization Chair of EMNLP 2019, Conference Co-Chair of NDBC 2016, BigComp 2016, NLPCC 2015 and IJCNLP 2011; the Finance Chair SIG-MOD 2007; and the PC Co-chair of IJCNLP 2006. Also he is a Programme Committee Member of many international conferences.

His research interest focuses on Chinese computing, social media processing and information retrieval.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit https://doi.org/creativecommons.org/licenses/by/4.0/.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Kwan, WC., Wang, HR., Wang, HM. et al. A Survey on Recent Advances and Challenges in Reinforcement Learning Methods for Task-oriented Dialogue Policy Learning. Mach. Intell. Res. (2023). https://doi.org/10.1007/s11633-022-1347-y

Download citation

  • Received: 02 May 2022

  • Accepted: 06 June 2022

  • Published: 07 January 2023

  • DOI: https://doi.org/10.1007/s11633-022-1347-y

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Keywords

  • Dialogue policy learning (DPL)
  • task-oriented dialogue system (TOD)
  • reinforcement learning (RL)
  • dialogue system
  • Markov decision process
Download PDF

Working on a manuscript?

Avoid the common mistakes

Advertisement

Over 10 million scientific documents at your fingertips

Switch Edition
  • Academic Edition
  • Corporate Edition
  • Home
  • Impressum
  • Legal information
  • Privacy statement
  • California Privacy Statement
  • How we use cookies
  • Manage cookies/Do not sell my data
  • Accessibility
  • FAQ
  • Contact us
  • Affiliate program

Not affiliated

Springer Nature

© 2023 Springer Nature Switzerland AG. Part of Springer Nature.