A Survey on Recent Advances and Challenges in Reinforcement Learning Methods for Task-oriented Dialogue Policy Learning

Kwan, Wai-Chung; Wang, Hong-Ru; Wang, Hui-Min; Wong, Kam-Fai

doi:10.1007/s11633-022-1347-y

A Survey on Recent Advances and Challenges in Reinforcement Learning Methods for Task-oriented Dialogue Policy Learning

Review
Open access
Published: 07 January 2023

Volume 20, pages 318–334, (2023)
Cite this article

Download PDF

You have full access to this open access article

Machine Intelligence Research Aims and scope Submit manuscript

A Survey on Recent Advances and Challenges in Reinforcement Learning Methods for Task-oriented Dialogue Policy Learning

Download PDF

1672 Accesses
13 Altmetric
Explore all metrics

Abstract

Dialogue policy learning (DPL) is a key component in a task-oriented dialogue (TOD) system. Its goal is to decide the next action of the dialogue system, given the dialogue state at each turn based on a learned dialogue policy. Reinforcement learning (RL) is widely used to optimize this dialogue policy. In the learning process, the user is regarded as the environment and the system as the agent. In this paper, we present an overview of the recent advances and challenges in dialogue policy from the perspective of RL. More specifically, we identify the problems and summarize corresponding solutions for RL-based dialogue policy learning. In addition, we provide a comprehensive survey of applying RL to DPL by categorizing recent methods into five basic elements in RL. We believe this survey can shed light on future research in DPL.

Article PDF

Deep Reinforcement Learning for On-line Dialogue State Tracking

A review of dialogue systems: current trends and future directions

Article 22 December 2023

Single-Model Multi-domain Dialogue Management with Deep Learning

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

H. S. Chen, X. R. Liu, D. W. Yin, J. J. Tang. A survey on dialogue systems: Recent advances and new frontiers. ACM SIGKDD Explorations Newsletter, vol. 19, no. 2, pp. 25–35, 2017. DOI: https://doi.org/10.1145/3166054.3166058.
Article Google Scholar
M. Lewis, D. Yarats, Y. Dauphin, D. Parikh, D. Batra. Deal or no deal? End-to-end learning of negotiation dialogues. In Proceedings of Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 2443–2453, 2017. DOI: https://doi.org/10.18653/v1/D17-1259.
M. Eric, C. Manning. A copy-augmented sequence-to-sequence architecture gives good performance on task-oriented dialogue. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Valencia, Spain, pp. 468–473, 2017.
T. C. Chi, P. C. Chen, S. Y. Su, Y. N. Chen. Speaker role contextual modeling for language understanding and dialogue policy learning. In Proceedings of the 8th International Joint Conference on Natural Language Processing, Taipei, China, pp. 163–168, 2017.
K. Wang, J. F. Tian, R. Wang, X. J. Quan, J. X. Yu. Multi-domain dialogue acts and response co-generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7125–7134, 2020. DOI: https://doi.org/10.18653/v1/2020.acl-main.638.
Z. Zhang, R. Takanobu, Q. Zhu, M. L. Huang, X. Y. Zhu. Recent advances and challenges in task-oriented dialog systems. Science China Technological Sciences, vol. 63, no. 10, pp. 2011–2027, 2020. DOS: https://doi.org/10.1007/s11431-020-1692-3.
Article Google Scholar
S. Y. Gao, A. Sethi, S. Agarwal, T. Chung, D. Hakkani-Tur. Dialog state tracking: A neural reading comprehension approach. In Proceedings of the 20th Annual SIGdial Meeting on Discourse and Dialogue, Stockholm, Sweden, pp. 264–273, 2019. DOI: https://doi.org/10.18653/v1/W19-5932.
E. Levin, R. Pieraccini, W. Eckert. Learning dialogue strategies within the Markov decision process framework. In Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings, IEEE, Santa Barbara, USA, pp. 72–79, 1997. DOI: https://doi.org/10.1109/AS-RU.1997.658989.
Google Scholar
S. Singh, M. Kearns, D. Litman, M. Walker. Reinforcement learning for spoken dialogue systems. In Proceedings of the 12th International Conference on Neural Information Processing Systems, Denver, USA, pp. 956–962, 1999. DOI: https://doi.org/10.5555/3009657.3009792.
S. Gandhe, D. R. Traum. Creating spoken dialogue characters from corpora without annotations. In Proceedings of the 8th Annual Conference of the International Speech Communication Association, Antwerp, Belgium, pp. 2201–2204, 2007. DOI: https://doi.org/10.21437/Interspeech.2007-599.
L. F. Shang, Z. D. Lu, H. Li. Neural responding machine for short-text conversation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Beijing, China, pp. 1577–1586, 2015. DOI: https://doi.org/10.3115/v1/P15-1152.
M. A. Walker. An application of reinforcement learning to dialogue strategy selection in a spoken dialogue system for email. Journal of Artificial Intelligence Research, vol. 12, pp. 387–416, 2000. DOI: https://doi.org/10.1613/jair.713.
Article MATH Google Scholar
S. Singh, D. Litman, M. Kearns, M. Walker. Optimizing dialogue management with reinforcement learning: Experiments with the NJFun system. Journal of Artificial Intelligence Research, vol. 16, no. 1, pp. 105–133, 2002. DOI: https://doi.org/10.5555/1622407.1622410.
Article MATH Google Scholar
J. Henderson, O. Lemon, K. Georgila. Hybrid reinforcement/supervised learning of dialogue policies from fixed data sets. Computational Linguistics, vol. 34, no. 4, pp. 487–511, 2008. DOI: https://doi.org/10.1162/coli.2008.07-028-R2-05-82.
Article Google Scholar
D. DeVault, A. Leuski, K. Sagae. Toward learning and evaluation of dialogue policies with text examples. In Proceedings of the SIGDIAL Conference, Portland, USA, pp. 39–48, 2011.
O. Vinyals, Q. Le. A neural conversational model. [Online], Available: https://arxiv.org/abs/1506.05869, 2015
V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, M. Riedmiller. Playing Atari with deep reinforcement learning. [Online], Available: https://arxiv.org/abs/1312.5602, 2013.
D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, D. Hassabis. Mastering the game of Go with deep neural networks and tree search. Nature, vol. 529, no. 7587, pp. 484–489, 2016. DOI: https://doi.org/10.1038/nature16961.
Article Google Scholar
A. Y. Ng, H. J. Kim, M. I. Jordan, S. Sastry. Autonomous helicopter flight via reinforcement learning. In Proceedings of the 16th International Conference on Neural Information Processing Systems, Vancouver, Canada, pp. 799–806, 2003. DOI: https://doi.org/10.5555/2981345.2981445.
J. Peters, S. Schaal. Reinforcement learning of motor skills with policy gradients. Neural Networks, vol. 21, no. 4, pp. 682–697, 2008. DOI: https://doi.org/10.1016/j.neunet.2008.02.003.
Article Google Scholar
P. H. Su, M. Gasic, N. Mrksic, L. Rojas-Barahona, S. Ultes, D. Vandyke, T. H. Wen, S. Young. Continuously learning neural dialogue management. [Online], Available: https://arxiv.org/abs/1606.02689, 2016.
M. Fatemi, L. El Asri, H. Schulz, J. He, K. Suleman. Policy networks with two-stage training for dialogue systems. In Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue, Los Angeles, USA, pp. 101–110, 2016. DOI: https://doi.org/10.18653/v1/W16-3613.
P. H. Su, P. Budzianowski, S. Ultes, M. Gašić, S. Young. Sample-efficient actor-critic reinforcement learning with supervised data for dialogue management. In Proceedings of the 18th Annual SIGdial Meeting on Discourse And Dialogue, Saarbrücken, Germany, pp. 147–157, 2017. DOI: https://doi.org/10.18653/v1/W17-5518.
Z. C. Lipton, X. J. Li, J. F. Gao, L. H. Li, F. Ahmed, L. Deng. BBQ-networks: Efficient exploration in deep reinforcement learning for task-oriented dialogue systems. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, New Orleans, USA, pp. 5237–5244, 2018.
H. Cuayáhuitl, S. Yu, A. Williamson, J. Carse. Deep reinforcement learning for multi-domain dialogue systems. [Online], Available: https://arxiv.org/abs/1611.08675, 2016.
I. Gra§l. A survey on reinforcement learning for dialogue systems. [Online], Available: https://arxiv.org/abs/1903.0138, 2019.
Y. P. Dai, H. H. Yu, Y. X. Jiang, C. G. Tang, Y. B. Li, J. Sun. A survey on dialog management: Recent advances and challenges. [Online], Available: https://arxiv.org/abs/2005.02233, 2020.
B. Liu, I. Lane. Iterative policy learning in end-to-end trainable task-oriented neural dialog models. In Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop, Okinawa, Japan, pp. 482–489, 2017. DOI: https://doi.org/10.1109/ASRU.2017.8268975.
B. Liu, I. Lane. Adversarial learning of task-oriented neural dialog models. In Proceedings of the 19th Annual SIGdial Meeting on Discourse And Dialogue, Association for Computational Linguistics, Melbourne, Australia, pp. 350–359, 2018. DOI: https://doi.org/10.18653/v1/W18-5041.
Google Scholar
B. L. Peng, X. J. Li, J. F. Gao, J. J. Liu, Y. N. Chen, K. F. Wong. Adversarial advantage actor-critic model for task-completion dialogue policy learning. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, Canada, pp. 6149–6153, 2018. DOI: https://doi.org/10.1109/ICASSP.2018.8461918.
B. L. Peng, X. J. Li, J. F. Gao, J. J. Liu, K. F. Wong. Deep dyna-Q: Integrating planning for task-completion dialogue policy learning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, pp. 2182–2192, 2018. DOI: https://doi.org/10.18653/v1/P18-1203.
Y. Cao, K. T. Lu, X. P. Chen, S. Q. Zhang. Adaptive dialog policy learning with hindsight and user modeling. In Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pp. 329–338, 2020.
G. Gordon-Hall, P. J. Gorinski, G. Lampouras, I. Iacobacci. Show us the way: Learning to manage dialog from demonstrations. [Online], Available: https://arxiv.org/abs/2004.08114, 2020.
R. S. Sutton, A. G. Barto. Reinforcement learning: An introduction. IEEE Transactions on Neural Networks, vol. 9, no. 5, Article number 1054, 1998. DOI: https://doi.org/10.1109/TNN.1998.712192.
Google Scholar
W. H. Chen, J. S. Chen, P. D. Qin, X. F. Yan, W. Y. Wang. Semantically conditioned dialog response generation via hierarchical disentangled self-attention. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 3696–3709, 2019. DOI: https://doi.org/10.18653/v1/P19-1360.
P. H. Su, M. Gašić, N. Mrkšić, L. M. Rojas-Barahona, S. Ultes, D. Vandyke, T. H. Wen, S. Young. On-line active reward learning for policy optimisation in spoken dialogue systems. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, pp. 2431–2441, 2016. DOI: https://doi.org/10.18653/v1/P16-1230.
J. Schatzmann, B. Thomson, K. Weilhammer, H. Ye, S. Young. Agenda-based user simulation for bootstrapping a POMDP dialogue system. In Proceedings of Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers, Rochester, USA, pp. 149–152, 2007.
M. A. Walker, D. J. Litman, C. A. Kamm, A. Abella. PARADISE: A framework for evaluating spoken dialogue agents. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics, Madrid, Spain, pp. 271–280, 1997. DOI: https://doi.org/10.3115/976909.979652.
L. Chen, R. Z. Yang, C. Chang, Z. H. Ye, X. Zhou, K. Yu. On-line dialogue policy learning with companion teaching. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Valencia, Spain, pp. 198–204, 2017. DOI: https://doi.org/10.18653/vl/E17-2032.
K. T. Lu, S. Q. Zhang, X. P. Chen. Goal-oriented dialogue policy learning from failures. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence and the 31st Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, Honolulu, USA, pp. 2596–2603, 2019. DOI: https://doi.org/10.1609/aaai.v33i01.33012596.
K. T. Lu, S. Q. Zhang, X. P. Chen. AutoEG: Automated experience grafting for off-policy deep reinforcement learning. [Online], Available: https://arxiv.org/abs/2004.10698, 2020.
G. Gordon-Hall, P. J. Gorinski, S. B. Cohen. Learning dialog policies from weak demonstrations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 1394–1405, 2020. DOI: https://doi.org/10.18653/v1/2020.acl-main.129.
L. H. Li, H. He, J. D. Williams. Temporal supervised learning for inferring a dialog policy from example conversations. In Proceedings of IEEE Spoken Language Technology Workshop, South Lake Tahoe, USA, pp. 312–317, 2014. DOI: https://doi.org/10.1109/SLT.2014.7078593.
P. H. Su, D. Vandyke, M. Gašić, N. Mrkšić, T. H. Wen, S. Young. Reward shaping with recurrent neural networks for speeding up on-line policy learning in spoken dialogue systems. In Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue, Prague, Czech Republic, pp. 417–421, 2015. DOI: https://doi.org/10.18653/v1/W15-4655.
T. C. Zhao, M. Eskenazi. Towards end-to-end learning for dialog state tracking and management using deep re-inforcement learning. In Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue, Los Angeles, USA, pp. 1–10, 2016. DOI: 18653/v1/W16-3601.
P. Budzianowski, S. Ultes, P. H. Su, N. Mrkšić, T. H. Wen, I. Casanueva, L. M. Rojas-Barahona, M. Gašić. Sub-domain modelling for dialogue management with hierarchical reinforcement learning. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, Saarbrücken, Germany, pp. 86–92, 2017. DOI: https://doi.org/10.18653/v1/W17-5512.
B. L. Peng, X. J. Li, L. H. Li, J. F. Gao, A. Celikyilmaz, S. Lee, K. F. Wong. Composite task-completion dialogue policy learning via hierarchical deep reinforcement learning. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 2231–2240, 2017. DOI: https://doi.org/10.18653/v1/D17-1237.
G. Weisz, P. Budzianowski, P. H. Su, M. Gašić. Sample efficient deep reinforcement learning for dialogue systems with large action spaces. IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 11, pp. 2083–2097, 2018. DOI: https://doi.org/10.1109/TASLP.2018.2851664.
Article Google Scholar
I. Casanueva, P. Budzianowski, P. H. Su, S. Ultes, L. M. Rojas-Barahona, B. H. Tseng, M. Gašić. Feudal reinforcement learning for dialogue management in large domains. In Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, USA, pp. 714–719, 2018. DOI: https://doi.org/10.18653/v1/N18-2112.
G. Y. Kristianto, H. W. Zhang, B. Tong, M. Iwayama, Y. Kobayashi. Autonomous sub-domain modeling for dialogue policy with hierarchical deep reinforcement learning. In Proceedings of the EMNLP Workshop SCAI: The 2nd International Workshop on Search-Oriented Conversational AI, Brussels, Belgium, pp. 9–16, 2018. DOI: https://doi.org/10.18653/v1/W18-5702.
S. Y. Su, X. J. Li, J. F. Gao, J. J. Liu, Y. N. Chen. Discriminative deep dyna-Q: Robust planning for dialogue policy learning. In Proceedings of Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 3813–3823, 2018. DOI: https://doi.org/10.18653/v1/D18-1416.
D. Tang, X. J. Li, J. F. Gao, C. Wang, L. H. Li, T. Jebara. Subgoal discovery for hierarchical dialogue policy learning. In Proceedings of Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 2298–2309, 2018. DOI: https://doi.org/10.18653/v1/D18-1253.
Y. X. Wu, X. J. Li, J. J. Liu, J. F. Gao, Y. M. Yang. Switch-based active deep dyna-Q: Efficient adaptive planning for task-completion dialogue policy learning. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence, Honolulu, USA, pp. 7289–7296, 2019. DOI: https://doi.org/10.1609/aaai.v33i01.33017289.
T. C. Zhao, K. G. Xie, M. Eskenazi. Rethinking action spaces for reinforcement learning in end-to-end dialog agents with latent variable models. In Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis USA, pp. 1208–1218, 2019. DOI: https://doi.org/10.18653/v1/N19-1123.
Y. M. Xu, C. G. Zhu, B. L. Peng, M. Zeng. Meta dialogue policy learning. [Online], Available: https://arxiv.org/abs/2006.02588, 2020.
A. Papangelis, Y. C. Wang, P. Molino, G. Tur. Collaborative multi-agent dialogue model training via reinforcement learning. In Proceedings of the 20th Annual SIGdial Meeting on Discourse and Dialogue, Association for Computational Linguistics, Stockholm, Sweden, pp. 92–102, 2019. DOI: https://doi.org/10.18653/v1/W19-5912.
Z. R. Zhang, X. J. Li, J. F. Gao, E. H. Chen. Budgeted policy learning for task-oriented dialogue systems. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 3742–3751, 2019. DOI: https://doi.org/10.18653/v1/P19-1364.
R. Takanobu, H. L. Zhu, M. L. Huang. Guided dialog policy learning: Reward estimation for multi-domain task-oriented dialog. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Hong Kong, China, pp. 100–110, 2019. DOI: https://doi.org/10.18653/v1/D19-1010.
X. T. Huang, J. Z. Qi, Y. Sun, R. Zhang. Semi-supervised dialogue policy learning via stochastic reward estimation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 660–670, 2020. DOI: 10.18653/v1/2020.acl-main.62.
Z. Zhang, L. Z. Liao, X. Y. Zhu, T. S. Chua, Z. T. Liu, Y. Huang, M. L. Huang. Learning goal-oriented dialogue policy with opposite agent awareness. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, Suzhou, China, pp. 122–132, 2020.
Z. M. Li, S. Lee, B. L. Peng, J. C. Li, J. Kiseleva, M. de Rijke, S. Shayandeh, J. F. Gao. Guided dialogue policy learning without adversarial learning in the loop. In Proceedings of Findings of the Association for Computational Linguistics, pp. 2308–2317, 2020. DOI: https://doi.org/10.18653/v1/2020.findings-emnlp.209.
R. Takanobu, R. Z. Liang, M. L. Huang. Multi-agent task-oriented dialog policy learning with role-aware reward decomposition. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 625–638, 2020. DOI: https://doi.org/10.18653/v1/2020.acl-main.59.
P. H. Su, D. Vandyke, Gašić, D. Kim, N. Mrkšić, T. H. Wen, S. Young. Learning from real users: Rating dialogue success with neural networks for reinforcement learning in spoken dialogue systems. In Proceedings of the 16th Annual Conference of the International Speech Communication Association, Dresden, Germany, pp. 2007–2011, 2015. DOI: https://doi.org/10.21437/Interspeech.2015-456.
J. Schatzmann, S. Young. The hidden agenda user simulation model. IEEE Transactions on Audio, Speech, and Language Processing, vol. 17, no. 4, pp. 733–747, 2009. DOI: https://doi.org/10.1109/TASL.2008.2012071.
Article Google Scholar
X. J. Li, Z. C. Lipton, B. Dhingra, L. H. Li, J. F. Gao, Y. N. Chen. A user simulator for task-completion dialogues. [Online], Available: https://arxiv.org/abs/1612.05688, 2016.
S. Ultes, L. M. Rojas-Barahona, P. H. Su, D. Vandyke, D. Kim, I. Casanueva, P. Budzianowski, N. Mrkšić, T. H. Wen, M. Gašić, S. Young. Pydial: A multi-domain statistical dialogue system toolkit. In Proceedings of ACL System Demonstrations, Vancouver, Canada, pp. 73–78, 2017.
J. F. Gao, M. Galley, L. H. Li. Neural approaches to conversational AI. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, Ann Arbor, USA, pp. 1371–1374, 2018. DOI: https://doi.org/10.1145/3209978.3210183.
I. Sutskever, O. Vinyals, Q. V. Le. Sequence to sequence learning with neural networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems, Montreal, Canada, pp. 3104–3112, 2014. DOI: https://doi.org/10.5555/2969033.2969173.
W. Eckert, E. Levin, R. Pieraccini. User modeling for spoken dialogue system evaluation. In Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings, Santa Barbara, USA, pp. 80–87, 1997. DOI: https://doi.org/10.1109/ASRU.1997.658991.
E. Levin, R. Pieraccini, W. Eckert. A stochastic model of human-machine interaction for learning dialog strategies. IEEE Transactions on Speech and Audio Processing, vol. 8, no. 1, pp. 11–23, 2000. DOI: https://doi.org/10.1109/89.817450.
Article Google Scholar
S. Chandramohan, M. Geist, F. Lefévre, O. Pietquin. User simulation in dialogue systems using inverse reinforcement learning. In Proceedings of the 12th Annual Conference of the International Speech Communication Association, Florence, Italy, pp. 1025–1028, 2011.
L. El Asri, J. He, K. Suleman. A sequence-to-sequence model for user simulation in spoken dialogue systems. In Proceedings of the 17th Annual Conference of the International Speech Communication Association, San Francisco, USA, pp. 1151–1155, 2016. DOI: https://doi.org/10.21437/Inter-speech.2016-1175.
J. D. Williams. Evaluating user simulations with the cramér-von mises divergence. Speech Communication, vol. 50, no. 10, pp. 829–846, 2008. DOI: https://doi.org/10.1016/j.specom.2008.05.007.
Article Google Scholar
H. Ai, D. J. Litman. Assessing dialog system user simulation evaluation measures using human judges. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics, Columbus, USA, pp. 622–629, 2008.
O. Pietquin, H. Hastie. A survey on metrics for the evaluation of user simulations. The Knowledge Engineering Review, vol. 28, no. 1, pp. 59–73, 2013. DOI: https://doi.org/10.1017/S0269888912000343.
Article Google Scholar
K. Georgila, C. Nelson, D. Traum. Single-agent vs. multi-agent techniques for concurrent reinforcement learning of negotiation dialogue policies. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, USA, pp. 500–510, 2014. DOI: 3115/v1/P14-1047.
H. M. Wang, K. F. Wong. A collaborative multi-agent reinforcement learning framework for dialog action decomposition. In Proceedings of Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, pp. 7882–7889, 2021. DOI: https://doi.org/10.18653/v1/2021.emnlp-main.621.
M. Gašić, N. Mrkšić, L. Rojas-Barahona, P. H. Su, D. Vandyke, T. H. Wen. Multi-agent learning in multi-domain spoken dialogue systems. In Proceedings of the 29th Annual Conference on Neural Information Processing Systems, Montreal, Canada, 2015.
R. Parr, S. Russell. Reinforcement learning with hierarchies of machines. In Proceedings of Conference on Advances in Neural Information Processing Systems, MIT Press, Denver, USA, pp. 1043–1049, 1998. DOI: https://doi.org/10.5555/302528.302894.
Google Scholar
T. G. Dietterich. Hierarchical reinforcement learning with the MAXQ value function decomposition. Journal of Artificial Intelligence Research, vol. 13, pp. 227–303, 2000. DOI: https://doi.org/10.1613/jair.639.
Article MathSciNet MATH Google Scholar
S. Young, M. Gašić, B. Thomson, J. D. Williams. Pomdp-based statistical spoken dialog systems: A review. Proceedings of the IEEE, vol. 101, no. 5, pp. 1160–1179, 2013. DOI: https://doi.org/10.1109/JPROC.2012.2225812.
Article Google Scholar
R. S. Sutton, D. Precup, S. Singh. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, vol. 112, no. 1–2, pp. 181–211, 1999. DOI: https://doi.org/10.1016/S0004-3702(99)00052-1.
Article MathSciNet MATH Google Scholar
P. L. Bacon, J. Harb, D. Precup. The option-critic architecture. In Proceedings of the 31st AAAI Conference on Artificial Intelligence, San Francisco, USA, pp. 1726–1734, 2017.
M. C. Machado, M. G. Bellemare, M. Bowling. A Laplacian framework for option discovery in reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, pp. 2295–2304, 2017.
C. Wang, Y. N. Wang, P. S. Huang, A. Mohamed, D. Y. Zhou, L. Deng. Sequence modeling via segmentations. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, pp. 3674–3683, 2017.
P. Dayan, G. E. Hinton. Feudal reinforcement learning. In Proceedings of the 5th International Conference on Neural Information Processing Systems, Denver, USA, pp. 271–278, 1992. DOI: https://doi.org/10.5555/2987061.2987095.
I. Casanueva, P. Budzianowski, S. Ultes, F. Kreyssig, B. H. Tseng, Y. C. Wu, M. Gašić. Feudal dialogue management with jointly learned feature extractors. In Proceedings of the 19th Annual SIGdial Meeting on Discourse and Dialogue, Melbourne, Australia, pp. 332–337, 2018. DOI: https://doi.org/10.18653/v1/W18-5038.
P. Abbeel, A. Y. Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the 21st International Conference on Machine Learning, ACM, Banff, Canada, 2004. DOI: https://doi.org/10.1145/1015330.1015430.
M. Jhunjhunwala, C. Bryant, P. Shah. Multi-action dialog policy learning with interactive human teaching. In Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pp. 290–296, 2020.
N. Mrkšić, D. Ó. Séaghdha, T. H. Wen, B. Thomson, S. Young. Neural belief tracker: Data-driven dialogue state tracking. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, Canada, pp. 1777–1788, 2017. DOI: https://doi.org/10.18653/v1/P17-1163.
S. Hochreiter, J. Schmidhuber. Long short-term memory. Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997. DOI: https://doi.org/10.1162/neco.1997.9.8.1735.
Article Google Scholar
T. Winograd. Understanding natural language. Cognitive Psychology, vol. 3, no. 1, pp. 1–191, 1972. DOI: https://doi.org/10.1016/0010-0285(72)90002-3.
Article Google Scholar
J. P. Zhang, T. C. Zhao, Z. Yu. Multimodal hierarchical reinforcement learning policy for task-oriented visual dialog. In Proceedings of the 19th Annual SIGdial Meeting on Discourse and Dialogue, Association for Computational Linguistics, Melbourne, Australia, pp. 140–150, 2018. DOI: https://doi.org/10.18653/v1/W18-5015.
T. Saha, S. Saha, P. Bhattacharyya. Towards sentiment-aware multi-modal dialogue policy learning. Cognitive Computation, vol. 14, no. 1, pp. 246–260, 2022. DOI: https://doi.org/10.1007/s12559-020-09769-7.
Article Google Scholar
R. De Mori. Spoken language understanding: A survey. In Proceedings of IEEE Workshop on Automatic Speech Recognition & Understanding, Kyoto, Japan, pp. 365–376, 2007. DOI: https://doi.org/10.1109/ASRU.2007.4430139.
Y. C. Zhang, Z. J. Ou, Z. Yu. Task-oriented dialog systems that consider multiple appropriate responses under the same context. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 5, pp. 9604–9611, 2020. DOI: https://doi.org/10.1609/aaai.v34i05.6507.
Article Google Scholar
Y. H. Li, Y. Y. Yang, X. J. Quan, J. X. Yu. Retrieve & memorize: Dialog policy learning with multi-action memory. In Proceedings of Findings of the Association for Computational Linguistics, pp. 447–459, 2021. DOI: https://doi.org/10.18653/v1/2021.findings-acl.39.
L. Shu, H. Xu, B. Liu, P. Molino. Modeling multi-action policy for task-oriented dialogues. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Hong Kong, China, pp. 1304–1310, 2019. DOI: https://doi.org/10.18653/v1/D19-1130.
J. H. Wang, Y. Zhang, T. K. Kim, Y. J. Gu. Modelling hierarchical structure between dialogue policy and natural language generator with option framework for task-oriented dialogue system. In Proceedings of the 9th International Conference on Learning Representations, 2020.
L. El Asri, R. Laroche, O. Pietquin. Task completion transfer learning for reward inference. In Proceedings of International Workshop on Machine Learning for Interactive Systems, Québec, Canada, 2014.
H. M. Wang, B. L. Peng, K. F. Wong. Learning efficient dialogue policy from demonstrations through shaping. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 6355–6365, 2020. DOI: https://doi.org/10.18653/v1/2020.acl-main.566.
S. Russell. Learning agents for uncertain environments (extended abstract). In Proceedings of the Eleventh Annual Conference on Computational Learning Theory, Madison, USA, pp. 101–103, 1998. DOI: https://doi.org/10.1145/279943.279964.
A. Y. Ng, S. Russell. Algorithms for inverse reinforcement learning. In Proceedings of the 17th International Conference on Machine Learning, Morgan Kaufmann, Stanford, USA, pp. 663–670, 2000. DOI: https://doi.org/10.5555/645529.657801.
A. Boularias, J. Kober, J. Peters. Relative entropy inverse reinforcement learning. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, USA, pp. 182–189, 2011.
A. Boularias, H. R. Chinaei, B. Chaib-Draa. Learning the reward model of dialogue POMDPs from data. In Proceedings of NIPS Workshop on Machine Learning for Assistive Techniques, 2010.
J. Ho, S. Ermon. Generative adversarial imitation learning. In Proceedings of the 30th International Conference on Neural Information Processing Systems, Barcelona, Spain, pp. 4572–4580, 2016. DOI: https://doi.org/10.5555/3157382.3157608.
A. Y. Ng, D. Harada, S. Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In Proceedings of the 16th International Conference on Machine Learning, Bled, Slovenia, pp. 278–287, 1999.
L. El Asri, R. Laroche, O. Pietquin. Reward shaping for statistical optimisation of dialogue management. Statistical Language and Speech Processing, A. H. Dediu, C. Martín-Vide, R. Mitkov, B. Truthe, Eds., Tarragona, Spain: Springer, pp. 93–101, 2013. DOI: https://doi.org/10.1007/978-3-642-39593-2_8.
Chapter Google Scholar
E. Ferreira, F. Lefévre. Social signal and user adaptation in reinforcement learning-based dialogue management. In Proceedings of the 2nd Workshop on Machine Learning for Interactive Systems: Bridging the Gap Between Perception, Action and Communication, Beijing, China, pp. 61–69. 2013, DOI: https://doi.org/10.1145/2493525.2493535.
H. R. Wang, H. M. Wang, Z. H. Wang, K. F. Wong. Integrating pretrained language model for dialogue policy learning. [Online], Available: https://arxiv.org/abs/2111.01398, 2021.
V. Ilievski, C. Musat, A. Hossman, M. Baeriswyl. Goal-oriented chatbot dialog management bootstrapping with transfer learning. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, Stockholm, Sweden, pp. 4115–4121, 2018. DOI: https://doi.org/10.24963/ijcai.2018/572.
S. J. Pan, Q. Yang. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 10, pp. 1345–1359, 2010. DOI: https://doi.org/10.1109/TKDE.2009.191.
Article Google Scholar
L. Chen, C. Chang, Z. Chen, B. W. Tan, M. Gaišić, K. Yu. Policy adaptation for deep reinforcement learning-based dialogue management. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, Canada, pp. 6074–6078, 2018. DOI: https://doi.org/10.1109/ICASSP.2018.8462272.
K. X. Mo, Y. Zhang, Q. Yang, P. Fung. Cross-domain dialogue policy transfer via simultaneous speech-act and slot alignment. [Online], Available: https://arxiv.org/abs/1804.07691, 2018.
F. Mi, M. L. Huang, J. Y. Zhang, B. Faltings. Meta-learning for low-resource natural language generation in task-oriented dialogue systems. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, Macao, China, pp. 3151–3157, 2019. DOI: https://doi.org/10.5555/3367471.3367479]
C. Finn, P. Abbeel, S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, pp. 1126–1135, 2017. DOI: https://doi.org/10.5555/3305381.3305498.
R. Takanobu, Q. Zhu, J. C. Li, B. L. Peng, J. F. Gao, M. L. Huang. Is your goal-oriented dialog model performing really well? Empirical analysis of system-wise evaluation. In Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pp. 297–310, 2020.

Download references

Acknowledgements

This research was supported by Innovation and Technology Fund (ITF), Government of the Hong Kong Special Administrative Region (HKSAR), China (No. PRP-054-21FX).

Author information

These authors contributed equally to this work

Authors and Affiliations

The Systems Engineering and Engineering Management Department, The Chinese University of Hong Kong, Hong Kong, 999077, China
Wai-Chung Kwan, Hong-Ru Wang, Hui-Min Wang & Kam-Fai Wong

Authors

Wai-Chung Kwan
View author publications
You can also search for this author in PubMed Google Scholar
Hong-Ru Wang
View author publications
You can also search for this author in PubMed Google Scholar
Hui-Min Wang
View author publications
You can also search for this author in PubMed Google Scholar
Kam-Fai Wong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wai-Chung Kwan.

Additional information

Wai-Chung Kwan received the B. Sc. degree in computer science from Hong Kong Baptist University, China in 2019. He is currently a Ph. D. degree candidate in systems engineering and engineering management at Department of Systems Engineering and Engineering Management, the Chinese University of Hong Kong, China.

His research interests include natural language processing, reinforcement learning and dialogue systems.

Hong-Ru Wang received the B.Sc. degree in computer science and technology from Communication University of China, China in 2019, received the M.Sc. degree in computer science from the Chinese University of Hong Kong, China 2020, respectively. He is a currently a Ph. D. degree candidate in systems engineering and engineering management at Department of Systems Engineering and Engineering Management, the Chinese University of Hong Kong, China.

His research interests include task-oriented dialogue system, controllable natural language generation, persona-knowledge enhanced dialogue system.

Hui-Min Wang received B. Eng. and M. Eng. degrees in automation from Tsinghua University, China in 2014 and 2017, respectively, received the Ph. D. degree in systems engineering and engineering management from Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, China in 2021.

Her research interests include reinforcement learning, natural language processing, especially on dialogue system.

Kam-Fai Wong received the Ph.D. degree in electrical engineering from Edinburgh University, UK in 1987. He was a post doctoral researcher in Heriot-Watt University, UK, UniSys, UK and ECRC, Germany. At present, he is professor in Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong (CUHK), China. He serves as the Associate Dean (External Affairs) of Engineering, the Director of the Centre for Innovation and Technology (CINTEC), and Associate Director of the Centre for Entrepreneurship (CfE), CUHK. He serves as the President of Asian Federation of Natural Language Processing (AFNLP, 2015–2016), President of the Governing Board of Chinese Language Computer Society CLCS (2015–2017). He has published over 250 technical papers in these areas in different international journals and conferences and books. He is Fellow of ACL (2020), Member of ACM, Senior Member of IEEE as well as fellow of the following professional bodies BCS (UK), IET (UK) and HKIE. He is the founding Editor-In-Chief of ACM Transactions on Asian Language Processing (TALIP), and serves as Associate Editor of International Journal on Computational Linguistics and Chinese Language Processing. He is the Publication Chair of ACL 2021, General Chair of AACL-IJCNLP 2020, Organization Chair of EMNLP 2019, Conference Co-Chair of NDBC 2016, BigComp 2016, NLPCC 2015 and IJCNLP 2011; the Finance Chair SIG-MOD 2007; and the PC Co-chair of IJCNLP 2006. Also he is a Programme Committee Member of many international conferences.

His research interest focuses on Chinese computing, social media processing and information retrieval.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit https://doi.org/creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Kwan, WC., Wang, HR., Wang, HM. et al. A Survey on Recent Advances and Challenges in Reinforcement Learning Methods for Task-oriented Dialogue Policy Learning. Mach. Intell. Res. 20, 318–334 (2023). https://doi.org/10.1007/s11633-022-1347-y

Download citation

Received: 02 May 2022
Accepted: 06 June 2022
Published: 07 January 2023
Issue Date: June 2023
DOI: https://doi.org/10.1007/s11633-022-1347-y

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A Survey on Recent Advances and Challenges in Reinforcement Learning Methods for Task-oriented Dialogue Policy Learning

Abstract

Article PDF

Similar content being viewed by others

Deep Reinforcement Learning for On-line Dialogue State Tracking

A review of dialogue systems: current trends and future directions

Single-Model Multi-domain Dialogue Management with Deep Learning

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Survey on Recent Advances and Challenges in Reinforcement Learning Methods for Task-oriented Dialogue Policy Learning

Abstract

Article PDF

Similar content being viewed by others

Deep Reinforcement Learning for On-line Dialogue State Tracking

A review of dialogue systems: current trends and future directions

Single-Model Multi-domain Dialogue Management with Deep Learning

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation