Skip to main content

Crowd Knowledge Enhanced Multimodal Conversational Assistant in Travel Domain

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11961))

Abstract

We present a new solution towards building a crowd knowledge enhanced multimodal conversational system for travel. It aims to assist users in completing various travel-related tasks, such as searching for restaurants or things to do, in a multimodal conversation manner involving both text and images. In order to achieve this goal, we ground this research on the combination of multimodal understanding and recommendation techniques which explores the possibility of a more convenient information seeking paradigm. Specifically, we build the system in a modular manner where each modular construction is enriched with crowd knowledge from social sites. To the best of our knowledge, this is the first work that attempts to build intelligent multimodal conversational systems for travel, and moves an important step towards developing human-like assistants for completion of daily life tasks. Several current challenges are also pointed out as our future directions.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)

  2. Bordes, A., Weston, J.: Learning end-to-end goal-oriented dialog. In: The 3rd International Conference on Learning Representations, pp. 1–14 (2016)

    Google Scholar 

  3. Budzianowski, P., et al.: MultiWOZ - a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling. In: EMNLP, pp. 5016–5026 (2018)

    Google Scholar 

  4. Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R.: Robust and efficient fuzzy match for online data cleaning. In: SIGMOD, pp. 313–324. ACM (2003)

    Google Scholar 

  5. Chen, Y.N., Wang, W.Y., Rudnicky, A.I.: Leveraging frame semantics and distributional semantics for unsupervised semantic slot induction in spoken dialogue systems. In: 2014 IEEE Spoken Language Technology Workshop, pp. 584–589 (2014)

    Google Scholar 

  6. Ester, M., et al.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD, vol. 96, pp. 226–231 (1996)

    Google Scholar 

  7. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)

    Google Scholar 

  8. Li, R., Kahou, S.E., Schulz, H., Michalski, V., Charlin, L., Pal, C.: Towards deep conversational recommendations. In: NIPS, pp. 9748–9758 (2018)

    Google Scholar 

  9. Liao, L., He, X., Ren, Z., Nie, L., Xu, H., Chua, T.S.: Representativeness-aware aspect analysis for brand monitoring in social media. In: IJCAI, pp. 310–316 (2017)

    Google Scholar 

  10. Liao, L., Takanobu, R., Ma, Y., Yang, X., Huang, M., Chua, T.: Deep conversational recommender in travel. arxiv:1907.00710 (2019)

  11. Liu, B., Lane, I.: Attention-based recurrent neural network models for joint intent detection and slot filling. arXiv preprint arXiv:1609.01454 (2016)

  12. Madotto, A., Wu, C.S., Fung, P.: Mem2seq: effectively incorporating knowledge bases into end-to-end task-oriented dialog systems. In: ACL, pp. 1468–1478 (2018)

    Google Scholar 

  13. Rieser, V., Lemon, O.: Natural language generation as planning under uncertainty for spoken dialogue systems. In: Krahmer, E., Theune, M. (eds.) EACL/ENLG -2009. LNCS (LNAI), vol. 5790, pp. 105–120. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15573-4_6

    Chapter  Google Scholar 

  14. Sukhbaatar, S., et al.: End-to-end memory networks. In: NIPS, pp. 2440–2448 (2015)

    Google Scholar 

  15. Sun, Y., Zhang, Y.: Conversational recommender system. In: SIGIR, pp. 235–244 (2018)

    Google Scholar 

  16. Tur, G., Jeong, M., Wang, Y.Y., Hakkani-Tür, D., Heck, L.: Exploiting the semantic web for unsupervised natural language semantic parsing. In: Thirteenth Annual Conference of the International Speech Communication Association (2012)

    Google Scholar 

  17. Wen, T.H., et al.: A network-based end-to-end trainable task-oriented dialogue system. In: EACL, pp. 438–449 (2017)

    Google Scholar 

  18. Yan, Z., Duan, N., Chen, P., Zhou, M., Zhou, J., Li, Z.: Building task-oriented dialogue systems for online shopping. In: AAAI, pp. 4618–4625 (2017)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lizi Liao .

Editor information

Editors and Affiliations

Appendices

Appendix A: State Tracking

State tracking refers to the maintenance of the dialogue state \(\mathcal {S}_t\) which is the representation of the conversation session until time t. Based on the state \(\mathcal {S}_{t-1}\) in former time step and the multimodal understanding result \(\mathcal {U}_t\) for utterance in time step t, the dialogue state is obtained as follows:

$$\begin{aligned} \mathcal {S}_t = \mathcal {G}(\mathcal {S}_{t-1},{<}\mathcal {M}_t,\mathcal {D}_t,\mathcal {C}_t,\mathcal {A}_t,\mathcal {I}_t{>}), \end{aligned}$$
(6)

where \(\mathcal {G}\) refers to a set of rules. We generally summarize the rules as below:

  1. (1)

    if \(\mathcal {M}_t = Chitchat\), then \(\mathcal {S}_t = \mathcal {S}_{t-1}\);

  2. (2)

    if domain \(\mathcal {D}_t\) is changed: \(\mathcal {S}_t\) will be updated totally based on \(\mathcal {U}_t\);

  3. (3)

    if domain \(\mathcal {D}_t\) is not changed: if \(\mathcal {M}_t \ne Negation\), \(\mathcal {S}_t\) will inherit information stored in \(\mathcal {S}_{t-1}\);

  4. (4)

    if domain \(\mathcal {D}_t\) is not changed: if \(\mathcal {M}_t = Negation\), \(\mathcal {S}_t\) will inherit information stored in \(\mathcal {S}_{t-1}\) while update the parts according \(\mathcal {U}_t\);

  5. (5)

    if the time interval between two consecutive utterances exceeds a pre-defined length at time t, then \(\mathcal {S}_t\) will be cleaned.

Tracking dialogue states is the key to elevate user experience on multi-turn conversation. The main reason that we do not follow the previous works to learn models is because of the lack of dialogue data to train statistical tracking models. We leave leveraging session-level labeled dialogue data to improve the state tracking task as our future work.

Appendix B: Action Decision

At each turn of conversation between user and agent, the dialogue management module takes the current state tracking results as input, and outputs the corresponding actions. Due to the lack of large-scale dialogue training data, we also resort to a set of rules. Intuitively, the main action types considered and the conditions for triggering it are as below:

  • Proactive Questioning. This action will be triggered when (a) a recommendation intent is detected, (b) a domain is detected, and (c) no enough constraints or attributes is detected in \(\mathcal {S}_t\). This action is often used to obtain more constraints or attributes to narrow down the search space.

  • Candidate Listing. This action is often triggered when recommendation results are obtained or the Show more intent is detected. As each venue in our dataset is associated with Foursquare photos, we implement candidate listing via a list of images where each image corresponds to a venue. In the interface, user can conveniently choose a venue by simply clicking its corresponding image.

  • Venue Recommendation. This action is triggered when the intent \(\mathcal {I}_t\) in \(\mathcal {S}_t\) is Recommendation, and it will retrieve results from the recommendation module.

  • Question Answering. It will be triggered when a venue is selected and one of its slot names are detected in \(\mathcal {U}_t\) without value around. It returns the missing attribute value by looking up the venue database.

  • Review Summary. This action will be triggered when a venue is selected and the intent \(\mathcal {I}_t\) in \(\mathcal {S}_t\) is Ask opinion. It will summarize the reviews of the target venue and present in organized form.

  • API call. It will be triggered when a venue is selected and the Map direction intent is present in \(\mathcal {I}_t\). The Google Map API will be called with the start position and destination. Currently, only the Map API is integrated. However, other APIs such as weather report can also be integrated with proper modifications.

  • Chitchat. This action will be triggered when none travel venue seeking related intent is detected. As pointed out by [18], nearly 80% of utterances are chitchat queries for e-commerce bots. If the system cannot reply to them, then the conversation may not be able to continue. Thus, it will activate the chitchat response generation to obtain a reply.

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Liao, L., Kennedy, L., Wilcox, L., Chua, TS. (2020). Crowd Knowledge Enhanced Multimodal Conversational Assistant in Travel Domain. In: Ro, Y., et al. MultiMedia Modeling. MMM 2020. Lecture Notes in Computer Science(), vol 11961. Springer, Cham. https://doi.org/10.1007/978-3-030-37731-1_33

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-37731-1_33

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-37730-4

  • Online ISBN: 978-3-030-37731-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics