Extending Long Short-Term Memory for Multi-View Structured Learning

  • Shyam Sundar RajagopalanEmail author
  • Louis-Philippe Morency
  • Tadas Baltrus̆aitis
  • Roland Goecke
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9911)


Long Short-Term Memory (LSTM) networks have been successfully applied to a number of sequence learning problems but they lack the design flexibility to model multiple view interactions, limiting their ability to exploit multi-view relationships. In this paper, we propose a Multi-View LSTM (MV-LSTM), which explicitly models the view-specific and cross-view interactions over time or structured outputs. We evaluate the MV-LSTM model on four publicly available datasets spanning two very different structured learning problems: multimodal behaviour recognition and image captioning. The experimental results show competitive performance on all four datasets when compared with state-of-the-art models.


Long Short-Term Memory Multi-View Learning Behaviour recognition Image Caption 


  1. 1.
    Denkowski, M., Lavie, A.: Meteor universal: language specific translation evaluation for any target language. In: Proceedings of the Ninth Workshop on Statistical Machine Translation, pp. 376–380. Association for Computational Linguistics, Baltimore, June 2014.
  2. 2.
    Graves, A.: Generating sequences with recurrent neural networks. arXiv:1308.0850 (2013)
  3. 3.
    Gupta, R., Lee, C.C., Bone, D., Rozga, A., Lee, S., Narayanan, S.S.: Acoustical analysis of engagement behavior in children. In: Proceedings of the Workshop on Child, Computer and Interaction, Portland, USA, September 2012Google Scholar
  4. 4.
    Hernandez, J., Riobo, I., Rozga, A., Abowd, G.D., Picard, R.W.: Using electrodermal activity to recognize ease of engagement in children during social interactions. In: Proceedings of the International Conference on Ubiquitous Computing, Seattle, USA, pp. 301–317, September 2014Google Scholar
  5. 5.
    Hoai, M., Zisserman, A.: Talking heads: detecting humans and recognizing their interactions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Ohio, USA, pp. 875–882, June 2014Google Scholar
  6. 6.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  7. 7.
    Jia, X., Gavves, E., Fernando, B., Tuytelaars, T.: Guiding the long-short term memory model for image caption generation. In: Proceedings of the International Conference on Computer Vision, pp. 2407–2415 (2015)Google Scholar
  8. 8.
    Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3128–3137 (2015)Google Scholar
  9. 9.
    Kiros, R., Zemel, R.S., Salakhutdinov, R.: Multimodal neural language models. In: Proceedings of the 31st International Conference on Machine Learning, pp. 595–603 (2014)Google Scholar
  10. 10.
    Laptev, I.: On space-time interest points. Int. J. Comput. Vis. 64(2–3), 107–123 (2005)CrossRefGoogle Scholar
  11. 11.
    Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, vol. 8 (2004)Google Scholar
  12. 12.
    McNemar, Q.: Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika 12(2), 153–157 (1947)CrossRefGoogle Scholar
  13. 13.
    Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 311–318 (2002)Google Scholar
  14. 14.
    Presti, L.L., Sclaroff, S., Rozga, A.: Joint alignment and modeling of correlated behavior streams. In: Proceedings of the IEEE ICCV Workshop on Decoding Subtle Cues from Social Interactions, Sydney, Australia, pp. 730–737, December 2013Google Scholar
  15. 15.
    Rajagopalan, S.S., Murthy, O.R., Goecke, R., Rozga, A.: Play with me - measuring a childs engagement in a social interaction. In: Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition, Ljubljana, Slovenia, May 2015Google Scholar
  16. 16.
    Rehg, J.M., Abowd, G.D., Rozga, A., Romero, M., Clements, M.A., Sclaroff, S., Essa, I., Ousley, O.Y., Li, Y., Kim, C., Rao, H., Kim, J.C., Presti, L.L., Zhang, J., Lantsman, D., Bidwell, J., Ye, Z.: Decoding children’s social behavior. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, pp. 3414–3421, June 2013Google Scholar
  17. 17.
    Ren, J., Hu, Y., Tai, Y.W., Wang, C., Xu, L., Sun, W., Yan, Q.: Look, listen and learn - a multimodal LSTM for speaker identification. In: Proceedings of the 30th AAAI Conference on Artificial Intelligence (2016)Google Scholar
  18. 18.
    Sharma, S., Kiros, R., Salakhutdinov, R.: Action recognition using visual attention. In: Proceedings of the International Conference on Learning Representations Workshops (2016)Google Scholar
  19. 19.
    Snoek, C.G., Worring, M., Smeulders, A.W.: Early versus late fusion in semantic video analysis. In: Proceedings of the 13th Annual ACM International Conference on Multimedia, pp. 399–402 (2005)Google Scholar
  20. 20.
    Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Proceedings of the Advances in Neural Information Processing Systems, pp. 3104–3112 (2014)Google Scholar
  21. 21.
    Vedantam, R., Zitnick, C.L., Parikh, D.: CIDEr: Consensus-based Image Description Evaluation. arXiv:1411.5726 (2015)
  22. 22.
    Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164 (2015)Google Scholar
  23. 23.
    Wang, W., Arora, R., Livescu, K., Bilmes, J.: On deep multi-view representation learning. In: Proceedings of the 32nd International Conference on Machine Learning, pp. 1083–1092 (2015)Google Scholar
  24. 24.
    Xiong, X., de la Torre, F.: Supervised descent method and its application to face alignment. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Oregon, USA, pp. 532–539, June 2013Google Scholar
  25. 25.
    Xu, C., Tao, D., Xu, C.: A survey on multi-view learning, April 2013. arXiv:1304.5634. Accessed 15 June 2016
  26. 26.
    Xu, K., Ba, J.L., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R.S., Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention. In: Proceedings of the International Conference on Machine Learning (2015)Google Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • Shyam Sundar Rajagopalan
    • 1
    Email author
  • Louis-Philippe Morency
    • 2
  • Tadas Baltrus̆aitis
    • 2
  • Roland Goecke
    • 1
  1. 1.Vision and Sensing, Human-Centred Technology Research CentreUniversity of CanberraCanberraAustralia
  2. 2.Language Technologies Institute, School of Computer ScienceCarnegie Mellon UniversityPittsburghUSA

Personalised recommendations