Pedestrian Trajectory Prediction with Structured Memory Hierarchies

  • Tharindu FernandoEmail author
  • Simon Denman
  • Sridha Sridharan
  • Clinton Fookes
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11051)


This paper presents a novel framework for human trajectory prediction based on multimodal data (video and radar). Motivated by recent neuroscience discoveries, we propose incorporating a structured memory component in the human trajectory prediction pipeline to capture historical information to improve performance. We introduce structured LSTM cells for modelling the memory content hierarchically, preserving the spatiotemporal structure of the information and enabling us to capture both short-term and long-term context. We demonstrate how this architecture can be extended to integrate salient information from multiple modalities to automatically store and retrieve important information for decision making without any supervision. We evaluate the effectiveness of the proposed models on a novel multimodal dataset that we introduce, consisting of 40,000 pedestrian trajectories, acquired jointly from a radar system and a CCTV camera system installed in a public place. The performance is also evaluated on the publicly available New York Grand Central pedestrian database. In both settings, the proposed models demonstrate their capability to better anticipate future pedestrian motion compared to existing state of the art. Data related to this paper are available at:


Human trajectory prediction Structured memory networks Multimodal information fusion Long-term planing 



This research was supported in part by the Defence Science and Technology (DST) Group under the Defence Science Partnership Program. The authors acknowledge the contribution to the paper by Dr. Jason Williams, Senior Research Scientist, National Security, Intelligence, Surveillance and Reconnaissance Division of DST.

Supplementary material

478880_1_En_15_MOESM1_ESM.pdf (5.1 mb)
Supplementary material 1 (pdf 5210 KB)


  1. 1.
    Alahi, A., Goel, K., Ramanathan, V., Robicquet, A., Fei-Fei, L., Savarese, S.: Social LSTM: human trajectory prediction in crowded spaces. In: CVPR, pp. 961–971 (2016)Google Scholar
  2. 2.
    Arevalo, J., Solorio, T., Montes-y Gómez, M., González, F.A.: Gated multimodal units for information fusion. In: ICLR (2017)Google Scholar
  3. 3.
    Bartoli, F., Lisanti, G., Ballan, L., Del Bimbo, A.: Context-aware trajectory prediction. arXiv preprint arXiv:1705.02503 (2017)
  4. 4.
    Bhatt, C.A., Kankanhalli, M.S.: Multimedia data mining: state of the art and challenges. Multimedia Tools Appl. 51(1), 35–76 (2011)CrossRefGoogle Scholar
  5. 5.
    Boström, M., Claesson, T.: Reducing false triggers in surveillance systems using sensor fusion. Master’s theses in Mathematical Sciences (2017)Google Scholar
  6. 6.
    Brun, V.H., et al.: Progressive increase in grid scale from dorsal to ventral medial entorhinal cortex. Hippocampus 18(12), 1200–1212 (2008)CrossRefGoogle Scholar
  7. 7.
    Chollet, F.: Keras (2017).
  8. 8.
    Chorowski, J.K., Bahdanau, D., Serdyuk, D., Cho, K., Bengio, Y.: Attention-based models for speech recognition. In: NIPS, pp. 577–585 (2015)Google Scholar
  9. 9.
    Coates, A., Ng, A.Y.: The importance of encoding versus training with sparse coding and vector quantization. In: ICML, pp. 921–928 (2011)Google Scholar
  10. 10.
    Deng, L., Yu, D., et al.: Deep learning: methods and applications. Found. Trends® Sig. Process. 7(3–4), 197–387 (2014)MathSciNetCrossRefGoogle Scholar
  11. 11.
    Derdikman, D., Moser, E.I.: A manifold of spatial maps in the brain. Trends Cogn. Sci. 14(12), 561–569 (2010)CrossRefGoogle Scholar
  12. 12.
    Epstein, R.A., Patai, E.Z., Julian, J.B., Spiers, H.J.: The cognitive map in humans: spatial navigation and beyond. Nature Neurosci. 20(11), 1504 (2017)CrossRefGoogle Scholar
  13. 13.
    Fanselow, M.S., Dong, H.W.: Are the dorsal and ventral hippocampus functionally distinct structures? Neuron 65(1), 7–19 (2010)CrossRefGoogle Scholar
  14. 14.
    Fernando, T., Denman, S., McFadyen, A., Sridharan, S., Fookes, C.: Tree memory networks for modelling long-term temporal dependencies. Neurocomputing 304, 64–81 (2018)CrossRefGoogle Scholar
  15. 15.
    Fernando, T., Denman, S., Sridharan, S., Fookes, C.: Going deeper: autonomous steering with neural memory networks. In: ICCV, pp. 214–221 (2017)Google Scholar
  16. 16.
    Fernando, T., Denman, S., Sridharan, S., Fookes, C.: Soft+ hardwired attention: an LSTM framework for human trajectory prediction and abnormal event detection. arXiv preprint arXiv:1702.05552 (2017)
  17. 17.
    Fernando, T., Denman, S., Sridharan, S., Fookes, C.: Learning temporal strategic relationships using generative adversarial imitation learning. In: IFAAMAS (2018)Google Scholar
  18. 18.
    Fernando, T., Denman, S., Sridharan, S., Fookes, C.: Task specific visual saliency prediction with memory augmented conditional generative adversarial networks. In: WACV, pp. 1539–1548. IEEE (2018)Google Scholar
  19. 19.
    Fernando, T., Denman, S., Sridharan, S., Fookes, C.: Tracking by prediction: a deep generative model for multi-person localisation and tracking. In: WACV (2018)Google Scholar
  20. 20.
    Gobet, F., et al.: Chunking mechanisms in human learning. Trends Cogn. Sci. 5(6), 236–243 (2001)MathSciNetCrossRefGoogle Scholar
  21. 21.
    Huang, Y., Wu, Q., Wang, L.: Learning semantic concepts and order for image and sentence matching. arXiv preprint arXiv:1712.02036 (2017)
  22. 22.
    Kaiser, Ł., Sutskever, I.: Neural GPUs learn algorithms. In: ICLR (2016)Google Scholar
  23. 23.
    Kiela, D., Bottou, L.: Learning image embeddings using convolutional neural networks for improved multi-modal semantics. In: EMNLP, pp. 36–45 (2014)Google Scholar
  24. 24.
    Kiela, D., Grave, E., Joulin, A., Mikolov, T.: Efficient large-scale multi-modal classification. arXiv preprint arXiv:1802.02892 (2018)
  25. 25.
    Kiros, R., Salakhutdinov, R., Zemel, R.: Multimodal neural language models. In: ICML, pp. 595–603 (2014)Google Scholar
  26. 26.
    Madl, T., Franklin, S., Chen, K., Trappl, R., Montaldi, D.: Exploring the structure of spatial representations. PloS one 11(6), e0157343 (2016)CrossRefGoogle Scholar
  27. 27.
    Malinowski, M., Fritz, M.: A multi-world approach to question answering about real-world scenes based on uncertain input. In: NIPS, pp. 1682–1690 (2014)Google Scholar
  28. 28.
    Parisotto, E., Salakhutdinov, R.: Neural map: structured memory for deep reinforcement learning. In: ICLR (2018)Google Scholar
  29. 29.
    Pei, D., Liu, H., Liu, Y., Sun, F.: Unsupervised multimodal feature learning for semantic image segmentation. In: IJCNN, pp. 1–6. IEEE (2013)Google Scholar
  30. 30.
    Roy, A., Gale, N., Hong, L.: Automated traffic surveillance using fusion of doppler radar and video information. Mathe. Comput. Model. 54(1–2), 531–543 (2011)CrossRefGoogle Scholar
  31. 31.
    Srivastava, N., Salakhutdinov, R.R.: Multimodal learning with deep boltzmann machines. In: NIPS, pp. 2222–2230 (2012)Google Scholar
  32. 32.
    Varshneya, D., Srinivasaraghavan, G.: Human trajectory prediction using spatially aware deep attention models. arXiv preprint arXiv:1705.09436 (2017)
  33. 33.
    Yamaguchi, K., Berg, A.C., Ortiz, L.E., Berg, T.L.: Who are you with and where are you going? In: CVPR, pp. 1345–1352. IEEE (2011)Google Scholar
  34. 34.
    Yi, S., Li, H., Wang, X.: Understanding pedestrian behaviors from stationary crowd groups. In: CVPR, pp. 3488–3496 (2015)Google Scholar
  35. 35.
    Yuan, A., Li, X., Lu, X.: FFGS: feature fusion with gating structure for image caption generation. In: Yang, J., et al. (eds.) CCCV 2017. CCIS, vol. 771, pp. 638–649. Springer, Singapore (2017). Scholar
  36. 36.
    Zou, H., Su, H., Song, S., Zhu, J.: Understanding human behaviors in crowds by imitating the decision-making process. arXiv preprint arXiv:1801.08391 (2018)

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Tharindu Fernando
    • 1
    Email author
  • Simon Denman
    • 1
  • Sridha Sridharan
    • 1
  • Clinton Fookes
    • 1
  1. 1.Image and Video Research Laboratory, SAIVT Research ProgramQueensland University of TechnologyBrisbaneAustralia

Personalised recommendations