Anticipating Next Goal for Robot Plan Prediction

  • Edoardo AlatiEmail author
  • Lorenzo MauroEmail author
  • Valsamis NtouskosEmail author
  • Fiora PirriEmail author
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 1037)


Goal reasoning is a main objective for robot task execution. Here we propose a deep model for learning to infer a next goal, while performing an activity. Because predicting the next goal state requires a robot language, not comparable to sentences, we introduce a specific metric for optimization, which is related to the representation the robot has of the scene. Experiments of the proposed idea and method have been done at a warehouse with a humanoid robot performing tasks assisting a maintenance technician working at a production line.


Next-goal deep prediction Deep learning Robot perception Robot planning 



The research has been granted by the H2020 Project Second Hands under grant agreement No. 643950.


  1. 1.
    Ajzen, I.: The theory of planned behavior. Organ. Behav. Hum. Decis. Process. 50(2), 179–211 (1991)CrossRefGoogle Scholar
  2. 2.
    Alford, R., Shivashankar, V., Roberts, M., Frank, J., Aha, D.W.: Hierarchical planning: Relating task and goal decomposition with task sharing. In: IJCAI, pp. 3022–3029 (2016)Google Scholar
  3. 3.
    Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., Zhang, L.: Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR, vol. 3, p. 6 (2018)Google Scholar
  4. 4.
    Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence, Z., Parikh, D.: VQA: visual question answering. In: CVPR 2015, pp. 2425–2433 (2015)Google Scholar
  5. 5.
    Arkin, R.C., Arkin, R.C., et al.: Behavior-Based Robotics. MIT press, Cambridge (1998)Google Scholar
  6. 6.
    Bahdanau, D., Brakel, P., Xu, K., Goyal, A., Lowe, R., Pineau, J., Courville, A., Bengio, Y.: An actor-critic algorithm for sequence prediction. In: ICLR (2017)Google Scholar
  7. 7.
    Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv:1409.0473 (2014)
  8. 8.
    Bengio, S., Vinyals, O., Jaitly, N., Shazeer, N.: Scheduled sampling for sequence prediction with recurrent neural networks. In: NIPS, pp. 1171–1179 (2015)Google Scholar
  9. 9.
    Bertsekas, D.P., Tsitsiklis, J.N.: Neuro-dynamic programming: an overview. In: Decision and Control, vol. 1, pp. 560–564 (1995)Google Scholar
  10. 10.
    Boutilier, C., Reiter, R., Soutchanski, M., Thrun, S., et al.: Decision-theoretic, high-level agent programming in the situation calculus. In: AAAI/IAAI, pp. 355–362 (2000)Google Scholar
  11. 11.
    Chaplot, D.S., Sathyendra, K.M., Pasumarthi, R.K., Rajagopal, D., Salakhutdinov, R.: Gated-attention architectures for task-oriented language grounding. arXiv:1706.07230 (2017)
  12. 12.
    Chen, L., Zhang, H., Xiao, J., Nie, L., Shao, J., Liu, W., Chua, T.S.: SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6298–6306. IEEE (2017)Google Scholar
  13. 13.
    Chen, X., Shrivastava, A., Gupta, A.: Neil: Extracting visual knowledge from web data. In: CVPR 2013, pp. 1409–1416 (2013)Google Scholar
  14. 14.
    Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555 (2014)
  15. 15.
    Das, A., Datta, S., Gkioxari, G., Lee, S., Parikh, D., Batra, D.: Embodied question answering. In: CVPR (2018)Google Scholar
  16. 16.
    Deng, J., Dong, W., Socher, R., Li, L., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: CVPR 2009, pp. 248–255 (2009)Google Scholar
  17. 17.
    Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2625–2634 (2015)Google Scholar
  18. 18.
    Downey, C., Hefny, A., Boots, B., Gordon, G.J., Li, B.: Predictive state recurrent neural networks. In: NIPS, pp. 6053–6064 (2017)Google Scholar
  19. 19.
    Doyle, R.J., Atkinson, D.J., Doshi, R.S.: Generating perception requests and expectations to verify the execution of plans. In: AAAI, pp. 81–88 (1986)Google Scholar
  20. 20.
    Erol, K., Hendler, J.A., Nau, D.S.: UMCP: a sound and complete procedure for hierarchical task-network planning. In: AIPS, vol. 94, pp. 249–254 (1994)Google Scholar
  21. 21.
    Fang, H., Gupta, S., Iandola, F., Srivastava, R.K., Deng, L., Dollár, P., Gao, J., He, X., Mitchell, M., Platt, J.C.: From captions to visual concepts and back. In: CVPR 2015, pp. 1473–1482 (2015)Google Scholar
  22. 22.
    Gu, S., Holly, E., Lillicrap, T., Levine, S.: Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In: 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 3389–3396. IEEE (2017)Google Scholar
  23. 23.
    Guadarrama, S., Riano, L., Golland, D., Go, D., Jia, Y., Klein, D., Abbeel, P., Darrell, T., et al.: Grounding spatial relations for human-robot interaction. In: IROS, pp. 1640–1647 (2013)Google Scholar
  24. 24.
    He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: ICCV, pp. 2980–2988. IEEE (2017)Google Scholar
  25. 25.
    Helmert, M.: The fast downward planning system. JAIR 26, 191–246 (2006)zbMATHCrossRefGoogle Scholar
  26. 26.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  27. 27.
    Hofmann, T., Niemueller, T., Lakemeyer, G.: Initial results on generating macro actions from a plan database for planning on autonomous mobile robots. In: ICAPS (2017)Google Scholar
  28. 28.
    Hornung, A., Böttcher, S., Schlagenhauf, J., Dornhege, C., Hertle, A., Bennewitz, M.: Mobile manipulation in cluttered environments with humanoids: integrated perception, task planning, and action execution. In: Humanoids, pp. 773–778 (2014)Google Scholar
  29. 29.
    Johnson, J., Krishna, R., Stark, M., Li, L.J., Shamma, D., Bernstein, M., Fei-Fei, L.: Image retrieval using scene graphs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3668–3678 (2015)Google Scholar
  30. 30.
    Karkus, P., Hsu, D., Lee, W.S.: QMDP-net: Deep learning for planning under partial observability. In: NIPS, pp. 4697–4707 (2017)Google Scholar
  31. 31.
    Lee, N., Choi, W., Vernaza, P., Choy, C.B., Torr, P.H., Chandraker, M.: Desire: distant future prediction in dynamic scenes with interacting agents. In: CVPR, pp. 336–345 (2017)Google Scholar
  32. 32.
    Levine, S., Finn, C., Darrell, T., Abbeel, P.: End-to-end training of deep visuomotor policies. J. Mach. Learn. Res. 17(39), 1–40 (2016)MathSciNetzbMATHGoogle Scholar
  33. 33.
    Lin, T., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: common objects in context. In: ECCV 2014, pp. 740–755 (2014)CrossRefGoogle Scholar
  34. 34.
    Littman, M.L., Sutton, R.S.: Predictive representations of state. In: NIPS, pp. 1555–1561 (2002)Google Scholar
  35. 35.
    Luong, M.T., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation. arXiv:1508.04025 (2015)
  36. 36.
    Luong, M.T., Sutskever, I., Le, Q.V., Vinyals, O., Zaremba, W.: Addressing the rare word problem in neural machine translation. arXiv:1410.8206 (2014)
  37. 37.
    Mauro, L., Alati, E., Ntouskos, V., Pirri, F.: Help by predicting what to do. In: IEEE International Conference on Image Processing (ICIP 2019) (2019)Google Scholar
  38. 38.
    Mauro, L., Alati, E., Ntouskos, V., Pirri, F., Izadpanahkakhk, M., Omrani, E.: Anticipation and next action forecasting in video: an end-to-end model with memory. arXiv preprint arXiv:1901.03728 (2019)
  39. 39.
    Mauro, L., Alati, E., Sanzari, M., Ntouskos, V., Massimiani, G., Pirri, F.: Deep execution monitor for robot assistive tasks. In: ECCV, ACVR Workshop, pp. 158–175 (2018)Google Scholar
  40. 40.
    McFadden, D., et al.: Conditional logit analysis of qualitative choice behavior (1973)Google Scholar
  41. 41.
    Mendoza, J.P., Veloso, M., Simmons, R.: Plan execution monitoring through detection of unmet expectations about action outcomes. In: ICRA, pp. 3247–3252 (2015)Google Scholar
  42. 42.
    Mesnil, G., Bordes, A., Weston, J., Chechik, G., Bengio, Y.: Learning semantic representations of objects and their parts. Mach. Learn. 94(2), 281–301 (2014)MathSciNetzbMATHCrossRefGoogle Scholar
  43. 43.
    Mesnil, G., Rifai, S., Bordes, A., Glorot, X., Bengio, Y., Vincent, P.: Unsupervised learning of semantics of object detections for scene categorization. In: Pattern Recognition Applications and Methods, pp. 209–224 (2015)Google Scholar
  44. 44.
    Miller, G.A.: Wordnet: a lexical database for english. Commun. ACM 38(11), 39–41 (1995)CrossRefGoogle Scholar
  45. 45.
    Mnih, V., Badia, A.P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., Kavukcuoglu, K.: Asynchronous methods for deep reinforcement learning. ICML 2016, 1928–1937 (2016)Google Scholar
  46. 46.
    Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)CrossRefGoogle Scholar
  47. 47.
    Norouzi, M., Bengio, S., Jaitly, N., Schuster, M., Wu, Y., Schuurmans, D., et al.: Reward augmented maximum likelihood for neural structured prediction. In: NIPS, pp. 1723–1731 (2016)Google Scholar
  48. 48.
    Ntouskos, V., Sanzari, M., Alati, E., Freda, L., Pirri, F.: Visual search and recognition for robot task execution and monitoring. In: Applications of Intelligent Systems: Proceedings of the 1st International APPIS Conference 2018, vol. 310, p. 94. IOS Press (2018)Google Scholar
  49. 49.
    Pan, J.Y., Yang, H.J., Faloutsos, C., Duygulu, P.: Gcap: Graph-based automatic image captioning. In: CVPRW 2004. Conference on Computer Vision and Pattern Recognition Workshop, p. 146. IEEE (2004)Google Scholar
  50. 50.
    Pathak, D., Agrawal, P., Efros, A.A., Darrell, T.: Curiosity-driven exploration by self-supervised prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 16–17 (2017)Google Scholar
  51. 51.
    Pathak, D., Agrawal, P., Efros, A.A., Darrell, T.: Curiosity-driven exploration by self-supervised prediction. In: ICML, vol. 2017 (2017)Google Scholar
  52. 52.
    Pathak, D., Mahmoudieh, P., Luo, G., Agrawal, P., Chen, D., Shentu, Y., Shelhamer, E., Malik, J., Efros, A.A., Darrell, T.: Zero-shot visual imitation. In: ICLR (2018)Google Scholar
  53. 53.
    Pei, M., Jia, Y., Zhu, S.C.: Parsing video events with goal inference and intent prediction. In: 2011 International Conference on Computer Vision, pp. 487–494. IEEE (2011)Google Scholar
  54. 54.
    Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)Google Scholar
  55. 55.
    Pereyra, G., Tucker, G., Chorowski, J., Kaiser, Ł., Hinton, G.: Regularizing neural networks by penalizing confident output distributions. arXiv:1701.06548 (2017)
  56. 56.
    Petrick, R.P., Bacchus, F.: PKS: knowledge-based planning with incomplete information and sensing. In: ICAPS (2004)Google Scholar
  57. 57.
    Ranzato, M., Chopra, S., Auli, M., Zaremba, W.: Sequence level training with recurrent neural networks. In: ICLR (2017)Google Scholar
  58. 58.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. In: NIPS, pp. 91–99 (2015)Google Scholar
  59. 59.
    Sermanet, P., Lynch, C., Chebotar, Y., Hsu, J., Jang, E., Schaal, S., Levine, S.: Time-contrastive networks: self-supervised learning from video. arXiv:1704.06888 (2018)
  60. 60.
    Shivashankar, V.: Hierarchical goal networks: formalisms and algorithms for planning and acting. Ph.D. thesis, University of Maryland, College Park (2015)Google Scholar
  61. 61.
    Singh, S., Jaakkola, T., Littman, M.L., Szepesvári, C.: Convergence results for single-step on-policy reinforcement-learning algorithms. Mach. Learn. 38(3), 287–308 (2000)zbMATHCrossRefGoogle Scholar
  62. 62.
    Sohn, S., Oh, J., Lee, H.: Multitask reinforcement learning for zero-shot generalization with subtask dependencies. arXiv:1807.07665 (2018)
  63. 63.
    Sun, W., Venkatraman, A., Boots, B., Bagnell, J.A.: Learning to filter with predictive state inference machines. In: ICML, pp. 1197–1205 (2016)Google Scholar
  64. 64.
    Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: NIPS, pp. 3104–3112 (2014)Google Scholar
  65. 65.
    Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction, 2nd edn., vol. 1 (1998, 2017)zbMATHCrossRefGoogle Scholar
  66. 66.
    Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT press, Cambridge (2018)zbMATHGoogle Scholar
  67. 67.
    Tensorflow: Tensorflow models (2018).
  68. 68.
    Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., Saenko, K.: Sequence to sequence-video to text. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4534–4542 (2015)Google Scholar
  69. 69.
    Wilkins, D.E.: Recovering from execution errors in sipe. Comput. Intell. 1(1), 33–45 (1985)CrossRefGoogle Scholar
  70. 70.
    Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057 (2015)Google Scholar
  71. 71.
    Yamada, T., Murata, S., Arie, H., Ogata, T.: Representation learning of logic words by an RNN: from word sequences to robot actions. Front. Neurorobotics 11, 70 (2017)CrossRefGoogle Scholar
  72. 72.
    Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., Hovy, E.: Hierarchical attention networks for document classification. In: North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1480–1489 (2016)Google Scholar
  73. 73.
    You, Q., Jin, H., Wang, Z., Fang, C., Luo, J.: Image captioning with semantic attention. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4651–4659 (2016)Google Scholar
  74. 74.
    Zhang, X., Xie, G., Liu, C., Bengio, Y.: End-to-end online writer identification with recurrent neural network. IEEE Trans. Hum. Mach. Syst. 47(2), 285–292 (2017)CrossRefGoogle Scholar
  75. 75.
    Zhu, L., Xu, Z., Yang, Y., Hauptmann, A.G.: Uncovering the temporal context for video question answering. IJCV 124(3), 409–421 (2017)MathSciNetCrossRefGoogle Scholar
  76. 76.
    Zhu, Y., Gordon, D., Kolve, E., Fox, D., Fei-Fei, L., Gupta, A., Mottaghi, R., Farhadi, A.: Visual semantic planning using deep successor representations. CoRR (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Alcor LAB, Dipartimento di Ingegneria Informatica Automatica e GestionaleUniversity of Rome “Sapienza”RomeItaly

Personalised recommendations