Liquid Pouring Monitoring via Rich Sensory Inputs

  • Tz-Ying Wu
  • Juan-Ting Lin
  • Tsun-Hsuang Wang
  • Chan-Wei Hu
  • Juan Carlos Niebles
  • Min SunEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11215)


Humans have the amazing ability to perform very subtle manipulation task using a closed-loop control system with imprecise mechanics (i.e., our body parts) but rich sensory information (e.g., vision, tactile, etc.). In the closed-loop system, the ability to monitor the state of the task via rich sensory information is important but often less studied. In this work, we take liquid pouring as a concrete example and aim at learning to continuously monitor whether liquid pouring is successful (e.g., no spilling) or not via rich sensory inputs. We mimic humans’ rich sensories using synchronized observation from a chest-mounted camera and a wrist-mounted IMU sensor. Given many success and failure demonstrations of liquid pouring, we train a hierarchical LSTM with late fusion for monitoring. To improve the robustness of the system, we propose two auxiliary tasks during training: inferring (1) the initial state of containers and (2) forecasting the one-step future 3D trajectory of the hand with an adversarial training procedure. These tasks encourage our method to learn representation sensitive to container states and how objects are manipulated in 3D. With these novel components, our method achieves \(\sim \)8% and \(\sim \)11% better monitoring accuracy than the baseline method without auxiliary tasks on unseen containers and unseen users respectively.


Monitoring manipulation Multimodal fusion Auxiliary tasks 



We thank Stanford University for collaboration. We also thank MOST 107-2634-F-007-007, Panasonic and MediaTeK for their support.


  1. 1.
    Kubricht, J., Jiang, C., Zhu, Y., Zhu, S.C., Terzopoulos, D., Lu, H.: Probabilistic simulation predicts human performance on viscous fluid-pouring problem. In: CogSci (2016)Google Scholar
  2. 2.
    Bates, C.J., Yildirim, I., Tenenbaum, J.B., Battaglia, P.W.: Humans predict liquid dynamics using probabilistic simulation. In: CogSci (2015)Google Scholar
  3. 3.
    Edmonds, M., et al.: Feeling the force: integrating force and pose for fluent discovery through imitation learning to open medicine bottles. In: IROS (2017)Google Scholar
  4. 4.
    Abu-El-Haija, S., et al.: Youtube-8m: a large-scale video classification benchmark. arXiv:1609.08675 (2016)
  5. 5.
    Heilbron, F.C., Escorcia, V., Ghanem, B., Niebles, J.C.: Activitynet: a large-scale video benchmark for human activity understanding. In: CVPR (2015)Google Scholar
  6. 6.
    Alayrac, J.B., Sivic, J., Laptev, I., Lacoste-Julien, S.: Joint discovery of object states and manipulating actions. In: ICCV (2017)Google Scholar
  7. 7.
    Mottaghi, R., Schenck, C., Fox, D., Farhadi, A.: See the glass half full: Reasoning about liquid containers, their volume and content. In: ICCV (2017)Google Scholar
  8. 8.
    Nishida, N., Nakayama, H.: Multimodal gesture recognition using multi-stream recurrent neural network. In: Bräunl, T., McCane, B., Rivera, M., Yu, X. (eds.) PSIVT 2015. LNCS, vol. 9431, pp. 682–694. Springer, Cham (2016). Scholar
  9. 9.
    Soomro, K., Zamir, A.R., Shah, M.: Ucf101: a dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402 (2012)
  10. 10.
    Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: Hmdb: a large video database for human motion recognition. In: ICCV (2011)Google Scholar
  11. 11.
    Gu, C., et al.: AVA: a video dataset of spatio-temporally localized atomic visual actions. In: CVPR (2018)Google Scholar
  12. 12.
    Rohrbach, M., Amin, S., Andriluka, M., Schiele, B.: A database for fine grained activity detection of cooking activities. In: CVPR (2012)Google Scholar
  13. 13.
    Chéron, G., Laptev, I., Schmid, C.: P-CNN: pose-based CNN features for action recognition. In: ICCV (2015)Google Scholar
  14. 14.
    Jhuang, H., Gall, J., Zuffi, S., Schmid, C., Black, M.J.: Towards understanding action recognition. In: ICCV (2013)Google Scholar
  15. 15.
    Vu, T.-H., Olsson, C., Laptev, I., Oliva, A., Sivic, J.: Predicting actions from static scenes. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 421–436. Springer, Cham (2014). Scholar
  16. 16.
    Zhang, Y., Qu, W., Wang, D.: Action-scene model for human action recognition from videos (2014)CrossRefGoogle Scholar
  17. 17.
    Moore, D.J., Essa, I.A., Hayes, M.H.: Exploiting human actions and object context for recognition tasks. In: ICCV (1999)Google Scholar
  18. 18.
    Delaitre, V., Sivic, J., Laptev, I.: Learning person-object interactions for action recognition in still images. In: NIPS (2011)Google Scholar
  19. 19.
    Gupta, A., Kembhavi, A., Davis, L.S.: Observing human-object interactions: using spatial and functional compatibility for recognition. In: TPAMI (2009)Google Scholar
  20. 20.
    Gupta, A., Davis, L.S.: Objects in action: an approach for combining action understanding and object perception. In: CVPR (2007)Google Scholar
  21. 21.
    Fathi, A., Rehg, J.M.: Modeling actions through state changes. In: CVPR (2013)Google Scholar
  22. 22.
    Bambach, S., Lee, S., Crandall, D.J., Yu, C.: Lending a hand: detecting hands and recognizing activities in complex egocentric interactions. In: ICCV (2015)Google Scholar
  23. 23.
    Ma, M., Fan, H., Kitani, K.M.: Going deeper into first-person activity recognition. In: CVPR (2016)Google Scholar
  24. 24.
    Hu, J.F., Zheng, W.S., Lai, J., Zhang, J.: Jointly learning heterogeneous features for RGB-D activity recognition. In: CVPR (2015)Google Scholar
  25. 25.
    Lei, J., Ren, X., Fox, D.: Fine-grained kitchen activity recognition using RGB-D. In: UbiComp (2012)Google Scholar
  26. 26.
    Song, S., Cheung, N.M., Chandrasekhar, V., Mandal, B., Liri, J.: Egocentric activity recognition with multimodal fisher vector. In: Acoustics, Speech and Signal Processing (ICASSP). IEEE (2016)Google Scholar
  27. 27.
    de la Torre, F., Hodgins, J.K., Montano, J., Valcarcel, S.: Detailed human data acquisition of kitchen activities: the cmu-multimodal activity database (cmu-mmac). In: CHI Workshop (2009)Google Scholar
  28. 28.
    Roggen, D., et al.: Collecting complex activity datasets in highly rich networked sensor environments. In: INSS. IEEE (2010)Google Scholar
  29. 29.
    Zhou, Y., Ni, B., Hong, R., Wang, M., Tian, Q.: Interaction part mining: a mid-level approach for fine-grained action recognition. In: CVPR (2015)Google Scholar
  30. 30.
    Zhou, Y., Ni, B., Yan, S., Moulin, P., Tian, Q.: Pipelining localized semantic features for fine-grained action recognition. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8692, pp. 481–496. Springer, Cham (2014). Scholar
  31. 31.
    Peng, X., Zou, C., Qiao, Y., Peng, Q.: Action recognition with stacked fisher vectors. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 581–595. Springer, Cham (2014). Scholar
  32. 32.
    Sun, S., Kuang, Z., Sheng, L., Ouyang, W., Zhang, W.: Optical flow guided feature: a fast and robust motion representation for video action recognition. In: CVPR (2018)Google Scholar
  33. 33.
    Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: CVPR (2018)Google Scholar
  34. 34.
    Schenck, C., Fox, D.: Detection and tracking of liquids with fully convolutional networks. In: RSS workshop (2016)Google Scholar
  35. 35.
    Sermanet, P., Lynch, C., Hsu, J., Levine, S.: Time-contrastive networks: Self-supervised learning from multi-view observation. arXiv:1704.06888 (2017)
  36. 36.
    Yamaguchi, A., Atkeson, C.G.: Stereo vision of liquid and particle flow for robot pouring. In: Humanoids (2016)Google Scholar
  37. 37.
    Tamosiunaite, M., Nemec, B., Ude, A., Wrgtter, F.: Learning to pour with a robot arm combining goal and shape learning for dynamic movement primitives. In: IEEE-RAS (2011)Google Scholar
  38. 38.
    Rozo, L., Jimnez, P., Torras, C.: Force-based robot learning of pouring skills using parametric hidden markov models. In: 9th International Workshop on Robot Motion and Control (2013)Google Scholar
  39. 39.
    Brandi, S., Kroemer, O., Peters, J.: Generalizing pouring actions between objects using warped parameters. In: Humanoids (2014)Google Scholar
  40. 40.
    Schenck, C., Fox, D.: Visual closed-loop control for pouring liquids. In: ICRA (2017)Google Scholar
  41. 41.
    Yamaguchi, A., Atkeson, C.G.: Differential dynamic programming with temporally decomposed dynamics. In: IEEE-RAS (2015)Google Scholar
  42. 42.
    Kunze, L., Beetz, M.: Envisioning the qualitative effects of robot manipulation actions using simulation-based projections. Artif. Intell. 247, 352–380 (2017)MathSciNetCrossRefGoogle Scholar
  43. 43.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)Google Scholar
  44. 44.
    Goodfellow, I., et al.: Generative adversarial nets. In: NIPS (2014)Google Scholar
  45. 45.
    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR (2009)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Tz-Ying Wu
    • 1
  • Juan-Ting Lin
    • 1
  • Tsun-Hsuang Wang
    • 1
  • Chan-Wei Hu
    • 1
  • Juan Carlos Niebles
    • 2
  • Min Sun
    • 1
    Email author
  1. 1.Department of Electrical EngineeringNational Tsing Hua UniversityHsinchuTaiwan
  2. 2.Department of Computer ScienceStanford UniversityStanfordUSA

Personalised recommendations