Advertisement

DLow: Diversifying Latent Flows for Diverse Human Motion Prediction

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12354)

Abstract

Deep generative models are often used for human motion prediction as they are able to model multi-modal data distributions and characterize diverse human behavior. While much care has been taken into designing and learning deep generative models, how to efficiently produce diverse samples from a deep generative model after it has been trained is still an under-explored problem. To obtain samples from a pretrained generative model, most existing generative human motion prediction methods draw a set of independent Gaussian latent codes and convert them to motion samples. Clearly, this random sampling strategy is not guaranteed to produce diverse samples for two reasons: (1) The independent sampling cannot force the samples to be diverse; (2) The sampling is based solely on likelihood which may only produce samples that correspond to the major modes of the data distribution. To address these problems, we propose a novel sampling method, Diversifying Latent Flows (DLow), to produce a diverse set of samples from a pretrained deep generative model. Unlike random (independent) sampling, the proposed DLow sampling method samples a single random variable and then maps it with a set of learnable mapping functions to a set of correlated latent codes. The correlated latent codes are then decoded into a set of correlated samples. During training, DLow uses a diversity-promoting prior over samples as an objective to optimize the latent mappings to improve sample diversity. The design of the prior is highly flexible and can be customized to generate diverse motions with common features (e.g., similar leg motion but diverse upper-body motion). Our experiments demonstrate that DLow outperforms state-of-the-art baseline methods in terms of sample diversity and accuracy (Code: https://github.com/Khrylx/DLow. Video: https://youtu.be/64OEdSadb00).

Keywords

Generative models Diversity Human motion forecasting 

Supplementary material

Supplementary material 1 (mp4 88770 KB)

504446_1_En_20_MOESM2_ESM.pdf (11 mb)
Supplementary material 2 (pdf 11250 KB)

References

  1. 1.
    Aksan, E., Kaufmann, M., Hilliges, O.: Structured prediction helps 3D human motion modelling. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 7144–7153 (2019)Google Scholar
  2. 2.
    Alahi, A., Goel, K., Ramanathan, V., Robicquet, A., Fei-Fei, L., Savarese, S.: Social LSTM: human trajectory prediction in crowded spaces. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 961–971 (2016)Google Scholar
  3. 3.
    Aliakbarian, S., Saleh, F.S., Salzmann, M., Petersson, L., Gould, S.: A stochastic conditioning scheme for diverse human motion prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5223–5232 (2020)Google Scholar
  4. 4.
    Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein GAN. arXiv preprint arXiv:1701.07875 (2017)
  5. 5.
    Azadi, S., Feng, J., Darrell, T.: Learning detection with diverse proposals. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7149–7157 (2017)Google Scholar
  6. 6.
    Barsoum, E., Kender, J., Liu, Z.: HP-GAN: probabilistic 3D human motion prediction via GAN. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1418–1427 (2018)Google Scholar
  7. 7.
    Batra, D., Yadollahpour, P., Guzman-Rivera, A., Shakhnarovich, G.: Diverse M-best solutions in Markov random fields. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7576, pp. 1–16. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-33715-4_1CrossRefGoogle Scholar
  8. 8.
    Bhattacharyya, A., Schiele, B., Fritz, M.: Accurate and diverse sampling of sequences based on a “best of many” sample objective. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8485–8493 (2018)Google Scholar
  9. 9.
    Butepage, J., Black, M.J., Kragic, D., Kjellstrom, H.: Deep representation learning for human motion prediction and classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6158–6166 (2017)Google Scholar
  10. 10.
    Chao, Y.W., Yang, J., Price, B., Cohen, S., Deng, J.: Forecasting human dynamics from static images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 548–556 (2017)Google Scholar
  11. 11.
    Che, T., Li, Y., Jacob, A.P., Bengio, Y., Li, W.: Mode regularized generative adversarial networks. arXiv preprint arXiv:1612.02136 (2016)
  12. 12.
    Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., Abbeel, P.: InfoGAN: interpretable representation learning by information maximizing generative adversarial nets. In: Advances in Neural Information Processing Systems, pp. 2172–2180 (2016)Google Scholar
  13. 13.
    Chiu, H.k., Adeli, E., Wang, B., Huang, D.A., Niebles, J.C.: Action-agnostic human pose forecasting. In: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1423–1432. IEEE (2019)Google Scholar
  14. 14.
    Dilokthanakul, N., et al.: Deep unsupervised clustering with Gaussian mixture variational autoencoders. arXiv preprint arXiv:1611.02648 (2016)
  15. 15.
    Elfeki, M., Couprie, C., Riviere, M., Elhoseiny, M.: GDPP: learning diverse generations using determinantal point process. arXiv preprint arXiv:1812.00068 (2018)
  16. 16.
    Fragkiadaki, K., Levine, S., Felsen, P., Malik, J.: Recurrent network models for human dynamics. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4346–4354 (2015)Google Scholar
  17. 17.
    Ghosh, P., Song, J., Aksan, E., Hilliges, O.: Learning human motion models for long-term predictions. In: 2017 International Conference on 3D Vision (3DV), pp. 458–466. IEEE (2017)Google Scholar
  18. 18.
    Gillenwater, J.A., Kulesza, A., Fox, E., Taskar, B.: Expectation-maximization for learning determinantal point processes. In: Advances in Neural Information Processing Systems, pp. 3149–3157 (2014)Google Scholar
  19. 19.
    Gong, B., Chao, W.L., Grauman, K., Sha, F.: Diverse sequential subset selection for supervised video summarization. In: Advances in Neural Information Processing Systems, pp. 2069–2077 (2014)Google Scholar
  20. 20.
    Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, pp. 2672–2680 (2014)Google Scholar
  21. 21.
    Gopalakrishnan, A., Mali, A., Kifer, D., Giles, L., Ororbia, A.G.: A neural temporal model for human motion prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12116–12125 (2019)Google Scholar
  22. 22.
    Gu, Q., Li, Z., Han, J.: Generalized fisher score for feature selection. arXiv preprint arXiv:1202.3725 (2012)
  23. 23.
    Guan, J., Yuan, Y., Kitani, K.M., Rhinehart, N.: Generative hybrid representations for activity forecasting with no-regret learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2020)Google Scholar
  24. 24.
    Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.C.: Improved training of wasserstein gans. In: Advances in Neural Information Processing Systems, pp. 5767–5777 (2017)Google Scholar
  25. 25.
    Gupta, A., Johnson, J., Fei-Fei, L., Savarese, S., Alahi, A.: Social GAN: socially acceptable trajectories with generative adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2255–2264 (2018)Google Scholar
  26. 26.
    Gurumurthy, S., Kiran Sarvadevabhatla, R., Venkatesh Babu, R.: DeliGAN: generative adversarial networks for diverse and limited data. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 166–174 (2017)Google Scholar
  27. 27.
    Guzman-Rivera, A., Batra, D., Kohli, P.: Multiple choice learning: learning to produce multiple structured outputs. In: Advances in Neural Information Processing Systems, pp. 1799–1807 (2012)Google Scholar
  28. 28.
    He, J., Spokoyny, D., Neubig, G., Berg-Kirkpatrick, T.: Lagging inference networks and posterior collapse in variational autoencoders. arXiv preprint arXiv:1901.05534 (2019)
  29. 29.
    Higgins, I., et al.: beta-VAE: learning basic visual concepts with a constrained variational framework. ICLR 2(5), 6 (2017)Google Scholar
  30. 30.
    Hsiao, W.L., Grauman, K.: Creating capsule wardrobes from fashion images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7161–7170 (2018)Google Scholar
  31. 31.
    Huang, D.A., Ma, M., Ma, W.C., Kitani, K.M.: How do we use our hands? discovering a diverse set of common grasps. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 666–675 (2015)Google Scholar
  32. 32.
    Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6M: large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. 36(7), 1325–1339 (2013)Google Scholar
  33. 33.
    Jain, A., Zamir, A.R., Savarese, S., Saxena, A.: Structural-RNN: deep learning on spatio-temporal graphs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5308–5317 (2016)Google Scholar
  34. 34.
    Jordan, M.I., Ghahramani, Z., Jaakkola, T.S., Saul, L.K.: An introduction to variational methods for graphical models. Mach. Learn. 37(2), 183–233 (1999)CrossRefGoogle Scholar
  35. 35.
    Kim, Y., Wiseman, S., Miller, A.C., Sontag, D., Rush, A.M.: Semi-amortized variational autoencoders. arXiv preprint arXiv:1802.02550 (2018)
  36. 36.
    Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)
  37. 37.
    Koppula, H.S., Saxena, A.: Anticipating human activities for reactive robotic response. In: IROS, Tokyo, p. 2071 (2013)Google Scholar
  38. 38.
    Kulesza, A., Taskar, B.: k-dpps: Fixed-size determinantal point processes. In: Proceedings of the 28th International Conference on Machine Learning (ICML 2011), pp. 1193–1200 (2011)Google Scholar
  39. 39.
    Kulesza, A., Taskar, B., et al.: Determinantal point processes for machine learning. Found. Trends® Mach. Learn. 5(2–3), 123–286 (2012)Google Scholar
  40. 40.
    Kundu, J.N., Gor, M., Babu, R.V.: BIHMP-GAN: bidirectional 3D human motion prediction GAN. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 8553–8560 (2019)Google Scholar
  41. 41.
    Lee, N., Choi, W., Vernaza, P., Choy, C.B., Torr, P.H., Chandraker, M.: Desire: distant future prediction in dynamic scenes with interacting agents. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 336–345 (2017)Google Scholar
  42. 42.
    Lee, S., Prakash, S.P.S., Cogswell, M., Ranjan, V., Crandall, D., Batra, D.: Stochastic multiple choice learning for training diverse deep ensembles. In: Advances in Neural Information Processing Systems, pp. 2119–2127 (2016)Google Scholar
  43. 43.
    Li, Z., Zhou, Y., Xiao, S., He, C., Huang, Z., Li, H.: Auto-conditioned recurrent networks for extended complex human motion synthesis. arXiv preprint arXiv:1707.05363 (2017)
  44. 44.
    Lin, X., Amer, M.R.: Human motion modeling using DVGANs. arXiv preprint arXiv:1804.10652 (2018)
  45. 45.
    Lin, Z., Khetan, A., Fanti, G., Oh, S.: PACGAN: the power of two samples in generative adversarial networks. In: Advances in Neural Information Processing Systems, pp. 1498–1507 (2018)Google Scholar
  46. 46.
    Liu, X., Gao, J., Celikyilmaz, A., Carin, L., et al.: Cyclical annealing schedule: a simple approach to mitigating KL vanishing. arXiv preprint arXiv:1903.10145 (2019)
  47. 47.
    Luvizon, D.C., Picard, D., Tabia, H.: 2D/3D pose estimation and action recognition using multitask deep learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5137–5146 (2018)Google Scholar
  48. 48.
    Macchi, O.: The coincidence approach to stochastic point processes. Adv. Appl. Probab. 7(1), 83–122 (1975)MathSciNetCrossRefGoogle Scholar
  49. 49.
    Mao, W., Liu, M., Salzmann, M., Li, H.: Learning trajectory dependencies for human motion prediction. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 9489–9497 (2019)Google Scholar
  50. 50.
    Martinez, J., Black, M.J., Romero, J.: On human motion prediction using recurrent neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2891–2900 (2017)Google Scholar
  51. 51.
    Martinez, J., Hossain, R., Romero, J., Little, J.J.: A simple yet effective baseline for 3D human pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2640–2649 (2017)Google Scholar
  52. 52.
    Nilsson, D.: An efficient algorithm for finding the m most probable configurations in probabilistic expert systems. Stat. Comput. 8(2), 159–173 (1998)CrossRefGoogle Scholar
  53. 53.
    Paden, B., Čáp, M., Yong, S.Z., Yershov, D., Frazzoli, E.: A survey of motion planning and control techniques for self-driving urban vehicles. IEEE Trans. Intell. Veh. 1(1), 33–55 (2016)CrossRefGoogle Scholar
  54. 54.
    Pavllo, D., Feichtenhofer, C., Grangier, D., Auli, M.: 3D human pose estimation in video with temporal convolutions and semi-supervised training. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7753–7762 (2019)Google Scholar
  55. 55.
    Pavllo, D., Grangier, D., Auli, M.: QuaterNet: a quaternion-based recurrent model for human motion. arXiv preprint arXiv:1805.06485 (2018)
  56. 56.
    Rezende, D.J., Mohamed, S.: Variational inference with normalizing flows. arXiv preprint arXiv:1505.05770 (2015)
  57. 57.
    Rhinehart, N., Kitani, K.M., Vernaza, P.: r2p2: a reparameterized pushforward policy for diverse, precise generative path forecasting. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 794–811. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01261-8_47CrossRefGoogle Scholar
  58. 58.
    Rissanen, J.J.: Fisher information and stochastic complexity. IEEE Trans. Inf. Theory 42(1), 40–47 (1996)MathSciNetCrossRefGoogle Scholar
  59. 59.
    Ruiz, A.H., Gall, J., Moreno-Noguer, F.: Human motion prediction via spatio-temporal inpainting. arXiv preprint arXiv:1812.05478 (2018)
  60. 60.
    Seroussi, B., Golmard, J.L.: An algorithm directly finding the k most probable configurations in Bayesian networks. Int. J. Approx. Reason. 11(3), 205–233 (1994)CrossRefGoogle Scholar
  61. 61.
    Sigal, L., Balan, A.O., Black, M.J.: HUMANEVA: synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. Int. J. Comput. Vis. 87(1–2), 4 (2010)CrossRefGoogle Scholar
  62. 62.
    Sourati, J., Akcakaya, M., Erdogmus, D., Leen, T.K., Dy, J.G.: A probabilistic active learning algorithm based on fisher information ratio. IEEE Trans. Pattern Anal. Mach. Intell. 40(8), 2023–2029 (2017)CrossRefGoogle Scholar
  63. 63.
    Srivastava, A., Valkov, L., Russell, C., Gutmann, M.U., Sutton, C.: VeeGAN: reducing mode collapse in GANs using implicit variational learning. In: Advances in Neural Information Processing Systems, pp. 3308–3318 (2017)Google Scholar
  64. 64.
    Tolstikhin, I., Bousquet, O., Gelly, S., Schoelkopf, B.: Wasserstein auto-encoders. arXiv preprint arXiv 1711, 01558 (2017)Google Scholar
  65. 65.
    Troje, N.F.: Decomposing biological motion: a framework for analysis and synthesis of human gait patterns. J. Vis. 2(5), 2 (2002)Google Scholar
  66. 66.
    Walker, J., Marino, K., Gupta, A., Hebert, M.: The pose knows: video forecasting by generating pose futures. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3332–3341 (2017)Google Scholar
  67. 67.
    Wang, B., Adeli, E., Chiu, H.k., Huang, D.A., Niebles, J.C.: Imitation learning for human pose prediction. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 7124–7133 (2019)Google Scholar
  68. 68.
    Weng, X., Yuan, Y., Kitani, K.: Joint 3d tracking and forecasting with graph neural network and diversity sampling. arXiv:2003.07847 (2020)
  69. 69.
    Yan, X., et al.: MT-VAE: learning motion transformations to generate multimodal human dynamics. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11209, pp. 276–293. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01228-1_17CrossRefGoogle Scholar
  70. 70.
    Yang, D., Hong, S., Jang, Y., Zhao, T., Lee, H.: Diversity-sensitive conditional generative adversarial networks. arXiv preprint arXiv:1901.09024 (2019)
  71. 71.
    Yang, W., Ouyang, W., Wang, X., Ren, J., Li, H., Wang, X.: 3D human pose estimation in the wild by adversarial learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5255–5264 (2018)Google Scholar
  72. 72.
    Yuan, Y., Kitani, K.: Diverse trajectory forecasting with determinantal point processes. arXiv preprint arXiv:1907.04967 (2019)
  73. 73.
    Yuan, Y., Kitani, K.: Ego-pose estimation and forecasting as real-time PD control. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 10082–10092 (2019)Google Scholar
  74. 74.
    Yuan, Y., Kitani, K.: Residual force control for agile human behavior imitation and extended motion synthesis. arXiv preprint arXiv:2006.07364 (2020)
  75. 75.
    Zhang, J.Y., Felsen, P., Kanazawa, A., Malik, J.: Predicting 3D human dynamics from video. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 7114–7123 (2019)Google Scholar
  76. 76.
    Zhao, S., Song, J., Ermon, S.: InfoVAE: information maximizing variational autoencoders. arXiv preprint arXiv:1706.02262 (2017)

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Robotics InstituteCarnegie Mellon UniversityPittsburghUSA

Personalised recommendations