Skip to main content
Log in

Modeling Human Motion with Quaternion-Based Neural Networks

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Previous work on predicting or generating 3D human pose sequences regresses either joint rotations or joint positions. The former strategy is prone to error accumulation along the kinematic chain, as well as discontinuities when using Euler angles or exponential maps as parameterizations. The latter requires re-projection onto skeleton constraints to avoid bone stretching and invalid configurations. This work addresses both limitations. QuaterNet represents rotations with quaternions and our loss function performs forward kinematics on a skeleton to penalize absolute position errors instead of angle errors. We investigate both recurrent and convolutional architectures and evaluate on short-term prediction and long-term generation. For the latter, our approach is qualitatively judged as realistic as recent neural strategies from the graphics literature. Our experiments compare quaternions to Euler angles as well as exponential maps and show that only a very short context is required to make reliable future predictions. Finally, we show that the standard evaluation protocol for Human3.6M produces high variance results and we propose a simple solution.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

Notes

  1. Reference implementation at https://github.com/asheshjain399/RNNexp/blob/srnn/structural_rnn/forecastTrajectories.py#L29.

References

  • Akhter, I., & Black M. J. (2015). Pose-conditioned joint angle limits for 3d human pose reconstruction. In 2015 IEEE conference on computer vision and pattern recognition (CVPR).

  • Arikan O., Forsyth D. A., & O’Brien J. F. (2003). Motion synthesis from annotations. In ACM transactions on graphics (SIGGRAPH).

  • Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer normalization. arXiv preprint arXiv:1607.06450.

  • Badler, N. I., Phillips, C. B., & Webber, B. L. (1993). Simulating humans: Computer graphics animation and control. Oxford: Oxford University Press.

    MATH  Google Scholar 

  • Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. In International conference on learning representations (ICLR).

  • Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). A neural probabilistic language model. Journal of Machine Learning Research, 3, 1137–1155.

    MATH  Google Scholar 

  • Bengio, S., Vinyals, O., Jaitly, N., & Shazeer, N. (2015). Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in neural information processing systems (NIPS).

  • Bütepage, J., Black, M. J., Kragic, D., & Kjellström, H. (2017). Deep representation learning for human motion prediction and classification. In Conference on computer vision and pattern recognition (CVPR).

  • Bütepage, J., Kjellström, H., & Kragic, D. (2018). Anticipating many futures: Online human motion prediction and generation for human–robot interaction. In 2018 IEEE international conference on robotics and automation (ICRA), pp. 1–9.

  • Byravan, A., & Fox, D. (2017). SE3-nets: Learning rigid body motion using deep neural networks. In IEEE international conference on robotics and automation (ICRA).

  • Chao, Y. W., Yang, J., Price, B. L., Cohen, S., & Deng, J. (2017). Forecasting human dynamics from static images. In Conference on computer vision and pattern recognition (CVPR).

  • Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., et al. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Conference on empirical methods in natural language processing (EMNLP).

  • Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. In NIPS deep learning and representation learning workshop.

  • CMU (2003) CMU graphics lab motion capture database. http://mocap.cs.cmu.edu. The database was created with funding from NSF EIA-0196217.

  • Collobert, R., Puhrsch, C., & Synnaeve, G. (2016). Wav2letter: an end-to-end convnet-based speech recognition system. arXiv:1609.03193.

  • Cootes, T. F. (2000). An introduction to active shape models, chap 7. In R. Baldock & J. Graham (Eds.), Image processing and analysis. Oxford: Oxford University Press.

    Google Scholar 

  • Dai, J. S. (2015). Euler–Rodrigues formula variations, quaternion conjugation and intrinsic connections. Mechanism and Machine Theory, 92, 144–152.

    Article  Google Scholar 

  • Dauphin, Y. N., Fan, A., Auli, M., & Grangier, D. (2017). Language modeling with gated convolutional networks. In: Proceedings of ICLR.

  • Du, Y., Wang, W., & Wang, L. (2015). Hierarchical recurrent neural network for skeleton based action recognition. In Conference on computer vision and pattern recognition (CVPR), pp. 1110–1118.

  • Dunn, F., Parberry, I., et al. (2010). 3D math primer for graphics and game development. Burlington: Jones & Bartlett Publishers.

    Google Scholar 

  • Forsyth, D. A., Arikan, O., Ikemoto, L., O’Brien, J., Ramanan, D., et al. (2006). Computational studies of human motion: Part 1, tracking and motion synthesis. Foundations and Trends in Computer Graphics and Vision, 1(2–3), 77–254.

    Google Scholar 

  • Fragkiadaki, K., Levine, S., Felsen, P., & Malik, J. (2015). Recurrent network models for human dynamics. In Conference on vision and pattern recognition (CVPR)

  • Gaudet, C. J., & Maida, A. S. (2018). Deep quaternion networks. In International joint conference on neural networks (IJCNN), IEEE, pp. 1–8.

  • Gehring, J., Auli, M., Grangier, D., Yarats, D., & Dauphin, Y. N. (2017). Convolutional sequence to sequence learning. In International conference on machine learning (ICML).

  • Ghosh, P., Song, J., Aksan, E., & Hilliges, O. (2017). Learning human motion models for long-term predictions. In International conference on 3D vision.

  • Gong, W., Zhang, X., Gonzàlez, J., Sobral, A., Bouwmans, T., Tu, C., et al. (2016). Human pose estimation from monocular images: A comprehensive survey. Sensors, 16(12), 1966.

    Article  Google Scholar 

  • Gopalakrishnan, A., Mali, A., Kifer, D., Giles, C. L., & Ororbia, A. G. (2018). A neural temporal model for human motion prediction. arXiv:1809.03036.

  • Grassia, F. S. (1998). Practical parameterization of rotations using the exponential map. Journal of Graphics Tools, 3, 29–48.

    Article  Google Scholar 

  • Gu, C., Sun, C., Ross, D. A., Vondrick, C., Pantofaru, C., & Li, Y., et al. (2018). AVA: A video dataset of spatio-temporally localized atomic visual actions. In Computer vision and pattern recognition (CVPR).

  • Gui, L. Y., Wang, Y. X., Liang, X., & Moura, J. M. (2018). Adversarial geometry-aware human motion prediction. In European conference on computer vision (ECCV).

  • Han, F., Reily, B., Hoff, W., & Zhang, H. (2017). Space-time representation of people based on 3D skeletal data: A review. Computer Vision and Image Understanding, 158, 85–105.

    Article  Google Scholar 

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016) Deep residual learning for image recognition. In Conference on computer vision and pattern recognition (CVPR), pp. 770–778.

  • Herda, L., Urtasun, R., & Fua, P. (2005). Hierarchical implicit surface joint limits for human body tracking. Computer Vision and Image Understanding, 99, 189–209.

    Article  Google Scholar 

  • Hinton, G., Deng, L., Yu, D., Dahl, G., Rahman Mohamed, A., et al. (2012). Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine., 29, 82–97.

    Article  Google Scholar 

  • Hochreiter, S., Bengio, Y., Frasconi, P., & Schmidhuber, J. (2001). Gradient flow in recurrent nets: The difficulty of learning long-term dependencies. In S. C. Kremer & J. F. Kolen (Eds.), A field guide to dynamical recurrent neural networks. IEEE Press: Piscataway.

    Google Scholar 

  • Holden, D., Komura, T., & Saito, J. (2017). Phase-functioned neural networks for character control. ACM Transaction on Graphics (SIGGRAPH), 36, 42.

    Google Scholar 

  • Holden, D., Saito, J., & Komura, T. (2016). A deep learning framework for character motion synthesis and editing. ACM Transaction on Graphics (SIGGRAPH), 35, 138.

    Google Scholar 

  • Holden, D., Saito, J., Komura, T., & Joyce, T. (2015). Learning motion manifolds with convolutional autoencoders. In SIGGRAPH Asia 2015 technical briefs.

  • Huynh, D. Q. (2009). Metrics for 3d rotations: Comparison and analysis. Journal of Mathematical Imaging and Vision, 35(2), 155–164. https://doi.org/10.1007/s10851-009-0161-2.

    Article  MathSciNet  Google Scholar 

  • Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning (ICML), pp. 448–456.

  • Ionescu, C., Li, F., & Sminchisescu, C. (2011). Latent structured models for human pose estimation. In International conference on computer vision (ICCV).

  • Ionescu, C., Papava, D., Olaru, V., & Sminchisescu, C. (2014). Human3.6m: Large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 36, 1325–1339.

    Article  Google Scholar 

  • Jain, A., Zamir, A. R., Savarese, S., Saxena, A. (2016). Structural-RNN: Deep learning on spatio-temporal graphs. In Conference on computer vision and pattern recognition (CVPR).

  • Kiasari, M. A., Moirangthem, D. S., & Lee, M. (2018). Human action generation with generative adversarial networks. arXiv:1805.10416.

  • Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. In International conference on learning represention (ICLR).

  • Kitani, K. M., Ziebart, B. D., Bagnell, J. A., & Hebert, M. (2012a). Activity forecasting. In European conference on computer vision (ECCV).

  • Kitani, K. M., Ziebart, B. D., Bagnell, J. A., & Hebert, M. (2012b) Activity forecasting. In European conference on computer vision (ECCV).

  • Koppula, H. S., & Saxena, A. (2016). Anticipating human activities using object affordances for reactive robotic response. Transaction on Pattern Analysis and Machine Intelligence (TPAMI), 38, 14–39.

    Article  Google Scholar 

  • Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, & K. Q. Weinberger (Eds.), Advances in Neural Information Processing Systems 25 (pp. 1097–1105). Curran Associates, Inc. http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf.

  • Kumar, S., & Tripathi, B. K. (2017). Machine learning with resilient propagation in quaternionic domain. International Journal of Intelligent Engineering & Systems, 10(4), 205–216.

    Article  Google Scholar 

  • Lan, T., Chen, T. C., & Savarese, S. (2014). A hierarchical representation for future action prediction. In European conference on computer vision (ECCV).

  • LaValle, S. M. (2006). Planning algorithms (pp. 150–152). Cambridge: Cambridge University Press.

    Book  Google Scholar 

  • Lehrmann, A. M., Gehler, P. V., & Nowozin, S. (2014). Efficient nonlinear Markov models for human motion. In Conference on computer vision and pattern recognition (CVPR).

  • Li, C., Zhang, Z., Sun Lee, W., & Hee Lee, G. (2018a). Convolutional sequence to sequence model for human dynamics. In The IEEE conference on computer vision and pattern recognition (CVPR).

  • Li, Z., Zhou, Y., Xiao, S., He, C., & Li, H. (2018b). Auto-conditioned LSTM network for extended complex human motion synthesis. In International conference on learning representations (ICLR).

  • Lin, X., & Amer, M. R. (2018) Human motion modeling using dvgans. arXiv:1804.10652.

  • Liu, C. K., Hertzmann, A., & Popović, Z. (2005). Learning physics-based motion style with nonlinear inverse optimization. ACM Transaction on Graphics (SIGGRAPH), 24, 1071–1081.

    Article  Google Scholar 

  • Liu, J., Shahroudy, A., Xu, D., & Wang, G. (2016). Spatio-temporal LSTM with trust gates for 3D human action recognition. In European conference on computer vision (ECCV).

  • Luc, P., Couprie, C., Lecun, Y., & Verbeek, J. (2018). Predicting future instance segmentations by forecasting convolutional features. arXiv preprint arXiv:1803.11496.

  • Luc, P., Neverova, N., Couprie, C., Verbeek, J., & LeCun, Y. (2017) Predicting deeper into the future of semantic segmentation. In International conference in computer vision (ICCV).

  • Martinez, J., Black, M. J., & Romero, J. (2017). On human motion prediction using recurrent neural networks. In Conference on vision and pattern recognition (CVPR).

  • Mathieu, M., Couprie, C., & LeCun, Y. (2016). Deep multi-scale video prediction beyond mean square error. In International conference on learning representations (ICLR).

  • McCarthy, J. (1990). An introduction to theoretical kinematics. MIT Press. https://books.google.ca/books?id=glOqQgAACAAJ.

  • Menache, A. (1999). Understanding motion capture for computer animation and video games. San Francisco, CA: Morgan Kaufmann Publishers Inc.

    Google Scholar 

  • Müller, M., Röder, T., Clausen, M., Eberhardt, B., Krüger, B., & Weber, A. (2007). Documentation Mocap database HDM05. Technical report no. CG-2007-2, ISSN 1610-8892, Universität Bonn, the data used in this project was obtained from HDM05.

  • Multon, F., France, L., Cani-Gascuel, M. P., & Debunne, G. (1999). Computer animation of human walking: A survey. The Journal of Visualization and Computer Animation, 10(1), 39–54.

    Article  Google Scholar 

  • Oberweger, M., Wohlhart, P., & Lepetit, V. (2015). Hands deep in deep learning for hand pose estimation. In Computer vision winter workshop (CVWW).

  • Ofli, F., Chaudhry, R., Kurillo, G., Vidal, R., & Bajcsy, R. (2013). Berkeley MHAD: A comprehensive multimodal human action database. In Proceedings of the IEEE workshop on applications on computer vision (WACV).

  • Parameswaran, V., & Chellappa, R. (2004). View independent human body pose estimation from a single perspective image. In Conference on computer vision and pattern recognition (CVPR).

  • Parcollet, T., Ravanelli, M., Morchid, M., Linarès, G., Trabelsi, C., De Mori, R., et al. (2018a). Quaternion recurrent neural networks. arXiv preprint arXiv:1806.04418.

  • Parcollet, T., Zhang, Y., Morchid, M., Trabelsi, C., & Linarès, G., Mori, R. D. et al. (2018b) Quaternion convolutional neural networks for end-to-end automatic speech recognition. In Interspeech.

  • Pascanu, R., Mikolov, T., & Bengio, Y. (2013). On the difficulty of training recurrent neural networks. In Proceedings of the 30th international conference on international conference on machine learning, JMLR.org, ICML’13, (Vol 28, pp. III–1310–III–1318). http://dl.acm.org/citation.cfm?id=3042817.3043083.

  • Pavllo, D., Feichtenhofer, C., Grangier, D., & Auli, M. (2019). 3d human pose estimation in video with temporal convolutions and semi-supervised training. In Conference on computer vision and pattern recognition (CVPR).

  • Pavllo, D., Grangier, D., & Auli, M. (2018). Quaternet: A quaternion-based recurrent model for human motion. In British machine vision conference (BMVC).

  • Pavlovic, V., Rehg, J. M., & MacCormick, J. (2001). Learning switching linear models of human motion. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems 13 (pp. 981–987). MIT Press. http://papers.nips.cc/paper/1892-learning-switching-linear-models-of-human-motion.pdf

  • Pervin, E., & Webb, J. (1983). Quaternions for computer vision and robotics. In Conference on computer vision and pattern recognition (CVPR).

  • Radwan, I., Dhall, A., & Göcke, R. (2013). Monocular image 3D human pose estimation under self-occlusion. In International conference on computer vision (ICCV), pp. 1888–1895.

  • Ranzato, M., Chopra, S., Auli, M., & Zaremba, W. (2015). Sequence-level training with recurrent neural networks. In International conference on learning represention (ICLR).

  • Shlizerman, E., Dery, L. M., Schoen, H., & Kemelmacher-Shlizerman, I. (2018). Audio to body dynamics. In Conference on computer vision and pattern recognition (CVPR), pp. 7574–7583

  • Shoemake, K. (1985). Animating rotation with quaternion curves. Transactions on Computer Graphics (SIGGRAPH), 19, 245–254.

    Article  Google Scholar 

  • Stoer, J., & Bulirsch, R. (1993). Introduction to numerical analysis. New York: Springer.

    Book  Google Scholar 

  • Tanco, L. M., & Hilton, A. (2000). Realistic synthesis of novel human movements from a database of motion. In Workshop on human motion (HUMO).

  • Taylor, G. W., Hinton, G. E., & Roweis, S. T. (2006). Modeling human motion using binary latent variables. In Advances in neural information processing systems (NIPS).

  • Toyer, S., Cherian, A., Han, T., & Gould, S. (2017). Human pose forecasting via deep Markov models. In International conference on digital image computing: Techniques and applications (DICTA).

  • Treuille, A., Lee, Y., & Popović, Z. (2007). Near-optimal character animation with continuous control. ACM Transactions on Graphics (tog), 26(3), 7.

    Article  Google Scholar 

  • van den Bergen, G., & Gregorius, D. (2010). Game physics pearls. Natick: AK Peters.

    Book  Google Scholar 

  • van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., et al. (2016a) Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499.

  • van den Oord, A., Kalchbrenner, N., & Kavukcuoglu, K. (2016b). Pixel recurrent neural networks. In International conference on machine learning (ICML).

  • Villegas, R., Yang, J., Ceylan, D., & Lee, H. (2018), Neural kinematic networks for unsupervised motion retargetting. In Conference on computer vision and pattern recognition (CVPR), pp. 8639–8648.

  • Villegas, R., Yang, J., Zou, Y., Sohn, S., Lin, X., & Lee, H. (2017). Learning to generate long-term future via hierarchical prediction. In International conference on machine learning (ICML).

  • Walker, J., Doersch, C., Gupta, A., & Hebert, M. (2016). An uncertain future: Forecasting from static images using variational autoencoders. In European conference on computer vision (ECCV).

  • Walker, J., Marino, K., Gupta, A., & Hebert, M. (2017). The pose knows: Video forecasting by generating pose futures. In International conference on computer vision (ICCV).

  • Wang, J. M., Fleet, D. J., & Hertzmann, A. (2008). Gaussian process dynamical models for human motion. Transaction on Pattern Analysis and Machine Intelligence (TPAMI), 30, 283–298.

    Article  Google Scholar 

  • Wang, Z., Chai, J., & Xia, S. (2018). Combining recurrent neural networks and adversarial training for human motion synthesis and control. arXiv:1806.08666.

  • Wiseman, S., & Rush, A. M. (2016). Sequence-to-sequence learning as beam-search optimization. In Conference on empirical methods in natural language processing (EMNLP).

  • Xia, S., Wang, C., Chai, J., & Hodgins, J. (2015). Realtime style transfer for unlabeled heterogeneous human motion. ACM Transactions on Graphics (SIGGRAPH), 34, 119.

    Article  Google Scholar 

  • Zhou, F., De la Torre, F., & Hodgins, J. K. (2013). Hierarchical aligned cluster analysis for temporal clustering of human motion. Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 35, 582–596.

    Article  Google Scholar 

  • Zhou, X., Sun, X., Zhang, W., Liang, S., & Wei, Y. (2016a). Deep kinematic pose regression. In European conference on computer vision (ECCV) workshops.

  • Zhou, X., Wan, Q., Zhang, W., Xue, X., & Wei, Y. (2016b). Model-based deep hand pose estimation. In IJCAI.

  • Zhou, Y., Li, Z., Xiao, S., He, C., & Li, H. (2018). Auto-conditioned LSTM network for extended complex human motion synthesis. In International conference on learning representations (ICLR).

  • Zhu, X., Xu, Y., Xu, H., & Chen, C. (2018). Quaternion convolutional neural networks. In European conference on computer vision (ECCV).

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michael Auli.

Additional information

Communicated by Ling Shao, Hubert P. H. Shum, Timothy Hospedales.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Dario Pavllo: Work done during an internship at Facebook AI Research. David Grangier: Work done while at Facebook AI Research.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (mp4 21805 KB)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pavllo, D., Feichtenhofer, C., Auli, M. et al. Modeling Human Motion with Quaternion-Based Neural Networks. Int J Comput Vis 128, 855–872 (2020). https://doi.org/10.1007/s11263-019-01245-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-019-01245-6

Keywords

Navigation