Modeling Human Motion with Quaternion-Based Neural Networks

Pavllo, Dario; Feichtenhofer, Christoph; Auli, Michael; Grangier, David

doi:10.1007/s11263-019-01245-6

Modeling Human Motion with Quaternion-Based Neural Networks

Published: 08 October 2019

Volume 128, pages 855–872, (2020)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Dario Pavllo¹,
Christoph Feichtenhofer²,
Michael Auli ORCID: orcid.org/0000-0001-5974-4459² &
…
David Grangier³

2712 Accesses
98 Citations
18 Altmetric
1 Mention
Explore all metrics

Abstract

Previous work on predicting or generating 3D human pose sequences regresses either joint rotations or joint positions. The former strategy is prone to error accumulation along the kinematic chain, as well as discontinuities when using Euler angles or exponential maps as parameterizations. The latter requires re-projection onto skeleton constraints to avoid bone stretching and invalid configurations. This work addresses both limitations. QuaterNet represents rotations with quaternions and our loss function performs forward kinematics on a skeleton to penalize absolute position errors instead of angle errors. We investigate both recurrent and convolutional architectures and evaluate on short-term prediction and long-term generation. For the latter, our approach is qualitatively judged as realistic as recent neural strategies from the graphics literature. Our experiments compare quaternions to Euler angles as well as exponential maps and show that only a very short context is required to make reliable future predictions. Finally, we show that the standard evaluation protocol for Human3.6M produces high variance results and we propose a simple solution.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 9

3D Human Pose Lifting: From Joint Position to Joint Rotation

Contact and Human Dynamics from Monocular Video

Toward Continuous-Time Representations of Human Motion

Notes

Reference implementation at https://github.com/asheshjain399/RNNexp/blob/srnn/structural_rnn/forecastTrajectories.py#L29.

References

Akhter, I., & Black M. J. (2015). Pose-conditioned joint angle limits for 3d human pose reconstruction. In 2015 IEEE conference on computer vision and pattern recognition (CVPR).
Arikan O., Forsyth D. A., & O’Brien J. F. (2003). Motion synthesis from annotations. In ACM transactions on graphics (SIGGRAPH).
Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer normalization. arXiv preprint arXiv:1607.06450.
Badler, N. I., Phillips, C. B., & Webber, B. L. (1993). Simulating humans: Computer graphics animation and control. Oxford: Oxford University Press.
MATH Google Scholar
Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. In International conference on learning representations (ICLR).
Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). A neural probabilistic language model. Journal of Machine Learning Research, 3, 1137–1155.
MATH Google Scholar
Bengio, S., Vinyals, O., Jaitly, N., & Shazeer, N. (2015). Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in neural information processing systems (NIPS).
Bütepage, J., Black, M. J., Kragic, D., & Kjellström, H. (2017). Deep representation learning for human motion prediction and classification. In Conference on computer vision and pattern recognition (CVPR).
Bütepage, J., Kjellström, H., & Kragic, D. (2018). Anticipating many futures: Online human motion prediction and generation for human–robot interaction. In 2018 IEEE international conference on robotics and automation (ICRA), pp. 1–9.
Byravan, A., & Fox, D. (2017). SE3-nets: Learning rigid body motion using deep neural networks. In IEEE international conference on robotics and automation (ICRA).
Chao, Y. W., Yang, J., Price, B. L., Cohen, S., & Deng, J. (2017). Forecasting human dynamics from static images. In Conference on computer vision and pattern recognition (CVPR).
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., et al. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Conference on empirical methods in natural language processing (EMNLP).
Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. In NIPS deep learning and representation learning workshop.
CMU (2003) CMU graphics lab motion capture database. http://mocap.cs.cmu.edu. The database was created with funding from NSF EIA-0196217.
Collobert, R., Puhrsch, C., & Synnaeve, G. (2016). Wav2letter: an end-to-end convnet-based speech recognition system. arXiv:1609.03193.
Cootes, T. F. (2000). An introduction to active shape models, chap 7. In R. Baldock & J. Graham (Eds.), Image processing and analysis. Oxford: Oxford University Press.
Google Scholar
Dai, J. S. (2015). Euler–Rodrigues formula variations, quaternion conjugation and intrinsic connections. Mechanism and Machine Theory, 92, 144–152.
Article Google Scholar
Dauphin, Y. N., Fan, A., Auli, M., & Grangier, D. (2017). Language modeling with gated convolutional networks. In: Proceedings of ICLR.
Du, Y., Wang, W., & Wang, L. (2015). Hierarchical recurrent neural network for skeleton based action recognition. In Conference on computer vision and pattern recognition (CVPR), pp. 1110–1118.
Dunn, F., Parberry, I., et al. (2010). 3D math primer for graphics and game development. Burlington: Jones & Bartlett Publishers.
Google Scholar
Forsyth, D. A., Arikan, O., Ikemoto, L., O’Brien, J., Ramanan, D., et al. (2006). Computational studies of human motion: Part 1, tracking and motion synthesis. Foundations and Trends in Computer Graphics and Vision, 1(2–3), 77–254.
Google Scholar
Fragkiadaki, K., Levine, S., Felsen, P., & Malik, J. (2015). Recurrent network models for human dynamics. In Conference on vision and pattern recognition (CVPR)
Gaudet, C. J., & Maida, A. S. (2018). Deep quaternion networks. In International joint conference on neural networks (IJCNN), IEEE, pp. 1–8.
Gehring, J., Auli, M., Grangier, D., Yarats, D., & Dauphin, Y. N. (2017). Convolutional sequence to sequence learning. In International conference on machine learning (ICML).
Ghosh, P., Song, J., Aksan, E., & Hilliges, O. (2017). Learning human motion models for long-term predictions. In International conference on 3D vision.
Gong, W., Zhang, X., Gonzàlez, J., Sobral, A., Bouwmans, T., Tu, C., et al. (2016). Human pose estimation from monocular images: A comprehensive survey. Sensors, 16(12), 1966.
Article Google Scholar
Gopalakrishnan, A., Mali, A., Kifer, D., Giles, C. L., & Ororbia, A. G. (2018). A neural temporal model for human motion prediction. arXiv:1809.03036.
Grassia, F. S. (1998). Practical parameterization of rotations using the exponential map. Journal of Graphics Tools, 3, 29–48.
Article Google Scholar
Gu, C., Sun, C., Ross, D. A., Vondrick, C., Pantofaru, C., & Li, Y., et al. (2018). AVA: A video dataset of spatio-temporally localized atomic visual actions. In Computer vision and pattern recognition (CVPR).
Gui, L. Y., Wang, Y. X., Liang, X., & Moura, J. M. (2018). Adversarial geometry-aware human motion prediction. In European conference on computer vision (ECCV).
Han, F., Reily, B., Hoff, W., & Zhang, H. (2017). Space-time representation of people based on 3D skeletal data: A review. Computer Vision and Image Understanding, 158, 85–105.
Article Google Scholar
He, K., Zhang, X., Ren, S., & Sun, J. (2016) Deep residual learning for image recognition. In Conference on computer vision and pattern recognition (CVPR), pp. 770–778.
Herda, L., Urtasun, R., & Fua, P. (2005). Hierarchical implicit surface joint limits for human body tracking. Computer Vision and Image Understanding, 99, 189–209.
Article Google Scholar
Hinton, G., Deng, L., Yu, D., Dahl, G., Rahman Mohamed, A., et al. (2012). Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine., 29, 82–97.
Article Google Scholar
Hochreiter, S., Bengio, Y., Frasconi, P., & Schmidhuber, J. (2001). Gradient flow in recurrent nets: The difficulty of learning long-term dependencies. In S. C. Kremer & J. F. Kolen (Eds.), A field guide to dynamical recurrent neural networks. IEEE Press: Piscataway.
Google Scholar
Holden, D., Komura, T., & Saito, J. (2017). Phase-functioned neural networks for character control. ACM Transaction on Graphics (SIGGRAPH), 36, 42.
Google Scholar
Holden, D., Saito, J., & Komura, T. (2016). A deep learning framework for character motion synthesis and editing. ACM Transaction on Graphics (SIGGRAPH), 35, 138.
Google Scholar
Holden, D., Saito, J., Komura, T., & Joyce, T. (2015). Learning motion manifolds with convolutional autoencoders. In SIGGRAPH Asia 2015 technical briefs.
Huynh, D. Q. (2009). Metrics for 3d rotations: Comparison and analysis. Journal of Mathematical Imaging and Vision, 35(2), 155–164. https://doi.org/10.1007/s10851-009-0161-2.
Article MathSciNet Google Scholar
Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning (ICML), pp. 448–456.
Ionescu, C., Li, F., & Sminchisescu, C. (2011). Latent structured models for human pose estimation. In International conference on computer vision (ICCV).
Ionescu, C., Papava, D., Olaru, V., & Sminchisescu, C. (2014). Human3.6m: Large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 36, 1325–1339.
Article Google Scholar
Jain, A., Zamir, A. R., Savarese, S., Saxena, A. (2016). Structural-RNN: Deep learning on spatio-temporal graphs. In Conference on computer vision and pattern recognition (CVPR).
Kiasari, M. A., Moirangthem, D. S., & Lee, M. (2018). Human action generation with generative adversarial networks. arXiv:1805.10416.
Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. In International conference on learning represention (ICLR).
Kitani, K. M., Ziebart, B. D., Bagnell, J. A., & Hebert, M. (2012a). Activity forecasting. In European conference on computer vision (ECCV).
Kitani, K. M., Ziebart, B. D., Bagnell, J. A., & Hebert, M. (2012b) Activity forecasting. In European conference on computer vision (ECCV).
Koppula, H. S., & Saxena, A. (2016). Anticipating human activities using object affordances for reactive robotic response. Transaction on Pattern Analysis and Machine Intelligence (TPAMI), 38, 14–39.
Article Google Scholar
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, & K. Q. Weinberger (Eds.), Advances in Neural Information Processing Systems 25 (pp. 1097–1105). Curran Associates, Inc. http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf.
Kumar, S., & Tripathi, B. K. (2017). Machine learning with resilient propagation in quaternionic domain. International Journal of Intelligent Engineering & Systems, 10(4), 205–216.
Article Google Scholar
Lan, T., Chen, T. C., & Savarese, S. (2014). A hierarchical representation for future action prediction. In European conference on computer vision (ECCV).
LaValle, S. M. (2006). Planning algorithms (pp. 150–152). Cambridge: Cambridge University Press.
Book Google Scholar
Lehrmann, A. M., Gehler, P. V., & Nowozin, S. (2014). Efficient nonlinear Markov models for human motion. In Conference on computer vision and pattern recognition (CVPR).
Li, C., Zhang, Z., Sun Lee, W., & Hee Lee, G. (2018a). Convolutional sequence to sequence model for human dynamics. In The IEEE conference on computer vision and pattern recognition (CVPR).
Li, Z., Zhou, Y., Xiao, S., He, C., & Li, H. (2018b). Auto-conditioned LSTM network for extended complex human motion synthesis. In International conference on learning representations (ICLR).
Lin, X., & Amer, M. R. (2018) Human motion modeling using dvgans. arXiv:1804.10652.
Liu, C. K., Hertzmann, A., & Popović, Z. (2005). Learning physics-based motion style with nonlinear inverse optimization. ACM Transaction on Graphics (SIGGRAPH), 24, 1071–1081.
Article Google Scholar
Liu, J., Shahroudy, A., Xu, D., & Wang, G. (2016). Spatio-temporal LSTM with trust gates for 3D human action recognition. In European conference on computer vision (ECCV).
Luc, P., Couprie, C., Lecun, Y., & Verbeek, J. (2018). Predicting future instance segmentations by forecasting convolutional features. arXiv preprint arXiv:1803.11496.
Luc, P., Neverova, N., Couprie, C., Verbeek, J., & LeCun, Y. (2017) Predicting deeper into the future of semantic segmentation. In International conference in computer vision (ICCV).
Martinez, J., Black, M. J., & Romero, J. (2017). On human motion prediction using recurrent neural networks. In Conference on vision and pattern recognition (CVPR).
Mathieu, M., Couprie, C., & LeCun, Y. (2016). Deep multi-scale video prediction beyond mean square error. In International conference on learning representations (ICLR).
McCarthy, J. (1990). An introduction to theoretical kinematics. MIT Press. https://books.google.ca/books?id=glOqQgAACAAJ.
Menache, A. (1999). Understanding motion capture for computer animation and video games. San Francisco, CA: Morgan Kaufmann Publishers Inc.
Google Scholar
Müller, M., Röder, T., Clausen, M., Eberhardt, B., Krüger, B., & Weber, A. (2007). Documentation Mocap database HDM05. Technical report no. CG-2007-2, ISSN 1610-8892, Universität Bonn, the data used in this project was obtained from HDM05.
Multon, F., France, L., Cani-Gascuel, M. P., & Debunne, G. (1999). Computer animation of human walking: A survey. The Journal of Visualization and Computer Animation, 10(1), 39–54.
Article Google Scholar
Oberweger, M., Wohlhart, P., & Lepetit, V. (2015). Hands deep in deep learning for hand pose estimation. In Computer vision winter workshop (CVWW).
Ofli, F., Chaudhry, R., Kurillo, G., Vidal, R., & Bajcsy, R. (2013). Berkeley MHAD: A comprehensive multimodal human action database. In Proceedings of the IEEE workshop on applications on computer vision (WACV).
Parameswaran, V., & Chellappa, R. (2004). View independent human body pose estimation from a single perspective image. In Conference on computer vision and pattern recognition (CVPR).
Parcollet, T., Ravanelli, M., Morchid, M., Linarès, G., Trabelsi, C., De Mori, R., et al. (2018a). Quaternion recurrent neural networks. arXiv preprint arXiv:1806.04418.
Parcollet, T., Zhang, Y., Morchid, M., Trabelsi, C., & Linarès, G., Mori, R. D. et al. (2018b) Quaternion convolutional neural networks for end-to-end automatic speech recognition. In Interspeech.
Pascanu, R., Mikolov, T., & Bengio, Y. (2013). On the difficulty of training recurrent neural networks. In Proceedings of the 30th international conference on international conference on machine learning, JMLR.org, ICML’13, (Vol 28, pp. III–1310–III–1318). http://dl.acm.org/citation.cfm?id=3042817.3043083.
Pavllo, D., Feichtenhofer, C., Grangier, D., & Auli, M. (2019). 3d human pose estimation in video with temporal convolutions and semi-supervised training. In Conference on computer vision and pattern recognition (CVPR).
Pavllo, D., Grangier, D., & Auli, M. (2018). Quaternet: A quaternion-based recurrent model for human motion. In British machine vision conference (BMVC).
Pavlovic, V., Rehg, J. M., & MacCormick, J. (2001). Learning switching linear models of human motion. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems 13 (pp. 981–987). MIT Press. http://papers.nips.cc/paper/1892-learning-switching-linear-models-of-human-motion.pdf
Pervin, E., & Webb, J. (1983). Quaternions for computer vision and robotics. In Conference on computer vision and pattern recognition (CVPR).
Radwan, I., Dhall, A., & Göcke, R. (2013). Monocular image 3D human pose estimation under self-occlusion. In International conference on computer vision (ICCV), pp. 1888–1895.
Ranzato, M., Chopra, S., Auli, M., & Zaremba, W. (2015). Sequence-level training with recurrent neural networks. In International conference on learning represention (ICLR).
Shlizerman, E., Dery, L. M., Schoen, H., & Kemelmacher-Shlizerman, I. (2018). Audio to body dynamics. In Conference on computer vision and pattern recognition (CVPR), pp. 7574–7583
Shoemake, K. (1985). Animating rotation with quaternion curves. Transactions on Computer Graphics (SIGGRAPH), 19, 245–254.
Article Google Scholar
Stoer, J., & Bulirsch, R. (1993). Introduction to numerical analysis. New York: Springer.
Book Google Scholar
Tanco, L. M., & Hilton, A. (2000). Realistic synthesis of novel human movements from a database of motion. In Workshop on human motion (HUMO).
Taylor, G. W., Hinton, G. E., & Roweis, S. T. (2006). Modeling human motion using binary latent variables. In Advances in neural information processing systems (NIPS).
Toyer, S., Cherian, A., Han, T., & Gould, S. (2017). Human pose forecasting via deep Markov models. In International conference on digital image computing: Techniques and applications (DICTA).
Treuille, A., Lee, Y., & Popović, Z. (2007). Near-optimal character animation with continuous control. ACM Transactions on Graphics (tog), 26(3), 7.
Article Google Scholar
van den Bergen, G., & Gregorius, D. (2010). Game physics pearls. Natick: AK Peters.
Book Google Scholar
van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., et al. (2016a) Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499.
van den Oord, A., Kalchbrenner, N., & Kavukcuoglu, K. (2016b). Pixel recurrent neural networks. In International conference on machine learning (ICML).
Villegas, R., Yang, J., Ceylan, D., & Lee, H. (2018), Neural kinematic networks for unsupervised motion retargetting. In Conference on computer vision and pattern recognition (CVPR), pp. 8639–8648.
Villegas, R., Yang, J., Zou, Y., Sohn, S., Lin, X., & Lee, H. (2017). Learning to generate long-term future via hierarchical prediction. In International conference on machine learning (ICML).
Walker, J., Doersch, C., Gupta, A., & Hebert, M. (2016). An uncertain future: Forecasting from static images using variational autoencoders. In European conference on computer vision (ECCV).
Walker, J., Marino, K., Gupta, A., & Hebert, M. (2017). The pose knows: Video forecasting by generating pose futures. In International conference on computer vision (ICCV).
Wang, J. M., Fleet, D. J., & Hertzmann, A. (2008). Gaussian process dynamical models for human motion. Transaction on Pattern Analysis and Machine Intelligence (TPAMI), 30, 283–298.
Article Google Scholar
Wang, Z., Chai, J., & Xia, S. (2018). Combining recurrent neural networks and adversarial training for human motion synthesis and control. arXiv:1806.08666.
Wiseman, S., & Rush, A. M. (2016). Sequence-to-sequence learning as beam-search optimization. In Conference on empirical methods in natural language processing (EMNLP).
Xia, S., Wang, C., Chai, J., & Hodgins, J. (2015). Realtime style transfer for unlabeled heterogeneous human motion. ACM Transactions on Graphics (SIGGRAPH), 34, 119.
Article Google Scholar
Zhou, F., De la Torre, F., & Hodgins, J. K. (2013). Hierarchical aligned cluster analysis for temporal clustering of human motion. Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 35, 582–596.
Article Google Scholar
Zhou, X., Sun, X., Zhang, W., Liang, S., & Wei, Y. (2016a). Deep kinematic pose regression. In European conference on computer vision (ECCV) workshops.
Zhou, X., Wan, Q., Zhang, W., Xue, X., & Wei, Y. (2016b). Model-based deep hand pose estimation. In IJCAI.
Zhou, Y., Li, Z., Xiao, S., He, C., & Li, H. (2018). Auto-conditioned LSTM network for extended complex human motion synthesis. In International conference on learning representations (ICLR).
Zhu, X., Xu, Y., Xu, H., & Chen, C. (2018). Quaternion convolutional neural networks. In European conference on computer vision (ECCV).

Download references

Author information

Authors and Affiliations

ETH Zurich, Zurich, Switzerland
Dario Pavllo
Facebook AI Research, Menlo Park, USA
Christoph Feichtenhofer & Michael Auli
Google Brain, Mountain View, USA
David Grangier

Authors

Dario Pavllo
View author publications
You can also search for this author in PubMed Google Scholar
Christoph Feichtenhofer
View author publications
You can also search for this author in PubMed Google Scholar
Michael Auli
View author publications
You can also search for this author in PubMed Google Scholar
David Grangier
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michael Auli.

Additional information

Communicated by Ling Shao, Hubert P. H. Shum, Timothy Hospedales.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Dario Pavllo: Work done during an internship at Facebook AI Research. David Grangier: Work done while at Facebook AI Research.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (mp4 21805 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pavllo, D., Feichtenhofer, C., Auli, M. et al. Modeling Human Motion with Quaternion-Based Neural Networks. Int J Comput Vis 128, 855–872 (2020). https://doi.org/10.1007/s11263-019-01245-6

Download citation

Received: 07 January 2019
Accepted: 25 September 2019
Published: 08 October 2019
Issue Date: April 2020
DOI: https://doi.org/10.1007/s11263-019-01245-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Modeling Human Motion with Quaternion-Based Neural Networks

Abstract

Access this article

Similar content being viewed by others

3D Human Pose Lifting: From Joint Position to Joint Rotation

Contact and Human Dynamics from Monocular Video

Toward Continuous-Time Representations of Human Motion

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Electronic supplementary material

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Modeling Human Motion with Quaternion-Based Neural Networks

Abstract

Access this article

Similar content being viewed by others

3D Human Pose Lifting: From Joint Position to Joint Rotation

Contact and Human Dynamics from Monocular Video

Toward Continuous-Time Representations of Human Motion

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Electronic supplementary material

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation