Abstract
Previous work on predicting or generating 3D human pose sequences regresses either joint rotations or joint positions. The former strategy is prone to error accumulation along the kinematic chain, as well as discontinuities when using Euler angles or exponential maps as parameterizations. The latter requires re-projection onto skeleton constraints to avoid bone stretching and invalid configurations. This work addresses both limitations. QuaterNet represents rotations with quaternions and our loss function performs forward kinematics on a skeleton to penalize absolute position errors instead of angle errors. We investigate both recurrent and convolutional architectures and evaluate on short-term prediction and long-term generation. For the latter, our approach is qualitatively judged as realistic as recent neural strategies from the graphics literature. Our experiments compare quaternions to Euler angles as well as exponential maps and show that only a very short context is required to make reliable future predictions. Finally, we show that the standard evaluation protocol for Human3.6M produces high variance results and we propose a simple solution.
Similar content being viewed by others
Notes
Reference implementation at https://github.com/asheshjain399/RNNexp/blob/srnn/structural_rnn/forecastTrajectories.py#L29.
References
Akhter, I., & Black M. J. (2015). Pose-conditioned joint angle limits for 3d human pose reconstruction. In 2015 IEEE conference on computer vision and pattern recognition (CVPR).
Arikan O., Forsyth D. A., & O’Brien J. F. (2003). Motion synthesis from annotations. In ACM transactions on graphics (SIGGRAPH).
Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer normalization. arXiv preprint arXiv:1607.06450.
Badler, N. I., Phillips, C. B., & Webber, B. L. (1993). Simulating humans: Computer graphics animation and control. Oxford: Oxford University Press.
Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. In International conference on learning representations (ICLR).
Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). A neural probabilistic language model. Journal of Machine Learning Research, 3, 1137–1155.
Bengio, S., Vinyals, O., Jaitly, N., & Shazeer, N. (2015). Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in neural information processing systems (NIPS).
Bütepage, J., Black, M. J., Kragic, D., & Kjellström, H. (2017). Deep representation learning for human motion prediction and classification. In Conference on computer vision and pattern recognition (CVPR).
Bütepage, J., Kjellström, H., & Kragic, D. (2018). Anticipating many futures: Online human motion prediction and generation for human–robot interaction. In 2018 IEEE international conference on robotics and automation (ICRA), pp. 1–9.
Byravan, A., & Fox, D. (2017). SE3-nets: Learning rigid body motion using deep neural networks. In IEEE international conference on robotics and automation (ICRA).
Chao, Y. W., Yang, J., Price, B. L., Cohen, S., & Deng, J. (2017). Forecasting human dynamics from static images. In Conference on computer vision and pattern recognition (CVPR).
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., et al. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Conference on empirical methods in natural language processing (EMNLP).
Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. In NIPS deep learning and representation learning workshop.
CMU (2003) CMU graphics lab motion capture database. http://mocap.cs.cmu.edu. The database was created with funding from NSF EIA-0196217.
Collobert, R., Puhrsch, C., & Synnaeve, G. (2016). Wav2letter: an end-to-end convnet-based speech recognition system. arXiv:1609.03193.
Cootes, T. F. (2000). An introduction to active shape models, chap 7. In R. Baldock & J. Graham (Eds.), Image processing and analysis. Oxford: Oxford University Press.
Dai, J. S. (2015). Euler–Rodrigues formula variations, quaternion conjugation and intrinsic connections. Mechanism and Machine Theory, 92, 144–152.
Dauphin, Y. N., Fan, A., Auli, M., & Grangier, D. (2017). Language modeling with gated convolutional networks. In: Proceedings of ICLR.
Du, Y., Wang, W., & Wang, L. (2015). Hierarchical recurrent neural network for skeleton based action recognition. In Conference on computer vision and pattern recognition (CVPR), pp. 1110–1118.
Dunn, F., Parberry, I., et al. (2010). 3D math primer for graphics and game development. Burlington: Jones & Bartlett Publishers.
Forsyth, D. A., Arikan, O., Ikemoto, L., O’Brien, J., Ramanan, D., et al. (2006). Computational studies of human motion: Part 1, tracking and motion synthesis. Foundations and Trends in Computer Graphics and Vision, 1(2–3), 77–254.
Fragkiadaki, K., Levine, S., Felsen, P., & Malik, J. (2015). Recurrent network models for human dynamics. In Conference on vision and pattern recognition (CVPR)
Gaudet, C. J., & Maida, A. S. (2018). Deep quaternion networks. In International joint conference on neural networks (IJCNN), IEEE, pp. 1–8.
Gehring, J., Auli, M., Grangier, D., Yarats, D., & Dauphin, Y. N. (2017). Convolutional sequence to sequence learning. In International conference on machine learning (ICML).
Ghosh, P., Song, J., Aksan, E., & Hilliges, O. (2017). Learning human motion models for long-term predictions. In International conference on 3D vision.
Gong, W., Zhang, X., Gonzàlez, J., Sobral, A., Bouwmans, T., Tu, C., et al. (2016). Human pose estimation from monocular images: A comprehensive survey. Sensors, 16(12), 1966.
Gopalakrishnan, A., Mali, A., Kifer, D., Giles, C. L., & Ororbia, A. G. (2018). A neural temporal model for human motion prediction. arXiv:1809.03036.
Grassia, F. S. (1998). Practical parameterization of rotations using the exponential map. Journal of Graphics Tools, 3, 29–48.
Gu, C., Sun, C., Ross, D. A., Vondrick, C., Pantofaru, C., & Li, Y., et al. (2018). AVA: A video dataset of spatio-temporally localized atomic visual actions. In Computer vision and pattern recognition (CVPR).
Gui, L. Y., Wang, Y. X., Liang, X., & Moura, J. M. (2018). Adversarial geometry-aware human motion prediction. In European conference on computer vision (ECCV).
Han, F., Reily, B., Hoff, W., & Zhang, H. (2017). Space-time representation of people based on 3D skeletal data: A review. Computer Vision and Image Understanding, 158, 85–105.
He, K., Zhang, X., Ren, S., & Sun, J. (2016) Deep residual learning for image recognition. In Conference on computer vision and pattern recognition (CVPR), pp. 770–778.
Herda, L., Urtasun, R., & Fua, P. (2005). Hierarchical implicit surface joint limits for human body tracking. Computer Vision and Image Understanding, 99, 189–209.
Hinton, G., Deng, L., Yu, D., Dahl, G., Rahman Mohamed, A., et al. (2012). Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine., 29, 82–97.
Hochreiter, S., Bengio, Y., Frasconi, P., & Schmidhuber, J. (2001). Gradient flow in recurrent nets: The difficulty of learning long-term dependencies. In S. C. Kremer & J. F. Kolen (Eds.), A field guide to dynamical recurrent neural networks. IEEE Press: Piscataway.
Holden, D., Komura, T., & Saito, J. (2017). Phase-functioned neural networks for character control. ACM Transaction on Graphics (SIGGRAPH), 36, 42.
Holden, D., Saito, J., & Komura, T. (2016). A deep learning framework for character motion synthesis and editing. ACM Transaction on Graphics (SIGGRAPH), 35, 138.
Holden, D., Saito, J., Komura, T., & Joyce, T. (2015). Learning motion manifolds with convolutional autoencoders. In SIGGRAPH Asia 2015 technical briefs.
Huynh, D. Q. (2009). Metrics for 3d rotations: Comparison and analysis. Journal of Mathematical Imaging and Vision, 35(2), 155–164. https://doi.org/10.1007/s10851-009-0161-2.
Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning (ICML), pp. 448–456.
Ionescu, C., Li, F., & Sminchisescu, C. (2011). Latent structured models for human pose estimation. In International conference on computer vision (ICCV).
Ionescu, C., Papava, D., Olaru, V., & Sminchisescu, C. (2014). Human3.6m: Large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 36, 1325–1339.
Jain, A., Zamir, A. R., Savarese, S., Saxena, A. (2016). Structural-RNN: Deep learning on spatio-temporal graphs. In Conference on computer vision and pattern recognition (CVPR).
Kiasari, M. A., Moirangthem, D. S., & Lee, M. (2018). Human action generation with generative adversarial networks. arXiv:1805.10416.
Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. In International conference on learning represention (ICLR).
Kitani, K. M., Ziebart, B. D., Bagnell, J. A., & Hebert, M. (2012a). Activity forecasting. In European conference on computer vision (ECCV).
Kitani, K. M., Ziebart, B. D., Bagnell, J. A., & Hebert, M. (2012b) Activity forecasting. In European conference on computer vision (ECCV).
Koppula, H. S., & Saxena, A. (2016). Anticipating human activities using object affordances for reactive robotic response. Transaction on Pattern Analysis and Machine Intelligence (TPAMI), 38, 14–39.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, & K. Q. Weinberger (Eds.), Advances in Neural Information Processing Systems 25 (pp. 1097–1105). Curran Associates, Inc. http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf.
Kumar, S., & Tripathi, B. K. (2017). Machine learning with resilient propagation in quaternionic domain. International Journal of Intelligent Engineering & Systems, 10(4), 205–216.
Lan, T., Chen, T. C., & Savarese, S. (2014). A hierarchical representation for future action prediction. In European conference on computer vision (ECCV).
LaValle, S. M. (2006). Planning algorithms (pp. 150–152). Cambridge: Cambridge University Press.
Lehrmann, A. M., Gehler, P. V., & Nowozin, S. (2014). Efficient nonlinear Markov models for human motion. In Conference on computer vision and pattern recognition (CVPR).
Li, C., Zhang, Z., Sun Lee, W., & Hee Lee, G. (2018a). Convolutional sequence to sequence model for human dynamics. In The IEEE conference on computer vision and pattern recognition (CVPR).
Li, Z., Zhou, Y., Xiao, S., He, C., & Li, H. (2018b). Auto-conditioned LSTM network for extended complex human motion synthesis. In International conference on learning representations (ICLR).
Lin, X., & Amer, M. R. (2018) Human motion modeling using dvgans. arXiv:1804.10652.
Liu, C. K., Hertzmann, A., & Popović, Z. (2005). Learning physics-based motion style with nonlinear inverse optimization. ACM Transaction on Graphics (SIGGRAPH), 24, 1071–1081.
Liu, J., Shahroudy, A., Xu, D., & Wang, G. (2016). Spatio-temporal LSTM with trust gates for 3D human action recognition. In European conference on computer vision (ECCV).
Luc, P., Couprie, C., Lecun, Y., & Verbeek, J. (2018). Predicting future instance segmentations by forecasting convolutional features. arXiv preprint arXiv:1803.11496.
Luc, P., Neverova, N., Couprie, C., Verbeek, J., & LeCun, Y. (2017) Predicting deeper into the future of semantic segmentation. In International conference in computer vision (ICCV).
Martinez, J., Black, M. J., & Romero, J. (2017). On human motion prediction using recurrent neural networks. In Conference on vision and pattern recognition (CVPR).
Mathieu, M., Couprie, C., & LeCun, Y. (2016). Deep multi-scale video prediction beyond mean square error. In International conference on learning representations (ICLR).
McCarthy, J. (1990). An introduction to theoretical kinematics. MIT Press. https://books.google.ca/books?id=glOqQgAACAAJ.
Menache, A. (1999). Understanding motion capture for computer animation and video games. San Francisco, CA: Morgan Kaufmann Publishers Inc.
Müller, M., Röder, T., Clausen, M., Eberhardt, B., Krüger, B., & Weber, A. (2007). Documentation Mocap database HDM05. Technical report no. CG-2007-2, ISSN 1610-8892, Universität Bonn, the data used in this project was obtained from HDM05.
Multon, F., France, L., Cani-Gascuel, M. P., & Debunne, G. (1999). Computer animation of human walking: A survey. The Journal of Visualization and Computer Animation, 10(1), 39–54.
Oberweger, M., Wohlhart, P., & Lepetit, V. (2015). Hands deep in deep learning for hand pose estimation. In Computer vision winter workshop (CVWW).
Ofli, F., Chaudhry, R., Kurillo, G., Vidal, R., & Bajcsy, R. (2013). Berkeley MHAD: A comprehensive multimodal human action database. In Proceedings of the IEEE workshop on applications on computer vision (WACV).
Parameswaran, V., & Chellappa, R. (2004). View independent human body pose estimation from a single perspective image. In Conference on computer vision and pattern recognition (CVPR).
Parcollet, T., Ravanelli, M., Morchid, M., Linarès, G., Trabelsi, C., De Mori, R., et al. (2018a). Quaternion recurrent neural networks. arXiv preprint arXiv:1806.04418.
Parcollet, T., Zhang, Y., Morchid, M., Trabelsi, C., & Linarès, G., Mori, R. D. et al. (2018b) Quaternion convolutional neural networks for end-to-end automatic speech recognition. In Interspeech.
Pascanu, R., Mikolov, T., & Bengio, Y. (2013). On the difficulty of training recurrent neural networks. In Proceedings of the 30th international conference on international conference on machine learning, JMLR.org, ICML’13, (Vol 28, pp. III–1310–III–1318). http://dl.acm.org/citation.cfm?id=3042817.3043083.
Pavllo, D., Feichtenhofer, C., Grangier, D., & Auli, M. (2019). 3d human pose estimation in video with temporal convolutions and semi-supervised training. In Conference on computer vision and pattern recognition (CVPR).
Pavllo, D., Grangier, D., & Auli, M. (2018). Quaternet: A quaternion-based recurrent model for human motion. In British machine vision conference (BMVC).
Pavlovic, V., Rehg, J. M., & MacCormick, J. (2001). Learning switching linear models of human motion. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems 13 (pp. 981–987). MIT Press. http://papers.nips.cc/paper/1892-learning-switching-linear-models-of-human-motion.pdf
Pervin, E., & Webb, J. (1983). Quaternions for computer vision and robotics. In Conference on computer vision and pattern recognition (CVPR).
Radwan, I., Dhall, A., & Göcke, R. (2013). Monocular image 3D human pose estimation under self-occlusion. In International conference on computer vision (ICCV), pp. 1888–1895.
Ranzato, M., Chopra, S., Auli, M., & Zaremba, W. (2015). Sequence-level training with recurrent neural networks. In International conference on learning represention (ICLR).
Shlizerman, E., Dery, L. M., Schoen, H., & Kemelmacher-Shlizerman, I. (2018). Audio to body dynamics. In Conference on computer vision and pattern recognition (CVPR), pp. 7574–7583
Shoemake, K. (1985). Animating rotation with quaternion curves. Transactions on Computer Graphics (SIGGRAPH), 19, 245–254.
Stoer, J., & Bulirsch, R. (1993). Introduction to numerical analysis. New York: Springer.
Tanco, L. M., & Hilton, A. (2000). Realistic synthesis of novel human movements from a database of motion. In Workshop on human motion (HUMO).
Taylor, G. W., Hinton, G. E., & Roweis, S. T. (2006). Modeling human motion using binary latent variables. In Advances in neural information processing systems (NIPS).
Toyer, S., Cherian, A., Han, T., & Gould, S. (2017). Human pose forecasting via deep Markov models. In International conference on digital image computing: Techniques and applications (DICTA).
Treuille, A., Lee, Y., & Popović, Z. (2007). Near-optimal character animation with continuous control. ACM Transactions on Graphics (tog), 26(3), 7.
van den Bergen, G., & Gregorius, D. (2010). Game physics pearls. Natick: AK Peters.
van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., et al. (2016a) Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499.
van den Oord, A., Kalchbrenner, N., & Kavukcuoglu, K. (2016b). Pixel recurrent neural networks. In International conference on machine learning (ICML).
Villegas, R., Yang, J., Ceylan, D., & Lee, H. (2018), Neural kinematic networks for unsupervised motion retargetting. In Conference on computer vision and pattern recognition (CVPR), pp. 8639–8648.
Villegas, R., Yang, J., Zou, Y., Sohn, S., Lin, X., & Lee, H. (2017). Learning to generate long-term future via hierarchical prediction. In International conference on machine learning (ICML).
Walker, J., Doersch, C., Gupta, A., & Hebert, M. (2016). An uncertain future: Forecasting from static images using variational autoencoders. In European conference on computer vision (ECCV).
Walker, J., Marino, K., Gupta, A., & Hebert, M. (2017). The pose knows: Video forecasting by generating pose futures. In International conference on computer vision (ICCV).
Wang, J. M., Fleet, D. J., & Hertzmann, A. (2008). Gaussian process dynamical models for human motion. Transaction on Pattern Analysis and Machine Intelligence (TPAMI), 30, 283–298.
Wang, Z., Chai, J., & Xia, S. (2018). Combining recurrent neural networks and adversarial training for human motion synthesis and control. arXiv:1806.08666.
Wiseman, S., & Rush, A. M. (2016). Sequence-to-sequence learning as beam-search optimization. In Conference on empirical methods in natural language processing (EMNLP).
Xia, S., Wang, C., Chai, J., & Hodgins, J. (2015). Realtime style transfer for unlabeled heterogeneous human motion. ACM Transactions on Graphics (SIGGRAPH), 34, 119.
Zhou, F., De la Torre, F., & Hodgins, J. K. (2013). Hierarchical aligned cluster analysis for temporal clustering of human motion. Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 35, 582–596.
Zhou, X., Sun, X., Zhang, W., Liang, S., & Wei, Y. (2016a). Deep kinematic pose regression. In European conference on computer vision (ECCV) workshops.
Zhou, X., Wan, Q., Zhang, W., Xue, X., & Wei, Y. (2016b). Model-based deep hand pose estimation. In IJCAI.
Zhou, Y., Li, Z., Xiao, S., He, C., & Li, H. (2018). Auto-conditioned LSTM network for extended complex human motion synthesis. In International conference on learning representations (ICLR).
Zhu, X., Xu, Y., Xu, H., & Chen, C. (2018). Quaternion convolutional neural networks. In European conference on computer vision (ECCV).
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Ling Shao, Hubert P. H. Shum, Timothy Hospedales.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Dario Pavllo: Work done during an internship at Facebook AI Research. David Grangier: Work done while at Facebook AI Research.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Supplementary material 1 (mp4 21805 KB)
Rights and permissions
About this article
Cite this article
Pavllo, D., Feichtenhofer, C., Auli, M. et al. Modeling Human Motion with Quaternion-Based Neural Networks. Int J Comput Vis 128, 855–872 (2020). https://doi.org/10.1007/s11263-019-01245-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-019-01245-6