International Journal of Computer Vision

, Volume 113, Issue 1, pp 19–36 | Cite as

Heterogeneous Multi-task Learning for Human Pose Estimation with Deep Convolutional Neural Network

Article

Abstract

We propose a heterogeneous multi-task learning framework for human pose estimation from monocular images using a deep convolutional neural network. In particular, we simultaneously learn a human pose regressor and sliding-window body-part and joint-point detectors in a deep network architecture. We show that including the detection tasks helps to regularize the network, directing it to converge to a good solution. We report competitive and state-of-art results on several datasets. We also empirically show that the learned neurons in the middle layer of our network are tuned to localized body parts.

Keywords

Human Pose Estimation Deep Learning 

References

  1. Bo, L., & Sminchisescu, C. (2010). Twin gaussian processes for structured prediction. International Journal of Computer Vision, 87(1–2), 28–52.CrossRefGoogle Scholar
  2. Dalal, N., & Triggs, B. (2005) Histograms of oriented gradients for human detection. In: IEEE Conference on Computer Vision and Pattern Recognition.Google Scholar
  3. Dantone, M., Gall, J., Leistner, C., & van Gool L. (2013) Human pose estimation from still images using body parts dependent joint regressors. In: IEEE Conference on Computer Vision and Pattern Recognition.Google Scholar
  4. Eichner, M., & Ferrari, V. (2009a) Better appearance models for pictorial structures. In: British Machine Vision Conference, pp 1–11.Google Scholar
  5. Eichner, M., & Ferrari, V. (2009b) Upper body detector. http://groups.inf.ed.ac.uk/calvin/calvin_upperbody_detector/
  6. Eichner, M., & Ferrari, V. (2010) We are family: Joint pose estimation of multiple persons. In: European Conference.on Computer Vision.Google Scholar
  7. Eichner, M., & Ferrari, V. (2012). Human pose co-estimation and applications. IEEE Trans Pattern Anal Mach Intell.Google Scholar
  8. Eichner, M., Marin-Jimenez, M., Zisserman, A., & Ferrari, V. (2012). 2d articulated human pose estimation and retrieval in (almost) unconstrained still images. International Journal of Computer Vision, 99(2), 190–214.CrossRefMathSciNetGoogle Scholar
  9. Evgeniou, T., Micchelli, C. A., & Pontil, M. (2005). Learning multiple tasks with kernel methods. Journal of Machine Learning Research, 6, 615–637.MATHMathSciNetGoogle Scholar
  10. Farabet, C., Couprie, C., Najman, L., & LeCun, Y. (2013). Learning hierarchical features for scene labeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8), 1915–1929.Google Scholar
  11. Felzenszwalb, P. F., & Huttenlocher, D. P. (2005). Pictorial structures for object recognition. International Journal of Computer Vision, 61(1), 55–79.CrossRefGoogle Scholar
  12. Gülçehrem, C., & Bengio, Y. (2013) Knowledge matters: Importance of prior information for optimization. In: International Conference on Learning Representations.Google Scholar
  13. Hara, K., & Chellappa, R. (2013) Computationally efficient regression on a dependency graph for human pose estimation. In: IEEE Conference on Computer Vision and Pattern Recognition.Google Scholar
  14. Jain, A., Tompson, J., Andriluka, M., Taylor, G. W., & Bregler, C. (2014) Learning human pose estimation features with convolutional networks. In: International Conference on Learning Representations.Google Scholar
  15. Johnson, S., & Everingham, M. (2011) Learning effective human pose estimation from inaccurate annotation. In: IEEE Conference on Computer Vision and Pattern Recognition.Google Scholar
  16. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012) Imagenet classification with deep convolutional neural networks. In: Neural Information Processing Systems.Google Scholar
  17. Le, Q., Ranzato, M., Monga, R., Devin, M., Chen, K., Corrado, G., Dean, J., & Ng, A. (2012) Building high-level features using large scale unsupervised learning. In: International Conference on Machine Learning.Google Scholar
  18. van der Maaten, L., & Hinton, G. (2008). Visualizing Data using t-SNE. Journal of Machine Learning Research, 9, 2579–2605.MATHGoogle Scholar
  19. Nair, V., & Hinton, G. E. (2010) Rectified linear units improve restricted boltzmann machines. In: International Conference on Machine Learning.Google Scholar
  20. Pishchulin, L., Jain, A., Andriluka, M., Thormaehlen, T., & Schiele, B. (2012) Articulated people detection and pose estimation: Reshaping the future. In: IEEE Conference on Computer Vision and Pattern Recognition.Google Scholar
  21. Pishchulin, L., Andriluka, M., Gehler, P., & Schiele, B. (2013) Poselet conditioned pictorial structures. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 588–595.Google Scholar
  22. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1988). Learning representations by back-propagating errors. In J. A. Anderson & E. Rosenfeld (Eds.), Neurocomputing: Foundations of research (pp. 696–699). Cambridge, MA: MIT Press.Google Scholar
  23. Sapp, B., & Taskar, B. (2013) Modec: Multimodal decomposable models for human pose estimation. In: IEEE Conference on Computer Vision and Pattern Recognition.Google Scholar
  24. Sapp, B., Toshev, A., & Taskar, B. (2010) Cascaded models for articulated pose estimation. In: European Conference on Computer Vision.Google Scholar
  25. Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., & Blake, A. (2011) Real-time human pose recognition in parts from single depth images. In: IEEE Conference on Computer Vision and Pattern Recognition.Google Scholar
  26. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning, 15, 1929–1958.Google Scholar
  27. Sun, Y., Wang, X., & Tang, X. (2013) Deep convolutional network cascade for facial point detection. In: IEEE Conference on Computer Vision and Pattern Recognition.Google Scholar
  28. Toshev, A., & Szegedy, C. (2014) Deeppose: Human pose estimation via deep neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition.Google Scholar
  29. Weston, J., Ratle, F., & Collobert, R. (2008) Deep learning via semi-supervised embedding. In: International Conference on Machine Learning.Google Scholar
  30. Yang, X., Kim, S., & Xing, E. P. (2009) Heterogeneous multitask learning with joint sparsity constraints. In: Neural Information Processing Systems.Google Scholar
  31. Yang, Y., & Ramanan, D. (2011) Articulated pose estimation with flexible mixtures-of-parts. In: IEEE Conference on Computer Vision and Pattern Recognition.Google Scholar
  32. Yang, Y., & Ramanan, D. (2013). Articulated human detection with flexible mixtures of parts. IEEE Trans Pattern Analysis and Machine Intelligence, 35(12), 2878–2890.CrossRefGoogle Scholar
  33. Yu, K., Tresp, V., & Schwaighofer, A. (2005) Learning gaussian processes from multiple tasks. In: International Conference on Machine Learning, pp 1012–1019.Google Scholar
  34. Zeiler, M. D., & Fergus, R. (2014). Visualizing and understanding convolutional networks. In Computer Vision – ECCV 2014. Lecture Notes in Computer Science (Vol. 8689, pp. 818–833). Springer.Google Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  1. 1.Department of Computer Science, City University of Hong KongHong KongChina
  2. 2.School of Creative Media (SCM), City University of Hong KongHong KongChina
  3. 3.Department of Computer Science, Multimedia software Engineering Research Centre (MERC)City University of Hong KongHong KongChina
  4. 4.Multimedia software Engineering Research Centre (MERC)ShenzhenChina

Personalised recommendations