Abstract
We consider the problem of parsing human poses and recognizing their actions in static images with part-based models. Most previous work in part-based models only considers rigid parts (e.g., torso, head, half limbs) guided by human anatomy. We argue that this representation of parts is not necessarily appropriate. In this paper, we introduce hierarchical poselets—a new representation for modeling the pose configuration of human bodies. Hierarchical poselets can be rigid parts, but they can also be parts that cover large portions of human bodies (e.g., torso + left arm). In the extreme case, they can be the whole bodies. The hierarchical poselets are organized in a hierarchical way via a structured model. Human parsing can be achieved by inferring the optimal labeling of this hierarchical model. The pose information captured by this hierarchical model can also be used as a intermediate representation for other high-level tasks. We demonstrate it in action recognition from static images.
Editors: Isabelle Guyon and Vassilis Athitsos.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
Both data sets can be downloaded from http://vision.cs.uiuc.edu/humanparse.
- 2.
A small number of images/annotations we obtained from the authors of Yang et al. (2010) are somehow corrupted due to some file-system failure. We have removed those images from the data set.
References
M. Andriluka, S. Roth, B. Schiele, Pictorial structures revisited: people detection and articulated pose estimation, in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2009
L. Bourdev, J. Malik, Poselets: body part detectors training using 3d human pose annotations, in IEEE International Conference on Computer Vision, 2009
L. Bourdev, S. Maji, T. Brox, J. Malik, Detecting people using mutually consistent poselet activations, in European Conference on Computer Vision, 2010
C.K. Chow, C.N. Liu, Approximating discrete probability distributions with dependence trees. IEEE Trans. Inf. Theory 14(3), 462–467 (1968)
N. Dalal, B. Triggs, Histogram of oriented gradients for human detection, in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005
V. Delaitre, I. Laptev, J. Sivic, Recognizing human actions in still images: a study of bag-of-features and part-based representations, in British Machine Vision Conference, 2010
C. Desai, D. Ramanan, C. Fowlkes, Discriminative models for static human-object interactions, in Workshop on Structured Models in Computer Vision, 2010
P. Dollár, V. Rabaud, G. Cottrell, S. Belongie, Behavior recognition via sparse spatio-temporal features, in ICCV’05 Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, 2005
A.A. Efros, A.C. Berg, G. Mori, J. Malik, Recognizing action at a distance, in IEEE International Conference on Computer Vision, 2003, pp. 726–733
M. Eichner, V. Ferrari, Better appearance models for pictorial structures, in British Machine Vision Conference, 2009
P.F. Felzenszwalb, D.P. Huttenlocher, Pictorial structures for object recognition. Int. J. Comput. Vis. 61(1), 55–79 (2005)
P.F. Felzenszwalb, R.B. Girshick, D. McAllester, D. Ramanan, Object detection with discriminatively trained part based models. IEEE Trans. Pattern Anal. Mach. Intell. 32(9), 1627–1645 (2010)
V. Ferrari, M. Marín-Jiménez, A. Zisserman, Progressive search space reduction for human pose estimation, in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2008
V. Ferrari, M. Marín-Jiménez, A. Zisserman, Pose search: retrieving people using their pose, in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2009
D.A. Forsyth, O. Arikan, L. Ikemoto, J. O’Brien, D. Ramanan, Computational studies of human motion: part 1, tracking and motion synthesis. Found. Trends Comput. Gr. Vis. 1(2/3), 77–254 (2006)
A. Gupta, A. Kembhavi, L.S. Davis, Observing human-object interactions: using spatial and functional compatibility for recognition. IEEE Trans. Pattern Anal. Mach. Intell. 31(10), 1775–1789 (2009)
N. Ikizler, R. Gokberk Cinbis, S. Pehlivan, P. Duygulu, Recognizing actions from still images, in International Conference on Pattern Recognition, 2008
N. Ikizler-Cinbis, R. Gokberk Cinbis, S. Sclaroff, Learning actions from the web, in IEEE International Conference on Computer Vision, 2009
H. Jiang, D.R. Martin, Globel pose estimation using non-tree models, in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2008
T. Joachims, T. Finley, C.-N. Yu, Cutting-plane training of structural SVMs, in Machine Learning, 2008
S. Johnson, M. Everingham, Combining discriminative appearance and segmentation cues for articulated human pose estimation, in International Workshop on Machine Learning for Vision-based Motion Analysis, 2009
S. Johnson, M. Everingham, Clustered pose and nonlinear appearance models for human pose estimation, in British Machine Vision Conference, 2010
S.X. Ju, M.J. Black, Y. Yaccob, Cardboard people: a parameterized model of articulated image motion, in International Conference on Automatic Face and Gesture Recognition, 1996, pp. 38–44
Y. Ke, R. Sukthankar, M. Hebert, Event detection in crowded videos, in IEEE International Conference on Computer Vision, 2007
M.P. Kumar, A. Zisserman, P.H.S. Torr, Efficient discriminative learning of parts-based models, in IEEE International Conference on Computer Vision, 2009
T. Lan, Y. Wang, W. Yang, G. Mori, Beyond actions: discriminative models for contextual group activities, in Advances in Neural Information Processing Systems (MIT Press, 2010)
X. Lan, D.P. Huttenlocher, Beyond trees: common-factor models for 2d human pose recovery. IEEE Int. Conf. Comput. Vis. 1, 470–477 (2005)
I. Laptev, M. Marszalek, C. Schmid, B. Rozenfeld, Learning realistic human actions from movies, in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2008
S. Maji, L. Bourdev, J. Malik, Action recognition from a distributed representation of pose and appearance, in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2011
D. Marr, A Computational Investigation into the Human Representation and Processing of Visual Information (W. H. Freeman, San Francisco, 1982)
G. Mori, Guiding model search using segmentation. IEEE Int. Conf. Comput. Vis. 2, 1417–1423 (2005)
G. Mori, J. Malik, Estimating human body configurations using shape context matching. Eur. Conf. Comput. Vis. 3, 666–680 (2002)
G. Mori, X. Ren, A. Efros, J. Malik, Recovering human body configuration: combining segmentation and recognition. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. 2, 326–333 (2004)
J.C. Niebles, L. Fei-Fei, A hierarchical model of shape and appearance for human action classification, in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2007
J.C. Niebles, H. Wang, L. Fei-Fei, Unsupervised learning of human action categories using spatial-temporal words, in British Machine Vision Conference, vol. 3, 2006, pp. 1249–1258
D. Ramanan, Learning to parse images of articulated bodies. Adv. Neural Inf. Process. Syst. 19, 1129–1136 (2006)
D. Ramanan, C. Sminchisescu, Training deformable models for localization. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. 1, 206–213 (2006)
D. Ramanan, D.A. Forsyth, A. Zisserman, Strike a pose: tracking people by finding stylized poses. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. 1, 271–278 (2005)
X. Ren, A. Berg, J. Malik, Recovering human body configurations using pairwise constraints between parts. IEEE Int. Conf. Comput. Vis. 1, 824–831 (2005)
B. Sapp, C. Jordan, B. Taskar, Adaptive pose priors for pictorial structures, in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2010a
B. Sapp, A. Toshev, B. Taskar, Cascaded models for articulated pose estimation, in European Conference on Computer Vision, 2010b
G. Shakhnarovich, P. Viola, T. Darrell, Fast pose estimation with parameter sensitive hashing. IEEE Int. Conf. Comput. Vis. 2, 750–757 (2003)
L. Sigal, M.J. Black, Measure locally, reason globally: occlusion-sensitive articulated pose estimation. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. 2, 2041–2048 (2006)
V.K. Singh, R. Nevatia, C. Huang, Efficient inference with multiple heterogenous part detectors for human pose estimation, in European Conference on Computer Vision, 2010
P. Srinivasan, J. Shi, Bottom-up recognition and parsing of the human body, in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2007
J. Sullivan, S. Carlsson, Recognizing and tracking human action, in European Conference on Computer Vision LNCS 2352, vol. 1, 2002, pp. 629–644
M. Sun, S. Savarese, Articulated part-base model for joint object detection and pose estimation, in IEEE International Conference on Computer Vision, 2011
T.-P. Tian, S. Sclaroff, Fast globally optimal 2d human detection with loopy graph models, in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2010
K. Toyama, A. Blake, Probabilistic exemplar-based tracking in a metric space. IEEE Int. Conf. Comput. Vis. 2, 50–57 (2001)
D. Tran, D. Forsyth, Improved human parsing with a full relational model, in European Conference on Computer Vision, 2010
I. Tsochantaridis, T. Joachims, T. Hofmann, Y. Altun, Large margin methods for structured and interdependent output variables. J. Mach. Learn. Res. 6, 1453–1484 (2005)
Y. Wang, G. Mori, Multiple tree models for occlusion and spatial constraints in human pose estimation, in European Conference on Computer Vision, 2008
Y. Wang, H. Jiang, M.S. Drew, Z.-N. Li, G. Mori, Unsupervised discovery of action classes, in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2006
Y. Wang, D. Tran, Z. Liao, Learning hierarchical poselets for human parsing, in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2011
W. Yang, Y. Wang, G. Mori, Recognizing human actions from still images with latent poses, in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2010
Y. Yang, D. Ramanan, Articulated pose estimation with flexible mixtures-of-parts, in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2011
B. Yao, L. Fei-Fei, Modeling mutual context of object and human pose in human–object interaction activities, in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2010
L. Zhu, Y. Chen, Y. Lu, C. Lin, A. Yuille, Max margin AND/OR graph learning for parsing the human body, in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2008
Acknowledgements
This work was supported in part by NSF under IIS-0803603 and IIS-1029035, and by ONR under N00014-01-1-0890 and N00014-10-1-0934 as part of the MURI program. Yang Wang was also supported in part by an NSERC postdoc fellowship when the work was done. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect those of NSF, ONR, or NSERC.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this chapter
Cite this chapter
Wang, Y., Tran, D., Liao, Z., Forsyth, D. (2017). Discriminative Hierarchical Part-Based Models for Human Parsing and Action Recognition. In: Escalera, S., Guyon, I., Athitsos, V. (eds) Gesture Recognition. The Springer Series on Challenges in Machine Learning. Springer, Cham. https://doi.org/10.1007/978-3-319-57021-1_9
Download citation
DOI: https://doi.org/10.1007/978-3-319-57021-1_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-57020-4
Online ISBN: 978-3-319-57021-1
eBook Packages: Computer ScienceComputer Science (R0)