Abstract
In speech recognition, phonemes have demonstrated their efficacy to model the words of a language. While they are well defined for languages, their extension to human actions is not straightforward. In this paper, we study such an extension and propose an unsupervised framework to find phoneme-like units for actions, which we call actemes, using 3D data and without any prior assumptions. To this purpose, build on an earlier proposed framework in speech literature to automatically find actemes in the training data. We experimentally show that actions defined in terms of actemes and actions defined by whole units give similar recognition results. We define actions out of the training set in terms of these actemes to see whether the actemes generalize to unseen actions. The results show that although the acteme definitions of the actions are not always semantically meaningful, they yield optimal recognition accuracy and constitute a promising direction of research for action modeling.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Green, R.D., Guan, L.: Quantifying and recognizing human movement patterns from monocular video images-part i: a new framework for modeling human motion. IEEE Trans. Circuits Syst. Video Techn. 14, 179–190 (2004)
Guerra-Filho, G., Aloimonos, Y.: A language for human action. Computer 40, 42–51 (2007)
Bregler, C.: Learning and recognizing human dynamics in video sequences. In: CVPR (1997)
Lee, C.H., Soong, F., Juang, B.H.: A segment model based approach to speech recognition. In: ICASSP (1988)
Rabiner, L., Juang, B.: Fundamentals of Speech Recognition. Prentice-Hall, New Jersey (1993)
Poppe, R.: A survey on vision-based human action recognition. Image and Vision Computing 28, 976–990 (2010)
Yamato, J., Ohya, J., Ishii, K.: Recognizing human action in time-sequential images using hidden markov models. In: CVPR, pp. 379–385 (1992)
Bobick, A.F., Davis, J.W.: The recognition of human movement using temporal templates. PAMI (2001)
Weinland, D., Ronfard, R., Boyer, E.: Free viewpoint action recognition using motion history volumes. In: CVIU, vol. 104, pp. 249–257 (2006)
Weinland, D., Boyer, E., Ronfard, R.: Action recognition from arbitrary views using 3d exemplars. In: ICCV (2007)
Veeraraghavan, A., Chellappa, R., Roy-Chowdhury, A.K.: The function space of an activity. In: CVPR (2006)
Turaga, P.K., Veeraraghavan, A., Chellappa, R.: From videos to verbs: Mining videos for events using a cascade of dynamical systems. In: CVPR (2007)
Turaga, P.K., Veeraraghavan, A., Chellappa, R.: Statistical analysis on stiefel and grassmann manifolds with applications in computer vision. In: CVPR (2008)
Turaga, P.K., Chellappa, R.: Locally time-invariant models of human activities using trajectories on the grassmanian. In: CVPR (2009)
Kulkarni, K., Cherla, S., Kale, A., Ramasubramanian, V.: A framework for indexing human actions in video. In: ECCV Workshops (2008)
Carlsson, S., Sullivan, J.: Action recognition by shape matching to key frames. In: CVPR Workshops (2001)
Schindler, K., Gool, L.V.: Action snippets: How many frames does human action recognition require? In: CVPR (2008)
Weinland, D., Boyer, E.: Action recognition using exemplar-based embedding. In: CVPR (2008)
Ogale, A.S., Karapurkar, A., Aloimonos, Y.: View-invariant modeling and recognition of human actions using grammars. In: ICCV Workshops (2005)
Ney, H.: The use of one-stage dynamic programming algorithm for connected word recognition. IEEE Trans. on Acoustic Speech and Signal Processing 32(2), 263–270 (1984)
Ramasubramanian, V., Kulkarni, K., Kaemmerer, B.: Acoustic modeling by phoneme templates and modified one-pass dp decoding for continuous speech recognition. In: ICASSP (2008)
Weinland, D., Ronfard, R., Boyer, E.: Automatic discovery of action taxonomies from multiple views. In: CVPR (2006)
Svendsen, T., Soong, F.: On the automatic segmentation of speech signals (1987)
Ramasubramanian, V., Sreenivas, T.: Automatically derived units for segment vocoders. In: ICASSP, vol. 1, pp. I-473–I-476 (2004)
Zelinski, R., Class, F.: A learning procedure for speaker-dependent word recognition systems based on sequential processing of input tokens. In: ICASSP (1983)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kulkarni, K., Boyer, E., Horaud, R., Kale, A. (2011). An Unsupervised Framework for Action Recognition Using Actemes . In: Kimmel, R., Klette, R., Sugimoto, A. (eds) Computer Vision – ACCV 2010. ACCV 2010. Lecture Notes in Computer Science, vol 6495. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19282-1_47
Download citation
DOI: https://doi.org/10.1007/978-3-642-19282-1_47
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-19281-4
Online ISBN: 978-3-642-19282-1
eBook Packages: Computer ScienceComputer Science (R0)