SegmentCodeList: Unsupervised Representation Learning for Human Skeleton Data Retrieval

Sedmidubsky, Jan; Carrara, Fabio; Amato, Giuseppe

doi:10.1007/978-3-031-28238-6_8

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13981))

Included in the following conference series:

European Conference on Information Retrieval

Abstract

Recent progress in pose-estimation methods enables the extraction of sufficiently-precise 3D human skeleton data from ordinary videos, which offers great opportunities for a wide range of applications. However, such spatio-temporal data are typically extracted in the form of a continuous skeleton sequence without any information about semantic segmentation or annotation. To make the extracted data reusable for further processing, there is a need to access them based on their content. In this paper, we introduce a universal retrieval approach that compares any two skeleton sequences based on temporal order and similarities of their underlying segments. The similarity of segments is determined by their content-preserving low-dimensional code representation that is learned using the Variational AutoEncoder principle in an unsupervised way. The quality of the proposed representation is validated in retrieval and classification scenarios; our proposal outperforms the state-of-the-art approaches in effectiveness and reaches speed-ups up to 64x on common skeleton sequence datasets.

J. Sedmidubsky and F. Carrara—These authors contributed equally.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Acsintoae, A., et al.: Ubnormal: new benchmark for supervised open-set video anomaly detection. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 20143–20153 (2022)
Google Scholar
Aristidou, A., Cohen-Or, D., Hodgins, J.K., Chrysanthou, Y., Shamir, A.: Deep motifs and motion signatures. ACM Trans. Graph. 37(6), 187:1–187:13 (2018)
Google Scholar
Basak, H., Kundu, R., Singh, P.K., Ijaz, M.F., Wozniak, M., Sarkar, R.: A union of deep learning and swarm-based optimization for 3D human action recognition. Sci. Rep. 12(5494), 1–17 (2022)
Google Scholar
Budikova, P., Sedmidubsky, J., Zezula, P.: Efficient indexing of 3D human motions. In: International Conference on Multimedia Retrieval (ICMR), pp. 10–18. ACM (2021)
Google Scholar
Cheng, Y.B., Chen, X., Chen, J., Wei, P., Zhang, D., Lin, L.: Hierarchical transformer: Unsupervised representation learning for skeleton-based human action recognition. In: IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6 (2021)
Google Scholar
Dubey, S., Dixit, M.: A comprehensive survey on human pose estimation approaches. Multimedia Syst. 29, 1–29 (2022)
Google Scholar
Elias, P., Sedmidubsky, J., Zezula, P.: Understanding the limits of 2D skeletons for action recognition. Multimedia Syst. 27(3), 547–561 (2021)
Article Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 630–645. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_38
Chapter Google Scholar
Higgins, I., et al.: BETA-VAE: learning basic visual concepts with a constrained variational framework. In: 5th International Conference on Learning Representations (ICLR), pp. 1–22. OpenReview.net (2017)
Google Scholar
Kay, W., et al.: The kinetics human action video dataset. arXiv (2017)
Google Scholar
Kendall, A., Gal, Y., Cipolla, R.: Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7482–7491 (2018)
Google Scholar
Kico, I., Sedmidubsky, J., Zezula, P.: Towards efficient human action retrieval based on triplet-loss metric learning. In: 33rd International Conference on Database and Expert Systems Applications (DEXA), pp. 234–247. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-12423-5_18
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)
Lin, L., Song, S., Yang, W., Liu, J.: MS2L: multi-task self-supervised learning for skeleton based action recognition. In: 28th ACM International Conference on Multimedia (MM), pp. 2490–2498. ACM, New York (2020)
Google Scholar
Liu, C., Hu, Y., Li, Y., Song, S., Liu, J.: PKU-MMD: a large scale benchmark for skeleton-based human action understanding. In: Workshop on Visual Analysis in Smart and Connected Communities (VSCC@MM), pp. 1–8. ACM (2017)
Google Scholar
Liu, J., Song, S., Liu, C., Li, Y., Hu, Y.: A benchmark dataset and comparison study for multi-modal human action analytics. ACM Trans. Multimedia Comput. Commun. Appl. 16(2), 1–24 (2020)
Article Google Scholar
Liu, J., Shahroudy, A., Perez, M., Wang, G., Duan, L., Kot, A.C.: NTU RGB+D 120: a large-scale benchmark for 3D human activity understanding. IEEE Trans. Pattern Anal. Mach. Intell. 42, 2684–2701 (2019)
Article Google Scholar
Liu, X., He, G., Peng, S., Cheung, Y., Tang, Y.Y.: Efficient human motion retrieval via temporal adjacent bag of words and discriminative neighborhood preserving dictionary learning. IEEE Trans. Human-Mach. Syst. 47(6), 763–776 (2017)
Article Google Scholar
Lv, N., Wang, Y., Feng, Z., Peng, J.: Deep hashing for motion capture data retrieval. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2215–2219. IEEE (2021)
Google Scholar
Müller, M., Röder, T., Clausen, M., Eberhardt, B., Krüger, B., Weber, A.: Documentation Mocap Database HDM05. Technical RepORT CG-2007-2, Universität Bonn (2007)
Google Scholar
Papadopoulos, K., Ghorbel, E., Baptista, R., Aouada, D., Ottersten, B.: Two-stage RGB-based action detection using augmented 3D poses. In: Vento, M., Percannella, G. (eds.) CAIP 2019. LNCS, vol. 11678, pp. 26–35. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-29888-3_3
Chapter Google Scholar
Peng, W., Hong, X., Zhao, G.: Tripool: graph triplet pooling for 3d skeleton-based action recognition. Pattern Recogn. 115, 107921 (2021)
Article Google Scholar
Rakthanmanon, T., et al.: Searching and mining trillions of time series subsequences under dynamic time warping. In: 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp. 262–270. ACM (2012)
Google Scholar
Sedmidubsky, J., Budikova, P., Dohnal, V., Zezula, P.: Motion words: a text-like representation of 3D skeleton sequences. In: Jose, J.M., et al. (eds.) ECIR 2020. LNCS, vol. 12035, pp. 527–541. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-45439-5_35
Chapter Google Scholar
Sedmidubsky, J., Elias, P., Budikova, P., Zezula, P.: Content-based management of human motion data: Survey and challenges. IEEE Access 9, 64241–64255 (2021). https://doi.org/10.1109/ACCESS.2021.3075766
Article Google Scholar
Sedmidubsky, J., Elias, P., Zezula, P.: Searching for variable-speed motions in long sequences of motion capture data. Inf. Syst. 80, 148–158 (2019)
Article Google Scholar
Song, S., Lan, C., Xing, J., Zeng, W., Liu, J.: Spatio-temporal attention-based LSTM networks for 3D action recognition and detection. IEEE Trans. Image Process. 27(7), 3459–3471 (2018)
Article MathSciNet MATH Google Scholar
Vernikos, I., Koutrintzes, D., Mathe, E., Spyrou, E., Mylonas, P.: Early fusion of visual representations of skeletal data for human activity recognition. In: 12th Hellenic Conference on Artificial Intelligence (SETN). ACM (2022)
Google Scholar
Yang, Y., Liu, G., Gao, X.: Motion guided attention learning for self-supervised 3D human action recognition. IEEE Trans. Circ. Syst. Video Technol. 32, 1–13 (2022)
Article Google Scholar

Download references

Acknowledgements

This research was supported by ERDF “CyberSecurity, CyberCrime and Critical Information Infrastructures Center of Excellence” (No. CZ.02.1.01/0.0/0.0/16_019/0000822), by AI4Media - A European Excellence Centre for Media, Society, and Democracy (EC, H2020 n. 951911), and by SUN - Social and hUman ceNtered XR (EC, Horizon Europe n. 101092612).

Author information

Authors and Affiliations

Masaryk University, Brno, Czech Republic
Jan Sedmidubsky
ISTI-CNR, Pisa, Italy
Fabio Carrara & Giuseppe Amato

Authors

Jan Sedmidubsky
View author publications
You can also search for this author in PubMed Google Scholar
Fabio Carrara
View author publications
You can also search for this author in PubMed Google Scholar
Giuseppe Amato
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jan Sedmidubsky .

Editor information

Editors and Affiliations

University of Amsterdam, Amsterdam, The Netherlands
Jaap Kamps
Université Grenoble-Alpes, Saint-Martin-d’Hères, France
Lorraine Goeuriot
Università della Svizzera Italiana, Lugano, Switzerland
Fabio Crestani
University of Copenhagen, Copenhagen, Denmark
Maria Maistro
University of Tsukuba, Ibaraki, Japan
Hideo Joho
Dublin City University, Dublin, Ireland
Brian Davis
Dublin City University, Dublin, Ireland
Cathal Gurrin
Universität Regensburg, Regensburg, Germany
Udo Kruschwitz
Dublin City University, Dublin, Ireland
Annalina Caputo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sedmidubsky, J., Carrara, F., Amato, G. (2023). SegmentCodeList: Unsupervised Representation Learning for Human Skeleton Data Retrieval. In: Kamps, J., et al. Advances in Information Retrieval. ECIR 2023. Lecture Notes in Computer Science, vol 13981. Springer, Cham. https://doi.org/10.1007/978-3-031-28238-6_8

Download citation

DOI: https://doi.org/10.1007/978-3-031-28238-6_8
Published: 17 March 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-28237-9
Online ISBN: 978-3-031-28238-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

SegmentCodeList: Unsupervised Representation Learning for Human Skeleton Data Retrieval