Skip to main content

SegmentCodeList: Unsupervised Representation Learning for Human Skeleton Data Retrieval

  • Conference paper
  • First Online:
Advances in Information Retrieval (ECIR 2023)

Abstract

Recent progress in pose-estimation methods enables the extraction of sufficiently-precise 3D human skeleton data from ordinary videos, which offers great opportunities for a wide range of applications. However, such spatio-temporal data are typically extracted in the form of a continuous skeleton sequence without any information about semantic segmentation or annotation. To make the extracted data reusable for further processing, there is a need to access them based on their content. In this paper, we introduce a universal retrieval approach that compares any two skeleton sequences based on temporal order and similarities of their underlying segments. The similarity of segments is determined by their content-preserving low-dimensional code representation that is learned using the Variational AutoEncoder principle in an unsupervised way. The quality of the proposed representation is validated in retrieval and classification scenarios; our proposal outperforms the state-of-the-art approaches in effectiveness and reaches speed-ups up to 64x on common skeleton sequence datasets.

J. Sedmidubsky and F. Carrara—These authors contributed equally.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Acsintoae, A., et al.: Ubnormal: new benchmark for supervised open-set video anomaly detection. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 20143–20153 (2022)

    Google Scholar 

  2. Aristidou, A., Cohen-Or, D., Hodgins, J.K., Chrysanthou, Y., Shamir, A.: Deep motifs and motion signatures. ACM Trans. Graph. 37(6), 187:1–187:13 (2018)

    Google Scholar 

  3. Basak, H., Kundu, R., Singh, P.K., Ijaz, M.F., Wozniak, M., Sarkar, R.: A union of deep learning and swarm-based optimization for 3D human action recognition. Sci. Rep. 12(5494), 1–17 (2022)

    Google Scholar 

  4. Budikova, P., Sedmidubsky, J., Zezula, P.: Efficient indexing of 3D human motions. In: International Conference on Multimedia Retrieval (ICMR), pp. 10–18. ACM (2021)

    Google Scholar 

  5. Cheng, Y.B., Chen, X., Chen, J., Wei, P., Zhang, D., Lin, L.: Hierarchical transformer: Unsupervised representation learning for skeleton-based human action recognition. In: IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6 (2021)

    Google Scholar 

  6. Dubey, S., Dixit, M.: A comprehensive survey on human pose estimation approaches. Multimedia Syst. 29, 1–29 (2022)

    Google Scholar 

  7. Elias, P., Sedmidubsky, J., Zezula, P.: Understanding the limits of 2D skeletons for action recognition. Multimedia Syst. 27(3), 547–561 (2021)

    Article  Google Scholar 

  8. He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 630–645. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_38

    Chapter  Google Scholar 

  9. Higgins, I., et al.: BETA-VAE: learning basic visual concepts with a constrained variational framework. In: 5th International Conference on Learning Representations (ICLR), pp. 1–22. OpenReview.net (2017)

    Google Scholar 

  10. Kay, W., et al.: The kinetics human action video dataset. arXiv (2017)

    Google Scholar 

  11. Kendall, A., Gal, Y., Cipolla, R.: Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7482–7491 (2018)

    Google Scholar 

  12. Kico, I., Sedmidubsky, J., Zezula, P.: Towards efficient human action retrieval based on triplet-loss metric learning. In: 33rd International Conference on Database and Expert Systems Applications (DEXA), pp. 234–247. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-12423-5_18

  13. Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)

  14. Lin, L., Song, S., Yang, W., Liu, J.: MS2L: multi-task self-supervised learning for skeleton based action recognition. In: 28th ACM International Conference on Multimedia (MM), pp. 2490–2498. ACM, New York (2020)

    Google Scholar 

  15. Liu, C., Hu, Y., Li, Y., Song, S., Liu, J.: PKU-MMD: a large scale benchmark for skeleton-based human action understanding. In: Workshop on Visual Analysis in Smart and Connected Communities (VSCC@MM), pp. 1–8. ACM (2017)

    Google Scholar 

  16. Liu, J., Song, S., Liu, C., Li, Y., Hu, Y.: A benchmark dataset and comparison study for multi-modal human action analytics. ACM Trans. Multimedia Comput. Commun. Appl. 16(2), 1–24 (2020)

    Article  Google Scholar 

  17. Liu, J., Shahroudy, A., Perez, M., Wang, G., Duan, L., Kot, A.C.: NTU RGB+D 120: a large-scale benchmark for 3D human activity understanding. IEEE Trans. Pattern Anal. Mach. Intell. 42, 2684–2701 (2019)

    Article  Google Scholar 

  18. Liu, X., He, G., Peng, S., Cheung, Y., Tang, Y.Y.: Efficient human motion retrieval via temporal adjacent bag of words and discriminative neighborhood preserving dictionary learning. IEEE Trans. Human-Mach. Syst. 47(6), 763–776 (2017)

    Article  Google Scholar 

  19. Lv, N., Wang, Y., Feng, Z., Peng, J.: Deep hashing for motion capture data retrieval. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2215–2219. IEEE (2021)

    Google Scholar 

  20. Müller, M., Röder, T., Clausen, M., Eberhardt, B., Krüger, B., Weber, A.: Documentation Mocap Database HDM05. Technical RepORT CG-2007-2, Universität Bonn (2007)

    Google Scholar 

  21. Papadopoulos, K., Ghorbel, E., Baptista, R., Aouada, D., Ottersten, B.: Two-stage RGB-based action detection using augmented 3D poses. In: Vento, M., Percannella, G. (eds.) CAIP 2019. LNCS, vol. 11678, pp. 26–35. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-29888-3_3

    Chapter  Google Scholar 

  22. Peng, W., Hong, X., Zhao, G.: Tripool: graph triplet pooling for 3d skeleton-based action recognition. Pattern Recogn. 115, 107921 (2021)

    Article  Google Scholar 

  23. Rakthanmanon, T., et al.: Searching and mining trillions of time series subsequences under dynamic time warping. In: 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp. 262–270. ACM (2012)

    Google Scholar 

  24. Sedmidubsky, J., Budikova, P., Dohnal, V., Zezula, P.: Motion words: a text-like representation of 3D skeleton sequences. In: Jose, J.M., et al. (eds.) ECIR 2020. LNCS, vol. 12035, pp. 527–541. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-45439-5_35

    Chapter  Google Scholar 

  25. Sedmidubsky, J., Elias, P., Budikova, P., Zezula, P.: Content-based management of human motion data: Survey and challenges. IEEE Access 9, 64241–64255 (2021). https://doi.org/10.1109/ACCESS.2021.3075766

    Article  Google Scholar 

  26. Sedmidubsky, J., Elias, P., Zezula, P.: Searching for variable-speed motions in long sequences of motion capture data. Inf. Syst. 80, 148–158 (2019)

    Article  Google Scholar 

  27. Song, S., Lan, C., Xing, J., Zeng, W., Liu, J.: Spatio-temporal attention-based LSTM networks for 3D action recognition and detection. IEEE Trans. Image Process. 27(7), 3459–3471 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  28. Vernikos, I., Koutrintzes, D., Mathe, E., Spyrou, E., Mylonas, P.: Early fusion of visual representations of skeletal data for human activity recognition. In: 12th Hellenic Conference on Artificial Intelligence (SETN). ACM (2022)

    Google Scholar 

  29. Yang, Y., Liu, G., Gao, X.: Motion guided attention learning for self-supervised 3D human action recognition. IEEE Trans. Circ. Syst. Video Technol. 32, 1–13 (2022)

    Article  Google Scholar 

Download references

Acknowledgements

This research was supported by ERDF “CyberSecurity, CyberCrime and Critical Information Infrastructures Center of Excellence” (No. CZ.02.1.01/0.0/0.0/16_019/0000822), by AI4Media - A European Excellence Centre for Media, Society, and Democracy (EC, H2020 n. 951911), and by SUN - Social and hUman ceNtered XR (EC, Horizon Europe n. 101092612).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jan Sedmidubsky .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sedmidubsky, J., Carrara, F., Amato, G. (2023). SegmentCodeList: Unsupervised Representation Learning for Human Skeleton Data Retrieval. In: Kamps, J., et al. Advances in Information Retrieval. ECIR 2023. Lecture Notes in Computer Science, vol 13981. Springer, Cham. https://doi.org/10.1007/978-3-031-28238-6_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-28238-6_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-28237-9

  • Online ISBN: 978-3-031-28238-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics