Abstract
Recent progress in pose-estimation methods enables the extraction of sufficiently-precise 3D human skeleton data from ordinary videos, which offers great opportunities for a wide range of applications. However, such spatio-temporal data are typically extracted in the form of a continuous skeleton sequence without any information about semantic segmentation or annotation. To make the extracted data reusable for further processing, there is a need to access them based on their content. In this paper, we introduce a universal retrieval approach that compares any two skeleton sequences based on temporal order and similarities of their underlying segments. The similarity of segments is determined by their content-preserving low-dimensional code representation that is learned using the Variational AutoEncoder principle in an unsupervised way. The quality of the proposed representation is validated in retrieval and classification scenarios; our proposal outperforms the state-of-the-art approaches in effectiveness and reaches speed-ups up to 64x on common skeleton sequence datasets.
J. Sedmidubsky and F. Carrara—These authors contributed equally.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Acsintoae, A., et al.: Ubnormal: new benchmark for supervised open-set video anomaly detection. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 20143–20153 (2022)
Aristidou, A., Cohen-Or, D., Hodgins, J.K., Chrysanthou, Y., Shamir, A.: Deep motifs and motion signatures. ACM Trans. Graph. 37(6), 187:1–187:13 (2018)
Basak, H., Kundu, R., Singh, P.K., Ijaz, M.F., Wozniak, M., Sarkar, R.: A union of deep learning and swarm-based optimization for 3D human action recognition. Sci. Rep. 12(5494), 1–17 (2022)
Budikova, P., Sedmidubsky, J., Zezula, P.: Efficient indexing of 3D human motions. In: International Conference on Multimedia Retrieval (ICMR), pp. 10–18. ACM (2021)
Cheng, Y.B., Chen, X., Chen, J., Wei, P., Zhang, D., Lin, L.: Hierarchical transformer: Unsupervised representation learning for skeleton-based human action recognition. In: IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6 (2021)
Dubey, S., Dixit, M.: A comprehensive survey on human pose estimation approaches. Multimedia Syst. 29, 1–29 (2022)
Elias, P., Sedmidubsky, J., Zezula, P.: Understanding the limits of 2D skeletons for action recognition. Multimedia Syst. 27(3), 547–561 (2021)
He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 630–645. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_38
Higgins, I., et al.: BETA-VAE: learning basic visual concepts with a constrained variational framework. In: 5th International Conference on Learning Representations (ICLR), pp. 1–22. OpenReview.net (2017)
Kay, W., et al.: The kinetics human action video dataset. arXiv (2017)
Kendall, A., Gal, Y., Cipolla, R.: Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7482–7491 (2018)
Kico, I., Sedmidubsky, J., Zezula, P.: Towards efficient human action retrieval based on triplet-loss metric learning. In: 33rd International Conference on Database and Expert Systems Applications (DEXA), pp. 234–247. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-12423-5_18
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)
Lin, L., Song, S., Yang, W., Liu, J.: MS2L: multi-task self-supervised learning for skeleton based action recognition. In: 28th ACM International Conference on Multimedia (MM), pp. 2490–2498. ACM, New York (2020)
Liu, C., Hu, Y., Li, Y., Song, S., Liu, J.: PKU-MMD: a large scale benchmark for skeleton-based human action understanding. In: Workshop on Visual Analysis in Smart and Connected Communities (VSCC@MM), pp. 1–8. ACM (2017)
Liu, J., Song, S., Liu, C., Li, Y., Hu, Y.: A benchmark dataset and comparison study for multi-modal human action analytics. ACM Trans. Multimedia Comput. Commun. Appl. 16(2), 1–24 (2020)
Liu, J., Shahroudy, A., Perez, M., Wang, G., Duan, L., Kot, A.C.: NTU RGB+D 120: a large-scale benchmark for 3D human activity understanding. IEEE Trans. Pattern Anal. Mach. Intell. 42, 2684–2701 (2019)
Liu, X., He, G., Peng, S., Cheung, Y., Tang, Y.Y.: Efficient human motion retrieval via temporal adjacent bag of words and discriminative neighborhood preserving dictionary learning. IEEE Trans. Human-Mach. Syst. 47(6), 763–776 (2017)
Lv, N., Wang, Y., Feng, Z., Peng, J.: Deep hashing for motion capture data retrieval. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2215–2219. IEEE (2021)
Müller, M., Röder, T., Clausen, M., Eberhardt, B., Krüger, B., Weber, A.: Documentation Mocap Database HDM05. Technical RepORT CG-2007-2, Universität Bonn (2007)
Papadopoulos, K., Ghorbel, E., Baptista, R., Aouada, D., Ottersten, B.: Two-stage RGB-based action detection using augmented 3D poses. In: Vento, M., Percannella, G. (eds.) CAIP 2019. LNCS, vol. 11678, pp. 26–35. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-29888-3_3
Peng, W., Hong, X., Zhao, G.: Tripool: graph triplet pooling for 3d skeleton-based action recognition. Pattern Recogn. 115, 107921 (2021)
Rakthanmanon, T., et al.: Searching and mining trillions of time series subsequences under dynamic time warping. In: 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp. 262–270. ACM (2012)
Sedmidubsky, J., Budikova, P., Dohnal, V., Zezula, P.: Motion words: a text-like representation of 3D skeleton sequences. In: Jose, J.M., et al. (eds.) ECIR 2020. LNCS, vol. 12035, pp. 527–541. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-45439-5_35
Sedmidubsky, J., Elias, P., Budikova, P., Zezula, P.: Content-based management of human motion data: Survey and challenges. IEEE Access 9, 64241–64255 (2021). https://doi.org/10.1109/ACCESS.2021.3075766
Sedmidubsky, J., Elias, P., Zezula, P.: Searching for variable-speed motions in long sequences of motion capture data. Inf. Syst. 80, 148–158 (2019)
Song, S., Lan, C., Xing, J., Zeng, W., Liu, J.: Spatio-temporal attention-based LSTM networks for 3D action recognition and detection. IEEE Trans. Image Process. 27(7), 3459–3471 (2018)
Vernikos, I., Koutrintzes, D., Mathe, E., Spyrou, E., Mylonas, P.: Early fusion of visual representations of skeletal data for human activity recognition. In: 12th Hellenic Conference on Artificial Intelligence (SETN). ACM (2022)
Yang, Y., Liu, G., Gao, X.: Motion guided attention learning for self-supervised 3D human action recognition. IEEE Trans. Circ. Syst. Video Technol. 32, 1–13 (2022)
Acknowledgements
This research was supported by ERDF “CyberSecurity, CyberCrime and Critical Information Infrastructures Center of Excellence” (No. CZ.02.1.01/0.0/0.0/16_019/0000822), by AI4Media - A European Excellence Centre for Media, Society, and Democracy (EC, H2020 n. 951911), and by SUN - Social and hUman ceNtered XR (EC, Horizon Europe n. 101092612).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Sedmidubsky, J., Carrara, F., Amato, G. (2023). SegmentCodeList: Unsupervised Representation Learning for Human Skeleton Data Retrieval. In: Kamps, J., et al. Advances in Information Retrieval. ECIR 2023. Lecture Notes in Computer Science, vol 13981. Springer, Cham. https://doi.org/10.1007/978-3-031-28238-6_8
Download citation
DOI: https://doi.org/10.1007/978-3-031-28238-6_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-28237-9
Online ISBN: 978-3-031-28238-6
eBook Packages: Computer ScienceComputer Science (R0)