Abstract
Recent pose-estimation methods enable digitization of human motion by extracting 3D skeleton sequences from ordinary video recordings. Such spatio-temporal skeleton representation offers attractive possibilities for a wide range of applications but, at the same time, requires effective and efficient content-based access to make the extracted data reusable. In this paper, we focus on content-based retrieval of pre-segmented skeleton sequences of human actions to identify the most similar ones to a query action. We mainly deal with the extraction of content-preserving action features, which are learned using the triplet-loss approach in an unsupervised way. Such features are (1) effective as they achieve a similar retrieval quality as the features learned in a supervised way, and (2) of a fixed size which enables the application of indexing structures for efficient retrieval.
Keywords
- Human motion data
- Skeleton sequences
- Action similarity
- Action retrieval
- Triplet-loss learning
- LSTM
Supported by ERDF “CyberSecurity, CyberCrime and Critical Information Infrastructures Center of Excellence” (No. CZ.02.1.01/0.0/0.0/16_019/0000822).
This is a preview of subscription content, access via your institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Aristidou, A., Cohen-Or, D., Hodgins, J.K., Chrysanthou, Y., Shamir, A.: Deep motifs and motion signatures. ACM Trans. Graph. 37(6), 187:1–187:13 (2018). https://doi.org/10.1145/3272127.3275038
Barnachon, M., Bouakaz, S., Boufama, B., Guillou, E.: Ongoing human action recognition with motion capture. Pattern Recogn. 47(1), 238–247 (2014)
Budikova, P., Sedmidubsky, J., Zezula, P.: Efficient indexing of 3d human motions. In: International Conference on Multimedia Retrieval (ICMR), pp. 10–18. ACM (2021)
Chang, S., et al.: Towards accurate human pose estimation in videos of crowded scenes. In: 28th ACM International Conference on Multimedia (MM), pp. 4630–4634. ACM (2020). https://doi.org/10.1145/3394171.3416299
Cheng, Y.B., Chen, X., Chen, J., Wei, P., Zhang, D., Lin, L.: Hierarchical transformer: Unsupervised representation learning for skeleton-based human action recognition. In: IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6 (2021). https://doi.org/10.1109/ICME51207.2021.9428459
Häring, S., Memmesheimer, R., Paulus, D.: Action segmentation on representations of skeleton sequences using transformer networks. In: IEEE International Conference on Image Processing (ICIP), pp. 3053–3057 (2021). https://doi.org/10.1109/ICIP42928.2021.9506687
Khaire, P., Kumar, P.: Deep learning and rgb-d based human action, human–human and human–object interaction recognition: a survey. J. Visual Commun. Image Repr.86, 1–25 (2022). https://doi.org/10.1016/j.jvcir.2022.103531, https://www.sciencedirect.com/science/article/pii/S1047320322000724
Laraba, S., Brahimi, M., Tilmanne, J., Dutoit, T.: 3d skeleton-based action recognition by representing motion capture sequences as 2d-rgb images. Comput. Anim. Virtual Worlds 28(3–4), e1782 (2017)
Lv, N., Wang, Y., Feng, Z., Peng, J.: Deep hashing for motion capture data retrieval. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2215–2219. IEEE (2021). https://doi.org/10.1109/ICASSP39728.2021.9413505
Müller, M., Röder, T., Clausen, M., Eberhardt, B., Krüger, B., Weber, A.: Documentation Mocap Database HDM05. Tech. Rep. CG-2007-2, Universität Bonn (2007)
Novak, D., Zezula, P.: Rank aggregation of candidate sets for efficient similarity search. In: Decker, H., Lhotská, L., Link, S., Spies, M., Wagner, R.R. (eds.) DEXA 2014. LNCS, vol. 8645, pp. 42–58. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10085-2_4
Peng, W., Hong, X., Zhao, G.: Tripool: graph triplet pooling for 3d skeleton-based action recognition. Pattern Recogn. 115, 107921 (2021). https://doi.org/10.1016/j.patcog.2021.107921
Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: a unified embedding for face recognition and clustering. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 815–823 (2015). https://doi.org/10.1109/CVPR.2015.7298682
Sedmidubsky, J., Budikova, P., Dohnal, V., Zezula, P.: Motion words: a text-like representation of 3D skeleton sequences. In: ECIR 2020. LNCS, vol. 12035, pp. 527–541. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-45439-5_35
Sedmidubsky, J., Elias, P., Budikova, P., Zezula, P.: Content-based management of human motion data: Survey and challenges. IEEE Access 9, 64241–64255 (2021). https://doi.org/10.1109/ACCESS.2021.3075766, https://doi.org/10.1109/ACCESS.2021.3075766
Sedmidubsky, J., Elias, P., Zezula, P.: Similarity searching in long sequences of motion capture data. In: Amsaleg, L., Houle, M.E., Schubert, E. (eds.) SISAP 2016. LNCS, vol. 9939, pp. 271–285. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46759-7_21
Sedmidubsky, J., Elias, P., Zezula, P.: Searching for variable-speed motions in long sequences of motion capture data. Inf. Syst. 80, 148–158 (2019). https://doi.org/10.1016/j.is.2018.04.002
Sedmidubsky, J., Zezula, P.: Augmenting Spatio-Temporal Human Motion Data for Effective 3D Action Recognition. In: 21st IEEE International Symposium on Multimedia (ISM), pp. 204–207. IEEE Computer Society (2019). https://doi.org/10.1109/ISM.2019.00044
Song, S., Lan, C., Xing, J., Zeng, W., Liu, J.: Spatio-temporal attention-based LSTM networks for 3d action recognition and detection. IEEE Trans. Image Process. 27(7), 3459–3471 (2018). https://doi.org/10.1109/TIP.2018.2818328, https://doi.org/10.1109/TIP.2018.2818328
Tanfous, A.B., Zerroug, A., Linsley, D., Serre, T.: How and what to learn: taxonomizing self-supervised learning for 3d action recognition. In: IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 2888–2897 (2022). https://doi.org/10.1109/WACV51458.2022.00294
Thakkar, K.C., Narayanan, P.J.: Part-based graph convolutional network for action recognition. In: British Machine Vision Conference (BMVC), pp. 1–13. BMVA Press (2018). http://bmvc2018.org/contents/papers/1003.pdf
Wang, J., Chen, Y., Hao, S., Peng, X., Hu, L.: Deep learning for sensor-based activity recognition: a survey. Pattern Recogn. Lett. 119, 3–11 (2019)
Wang, J., Jin, S., Liu, W., Liu, W., Qian, C., Luo, P.: When human pose estimation meets robustness: Adversarial algorithms and benchmarks. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11855–11864 (2021)
Wang, W., Chang, F., Liu, C., Li, G., Wang, B.: Ga-net: a guidance aware network for skeleton-based early activity recognition. IEEE Trans. Multimedia, 1–13 (2021). https://doi.org/10.1109/TMM.2021.3137745
Wen, Y.H., Gao, L., Fu, H., Zhang, F.L., Xia, S., Liu, Y.J.: Motif-gcns with local and non-local temporal blocks for skeleton-based action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 1–15 (2022). https://doi.org/10.1109/TPAMI.2022.3170511
Zhang, T., et al.: Deep manifold-to-manifold transforming network for skeleton-based action recognition. IEEE Trans. Multi. 22(11), 2926–2937 (2020). https://doi.org/10.1109/TMM.2020.2966878
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Kico, I., Sedmidubsky, J., Zezula, P. (2022). Towards Efficient Human Action Retrieval Based on Triplet-Loss Metric Learning. In: Strauss, C., Cuzzocrea, A., Kotsis, G., Tjoa, A.M., Khalil, I. (eds) Database and Expert Systems Applications. DEXA 2022. Lecture Notes in Computer Science, vol 13426. Springer, Cham. https://doi.org/10.1007/978-3-031-12423-5_18
Download citation
DOI: https://doi.org/10.1007/978-3-031-12423-5_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-12422-8
Online ISBN: 978-3-031-12423-5
eBook Packages: Computer ScienceComputer Science (R0)