Skip to main content

Towards Efficient Human Action Retrieval Based on Triplet-Loss Metric Learning

Part of the Lecture Notes in Computer Science book series (LNCS,volume 13426)


Recent pose-estimation methods enable digitization of human motion by extracting 3D skeleton sequences from ordinary video recordings. Such spatio-temporal skeleton representation offers attractive possibilities for a wide range of applications but, at the same time, requires effective and efficient content-based access to make the extracted data reusable. In this paper, we focus on content-based retrieval of pre-segmented skeleton sequences of human actions to identify the most similar ones to a query action. We mainly deal with the extraction of content-preserving action features, which are learned using the triplet-loss approach in an unsupervised way. Such features are (1) effective as they achieve a similar retrieval quality as the features learned in a supervised way, and (2) of a fixed size which enables the application of indexing structures for efficient retrieval.


  • Human motion data
  • Skeleton sequences
  • Action similarity
  • Action retrieval
  • Triplet-loss learning
  • LSTM

Supported by ERDF “CyberSecurity, CyberCrime and Critical Information Infrastructures Center of Excellence” (No. CZ.02.1.01/0.0/0.0/16_019/0000822).

This is a preview of subscription content, access via your institution.

Buying options

USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
USD   69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions


  1. Aristidou, A., Cohen-Or, D., Hodgins, J.K., Chrysanthou, Y., Shamir, A.: Deep motifs and motion signatures. ACM Trans. Graph. 37(6), 187:1–187:13 (2018).

  2. Barnachon, M., Bouakaz, S., Boufama, B., Guillou, E.: Ongoing human action recognition with motion capture. Pattern Recogn. 47(1), 238–247 (2014)

    CrossRef  Google Scholar 

  3. Budikova, P., Sedmidubsky, J., Zezula, P.: Efficient indexing of 3d human motions. In: International Conference on Multimedia Retrieval (ICMR), pp. 10–18. ACM (2021)

    Google Scholar 

  4. Chang, S., et al.: Towards accurate human pose estimation in videos of crowded scenes. In: 28th ACM International Conference on Multimedia (MM), pp. 4630–4634. ACM (2020).

  5. Cheng, Y.B., Chen, X., Chen, J., Wei, P., Zhang, D., Lin, L.: Hierarchical transformer: Unsupervised representation learning for skeleton-based human action recognition. In: IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6 (2021).

  6. Häring, S., Memmesheimer, R., Paulus, D.: Action segmentation on representations of skeleton sequences using transformer networks. In: IEEE International Conference on Image Processing (ICIP), pp. 3053–3057 (2021).

  7. Khaire, P., Kumar, P.: Deep learning and rgb-d based human action, human–human and human–object interaction recognition: a survey. J. Visual Commun. Image Repr.86, 1–25 (2022).,

  8. Laraba, S., Brahimi, M., Tilmanne, J., Dutoit, T.: 3d skeleton-based action recognition by representing motion capture sequences as 2d-rgb images. Comput. Anim. Virtual Worlds 28(3–4), e1782 (2017)

    Google Scholar 

  9. Lv, N., Wang, Y., Feng, Z., Peng, J.: Deep hashing for motion capture data retrieval. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2215–2219. IEEE (2021).

  10. Müller, M., Röder, T., Clausen, M., Eberhardt, B., Krüger, B., Weber, A.: Documentation Mocap Database HDM05. Tech. Rep. CG-2007-2, Universität Bonn (2007)

    Google Scholar 

  11. Novak, D., Zezula, P.: Rank aggregation of candidate sets for efficient similarity search. In: Decker, H., Lhotská, L., Link, S., Spies, M., Wagner, R.R. (eds.) DEXA 2014. LNCS, vol. 8645, pp. 42–58. Springer, Cham (2014).

    CrossRef  Google Scholar 

  12. Peng, W., Hong, X., Zhao, G.: Tripool: graph triplet pooling for 3d skeleton-based action recognition. Pattern Recogn. 115, 107921 (2021).

  13. Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: a unified embedding for face recognition and clustering. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 815–823 (2015).

  14. Sedmidubsky, J., Budikova, P., Dohnal, V., Zezula, P.: Motion words: a text-like representation of 3D skeleton sequences. In: ECIR 2020. LNCS, vol. 12035, pp. 527–541. Springer, Cham (2020).

    CrossRef  Google Scholar 

  15. Sedmidubsky, J., Elias, P., Budikova, P., Zezula, P.: Content-based management of human motion data: Survey and challenges. IEEE Access 9, 64241–64255 (2021).,

  16. Sedmidubsky, J., Elias, P., Zezula, P.: Similarity searching in long sequences of motion capture data. In: Amsaleg, L., Houle, M.E., Schubert, E. (eds.) SISAP 2016. LNCS, vol. 9939, pp. 271–285. Springer, Cham (2016).

    CrossRef  Google Scholar 

  17. Sedmidubsky, J., Elias, P., Zezula, P.: Searching for variable-speed motions in long sequences of motion capture data. Inf. Syst. 80, 148–158 (2019).

  18. Sedmidubsky, J., Zezula, P.: Augmenting Spatio-Temporal Human Motion Data for Effective 3D Action Recognition. In: 21st IEEE International Symposium on Multimedia (ISM), pp. 204–207. IEEE Computer Society (2019).

  19. Song, S., Lan, C., Xing, J., Zeng, W., Liu, J.: Spatio-temporal attention-based LSTM networks for 3d action recognition and detection. IEEE Trans. Image Process. 27(7), 3459–3471 (2018).,

  20. Tanfous, A.B., Zerroug, A., Linsley, D., Serre, T.: How and what to learn: taxonomizing self-supervised learning for 3d action recognition. In: IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 2888–2897 (2022).

  21. Thakkar, K.C., Narayanan, P.J.: Part-based graph convolutional network for action recognition. In: British Machine Vision Conference (BMVC), pp. 1–13. BMVA Press (2018).

  22. Wang, J., Chen, Y., Hao, S., Peng, X., Hu, L.: Deep learning for sensor-based activity recognition: a survey. Pattern Recogn. Lett. 119, 3–11 (2019)

    CrossRef  Google Scholar 

  23. Wang, J., Jin, S., Liu, W., Liu, W., Qian, C., Luo, P.: When human pose estimation meets robustness: Adversarial algorithms and benchmarks. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11855–11864 (2021)

    Google Scholar 

  24. Wang, W., Chang, F., Liu, C., Li, G., Wang, B.: Ga-net: a guidance aware network for skeleton-based early activity recognition. IEEE Trans. Multimedia, 1–13 (2021).

  25. Wen, Y.H., Gao, L., Fu, H., Zhang, F.L., Xia, S., Liu, Y.J.: Motif-gcns with local and non-local temporal blocks for skeleton-based action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 1–15 (2022).

  26. Zhang, T., et al.: Deep manifold-to-manifold transforming network for skeleton-based action recognition. IEEE Trans. Multi. 22(11), 2926–2937 (2020).

Download references

Author information

Authors and Affiliations


Corresponding author

Correspondence to Iris Kico .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kico, I., Sedmidubsky, J., Zezula, P. (2022). Towards Efficient Human Action Retrieval Based on Triplet-Loss Metric Learning. In: Strauss, C., Cuzzocrea, A., Kotsis, G., Tjoa, A.M., Khalil, I. (eds) Database and Expert Systems Applications. DEXA 2022. Lecture Notes in Computer Science, vol 13426. Springer, Cham.

Download citation

  • DOI:

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-12422-8

  • Online ISBN: 978-3-031-12423-5

  • eBook Packages: Computer ScienceComputer Science (R0)