Learning shared embedding representation of motion and text using contrastive learning

Horie, Junpei; Noguchi, Wataru; Iizuka, Hiroyuki; Yamamoto, Masahito

doi:10.1007/s10015-022-00840-0

Learning shared embedding representation of motion and text using contrastive learning

Original Article
Published: 27 December 2022

Volume 28, pages 148–157, (2023)
Cite this article

Artificial Life and Robotics Aims and scope Submit manuscript

Junpei Horie¹,
Wataru Noguchi²,
Hiroyuki Iizuka^2,3 &
…
Masahito Yamamoto^2,3

258 Accesses
Explore all metrics

Abstract

Multimodal learning of motion and text tries to find the correspondence between skeletal time-series data acquired by motion capture and the text that describes the motion. In this field, good associations can realize both motion-to-text and text-to-motion applications. However, the previous methods failed to associate motion with text, taking into account details of descriptions, for example, whether to move the left or right arm. In this paper, we propose a motion-text contrastive learning method for making correspondences between motion and text in a shared embedding space. We showed that our model outperforms the previous studies in the task of action recognition. We also qualitatively show that, by using a pre-trained text encoder, our model can perform motion retrieval with detailed correspondences between motion and text.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

Article 10 June 2021

Transformers in Time-Series Analysis: A Tutorial

Article 25 July 2023

Human Action Recognition and Prediction: A Survey

Article 28 March 2022

Data Availability

The BABEL dataset used in the experiment is publicly available.

References

Ahuja C, Morency L-P (2019) Language2pose: natural language grounded pose forecasting. In: 2019 international conference on 3D vision (3DV). IEEE, pp 719–728
Ghosh A, Cheema N, Oguz C, Theobalt C, Slusallek P (2021) Synthesis of compositional animations from textual descriptions. CoRR abs/2103.14675
Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J et al (2021) Learning transferable visual models from natural language supervision. In: International conference on machine learning. PMLR, pp 8748–8763
Goh G, Cammarata N, Voss C, Carter S, Petrov M, Schubert L, Radford A, Olah C (2021) Multimodal neurons in artificial neural networks. Distill. https://distill.pub/2021/multimodal-neurons
Tevet G, Gordon B, Hertz A, Bermano AH, Cohen-Or D (2022) Motionclip: exposing human motion generation to clip space. arXiv preprint arXiv:2203.08063
Chen T, Kornblith S, Norouzi M, Hinton G (2020) A simple framework for contrastive learning of visual representations. In: International conference on machine learning. PMLR, pp 1597–1607
Jaiswal A, Babu AR, Zadeh MZ, Banerjee D, Makedon F (2020) A survey on contrastive self-supervised learning. Technologies 9(1):2
Article Google Scholar
Khosla P, Teterwak P, Wang C, Sarna A, Tian Y, Isola P, Maschinot A, Liu C, Krishnan D (2020) Supervised contrastive learning. Adv Neural Inf Process Syst 33:18661–18673
Google Scholar
He K, Fan H, Wu Y, Xie S, Girshick R (2020) Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9729–9738
Schlichtkrull M, Kipf TN, Bloem P, van den Berg R, Titov I, Welling M (2018) Modeling relational data with graph convolutional networks. In: European semantic web conference. Springer, pp 593–607
Yu B, Yin H, Zhu Z (2018) Spatio-temporal graph convolutional networks: a deep learning framework for traffic forecasting. In: Proceedings of the 27th international joint conference on artificial intelligence (IJCAI)
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 30
Radford A, Jeffrey W, Child R, Luan D, Amodei D, Sutskever I et al (2019) Language models are unsupervised multitask learners. Open AI blog 1(8):9
Google Scholar
Wang J, Song Y, Leung T, Rosenberg C, Wang J, Philbin J, Chen B, Wu Y (2014) Learning fine-grained image similarity with deep ranking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1386–1393
Sohn K (2016) Improved deep metric learning with multi-class n-pair loss objective. Adv Neural Inf Process Syst 29:1857–1865
Google Scholar
Punnakkal AR, Chandrasekaran A, Athanasiou N, Quiros-Ramirez A, Black MJ (2021) BABEL: bodies, action and behavior with English labels. In: Proceedings IEEE/CVF conf. on computer vision and pattern recognition (CVPR), June 2021, pp 722–731
Loper M, Mahmood N, Romero J, Pons-Moll G, Black MJ (2015) SMPL: a skinned multi-person linear model. ACM Trans Graph (Proc. SIGGRAPH Asia) 34(6):248:1–248:16
Mahmood N, Ghorbani N, Troje NF, Pons-Moll G, Black MJ (2019) AMASS: archive of motion capture as surface shapes. In: International conference on computer vision, October 2019, pp 5442–5451
Chen Y, Zhang Z, Yuan C, Li B, Deng Y, Hu W (2021) Channel-wise topology refinement graph convolution for skeleton-based action recognition. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 13359–13368
Shi L, Zhang Y, Cheng J, Lu H (2019) Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12026–12035
Hendrycks D, Gimpel K (2016) Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. CoRR, abs/1502.03167
Ba JL, Kiros JR, Hinton GE (2016) Layer normalization
Marrakchi Y, Makansi O, Brox T (2021) Fighting class imbalance with contrastive learning. In: de Bruijne M, Cattin PC, Cotin S, Padoy N, Speidel S, Zheng Y, Essert C (eds) Medical image computing and computer assisted intervention–MICCAI 2021. Springer International Publishing, Cham, pp 466–476
Chapter Google Scholar
Wang P, Han K, Wei X-S, Zhang L, Wang L (2021) Contrastive learning based hybrid networks for long-tailed image classification. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 943–952
Kang B, Xie S, Rohrbach M, Yan Z, Gordo A, Feng J, Kalantidis Y (2020) Decoupling representation and classifier for long-tailed recognition. In: Eighth international conference on learning representations (ICLR)
Toyoda M, Suzuki K, Mori H, Hayashi Y, Ogata T (2021) Embodying pre-trained word embeddings through robot actions. IEEE Robot Automat Lett 6(2):4225–4232
Article Google Scholar

Download references

Author information

Authors and Affiliations

Graduate School of Information Science and Technology, Hokkaido University, Sapporo, Hokkaido, Japan
Junpei Horie
Faculty of Information Science and Technology, Hokkaido University, Sapporo, Hokkaido, Japan
Wataru Noguchi, Hiroyuki Iizuka & Masahito Yamamoto
Center for Human Nature, Artificial Intelligence, and Neuroscience, Hokkaido University, Sapporo, Hokkaido, Japan
Hiroyuki Iizuka & Masahito Yamamoto

Authors

Junpei Horie
View author publications
You can also search for this author in PubMed Google Scholar
Wataru Noguchi
View author publications
You can also search for this author in PubMed Google Scholar
Hiroyuki Iizuka
View author publications
You can also search for this author in PubMed Google Scholar
Masahito Yamamoto
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Junpei Horie.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was submitted and accepted for the Journal Track of the joint symposium of the 28th International Symposium on Artificial Life and Robotics, the 8th International Symposium on BioComplexity, and the 6th International Symposium on Swarm Behavior and Bio-Inspired Robotics (Beppu, Oita, January 25–27, 2023).

About this article

Cite this article

Horie, J., Noguchi, W., Iizuka, H. et al. Learning shared embedding representation of motion and text using contrastive learning. Artif Life Robotics 28, 148–157 (2023). https://doi.org/10.1007/s10015-022-00840-0

Download citation

Received: 01 September 2022
Accepted: 24 November 2022
Published: 27 December 2022
Issue Date: February 2023
DOI: https://doi.org/10.1007/s10015-022-00840-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning shared embedding representation of motion and text using contrastive learning

Abstract

Access this article

Similar content being viewed by others

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

Transformers in Time-Series Analysis: A Tutorial

Human Action Recognition and Prediction: A Survey

Data Availability

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

About this article

Cite this article

Keywords

Navigation

Learning shared embedding representation of motion and text using contrastive learning

Abstract

Access this article

Similar content being viewed by others

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

Transformers in Time-Series Analysis: A Tutorial

Human Action Recognition and Prediction: A Survey

Data Availability

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

About this article

Cite this article

Share this article

Keywords

Search

Navigation