Skip to main content
Log in

Realistic video generation for american sign language

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

There are many ways to generate sign language videos, but most of them are based on 3D character modeling. These methods are time-consuming and labor-intensive, and are hard to compare with the videos of real person sign language in terms of realness and naturalness. To address this, we propose a novel approach using the recently popular generative adversarial network to synthesize sentence-level videos from word-level videos. A pose transition estimation is used to measure the distance between sign language clips and synthesize the corresponding transition skeletons. In particular, we propose an interpolation approach, as it is faster than a graphics approach and does not require additional datasets. In addition, we also propose an stacked based approach for the Vid2Vid model. Two Vid2Vid models are stacked together to generate videos via two stages. The first stage is to generate IUV images (3 channels images composed by the index I and the UV texture coordinates) from the skeleton images, and the second stage is to generate realistic video from the skeleton images and the IUV images. We use American Sign Language Lexicon Video Dataset (ASLLVD) in our experiment, and we found that when the skeletons are generated by our proposed pose transition estimation method, the quality of the video is better than that of the direct generation using only the skeletons. Finally, we also develop a graphical user interface that allows users to drag and drop the clips to the video track and generate a realistic sign language video.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26

Similar content being viewed by others

References

  1. Arjovsky M, Chintala S, Bottou L Wasserstein generative adversarial networks. In: 34th international conference on machine learning, ICML 2017, 1, 2017

  2. Athitsos V, Neidle C, Sclaroff S, Nash J, Stefan A, Yuan Q, Thangali A (2008) The american sign language lexicon video dataset. In: 2008 IEEE computer society conference on computer vision and pattern recognition workshops, CVPR Workshops

  3. Borg M, Camilleri K P (2020) Phonologically-meaningful subunits for deep learning-based sign language recognition. In: ECCV 2020 workshop on sign language recognition, translation and production

  4. Cao Z, Hidalgo G, Simon T, Wei S E, Sheikh Y (2019) OpenPose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence

  5. Chen C, Zhang B, Hou Z, Jiang J, Liu M, Yang Y (2017) Action recognition from depth sequences using weighted fusion of 2d and 3d auto-correlation of gradients features. Multimedia Tools and Applications, 76

  6. Elliott R, Glauert JR, Kennaway JR, Marshall I (2000) The development of language processing support for the ViSiCAST project. In: Annual ACM conference on assistive technologies, proceedings

  7. Forster J, Schmidt C, Hoyoux T, Koller O, Zelle U, Piater J, Ney H (2012) Rwth-phoenix-weather: A large vocabulary sign language recognition and translation corpus. In: Proceedings of the 8th international conference on language resources and evaluation, LREC 2012

  8. Gokce C, Ozdemir O, Kindiroglu A A, Akarun L (2020) Score-level multi cue fusion for sign language recognition. In: ECCV 2020 workshop on sign language recognition, translation and production

  9. Goodfellow I J, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. Advances in Neural Information Processing Systems, 3

  10. Guler R A, Neverova N, Kokkinos I (2018) DensePose: Dense human pose estimation in the wild. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition

  11. Ilg E, Mayer N, Saikia T, Keuper M, Dosovitskiy A, Brox T (2017) FlowNet 2.0: Evolution of optical flow estimation with deep networks. In: Proceedings - 30th IEEE conference on computer vision and pattern recognition, CVPR 2017

  12. Isola P, Zhu J Y, Zhou T, Efros A A (2017) Image-to-image translation with conditional adversarial networks. In: Proceedings - 30th IEEE conference on computer vision and pattern recognition, CVPR 2017

  13. Koller O, Forster J, Ney H (2015) Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers. Computer Vision and Image Understanding, 141

  14. Krapez S, Solina F (1999) Synthesis of the sign language of the deaf from the sign video clips. Elektrotehniski Vestnik/Electrotechnical Review, 66

  15. Leng L, Zhang J, Xu J, Khan M K, Alghathbar K (2010) Dynamic weighted discrimination power analysis in dct domain for face and palmprint recognition. In: 2010 international conference on information and communication technology convergence (ICTC), pp 467–471

  16. Li Z, Aaron A (2016) Toward a practical perceptual video quality metric. https://netflixtechblog.com/toward-a-practical-perceptual-video-quality-metric-653f208b9652. Accessed 12 Dec 2020

  17. Liang X, Angelopoulou A, Kapetanios E, Woll B, Al-Batat R, Woolfe T (2020) A multi-modal machine learning approach and toolkit to automate recognition of early stages of dementia among british sign language users. In: ECCV 2020 workshop on sign language recognition, translation and production

  18. Lu P, Huenerfauth M (2014) Collecting and evaluating the cuny asl corpus for research on american sign language animation. Computer Speech and Language, 28

  19. Martinez A M, Wilbur R B, Shay R, Kak A C (2002) Purdue rvl-slll asl database for automatic recognition of american sign language. In: Proceedings - 4th IEEE international conference on multimodal interfaces, ICMI 2002

  20. Merkel D (2014) Docker: Lightweight linux containers for consistent development and deployment. Linux Journal 2014(239):2

    Google Scholar 

  21. Min J, Chai J (2012) Motion graphs++: A compact generative model for semantic motion analysis and synthesis. ACM Transactions on Graphics, 31

  22. Mirza M, Osindero S (2014) Conditional generative adversarial nets. CoRR

  23. NVIDIA (2015) NVIDIA container toolkit. https://github.com/NVIDIA/nvidia-docker. Accessed 8 Oct 2020

  24. of the Deaf W F (2018) Our work. http://wfdeaf.org/our-work/. Accessed 8 Oct 2020

  25. Organization W H (2020) Deafness and hearing loss. https://www.who.int/news-room/fact-sheets/detail/deafness-and-hearing-loss. Accessed 8 Oct 2020

  26. Oszust M, Wysocki M (2013) Polish sign language words recognition with kinect. In: 2013 6th international conference on human system interactions, HSI 2013

  27. Papadogiorgaki M, Grammalidis N, Tzovaras D, Strintzis M G (2005) Text-to-sign language synthesis tool. In: 13th European signal processing conference, EUSIPCO 2005

  28. Parelli M, Papadimitriou K, Potamianos G, Pavlakos G, Maragos P (2020) Exploiting 3d hand pose estimation in deep learning-based sign language recognition from rgb videos. In: ECCV 2020 workshop on sign language recognition, translation and production

  29. Quiroga F (2020) Sign language recognition datasets. http://facundoq.github.io/guides/sign_language_datasets/slr. Accessed 26 Nov 2020

  30. Radford A, Metz L, Chintala S (2016) Unsupervised representation learning with deep convolutional generative adversarial networks. In: International conference on learning representations

  31. Sandler W, Lillo-Martin D (2006) Sign language and linguistic universals. Cambridge University Press

  32. Silva E P D, Dornhofer P, Costa P, Mamhy K, Kumada O, Martino J M D, Florentino G A (2020) Recognition of affective and grammatical facial expressions: a study for brazilian sign language. In: ECCV 2020 workshop on sign language recognition, translation and production

  33. Simon T, Joo H, Matthews I, Sheikh Y (2017) Hand keypoint detection in single images using multiview bootstrapping. In: Proceedings - 30th IEEE conference on computer vision and pattern recognition, CVPR 2017

  34. Stoll S, Camgoz N C, Hadfield S, Bowden R (2020) Text2sign: Towards sign language production using neural machine translation and generative adversarial networks. International Journal of Computer Vision, 128

  35. Tavakoli M, Batista R, Sgrigna L (2015) The UC Softhand: Light weight adaptive bionic hand with a compact twisted string actuation system. Actuators 5:1

    Article  Google Scholar 

  36. Tomar S (2006) Converting video formats with ffmpeg. Linux Journal 2006(146):10

    Google Scholar 

  37. Wang T C, Liu M Y, Zhu J Y, Liu G, Tao A, Kautz J, Catanzaro B (2018) Video-to-video synthesis. Advances in Neural Information Processing Systems

  38. Wang Z, Simoncelli E P, Bovik A C (2003) Multiscale structural similarity for image quality assessment. In: The Thrity-Seventh asilomar conference on signals, systems computers, 2003, vol 2, pp 1398–1402 Vol.2

  39. Yulia (2019) Transition motion synthesis for video-based text to asl. Master?s thesis, National Taiwan University of Science and Technology

  40. Zhou Wang, Bovik A C, Sheikh H R, Simoncelli E P (2004) Image quality assessment: From error visibility to structural similarity. IEEE Trans Image Process 13(4):600–612

    Article  Google Scholar 

  41. Zwitserlood I, Verlinden M, Ros J, Schoot S (2005) Synthetic signing for the deaf: Esign. https://core.ac.uk/display/101752491. Accessed 26 Sep 2020

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chuan-Kai Yang.

Ethics declarations

This work was supported in part by the Ministry of Science and Technology of Taiwan under the grants MOST 109-2221-E-011-133 and MOST 109-2228-E-011-007. Conflict of Interest: Both authors have received the aforementioned funding support and both authors have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xu, MC., Yang, CK. Realistic video generation for american sign language. Multimed Tools Appl 81, 38849–38886 (2022). https://doi.org/10.1007/s11042-022-12590-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-022-12590-z

Keywords

Navigation