Skip to main content

Bidirectional LSTM Networks Employing Stacked Bottleneck Features for Expressive Speech-Driven Head Motion Synthesis

  • Conference paper
  • First Online:
Book cover Intelligent Virtual Agents (IVA 2016)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10011))

Included in the following conference series:

Abstract

Previous work in speech-driven head motion synthesis is centred around Hidden Markov Model (HMM) based methods and data that does not show a large variability of expressiveness in both speech and motion. When using expressive data, these systems often fail to produce satisfactory results. Recent studies have shown that using deep neural networks (DNNs) results in a better synthesis of head motion, in particular when employing bidirectional long short-term memory (BLSTM). We present a novel approach which makes use of DNNs with stacked bottleneck features combined with a BLSTM architecture to model context and expressive variability. Our proposed DNN architecture outperforms conventional feed-forward DNNs and simple BLSTM networks in an objective evaluation. Results from a subjective evaluation show a significant improvement of the bottleneck architecture over feed-forward DNNs.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Sargin, M.E., Aran, O., Karpov, A., Ofli, F., Yasinnik, Y., Wilson, S.: Combined gesture-speech analysis and speech driven gesture sythesis. In: IEEE International Conference on Multimedia and Expo, pp. 893–896 (2006)

    Google Scholar 

  2. Busso, C., Deng, Z., Grimm, M., Neumann, U., Narayanan, S.: Rigid head motion in expressive speech animation: analysis and synthesis. IEEE Trans. Audio Speech Lang. Process. 15, 1075–2007 (2007)

    Article  Google Scholar 

  3. Ben Youssef, A., Shimodaira, H., Braude, D.A.: Articulatory features for speech-driven head motion synthesis. In: 14th Annual Conference of the International Speech Communication Association, Interspeech 2013, pp. 2758–2762 (2013)

    Google Scholar 

  4. Braude, D.A., Shimodaira, H., Ben Youssef, A.: Template-warping based speech driven head motion synthesis. In: 14th Annual Conference of the International Speech Communication Association, Interspeech 2013, pp. 2763–2767 (2013)

    Google Scholar 

  5. Yehia, H.C., Kuratate, T., Vatikiotis-Bateson, E.: Linking facial animation, head motion and speech acoustics. J. Phonetics 30(3), 555–568 (2002)

    Article  Google Scholar 

  6. Zen, H., Senior, A., Schuster, M.: Statistical parametric speech synthesis using deep neural networks. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 20, pp. 1713–1724 (2013)

    Google Scholar 

  7. Zhao, K., Wu, Z., Cai, L.: A real-time speech driven talking avatar based on deep neural network. In: Signal and Information Processing Association Annual Summit and Conference (APSIPA), pp. 1–4 (2013)

    Google Scholar 

  8. Susskind, J., Hinton, G., Movellan, J., Anderson, A.: Generating facial expressions with deep belief nets. In: Or, J. (ed.) Affective Computing, Focus on Emotion Expression, Synthesis and Recognition. I-TECH Education and Publishing, Vienna (2008)

    Google Scholar 

  9. Chiu, C.-C., Marsella, S.: How to train your avatar: a data driven approach to gesture generation. In: Vilhjálmsson, H.H., Kopp, S., Marsella, S., Thórisson, K.R. (eds.) IVA 2011. LNCS (LNAI), vol. 6895, pp. 127–140. Springer, Heidelberg (2011). doi:10.1007/978-3-642-23974-8_14

    Chapter  Google Scholar 

  10. Ding, C., Xie, L., Zhu, P.: Head motion synthesis from speech using deep neural networks. Multimedia Tools Appl. 74, 9871–9888 (2014)

    Article  Google Scholar 

  11. Ding, C., Zhu, P., Xie, L.: BLSTM neural networks for speech driven head motion synthesis. In: 16th Annual Conference of the International Speech Communication Association, Interspeech 2015, pp. 3345–3349 (2015)

    Google Scholar 

  12. Hochreiter, S.: Recurrent neural net learning and vanishing gradient. Int. J. Uncertain. Fuzziness Knowl. Based Syst. 6(2), 107–116 (1998)

    Article  MathSciNet  MATH  Google Scholar 

  13. Fan, Y., Qian, Y., Xie, F.-L., Soong, F.K.: TTS synthesis with bidirectional LSTM based recurrent neural networks. In: 15th Annual Conference of the International Speech Communication Association, Interspeech 2014, pp. 1964–1968 (2014)

    Google Scholar 

  14. Fan, B., Wang, L., Soong, F.K., Xie, L.: Photo-real talking head with deep bidirectional LSTM. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4884–4888 (2015)

    Google Scholar 

  15. Dong, Y., Seltzer, M.L.: Improved bottleneck features using pre-trained deep neural networks. In: 12th Annual Conference of the International Speech Communication Association, Interspeech 2011, pp. 237–240 (2011)

    Google Scholar 

  16. Gehring, J., Miao, Y., Metze, F., Waibel, A.: Extracting deep bottleneck features using stacked autoencoders. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3377–3381 (2013)

    Google Scholar 

  17. Wu, Z., Valentini-Botinhao, C., Watts, O., King, S.: Deep neural networks employing multi-task learning and stacked bottleneck features for speech synthesis. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4460–4464 (2015)

    Google Scholar 

  18. Haag, K., Shimodaira, H.: The University of Edinburgh speaker personality and MoCap dataset. In: Proceedings of Facial Analysis and Animation, pp. 8:1–8:2. ACM (2015)

    Google Scholar 

  19. Motu. http://motu.com

  20. Speech Signal Processing Toolkit (SPTK). http://sptk.sourceforge.net

  21. Eyben, F., Woellmer, M., Schuller, B.: openSMILE: the Munich versatile and fast open-source audio feature extractor. In: Proceedings of the 18th ACM International Conference on Multimedia, MM 2010, pp. 1459–1462. ACM (2010)

    Google Scholar 

  22. Ben Youssef, A., Shimodaira, H., Braude, D.A.: Speech driven talking head from estimated articulatory features. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4606–4610 (2014)

    Google Scholar 

  23. NaturalPoint Optitrack. http://www.naturalpoint.com/optitrack

  24. Soederkvist, I., Wedin, P.-A.: Determining the movements of the skeleton using well-configured markers. J. Biomech. 26, 1473–1477 (1993)

    Article  Google Scholar 

  25. Alpert, M., Peterson, R.: On the interpretation of canonical correlation analysis. J. Mark. Res. 9, 187–192 (1972)

    Article  Google Scholar 

  26. Braude, D.: Head motion synthesis: evaluation and a template motion approach. Ph.D. dissertation, School of Informatics, University of Edinburgh (2016)

    Google Scholar 

  27. Poser Pro 2012. http://my.smithmicro.com/poser-3d-animation-software.html

  28. Dall, R., Yamagishi, J., King, S.: Rating naturalness in speech synthesis: the effect of style and expectation. In: Proceedings of the 7th International Conference on Speech Prosody, pp. 1012–1016 (2014)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Kathrin Haag or Hiroshi Shimodaira .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing AG

About this paper

Cite this paper

Haag, K., Shimodaira, H. (2016). Bidirectional LSTM Networks Employing Stacked Bottleneck Features for Expressive Speech-Driven Head Motion Synthesis. In: Traum, D., Swartout, W., Khooshabeh, P., Kopp, S., Scherer, S., Leuski, A. (eds) Intelligent Virtual Agents. IVA 2016. Lecture Notes in Computer Science(), vol 10011. Springer, Cham. https://doi.org/10.1007/978-3-319-47665-0_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-47665-0_18

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-47664-3

  • Online ISBN: 978-3-319-47665-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics