Bidirectional LSTM Networks Employing Stacked Bottleneck Features for Expressive Speech-Driven Head Motion Synthesis

Haag, Kathrin; Shimodaira, Hiroshi

doi:10.1007/978-3-319-47665-0_18

Kathrin Haag¹⁹ &
Hiroshi Shimodaira¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10011))

Included in the following conference series:

International Conference on Intelligent Virtual Agents

2739 Accesses
18 Citations

Abstract

Previous work in speech-driven head motion synthesis is centred around Hidden Markov Model (HMM) based methods and data that does not show a large variability of expressiveness in both speech and motion. When using expressive data, these systems often fail to produce satisfactory results. Recent studies have shown that using deep neural networks (DNNs) results in a better synthesis of head motion, in particular when employing bidirectional long short-term memory (BLSTM). We present a novel approach which makes use of DNNs with stacked bottleneck features combined with a BLSTM architecture to model context and expressive variability. Our proposed DNN architecture outperforms conventional feed-forward DNNs and simple BLSTM networks in an objective evaluation. Results from a subjective evaluation show a significant improvement of the bottleneck architecture over feed-forward DNNs.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Sargin, M.E., Aran, O., Karpov, A., Ofli, F., Yasinnik, Y., Wilson, S.: Combined gesture-speech analysis and speech driven gesture sythesis. In: IEEE International Conference on Multimedia and Expo, pp. 893–896 (2006)
Google Scholar
Busso, C., Deng, Z., Grimm, M., Neumann, U., Narayanan, S.: Rigid head motion in expressive speech animation: analysis and synthesis. IEEE Trans. Audio Speech Lang. Process. 15, 1075–2007 (2007)
Article Google Scholar
Ben Youssef, A., Shimodaira, H., Braude, D.A.: Articulatory features for speech-driven head motion synthesis. In: 14th Annual Conference of the International Speech Communication Association, Interspeech 2013, pp. 2758–2762 (2013)
Google Scholar
Braude, D.A., Shimodaira, H., Ben Youssef, A.: Template-warping based speech driven head motion synthesis. In: 14th Annual Conference of the International Speech Communication Association, Interspeech 2013, pp. 2763–2767 (2013)
Google Scholar
Yehia, H.C., Kuratate, T., Vatikiotis-Bateson, E.: Linking facial animation, head motion and speech acoustics. J. Phonetics 30(3), 555–568 (2002)
Article Google Scholar
Zen, H., Senior, A., Schuster, M.: Statistical parametric speech synthesis using deep neural networks. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 20, pp. 1713–1724 (2013)
Google Scholar
Zhao, K., Wu, Z., Cai, L.: A real-time speech driven talking avatar based on deep neural network. In: Signal and Information Processing Association Annual Summit and Conference (APSIPA), pp. 1–4 (2013)
Google Scholar
Susskind, J., Hinton, G., Movellan, J., Anderson, A.: Generating facial expressions with deep belief nets. In: Or, J. (ed.) Affective Computing, Focus on Emotion Expression, Synthesis and Recognition. I-TECH Education and Publishing, Vienna (2008)
Google Scholar
Chiu, C.-C., Marsella, S.: How to train your avatar: a data driven approach to gesture generation. In: Vilhjálmsson, H.H., Kopp, S., Marsella, S., Thórisson, K.R. (eds.) IVA 2011. LNCS (LNAI), vol. 6895, pp. 127–140. Springer, Heidelberg (2011). doi:10.1007/978-3-642-23974-8_14
Chapter Google Scholar
Ding, C., Xie, L., Zhu, P.: Head motion synthesis from speech using deep neural networks. Multimedia Tools Appl. 74, 9871–9888 (2014)
Article Google Scholar
Ding, C., Zhu, P., Xie, L.: BLSTM neural networks for speech driven head motion synthesis. In: 16th Annual Conference of the International Speech Communication Association, Interspeech 2015, pp. 3345–3349 (2015)
Google Scholar
Hochreiter, S.: Recurrent neural net learning and vanishing gradient. Int. J. Uncertain. Fuzziness Knowl. Based Syst. 6(2), 107–116 (1998)
Article MathSciNet MATH Google Scholar
Fan, Y., Qian, Y., Xie, F.-L., Soong, F.K.: TTS synthesis with bidirectional LSTM based recurrent neural networks. In: 15th Annual Conference of the International Speech Communication Association, Interspeech 2014, pp. 1964–1968 (2014)
Google Scholar
Fan, B., Wang, L., Soong, F.K., Xie, L.: Photo-real talking head with deep bidirectional LSTM. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4884–4888 (2015)
Google Scholar
Dong, Y., Seltzer, M.L.: Improved bottleneck features using pre-trained deep neural networks. In: 12th Annual Conference of the International Speech Communication Association, Interspeech 2011, pp. 237–240 (2011)
Google Scholar
Gehring, J., Miao, Y., Metze, F., Waibel, A.: Extracting deep bottleneck features using stacked autoencoders. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3377–3381 (2013)
Google Scholar
Wu, Z., Valentini-Botinhao, C., Watts, O., King, S.: Deep neural networks employing multi-task learning and stacked bottleneck features for speech synthesis. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4460–4464 (2015)
Google Scholar
Haag, K., Shimodaira, H.: The University of Edinburgh speaker personality and MoCap dataset. In: Proceedings of Facial Analysis and Animation, pp. 8:1–8:2. ACM (2015)
Google Scholar
Motu. http://motu.com
Speech Signal Processing Toolkit (SPTK). http://sptk.sourceforge.net
Eyben, F., Woellmer, M., Schuller, B.: openSMILE: the Munich versatile and fast open-source audio feature extractor. In: Proceedings of the 18th ACM International Conference on Multimedia, MM 2010, pp. 1459–1462. ACM (2010)
Google Scholar
Ben Youssef, A., Shimodaira, H., Braude, D.A.: Speech driven talking head from estimated articulatory features. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4606–4610 (2014)
Google Scholar
NaturalPoint Optitrack. http://www.naturalpoint.com/optitrack
Soederkvist, I., Wedin, P.-A.: Determining the movements of the skeleton using well-configured markers. J. Biomech. 26, 1473–1477 (1993)
Article Google Scholar
Alpert, M., Peterson, R.: On the interpretation of canonical correlation analysis. J. Mark. Res. 9, 187–192 (1972)
Article Google Scholar
Braude, D.: Head motion synthesis: evaluation and a template motion approach. Ph.D. dissertation, School of Informatics, University of Edinburgh (2016)
Google Scholar
Poser Pro 2012. http://my.smithmicro.com/poser-3d-animation-software.html
Dall, R., Yamagishi, J., King, S.: Rating naturalness in speech synthesis: the effect of style and expectation. In: Proceedings of the 7th International Conference on Speech Prosody, pp. 1012–1016 (2014)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Informatics, Centre for Speech Technology Research, University of Edinburgh, Edinburgh, UK
Kathrin Haag & Hiroshi Shimodaira

Authors

Kathrin Haag
View author publications
You can also search for this author in PubMed Google Scholar
Hiroshi Shimodaira
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Kathrin Haag or Hiroshi Shimodaira .

Editor information

Editors and Affiliations

University of Southern California, Los Angeles, California, USA
David Traum
University of Southern California, Los Angeles, California, USA
William Swartout
US Army Research Laboratory, Los Angeles, California, USA
Peter Khooshabeh
Universität Bielefeld, Bielefeld, Nordrhein-Westfalen, Germany
Stefan Kopp
University of Southern California, Los Angeles, California, USA
Stefan Scherer
University of Southern California, Los Angeles, California, USA
Anton Leuski

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Haag, K., Shimodaira, H. (2016). Bidirectional LSTM Networks Employing Stacked Bottleneck Features for Expressive Speech-Driven Head Motion Synthesis. In: Traum, D., Swartout, W., Khooshabeh, P., Kopp, S., Scherer, S., Leuski, A. (eds) Intelligent Virtual Agents. IVA 2016. Lecture Notes in Computer Science(), vol 10011. Springer, Cham. https://doi.org/10.1007/978-3-319-47665-0_18

Download citation

DOI: https://doi.org/10.1007/978-3-319-47665-0_18
Published: 19 October 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-47664-3
Online ISBN: 978-3-319-47665-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics