Abstract
Previous work in speech-driven head motion synthesis is centred around Hidden Markov Model (HMM) based methods and data that does not show a large variability of expressiveness in both speech and motion. When using expressive data, these systems often fail to produce satisfactory results. Recent studies have shown that using deep neural networks (DNNs) results in a better synthesis of head motion, in particular when employing bidirectional long short-term memory (BLSTM). We present a novel approach which makes use of DNNs with stacked bottleneck features combined with a BLSTM architecture to model context and expressive variability. Our proposed DNN architecture outperforms conventional feed-forward DNNs and simple BLSTM networks in an objective evaluation. Results from a subjective evaluation show a significant improvement of the bottleneck architecture over feed-forward DNNs.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Sargin, M.E., Aran, O., Karpov, A., Ofli, F., Yasinnik, Y., Wilson, S.: Combined gesture-speech analysis and speech driven gesture sythesis. In: IEEE International Conference on Multimedia and Expo, pp. 893–896 (2006)
Busso, C., Deng, Z., Grimm, M., Neumann, U., Narayanan, S.: Rigid head motion in expressive speech animation: analysis and synthesis. IEEE Trans. Audio Speech Lang. Process. 15, 1075–2007 (2007)
Ben Youssef, A., Shimodaira, H., Braude, D.A.: Articulatory features for speech-driven head motion synthesis. In: 14th Annual Conference of the International Speech Communication Association, Interspeech 2013, pp. 2758–2762 (2013)
Braude, D.A., Shimodaira, H., Ben Youssef, A.: Template-warping based speech driven head motion synthesis. In: 14th Annual Conference of the International Speech Communication Association, Interspeech 2013, pp. 2763–2767 (2013)
Yehia, H.C., Kuratate, T., Vatikiotis-Bateson, E.: Linking facial animation, head motion and speech acoustics. J. Phonetics 30(3), 555–568 (2002)
Zen, H., Senior, A., Schuster, M.: Statistical parametric speech synthesis using deep neural networks. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 20, pp. 1713–1724 (2013)
Zhao, K., Wu, Z., Cai, L.: A real-time speech driven talking avatar based on deep neural network. In: Signal and Information Processing Association Annual Summit and Conference (APSIPA), pp. 1–4 (2013)
Susskind, J., Hinton, G., Movellan, J., Anderson, A.: Generating facial expressions with deep belief nets. In: Or, J. (ed.) Affective Computing, Focus on Emotion Expression, Synthesis and Recognition. I-TECH Education and Publishing, Vienna (2008)
Chiu, C.-C., Marsella, S.: How to train your avatar: a data driven approach to gesture generation. In: Vilhjálmsson, H.H., Kopp, S., Marsella, S., Thórisson, K.R. (eds.) IVA 2011. LNCS (LNAI), vol. 6895, pp. 127–140. Springer, Heidelberg (2011). doi:10.1007/978-3-642-23974-8_14
Ding, C., Xie, L., Zhu, P.: Head motion synthesis from speech using deep neural networks. Multimedia Tools Appl. 74, 9871–9888 (2014)
Ding, C., Zhu, P., Xie, L.: BLSTM neural networks for speech driven head motion synthesis. In: 16th Annual Conference of the International Speech Communication Association, Interspeech 2015, pp. 3345–3349 (2015)
Hochreiter, S.: Recurrent neural net learning and vanishing gradient. Int. J. Uncertain. Fuzziness Knowl. Based Syst. 6(2), 107–116 (1998)
Fan, Y., Qian, Y., Xie, F.-L., Soong, F.K.: TTS synthesis with bidirectional LSTM based recurrent neural networks. In: 15th Annual Conference of the International Speech Communication Association, Interspeech 2014, pp. 1964–1968 (2014)
Fan, B., Wang, L., Soong, F.K., Xie, L.: Photo-real talking head with deep bidirectional LSTM. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4884–4888 (2015)
Dong, Y., Seltzer, M.L.: Improved bottleneck features using pre-trained deep neural networks. In: 12th Annual Conference of the International Speech Communication Association, Interspeech 2011, pp. 237–240 (2011)
Gehring, J., Miao, Y., Metze, F., Waibel, A.: Extracting deep bottleneck features using stacked autoencoders. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3377–3381 (2013)
Wu, Z., Valentini-Botinhao, C., Watts, O., King, S.: Deep neural networks employing multi-task learning and stacked bottleneck features for speech synthesis. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4460–4464 (2015)
Haag, K., Shimodaira, H.: The University of Edinburgh speaker personality and MoCap dataset. In: Proceedings of Facial Analysis and Animation, pp. 8:1–8:2. ACM (2015)
Motu. http://motu.com
Speech Signal Processing Toolkit (SPTK). http://sptk.sourceforge.net
Eyben, F., Woellmer, M., Schuller, B.: openSMILE: the Munich versatile and fast open-source audio feature extractor. In: Proceedings of the 18th ACM International Conference on Multimedia, MM 2010, pp. 1459–1462. ACM (2010)
Ben Youssef, A., Shimodaira, H., Braude, D.A.: Speech driven talking head from estimated articulatory features. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4606–4610 (2014)
NaturalPoint Optitrack. http://www.naturalpoint.com/optitrack
Soederkvist, I., Wedin, P.-A.: Determining the movements of the skeleton using well-configured markers. J. Biomech. 26, 1473–1477 (1993)
Alpert, M., Peterson, R.: On the interpretation of canonical correlation analysis. J. Mark. Res. 9, 187–192 (1972)
Braude, D.: Head motion synthesis: evaluation and a template motion approach. Ph.D. dissertation, School of Informatics, University of Edinburgh (2016)
Poser Pro 2012. http://my.smithmicro.com/poser-3d-animation-software.html
Dall, R., Yamagishi, J., King, S.: Rating naturalness in speech synthesis: the effect of style and expectation. In: Proceedings of the 7th International Conference on Speech Prosody, pp. 1012–1016 (2014)
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing AG
About this paper
Cite this paper
Haag, K., Shimodaira, H. (2016). Bidirectional LSTM Networks Employing Stacked Bottleneck Features for Expressive Speech-Driven Head Motion Synthesis. In: Traum, D., Swartout, W., Khooshabeh, P., Kopp, S., Scherer, S., Leuski, A. (eds) Intelligent Virtual Agents. IVA 2016. Lecture Notes in Computer Science(), vol 10011. Springer, Cham. https://doi.org/10.1007/978-3-319-47665-0_18
Download citation
DOI: https://doi.org/10.1007/978-3-319-47665-0_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-47664-3
Online ISBN: 978-3-319-47665-0
eBook Packages: Computer ScienceComputer Science (R0)