Advertisement

Speech-Driven Facial Animation Using Manifold Relevance Determination

  • Samia DawoodEmail author
  • Yulia HicksEmail author
  • David MarshallEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9914)

Abstract

In this paper, a new approach to visual speech synthesis using a joint probabilistic model is introduced, namely the Gaussian process latent variable model trimmed with manifold relevance determination model, which explicitly models coarticulation. One talking head dataset is processed (LIPS dataset) by extracting visual and audio features from the sequences. The model can capture the structure of data with extremely high dimensionality. Distinguishable visual features can be inferred directly from the trained model by sampling from the discovered latent points. Statistical evaluation of inferred visual features against ground truth data is obtained and compared with the current state-of-the-art visual speech synthesis approach. The quantitative results demonstrate that the proposed approach outperforms the state-of-the-art technique.

Keywords

Latent Space Visual Feature Canonical Correlation Analysis Latent Variable Model Average Mean Square Error 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
    Anderson, J.A.: An introduction to neural networks. MIT press, Cambridge (1995)zbMATHGoogle Scholar
  2. 2.
    Brand, M.: Voice puppetry. In: Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques, pp. 21–28. ACM Press/Addison-Wesley Publishing Co. (1999)Google Scholar
  3. 3.
    Bregler, C., Covell, M., Slaney, M.: Video rewrite: driving visual speech with audio. In: Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques, pp. 353–360. ACM Press/Addison-Wesley Publishing Co. (1997)Google Scholar
  4. 4.
    Brooke, N.M., Scott, S.D.: Two-and three-dimensional audio-visual speech synthesis. In: AVSP 1998 International Conference on Auditory-Visual Speech Processing (1998)Google Scholar
  5. 5.
    Chen, T.: Audiovisual speech processing. IEEE Signal Process. Mag. 18(1), 9–21 (2001)CrossRefGoogle Scholar
  6. 6.
    Cootes, T.F., Edwards, G.J., Taylor, C.J.: Active appearance models. IEEE Trans. Pattern Anal. Mach. Intell. 6, 681–685 (2001)CrossRefGoogle Scholar
  7. 7.
    Cosker, D., Marshall, D., Rosin, P., Hicks, Y.: Video realistic talking heads using hierarchical non-linear speech-appearance models. In: Mirage, France, 147 (2003)Google Scholar
  8. 8.
    Damianou, A., Ek, C., Titsias, M., Lawrence, N.: Manifold relevance determination. arXiv preprint arXiv:1206.4610 (2012)
  9. 9.
    Damianou, A., Titsias, M.K., Lawrence, N.D.: Variational Gaussian process dynamical systems. In: Advances in Neural Information Processing Systems, pp. 2510–2518 (2011)Google Scholar
  10. 10.
    Deena, S., Galata, A.: Speech-driven facial animation using a shared Gaussian process latent variable model. In: Bebis, G., et al. (eds.) ISVC 2009, Part I. LNCS, vol. 5875, pp. 89–100. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  11. 11.
    Deena, S., Hou, S., Galata, A.: Visual speech synthesis by modelling coarticulation dynamics using a non-parametric switching state-space model. In: International Conference on Multimodal Interfaces and the Workshop on Machine Learning for Multimodal Interaction, pp. 1–8. ACM (2010)Google Scholar
  12. 12.
    Deena, S., Hou, S., Galata, A.: Visual speech synthesis using a variable-order switching shared Gaussian process dynamical model. IEEE Trans. Multimedia 15(8), 1755–1768 (2013)CrossRefGoogle Scholar
  13. 13.
    Deena, S.P.: Visual speech synthesis by learning joint probabilistic models of audio and video (2012)Google Scholar
  14. 14.
    Ek, C.H., Jaeckel, P., Campbell, N., Lawrence, N.D., Melhuish, C.: Shared gaussian process latent variable models for handling ambiguous facial expressions. Am. Inst. Phys. Conf. Ser. 1107, 147–153 (2009)Google Scholar
  15. 15.
    Ek, C.H., Rihan, J., Torr, P.H.S., Rogez, G., Lawrence, N.D.: Ambiguity modeling in latent spaces. In: Popescu-Belis, A., Stiefelhagen, R. (eds.) MLMI 2008. LNCS, vol. 5237, pp. 62–73. Springer, Heidelberg (2008). doi: 10.1007/978-3-540-85853-9_6 CrossRefGoogle Scholar
  16. 16.
    Ek, C.H., Torr, P.H.S., Lawrence, N.D.: Gaussian process latent variable models for human pose estimation. In: Popescu-Belis, A., Renals, S., Bourlard, H. (eds.) MLMI 2007. LNCS, vol. 4892, pp. 132–143. Springer, Heidelberg (2008). doi: 10.1007/978-3-540-78155-4_12 CrossRefGoogle Scholar
  17. 17.
    Englebienne, G.: Animating faces from speech. Ph.D. thesis, Citeseer (2008)Google Scholar
  18. 18.
    Hsieh, C.K., Chen, Y.C.: Partial linear regression for speech-driven talking head application. Signal Proces. Image Commun. 21(1), 1–12 (2006)CrossRefGoogle Scholar
  19. 19.
    Kent, R.D., Minifie, F.D.: Coarticulation in recent speech production models. J. Phonetics 5(2), 115–133 (1977)Google Scholar
  20. 20.
    Lawrence, N.: Probabilistic non-linear principal component analysis with Gaussian process latent variable models. J. Mach. Learn. Res. 6, 1783–1816 (2005)MathSciNetzbMATHGoogle Scholar
  21. 21.
    Lawrence, N.D.: Gaussian process latent variable models for visualisation of high dimensional data. Adv. Neural Inf. Process. Syst. 16(3), 329–336 (2004)Google Scholar
  22. 22.
    Lawrence, N.D.: Learning for larger datasets with the Gaussian process latent variable model. In: International Conference on Artificial Intelligence and Statistics, pp. 243–250 (2007)Google Scholar
  23. 23.
    Massaro, D.W., Beskow, J., Cohen, M.M., Fry, C.L., Rodgriguez, T.: Picture my voice: audio to visual speech synthesis using artificial neural networks. In: AVSP 1999-International Conference on Auditory-Visual Speech Processing (1999)Google Scholar
  24. 24.
    Parke, F.I.: A parametric model for human faces. Technical report, DTIC Document (1974)Google Scholar
  25. 25.
    Rasmussen, C.E.: Gaussian processes for machine learning (2006)Google Scholar
  26. 26.
    Ron, D., Singer, Y., Tishby, N.: The power of Amnesia: learning probabilistic automata with variable memory length. Mach. Learn. 25(2–3), 117–149 (1996)CrossRefzbMATHGoogle Scholar
  27. 27.
    Salzmann, M., Ek, C.H., Urtasun, R., Darrell, T.: Factorized orthogonal latent spaces. In: International Conference on Artificial Intelligence and Statistics, pp. 701–708 (2010)Google Scholar
  28. 28.
    Shon, A., Grochow, K., Hertzmann, A., Rao, R.P.: Learning shared latent structure for image synthesis and robotic imitation. In: Advances in Neural Information Processing Systems, pp. 1233–1240 (2005)Google Scholar
  29. 29.
    Tekalp, A.M., Ostermann, J.: Face and 2-D mesh animation in MPEG-4. Signal Process. Image Commun. 15(4), 387–421 (2000)CrossRefGoogle Scholar
  30. 30.
    Theobald, B.J.: Visual speech synthesis using shape and appearance models. Ph.D. thesis, University of East Anglia (2003)Google Scholar
  31. 31.
    Theobald, B.J., Fagel, S., Bailly, G., Elisei, F.: Lips 2008: visual speech synthesis challenge. In: Interspeech, pp. 2310–2313 (2008)Google Scholar
  32. 32.
    Theobald, B.J., Wilkinson, N.: A real-time speech-driven talking head using active appearance models. In: AVSP, pp. 264–269 (2007)Google Scholar
  33. 33.
    Titsias, M.K.: Variational learning of inducing variables in sparse gaussian processes. In: International Conference on Artificial Intelligence and Statistics, pp. 567–574 (2009)Google Scholar
  34. 34.
    Titsias, M.K., Lawrence, N.D.: Bayesian Gaussian process latent variable model. In: International Conference on Artificial Intelligence and Statistics, pp. 844–851 (2010)Google Scholar
  35. 35.
    Wang, J., Hertzmann, A., Blei, D.M.: Gaussian process dynamical models. In: Advances in neural information processing systems, pp. 1441–1448 (2005)Google Scholar
  36. 36.
    Wang, L., Qian, X., Han, W., Soong, F.K.: Photo-real lips synthesis with trajectory-guided sample selection. In: SSW, pp. 217–222 (2010)Google Scholar
  37. 37.
    Zhang, Y., Prakash, E.C., Sung, E.: A new physical model with multilayer architecture for facial expression animation using dynamic adaptive mesh. IEEE Trans. Visual. Comput. Graphics 10(3), 339–352 (2004)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.Cardiff School of EngineeringCardiffWales
  2. 2.Cardiff School of Computer Science and InformaticsCardiffWales

Personalised recommendations