Advertisement

Journal on Multimodal User Interfaces

, Volume 2, Issue 2, pp 93–103 | Cite as

An audio-driven dancing avatar

  • Ferda OfliEmail author
  • Yasemin Demir
  • Yücel Yemez
  • Engin Erzin
  • A. Murat Tekalp
  • Koray Balcı
  • İdil Kızoğlu
  • Lale Akarun
  • Cristian Canton-Ferrer
  • Joëlle Tilmanne
  • Elif Bozkurt
  • A. Tanju Erdem
Original Paper

Abstract

We present a framework for training and synthesis of an audio-driven dancing avatar. The avatar is trained for a given musical genre using the multicamera video recordings of a dance performance. The video is analyzed to capture the time-varying posture of the dancer’s body whereas the musical audio signal is processed to extract the beat information. We consider two different marker-based schemes for the motion capture problem. The first scheme uses 3D joint positions to represent the body motion whereas the second uses joint angles. Body movements of the dancer are characterized by a set of recurring semantic motion patterns, i.e., dance figures. Each dance figure is modeled in a supervised manner with a set of HMM (Hidden Markov Model) structures and the associated beat frequency. In the synthesis phase, an audio signal of unknown musical type is first classified, within a time interval, into one of the genres that have been learnt in the analysis phase, based on mel frequency cepstral coefficients (MFCC). The motion parameters of the corresponding dance figures are then synthesized via the trained HMM structures in synchrony with the audio signal based on the estimated tempo information. Finally, the generated motion parameters, either the joint angles or the 3D joint positions of the body, are animated along with the musical audio using two different animation tools that we have developed. Experimental results demonstrate the effectiveness of the proposed framework.

Keywords

Multicamera motion capture Audio-driven body motion synthesis Dancing avatar animation 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Chen T (2001) Audiovisual speech processing. IEEE Signal Process Mag 18(1):9–21 zbMATHCrossRefGoogle Scholar
  2. 2.
    Bregler C, Covell M, Slaney M (1997) Video rewrite: driving visual speech with audio. In: SIGGRAPH ’97: Proceedings of the 24th annual conference on computer graphics and interactive techniques, New York, NY, USA. ACM Press/Addison-Wesley, New York, pp 353–360 CrossRefGoogle Scholar
  3. 3.
    Brand M (1999) Voice puppetry. In: SIGGRAPH ’99: Proceedings of the 26th annual conference on computer graphics and interactive techniques, New York, NY, USA. ACM Press/Addison-Wesley, New York, pp 21–28 CrossRefGoogle Scholar
  4. 4.
    Li Y, Shum H (2006) Learning dynamic audio-visual mapping with input-output hidden Markov models. IEEE Trans Multimedia 8(3):542–549 CrossRefGoogle Scholar
  5. 5.
    Ofli F, Erzin E, Yemez Y, Tekalp AM (2007) Estimation and analysis of facial animation parameter patterns. In: IEEE International conference on image processing Google Scholar
  6. 6.
    Sargin ME, Erzin E, Yemez Y, Tekalp AM, Erdem AT, Erdem C, Ozkan M (2007) Prosody-driven head-gesture animation. IEEE Int Conf Acoustics Speech Signal Process 2:677–680 Google Scholar
  7. 7.
    Sargin ME, Aran O, Karpov A, Ofli F, Yasinnik Y, Wilson S, Erzin E, Yemez Y, Tekalp AM (2006) Combined gesture—speech analysis and speech driven gesture synthesis. In: IEEE international conference on multimedia and expo, pp 893–896 Google Scholar
  8. 8.
    Bagci U, Erzin E (2007) Automatic classification of musical genres using inter-genre similarity. IEEE Signal Process Lett 14:521–524 CrossRefGoogle Scholar
  9. 9.
    Ehara Y, Fujimoto H, Miyazaki S, Tanaka S, Yamamoto S (1995) Comparison of the performance of 3d camera systems. Gait Posture 3:166–169 CrossRefGoogle Scholar
  10. 10.
    Ehara Y, Fujimoto H, Miyazaki S, Mochimaru M, Tanaka S, Yamamoto S (1997) Comparison of the performance of 3d camera systems II. Gait Posture 5:251–255 CrossRefGoogle Scholar
  11. 11.
    Bregler C, Malik J (1998) Tracking people with twists and exponential maps. In: IEEE international conference on computer vision and pattern recognition Google Scholar
  12. 12.
    Deutscher J, Reid I (2005) Articulated body motion capture by stochastic search. Int J Comput Vis 61:185–205 CrossRefGoogle Scholar
  13. 13.
    Canton-Ferrer C, Casas JR, Pardàs M (2005) Towards a Bayesian approach to robust finding correspondences in multiple view geometry environments. In: Lecture notes on computer science, vol 3515. Springer, Berlin, pp 281–289 Google Scholar
  14. 14.
    Arulampalam M, Maskell S, Gordon N, Clapp T (2002) A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking. IEEE Trans Signal Process 50(2):174–188 CrossRefGoogle Scholar
  15. 15.
    Young S (1993) The htk hidden Markov model toolkit: design and philosophy. Technical Report TR. 153, Speech Group, Department of Engineering, Cambridge University (UK) Google Scholar
  16. 16.
    Alonso M, David B, Richard G (2004) Tempo and beat estimation of music signals. In: International conference on music information retrieval Google Scholar
  17. 17.
    Balci K, Not E, Zancanaro M, Pianesi F (2007) Xface open source project and smil-agent scripting language for creating and animating embodied conversational agents. In: MULTIMEDIA ’07: Proceedings of the 15th international conference on Multimedia, New York, NY, USA. ACM Press, New York, pp 1013–1016 CrossRefGoogle Scholar

Copyright information

© OpenInterface Association 2008

Authors and Affiliations

  • Ferda Ofli
    • 1
    Email author
  • Yasemin Demir
    • 1
  • Yücel Yemez
    • 1
  • Engin Erzin
    • 1
  • A. Murat Tekalp
    • 1
  • Koray Balcı
    • 2
  • İdil Kızoğlu
    • 2
  • Lale Akarun
    • 2
  • Cristian Canton-Ferrer
    • 3
  • Joëlle Tilmanne
    • 4
  • Elif Bozkurt
    • 5
  • A. Tanju Erdem
    • 5
  1. 1.Multimedia, Vision and Graphics LaboratoryKoç UniversityİstanbulTurkey
  2. 2.Multimedia GroupBoğaziçi UniversityİstanbulTurkey
  3. 3.Image and Video Processing GroupTechnical University of CataloniaBarcelonaSpain
  4. 4.TCTS LabFaculty of Engineering of MonsMonsBelgium
  5. 5.Momentum Digital Media TechnologiesİstanbulTurkey

Personalised recommendations