Skip to main content

Cross-Modal Predictive Coding for Talking Head Sequences

  • Chapter
Multimedia Communications and Video Coding

Abstract

Predictive coding of video has traditionally used information from previous video frames to help construct an estimate of the current frame. The difference between the original and estimated signal can then be transmitted to allow the receiver to fully reconstruct the original video frame. In this paper, we explore a new algorithm for use in coding the shape of a person’s lips in a head-and-shoulder video sequence. This algorithm uses the same predictive coding loop, but instead of forming an estimate of the lip image using motion compensation and previous video frames, it forms an estimate from the associated acoustic data. Since the acoustic data is also transmitted, the receiver is able to reconstruct the video with very little side information. In this paper, we will describe our predictive coding system and analyze methods for converting from the acoustic data to visual estimates.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

REFERENCES

  1. Aizawa, K., Harashima, H., Saito, T., “Model-based synthesis image coding (MBASIC) system for a person’s face,” Signal Processing: Image Communication, volume 1, number 2, pages 139–152, Oct. 1989.

    Article  Google Scholar 

  2. Rao, R. and Mersereau, R., “State-Embedded Deformable Templates,” to appear in ICIP ’95, Washington, D.C., 1995.

    Google Scholar 

  3. Chen, T., Graf, H. P., and Wang, K., “Speech-assisted video processing: Interpolation and low-bitrate coding,” 28th Annual Asilomar Conference on Signals, Systems, and Computers, Pacific Grove, CA, October 1994.

    Google Scholar 

  4. Lavagetto F.,“Converting speech into lip movements: A multimedia telephone for hard of hearing people,” IEEE Transactions on Rehabilitation Engineering, Vol. 3, No. 1, March 1995.

    Google Scholar 

  5. Morishima, S., Aizawa, K., and Harashima, H., “An intelligent facial image coding driven by speech and phonemes,” ICASSP ’89, Glasgow, UK, 1989.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 1996 Plenum Press, New York

About this chapter

Cite this chapter

Rao, R.R., Chen, T. (1996). Cross-Modal Predictive Coding for Talking Head Sequences. In: Wang, Y., Panwar, S., Kim, SP., Bertoni, H.L. (eds) Multimedia Communications and Video Coding. Springer, Boston, MA. https://doi.org/10.1007/978-1-4613-0403-6_37

Download citation

  • DOI: https://doi.org/10.1007/978-1-4613-0403-6_37

  • Publisher Name: Springer, Boston, MA

  • Print ISBN: 978-1-4613-8036-8

  • Online ISBN: 978-1-4613-0403-6

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics