Cross-Modal Predictive Coding for Talking Head Sequences

Rao, Ram R.; Chen, Tsuhan

doi:10.1007/978-1-4613-0403-6_37

Ram R. Rao² &
Tsuhan Chen³

96 Accesses
5 Citations

Abstract

Predictive coding of video has traditionally used information from previous video frames to help construct an estimate of the current frame. The difference between the original and estimated signal can then be transmitted to allow the receiver to fully reconstruct the original video frame. In this paper, we explore a new algorithm for use in coding the shape of a person’s lips in a head-and-shoulder video sequence. This algorithm uses the same predictive coding loop, but instead of forming an estimate of the lip image using motion compensation and previous video frames, it forms an estimate from the associated acoustic data. Since the acoustic data is also transmitted, the receiver is able to reconstruct the video with very little side information. In this paper, we will describe our predictive coding system and analyze methods for converting from the acoustic data to visual estimates.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

REFERENCES

Aizawa, K., Harashima, H., Saito, T., “Model-based synthesis image coding (MBASIC) system for a person’s face,” Signal Processing: Image Communication, volume 1, number 2, pages 139–152, Oct. 1989.
Article Google Scholar
Rao, R. and Mersereau, R., “State-Embedded Deformable Templates,” to appear in ICIP ’95, Washington, D.C., 1995.
Google Scholar
Chen, T., Graf, H. P., and Wang, K., “Speech-assisted video processing: Interpolation and low-bitrate coding,” 28th Annual Asilomar Conference on Signals, Systems, and Computers, Pacific Grove, CA, October 1994.
Google Scholar
Lavagetto F.,“Converting speech into lip movements: A multimedia telephone for hard of hearing people,” IEEE Transactions on Rehabilitation Engineering, Vol. 3, No. 1, March 1995.
Google Scholar
Morishima, S., Aizawa, K., and Harashima, H., “An intelligent facial image coding driven by speech and phonemes,” ICASSP ’89, Glasgow, UK, 1989.
Google Scholar

Download references

Author information

Authors and Affiliations

Georgia Institute of Technology, Atlanta, GA, 30332, USA
Ram R. Rao
AT&T Bell Laboratories, Holmdel, NJ, 07733, USA
Tsuhan Chen

Authors

Ram R. Rao
View author publications
You can also search for this author in PubMed Google Scholar
Tsuhan Chen
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Polytechnic University, Brooklyn, New York, USA
Yao Wang , Shivendra Panwar , Seung-Pil Kim & Henry L. Bertoni , , &

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Rao, R.R., Chen, T. (1996). Cross-Modal Predictive Coding for Talking Head Sequences. In: Wang, Y., Panwar, S., Kim, SP., Bertoni, H.L. (eds) Multimedia Communications and Video Coding. Springer, Boston, MA. https://doi.org/10.1007/978-1-4613-0403-6_37

Download citation

DOI: https://doi.org/10.1007/978-1-4613-0403-6_37
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4613-8036-8
Online ISBN: 978-1-4613-0403-6
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics