Advertisement

Abstract

This paper summarizes some of the current research challenges arising from multi-channel sequence processing. Indeed, multiple real life applications involve simultaneous recording and analysis of multiple information sources, which may be asynchronous, have different frame rates, exhibit different stationarity properties, and carry complementary (or correlated) information. Some of these problems can already be tackled by one of the many statistical approaches towards sequence modeling. However, several challenging research issues are still open, such as taking into account asynchrony and correlation between several feature streams, or handling the underlying growing complexity. In this framework, we discuss here two novel approaches, which recently started to be investigated with success in the context of large multimodal problems. These include the asynchronous HMM, providing a principled approach towards the processing of multiple feature streams, and the layered HMM approach, providing a good formalism for decomposing large and complex (multi-stream) problems into layered architectures. As briefly reported here, combination of these two approaches yielded successful results on several multi-channel tasks, ranging from audio-visual speech recognition to automatic meeting analysis.

Keywords

Hide Markov Model Speech Recognition Automatic Speech Recognition Observation Sequence Word Error Rate 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Dupont, S., Luettin, J.: Audio-visual speech modeling for continuous speech recognition. IEEE Transactions on Multimedia 2, 141–151 (2000)CrossRefGoogle Scholar
  2. 2.
    Bengio, S.: Multimodal speech processing using asynchronous hidden markov models. Information Fusion 5, 81–89 (2004)CrossRefGoogle Scholar
  3. 3.
    Morris, A., Hagen, A., Glotin, H., Bourlard, H.: Multi-stream adaptive evidence combination for noise robust ASR. Speech Communication (2001)Google Scholar
  4. 4.
    McCowan, I., Gatica-Perez, D., Bengio, S., Lathoud, G., Barnard, M., Zhang, D.: Automatic analysis of multimodal group actions in meetings. IEEE Transactions on Pattern Analysis and Machine Intelligence 27, 305–317 (2005)CrossRefGoogle Scholar
  5. 5.
    Gatica-Perez, D., Lathoud, G., McCowan, I., Odobez, J.M.: A mixed-state i-particle filter for multi-camera speaker tracking. In: Proc. of WOMTEC (2003)Google Scholar
  6. 6.
    Renals, S., Abberley, D., Kirby, D., Robinson, T.: Indexing and retrieval of broadcast news. Speech Communication 32, 5–20 (2000)CrossRefGoogle Scholar
  7. 7.
    Westerveld, T., de Vries, A.P., van Ballegooij, A., de Jong, F., Hiemstra, D.: A probabilistic multimedia retrieval model and its evaluation. EURASIP Journal on Applied Signal Processing 2 (2003)Google Scholar
  8. 8.
    Mann, S.: Smart clothing: The wearable computer and wearcam. Personal Technologies 1(1) (1997)Google Scholar
  9. 9.
    Rabiner, L.R., Juang, B.-H.: Fundamentals of Speech Recognition. Prentice-Hall, Englewood Cliffs (1993)Google Scholar
  10. 10.
    Boreczky, J.S., Wilcox, L.D.: A Hidden Markov Model framework for video segmentation using audio and image features. In: Proc. of ICASSP, vol. 6 (1998)Google Scholar
  11. 11.
    Xie, L., Chang, S.F., Divakaran, A., Sun, H.: Structure analysis of soccer video with Hidden Markov Models. In: ICASSP (2002)Google Scholar
  12. 12.
    Eickeler, S., Müller, S.: Content-based video indexing of TV broadcast news using Hidden Markov Models. In: Proc. of ICASSP (1999)Google Scholar
  13. 13.
    Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum-likelihood from incomplete data via the EM algorithm. Journal of Royal Statistical Society B 39, 1–38 (1977)zbMATHMathSciNetGoogle Scholar
  14. 14.
    Jelinek, F.: A fast sequential decoding algorithm using a stack. IBM Journal of Research and Development 13, 675–685 (1969)zbMATHCrossRefMathSciNetGoogle Scholar
  15. 15.
    Viterbi, A.: Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory, 260–269 (1967)Google Scholar
  16. 16.
    Oliver, N., Horvitz, E., Garg, A.: Layered representations for learning and inferring office activity from multiple sensory channels. In: Proc. of the Int. Conf. on Multimodal Interfaces (2002)Google Scholar
  17. 17.
    Bourlard, H., Dupont, S.: Subband-based speech recognition. In: Proc. IEEE ICASSP (1997)Google Scholar
  18. 18.
    Potamianos, G., Neti, C., Luettin, J., Matthews, I.: Audio-visual automatic speech recognition: An overview. In: Bailly, G., Vatikiotis-Bateson, E., Perrier, P. (eds.) Issues in Visual and Audio-Visual Speech Processing. MIT Press, Cambridge (2004)Google Scholar
  19. 19.
    Brand, M.: Coupled hidden markov models for modeling interacting processes. Technical Report 405, MIT Media Lab Vision and Modeling (1996)Google Scholar
  20. 20.
    Bengio, S.: An asynchronous hidden markov model for audio-visual speech recognition. In: Becker, S., Thrun, S., Obermayer, K. (eds.) Advances in Neural Information Processing Systems, vol. 15 (2003)Google Scholar
  21. 21.
    Zhang, D., Gatica-Perez, D., Bengio, S., McCowan, I., Lathoud, G.: Modeling individual and group actions in meetings: a two-layer hmm framework. In: IEEE Workshop on Event Mining at CVPR 2004 (2004)Google Scholar
  22. 22.
    Bourlard, H., Bengio, S., Doss, M.M., Zhu, Q., Mesot, B., Morgan, N.: Towards using hierarchical posteriors for flexible automatic speech recognition systems. In: Proc. of DARPA EARS Rich Transcription Workshop (2004)Google Scholar
  23. 23.
    Bishop, C.: Neural Networks for Pattern Recognition. Oxford University Press, London (1995)Google Scholar
  24. 24.
    Bengio, Y., Simard, P., Frasconi, P.: Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks 5, 157–166 (1994)CrossRefGoogle Scholar
  25. 25.
    Pigeon, S., Vandendorpe, L.: The M2VTS multimodal face database (release 1.00). In: Proc. of the Conf. on AVBPA (1997)Google Scholar
  26. 26.
    Varga, A., Steeneken, H., Tomlinson, M., Jones, D.: The noisex-92 study on the effect of additive noise on automatic speech recognition. Technical report, DRA Speech Research Unit (1992)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Samy Bengio
    • 1
  • Hervé Bourlard
    • 1
    • 2
  1. 1.IDIAP Research InstituteMartignySwitzerland
  2. 2.Swiss Federal Institute of Technology at Lausanne (EPFL)Switzerland

Personalised recommendations