Interplay Between Visual and Audio Scene Analysis
We have argued the necessity of joint audio-visual scene analysis to deal with the difficult problem of CASA. It is argued that the problem of CASA will benefit from computer audio-visual scene analysis (CAVSA). We also propose a generative probabilistic model on correlogram, the video representation of audio signal, to separate the audio sources.
KeywordsAudio Signal Probabilistic Inference Scene Analysis Markov Blanket Generative Probabilistic Model
Unable to display preview. Download preview PDF.
- Cooke, M.P., 1993, Modeling Auditory Processing and Organization, Cambridge University Press, Cambridge, U.K.Google Scholar
- Frey, B.J. and Jojic, N,, 1999, Learning mixture models of images and inferring spatial transformations using the EM algorithm, Proceedings of the IEEE Computer Vision and Pattern Recognition, pp. 416–422.Google Scholar
- Jojic, N. and Frey, B.J., 2001, Learning flexible sprites in video layers, Proceedings of the IEEE Computer Vision and Pattern Recognition, December 2001.Google Scholar
- Jojic, N., Petrovic, N., and Huang, T.S., 2003, Scene generative models for adaptive video fast forward, Proceedings. International Conference on Image Processing, vol. 2, pp. 619–622.Google Scholar
- Jojic, N., Petrovic, N., Frey, B.J., and Huang, T.S., 2000, Transformed hidden Markov models: estimating mixture models of images and inferring spatial transformations in video sequences, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 26–33.Google Scholar
- Neal, R.M. and Hinton, G.E., 1998, A view of the EM algorithm that justifies incremental, sparse, and other variants, in: Learning in Graphical Models, M.I. Jordan, ed., Kluwer Academic Publishers, pp. 355–368.Google Scholar
- Pearl, J., 1988, Probabilistic Reasoning in Intelligent Systems, Kaufmann, 2nd edition.Google Scholar
- Slaney, M., Narr, D., and Lyon, R.F., 1994, Auditory model inversion for sound separation, Proceedings of the ICASSP 94, vol. II, pp. 77–80.Google Scholar
- Tao, H., Kumar, R., and Sawhney, H.S., 2000, Dynamic layer representation with applications to tracking, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 134–141.Google Scholar
- Williams, C., and Titsias, M.K., 2002, Learning about multiple objects in images: Factorial learning without factorial search, Advances in Neural Information Processing Systems(NIPS).Google Scholar
- Xiong, Z., Radhakrishnan, R., Divakaran, A., and Huang, T.S., (submitted), Audio-visual sports highlights extraction using coupled hidden markov models, Pattern Analysis and Application Journal, Special Issue on Video Based Event Detection.Google Scholar