Skip to main content

Multiview Approaches to Event Detection and Scene Analysis

Abstract

This chapter addresses sound scene and event classification in multiview settings, that is, settings where the observations are obtained from multiple sensors, each sensor contributing a particular view of the data (e.g., audio microphones, video cameras, etc.). We briefly introduce some of the techniques that can be exploited to effectively combine the data conveyed by the different views under analysis for a better interpretation. We first provide a high-level presentation of generic methods that are particularly relevant in the context of multiview and multimodal sound scene analysis. Then, we more specifically present a selection of techniques used for audiovisual event detection and microphone array-based scene analysis.

Keywords

  • Multimodal scene analysis
  • Multiview scene analysis
  • Multichannel audio
  • Joint audiovisual scene analysis
  • Representation learning
  • Data fusion
  • Matrix factorization
  • Tensor factorization
  • Audio source localization and tracking
  • Audio source separation
  • Beamforming
  • Multichannel Wiener filtering

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-319-63450-0_9
  • Chapter length: 34 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   149.00
Price excludes VAT (USA)
  • ISBN: 978-3-319-63450-0
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   199.99
Price excludes VAT (USA)
Hardcover Book
USD   199.99
Price excludes VAT (USA)
Fig. 9.1
Fig. 9.2
Fig. 9.3
Fig. 9.4
Fig. 9.5
Fig. 9.6
Fig. 9.7
Fig. 9.8

Notes

  1. 1.

    The underlying assumption is that the (synchronized) features from both modalities are extracted at the same rate. In the case of audio and visual modalities this is often obtained by downsampling the audio features or upsampling the video features, or by using temporal integration techniques [80].

  2. 2.

    To simplify, we consider the case of two modalities, but clearly the methods described here can be straightforwardly generalized to more than two data views by considering the relevant pairwise associations.

  3. 3.

    Matlab implementations are available online at http://plato.telecom-paristech.fr/publi/26108/.

  4. 4.

    TREC Video Retrieval Evaluation: http://www-nlpir.nist.gov/projects/trecvid/.

  5. 5.

    Here the term “concept classification” refers to generic categorization in terms of scene, event, object, or location [78].

References

  1. Adavanne, S., Parascandolo, G., Pertila, P., Heittola, T., Virtanen, T.: Sound event detection in multichannel audio using spatial and harmonic features. In: Proceedings of the IEEE AASP Chall Detect Classif Acoust Scenes Events (2016)

    Google Scholar 

  2. Amir, A., Berg, M., Chang, S.F., Hsu, W., Iyengar, G., Lin, C.Y., Naphade, M., Natsev, A., Neti, C., Nock, H., et al.: Ibm research trecvid-2003 video retrieval system. In: NIST TRECVID-2003 (2003)

    Google Scholar 

  3. Andrew, G., Arora, R., Bilmes, J.A., Livescu, K.: Deep canonical correlation analysis. In: Proceedings of the International Conference on Machine Learning (2013)

    Google Scholar 

  4. Antonacci, F., Lonoce, D., Motta, M., Sarti, A., Tubaro, S.: Efficient source localization and tracking in reverberant environments using microphone arrays. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 4, pp. iv–1061. IEEE, New York (2005)

    Google Scholar 

  5. Antonacci, F., Matteucci, M., Migliore, D., Riva, D., Sarti, A., Tagliasacchi, M., Tubaro, S.: Tracking multiple acoustic sources in reverberant environments using regularized particle filter. In: Proceedings of the International Conference on Digital Signal Processing, pp. 99–102 (2007)

    Google Scholar 

  6. Arai, T., Hodoshima, H., Yasu, K.: Using steady-state suppression to improve speech intelligibility in reverberant environments for elderly listeners. IEEE Trans. Audio Speech Lang. Process. 18(7), 1775–1780 (2010)

    CrossRef  Google Scholar 

  7. Argones Rúa, E., Bredin, H.H., García Mateo, C., Chollet, G.G., González Jiménez, D.: Audio-visual speech asynchrony detection using co-inertia analysis and coupled hidden Markov models. Pattern Anal. Appl. 12(3), 271–284 (2008)

    MathSciNet  CrossRef  Google Scholar 

  8. Arulampalam, M., Maskell, S., Gordon, N., Clapp, T.: A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking. IEEE Trans. Signal Process. 50(2), 174–188 (2002)

    CrossRef  Google Scholar 

  9. Asoh, H., Asano, F., Yoshimura, T., Yamamoto, K., Motomura, Y., Ichimura, N., Hara, I., Ogata, J.: An application of a particle filter to Bayesian multiple sound source tracking with audio and video information fusion. In: Proceedings of the Fusion, pp. 805–812. Citeseer (2004)

    Google Scholar 

  10. Atrey, P.K., Hossain, M.A., El Saddik, A., Kankanhalli, M.S.: Multimodal fusion for multimedia analysis: a survey. Multimed Syst 16(6), 345–379 (2010)

    CrossRef  Google Scholar 

  11. Barzelay, Z., Schechner, Y.Y.: Harmony in motion. In: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2007)

    Google Scholar 

  12. Beck, A., Stoica, P., Li, J.: Exact and approximate solutions of source localization problems. IEEE Trans. Signal Process. 56(5), 1770–1778 (2008)

    MathSciNet  CrossRef  Google Scholar 

  13. Benmokhtar, R., Huet, B.: Neural network combining classifier based on Dempster-Shafer theory for semantic indexing in video content. In: International MultiMedia Modeling Conference (MMM 2007), Singapore, 9–12 January 2007. LNCS, vol. 4352/2006, Part II. http://www.eurecom.fr/publication/2119

  14. Bertin, N., Badeau, R., Vincent, E.: Enforcing harmonicity and smoothness in Bayesian non-negative matrix factorization applied to polyphonic music transcription. IEEE Trans. Audio Speech Lang. Process. 18(3), 538–549 (2010)

    CrossRef  Google Scholar 

  15. Bießmann, F., Meinecke, F.C., Gretton, A., Rauch, A., Rainer, G., Logothetis, N.K., Müller, K.R.: Temporal kernel cca and its application in multimodal neuronal data analysis. Mach. Learn. 79(1–2), 5–27 (2010)

    MathSciNet  CrossRef  Google Scholar 

  16. Bitzer, J., Simmer, K.U.: Superdirective microphone arrays. In: Microphone Arrays, pp. 19–38. Springer, New York (2001)

    Google Scholar 

  17. Bitzer, J., Simmer, K.U., Kammeyer, K.D.: Theoretical noise reduction limits of the generalized sidelobe canceller (GSC) for speech enhancement. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 2965–2968 (1999)

    Google Scholar 

  18. Blandin, C., Ozerov, A., Vincent, E.: Multi-source TDOA estimation in reverberant audio using angular spectra and clustering. Signal Process. 92(8), 1950–1960 (2012)

    CrossRef  Google Scholar 

  19. Bofill, P., Zibulevsky, M.: Underdetermined blind source separation using sparse representations. Signal Process. 81(11), 2353–2362 (2001)

    MATH  CrossRef  Google Scholar 

  20. Bousmalis, K., Morency, L.P.: Modeling hidden dynamics of multimodal cues for spontaneous agreement and disagreement recognition. In: International Conference on Automatic Face & Gesture Recognition, pp. 746–752 (2011)

    Google Scholar 

  21. Bredin, H., Chollet, G.: Measuring audio and visual speech synchrony: methods and applications. Proceedings of the IET International Conference on Visual Information Engineering, pp. 255–260 (2006)

    Google Scholar 

  22. Bregman, A.S.: Auditory Scene Analysis: The Perceptual Organization of Sound. MIT Press, Cambridge (1994)

    Google Scholar 

  23. Brutti, A., Omologo, M., Svaizer, P.: Localization of multiple speakers based on a two step acoustic map analysis. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4349–4352 (2008)

    Google Scholar 

  24. Canclini, A., Antonacci, F., Sarti, A., Tubaro, S.: Acoustic source localization with distributed asynchronous microphone networks. IEEE Trans. Audio Speech Lang. Process. 21(2), 439–443 (2013)

    CrossRef  Google Scholar 

  25. Canclini, A., Bestagini, P., Antonacci, F., Compagnoni, M., Sarti, A., Tubaro, S.: A robust and low-complexity source localization algorithm for asynchronous distributed microphone networks. IEEE/ACM Trans. Audio Speech Lang. Process. 23(10), 1563–1575 (2015)

    CrossRef  Google Scholar 

  26. Capon, J.: High-resolution frequency-wavenumber spectrum analysis. Proc. IEEE 57(8), 1408–1418 (1969)

    CrossRef  Google Scholar 

  27. Carter, G.C.: Coherence and time delay estimation. Proc. IEEE 75(2), 236–255 (1987)

    CrossRef  Google Scholar 

  28. Casanovas, A., Monaci, G., Vandergheynst, P., Gribonval, R.: Blind audiovisual source separation based on sparse redundant representations. IEEE Trans. Multimed. 12(5), 358–371 (2010)

    CrossRef  Google Scholar 

  29. Casanovas, A.L., Vandergheynst, P.: Nonlinear video diffusion based on audio-video synchrony. IEEE Trans. Multimed., 2486–2489 (2010). doi:10.1109/ICASSP.2010.5494896

  30. Chang, S.F., Ellis, D., Jiang, W., Lee, K., Yanagawa, A., Loui, A.C., Luo, J.: Large-scale multimodal semantic concept detection for consumer video. In: Proceedings of the International Workshop on Multimedia Information Retrieval, MIR ’07, pp. 255–264. ACM, New York, NY (2007)

    Google Scholar 

  31. Chibelushi, C.C., Mason, J.S.D., Deravi, N.: Integrated person identification using voice and facial features. In: Proceedings of the IEE Colloquium on Image Processing for Security Application, pp. 4/1–4/5 (1997)

    Google Scholar 

  32. Choudhury, T., Rehg, J.M., Pavlovic, V., Pentland, A.: Boosting and structure learning in dynamic Bayesian networks for audio-visual speaker detection. In: Proceedings of the IEEE International Conference on Pattern Recognition, vol. 3, pp. 789–794 (2002)

    Google Scholar 

  33. Cichocki, A., Zdunek, R., Amari, S.: Nonnegative matrix and tensor factorization. IEEE Signal Process. Mag. 25(1), 142–145 (2008)

    MATH  CrossRef  Google Scholar 

  34. Compagnoni, M., Bestagini, P., Antonacci, F., Sarti, A., Tubaro, S.: Localization of acoustic sources through the fitting of propagation cones using multiple independent arrays. IEEE Trans. Audio Speech Lang. Process. 20(7), 1964–1975 (2012)

    MATH  CrossRef  Google Scholar 

  35. Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley, London (2006)

    MATH  Google Scholar 

  36. Cox, H., Zeskind, R., Kooij, T.: Practical supergain. IEEE Trans. Acoust. Speech Signal Process. 34(3), 393–398 (1986)

    CrossRef  Google Scholar 

  37. Cristani, M., Bicego, M., Murino, V.: Audio-visual event recognition in surveillance video sequences. IEEE Trans. Multimed. 9(2), 257–267 (2007)

    CrossRef  Google Scholar 

  38. Crocco, M., Bue, A.D., Murino, V.: A bilinear approach to the position self-calibration of multiple sensors. IEEE Trans. Signal Process. 60(2), 660–673 (2012)

    MathSciNet  CrossRef  Google Scholar 

  39. Cutler, R., Davis, L.: Look who’s talking: speaker detection using video and audio correlation. In: Proceedings of the IEEE International Conference on Multimedia & Expo, vol. 3, pp. 1589–1592. IEEE, New York (2000)

    Google Scholar 

  40. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 886–893. IEEE, New York (2005)

    Google Scholar 

  41. D’Arca, E., Robertson, N., Hopgood, J.: Look who’s talking: Detecting the dominant speaker in a cluttered scenario. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (2014)

    Google Scholar 

  42. DiBiase, J., Silverman, H., Brandstein, M.: Robust localization in reverberant rooms. In: Microphone Arrays, pp. 157–180. Springer, New York (2001)

    Google Scholar 

  43. Dmochowski, J., Benesty, J., Affes, S.: A generalized steered response power method for computationally viable source localization. IEEE Trans. Audio Speech Lang. Process. 15(8), 2510–2526 (2007)

    CrossRef  Google Scholar 

  44. Do, H., Silverman, H., Yu, Y.: A real-time SRP-PHAT source location implementation using stochastic region contraction (SRC) on a large-aperture microphone array. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing, vol. 1, pp. I121–I124. IEEE, New York (2007)

    Google Scholar 

  45. Doclo, S., Moonen, M.: GSVD-based optimal filtering for single and multimicrophone speech enhancement. IEEE Trans. Signal Process. 50(9), 2230–2244 (2002)

    CrossRef  Google Scholar 

  46. Duong, N.Q.K., Vincent, E., Gribonval, R.: Under-determined reverberant audio source separation using a full-rank spatial covariance model. IEEE Trans. Audio Speech Lang. Process. 18(7), 1830–1840 (2010)

    CrossRef  Google Scholar 

  47. Duong, N.Q.K., Vincent, E., Gribonval, R.: Spatial location priors for Gaussian model based reverberant audio source separation. EURASIP J. Adv. Signal Process. 2013(1), 1–11 (2013)

    CrossRef  Google Scholar 

  48. Elko, G.W.: Spatial coherence functions for differential microphones in isotropic noise fields. In: Microphone Arrays: Signal Processing Techniques and Applications, pp. 61–85. Springer, New York (2001)

    Google Scholar 

  49. Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition (2016). arXiv preprint arXiv:1604.06573

    Google Scholar 

  50. Févotte, C., Cardoso, J.F.: Maximum likelihood approach for blind audio source separation using time-frequency Gaussian models. In: Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 78–81 (2005)

    Google Scholar 

  51. Fisher, J., Darrell, T., Freeman, W.T., Viola, P., Fisher III, J.W.: Learning joint statistical models for audio-visual fusion and segregation. In: Proceedings of the Advances in Neural Information Processing Systems, pp. 772–778 (2001)

    Google Scholar 

  52. FitzGerald, D., Cranitch, M., Coyle, E.: Extended nonnegative tensor factorisation models for musical sound source separation. Comput. Intell. Neurosci. 2008, 15 pp. (2008). Article ID 872425; doi:10.1155/2008/872425

  53. Fitzgerald, D., Cranitch, M., Coyle, E.: Using tensor factorisation models to separate drums from polyphonic music. In: Proceedings of the International Conference on Digital Audio Effects (2009)

    Google Scholar 

  54. Foucher, S., Lalibert, F., Boulianne, G., Gagnon, L.: A Dempster-Shafer based fusion approach for audio-visual speech recognition with application to large vocabulary French speech. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (2006)

    CrossRef  Google Scholar 

  55. Frost, O.L.: An algorithm for linearly constrained adaptive array processing. Proc. IEEE 60(8), 926–935 (1972)

    CrossRef  Google Scholar 

  56. Gandhi, A., Sharma, A., Biswas, A., Deshmukh, O.: Gethr-net: A generalized temporally hybrid recurrent neural network for multimodal information fusion (2016). arXiv preprint arXiv:1609.05281

    Google Scholar 

  57. Gehrig, T., Nickel, K., Ekenel, H., Klee, U., McDonough, J.: Kalman filters for audio-video source localization. In: Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 118–121. IEEE, New York (2005)

    Google Scholar 

  58. Goecke, R., Millar, J.B.: Statistical analysis of the relationship between audio and video speech parameters for Australian English. In: Proceedings of the ISCA Tutor Res Workshop Audit-Vis Speech Process, pp. 133–138 (2003)

    Google Scholar 

  59. Gowdy, J.N., Subramanya, A., Bartels, C., Bilmes, J.A.: DBN based multi-stream models for audio-visual speech recognition. In: Proceedings of the IEEE International Conference of Acoustics, Speech and Signal Processing (2004)

    CrossRef  Google Scholar 

  60. Gravier, G., Potamianos, G., Neti, C.: Asynchrony modeling for audio-visual speech recognition. In: Proceedings of the International Conference on Human Language Technology Research, pp. 1–6. Morgan Kaufmann Publishers Inc., San Diego (2002)

    Google Scholar 

  61. Gribonval, R., Zibulevsky, M.: Sparse component analysis. In: Handbook of Blind Source Separation, Independent Component Analysis and Applications, pp. 367–420. Academic, New York (2010)

    Google Scholar 

  62. Griffiths, L., Jim, C.: An alternative approach to linearly constrained adaptive beamforming. IEEE Trans. Antennas Propag. 30(1), 27–34 (1982)

    CrossRef  Google Scholar 

  63. Gustafsson, T., Rao, B.D., Trivedi, M.: Source localization in reverberant environments: modeling and statistical analysis. IEEE Trans. Speech Audio Process. 11, 791–803 (2003)

    CrossRef  Google Scholar 

  64. Hardoon, D.R., Szedmak, S., Shawe-Taylor, J.: Canonical correlation analysis: an overview with application to learning methods. Neural Comput. 16(12), 2639–2664 (2004)

    MATH  CrossRef  Google Scholar 

  65. Haykin, S.: Adaptive Filter Theory, 5th edn. Pearson Education, Upper Saddle River (2014)

    MATH  Google Scholar 

  66. Haykin, S., Justice, J.H., Owsley, N.L., Yen, J., Kak, A.C.: Array Signal Processing. Prentice-Hall, Inc., Englewood Cliffs (1985)

    Google Scholar 

  67. Hotelling, H.: Relations between two sets of variates. Biometrika 28(3–4), 321–377 (1936)

    MATH  CrossRef  Google Scholar 

  68. Hu, D., Li, X., lu, X.: Temporal multimodal learning in audiovisual speech recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)

    Google Scholar 

  69. Huang, P.S., Zhuang, X., Hasegawa-Johnson, M.: Improving acoustic event detection using generalizable visual features and multi-modality modeling. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 349–352. IEEE, New York (2011)

    Google Scholar 

  70. Huang, Y., Benesty, J., Elko, G., Mersereati, R.: Real-time passive source localization: a practical linear-correction least-squares approach. IEEE Trans. Speech Audio Process. 9(8), 943–956 (2001)

    CrossRef  Google Scholar 

  71. Ivanov, Y., Serre, T., Bouvrie, J.: Error weighted classifier combination for multi-modal human identification. Tech. Rep. MIT-CSAIL-TR-2005–081, MIT (2005)

    Google Scholar 

  72. Izadinia, H., Saleemi, I., Shah, M.: Multimodal analysis for identification and segmentation of moving-sounding objects. IEEE Trans. Multimed. 15(2), 378–390 (2013)

    CrossRef  Google Scholar 

  73. Izumi, Y., Ono, N., Sagayama, S.: Sparseness-based 2CH BSS using the EM algorithm in reverberant environment. In: Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 147–150 (2007)

    Google Scholar 

  74. Jaureguiberry, X., Vincent, E., Richard, G.: Fusion methods for speech enhancement and audio source separation. IEEE/ACM Trans. Audio Speech Lang. Process. 24(7), 1266–1279 (2016)

    CrossRef  Google Scholar 

  75. Jhuo, I.H., Ye, G., Gao, S., Liu, D., Jiang, Y.G., Lee, D., Chang, S.F.: Discovering joint audio–visual codewords for video event detection. Mach. Vis. Appl. 25(1), 33–47 (2014)

    CrossRef  Google Scholar 

  76. Jiang, W., Loui, A.C.: Audio-visual grouplet: temporal audio-visual interactions for general video concept classification. In: Proceedings of the ACM International Conference on Multimedia, Scottsdale, pp. 123–132. (2011)

    Google Scholar 

  77. Jiang, Y.G., Zeng, X., Ye, G.: Columbia-UCF TRECVID2010 multimedia event detection: combining multiple modalities, contextual concepts, and temporal matching. In: Proceedings of the NIST TRECVID-2003 (2003)

    Google Scholar 

  78. Jiang, W., Cotton, C., Chang, S.F., Ellis, D., Loui, A.: Short-term audiovisual atoms for generic video concept classification. In: Proceedings of the ACM International Conference on Multimedia, pp. 5–14. ACM, New York (2009)

    Google Scholar 

  79. Jiang, Y.G., Bhattacharya, S., Chang, S.F., Shah, M.: High-level event recognition in unconstrained videos. Int. J. Multimed. Inf. Retr. 2(2), 73–101 (2013)

    CrossRef  Google Scholar 

  80. Joder, C., Essid, S., Richard, G.: Temporal integration for audio classification with application to musical instrument classification. IEEE Trans. Audio Speech Lang. Process. 17(1), 174–186 (2009). doi:10.1109/TASL.2008.2007613

    CrossRef  Google Scholar 

  81. Jourjine, A., Rickard, S., Yılmaz, O.: Blind separation of disjoint orthogonal signals: demixing N sources from 2 mixtures. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 2985–2988 (2000)

    Google Scholar 

  82. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732 (2014)

    Google Scholar 

  83. Kay, J.: Feature discovery under contextual supervision using mutual information. In: Proceedings of the International Joint Conference on Neural Networks, vol. 4, pp. 79–84 (1992)

    Google Scholar 

  84. Kidron, E., Schechner, Y., Elad, M.: Pixels that sound. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 88–95 (2005)

    Google Scholar 

  85. Kijak, E., Gravier, G., Gros, P., Oisel, L., Bimbot, F.: HMM based structuring of tennis videos using visual and audio cues. In: Proceedings of the IEEE International Conference on Multimedia Expo, pp. 309–312. IEEE Computer Society, Washington (2003)

    Google Scholar 

  86. Kittler, J., Hatef, M., Duin, R.P.W., Matas, J.: On combining classifiers. IEEE Trans. Pattern Anal. Mach. Intell. 20(3), 226–239 (1998)

    CrossRef  Google Scholar 

  87. Kolda, T.G., Bader, B.W.: Tensor decompositions and applications. SIAM Rev. 51(3), 455–500 (2009)

    MathSciNet  MATH  CrossRef  Google Scholar 

  88. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Proceedings of the Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)

    Google Scholar 

  89. Kuhn, G.F.: Model for the interaural time differences in the azimuthal plane. J. Acoust. Soc. Am. 62(1), 157–167 (1977)

    CrossRef  Google Scholar 

  90. Lai, P.L., Fyfe, C.: Kernel and nonlinear canonical correlation analysis. Int. J. Neural Syst. 10(5), 365–378 (2000)

    CrossRef  Google Scholar 

  91. Levy, A., Gannot, S., Habets, E.: Multiple-hypothesis extended particle filter for acoustic source localization in reverberant environments. IEEE Trans. Audio Speech Lang. Process 19(6), 1540–1555 (2011)

    CrossRef  Google Scholar 

  92. Li, D., Dimitrova, N., Li, M., Sethi, I.: Multimedia content processing through cross-modal association. In: Proceedings of the ACM International Conference on Multimedia, Berkeley, CA (2003)

    CrossRef  Google Scholar 

  93. Lim, A., Nakamura, K., Nakadai, K., Ogata, T., Okuno, H.G.: Audio-visual musical instrument recognition. In: Proceedings of the National Convention Audio-V Information Processing Society (2011)

    Google Scholar 

  94. Liu, Q., Wang, W., Jackson, P.J., Barnard, M., Kittler, J., Chambers, J.: Source separation of convolutive and noisy mixtures using audio-visual dictionary learning and probabilistic time-frequency masking. IEEE Trans. Signal Process. 61(22), 5520–5535 (2013)

    MathSciNet  CrossRef  Google Scholar 

  95. Liutkus, A., Durrieu, J.L., Daudet, L., Richard, G.: An overview of informed audio source separation. In: Proceedings of the International Workshop on Image Analysis for Multimedia Interactive Services, pp. 1–4. IEEE, New York (2013)

    Google Scholar 

  96. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004)

    CrossRef  Google Scholar 

  97. Mahadevan, V., Li, W., Bhalodia, V., Vasconcelos, N.: Anomaly detection in crowded scenes. In: Proceedings of the IEEE Conference Computer Vision and Pattern Recognition, vol. 249, p. 250 (2010)

    Google Scholar 

  98. Makino, S., Lee, T.W., Sawada, H.: Blind Speech Separation. Springer, New York (2007)

    CrossRef  Google Scholar 

  99. Mandel, M., Ellis, D.: EM localization and separation using interaural level and phase cues. In: Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 275–278 (2007)

    Google Scholar 

  100. Mandel, M., Bressler, S., Shinn-Cunningham, B., Ellis, D.: Evaluating source separation algorithms with reverberant speech. IEEE Trans. Audio Speech Lang. Process. 18(7), 1872–1883 (2010)

    CrossRef  Google Scholar 

  101. Maragos, P., Gros, P., Katsamanis, A., Papandreou, G.: Cross-modal integration for performance improving in multimedia: a review. In: Multimodal Processing and Interaction, pp. 1–46. Springer, New York (2008)

    Google Scholar 

  102. Marti, A., Cobos, M., Lopez, J., Escolano, J.: A steered response power iterative method for high-accuracy acoustic source localization. J. Acoust. Soc. Am. 134(4), 2627–2630 (2013)

    CrossRef  Google Scholar 

  103. Metallinou, A., Lee, S., Narayanan, S.: Decision level combination of multiple modalities for recognition and analysis of emotional expression. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 2462–2465 (2010)

    Google Scholar 

  104. Milani, S., Fontani, M., Bestagini, P., Barni, M., Piva, A., Tagliasacchi, M., Tubaro, S.: An overview on video forensics. APSIPA Trans. Signal Inf. Process. 1, e2 (2012)

    CrossRef  Google Scholar 

  105. Monaci, G., Vandergheynst, P.: Audiovisual gestalts. In: Proceedings of the IEEE Conference Computer Vision and Pattern Recognition, pp. 200–200 (2006)

    Google Scholar 

  106. Monaci, G., Jost, P., Vandergheynst, P., Mailhé, B., Lesage, S., Gribonval, R.: Learning multimodal dictionaries. IEEE Trans. Image Process. 16(9), 2272–2283 (2007)

    MathSciNet  CrossRef  Google Scholar 

  107. Monaci, G., Vandergheynst, P., Sommer, F.T.: Learning bimodal structure in audio–visual data. IEEE Trans. Neural Netw. 20(12), 1898–1910 (2009)

    CrossRef  Google Scholar 

  108. Moore, B.C.J.: Introduction to the Psychology of Hearing. Macmillan, London (1977)

    Google Scholar 

  109. Murphy, K.P.: Dynamic Bayesian networks: representation, inference and learning. Ph.D. thesis, University of California, Berkeley (2002)

    Google Scholar 

  110. Naphade, M.R., Garg, A., Huang, T.S.: Audio-visual event detection using duration dependent input output markov models. In: Proceedings of the IEEE Workshop Content-Based Access Image and Video Libraries, pp. 39–43. IEEE, New York (2001)

    Google Scholar 

  111. Nefian, A.V., Liang, L., Pi, X., Xiaoxiang, L., Mao, C., Murphy, K.P.: A coupled HMM for audiovisual speech recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 2. IEEE, New York (2002)

    Google Scholar 

  112. Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: Proceedings of the International Conference on Machine Learning, pp. 689–696 (2011)

    Google Scholar 

  113. Nguyen, V.T., Nguyen, D.L., Tran, M.T., Le, D.D., Duong, D.A., Satoh, S.: Query-adaptive late fusion with neural network for instance search. In: Proceedings of the IEEE International Workshop on Multimedia Signal Processing, pp. 1–6. IEEE, New York (2015)

    Google Scholar 

  114. Nikunen, J., Virtanen, T.: Direction of arrival based spatial covariance model for blind sound source separation. IEEE/ACM Trans. Audio Speech Lang. Process. 22(3), 727–739 (2014)

    CrossRef  Google Scholar 

  115. Omologo, M., Svaizer, P.: Acoustic event localization using a crosspower-spectrum phase based technique. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 2. IEEE, New York (1994)

    Google Scholar 

  116. Otsuka, T., Ishiguro, K., Sawada, H., Okuno, H.G.: Bayesian nonparametrics for microphone array processing. IEEE/ACM Trans. Audio Speech Lang. Proc. 22(2), 493–504 (2014)

    CrossRef  Google Scholar 

  117. Ozerov, A., Févotte, C.: Multichannel nonnegative matrix factorization in convolutive mixtures for audio source separation. IEEE Trans. Audio Speech Lang. Process. 18(3), 550–563 (2010)

    CrossRef  Google Scholar 

  118. Ozerov, A., Févotte, C., Blouet, R., Durrieu, J.L.: Multichannel nonnegative tensor factorization with structured constraints for user-guided audio source separation. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Prague (2011)

    CrossRef  Google Scholar 

  119. Ozerov, A., Vincent, E., Bimbot, F.: A general flexible framework for the handling of prior information in audio source separation. IEEE Trans. Audio Speech Lang. Process. 20(4), 1118–1133 (2012)

    CrossRef  Google Scholar 

  120. Parekh, S., Essid, S., Ozerov, A., Duong, N.Q.K., Pérez, P., Richard, G.: Motion informed audio source separation. In: IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), New Orleans (2017)

    Google Scholar 

  121. Parisi, R., Croene, P., Uncini, A.: Particle swarm localization of acoustic sources in the presence of reverberation. In: Proceedings of the IEEE International Symposium on Circuits and Systems, pp. 4. IEEE, New York (2006)

    Google Scholar 

  122. Parra, L., Spence, C.: Convolutive blind separation of non-stationary sources. IEEE Trans. Speech Audio Process. 8(3), 320–327 (2000)

    MATH  CrossRef  Google Scholar 

  123. Pertilä, P., Mieskolainen, M., Hämäläinen, M.: Closed-form self-localization of asynchronous microphone arrays. In: Proceedings of the Joint Workshop on Hands-free Speech Communication and Microphone Arrays, pp. 139–144. IEEE, New York (2011)

    Google Scholar 

  124. Rocha, A., Scheirer, W., Boult, T., Goldenstein, S.: Vision of the unseen: Current trends and challenges in digital image and video forensics. ACM Comput. Surv. 43(4), 26 (2011)

    CrossRef  Google Scholar 

  125. Rokach, L.: Ensemble-based classifiers. Artif. Intell. Rev. 33(1), 1–39 (2010). doi:10.1007/s10462-009-9124-7

    MathSciNet  CrossRef  Google Scholar 

  126. Roy, R., Kailath, T.: Esprit-estimation of signal parameters via rotational invariance techniques. IEEE Trans. Acoust. Speech Signal Process. 37(7), 984–995 (1989)

    MATH  CrossRef  Google Scholar 

  127. Sadlier, D.A., O’Connor, N.E.: Event detection in field sports video using audio-visual features and a support vector machine. IEEE Trans. Circuits Syst. Video Technol. 15(10), 1225–1233 (2005)

    CrossRef  Google Scholar 

  128. Sawada, H., Mukai, R., Araki, S., Makino, S.: A robust and precise method for solving the permutation problem of frequency-domain blind source separation. IEEE Trans. Speech Audio Process. 12(5), 530–538 (2004)

    CrossRef  Google Scholar 

  129. Schau, H., Robinson, A.: Passive source localization employing intersecting spherical surfaces from time-of-arrival differences. IEEE Trans. Acoust. Speech Signal Process. 35(8), 1223–1225 (1987)

    CrossRef  Google Scholar 

  130. Scheuing, J., Yang, B.: Disambiguation of tdoa estimation for multiple sources in reverberant environments. IEEE Trans. Audio Speech Lang. Process. 16(8), 1479–1489 (2008)

    CrossRef  Google Scholar 

  131. Schmidt, R.: Multiple emitter location and signal parameter estimation. IEEE Trans. Antennas Propag. 34(3), 276–280 (1986)

    CrossRef  Google Scholar 

  132. Sedighin, F., Babaie-Zadeh, M., Rivet, B., Jutten, C.: Two multimodal approaches for single microphone source separation. In: Proceedings of the European Signal Processing Conference (2016)

    CrossRef  Google Scholar 

  133. Seichepine, N., Essid, S., Févotte, C., Cappe, O.: Soft nonnegative matrix co-factorization with application to multimodal speaker diarization. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver (2013)

    Google Scholar 

  134. Seichepine, N., Essid, S., Fevotte, C., Cappe, O.: Soft nonnegative matrix co-factorization. IEEE Trans. Signal Process. PP(99) (2014)

    Google Scholar 

  135. Serizel, R., Moonen, M., van Dijk, B., Wouters, J.: Low-rank approximation based multichannel wiener filter algorithms for noise reduction with application in cochlear implants. IEEE/ACM Trans. Audio Speech Lang. Process. 22(4), 785–799 (2014)

    CrossRef  Google Scholar 

  136. Serizel, R., Bisot, V., Essid, S., Richard, G.: Machine listening techniques as a complement to video image analysis in forensics. In: Proceedings of the IEEE International Conference on Image Processing, pp. 948–952. IEEE, New York (2016)

    Google Scholar 

  137. Showen, R., Calhoun, R., Dunham, J.: Acoustic location of gunshots using combined angle of arrival and time of arrival measurements (2009). US Patent 7,474,589

    Google Scholar 

  138. Sigg, C., Fischer, B., Ommer, B., Roth, V., Buhmann, J.: Nonnegative CCA for audiovisual source separation. In: Proceedings of the IEEE Workshop Machine Learning and Signal Processing, pp. 253–258. IEEE, New York (2007)

    Google Scholar 

  139. Smaragdis, P., Casey, M.: Audio visual independent components. In: Proceedings of the International Symposium Independent Component Analysis and Blind Signal Separation, pp. 709–714 (2003)

    Google Scholar 

  140. Srivastava, N., Salakhutdinov, R.R.: Multimodal learning with deep Boltzmann machines. In: Proceedings of the Advances in Neural Information Processing Systems, pp. 2222–2230 (2012)

    Google Scholar 

  141. P. Stoica, Moses, R.: Spectral Analysis of Signals. Pearson Prentice Hall, Upper Saddle River, NJ (2005)

    Google Scholar 

  142. Strobel, N., Spors, S., Rabenstein, R.: Joint audio-video object localization and tracking. IEEE Signal Process. Mag. 18(1), 22–31 (2001)

    CrossRef  Google Scholar 

  143. Tian, Y., Chen, Z., Yin, F.: Distributed Kalman filter-based speaker tracking in microphone array networks. Appl. Acoust. 89, 71–77 (2015)

    CrossRef  Google Scholar 

  144. Togami, M., Hori, K.: Multichannel semi-blind source separation via local Gaussian modeling for acoustic echo reduction. In: Proceedings of the European Signal Processing Conference (2011)

    Google Scholar 

  145. Togami, M., Kawaguchi, Y.: Simultaneous optimization of acoustic echo reduction, speech dereverberation, and noise reduction against mutual interference. IEEE/ACM Trans. Audio Speech Lang. Process. 22(11), 1612–1623 (2014)

    CrossRef  Google Scholar 

  146. Trifa, V., Koene, A., Moren, J., Cheng, G.: Real-time acoustic source localization in noisy environments for human-robot multimodal interaction. In: Proceedings of the IEEE International Symposium on Robots and Human Interactive Communication (2007)

    CrossRef  Google Scholar 

  147. Valente, S., Tagliasacchi, M., Antonacci, F., Bestagini, P., Sarti, A., Tubaro, S.: Geometric calibration of distributed microphone arrays from acoustic source correspondences. In: Proceedings of the IEEE International Workshop on Multimedia Signal Processing, pp. 13–18 (2010)

    Google Scholar 

  148. Valin, J., Michaud, F., Rouat, J.: Robust 3d localization and tracking of sound sources using beamforming and particle filtering. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 4. IEEE, New York (2006)

    Google Scholar 

  149. Velivelli, A., Ngo, C.W., Huang, T.S.: Detection of documentary scene changes by audio-visual fusion. In: Proceedings of the International Conference on Image and Video Retrieval, pp. 227–238. Springer, New York (2003)

    Google Scholar 

  150. Vincent, E., Bertin, N., Gribonval, R., Bimbot, F.: From blind to guided audio source separation: how models and side information can improve the separation of sound. IEEE Signal Process. Mag. 31(3), 107–115 (2014)

    CrossRef  Google Scholar 

  151. Vuegen, L., Broeck, B.V.D., Karsmakers, P., hamme, H.V., Vanrumste, B.: Automatic monitoring of activities of daily living based on real-life acoustic sensor data: a preliminary study. In: Proceedings of the International Workshop on Speech and Language Processing for Assistive Technologies, pp. 113–118 (2013)

    Google Scholar 

  152. Wang, D.L.: Time-frequency masking for speech separation and its potential for hearing aid design. Trends Amplif. 12(4), 332–352 (2008)

    MathSciNet  CrossRef  Google Scholar 

  153. Wang, H., Chu, P.: Voice source localization for automatic camera pointing system in videoconferencing. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (1997)

    Google Scholar 

  154. Wang, H., Kläser, A., Schmid, C., Liu, C.L.: Dense trajectories and motion boundary descriptors for action recognition. Int. J. Comput. Vis. 103(1), 60–79 (2013)

    MathSciNet  CrossRef  Google Scholar 

  155. Ward, D.B., Lehmann, E.A., Williamson, R.C.: Particle filtering algorithms for tracking an acoustic source in a reverberant environment. IEEE Trans. Speech Audio Process. 11(6), 826–836 (2003)

    CrossRef  Google Scholar 

  156. Wilkins, P., Adamek, T., Byrne, D., Jones, G., Lee, H., Keenan, G., Mcguinness, K., O’Connor, N.E., Smeaton, A.F., Amin, A., Obrenovic, Z., Benmokhtar, R., Galmar, E., Huet, B., Essid, S., Landais, R., Vallet, F., Papadopoulos, G.T., Vrochidis, S., Mezaris, V., Kompatsiaris, I., Spyrou, E., Avrithis, Y., Morzinger, R., Schallauer, P., Bailer, W., Piatrik, T., Chandramouli, K., Izquierdo, E., Haller, M., Goldmann, L., Samour, A., Cobet, A., Sikora, T., Praks, P.: K-space at TRECVid 2007. In: TRECVID 2007 (2007)

    Google Scholar 

  157. Wu, Y., Lin, C.Y.Y., Chang, E.Y., Smith, J.R.: Multimodal information fusion for video concept detection. In: Proceedings of the IEEE International Conference on Image Processing, vol. 4, pp. 2391–2394. IEEE, Singapore (2004)

    Google Scholar 

  158. Wu, Z., Jiang, Y.G., Wang, J., Pu, J., Xue, X.: Exploring inter-feature and inter-class relationships with deep neural networks for video classification. In: Proceedings of the ACM International Conference on Multimedia, pp. 167–176. ACM, New York (2014)

    Google Scholar 

  159. Yilmaz, K., Cemgil, A.T.: Probabilistic latent tensor factorisation. In: Proceedings of the International Conference on Latent Variable Analysis and Signal Separation, pp. 346–353 (2010)

    Google Scholar 

  160. Yokoya, N., Yairi, T., Iwasaki, A.: Coupled nonnegative matrix factorization unmixing for hyperspectral and multispectral data fusion. IEEE Trans. Geosci. Remote Sens. 50(2), 528–537 (2012)

    CrossRef  Google Scholar 

  161. Yoo, J., Choi, S.: Matrix co-factorization on compressed sensing. In: Proceedings of the International Joint Conference on Artificial Intelligence (2011)

    Google Scholar 

  162. Yost, W.A.: Discriminations of interaural phase differences. J. Acoust. Soc. Am. 55(6), 1299–1303 (1974)

    CrossRef  Google Scholar 

  163. Yuhas, B.P., Goldstein, M.H., Sejnowski, T.J.: Integration of acoustic and visual speech signals using neural networks. IEEE Commun. Mag. 27(11), 65–71 (1989)

    CrossRef  Google Scholar 

  164. Zhang, Q., Chen, Z., Yin, F.: Distributed marginalized auxiliary particle filter for speaker tracking in distributed microphone networks. IEEE/ACM Trans. Audio Speech Lang. Process. 24(11), 1921–1934 (2016)

    CrossRef  Google Scholar 

  165. Zotkin, D.N., Duraiswami, R.: Accelerated speech source localization via a hierarchical search of steered response power. IEEE Trans. Speech Audio Process. 12(5), 499–508 (2004)

    CrossRef  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Slim Essid .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2018 Springer International Publishing AG

About this chapter

Cite this chapter

Essid, S. et al. (2018). Multiview Approaches to Event Detection and Scene Analysis. In: Virtanen, T., Plumbley, M., Ellis, D. (eds) Computational Analysis of Sound Scenes and Events. Springer, Cham. https://doi.org/10.1007/978-3-319-63450-0_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-63450-0_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-63449-4

  • Online ISBN: 978-3-319-63450-0

  • eBook Packages: EngineeringEngineering (R0)