Abstract
Due to the proliferation of video-based applications, there is a high demand for automated systems to support various video-based tasks that are free from human intervention, i.e., manual tagging. In this paper, we present a novel approach for detecting the presence of overlap between two videos by exploiting their corresponding audio signals, which is a crucial preprocessing step for audio, and further video alignment and synchronisation. Several existing approaches have limitations related to timestamps, overlapping regions, and the length of video clips. For the proposed work, we target the challenging scenario consisting of simultaneously recorded videos in an unconstrained manner by multiple users attending performance events. xOur work is an attempt towards developing a robust framework that not only considers noisy components present in the audio but is also free from the limitations mentioned above. We compare our framework with several other existing approaches. Our proposed framework outperforms other approaches by an average of 13.71% in terms of accuracy.
Similar content being viewed by others
Data Availability
The datasets generated during and/or analysed during the current study are available in the following repository: https://traces.cs.umass.edu/index.php/Mmsys/Mmsys
References
Müller M, Mattes H, Kurth F (2006) An efficient multiscale approach to audio synchronization. In: 7th International Society on Music Information Retrieval (ISMIR), vol. 546, pp 192–197. Citeseer
Diego F, Ponsa D, Serrat J, López AM (2010) Video alignment for change detection. IEEE Transactions on Image Processing 20(7):1858–1869
Covell M, Baluja S (2007) Known-audio detection using waveprint: Spectrogram fingerprinting by wavelet hashing. In: 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP-07, vol.1, pp 237. IEEE
Joder C, Essid S, Richard G (2013) Learning optimal features for polyphonic audio-to-score alignment. IEEE Transactions on Audio, Speech, and Language Processing 21(10):2118–2128
Six J, Leman M (2014) Panako: a scalable acoustic fingerprinting system handling time-scale and pitch modification. In: 15th International Society for Music Information Retrieval Conference (ISMIR-2014)
Trigeorgis G, Nicolaou MA, Schuller BW, Zafeiriou S (2017) Deep canonical time warping for simultaneous alignment and representation learning of sequences. IEEE transactions on pattern analysis and machine intelligence 40(5):1128–1138
Dannenberg RB, Raphael C (2006) Music score alignment and computer accompaniment. Communications of the ACM 49(8):38–43
Joder C, Essid S, Richard G (2011) A conditional random field framework for robust and scalable audio-to-score matching. IEEE Transactions on Audio, Speech, and Language Processing 19(8):2385–2397
Montecchio N, Cont A (2011) A unified approach to real time audio-to-score and audio-to-audio alignment using sequential montecarlo inference techniques. In: 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 193–196. IEEE
Duan Z, Pardo B (2011) A state space model for online polyphonic audioscore alignment. In: 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 197–200. IEEE
Haitsma J, Kalker T (2002) A highly robust audio fingerprinting system. In: 3rd International Society on Music Information Retrieval (ISMIR), vol. 2002, pp 107–115
Wang A, et al (2003) An industrial strength audio search algorithm. In: 4th International Society on Music Information Retrieval (ISMIR), vol. 2003, pp 7–13. Citeseer
Ke Y, Hoiem D, Sukthankar R (2005) Computer vision for music identification. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR-05), vol. 1, pp 597–604. IEEE
Ewert S, Muller M, Grosche P (2009) High resolution audio synchronization using chroma onset features. In: 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pp 1869–1872. IEEE
Gasser MGM, Arzt A, Widmer G (2013) Automatic alignment of music performances with structural differences
Burloiu G (2014) An online audio alignment tool for live musical performance. In: 2014 11th International Symposium on Electronics and Telecommunications (ISETC), pp 1–4 . IEEE
Wang S, Ewert S, Dixon S (2016) Robust and efficient joint alignment of multiple musical performances. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24(11):2132–2145
Tralie C, Dempsey E (2020) Exact, parallelizable dynamic time warping alignment with linear memory. arXiv preprint arXiv:2008.02734
Müller M, Özer Y, Krause M, Prätzlich T, Driedger J (2021) Sync toolbox: A python package for efficient, robust, and accurate music synchronization. Journal of Open Source Software 6(64):3434
Salvador S, Chan P (2007) Toward accurate dynamic time warping in linear time and space. Intelligent Data Analysis 11(5):561–580
Prätzlich T, Driedger J, Müller M (2016) Memory-restricted multiscale dynamic time warping. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 569–573. IEEE
Bullock L, Bredin H, Garcia-Perera LP (2020) Overlap-aware diarization: Resegmentation using neural end-to-end overlapped speech detection. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 7114–7118 . IEEE
Cano P, Batlle E, Kalker T, Haitsma J (2005) A review of audio fingerprinting. Journal of VLSI signal processing systems for signal, image and video technology 41(3):271–284
Saini M, Venkatagiri SP, Ooi WT, Chan MC (2013) The jiku mobile video dataset. In: Proceedings of the 4th ACM Multimedia Systems Conference, pp 108–113
Giorgino T (2009) Computing and visualizing dynamic time warping alignments in r: the dtw package. Journal of statistical Software 31:1–24
Funding
This work is supported by the grant received from DST, Govt. of India for the Technology Innovation Hub at the IIT Ropar in the framework of National Mission on Interdisciplinary Cyber-Physical Systems.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interests/Competing interests
The authors declare no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Uikey, A., Bedi, A.K., Choudhary, P. et al. A highly robust deep learning technique for overlap detection using audio fingerprinting. Multimed Tools Appl 83, 29119–29137 (2024). https://doi.org/10.1007/s11042-023-16713-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-16713-y