Abstract
Shot boundary detection (SBD) is a critical pre-processing task for intelligent video analysis applications. In this study, we proposed a novel multimodal approach for SBD by utilizing a Siamese network with multimodal features to learn the distance measure between audiovisual features. To extract relevant features from the audio stream’s power spectrum density (PSD), we combined Tsfresh, a Python package for time series feature extraction, with PCA (principal component analysis), and used a Gru-Attention network for learning sequential semantic representations and spatial location information. For the visual modality, we employed the pre-trained EfficientNet model to extract and learn visual features. Our proposed network learned the similarity score from the image embedding features and the PSD as audio features, which were then used to build a signal representing the audiovisual change. We used a global threshold for transition detection and an adaptive threshold to differentiate between the detected transition types (abrupt or gradual). In our experimental study, we applied the proposed approach to standard datasets (TRECvid 2001 and TRECvid 2007) and found that the introduction of audio features achieved a significant improvement in terms of F1 score (92.43%) and gradual transition detection (90.08%) compared to state-of-the-art models.
Similar content being viewed by others
References
Sharma, V., Gupta, M., Kumar, A., Mishra, D.: Video processing using deep learning techniques: a systematic literature review. IEEE Access 9, 139489–139507 (2021)
Spolaor, N., Lee, H.D., Takaki, W.S.R., Ensina, L.A., Coy, C.S.R., Wu, F.C.: A systematic review on content-based video retrieval. Eng. Appl. Artif. Intell. 90, 103557 (2020)
Abdulhussain, S.H., Ramli, A.R., Saripan, M.I., Mahmmod, B.M., Al-Haddad, S.A.R., Jassim, W.A., et al.: Methods and challenges in shot boundary detection: a review. Entropy 20(4), 214 (2018)
Georgiou, T., Liu, Y., Chen, W., Lew, M.: A survey of traditional and deep learning-based feature descriptors for high dimensional data in computer vision. Int. J. Multimed. Inf. Retr. 9(3), 135–170 (2020)
Bouyahi, M., Ayed, Y.B.: Video scenes segmentation based on multimodal genre prediction. Procedia Comput. Sci. 176, 10–21 (2020)
Bouyahi, M., Ayed, Y.B.: Multimodal features for shots boundary detection. In: International Conference on Machine Vision, vol. 11605, pp. 661–670 (2021)
Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019)
Chakraborty, S., Thounaojam, D.M.: A novel shot boundary detection system using hybrid optimization technique. Appl. Intell. 49(9), 3207–3220 (2019)
Sasithradevi, A., Roomi, S.M.M.: A new pyramidal opponent color-shape model based video shot boundary detection. J. Vis. Commun. Image Represent. 67, 102754 (2020)
Chakraborty, S., Thounaojam, D.M.: Sbd-duo: a dual stage shot boundary detection technique robust to motion and illumination effect. Multimed. Tools Appl. 80(2), 3071–3087 (2021)
Chakraborty, S., Thounaojam, D.M., Sinha, N.: A shot boundary detection technique based on visual colour information. Multimed. Tools Appl. 80(3), 4007–4022 (2021)
Rastgoo, M.N., Nakisa, B., Maire, F., Rakotonirainy, A., Chandran, V.: Automatic driver stress level classification using multimodal deep learning. Expert Syst. Appl. 138, 112793 (2019)
Chakladar, D.D., Kumar, P., Roy, P.P., Dogra, D.P., Scheme, E., Chang, V.: A multimodal-Siamese Neural Network (mSNN) for person verification using signatures and EEG. Inf. Fus. 71, 17–27 (2021)
Sun, J., Peng, Y., Guo, Y., Li, D.: Segmentation of the multimodal brain tumor image used the multi-pathway architecture method based on 3d FCN. Neurocomputing 423, 34–45 (2021)
Mocanu, B., Tapu, R., Zaharia, T.: A multimodal high level video segmentation for content targeted online advertising. In: International Symposium on Visual Computing, pp. 506–517 (2020)
Iwan, L.H., Thom, J.A.: Temporal video segmentation: detecting the end-of-act in circus performance videos. Multimed. Tools Appl. 76(1), 1379–1401 (2017)
Zhang, Z., Song, W., Li, Q.: Dual-aspect self-attention based on transformer for remaining useful life prediction. IEEE Trans. Instrum. Meas. 71, 1–11 (2022)
Shao, Y., Lin, J.C.-W., Srivastava, G., Jolfaei, A., Guo, D., Hu, Y.: Self-attention-based conditional random fields latent variables model for sequence labeling. Pattern Recognit. Lett. 145, 157–164 (2021)
Chavate, S., Mishra, R., Yadav, P.: A comparative analysis of video shot boundary detection using different approaches. In: 2021 10th International Conference on System Modeling & Advancement in Research Trends (SMART), pp. 1–7 (2021)
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009)
Tanberk, S., Dağlı, V., Gürkan, M.K.: Deep learning for videoconferencing: A brief examination of speech to text and speech synthesis. In: 6th International Conference on Computer Science and Engineering (UBMK), pp. 506–511 (2021)
Sajjad, M., Khan, Z.A., Ullah, A., Hussain, T., Ullah, W., Lee, M.Y., Baik, S.W.: A novel cnn-gru-based hybrid approach for short-term residential load forecasting. IEEE Access 8, 143759–143768 (2020)
Wang, Y., Gui, R.: [PDF] mdpi.comA hybrid model for GRU ultra-short-term wind speed prediction based on tsfresh and sparse PCA. Energies 15, 7567 (2022)
Shoeibi, A., Ghassemi, N., Alizadehsani, R., Rouhani, M., Hosseini-Nejad, H., Khosravi, A., Panahiazar, M., Nahavandi, S.: A comprehensive comparison of handcrafted features and convolutional autoencoders for epileptic seizures detection in EEG signals. Expert Syst. Appl. 163, 113788 (2021)
Tippaya, S., Sitjongsataporn, S., Tan, T., Khan, M.M., Chamnongthai, K.: Multi-modal visual features-based video shot boundary detection. IEEE Access 5, 12563–12575 (2017)
Rashmi, B., Nagendraswamy, H.: Video shot boundary detection using block based cumulative approach. Multimed. Tools Appl. 80(1), 641–664 (2021)
Singh, A., Singh, T.D., Bandyopadhyay, S.: V2t: video to text framework using a novel automatic shot boundary detection algorithm. Multimed. Tools Appl. 81, 17989–18009 (2022)
Thounaojam, D.M., Bhadouria, V.S., Roy, S., Singh, K., et al.: Shot boundary detection using perceptual and semantic information. Int. J Multimed. Inf. Retr. 6(2), 167–174 (2017)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Mohamed, B., Yassine, B.A. Enhanced video temporal segmentation using a Siamese network with multimodal features. SIViP 17, 4295–4303 (2023). https://doi.org/10.1007/s11760-023-02662-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11760-023-02662-4