Skip to main content
Log in

Enhanced video temporal segmentation using a Siamese network with multimodal features

  • Original Paper
  • Published:
Signal, Image and Video Processing Aims and scope Submit manuscript

Abstract

Shot boundary detection (SBD) is a critical pre-processing task for intelligent video analysis applications. In this study, we proposed a novel multimodal approach for SBD by utilizing a Siamese network with multimodal features to learn the distance measure between audiovisual features. To extract relevant features from the audio stream’s power spectrum density (PSD), we combined Tsfresh, a Python package for time series feature extraction, with PCA (principal component analysis), and used a Gru-Attention network for learning sequential semantic representations and spatial location information. For the visual modality, we employed the pre-trained EfficientNet model to extract and learn visual features. Our proposed network learned the similarity score from the image embedding features and the PSD as audio features, which were then used to build a signal representing the audiovisual change. We used a global threshold for transition detection and an adaptive threshold to differentiate between the detected transition types (abrupt or gradual). In our experimental study, we applied the proposed approach to standard datasets (TRECvid 2001 and TRECvid 2007) and found that the introduction of audio features achieved a significant improvement in terms of F1 score (92.43%) and gradual transition detection (90.08%) compared to state-of-the-art models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Sharma, V., Gupta, M., Kumar, A., Mishra, D.: Video processing using deep learning techniques: a systematic literature review. IEEE Access 9, 139489–139507 (2021)

    Article  Google Scholar 

  2. Spolaor, N., Lee, H.D., Takaki, W.S.R., Ensina, L.A., Coy, C.S.R., Wu, F.C.: A systematic review on content-based video retrieval. Eng. Appl. Artif. Intell. 90, 103557 (2020)

  3. Abdulhussain, S.H., Ramli, A.R., Saripan, M.I., Mahmmod, B.M., Al-Haddad, S.A.R., Jassim, W.A., et al.: Methods and challenges in shot boundary detection: a review. Entropy 20(4), 214 (2018)

    Article  Google Scholar 

  4. Georgiou, T., Liu, Y., Chen, W., Lew, M.: A survey of traditional and deep learning-based feature descriptors for high dimensional data in computer vision. Int. J. Multimed. Inf. Retr. 9(3), 135–170 (2020)

    Article  Google Scholar 

  5. Bouyahi, M., Ayed, Y.B.: Video scenes segmentation based on multimodal genre prediction. Procedia Comput. Sci. 176, 10–21 (2020)

    Article  Google Scholar 

  6. Bouyahi, M., Ayed, Y.B.: Multimodal features for shots boundary detection. In: International Conference on Machine Vision, vol. 11605, pp. 661–670 (2021)

  7. Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019)

  8. Chakraborty, S., Thounaojam, D.M.: A novel shot boundary detection system using hybrid optimization technique. Appl. Intell. 49(9), 3207–3220 (2019)

    Article  Google Scholar 

  9. Sasithradevi, A., Roomi, S.M.M.: A new pyramidal opponent color-shape model based video shot boundary detection. J. Vis. Commun. Image Represent. 67, 102754 (2020)

    Article  Google Scholar 

  10. Chakraborty, S., Thounaojam, D.M.: Sbd-duo: a dual stage shot boundary detection technique robust to motion and illumination effect. Multimed. Tools Appl. 80(2), 3071–3087 (2021)

    Article  Google Scholar 

  11. Chakraborty, S., Thounaojam, D.M., Sinha, N.: A shot boundary detection technique based on visual colour information. Multimed. Tools Appl. 80(3), 4007–4022 (2021)

    Article  Google Scholar 

  12. Rastgoo, M.N., Nakisa, B., Maire, F., Rakotonirainy, A., Chandran, V.: Automatic driver stress level classification using multimodal deep learning. Expert Syst. Appl. 138, 112793 (2019)

    Article  Google Scholar 

  13. Chakladar, D.D., Kumar, P., Roy, P.P., Dogra, D.P., Scheme, E., Chang, V.: A multimodal-Siamese Neural Network (mSNN) for person verification using signatures and EEG. Inf. Fus. 71, 17–27 (2021)

    Article  Google Scholar 

  14. Sun, J., Peng, Y., Guo, Y., Li, D.: Segmentation of the multimodal brain tumor image used the multi-pathway architecture method based on 3d FCN. Neurocomputing 423, 34–45 (2021)

    Article  Google Scholar 

  15. Mocanu, B., Tapu, R., Zaharia, T.: A multimodal high level video segmentation for content targeted online advertising. In: International Symposium on Visual Computing, pp. 506–517 (2020)

  16. Iwan, L.H., Thom, J.A.: Temporal video segmentation: detecting the end-of-act in circus performance videos. Multimed. Tools Appl. 76(1), 1379–1401 (2017)

  17. Zhang, Z., Song, W., Li, Q.: Dual-aspect self-attention based on transformer for remaining useful life prediction. IEEE Trans. Instrum. Meas. 71, 1–11 (2022)

  18. Shao, Y., Lin, J.C.-W., Srivastava, G., Jolfaei, A., Guo, D., Hu, Y.: Self-attention-based conditional random fields latent variables model for sequence labeling. Pattern Recognit. Lett. 145, 157–164 (2021)

  19. Chavate, S., Mishra, R., Yadav, P.: A comparative analysis of video shot boundary detection using different approaches. In: 2021 10th International Conference on System Modeling & Advancement in Research Trends (SMART), pp. 1–7 (2021)

  20. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009)

  21. Tanberk, S., Dağlı, V., Gürkan, M.K.: Deep learning for videoconferencing: A brief examination of speech to text and speech synthesis. In: 6th International Conference on Computer Science and Engineering (UBMK), pp. 506–511 (2021)

  22. Sajjad, M., Khan, Z.A., Ullah, A., Hussain, T., Ullah, W., Lee, M.Y., Baik, S.W.: A novel cnn-gru-based hybrid approach for short-term residential load forecasting. IEEE Access 8, 143759–143768 (2020)

    Article  Google Scholar 

  23. Wang, Y., Gui, R.: [PDF] mdpi.comA hybrid model for GRU ultra-short-term wind speed prediction based on tsfresh and sparse PCA. Energies 15, 7567 (2022)

    Article  Google Scholar 

  24. Shoeibi, A., Ghassemi, N., Alizadehsani, R., Rouhani, M., Hosseini-Nejad, H., Khosravi, A., Panahiazar, M., Nahavandi, S.: A comprehensive comparison of handcrafted features and convolutional autoencoders for epileptic seizures detection in EEG signals. Expert Syst. Appl. 163, 113788 (2021)

    Article  Google Scholar 

  25. Tippaya, S., Sitjongsataporn, S., Tan, T., Khan, M.M., Chamnongthai, K.: Multi-modal visual features-based video shot boundary detection. IEEE Access 5, 12563–12575 (2017)

    Article  Google Scholar 

  26. Rashmi, B., Nagendraswamy, H.: Video shot boundary detection using block based cumulative approach. Multimed. Tools Appl. 80(1), 641–664 (2021)

    Article  Google Scholar 

  27. Singh, A., Singh, T.D., Bandyopadhyay, S.: V2t: video to text framework using a novel automatic shot boundary detection algorithm. Multimed. Tools Appl. 81, 17989–18009 (2022)

    Article  Google Scholar 

  28. Thounaojam, D.M., Bhadouria, V.S., Roy, S., Singh, K., et al.: Shot boundary detection using perceptual and semantic information. Int. J Multimed. Inf. Retr. 6(2), 167–174 (2017)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bouyahi Mohamed.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mohamed, B., Yassine, B.A. Enhanced video temporal segmentation using a Siamese network with multimodal features. SIViP 17, 4295–4303 (2023). https://doi.org/10.1007/s11760-023-02662-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11760-023-02662-4

Keywords

Navigation