Advertisement

Correlation based feature fusion for the temporal video scene segmentation task

  • Rodrigo Mitsuo Kishi
  • Tiago Henrique Trojahn
  • Rudinei Goularte
Article
  • 28 Downloads

Abstract

The available automatic temporal video scene segmentation methods still lack efficacy to be employed in most practical multimedia systems. The ones showing better results are multimodal and based on late fusion. On the other hand, early fusion has not been sufficiently investigated in this task because of the well known barriers of this approach: correlation identification, temporal synchronization and unique representation. This work presents a feature fusion method which deals with the mentioned difficulties and produces features which can enhance the efficacy of existing temporal video scene segmentation methods. This feature fusion process is performed on singlemodal Bag of Features feature vectors and is intended to enrich previously captured latent semantics by performing temporal clustering of features, providing an unified representation of multiple temporal related features. This feature fusion process have been coupled with two of-the-shelf scene segmentation algorithms, presenting competitive results when compared with two other state-of-the-art multimodal temporal scene segmentation methods. The results indicate that the proposed early fusion feature representation method is a promising alternative in helping to boost video retrieval related tasks.

Keywords

Multimedia Video Temporal scene segmentation Early fusion 

Notes

Acknowledgments

Authors of this work would like to thank Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq), Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP), Universidade Federal de Mato Grosso do Sul (UFMS), Universidade de São Paulo (USP) and Instituto Federal de São Paulo (IFSP) for financial support. This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Finance Code 001. The authors also like to thank Dr Lorenzo Baraldi for providing evaluation scripts. This research have been developed using computational resources from Centro de Ciências Matemáticas Aplicadas à Indústria (CeMEAI) financed by FAPESP.

References

  1. 1.
    Atrey PK, Hossain MA, El Saddik A, Kankanhalli MS (2010) Multimodal fusion for multimedia analysis: a survey. Multimedia Syst 16(6):345–379.  https://doi.org/10.1007/s00530-010-0182-0 CrossRefGoogle Scholar
  2. 2.
    Baraldi L, Grana C, Cucchiara R (2015) A deep siamese network for scene detection in broadcast videos. In: Proceedings of the 23rd ACM international conference on multimedia, MM ’15, pp 1199–1202. ACM, New York.  https://doi.org/10.1145/2733373.2806316
  3. 3.
    Baraldi L, Grana C, Cucchiara R (2015) Measuring scene detection performance, pp 395–403, Springer International Publishing, ChamGoogle Scholar
  4. 4.
    BBC: Planet earth. http://www.bbc.co.uk/programmes/b006mywy (2006). [Online; accessed 25-may-2018]
  5. 5.
    Bromley J, Guyon I, LeCun Y, Säckinger E, Shah R (1993) Signature verification using a “siamese” time delay neural network. In: Proceedings of the 6th international conference on neural information processing systems, NIPS’93, pp 737–744. Morgan Kaufmann Publishers Inc., San Francisco. http://dl.acm.org/citation.cfm?id=2987189.2987282
  6. 6.
    Chasanis V, Kalogeratos A, Likas A (2009) Movie segmentation into scenes and chapters using locally weighted bag of visual words. In: Proceedings of the ACM international conference on image and video retrieval, CIVR ’09, pp 35:1–35:7.  https://doi.org/10.1145/1646396.1646439. ACM, New York
  7. 7.
    Csurka G, Dance CR, Fan L, Willamowski J, Bray C (2004) Visual categorization with bags of keypoints. In: Workshop on statistical learning in computer vision, ECCV, pp 1–22Google Scholar
  8. 8.
    Davis SB, Mermelstein P (1980) Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE transactions on acoustics, speech and signal processing, pp 357–366Google Scholar
  9. 9.
    Del Fabro M, Böszörmenyi L (2013) State-of-the-art and future challenges in video scene detection: a survey. Multimedia Syst 19(5):427–454.  https://doi.org/10.1007/s00530-013-0306-4 CrossRefGoogle Scholar
  10. 10.
    Ellouze M, Boujemaa N, Alimi AM (2010) Scene pathfinder: unsupervised clustering techniques for movie scenes extraction. Multimedia Tools Appl 47(2):325–346.  https://doi.org/10.1007/s11042-009-0325-5 CrossRefGoogle Scholar
  11. 11.
    Gao G, Ma H (2012) Multi-modality movie scene detection using kernel canonical correlation analysis. In: 2012 21st International Conference on Pattern recognition (ICPR), pp 3074–3077Google Scholar
  12. 12.
    Gauch JM, Gauch S, Bouix S, Zhu X (1999) Real time video scene detection and classification. Inf Process Manag 35(3):381–400CrossRefGoogle Scholar
  13. 13.
    Haghighat M, Abdel-Mottaleb M, Alhalabi W (2016) Discriminant correlation analysis: Real-time feature level fusion for multimodal biometric recognition. IEEE Trans Inf Forensic Secur 11(9):1984–1996.  https://doi.org/10.1109/TIFS.2016.2569061 CrossRefGoogle Scholar
  14. 14.
    Han B, Wu W (2011) Video scene segmentation using a novel boundary evaluation criterion and dynamic programming. In: 2011 IEEE International conference on multimedia and expo, pp 1–6.  https://doi.org/10.1109/ICME.2011.6012001
  15. 15.
    Hardoon DR, Szedmak S, Shawe-Taylor J (2004) Canonical correlation analysis: an overview with application to learning methods. Neural Comput 16(12):2639–2664.  https://doi.org/10.1162/0899766042321814 CrossRefzbMATHGoogle Scholar
  16. 16.
    Hare J, Samangooei S, Dupplaw D (2011) Openimaj and imageterrier: Java libraries and tools for scalable multimedia analysis and indexing of images. In: ACM Multimedia 2011, pp 691–694. ACM. Event Dates: 28/11/2011 until 1/12/2011. http://eprints.soton.ac.uk/273040/
  17. 17.
    Jhuo IH, Ye G, Gao S, Liu D, Jiang YG, Lee DT, Chang SF (2014) Discovering joint audio–visual codewords for video event detection. Mach Vis Appl 25 (1):33–47CrossRefGoogle Scholar
  18. 18.
    Kender JR, Yeo BL (1998) Video scene segmentation via continuous video coherence. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, CVPR ’98, pp 367–. IEEE Computer Society, Washington, DC, USAGoogle Scholar
  19. 19.
    Koprinska I, Carrato S (2001) Temporal video segmentation: a survey. In: Signal processing: image communication, pp 477–500Google Scholar
  20. 20.
    Kurcius JJ, Breckon TP (2014) Using compressed audio-visual words for multi-modal scene classification. In: 2014 International workshop on computational intelligence for multimedia understanding (IWCIM), pp 1–5.  https://doi.org/10.1109/IWCIM.2014.7008808
  21. 21.
    LeCun Y, Bengio Y (1998) The handbook of brain theory and neural networks. MIT Press, Cambridge. http://dl.acm.org/citation.cfm?id=303568.303704
  22. 22.
    Lloyd SP (1982) Least squares quantization in pcm. IEEE Trans Inf Theory 28:129–137MathSciNetCrossRefGoogle Scholar
  23. 23.
    Lopes BL, Trojahn TH, Goularte R (2014) Video scene detection by multimodal bag of features. J Inf Data Manag 5(2):194Google Scholar
  24. 24.
    Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110MathSciNetCrossRefGoogle Scholar
  25. 25.
    Mikolov T, Sutskever I, Chen K, Corrado G, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th international conference on neural information processing systems - Volume 2, NIPS’13, pp 3111–3119. Curran Associates Inc., USA. http://dl.acm.org/citation.cfm?id=2999792.2999959
  26. 26.
    Rabiner L, Juang BH (1993) Fundamentals of speech recognition. Prentice-hall, inc., upper saddle river, NJ USAGoogle Scholar
  27. 27.
    Rao KS, Koolagudi SG (2012) Emotion recognition using speech features. Springer Publishing Company, Incorporated, New YorkzbMATHGoogle Scholar
  28. 28.
    Rasheed Z, Shah M (2003) Scene detection in hollywood movies and tv shows. In: Proceedings of the 2003 IEEE computer society conference on computer vision and pattern recognition, 2003. vol 2, pp II–343–8 vol 2.  https://doi.org/10.1109/CVPR.2003.1211489
  29. 29.
    Rasiwasia N, Mahajan D, Mahadevan V, Aggarwal G (2014) Cluster canonical correlation analysis. In: Kaski S, Corander J (eds) Proceedings of the seventeenth international conference on artificial intelligence and statistics, Proceedings of machine learning research, vol 33, pp 823-831. PMLR, Reykjavik, IcelandGoogle Scholar
  30. 30.
    Saraceno C, Leonardi R (1997) Audio as a support to scene change detection and characterization of video sequences. In: 1997 IEEE international conference on acoustics, speech, and signal processing, 1997. ICASSP-97. vol 4, pp 2597–2600 vol 4.  https://doi.org/10.1109/ICASSP.1997.595320
  31. 31.
    Sidiropoulos P, Mezaris V, Kompatsiaris I, Meinedo H, Bugalho M, Trancoso I (2011) Temporal video segmentation to scenes using high-level audiovisual features. IEEE Trans Cir Sys Video Technol 21(8):1163–1177.  https://doi.org/10.1109/TCSVT.2011.2138830 CrossRefGoogle Scholar
  32. 32.
    Smeulders AWM, Worring M, Santini S, Gupta A, Jain R (2000) Content-based image retrieval at the end of the early years. IEEE Trans Pattern Anal Mach Intell 22(12):1349–1380.  https://doi.org/10.1109/34.895972 CrossRefGoogle Scholar
  33. 33.
    Snoek CGM, Worring M (2002) A review on multimodal video indexing. In: Proceedings of the 2002 IEEE International Conference on Multimedia and expo, 2002. ICME ’02. vol 2, pp 21–24 vol 2.  https://doi.org/10.1109/ICME.2002.1035364
  34. 34.
    Swain MJ, Ballard DH (1991) Color indexing. Int J Comput Vis 7(1):11–32.  https://doi.org/10.1007/BF00130487 CrossRefGoogle Scholar
  35. 35.
    Vendrig J, Worring M (2002) Systematic evaluation of logical story unit segmentation. IEEE Trans Multimedia 4(4):492–499.  https://doi.org/10.1109/TMM.2002.802021 CrossRefGoogle Scholar
  36. 36.
    Wang X, Gao L, Song J, Shen H (2017) Beyond frame-level cnn: saliency-aware 3-d cnn with lstm for video action recognition. IEEE Sig Process Lett 24(4):510–514.  https://doi.org/10.1109/LSP.2016.2611485 CrossRefGoogle Scholar
  37. 37.
    Wang X, Gao L, Song J, Zhen X, Sebe N, Shen HT (2018) Deep appearance and motion learning for egocentric activity recognition. Neurocomputing 275:438–447.  https://doi.org/10.1016/j.neucom.2017.08.063. http://www.sciencedirect.com/science/article/pii/S0925231217314935 CrossRefGoogle Scholar
  38. 38.
    Wang X, Gao L, Wang P, Sun X, Liu X (2018) Two-stream 3-d convnet fusion for action recognition in videos with arbitrary size and length. IEEE Trans Multimedia 20(3):634–644.  https://doi.org/10.1109/TMM.2017.2749159 CrossRefGoogle Scholar
  39. 39.
    Wu S, Jin M (2015) Study on a new video scene segmentation algorithm. Appl Math Inf Sci 9 (1):361–368.  https://doi.org/10.12785/amis/090142. https://www.scopus.com/inward/record.uri?eid=2-s2.0-84907246427&partnerID=40&md5=dd07505c1071cd1603e5206c25e41311. Cited By 0
  40. 40.
    Xi W, Fox EA, Fan W, Zhang B, Chen Z, Yan J, Zhuang D (2005) Simfusion: Measuring similarity using unified relationship matrix. In: Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’05, pp 130–137. ACM, New York.  https://doi.org/10.1145/1076034.1076059
  41. 41.
    Xie L, Shen J, Han J, Zhu L, Shao L (2017) Dynamic multi-view hashing for online image retrieval. In: Proceedings of the 26th international joint conference on artificial intelligence, IJCAI’17, pp 3133–3139. AAAI Press. http://dl.acm.org/citation.cfm?id=3172077.3172326
  42. 42.
    Xie L, Shen J, Zhu L (2016) Online cross-modal hashing for web image retrieval. In: Proceedings of the thirtieth AAAI conference on artificial intelligence, AAAI’16, pp 294–300. AAAI Press. http://dl.acm.org/citation.cfm?id=3015812.3015855
  43. 43.
    Xu S, Feng B, Ding P, Xu B (2012) Graph-based multi-modal scene detection for movie and teleplay. In: 2012 IEEE International Conference On Acoustics, Speech and Signal Processing (ICASSP), pp 1413–1416.  https://doi.org/10.1109/ICASSP.2012.6288155
  44. 44.
    Xu S, Feng B, Xu B (2013) Temporal video segmentation to scene based on conditional random fileds. In: Li S, El Saddik A, Wang M, Mei T, Sebe N, Yan S, Hong R, Gurrin C (eds) 2013 Proceedings of the 19th international conference on advances in multimedia modeling, MMM 2013, Huangshan, China, January 7-9, Part II, pp 374–384. Springer, Berlin.  https://doi.org/10.1007/978-3-642-35728-2_36
  45. 45.
    Yeung M, Yeo BL, Liu B (1998) Segmentation of video by clustering and graph analysis. Comput. Vis. Image Underst 71(1):94–109.  https://doi.org/10.1006/cviu.1997.0628
  46. 46.
    Yu SX, Shi J (2001) Grouping with bias. In: Proceedings of the 14th international conference on neural information processing systems: natural and synthetic, NIPS’01, pp 1327–1334. http://dl.acm.org/citation.cfm?id=2980539.2980711. MIT Press, Cambridge
  47. 47.
    Zhu L, Shen J, Xie L, Cheng Z (2017) Unsupervised topic hypergraph hashing for efficient mobile image retrieval. IEEE Trans Cybern 47(11):3941–3954.  https://doi.org/10.1109/TCYB.2016.2591068 CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.São Paulo UniversitySão CarlosBrazil
  2. 2.Federal University of Mato Grosso do SulTrês LagoasBrazil
  3. 3.Federal Institute of São PauloSão CarlosBrazil

Personalised recommendations