Advertisement

A Unified Framework for Shot Type Classification Based on Subject Centric Lens

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12356)

Abstract

Shots are key narrative elements of various videos, e.g. movies, TV series, and user-generated videos that are thriving over the Internet. The types of shots greatly influence how the underlying ideas, emotions, and messages are expressed. The technique to analyze shot types is important to the understanding of videos, which has seen increasing demand in real-world applications in this era. Classifying shot type is challenging due to the additional information required beyond the video content, such as the spatial composition of a frame and camera movement. To address these issues, we propose a learning framework Subject Guidance Network (SGNet) for shot type recognition. SGNet separates the subject and background of a shot into two streams, serving as separate guidance maps for scale and movement type classification respectively. To facilitate shot type analysis and model evaluations, we build a large-scale dataset MovieShots, which contains 46K shots from 7K movie trailers with annotations of their scale and movement types. Experiments show that our framework is able to recognize these two attributes of shot accurately, outperforming all the previous methods.

Notes

Acknowledgement

This work is partially supported by the SenseTime Collaborative Grant on Large-scale Multi-modality Analysis (CUHK Agreement No. TS1610626 & No. TS1712093), the General Research Fund (GRF) of Hong Kong (No. 14203518 & No. 14205719), and Innovation and Technology Support Program (ITSP) Tier 2, ITS/431/18F.

Supplementary material

504452_1_En_2_MOESM1_ESM.pdf (486 kb)
Supplementary material 1 (pdf 486 KB)

References

  1. 1.
    Bagheri-Khaligh, A., Raziperchikolaei, R., Moghaddam, M.E.: A new method for shot classification in soccer sports video based on SVM classifier. In: 2012 IEEE Southwest Symposium on Image Analysis and Interpretation, pp. 109–112. IEEE (2012)Google Scholar
  2. 2.
    Belagiannis, V., Farshad, A., Galasso, F.: Adversarial network compression. In: Leal-Taixé, L., Roth, S. (eds.) ECCV 2018. LNCS, vol. 11132, pp. 431–449. Springer, Cham (2019).  https://doi.org/10.1007/978-3-030-11018-5_37CrossRefGoogle Scholar
  3. 3.
    Benini, S., Canini, L., Leonardi, R.: Estimating cinematographic scene depth in movie shots. In: 2010 IEEE International Conference on Multimedia and Expo, pp. 855–860. IEEE (2010)Google Scholar
  4. 4.
    Bhattacharya, S., Mehran, R., Sukthankar, R., Shah, M.: Classification of cinematographic shots using lie algebra and its application to complex event recognition. IEEE Trans. Multimed. 16(3), 686–696 (2014)CrossRefGoogle Scholar
  5. 5.
    Caelles, S., Pont-Tuset, J., Perazzi, F., Montes, A., Maninis, K.K., Van Gool, L.: The 2019 DAVIS challenge on VOS: unsupervised multi-object segmentation. arXiv:1905.00737 (2019)
  6. 6.
    Canini, L., Benini, S., Leonardi, R.: Classifying cinematographic shot types. Multimed. Tools Appl. 62(1), 51–73 (2013)CrossRefGoogle Scholar
  7. 7.
    Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)Google Scholar
  8. 8.
    Chao, Y.W., Vijayanarasimhan, S., Seybold, B., Ross, D.A., Deng, J., Sukthankar, R.: Rethinking the faster R-CNN architecture for temporal action localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1130–1139 (2018)Google Scholar
  9. 9.
    Cheng, M.M., Mitra, N.J., Huang, X., Torr, P.H., Hu, S.M.: Global contrast based salient region detection. IEEE Trans. Pattern Anal. Mach. Intell. 37(3), 569–582 (2014)CrossRefGoogle Scholar
  10. 10.
    Cheng, Y., Wang, D., Zhou, P., Zhang, T.: A survey of model compression and acceleration for deep neural networks. arXiv preprint arXiv:1710.09282 (2017)
  11. 11.
    Christoph, R., Pinz, F.A.: Spatiotemporal residual networks for video action recognition. In: Advances in Neural Information Processing Systems, pp. 3468–3476 (2016)Google Scholar
  12. 12.
    Deng, J., Pan, Y., Yao, T., Zhou, W., Li, H., Mei, T.: Relation distillation networks for video object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 7023–7032 (2019)Google Scholar
  13. 13.
    Deng, Z., et al.: R3Net: recurrent residual refinement network for saliency detection. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence, pp. 684–690. AAAI Press (2018)Google Scholar
  14. 14.
    Duan, L.Y., Xu, M., Tian, Q., Xu, C.S., Jin, J.S.: A unified framework for semantic shot classification in sports video. IEEE Trans. Multimed. 7(6), 1066–1083 (2005)CrossRefGoogle Scholar
  15. 15.
    Ekin, A., Tekalp, A.M.: Shot type classification by dominant color for sports video segmentation and summarization. In: Proceedings of the 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003, ICASSP 2003, vol. 3, pp. III-173. IEEE (2003)Google Scholar
  16. 16.
    Heilbron, F.C., Escorcia, V., Ghanem, B., Niebles, J.C.: ActivityNet: a large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 961–970 (2015)Google Scholar
  17. 17.
    Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. arXiv preprint arXiv:1812.03982 (2018)
  18. 18.
    Giannetti, L.D., Leach, J.: Understanding Movies, vol. 1. Prentice Hall, Upper Saddle River (1999)Google Scholar
  19. 19.
    Goldblum, M., Fowl, L., Feizi, S., Goldstein, T.: Adversarially robust distillation. In: Thirty-Fourth AAAI Conference on Artificial Intelligence (2020)Google Scholar
  20. 20.
    Guo, C., et al.: Progressive sparse local attention for video object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3909–3918 (2019)Google Scholar
  21. 21.
    Hasan, M.A., Xu, M., He, X., Xu, C.: CAMHID: camera motion histogram descriptor and its application to cinematographic shot classification. IEEE Trans. Circuits Syst. Video Technol. 24(10), 1682–1695 (2014)CrossRefGoogle Scholar
  22. 22.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)Google Scholar
  23. 23.
    Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. In: NIPS Deep Learning and Representation Learning Workshop (2015)Google Scholar
  24. 24.
    Hou, Q., Cheng, M.M., Hu, X., Borji, A., Tu, Z., Torr, P.: Deeply supervised salient object detection with short connections. IEEE TPAMI 41(4), 815–828 (2019)CrossRefGoogle Scholar
  25. 25.
    Huang, Q., Xiong, Y., Rao, A., Wang, J., Lin, D.: Movienet: a holistic dataset for movie understanding. In: The European Conference on Computer Vision (ECCV). Springer, Cham (2020)Google Scholar
  26. 26.
    Jiang, H., Zhang, M.: Tennis video shot classification based on support vector machine. In: 2011 IEEE International Conference on Computer Science and Automation Engineering, vol. 2, pp. 757–761. IEEE (2011)Google Scholar
  27. 27.
    Kowdle, A., Chen, T.: Learning to segment a video to clips based on scene and camera motion. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7574, pp. 272–286. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-33712-3_20CrossRefGoogle Scholar
  28. 28.
    Li, L., Zhang, X., Hu, W., Li, W., Zhu, P.: Soccer video shot classification based on color characterization using dominant sets clustering. In: Muneesawang, P., Wu, F., Kumazawa, I., Roeksabutr, A., Liao, M., Tang, X. (eds.) PCM 2009. LNCS, vol. 5879, pp. 923–929. Springer, Heidelberg (2009).  https://doi.org/10.1007/978-3-642-10467-1_83 CrossRefGoogle Scholar
  29. 29.
    Li, S., He, F., Du, B., Zhang, L., Xu, Y., Tao, D.: Fast spatio-temporal residual network for video super-resolution. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)Google Scholar
  30. 30.
    Li, X., et al.: DeepSaliency: multi-task deep neural network model for salient object detection. IEEE Trans. Image Process. 25(8), 3919–3930 (2016)MathSciNetCrossRefGoogle Scholar
  31. 31.
    Monfort, M., et al.: Moments in time dataset: one million videos for event understanding. IEEE Trans. Pattern Anal. Mach. Intell. 42, 502–508 (2019)CrossRefGoogle Scholar
  32. 32.
    Prasertsakul, P., Kondo, T., Iida, H.: Video shot classification using 2D motion histogram. In: 2017 14th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON), pp. 202–205. IEEE (2017)Google Scholar
  33. 33.
    Rao, A., et al.: A local-to-global approach to multi-modal movie scene segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10146–10155 (2020)Google Scholar
  34. 34.
    Recasens, A., Kellnhofer, P., Stent, S., Matusik, W., Torralba, A.: Learning to zoom: a saliency-based sampling layer for neural networks. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11213, pp. 52–67. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01240-3_4CrossRefGoogle Scholar
  35. 35.
    Roth, J., et al.: AVA-ActiveSpeaker: an audio-visual dataset for active speaker detection. arXiv preprint arXiv:1901.01342 (2019)
  36. 36.
    Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)MathSciNetCrossRefGoogle Scholar
  37. 37.
    Savardi, M., Signoroni, A., Migliorati, P., Benini, S.: Shot scale analysis in movies by convolutional neural networks. In: 2018 25th IEEE International Conference on Image Processing (ICIP), pp. 2620–2624. IEEE (2018)Google Scholar
  38. 38.
    Shou, Z., et al.: DMC-Net: generating discriminative motion cues for fast compressed video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1268–1277 (2019)Google Scholar
  39. 39.
    Sidiropoulos, P., Mezaris, V., Kompatsiaris, I., Meinedo, H., Bugalho, M., Trancoso, I.: Temporal video segmentation to scenes using high-level audiovisual features. IEEE Trans. Circuits Syst. Video Technol. 21(8), 1163–1177 (2011)CrossRefGoogle Scholar
  40. 40.
    Wang, H.L., Cheong, L.F.: Taxonomy of directing semantics for film shot classification. IEEE Trans. Circuits Syst. Video Technol. 19(10), 1529–1542 (2009)CrossRefGoogle Scholar
  41. 41.
    Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46484-8_2CrossRefGoogle Scholar
  42. 42.
    Wang, X., Zhang, R., Sun, Y., Qi, J.: KDGAN: knowledge distillation with generative adversarial networks. In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems 31, pp. 775–786. Curran Associates, Inc. (2018)Google Scholar
  43. 43.
    Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)Google Scholar
  44. 44.
    Wang, Y., Xu, C., Xu, C., Tao, D.: Adversarial learning of portable student networks. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)Google Scholar
  45. 45.
    Wikipedia: As seen through a telescope. https://en.wikipedia.org/. Accessed 18 Feb 2020
  46. 46.
    Xia, J., Rao, A., Huang, Q., Xu, L., Wen, J., Lin, D.: Online multi-modal person search in videos. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12357, pp. 174–190. Springer, Cham (2020).  https://doi.org/10.1007/978-3-030-58610-2_11CrossRefGoogle Scholar
  47. 47.
    Tong, X.-F., Liu, Q.-S., Lu, H.-Q., Jin, H.-L.: Shot classification in sports video. In: Proceedings 7th International Conference on Signal Processing, ICSP 2004, vol. 2, pp. 1364–1367 (2004)Google Scholar
  48. 48.
    Xiong, Y., Huang, Q., Guo, L., Zhou, H., Zhou, B., Lin, D.: A graph-based framework to bridge movies and synopses. In: The IEEE International Conference on Computer Vision (ICCV) (2019)Google Scholar
  49. 49.
    Xu, G., Liu, Z., Li, X., Loy, C.C.: Knowledge distillation meets self-supervision. In: European Conference on Computer Vision (ECCV). Springer, Cham (2020)Google Scholar
  50. 50.
    Xu, H., Das, A., Saenko, K.: R-C3D: region convolutional 3D network for temporal activity detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5783–5792 (2017)Google Scholar
  51. 51.
    Xu, K., Wen, L., Li, G., Bo, L., Huang, Q.: Spatiotemporal CNN for video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1379–1388 (2019)Google Scholar
  52. 52.
    Xu, M., et al.: Using context saliency for movie shot classification. In: 2011 18th IEEE International Conference on Image Processing, pp. 3653–3656. IEEE (2011)Google Scholar
  53. 53.
    Xu, R., Li, X., Zhou, B., Loy, C.C.: Deep flow-guided video inpainting. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)Google Scholar
  54. 54.
    Yang, J., Zheng, W.S., Yang, Q., Chen, Y.C., Tian, Q.: Spatial-temporal graph convolutional network for video-based person re-identification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3289–3299 (2020)Google Scholar
  55. 55.
    Laradji, I.H., Rostamzadeh, N., Pinheiro, P.O., Vazquez, D., Schmidt, M.: Where are the blobs: counting by localization with point supervision. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11206, pp. 560–576. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01216-8_34CrossRefGoogle Scholar
  56. 56.
    Yuan, L., et al.: Central similarity quantization for efficient image and video retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3083–3092 (2020)Google Scholar
  57. 57.
    Zeng, X., Liao, R., Gu, L., Xiong, Y., Fidler, S., Urtasun, R.: DMM-Net: differentiable mask-matching network for video object segmentation. arXiv preprint arXiv:1909.12471 (2019)
  58. 58.
    Zhang, H., Liu, D., Xiong, Z.: Two-stream oriented video super-resolution for action recognition. arXiv preprint arXiv:1903.05577 (2019)
  59. 59.
    Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., Lin, D.: Temporal action detection with structured segment networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2914–2923 (2017)Google Scholar
  60. 60.
    Zhu, W., Liang, S., Wei, Y., Sun, J.: Saliency optimization from robust background detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2814–2821 (2014)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.CUHK - SenseTime Joint LabThe Chinese University of Hong KongSha TinHong Kong
  2. 2.Communication University of ChinaBeijingChina

Personalised recommendations