Abstract
Machine learning is transforming the video editing industry. Recent advances in computer vision have leveled-up video editing tasks such as intelligent reframing, rotoscoping, color grading, or applying digital makeups. However, most of the solutions have focused on video manipulation and VFX. This work introduces the Anatomy of Video Editing, a dataset, and benchmark, to foster research in AI-assisted video editing. Our benchmark suite focuses on video editing tasks, beyond visual effects, such as automatic footage organization and assisted video assembling. To enable research on these fronts, we annotate more than 1.5M tags, with relevant concepts to cinematography, from 196176 shots sampled from movie scenes. We establish competitive baseline methods and detailed analyses for each of the tasks. We hope our work sparks innovative research towards underexplored areas of AI-assisted video editing. Code is available at: https://github.com/dawitmureja/AVE.git.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
We crawled the movie scenes from the MovieClips YouTube Channel.
- 2.
We consider 3 intensify patterns: extreme-wide - wide - medium, wide - medium - close-up, medium - close-up - extreme-close-up.
References
Ankerst, M., Breunig, M.M., Kriegel, H.P., Sander, J.: Optics: ordering points to identify the clustering structure. ACM SIGMOD Rec. 28(2), 49–60 (1999)
Bain, M., Nagrani, A., Brown, A., Zisserman, A.: Condensed movies: story based retrieval with contextual embeddings (2020)
Baxter, M.: Comparing cutting patterns-a working paper. Present Webpage Question 3 (2013)
Bhattacharya, S., Mehran, R., Sukthankar, R., Shah, M.: Classification of cinematographic shots using lie algebra and its application to complex event recognition. IEEE Trans. Multimedia 16(3), 686–696 (2014)
Canini, L., Benini, S., Leonardi, R.: Classifying cinematographic shot types. Multimedia Tools Appl. 62(1), 51–73 (2013)
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607. PMLR (2020)
Dancyger, K.: The Technique of Film and Video Editing: History, Theory, and Practice. Routledge, London (2018)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)
El-Nouby, A., Zhai, S., Taylor, G.W., Susskind, J.M.: Skip-clip: self-supervised spatiotemporal representation learning by future clip order ranking. arXiv preprint arXiv:1910.12770 (2019)
Gao, C., Saraf, A., Huang, J.-B., Kopf, J.: Flow-edge guided video completion. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12357, pp. 713–729. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58610-2_42
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Hoai, M., Zisserman, A.: Thread-safe: towards recognizing human actions across shot boundaries. In: Cremers, D., Reid, I., Saito, H., Yang, M.-H. (eds.) ACCV 2014. LNCS, vol. 9006, pp. 222–237. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-16817-3_15
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Huang, Q., Xiong, Yu., Rao, A., Wang, J., Lin, D.: MovieNet: a holistic dataset for movie understanding. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12349, pp. 709–727. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58548-8_41
Kang, B., et al.: Decoupling representation and classifier for long-tailed recognition. arXiv preprint arXiv:1910.09217 (2019)
Kasten, Y., Ofri, D., Wang, O., Dekel, T.: Layered neural atlases for consistent video editing. ACM Trans. Graph. (TOG) 40(6), 1–12 (2021)
Kay, W., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)
Khosla, P., et al.: Supervised contrastive learning. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 18661–18673. Curran Associates, Inc. (2020). https://proceedings.neurips.cc/paper/2020/file/d89a66c7c80a29b1bdbab0f2a1a94af8-Paper.pdf
Kim, D., Cho, D., Kweon, I.S.: Self-supervised video representation learning with space-time cubic puzzles. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 8545–8552 (2019)
Leake, M., Davis, A., Truong, A., Agrawala, M.: Computational video editing for dialogue-driven scenes. ACM Trans. Graph. 36(4), 130-1 (2017)
Liu, S., Johns, E., Davison, A.J.: End-to-end multi-task learning with attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1871–1880 (2019)
Liu, X., Hu, Y., Bai, S., Ding, F., Bai, X., Torr, P.H.: Multi-shot temporal event localization: a benchmark. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12596–12606 (2021)
Liu, Y.L., Lai, W.S., Yang, M.H., Chuang, Y.Y., Huang, J.B.: Learning to see through obstructions with layered decomposition. IEEE Trans. Pattern Anal. Mach. Intell. (2021)
Lloyd, S.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28(2), 129–137 (1982)
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 60(2), 91–110 (2004)
Lu, E., Cole, F., Dekel, T., Zisserman, A., Freeman, W.T., Rubinstein, M.: Omnimatte: associating objects and their effects in video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4507–4515 (2021)
Menon, A.K., Jayasumana, S., Rawat, A.S., Jain, H., Veit, A., Kumar, S.: Long-tail learning via logit adjustment. arXiv preprint arXiv:2007.07314 (2020)
Metz, C.: Film Language: A Semiotics of the Cinema. University of Chicago Press, Chicago (1991)
Müllner, D.: Modern hierarchical, agglomerative clustering algorithms. arXiv preprint arXiv:1109.2378 (2011)
Oh, S.W., Lee, J.Y., Xu, N., Kim, S.J.: Video object segmentation using space-time memory networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9226–9235 (2019)
Pardo, A., Caba, F., Alcázar, J.L., Thabet, A.K., Ghanem, B.: Learning to cut by watching movies. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6858–6868 (2021)
Pardo, A., Heilbron, F.C., Alcázar, J.L., Thabet, A., Ghanem, B.: Moviecuts: a new dataset and benchmark for cut type recognition. arXiv preprint arXiv:2109.05569 (2021)
Patwardhan, K.A., Sapiro, G., Bertalmío, M.: Video inpainting under constrained camera motion. IEEE Trans. Image Process. 16(2), 545–553 (2007)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Rao, A., et al.: A unified framework for shot type classification based on subject centric lens. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12356, pp. 17–34. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58621-8_2
Sarfraz, S., Sharma, V., Stiefelhagen, R.: Efficient parameter-free clustering using first neighbor relations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8934–8943 (2019)
Schütze, H., Manning, C.D., Raghavan, P.: Introduction to Information Retrieval, vol. 39. Cambridge University Press, Cambridge (2008)
Smith, J.R., Joshi, D., Huet, B., Hsu, W., Cota, J.: Harnessing AI for augmenting creativity: application to movie trailer creation. In: Proceedings of the 25th ACM International Conference on Multimedia, pp. 1799–1808 (2017)
Souček, T., Lokoč, J.: Transnet V2: an effective deep network architecture for fast shot transition detection. arXiv preprint arXiv:2008.04838 (2020)
Tan, J., et al.: Equalization loss for long-tailed object recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11662–11671 (2020)
Tapaswi, M., Zhu, Y., Stiefelhagen, R., Torralba, A., Urtasun, R., Fidler, S.: Movieqa: understanding stories in movies through question-answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4631–4640 (2016)
Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6450–6459 (2018)
Wang, H.L., Cheong, L.F.: Taxonomy of directing semantics for film shot classification. IEEE Trans. Circuits Syst. Video Technol. 19(10), 1529–1542 (2009)
Wu, C.Y., Krahenbuhl, P.: Towards long-form video understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1884–1894 (2021)
Wu, H.Y., Christie, M.: Analysing cinematography with embedded constrained patterns. In: WICED-Eurographics Workshop on Intelligent Cinematography and Editing (2016)
Wu, H.Y., Palù, F., Ranon, R., Christie, M.: Thinking like a director: film editing patterns for virtual cinematographic storytelling. ACM Trans. Multimedia Comput. Commun. Appl. (TOMM) 14(4), 1–22 (2018)
Xiao, J., et al.: Explore video clip order with self-supervised and curriculum learning for video applications. IEEE Trans. Multimedia 23, 3454–3466 (2021). https://doi.org/10.1109/TMM.2020.3025661
Xu, M., et al.: Using context saliency for movie shot classification. In: 2011 18th IEEE International Conference on Image Processing, pp. 3653–3656. IEEE (2011)
Zhang, X., Li, Y., Han, Y., Wen, J.: AI video editing: a survey (2022)
Zhu, Y., et al.: Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 19–27 (2015)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Argaw, D.M., Heilbron, F.C., Lee, JY., Woodson, M., Kweon, I.S. (2022). The Anatomy of Video Editing: A Dataset and Benchmark Suite for AI-Assisted Video Editing. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13668. Springer, Cham. https://doi.org/10.1007/978-3-031-20074-8_12
Download citation
DOI: https://doi.org/10.1007/978-3-031-20074-8_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20073-1
Online ISBN: 978-3-031-20074-8
eBook Packages: Computer ScienceComputer Science (R0)