The Anatomy of Video Editing: A Dataset and Benchmark Suite for AI-Assisted Video Editing

Argaw, Dawit Mureja; Heilbron, Fabian Caba; Lee, Joon-Young; Woodson, Markus; Kweon, In So

doi:10.1007/978-3-031-20074-8_12

Dawit Mureja Argaw^12,13,
Fabian Caba Heilbron¹²,
Joon-Young Lee¹²,
Markus Woodson¹² &
…
In So Kweon¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13668))

Included in the following conference series:

European Conference on Computer Vision

1945 Accesses
6 Citations

Abstract

Machine learning is transforming the video editing industry. Recent advances in computer vision have leveled-up video editing tasks such as intelligent reframing, rotoscoping, color grading, or applying digital makeups. However, most of the solutions have focused on video manipulation and VFX. This work introduces the Anatomy of Video Editing, a dataset, and benchmark, to foster research in AI-assisted video editing. Our benchmark suite focuses on video editing tasks, beyond visual effects, such as automatic footage organization and assisted video assembling. To enable research on these fronts, we annotate more than 1.5M tags, with relevant concepts to cinematography, from 196176 shots sampled from movie scenes. We establish competitive baseline methods and detailed analyses for each of the tasks. We hope our work sparks innovative research towards underexplored areas of AI-assisted video editing. Code is available at: https://github.com/dawitmureja/AVE.git.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
We crawled the movie scenes from the MovieClips YouTube Channel.
2.
We consider 3 intensify patterns: extreme-wide - wide - medium, wide - medium - close-up, medium - close-up - extreme-close-up.

References

Ankerst, M., Breunig, M.M., Kriegel, H.P., Sander, J.: Optics: ordering points to identify the clustering structure. ACM SIGMOD Rec. 28(2), 49–60 (1999)
Article Google Scholar
Bain, M., Nagrani, A., Brown, A., Zisserman, A.: Condensed movies: story based retrieval with contextual embeddings (2020)
Google Scholar
Baxter, M.: Comparing cutting patterns-a working paper. Present Webpage Question 3 (2013)
Google Scholar
Bhattacharya, S., Mehran, R., Sukthankar, R., Shah, M.: Classification of cinematographic shots using lie algebra and its application to complex event recognition. IEEE Trans. Multimedia 16(3), 686–696 (2014)
Article Google Scholar
Canini, L., Benini, S., Leonardi, R.: Classifying cinematographic shot types. Multimedia Tools Appl. 62(1), 51–73 (2013)
Article Google Scholar
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607. PMLR (2020)
Google Scholar
Dancyger, K.: The Technique of Film and Video Editing: History, Theory, and Practice. Routledge, London (2018)
Book Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)
Google Scholar
El-Nouby, A., Zhai, S., Taylor, G.W., Susskind, J.M.: Skip-clip: self-supervised spatiotemporal representation learning by future clip order ranking. arXiv preprint arXiv:1910.12770 (2019)
Gao, C., Saraf, A., Huang, J.-B., Kopf, J.: Flow-edge guided video completion. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12357, pp. 713–729. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58610-2_42
Chapter Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Hoai, M., Zisserman, A.: Thread-safe: towards recognizing human actions across shot boundaries. In: Cremers, D., Reid, I., Saito, H., Yang, M.-H. (eds.) ACCV 2014. LNCS, vol. 9006, pp. 222–237. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-16817-3_15
Chapter Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Huang, Q., Xiong, Yu., Rao, A., Wang, J., Lin, D.: MovieNet: a holistic dataset for movie understanding. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12349, pp. 709–727. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58548-8_41
Chapter Google Scholar
Kang, B., et al.: Decoupling representation and classifier for long-tailed recognition. arXiv preprint arXiv:1910.09217 (2019)
Kasten, Y., Ofri, D., Wang, O., Dekel, T.: Layered neural atlases for consistent video editing. ACM Trans. Graph. (TOG) 40(6), 1–12 (2021)
Article Google Scholar
Kay, W., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)
Khosla, P., et al.: Supervised contrastive learning. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 18661–18673. Curran Associates, Inc. (2020). https://proceedings.neurips.cc/paper/2020/file/d89a66c7c80a29b1bdbab0f2a1a94af8-Paper.pdf
Kim, D., Cho, D., Kweon, I.S.: Self-supervised video representation learning with space-time cubic puzzles. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 8545–8552 (2019)
Google Scholar
Leake, M., Davis, A., Truong, A., Agrawala, M.: Computational video editing for dialogue-driven scenes. ACM Trans. Graph. 36(4), 130-1 (2017)
Google Scholar
Liu, S., Johns, E., Davison, A.J.: End-to-end multi-task learning with attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1871–1880 (2019)
Google Scholar
Liu, X., Hu, Y., Bai, S., Ding, F., Bai, X., Torr, P.H.: Multi-shot temporal event localization: a benchmark. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12596–12606 (2021)
Google Scholar
Liu, Y.L., Lai, W.S., Yang, M.H., Chuang, Y.Y., Huang, J.B.: Learning to see through obstructions with layered decomposition. IEEE Trans. Pattern Anal. Mach. Intell. (2021)
Google Scholar
Lloyd, S.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28(2), 129–137 (1982)
Article MathSciNet MATH Google Scholar
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 60(2), 91–110 (2004)
Article Google Scholar
Lu, E., Cole, F., Dekel, T., Zisserman, A., Freeman, W.T., Rubinstein, M.: Omnimatte: associating objects and their effects in video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4507–4515 (2021)
Google Scholar
Menon, A.K., Jayasumana, S., Rawat, A.S., Jain, H., Veit, A., Kumar, S.: Long-tail learning via logit adjustment. arXiv preprint arXiv:2007.07314 (2020)
Metz, C.: Film Language: A Semiotics of the Cinema. University of Chicago Press, Chicago (1991)
Google Scholar
Müllner, D.: Modern hierarchical, agglomerative clustering algorithms. arXiv preprint arXiv:1109.2378 (2011)
Oh, S.W., Lee, J.Y., Xu, N., Kim, S.J.: Video object segmentation using space-time memory networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9226–9235 (2019)
Google Scholar
Pardo, A., Caba, F., Alcázar, J.L., Thabet, A.K., Ghanem, B.: Learning to cut by watching movies. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6858–6868 (2021)
Google Scholar
Pardo, A., Heilbron, F.C., Alcázar, J.L., Thabet, A., Ghanem, B.: Moviecuts: a new dataset and benchmark for cut type recognition. arXiv preprint arXiv:2109.05569 (2021)
Patwardhan, K.A., Sapiro, G., Bertalmío, M.: Video inpainting under constrained camera motion. IEEE Trans. Image Process. 16(2), 545–553 (2007)
Article MathSciNet Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
Rao, A., et al.: A unified framework for shot type classification based on subject centric lens. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12356, pp. 17–34. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58621-8_2
Chapter Google Scholar
Sarfraz, S., Sharma, V., Stiefelhagen, R.: Efficient parameter-free clustering using first neighbor relations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8934–8943 (2019)
Google Scholar
Schütze, H., Manning, C.D., Raghavan, P.: Introduction to Information Retrieval, vol. 39. Cambridge University Press, Cambridge (2008)
Google Scholar
Smith, J.R., Joshi, D., Huet, B., Hsu, W., Cota, J.: Harnessing AI for augmenting creativity: application to movie trailer creation. In: Proceedings of the 25th ACM International Conference on Multimedia, pp. 1799–1808 (2017)
Google Scholar
Souček, T., Lokoč, J.: Transnet V2: an effective deep network architecture for fast shot transition detection. arXiv preprint arXiv:2008.04838 (2020)
Tan, J., et al.: Equalization loss for long-tailed object recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11662–11671 (2020)
Google Scholar
Tapaswi, M., Zhu, Y., Stiefelhagen, R., Torralba, A., Urtasun, R., Fidler, S.: Movieqa: understanding stories in movies through question-answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4631–4640 (2016)
Google Scholar
Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6450–6459 (2018)
Google Scholar
Wang, H.L., Cheong, L.F.: Taxonomy of directing semantics for film shot classification. IEEE Trans. Circuits Syst. Video Technol. 19(10), 1529–1542 (2009)
Article Google Scholar
Wu, C.Y., Krahenbuhl, P.: Towards long-form video understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1884–1894 (2021)
Google Scholar
Wu, H.Y., Christie, M.: Analysing cinematography with embedded constrained patterns. In: WICED-Eurographics Workshop on Intelligent Cinematography and Editing (2016)
Google Scholar
Wu, H.Y., Palù, F., Ranon, R., Christie, M.: Thinking like a director: film editing patterns for virtual cinematographic storytelling. ACM Trans. Multimedia Comput. Commun. Appl. (TOMM) 14(4), 1–22 (2018)
Google Scholar
Xiao, J., et al.: Explore video clip order with self-supervised and curriculum learning for video applications. IEEE Trans. Multimedia 23, 3454–3466 (2021). https://doi.org/10.1109/TMM.2020.3025661
Article Google Scholar
Xu, M., et al.: Using context saliency for movie shot classification. In: 2011 18th IEEE International Conference on Image Processing, pp. 3653–3656. IEEE (2011)
Google Scholar
Zhang, X., Li, Y., Han, Y., Wen, J.: AI video editing: a survey (2022)
Google Scholar
Zhu, Y., et al.: Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 19–27 (2015)
Google Scholar

Download references

Author information

Authors and Affiliations

Adobe Research, San Jose, USA
Dawit Mureja Argaw, Fabian Caba Heilbron, Joon-Young Lee & Markus Woodson
KAIST, Daejeon, South Korea
Dawit Mureja Argaw & In So Kweon

Authors

Dawit Mureja Argaw
View author publications
You can also search for this author in PubMed Google Scholar
Fabian Caba Heilbron
View author publications
You can also search for this author in PubMed Google Scholar
Joon-Young Lee
View author publications
You can also search for this author in PubMed Google Scholar
Markus Woodson
View author publications
You can also search for this author in PubMed Google Scholar
In So Kweon
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dawit Mureja Argaw .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 388 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Argaw, D.M., Heilbron, F.C., Lee, JY., Woodson, M., Kweon, I.S. (2022). The Anatomy of Video Editing: A Dataset and Benchmark Suite for AI-Assisted Video Editing. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13668. Springer, Cham. https://doi.org/10.1007/978-3-031-20074-8_12

Download citation

DOI: https://doi.org/10.1007/978-3-031-20074-8_12
Published: 12 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20073-1
Online ISBN: 978-3-031-20074-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

The Anatomy of Video Editing: A Dataset and Benchmark Suite for AI-Assisted Video Editing