Abstract
Automatic video editing is an artistic process involving at least the steps of selecting the most valuable footage from the points of view of visual quality and the importance of the action filmed; and cutting the footage into a brief and coherent visual story that would be interesting to watch is implemented in a purely data-driven manner. We describe a system that is capable of learning the editing style from samples extracted from the content created by professional editors, including motion picture masterpieces, and of applying this data-driven style to cut non-professional videos with the ability to mimic the individual style of selected reference samples. Visual semantic and aesthetic features are extracted by an ImageNet-trained convolutional neural network, and the editing controller can be trained by an imitation learning algorithm or reinforcement learning algorithm. As a result, during the test the controller shows signs of observing basic cinematography editing rules learned from the corpus of motion pictures masterpieces. The loss function developed for learning approaches can be efficiently applied in a global optimisation setting of the automatic video editing problem using dynamic programming.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Arev, I., Park, H.S., Sheikh, Y., Hodgins, J., Shamir, A.: Automatic editing of footage from multiple social cameras. ACM Trans. Graph. 33(4), 1–11 (2014)
Arya, S., Mount, D.M., Netanyahu, N.S., Silverman, R., Wu, A.Y.: An optimal algorithm for approximate nearest neighbor searching fixed dimensions. J. ACM. 45(6), 891–923 (1998)
ASC Unveils List of 100 Milestone Films in Cinematography of the 20th Century (2019) Accessed on 06 October 2020. https://theasc.com/news/asc-unveils-list-of-100-milestone-films-in-cinematography-of-the-20th-century
Boiman, O., Rav-Acha, A.: System and method for semi-automatic video editing. US Patent 9,570,107 (2017)
Cong, Y., Yuan, J., Luo, J.: Towards scalable summarization of consumer videos via sparse dictionary selection. IEEE Transactions on Multimedia. 14(1), 66–75 (2012)
Dollar, P., Rabaud, V., Cottrell, G., Belongie, S.: Behaviour recognition via sparse spatio-temporal features. In: Proceedings of EEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, pp. 65–72 (2005)
Fischler, M., Bolles, R.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. In: Readings in Computer Vision: Issues, Problems, Principles, and Paradigms, pp. 726–740 (1987)
Jin, X., Chi, J., Peng, S., Tian, Y., Ye, C., Li, X.: Deep image aesthetics classification using inception modules and fine-tuning connected layer. In: Proceedings of the 8th IEEE International Conference on Wireless Communications and Signal Processing, pp. 1–6 (2016)
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732 (2014)
Langford, J., Li, L., Strehl, A.: Vowpal Wabbit Online Learning Project (2007) Accessed on 06 October 2020. http://hunch.net/?p=309
Leake, M., Davis, A., Truong, A., Agrawala, M.: Computational video editing for dialog-driven scenes. ACM Trans. Graph. 36(4), 130 (2017)
Lepetit, V., Moreno-Noguer, F., Fua, P.: EPnP: An accurate O(n) solution to the PnP problem. Int. J. Comput. Vision. (2009). Accessed on 06 October 2020). https://doi.org/10.1109/ICCV.2007.4409116
Matias, J., Phan, H.: System and method of generating video from video clips based on moments of interest within the video clips. US Patent 10,186,298 (2017)
Médioni, T.: Three-dimensional convolutional neural networks for video highlight detection. US Patent 9,836,853 (2017)
Merabti, B., Christie, M., Bouatouch, K.: A virtual director using hidden Markov models. In: Computer Graphics Forum. Wiley (2015). https://doi.org/10.1111/cgf.12775.hal-01244643
Mikolajczyk, K., Tuytelaars, T., Schmid, C., Zisserman, A., Matas, J., Schaffalitzku, F., Kadir, T., Van Gool, L.: A comparison of affine region detectors. Int. J. Comput. Vis. 65 (1/2), pp. 43–72 (2005)
Park, H.S., Jain, E., Sheikh, Y.: 3D social saliency from head-mounted cameras. In: Proceedings of the 25th International Conference on Neural Information Processing Systems., vol. 1, pp. 422–430 (2012)
Podlesnaya, A., Podlesnyy, S.: Deep learning based semantic video indexing and retrieval. In: Proceedings of SAI Intelligent Systems Conference, pp. 359–372 (2016)
Podlesnyy, S.: Towards data-driven automatic video editing. In: Liu, Y., Wang, L., Zhao, L., Yu, Z. (eds.) Advances in Natural Computation, Fuzzy Systems and Knowledge Discovery. Advances in Intelligent Systems and Computing, vol. 1074. Springer, Cham (2020)
Pudovkin, V.I.: Model (sitter) instead of actor. In: Collected Works, vol. 1, p. 184, Moscow (1974) (in Russian)
Rav-Acha, A., Boiman, O.: System and method for semi-automatic video editing. US Patent. 9, 554,111 (2017)
Rav-Acha, A., Boiman, O.: Method and system for automatic B-roll video production. US Patent 9,524,752 (2016)
Rocco, I., Cimpoi, M., Arandjelović, R., Torii, A., Pajdla, T., Sivic, J.: Neighbourhood consensus networks. In: Proceedings of the 32nd Conference on Neural Information Processing Systems, pp. 1658–1669 (2018)
Ross, S., Gordon, G.J., Bagnell, J.A.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011)
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., Fei-Fei, L.: ImageNet large scale visual recognition challenge. CoRR, arXiv: 1409.0575 (2014)
Safonov, I.V., Kurilin, I.V., Rychagov, M.N., Tolstaya, E.V.: Document Image Processing for Scanning and Printing. Springer Nature Switzerland AG (2019)
Snavely, N., Seitz, S.M., Szeliski, R.: Photo tourism: exploring photo collections in 3D. ACM Trans. Graph. 25(3), 835–846 (2006)
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. CoRR, arXiv:1409.4842 (2014)
Tola, E., Lepetit, V., Fua, P.: DAISY: An efficient dense descriptor applied to wide-baseline stereo. IEEE Trans. Pattern Anal. Mach. Intellig. 32(5), 815–830 (2010)
Tsivian, Y.: Cinemetrics: part of the humanities’ cyberinfrastructure. In: Ross, M., Grauer, M., Freisleben, B. (eds.) Digital Tools in Media Studies, vol. 9, pp. 93–100. Transcript Verlag, Bielefeld (2009)
Uchihachi, S., Foote, J.T., Wilcox, L.: Automatic video summarization using a measure of shot importance and a frame-packing method. US Patent 6,535,639 (2003)
Zhao, B., Xing, E.P.: Quasi real-time summarization for consumer videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2513–2520 (2014)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Podlesnyy, S.Y. (2021). Automatic Video Editing. In: Rychagov, M.N., Tolstaya, E.V., Sirotenko, M.Y. (eds) Smart Algorithms for Multimedia and Imaging. Signals and Communication Technology. Springer, Cham. https://doi.org/10.1007/978-3-030-66741-2_6
Download citation
DOI: https://doi.org/10.1007/978-3-030-66741-2_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-66740-5
Online ISBN: 978-3-030-66741-2
eBook Packages: EngineeringEngineering (R0)