Coherent video generation for multiple hand-held cameras with dynamic foreground

Abstract

For many social events such as public performances, multiple hand-held cameras may capture the same event. This footage is often collected by amateur cinematographers who typically have little control over the scene and may not pay close attention to the camera. For these reasons, each individually captured video may fail to cover the whole time of the event, or may lose track of interesting foreground content such as a performer. We introduce a new algorithm that can synthesize a single smooth video sequence of moving foreground objects captured by multiple hand-held cameras. This allows later viewers to gain a cohesive narrative experience that can transition between different cameras, even though the input footage may be less than ideal. We first introduce a graph-based method for selecting a good transition route. This allows us to automatically select good cut points for the hand-held videos, so that smooth transitions can be created between the resulting video shots. We also propose a method to synthesize a smooth photorealistic transition video between each pair of hand-held cameras, which preserves dynamic foreground content during this transition. Our experiments demonstrate that our method outperforms previous state-of-the-art methods, which struggle to preserve dynamic foreground content.

References

  1. [1]

    Guo, H.; Liu, S. C.; He, T.; Zhu, S. Y.; Zeng, B.; Gabbouj, M. Joint video stitching and stabilization from moving cameras. IEEE Transactions on Image Processing Vol. 25, No. 11, 5491–5503, 2016.

    MathSciNet  Article  Google Scholar 

  2. [2]

    Lin, K. M.; Liu, S. C.; Cheong, L. F.; Zeng, B. Seamless video stitching from hand-held camera inputs. Computer Graphics Forum Vol. 35, No. 2, 479–487, 2016.

    Article  Google Scholar 

  3. [3]

    Nie, Y. W.; Su, T.; Zhang, Z. S.; Sun, H. Q.; Li, G. Q. Dynamic video stitching via shakiness removing. IEEE Transactions on Image Processing Vol. 27, No. 1, 164–178, 2018.

    MathSciNet  Article  Google Scholar 

  4. [4]

    Arev, I.; Park, H. S.; Sheikh, Y.; Hodgins, J.; Shamir, A. Automatic editing of footage from multiple social cameras. ACM Transactions on Graphics Vol. 33, No. 4, Article No. 81, 2014.

  5. [5]

    Carranza, J.; Theobalt, C.; Magnor, M. A.; Seidel, H.-P. Free-viewpoint video of human actors. ACM Transactions on Graphics Vol. 22, No. 3, 569–577, 2003.

    Article  Google Scholar 

  6. [6]

    Collet, A.; Chuang, M.; Sweeney, P.; Gillett, D.; Evseev, D.; Calabrese, D.; Hoppe, H.; Kirk, A.; Sullivan, S. High-quality streamable free-viewpoint video. ACM Transactions on Graphics Vol. 34, No. 4, Article No. 69, 2015.

  7. [7]

    Szeliski, R.; Shum, H.-Y. Creating full view panoramic image mosaics and environment maps. In: Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques, 251–258, 1997.

  8. [8]

    ElSaban, M. A.; Refaat, M.; Kaheel, A.; AbdulHamid, A. Stitching videos streamed by mobile phones in realtime. In: Proceedings of the 17th ACM International Conference on Multimedia, 1009–1010, 2009.

  9. [9]

    Lin, W.-Y.; Liu, S.; Matsushita, Y.; Ng, T.-T.; Cheong, F. L. Smoothly varying affine stitching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 345–352, 2011.

  10. [10]

    Zaragoza, J.; Chin, T. J.; Tran, Q. H.; Brown, M. S.; Suter, D. As-projective-as-possible image stitching with moving DLT. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 36, No. 7, 1285–1298, 2014.

    Article  Google Scholar 

  11. [11]

    Ma, T. Z.; Nie, Y. W.; Zhang, Q.; Zhang, Z. S.; Sun, H. Q.; Li, G. Q. Effective video stabilization via joint trajectory smoothing and frame warping. IEEE Transactions on Visualization and Computer Graphics doi: https://doi.org/10.1109/TVCG.2019.2923196, 2019.

  12. [12]

    Liu, F.; Gleicher, M.; Jin, H. L.; Agarwala, A. Content-preserving warps for 3D video stabilization. In: Proceedings of the ACM SIGGRAPH 2009 papers, Article No. 44, 2009.

  13. [13]

    Zhang, F.-L.; Wu, X; Zhang, H.-T.; Wang, J.; Hu, S.-M. Robust background identification for dynamic video editing. ACM Transactions on Graphics Vol. 35, No. 6, Article No. 197, 2016.

  14. [14]

    Kwatra, V.; Schedl, A.; Essa, I.; Turk, G.; Bobick, A. Graphcut textures: Image and video synthesis using graph cuts. ACM Transactions on Graphics Vol. 22, No. 3, 277–286, 2003.

    Article  Google Scholar 

  15. [15]

    Agarwala, A.; Zheng, K. C.; Pal, C.; Agrawala, M.; Cohen, M.; Curless, B.; Salesin, D.; Szeliski, R. Panoramic video textures. ACM Transactions on Graphics Vol. 24, No. 3, 821–827, 2005.

    Article  Google Scholar 

  16. [16]

    Anderson, R.; Gallup, D.; Barron, J. T.; Kontkanen, J.; Snavely, N.; Hernández, C.; Agarwal, S.; Seitz, S. M. Jump: virtual reality video. ACM Transactions on Graphics Vol. 35, No. 6, Article No. 198, 2016.

  17. [17]

    Silva, R. M. A.; Feijó, B.; Gomes, P. B.; Frensh, T.; Monteiro, D. Real time 360° video stitching and streaming. In: Proceedings of the ACM SIGGRAPH 2016 Posters, Article No. 70, 2016.

  18. [18]

    Guo, H.; Liu, S. C.; Zhu, S. Y.; Shen, H. T.; Zeng, B. View-consistent MeshFlow for stereoscopic video stabilization. IEEE Transactions on Computational Imaging Vol. 4, No. 4, 573–584, 2018.

    Article  Google Scholar 

  19. [19]

    Wei X.; Chai J. Videomocap: Modeling physically realistic human motion from monocular video sequences. ACM Transactions on Graphics Vol. 29, No. 4, Article No. 42, 2010.

  20. [20]

    Ballan, L.; Brostow, G. J.; Puwein, J.; Pollefeys, M. Unstructured video-based rendering: Interactive exploration of casually captured videos. ACM Transactions on Graphics Vol. 29, No. 4, Article No. 87, 2010.

  21. [21]

    Tompkin, J.; Kim, K. I.; Kautz, J.; Theobalt, C Videoscapes: exploring sparse, unstructured video collections. ACM Transactions on Graphics Vol. 31, No. 4, Article No. 68, 2012.

  22. [22]

    Wang, M.; Lyu, X. Q.; Li, Y. J.; Zhang, F. L. VR content creation and exploration with deep learning: A survey. Computational Visual Media Vol. 6, No. 1, 3–28, 2020.

    Article  Google Scholar 

  23. [23]

    Zhu, Z.; Lu, J. M.; Wang, M. X.; Zhang, S. H.; Martin, R. R.; Liu, H. T.; Hu, S.-M. A comparative study of algorithms for realtime panoramic video blending. IEEE Transactions on Image Processing Vol. 27, No. 6, 2952–2965, 2018.

    MathSciNet  Article  Google Scholar 

  24. [24]

    Lee, W.; Chen, H.; Chen, M.; Shen, I.; Chen, B. Y. High-resolution 360 video foveated stitching for realtime VR. Computer Graphics Forum Vol. 36, No. 7, 115–123, 2017.

    Article  Google Scholar 

  25. [25]

    Liu, Q. X.; Su, X. Y.; Zhang, L.; Huang, H. Panoramic video stitching of dual cameras based on spatio-temporal seam optimization. Multimedia Tools and Applications Vol. 79, 3107–3124, 2020.

    Article  Google Scholar 

  26. [26]

    Perazzi, F.; Sorkine-Hornung, A.; Zimmer, H.; Kaufmann, P.; Wang, O.; Watson, S.; Gross, M. Panoramic video from unstructured camera arrays. Computer Graphics Forum Vol. 34, No. 2, 57–68, 2015.

    Article  Google Scholar 

  27. [27]

    Wang, O.; Schroers, C.; Zimmer, H.; Gross, M.; Sorkine-Hornung, A. VideoSnapping: Interactive synchronization of multiple videos. ACM Transactions on Graphics Vol. 33, No. 4, Article No. 77, 2014.

  28. [28]

    Cui, Z. P.; Wang, O.; Tan, P.; Wang, J. Time slice video synthesis by robust video alignment. ACM Transactions on Graphics Vol. 36, No. 4, Article No. 131, 2017.

  29. [29]

    Barnes, C.; Goldman, D. B.; Shechtman, E.; Finkelstein, A. Video tapestries with continuous temporal zoom. ACM Transactions on Graphics Vol. 29, No. 4, Article No. 89, 2010.

  30. [30]

    Zhang, Z. S.; Nie, Y. W.; Sun, H. Q.; Lai, Q. X.; Li, G. Q. Multi-video object synopsis integrating optimal view switching. In: Proceedings of the SIGGRAPH Asia 2017 Technical Briefs Article No. 17, 2017.

  31. [31]

    Wang, M.; Shamir, A.; Yang, G. Y.; Lin, J. K.; Yang, G. W.; Lu, S. P.; Hu, S.-M. BiggerSelfie: Selfie video expansion with hand-held camera. IEEE Transactions on Image Processing Vol. 27, No. 12, 5854–5865, 2018.

    MathSciNet  Article  Google Scholar 

  32. [32]

    Wu, C. VisualSFM: A visual structure from motion system. 2011. Available at http://ccwu.me/vsfm.

  33. [33]

    Wu, C.; Agarwal, S.; Curless, B.; Seitz, S. M. Multicore bundle adjustment. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3057–3064 2011.

  34. [34]

    Lee, S. M.; Xin, J. H.; Westland, S. Evaluation of image similarity by histogram intersection. Color Research & Application Vol. 30, No. 4, 265–274, 2005.

    Article  Google Scholar 

  35. [35]

    Newson, A.; Almansa, A.; Fradet, M.; Gousseau, Y.; Pérez, P. Video inpainting of complex scenes. SIAM Journal on Imaging Sciences Vol. 7, No. 4, 1993–2019, 2014.

    MathSciNet  Article  Google Scholar 

  36. [36]

    Lowe, D. G. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision Vol. 60, No. 2, 91–110, 2004.

    Article  Google Scholar 

  37. [37]

    He, K. M.; Sun, J.; Tang, X. O. Guided image filtering. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 35, No. 6, 1397–1409, 2013.

    Article  Google Scholar 

  38. [38]

    Wu, X.; Fang, X. N.; Chen, T.; Zhang, F. L. JMNet: A joint matting network for automatic human matting. Computational Visual Media Vol. 6, No. 2, 215–224, 2020.

    Article  Google Scholar 

  39. [39]

    Boykov, Y.; Veksler, O.; Zabih, R. Fast approximate energy minimization via graph cuts. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 23, No. 11, 1222–1239, 2001.

    Article  Google Scholar 

  40. [40]

    Zhang, Y.; Lai, Y.-K.; Zhang, F.-L. Content-preserving image stitching with regular boundary constraints. arXiv preprint arXiv:1810.11220, 2018.

  41. [41]

    Belongie, S.; Malik, J.; Puzicha, J. Shape matching and object recognition using shape contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 24, No. 4, 509–522, 2002.

    Article  Google Scholar 

  42. [42]

    Cheng, M. M.; Zhang, F. L.; Mitra, N. J.; Huang, X. L.; Hu, S. M. RepFinder: Finding approximately repeated scene elements for image editing. In: Proceedings of the ACM SIGGRAPH 2010 papers Article No. 83, 2010.

  43. [43]

    Schönberger, J. L.; Frahm, J.-M. Structure-from-motion revisited. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 4104–4113, 2016.

  44. [44]

    Lowe, D. G. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision Vol. 60, No. 2, 91–110, 2004.

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by a Research Establishment Grant of Victoria University of Wellington (Project No. 8-1620-216786-3744) and a Victoria Research Excellence Award.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Fang-Lue Zhang.

Additional information

Fang-Lue Zhang is currently a lecturer with Victoria University of Wellington, New Zealand. He received his bachelor degree from Zhejiang University, Hangzhou, China, in 2009, and his doctoral degree from Tsinghua University, Beijing, China, in 2015. His research interests include image and video editing, computer vision, and computer graphics. He is a member of IEEE and ACM. He received Victoria Early-Career Research Excellence Award in 2019.

Connelly Barnes is a senior researcher at Adobe Research. Previously, he was an assistant professor at the University of Virginia. He received his Ph.D. degree from Princeton University in 2011. He develops techniques for efficiently manipulating visual data in computer graphics by using semantic information from computer vision, with applications in computational photography, image editing, art, and hiding visual information. Many computer graphics algorithms are more useful if they are interactive; therefore, he also focuses on efficiency and optimization, including some compiler technologies.

Hao-Tian Zhang is currently a Ph.D. student at Stanford University. He received his B.S. degree from Tsinghua University in 2017. His research interests include image and video editing, and physically-based simulation.

Junhong Zhao is a postdoctoral research fellow of the Computational Media Innovation Centre (CMIC) at Victoria University of Wellington. She completed her doctoral degree in 2015 at the Institute of Electronics, Chinese Academy of Sciences. She worked in the Human-Computer Speech Interaction Lab of Tsinghua University, on computer-assisted language learning using speech recognition and computer graphics techniques (2011–2015). She then moved to CAS Institution of Information Engineering, where her research focused on machine learning and its applications on image understanding and audio signal processing (2015–2017).

Gabriel Salas Gabriel Salas is a research assistant and undergraduate student at the School of Engineering and Computer Science at Victoria University of Wellington. His research focuses on computer graphics and image processing.

Electronic supplementary material

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www.editorialmanager.com/cvmj.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Zhang, FL., Barnes, C., Zhang, HT. et al. Coherent video generation for multiple hand-held cameras with dynamic foreground. Comp. Visual Media 6, 291–306 (2020). https://doi.org/10.1007/s41095-020-0187-3

Download citation

Keywords

  • video editing
  • smooth temporal transitions
  • dynamic foreground
  • multiple cameras
  • hand-held cameras