MediaSync pp 167-190 | Cite as

Automated Video Mashups: Research and Challenges

  • Mukesh Kumar Saini
  • Wei Tsang Ooi


The proliferation of video cameras, such as those embedded in smartphones and wearable devices, has made it increasingly easy for users to film interesting events (such as public performance, family events, and vacation highlights) in their daily lives. Moreover, often there are multiple cameras capturing the same event at the same time, from different views. Concatenating segments of the videos produced by these cameras together along the event time forms a video mashup, which could depict the event in a less monotonous and more informative manner. It is, however, inefficient and costly to manually create a video mashup. This chapter aims to introduce the problem of automated video mashup to the readers, survey the state-of-the-art research work in this area, and outline the set of open challenges that remain to be solved. It provides a comprehensive introduction to practitioners, researchers, and graduate students who are interested in the research and challenges of automated video mashup.


Automated video mashup Video clips synchronization Video quality analysis Cinematography rules Mashup quality evaluation 


  1. 1.
    Nakano, T., Murofushi, S., Goto, M., Morishima, S.: Dancereproducer: an automatic mashup music video generation system by reusing dance video clips on the web. In: Sound and Music Computing Conference (SMC), pp. 183–189 (2011)Google Scholar
  2. 2.
    Fu, Y., Guo, Y., Zhu, Y., Liu, F., Song, C., Zhou, Z.H.: Multi-view video summarization. IEEE Trans. Multimedia (TOMM) 12(7), 717–729 (2010)Google Scholar
  3. 3.
    Pritch, Y., Ratovitch, S., Hende, A., Peleg, S.: Clustered synopsis of surveillance video. In: IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 195–200. IEEE (2009)Google Scholar
  4. 4.
    Wang, X., Hirayama, T., Mase, K.: Viewpoint sequence recommendation based on contextual information for multiview video. IEEE Multimedia 22(4), 40–50 (2015)Google Scholar
  5. 5.
    Saini, M.K., Gadde, R., Yan, S., Ooi, W.T.: Movimash: online mobile video mashup. In: ACM International Conference on Multimedia (MM), pp. 139–148. ACM (2012)Google Scholar
  6. 6.
    Nguyen, D.T.D., Saini, M., Nguyen, V.T., Ooi, W.T.: Jiku director: a mobile video mashup system. In: Proceedings of the 21st ACM International Conference on Multimedia, pp. 477–478. ACM, Barcelona, Spain (2013)Google Scholar
  7. 7.
    Shrestha, P., Weda, H., Barbieri, M., Aarts, E.H., et al.: Automatic mashup generation from multiple-camera concert recordings. In: Proceedings of the 18th ACM International Conference on Multimedia, pp. 541–550. ACM, Firenze, Italy (2010)Google Scholar
  8. 8.
    Arev, I., Park, H.S., Sheikh, Y., Hodgins, J., Shamir, A.: Automatic editing of footage from multiple social cameras. ACM Trans. Grap. (TOG) 33(4), 81:1–81:11 (2014)Google Scholar
  9. 9.
    Su, K., Naaman, M., Gurjar, A., Patel, M., Ellis, D.P.: Making a scene: alignment of complete sets of clips based on pairwise audio match. In: Proceedings of the 2nd ACM International Conference on Multimedia Retrieval, p. 26. ACM, Hong Kong (2012)Google Scholar
  10. 10.
    Sinha, S.N., Pollefeys, M.: Synchronization and calibration of camera networks from silhouettes. In: International Conference on Pattern Recognition (ICPR), pp. 116–119. IEEE (2004)Google Scholar
  11. 11.
    Meyer, B., Stich, T., Magnor, M.A., Pollefeys, M.: Subframe temporal alignment of non-stationary cameras. In: British Machine Vision Conference (BMVC), pp. 1–10 (2008)Google Scholar
  12. 12.
    Caspi, Y., Simakov, D., Irani, M.: Feature-based sequence-to-sequence matching. Int. J. Comput. Vis. (IJCV) 68(1), 53–64 (2006)Google Scholar
  13. 13.
    Elhayek, A., Stoll, C., Kim, K., Seidel, H., Theobalt, C.: Feature-based multi-video synchronization with subframe accuracy. Pattern Recogn. 266–275 (2012)Google Scholar
  14. 14.
    Hasler, N., Rosenhahn, B., Thormahlen, T., Wand, M., Gall, J., Seidel, H.P.: Markerless motion capture with unsynchronized moving cameras. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 224–231. IEEE (2009)Google Scholar
  15. 15.
    Kammerl, J., Birkbeck, N., Inguva, S., Kelly, D., Crawford, A.J., Denman, H., Kokaram, A., Pantofaru, C.: Temporal synchronization of multiple audio signals. In: IEEE International Conference on Acoustics. Speech and Signal Processing (ICASSP), pp. 4603–4607. IEEE, Firenze, Italy (2014)Google Scholar
  16. 16.
    Shrestha, P., Barbieri, M., Weda, H.: Synchronization of multi-camera video recordings based on audio. In: Proceedings of the 15th ACM International Conference on Multimedia, pp. 545–548. ACM, Augsburg, Germany (2007)Google Scholar
  17. 17.
    Haitsma, J., Kalker, T.: A highly robust audio fingerprinting system with an efficient search strategy. J. New Music Res. 32(2), 211–221 (2003)Google Scholar
  18. 18.
    Cremer, M., Cook, R.: Machine-assisted editing of user-generated content. In: IS&T/SPIE Electronic Imaging, International Society for Optics and Photonics, pp. 725,404–725,404–410 (2009)Google Scholar
  19. 19.
    Laiola Guimaraes, R., Cesar, P., Bulterman, D.C., Zsombori, V., Kegel, I.: Creating personalized memories from social events: community-based support for multi-camera recordings of school concerts. In: Proceedings of the 19th ACM International Conference on Multimedia, pp. 303–312. ACM, Scottdale, AZ, USA (2011)Google Scholar
  20. 20.
    Korchagin, D., Garner, P.N., Dines, J.: Automatic temporal alignment of av data with confidence estimation. In: IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), pp. 269–272. IEEE (2010)Google Scholar
  21. 21.
    Bano, S., Cavallaro, A.: Discovery and organization of multi-camera user-generated videos of the same event. Inf. Sci. 302, 108–121 (2015)CrossRefGoogle Scholar
  22. 22.
    Bano, S., Cavallaro, A.: Vicomp: composition of user-generated videos. Multimedia Tools Appl. (MTAP) 75(12), 1–24 (2015)Google Scholar
  23. 23.
    Wu, Y., Mei, T., Xu, Y.Q., Yu, N., Li, S.: Movieup: Automatic mobile video mashup. IEEE Trans. Circ. Syst. Video Technol. 25(12), 1941–1954 (2015)Google Scholar
  24. 24.
    Wilk, S., Kopf, S., Effelsberg, W.: Video composition by the crowd: a system to compose user-generated videos in near real-time. In: Proceedings of the 6th ACM Multimedia Systems Conference, pp. 13–24. ACM, Portland, USA (2015)Google Scholar
  25. 25.
    Mei, T., Hua, X.S., Zhu, C.Z., Zhou, H.Q., Li, S.: Home video visual quality assessment with spatiotemporal factors. IEEE Trans. Circ. Syst. Video Technol. (CSVT) 17(6), 699–706 (2007)Google Scholar
  26. 26.
    Wilk, S., Effelsberg, W.: The influence of camera shakes, harmful occlusions and camera misalignment on the perceived quality in user generated video. In: IEEE International Conference on Multimedia and Expo (ICME). pp. 1–6. IEEE, Chengdu, China (2014)Google Scholar
  27. 27.
    Daniyal, F., Cavallaro, A.: Multi-camera scheduling for video production. In: Conference for Visual Media Production (CVMP), pp. 11–20. IEEE (2011)Google Scholar
  28. 28.
    Daniyal, F., Taj, M., Cavallaro, A.: Content and task-based view selection from multiple video streams. Multimedia Tools Appl. (MTAP) 46(2–3), 235–258 (2010)CrossRefGoogle Scholar
  29. 29.
    Goshorn, R., Goshorn, J., Goshorn, D., Aghajan, H.: Architecture for cluster-based automated surveillance network for detecting and tracking multiple persons. In: ACM/IEEE International Conference on Distributed Smart Cameras (ICDSC), pp. 219–226. IEEE (2007)Google Scholar
  30. 30.
    Jiang, H., Fels, S., Little, J.J.: Optimizing multiple object tracking and best view video synthesis. IEEE Trans. Multimedia (TOMM) 10(6), 997–1012 (2008)Google Scholar
  31. 31.
    Vihavainen, S., Mate, S., Seppälä, L., Cricri, F., Curcio, I.D.: We want more: human-computer collaboration in mobile social video remixing of music concerts. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 287–296. ACM (2011)Google Scholar
  32. 32.
    Lerch, A.: An introduction to audio content analysis: applications in signal processing and music informatics. Wiley (2012)Google Scholar
  33. 33.
    Dmytyk, E.: On film editing: an introduction to the art of film construction (1984)Google Scholar
  34. 34.
    Canini, L., Benini, S., Leonardi, R.: Classifying cinematographic shot types. Multimedia Tools Appl. (MTAP) 62(1), 51–73 (2013)Google Scholar
  35. 35.
    Carlier, A., Calvet, L., Nguyen, D.T.D., Ooi, W.T., Gurdjos, P., Charvillat, V.: 3d interest maps from simultaneous video recordings. In: ACM International Conference on Multimedia, pp. 577–586. ACM (2014)Google Scholar
  36. 36.
    Zsombori, V., Frantzis, M., Guimaraes, R.L., Ursu, M.F., Cesar, P., Kegel, I., Craigie, R., Bulterman, D.C.: Automatic generation of video narratives from shared ugc. In: Proceedings of the 22nd ACM Conference on Hypertext and Hypermedia, pp. 325–334. ACM, Eindhoven, Netherlands (2011)Google Scholar
  37. 37.
    Nguyen, D.T.D., Carlier, A., Ooi, W.T., Charvillat, V.: Jiku director 2.0: a mobile video mashup system with zoom and pan using motion maps. In: Proceedings of the ACM International Conference on Multimedia, pp. 765–766. ACM, Orlando, FL, USA (2014)Google Scholar
  38. 38.
    Beerends, J.G., De Caluwe, F.E.: The influence of video quality on perceived audio quality and vice versa. J. Audio Eng. Soc. (AES) 47(5), 355–362 (1999)Google Scholar
  39. 39.
    Saini, M., Venkatagiri, S.P., Ooi, W.T., Chan, M.C.: The jiku mobile video dataset. In: ACM Multimedia Systems Conference (MMSys), pp. 108–113. ACM (2013)Google Scholar
  40. 40.
    Ballan, L., Brostow, G.J., Puwein, J., Pollefeys, M.: Unstructured video-based rendering: interactive exploration of casually captured videos. ACM Trans. Graphics (TOG) 29(4), 87. ACM (2010)Google Scholar
  41. 41.
    Park, H.S., Jain, E., Sheikh, Y.: 3D social saliency from head-mounted cameras. In: Advances in Neural Information Processing Systems (NIPS), pp. 431–439 (2012)Google Scholar
  42. 42.
    Nack, F.: Event and story: an intricate relationship. In: Proceedings of the 2011 Joint ACM Workshop on Modeling and Representing Events, J-MRE ’11, pp. 49–50. ACM, NY, USA (2011).
  43. 43.
    Frantzis, M., Zsombori, V., Ursu, M., Guimaraes, R.L., Kegel, I., Craigie, R.: Interactive video stories from user generated content: a school concert use case. In: International Conference on Interactive Digital Storytelling, pp. 183–195. Springer, Berlin (2012)Google Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Indian Institute of Technology Ropar RupnagarIndia
  2. 2.National University of SingaporeSingaporeSingapore

Personalised recommendations