International Journal of Computer Vision

, Volume 93, Issue 1, pp 73–100 | Cite as

Joint Multi-Layer Segmentation and Reconstruction for Free-Viewpoint Video Applications

  • Jean-Yves GuillemautEmail author
  • Adrian Hilton


Current state-of-the-art image-based scene reconstruction techniques are capable of generating high-fidelity 3D models when used under controlled capture conditions. However, they are often inadequate when used in more challenging environments such as sports scenes with moving cameras. Algorithms must be able to cope with relatively large calibration and segmentation errors as well as input images separated by a wide-baseline and possibly captured at different resolutions. In this paper, we propose a technique which, under these challenging conditions, is able to efficiently compute a high-quality scene representation via graph-cut optimisation of an energy function combining multiple image cues with strong priors. Robustness is achieved by jointly optimising scene segmentation and multiple view reconstruction in a view-dependent manner with respect to each input camera. Joint optimisation prevents propagation of errors from segmentation to reconstruction as is often the case with sequential approaches. View-dependent processing increases tolerance to errors in through-the-lens calibration compared to global approaches. We evaluate our technique in the case of challenging outdoor sports scenes captured with manually operated broadcast cameras as well as several indoor scenes with natural background. A comprehensive experimental evaluation including qualitative and quantitative results demonstrates the accuracy of the technique for high quality segmentation and reconstruction and its suitability for free-viewpoint video under these difficult conditions.


Segmentation Reconstruction Free-viewpoint video Graph-cuts 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Alahari, K., Kohli, P., & Torr, P. (2008). Reduce, reuse & recycle: efficiently solving multi-label MRFs. In CVPR. Google Scholar
  2. Bai, X., Wang, J., Simons, D., & Sapiro, G. (2009). Video snapcut: robust video object cutout using localized classifiers. ACM Transactions on Graphics, 28(3). Google Scholar
  3. Ballan, L., Brostow, G., Puwein, J., & Pollefeys, M. (2010). Unstructured video-based rendering: interactive exploration of casually captured videos. ACM Transactions on Graphics, 4(29). Google Scholar
  4. Boykov, Y., & Kolmogorov, V. (2004). An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(9), 1124–1137. CrossRefGoogle Scholar
  5. Boykov, Y., Veksler, O., & Zabih, R. (2001). Fast approximate energy minimization via graph cuts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(11), 1222–1239. CrossRefGoogle Scholar
  6. Bradley, D., Boubekeur, T., & Heidrich, W. (2008). Accurate multi-view reconstruction using robust binocular stereo and surface meshing. In CVPR. Google Scholar
  7. Broadhurst, A., Drummond, T., & Cipolla, R. (2001). A probabilistic framework for the Space Carving algorithm. In ICCV (pp. 388–393). Google Scholar
  8. Campbell, N., Vogiatzis, G., Hernández, C., & Cipolla, R. (2008). Using multiple hypotheses to improve depth-maps for multi-view stereo. In ECCV (Vol. I, pp. 766–779). Google Scholar
  9. Campbell, N., Vogiatzis, G., Hernández, C., & Cipolla, R. (2010). Automatic 3d object segmentation in multiple views using volumetric graph-cuts. Image and Vision Computing, 28(1), 14–25. CrossRefGoogle Scholar
  10. Chen, S., & Williams, L. (1993). View interpolation for image synthesis. ACM Transactions on Graphics, 279–288. Google Scholar
  11. Chuang, Y.-Y., Agarwala, A., Curless, B., Salesin, D., & Szeliski, R. (2002). Video matting of complex scenes. ACM Transactions on Graphics, 21(3), 243–248. CrossRefGoogle Scholar
  12. Cohen, J., Lin, M., Manocha, D., & Ponamgi, M. (1995). I-collide: an interactive and exact collision detection system for large-scale environments. In Proceedings of the symposium on interactive 3D graphics (pp. 189–218). CrossRefGoogle Scholar
  13. Connor, K., & Reid, I. (2003). A multiple view layered representation for dynamic novel view synthesis. In BMVC. Google Scholar
  14. De Bonet, J., & Viola, P. (1999). Roxels: responsibility weighted 3D volume reconstruction. In ICCV (pp. 418–425). Google Scholar
  15. Debevec, P., Taylor, C., & Malik, J. (1996). Modeling and rendering architecture from photographs: a hybrid geometry- and image-based approach. ACM Transactions on Graphics, 11–20. Google Scholar
  16. Dempster, A., Laird, N., & Rubin, D. (1977). Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B. Methodological, 39(1), 1–38. zbMATHMathSciNetGoogle Scholar
  17. Eisemann, M., De Decker, B., Magnor, M., Bekaert, P., de Aguiar, E., Ahmed, N., Theobalt, C., & Sellent, A. (2008). Floating textures. Computer Graphics Forum, 27(2), 409–418. CrossRefGoogle Scholar
  18. Franco, J.-S., & Boyer, E. (2003). Exact polyhedral visual hulls. In BMVC (Vol. 1, pp. 329–338). Google Scholar
  19. Franco, J.-S., & Boyer, E. (2005). Fusion of multi-view silhouette cues using a space occupancy grid. In ICCV (pp. 1747–1753). Google Scholar
  20. Furukawa, Y., & Ponce, J. (2010). Accurate, dense, and robust multiview stereopsis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(8), 1362–1376. CrossRefGoogle Scholar
  21. Germann, M., Hornung, A., Keiser, R., Ziegler, R., Würmlin, S., & Gross, M. (2010). Articulated billboards for video-based rendering. Computer Graphics Forum, 29(2), 585–594. CrossRefGoogle Scholar
  22. Goesele, M., Curless, B., & Seitz, S. (2006). Multi-view stereo revisited. In CVPR (pp. 2402–2409). Google Scholar
  23. Goldlücke, B., & Magnor, M. (2003). Joint 3D-reconstruction and background separation in multiple views using graph cuts. In CVPR (Vol. 1, pp. 683–688). Google Scholar
  24. Gortler, S., Grzeszczuk, R., Szeliski, R., & Cohen, M. (1996). The lumigraph. ACM Transactions on Graphics, 43–54. Google Scholar
  25. Grau, G., Thomas, G., Hilton, A., Kilner, J., & Starck, J. (2007). A robust free-viewpoint video system for sport scenes. In 3DTV. Google Scholar
  26. Grau, O., Prior-Jones, M., & Thomas, G. (2005). 3d modelling and rendering of studio and sport scenes for TV applications. In WIAMIS. Google Scholar
  27. Guillemaut, J.-Y., Hilton, A., Starck, J., Kilner, J., & Grau, O. (2007). A Bayesian framework for simultaneous matting and 3D reconstruction. In 3DIM (pp. 167–174). Google Scholar
  28. Guillemaut, J.-Y., Kilner, J., & Hilton, A. (2009). Robust graph-cut scene segmentation and reconstruction for free-viewpoint video of complex dynamic scenes. In ICCV (pp. 809–816). Google Scholar
  29. Habbecke, M., & Kobbelt, L. (2007). A surface-growing approach to multi-view stereo reconstruction. In CVPR. Google Scholar
  30. Hernández, Esteban (2004). C. and Schmitt, F., Silhouette and stereo fusion for 3d object modeling. Computer Vision and Image Understanding, 96(3), 367–392. CrossRefGoogle Scholar
  31. Inamoto, N., & Saito, H. (2007). Virtual viewpoint replay for a soccer match by view interpolation from multiple cameras. IEEE Transactions on Multimedia, 9(6), 1155–1166. CrossRefGoogle Scholar
  32. Kang, S., Li, Y., Tong, X., & Shum, H.-Y. (2006). Image-based rendering. Foundations and Trends in Computer Graphics and Vision, 2(3), 173–258. CrossRefGoogle Scholar
  33. Kazhdan, M., Bolitho, M., & Hoppe, H. (2006). Poisson surface reconstruction. In Symposium on geometry processing (pp. 61–70). Google Scholar
  34. Kilner, J., Starck, J., Guillemaut, J.-Y., & Hilton, A. (2009). Objective quality assessment in free-viewpoint video production. Signal Processing. Image Communication, 24, 3–16. CrossRefGoogle Scholar
  35. Kilner, J., Starck, J., Hilton, A., & Grau, O. (2007). Dual-mode deformable models for free-viewpoint video of sports events. In 3DIM (pp. 177–184). Google Scholar
  36. Kimura, K., & Saito, H. (2005). Player viewpoint video synthesis using multiple cameras. In CVMP (pp. 112–121). Google Scholar
  37. Kohli, P., & Torr, P. (2007). Dynamic graph cuts for efficient inference in Markov random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(12), 2079–2088. CrossRefGoogle Scholar
  38. Kolmogorov, V., Criminisi, A., Blake, A., Cross, G., & Pother, C. (2006). Probabilistic fusion of stereo with color and contrast for bilayer segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(9), 1480–1492. CrossRefGoogle Scholar
  39. Kolmogorov, V., & Zabih, R. (2002). Multi-camera scene reconstruction via graph cuts. In ECCV (Vol. III, pp. 82–96). Google Scholar
  40. Kolmogorov, V., & Zabih, R. (2004). What energy function can be minimized via graph cuts? IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(2), 147–159. CrossRefGoogle Scholar
  41. Kutulakos, K. (2000). Approximate N-view stereo. In ECCV (Vol. I, pp. 67–83). Google Scholar
  42. Kutulakos, K., & Seitz, S. (2000). A theory of shape by space carving. International Journal of Computer Vision, 38(3), 199–218. CrossRefzbMATHGoogle Scholar
  43. Laurentini, A. (1994). The visual hull concept for silhouette-based image understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(2), 150–162. CrossRefGoogle Scholar
  44. Levin, A., Lischinski, D., & Weiss, Y. (2008). A closed-form solution to natural image matting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(2), 228–242. CrossRefGoogle Scholar
  45. Levoy, M., & Hanrahan, P. (1996). Light field rendering. ACM Transactions on Graphics, 31–42. Google Scholar
  46. Li, Y., Sun, J., & Shum, H.-Y. (2005). Video object cut and paste. ACM Transactions on Graphics, 24(3), 595–600. CrossRefGoogle Scholar
  47. Liu, Y., Cao, X., Dai, Q., & Xu, W. (2009). Continuous depth estimation for multi-view stereo. In CVPR (pp. 2121–2128). Google Scholar
  48. Matusik, W., Buehler, C., Raskar, R., Gortler, S. J., & McMillan, L. (2000). Image-based visual hulls. ACM Transactions on Graphics, 369–374. Google Scholar
  49. Matusik, W., Pfister, H., Ngan, A., Beardsley, P., Ziegler, R., & McMillan, L. (2002). Image-based 3d photography using opacity hulls. ACM Transactions on Graphics, 21(3), 427–437. CrossRefGoogle Scholar
  50. Mikolajczyk, K., & Schmid, C. (2005). A performance evaluation of local descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(10), 1615–1630. CrossRefGoogle Scholar
  51. Mikolajczyk, K., Tuytelaars, T., Schmid, C., Zisserman, A., Matas, J., Schaffalitzky, F., Kadir, T., & Van Gool, L. (2005). A comparison of affine region detectors. International Journal of Computer Vision, 65(1–2), 43–72. CrossRefGoogle Scholar
  52. Mitchelson, J., & Hilton, A. (2003). Wand-based multiple-camera studio calibration (Technical report VSSP-TR-2/2003). Centre for Vision, Speech and Signal Processing, University of Surrey.
  53. Moezzi, S., Tai, L.-C., & Gerard, P. (1997). Virtual view generation for 3d digital video. IEEE Transactions on Multimedia, 4(1), 18–26. CrossRefGoogle Scholar
  54. Narayanan, P., Rander, P., & Kanade, T. (1998). Constructing virtual worlds using dense stereo. In ICCV (pp. 3–10). Google Scholar
  55. Ohta, Y., Kitahara, I., Kameda, Y., Ishikawa, H., & Koyama, T. (2007). Live 3D video in soccer stadium. International Journal of Computer Vision, 75(1), 173–187. CrossRefGoogle Scholar
  56. Rother, C., Kolmogorov, V., & Blake, A. (2004). GrabCut”—interactive foreground extraction using iterated graph cuts. ACM Transactions on Graphics, 309–314. Google Scholar
  57. Roy, S., & Cox, I. (1998). A maximum-flow formulation of the N-camera stereo correspondence problem. In ICCV (pp. 492–499). Google Scholar
  58. Seitz, S., Curless, B., Diebel, J., Scharstein, D., & Szeliski, R. (2006). A comparison and evaluation of multi-view stereo reconstruction algorithms. In CVPR (pp. 519–528). Google Scholar
  59. Seitz, S., & Dyer, C. (1996). View morphing. ACM Transactions on Graphics, 21–30. Google Scholar
  60. Seitz, S., & Dyer, C. (1999). Photorealistic scene reconstruction by voxel coloring. International Journal of Computer Vision, 35(2), 151–173. CrossRefGoogle Scholar
  61. Sinha, S., & Pollefeys, M. (2005). Multi-view reconstruction using photo-consistency and exact silhouette constraints: a maximum-flow formulation. In ICCV (Vol. 1, pp. 349–356). Google Scholar
  62. Slabaugh, G., Culbertson, B., Malzbender, T., & Schafer, R. (2001). A survey of methods for volumetric scene reconstruction from photographs. In International Workshop on Volume Graphics. Google Scholar
  63. Snow, D., Viola, P., & Zabih, R. (2000). Exact voxel occupancy with graph cuts. In CVPR (Vol. 1, pp. 345–352). Google Scholar
  64. Starck, J., & Hilton, A. (2005). Virtual view synthesis of people from multiple view video sequences. Graphical Models, 67(6), 600–620. CrossRefzbMATHGoogle Scholar
  65. Starck, J., & Hilton, A. (2007). Surface capture for performance based animation. IEEE Computer Graphics and Applications, 27(3), 21–31. CrossRefGoogle Scholar
  66. Sun, J., Zhang, W., Tang, X., & Shum, H.-Y. (2006). Background cut. In ECCV (pp. 628–641). Google Scholar
  67. Szeliski, R., & Golland, P. (1999). Stereo matching with transparency and matting. International Journal of Computer Vision, 32(1), 45–61. CrossRefGoogle Scholar
  68. Szeliski, R., Zabih, R., Scharstein, D., Veksler, O., Kolmogorov, V., Agarwala, A., Tappen, M., & Rother, C. (2008). A comparative study of energy minimization methods for Markov random fields with smoothness-based priors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(6), 1068–1080. CrossRefGoogle Scholar
  69. Thomas, G. (2007). Real-time camera tracking using sports pitch markings. Journal of Real-Time Image Processing, 2, 117–132. CrossRefGoogle Scholar
  70. Vlasic, D., Peers, P., Baran, I., Debevec, P., Popović, J., Rusinkiewicz, S., & Matusik, W. (2009). Dynamic shape capture using multi-view photometric stereo. ACM Transactions on Graphics, 28(5). Google Scholar
  71. Vogiatzis, G., Hernandez, C., Torr, P., & Cipolla, R. (2007). Multiview stereo via volumetric graph-cuts and occlusion robust photo-consistency. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(12), 2241–2246. CrossRefGoogle Scholar
  72. Waschbüsch, M., Würmlin, S., & Gross, M. (2007). 3d video billboard clouds. Computer Graphics Forum, 26(3), 561–569. CrossRefGoogle Scholar
  73. Zhang, Z. (2000). A flexible new technique for camera calibration. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(11), 1330–1334. CrossRefGoogle Scholar
  74. Zitnick, C., Kang, S., Uyttendaele, M., Winder, S., & Szeliski, R. (2004). High-quality video view interpolation using a layered representation. ACM Transactions on Graphics, 600–608. Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2010

Authors and Affiliations

  1. 1.Centre for Vision, Speech and Signal ProcessingUniversity of SurreyGuildfordUK

Personalised recommendations