1 Introduction

With the advent of autonomous vehicles and rising demand for immersive content in augmented and virtual reality, understanding dynamic scenes with multiple interacting people has become increasingly important. Understanding refers to reconstructing, segmenting and temporally aligning the reconstructions over time. In this paper we propose a framework for 4D dynamic scene understanding with multiple people in the scene from multi-view videos to address this demand. By “4D Scene understanding” we refer to a unified framework that describes: 3D modelling; motion/flow estimation; and semantic instance segmentation on a per frame basis for an entire sequence. Recent advances in pose estimation (Cao et al. 2017; Tome et al. 2017) and recognition (He et al. 2017; Xie et al. 2016; Chen et al. 2016) using deep learning have achieved excellent performance for complex images. We exploit these advances to obtain 3D human-pose and an initial semantic instance segmentation from multiple view videos to bootstrap the detailed 4D understanding and modelling of complex dynamic scenes captured with multiple static or moving cameras (see Fig. 1). Joint 4D reconstruction allows us to understand how people move and interact, giving contextual information in general scenes.

Fig. 1
figure 1

Joint 4D semantic instance segmentation and reconstruction exploiting 3D human-pose of interacting people in dynamic scenes. Shades of pink in segmentation represents instances of people. Colour assigned to reconstruction of frame 80 is reliably propagated to frame 120 using proposed temporal coherence (Color figure online)

Existing multi-task methods for scene understanding perform per frame joint reconstruction and semantic instance segmentation from a single image (Kendall et al. 2017), showing that joint estimation of both reconstruction and segmentation can improve the quality of each task. Other methods have fused semantic segmentation with reconstruction (Mustafa and Hilton 2017) or flow estimation (Sevilla-Lara et al. 2016) demonstrating significant improvement in both semantic segmentation and reconstruction/scene flow. Hence, we exploit the advantages of performing joint optimization in this paper to understand dynamic scenes with multiple interacting people by simultaneous reconstruction, flow and segmentation estimation from multiple view video.

The first category of methods in joint estimation for dynamic scenes generate segmentation and reconstruction from multi-view video (Mustafa et al. 2016) and monocular video (Floros and Leibe 2016; Larsen et al. 2007) without any output scene flow estimate. The second category of methods segment and estimates motion in 2D (Sevilla-Lara et al. 2016), or give spatio-temporal aligned segmentation (Chiu and Fritz 2013; Luo et al. 2015; Djelouah et al. 2016) from multiple views without inferring the shape of the objects. The third category of methods in 4D temporally coherent reconstruction either align meshes using correspondence information between consecutive frames (Zanfir and Sminchisescu 2015) or extract the scene flow by estimating the pairwise surface correspondence between reconstructions at successive frames (Wedel et al. 2011; Basha et. 2010). However methods in all of these three categories do not exploit semantic information of the scene, as seen in Table  1. The fourth category of joint estimation methods exploit semantic information by introducing joint semantic segmentation and reconstruction for general dynamic scenes (Hane et al. 2016; Xie et al. 2016; Kundu et al. 2014; Ulusoy et al. 2017; Mustafa and Hilton 2017) and street scenes (Engelmann et al. 2016; Vineet et al. 2015). However these methods give per-frame semantic segmentation and reconstruction with no motion estimate. This leads to unaligned geometry, pixel level incoherence in both segmentation and reconstruction for dynamic sequences and does not work for complex scenes with multiple interacting people such as stadium sports. Other methods for semantic video segmentation classify objects exploiting spatio-temporal semantic information (Tsai et al. 2016; Luo et al. 2015; Chiu and Fritz 2013) but do not perform reconstruction. Hence none of the existing methods in the literature give 4D temporally coherent reconstruction and instance segmentation on complex dynamic scenes with multiple interacting people. We address this gap in the literature by proposing a novel framework for joint multi-view 4D temporally coherent reconstruction, semantic instance segmentation and flow estimation for general dynamic scenes automatically without any manual intervention.

Table 1 Comparison of tasks state-of-the-art methods are solving against the proposed method

Methods in the literature have exploited human-pose information to improve results in semantic segmentation (Xia et al. 2017) and reconstruction (Huang et al. 2017). However existing joint estimation/ optimization methods for dynamic scenes (with multiple people) do not exploit human-pose information often detecting interacting people as a single object (Mustafa and Hilton 2017). Table 1 shows a comparison between the tasks performed by state-of-the-art methods. In addition to performing joint multi-person 4D temporally coherent reconstruction and semantic instance segmentation, we exploit advances in 3D human-pose estimation to propose the first approach for 4D (3D in time) human-pose based scene understanding of general dynamic scenes with multiple interacting dynamic objects (people) with complex non-rigid motion. 3D human-pose estimation makes full use of multi-view information and is used as a prior to constrain the shape, segmentation and motion in space and time in the joint scene understanding estimation to improve the results on challenging scenes in the wild including sports. Sports reconstruction presents a challenging problem with a small number (6-12) of independently manually operated panning and zooming broadcast cameras, sparsely located around the stadium to cover a large area with multiple players. This results in multiple view wide-baseline capture at different resolutions with motion blur due to player and camera motion. The framework enables high-quality reconstruction and semantic instance segmentation for multi-player occlusions in sports captured from wide-baseline moving cameras, overcoming limitation of previous multiple view reconstruction algorithms. The contributions of the paper are:

  • 4D scene understanding for multiple interacting people in dynamic scenes from multi-view video.

  • Joint instance-level segmentation, temporally coherent reconstruction and scene flow with human-pose priors.

  • Robust 4D temporal coherence and per-pixel semantic coherence for dynamic scenes containing interactions.

  • An extensive performance evaluation against 15 state-of-the-art methods demonstrating improved semantic segmentation, reconstruction and motion estimation.

This paper is an extended version of ICCV 2019 paper (Mustafa et al. 2019), which includes detailed information about the method, ablation studies, performance evaluation on multi-person benchmarks and results on challenging sports datasets.

2 Related Work

Humans extract rich information from the world around them, and for autonomous machines (self-driving cars, robots) to navigate safely around people, machines must be able to perceive the scenes as humans do. Scene understanding refers to the simultaneous extraction of 3D reconstruction, semantic information of objects and motion estimation, illustrated in Fig. 1. Scene understanding has become increasingly popular in the past 5 years and it brings machines one step closer to understand the real to human level, machine perception of the real-world. This section provides a review of recent notable methods for scene understanding tasks (individually and joint) for single-view (Table 2) and multi-view (Table 3) video.

2.1 Scene Understanding for Single-View Video-Table 2

Semantic segmentation Fully Convolutional Network (Shelhamer et al. 2015) extract deep per-pixel CNN features followed by the classification of each pixel in the image for semantic segmentation. Deeplabv3+ (Chen and Zhu 2018) improved results by using an encoder-decoder architecture with Resnet and atrous spatial pyramid pooling to classify each pixel. Mask RCNN gives semantic instance segmentation on challenging scenes (He et al. 2017) by using a Region Proposal Network that shares full-image convolutional features with the detection network and adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. An improved mask is predicted in Li et al. (2020) by effectively combining instance level information with semantic information with lower level fine-granularity. A flow alignment module is proposed in Chen et al. (2020) to learn Semantic Flow between feature maps of adjacent levels, and broadcast high-level features to high resolution features for improved semantic segmentation.

Depth estimation Fusion4D (Dou et al. 2016) introduced a method for real-time online reconstruction for a video sequence from RGB image, depth and high-quality segmentation as input (Dou et al. 2016) and is restricted to relative simple indoor scenes. The proposed method only need RGB images as input and works for crowded indoor and outdoor scenes with multiple people. A single multi-scale convolution network architecture was proposed for depth prediction and semantic labelling in Eigen and Fergus (2015). Unsupervised monocular depth estimation was performed by learning depth from a stereo pair in Godard et al. (2017). Traditional SFM was used in a self-supervised method (Klodt and Vedaldi 2018) to learn and predict depth from monocular video. A recent method (Wang et al. 2020) decomposes a scene into semantic segments and then predicts a scale and shift invariant depth map for each semantic segment in a canonical space from a single image (Wang et al. 2020). An optimization based depth estimation method was proposed in Rossi et al. (2020) exploiting the underlying piece-wise planarity of scenes and other depth estimation method (Rodriguez and Mikolajczyk 2020) from a single image. This bridges the domain gap by leveraging semantic predictions and low-level edge features to provide guidance for the target domain.

Table 2 Methods for scene understanding tasks for single-view video
Table 3 Methods for scene understanding tasks for multi-view video

Motion estimation One of the first methods to construct CNNs capable of estimating optical flow as a supervised learning task was introduced in Dosovitskiy et al. (2015). CNN was proposed to estimate human flow fields specifically from pairs of images in Ranjan et al. (2018). Deep Epipolar Flow (Zhong et al. 2019) was used for unsupervised flow estimation introducing global epipolar constraints into network learning. A transformer encoder-decoder type network was proposed along with a memory-based dictionary, which aims to preserve the global motion patterns in training data to improve flow estimation for humans (Cai et al. 2020).

Scene understanding Simultaneous semantic instance segmentation and depth estimation was proposed (Kendall et al. 2018) from a single view video exploiting uncertainties in multi-task learning framework. Unsupervised methods for joint depth, flow and motion segmentation; and joint depth and semantic segmentation from a monocular video were proposed in Ranjan et al. (2019) and Chen et al. (2019) respectively. A recent method (Zeng and Gevers 2020) reconstructs and semantically segments 3D indoor scenes from a single panorama image, however this method only works for static scenes.

All of these method either perform a single task (reconstruction, segmentation or flow estimation) or the joint scene understanding methods work for a single view video only. However the proposed method solves multiple tasks together giving a full scene understanding from multiple view videos by jointly estimating semantic instance segmentation, depth and motion exploiting human pose information.

2.2 Scene Understanding for Multi-View Videos-Table 3

Segmentation Co-temporal multi-view segmentation was proposed in Djelouah et al. (2016) with no semantic information. A multi-view semantic segmentation network was designed in Guerry et al. (2017) for the consistent labelling of static scenes. Semantic information across space and time was used in a joint framework (Mustafa and Hilton 2017) for multi-view semantic reconstruction of dynamic scenes. Abhijit et al. (2020) fuses features from multiple per view predictions on 3D mesh vertices to predict mesh semantic segmentation labels for 3D semantic segmentation, but this method works only for static scenes.

Reconstruction Temporally coherent reconstruction was obtained in Mustafa et al. (2016) from multi-view videos. An end-to-end deep learning architecture was introduced in Yao et al. (2018) for depth map inference from multi-view images. Trager et al. (2019) defined a new characterization of multi-view geometry by proposing a coordinate-free description of Carlsson-Weinshall duality. A recent approach estimates high fidelity 3D human pose and volumetric reconstruction from multiple camera views by using a dual loss in a generative adversarial network (Gilbert et al. 2020). Another deep learning approach (Bi et al. 2020) reconstructs scene appearance from unstructured images captured under collocated point lighting using reflectance volumes. However both of these approaches give per frame reconstruction which are unaligned in time without any semantic information.

Motion estimation Limited methods have been proposed in multi-view motion estimation. The first-ever method to estimate motion and stereo from multi-view images was proposed by Szeliski (1999). Scene flow was obtained from multiple light-field images in Mustafa et al. (2017) exploiting epipolar constraints. Recently a network to jointly learn spatio-temporal correspondence for stereo matching and flow estimation was introduced (Lai et al. 2019).

Scene understanding Multi-view scene understanding for static scenes was introduced in Hane et al. (2016) through a joint formulation of depth and semantics. 3D semantic scene segmentation of indoor RGB-D environments was performed in Dai and Nießner (2018) using a joint 3D multi-view prediction network.

All of the above methods either focus on a single task on segmentation, reconstruction and flow estimation exploiting multiple views or work for static scenes giving per frame reconstructions and segmentation unaligned in time. The proposed method performs joint semantic instance segmentation, 4D reconstruction and motion estimation of dynamic scenes with multiple interacting people in the scenes addressing the gap in the literature for full scene understanding from multi-view videos. Also most of the methods explained above use deep learning based approach to solve the reconstruction, flow estimation and segmentation, but the proposed method is an optimization framework which does not need any ground-truth data for training or require no manual intervention for 4D temporally coherent semantic reconstruction of dynamic scenes.

Fig. 2
figure 2

4D dynamic scene understanding framework for multiple interacting people in the scene from multi-view video

3 Joint 4D Dynamic Scene Understanding

Overview:

This section describes our approach to joint 4D scene understanding, with different stages shown in Fig. 2. The overview of the proposed method is as follows:

  • Input The input to the joint optimisation is multi-view video. The proposed algorithm requires synchronised cameras, however it works for all datasets the datasets which are either synchronised through audio information (Hasler et al. 2009) or time code generator. Slight errors that are introduced through audio synchronisation are handled well with the proposed method. More details on the datasets are given in the Experiments section.

  • Initial Semantic Instance Segmentation - Sect. 3.1:

    Initial semantic labels are estimated for each pixel in the image per-view using state-of-the-art semantic instance segmentation (He et al. 2017).

    An initial reconstruction is obtained for each object in the scene combining the initial semantic instance segmentation with the sparse reconstruction (Mustafa and Hilton 2017). Semantic information for each view is combined with sparse 3D feature correspondence between views to obtain an initial semantic 3D reconstruction. This initial reconstruction is inaccurate due to the errors in the per-view semantic information which is combined across views.

  • Key-frame Detection-Sect. 3.2: To achieve stable long-term 4D understanding a set of unique key-frames are detected exploiting multi-view informatio for final temporally coherent 4D reconstruction, key-frames are detected for the entire sequence exploiting shap, 3D pose and semantic information.

  • 3D Human Pose Estimation and Estimation of Sparse Temporal Tracks - Section  3.3: 3D human pose is estimated for each person in the scene to constraint the joint per-view optimization to estimate semantic instance segmentation, motion and 3D reconstruction. Sparse temporal feature tracks are obtained per view between key-frames to initialise the joint estimation. This allows robust 4D understanding in the presence of large non-rigid motion between frames.

  • Joint Estimation of Semantic Instance and Shape - Section 3.3: The initial reconstruction and semantic instance segmentation is refined for each object instance per-view through novel joint optimisation of segmentation, shape, and motion constrained by 3D human-pose. Key-frames are used to introduce robust temporal coherence in the joint estimation across long-sequences with large non-rigid deformation. Per-view information is merged into a single 3D model using Poisson surface reconstruction (Kazhdan et al. 2006).

  • 4D Scene Understanding - Section 3.4: The process is repeated for the entire sequence and is combined across views and in time to obtain temporally coherent 4D semantic reconstruction for dynamic scenes. Depth, motion and semantic instance segmentation is combined across views between frames for 4D temporally coherent reconstruction and dense per-pixel semantic coherence for final 4D understanding of scenes. Figure 2 shows segmentation, reconstruction and tracking of both static and dynamic objects in the scene.

3.1 Initial Semantic Instance Segmentation

Existing methods for semantic segmentation do not give instance level segmentation of the scene. Previous approaches for semantic segmentation either segment the image followed by a per-segment object category classification (Mostajabi et al. 2015; Gupta et al. 2014), which can lead to propagation of errors from segmentation or give deep per-pixel CNN features followed by per-pixel classification in the image (Farabet et al. 2013; Hariharan et al. 2015), leading to segmentations with fuzzy boundaries and spatially disjointed regions or predict semantic segmentation from raw pixels (Shelhamer et al. 2015) followed by conditional random fields (Kundu et al. 2016; Zheng et al. 2015). To address these issues, methods proposed semantic segmentation prediction from the raw pixels (Shelhamer et al. 2015) followed by conditional random fields (Kundu et al. 2016; Zheng et al. 2015) to improve segmentation. However none of these methods give instance segmentation of the scene. A recent state-of-the-art method (He et al. 2017) gives a good estimate of initial semantic instance segmentation masks (probability estimates of various classes at each pixel) from complex single images. We employ this state-of-the-art semantic instance segmentation method (He et al. 2017) to predict initial semantic unary potentials using pre-trained parameters on MS-COCO(Lin et al. 2014) and PASCAL VOC12 (Everingham et al. 2012) for each view. However this pre-segmentation can be replaced with any state-of-the-art methods as the framework refines the semantic labels and it is not sensitive to errors in the initialisation. Poor quality of initial semantics will increase computation cost.

3.2 Key-Frame Detection

Previous work (Newcombe et al. 2015; Mustafa et al. 2017) showed that sparse key-frames allow robust long-term correspondence for 4D reconstruction. In this work we introduce the additional use of pose in the detection and sparse temporal feature correspondence across key-frames to prevent the accumulation of errors in long sequences. Key-frame detection is used to improve the long term temporal coherence in the proposed joint semantic instance segmentation and 4D reconstruction. The 3D meshes are aligned for frames in between two key-frames \(K_i\) and \(K_{i+1}\) and between key-frames \(N_{K}\) to obtain full 4D scene reconstruction for the sequence. \(N_{K}\) is the total number of key-frames in the sequence. 4D scene alignment between key-frames is explained in Sect. 3.4.

Key-frames are detected exploiting similarity between frames, 3D pose and shape. Distance between 2 frames is also taken into account to estimate key-frames. All the metrics used to estimate key-frames are defined below:

3.2.1 Sparse Correspondence Metric (\(M_{i,j}^{c}\))

This measures appearance similarity between frames for each object, defined as the ratio of the number of sparse temporal correspondences Q to the total number of features R. SFD features are detected for each temporal frame and brute force matching (Mustafa et al. 2019) is performed to estimate the correspondences. The term is defined below:

$$\begin{aligned} M_{i,j}^{c} = \frac{2Q_{i,j}^{c}}{R_{i}^{c} + R_{j}^{c}} \end{aligned}$$

where \(Q_{i,j}^{c}\) are the number of sparse temporal correspondences between frame i and j for view c, \(R_{i}^{c}\) are the number of total features for frame i, view c and \(R_{j}^{c}\) are the number of total features for frame j, view c.

3.2.2 3D Pose Metric (\(P_{i,j}^{c}\))

3D human poses are estimated for each time frame (Tomè et. 2018) and this metric measures the distance between the regularised human-pose:

$$\begin{aligned} P_{i,j}^{c} = \frac{\left\| P_{i} - P_{j} \right\| }{P_{max}^{c}} \end{aligned}$$

where \(j> i\) and \(P_{max}^{c}\) is the maximum change of pose between frames for view c. \(P_{max}^{c}\) is calculated by measuring the distance between regularised poses for 20 frames and choosing the maximum value. This term ensures that the distance of poses between key-frames is limited.

3.2.3 Semantic Metric (\(L_{i,j}^{c}\))

This term checks the semantic similarity between two frames by comparing the number of pixels with the same semantic labels. An affine warp (Evangelidis and Psarakis 2008) is used to align semantic regions to measure semantic similarity between two frames. The metric is defined as the ratio of the number of pixels with the same class label \(z_{i,j}^{c}\) to the pixels in the segmented region \(y_{i,j}^{c}\):

$$\begin{aligned} L_{i,j}^{c} = \frac{z_{i,j}^{c}}{y_{i,j}^{c}} \end{aligned}$$

3.2.4 Distance Metric (\(D_{i,j}^{c}\))

This metric measures the distance between frames and makes sure that the distance between two key-frames is not large as it will introduce errors in the final reconstruction and segmentation. The term is defined as:

$$\begin{aligned} D_{i,j}^{c} = \frac{j-i}{D_{max}^{c}} \end{aligned}$$

where \(j> i\) and \(D_{max}^{c}\) is the maximum number of frames between key-frames for view c. This term ensures that the distance between two key-frames does not exceed \(D_{max}^{c}\). This is set to 100 throughout this work.

3.2.5 Shape Metric (\(I_{i,j}^{c}\))

It is defined as the ratio of the intersection of the aligned segmentation or silhouette (Evangelidis and Psarakis 2008) (h) to the union of the area (a):

$$\begin{aligned} I_{i,j}^{c} = \frac{h_{i,j}^{c}}{a_{i,j}^{c}} \end{aligned}$$

This give a measure of shape or silhouette overlap for an object between frames i and j for view c. The silhouette are projection of initial coarse 3D reconstruction in each view.

All these metrics defined above are combined to estimate keyframes using the Key-frame similarity metric, which is defined as:

$$\begin{aligned} KS_{i,j} = 1 - \frac{1}{5N_{v}}\sum _{c=1}^{N_{v}} ( M_{i,j}^{c} + L_{i,j}^{c} + D_{i,j}^{c} + P_{i,j}^{c} + I_{i,j}^{c})\nonumber \\ \end{aligned}$$
(1)

Key-frame detection exploits sparse correspondence (\(M_{i,j}^{c}\)), pose (\(P_{i,j}^{c}\)), shape (\(I_{i,j}^{c}\)), semantic label (\(I_{i,j}^{c}\)) and distance (\(D_{i,j}^{c}\)) information across views \(N_{v}\) between frame i and j for each object in view c, to improve the long-term temporal coherence of the proposed method, using similar frames across the sequence, illustrated in Fig. 3. All frames with similarity \(KS_{i,j}>0.75\) in a sequence are selected as key-frames defined as \(K = \{ {K^{1}, K^{2}, . . . , K^{N_{K}}}\}\) where \(N_{k}\) is the number of key-frames. We also define another term \(N_{f}^{i}\), which is the number of frames between \(K_{i}\) and \(K_{i+1}\).

Fig. 3
figure 3

An illustration of key-frame detection and matching across a short sequence for stable long-term temporal coherence

3.3 Joint Per-View Optimisation

Sparse reconstruction is obtained for each frame from multiple views using Colmap (Schönberger and Frahm 2016; Schönberger et al. 2016). The multi-view cameras are synchronised either directly or in post-processing using the audio information. The sparse point cloud is clustered in 3D (Rusu 2009) with each cluster representing a unique foreground object. The per-view semantic instance segmentation obtained in the previous step is combined across views with sparse reconstruction to obtain an initial coarse reconstruction \(\mathscr {R_{i}}\) for each person in the frame, where i represents different number of objects for each frame (Mustafa and Hilton 2017). This initial semantic coarse reconstruction \(\mathscr {R_{i}}\) is refined through a joint scene understanding optimization. The optimization is performed per-view to obtain depth, semantic segmentation and flow for each view.

3.3.1 Spatio-Temporal Coherence in the Optimisation

Constraints are applied on the spatial and temporal neighborhood to enforce consistency in the appearance, semantic label, 3D human pose and motion across views and time.

Spatial coherence Multi-view spatial coherence is enforced in the optimisation such that the motion, shape, appearance, 3D pose and class labels are consistent across views using an 8-connected spatial neighbourhood \(\psi _S\) for each camera view such that the set of pixel pairs (pq) belong to the same frame.

Temporal coherence Temporal coherence is enforced in the joint optimisation by enforcing coherence across key-frames (Sect. 3.2) to handle large non-rigid motion and to reduce errors in sequential alignment for long sequences in the 4D scene understanding. Sparse temporal feature correspondences are used for key-frame detection and robust initialisation of the joint optimisation. They measure the similarity between frames and unlike optical flow are robust to large motions and visual ambiguity. To achieve robust temporal coherence in the 4D scene understanding framework for large non-rigid motion, sparse temporal feature correspondences in 3D are obtained across the sequence.

The temporal neighbourhood is defined for each frame between its respective key-frames. Sparse temporal correspondence tracks define the temporal neighbourhood \(\psi _T = \left\{ \left( p,q \right) \mid q = p + e_{i,j} \right\} \); where \(j = \left\{ t-1, t+1 \right\} \), \(e_{i,j}\) is the displacement vector from image i to j, p and q are pixels in the image.

3.3.2 Joint Optimisation

The goal of the joint estimation is to refine initial semantic instance segmentation and reconstruction by assigning a label from a set of classes obtained from initial semantic instance segmentation \({\mathscr {L}} = \left\{ l_{1},...,l_{\left| {\mathscr {L}} \right| } \right\} \) (\(\left| {\mathscr {L}} \right| \) is the total number of classes), a depth value from a set of depth values \({\mathscr {D}} = \left\{ d_{1},...,d_{\left| {\mathscr {D}} \right| -1}, {\mathscr {U}} \right\} \) (each depth value is sampled on the ray from camera and \({\mathscr {U}}\) is an unknown depth value to handle occlusions), and a motion flow field \({\mathscr {M}} = \left\{ m_{1},...,m_{\left| {\mathscr {M}} \right| } \right\} \) simultaneously for the region \({\mathscr {R}}\) of each object per view. \(\left| {\mathscr {M}} \right| \) is the set of pre-defined discrete flow-fields for pixel \(p = (x, y)\) in image I by \(m = (\delta x, \delta y)\) in time and for each view.

Joint semantic instance segmentation, reconstruction and motion estimation is achieved by global optimisation of a cost function over unary \(E_{unary}\) and pairwise \(E_{pair}\) terms, defined as:

$$\begin{aligned}&E(l,d,m) = E_{unary}(l,d,m) + E_{pair}(l,d,m)\nonumber \\&E_{unary} = \lambda _{d}E_{d}(d) + \lambda _{a}E_{a}(l) + \lambda _{sem}E_{sem}(l) + \lambda _{f}E_{f}(m)\nonumber \\&E_{pair} = \lambda _{s}E_{s}(l,d) + \lambda _{c}E_{c}(l) + \lambda _{r}E_{r}(l,m) + \lambda _{p}E_{p}(l,d,m)\nonumber \\ \end{aligned}$$
(2)

where, d is the depth, l is the class label, and m is the motion at pixel p. Novel terms are introduced for flow \(E_{f}\), motion regularisation \(E_{r}\) and human-pose \(E_{p}\) costs, explained in Sects. 3.3.4 and 3.3.3 respectively. Results of the joint optimisation with and without pose (\(E_{p}\)) and motion (\(E_{f}\) , \(E_{r}\)) information are presented in Fig. 4, showing the improvement in results. Ablation analysis on individual costs in Sect. 4 demonstrates the improvement in performance with the novel introduction of motion and pose constraints in the joint optimisation. Standard unary terms for depth (\(E_{d}\)), semantic (\(E_{sem}\)), and appearance (\(E_{a}\)) costs, explained in Sect. 3.3.6. Standard pairwise terms colour contrast (\(E_{c}\)) is used to assist segmentation and smoothness (\(E_{s}\)) cost ensures that depth varies smoothly in a neighbourhood.

Fig. 4
figure 4

Comparison of reconstruction without pose and motion in the optimisation framework, proposed result is best

3.3.3 Human-Pose Constraints \(E_{p}(l,d,m)\)

We use 3D human-pose to constrain joint optimisation and improve the flow, reconstruction and instance segmentation, in both 2D and 3D for dynamic scenes with multiple interacting people (see Fig. 1). 3D human-pose is used as it is consistent across multiple views unlike 2D human-pose. A state-of-the-art method for 3D human-pose estimation from multiple cameras (Tomè et. 2018) is used in the paper. Previous work on 3D pose estimation (Tome et al. 2017) iteratively builds a 3D model of human-pose consistent with 2D estimates of joint locations and prior knowledge of natural body pose. In Tomè et. (2018), multiple cameras are used when estimating the 3D model; this then feeds back into new estimates of the 2D joint locations in each image. This approach allows us to take full advantage of 3D estimates of pose, consistent across all cameras when finding fine grained 2D correspondences between images, and leading to more lifelike, vivid human reconstructions.

Initial semantic reconstruction is updated if the 3D pose of the person lies outside the region \({\mathscr {R}}\) by dilating the boundary to include the missing joints. This allows for more robust and complete reconstruction and segmentation. We use a standard set of 17 joints (Tomè et. 2018) defined as \({\mathscr {B}}\). A circle \({\mathscr {C}}_{i}\) is placed around the joint position in 2D and a sphere \({\mathscr {S}}_{i}\) is placed around the joint position in 3D based on the confidence map to identify the nearest neighbour vertices for every joint \(b_i\).

$$\begin{aligned} E_{p}(l,d,m) = \sum _{b_{i} \in {\mathscr {B}}} \lambda _{2d} e_{2d}(l,m) + \lambda _{3d} e_{3d}(d) \end{aligned}$$
(3)
$$\begin{aligned} e_{2d}(l,m)= & {} e_{2d}^{L}(l) + e_{2d}^{S}(l) + e_{2d}^{M}(m) \\ e_{3d}(d)= & {} e_{3d}^{M}(d) + e_{3d}^{S}(d) \text {, if }d_{p}\ne {\mathscr {U}} \text { else } 0 \end{aligned}$$

where \(e_{2d}\) enforces pose constraint in 2D domain and \( e_{3d}\) enforces human pose constrain in 3D domain and \(\lambda _{2d}\) and \(\lambda _{3d}\) are weighting terms. \(e_{2d}\) comprises of semantic \(e_{2d}^{S}\) , motion \(e_{2d}^{M}\) and segmentation \(e_{2d}^{L}\) constraints and \(e_{3d}\) includes motion \(e_{3d}^{M}\) and semantic \(e_{2d}^{S}\) constraint.

3D shape term This term constrains the reconstruction in 3D such that the neighbourhood points around the joints do not move far from the respective joints, and is defined as:

$$\begin{aligned} e_{3d}^{S}(d) = \exp ({ - \frac{1}{\left| \sigma _{S_{D}} \right| } \sum _{\Phi (p) \in {\mathscr {S}}_{i}} \left\| O \right\| _{F} ^{2} }) \end{aligned}$$

where \(\Phi (p)\) is the 3D projection of pixel p. The Frobenius norm \(\left\| O \right\| _{F} = \left\| \begin{bmatrix} \Phi (p)&b_{i} \end{bmatrix} \right\| _{F}\) is applied on the 3D points in all directions to obtain the ‘net’ motion at each pixel within \({\mathscr {S}}_{i}\) (sphere around the joint position in 3D) and \(\sigma _{S_{D}} = \left\langle \frac{\left\| O \right\| _{F}^{2}}{\vartheta _{\Phi (p),b_i}}\right\rangle \), with the operator \(\big \langle \big \rangle \) denoting the mean computed in \({\mathscr {S}}_{i}\).

3D motion term This enforces as rigid as possible (Sorkine and Alexa 2007) constraints on 3D points in the neighbourhood of each joint \(b_{i}\) in space and time. An optimal rotation matrix \(R_{i}\) is estimated for each \(b_{i}\) by minimising the energy defined as:

$$\begin{aligned} e_{3d}^{M}(d)&= \sum _{\Phi (p) \in {\mathscr {S}}_{i}} \bigg \Vert \left( b_{i}^{t+1} - \Phi (p)^{t+1} \right) \\&\quad - R_{i} \left( b_{i}^{t} - \Phi (p)^{t} \right) \bigg \Vert _{2}^{2} + \lambda _{3d}^{p} \left\| p - e_{3d}^{M} \right\| _{2}^{2} \end{aligned}$$

\(\lambda _{3d}^{p}\) is the weighing constant. This term ensures that each joint does not move too far away from the original position.

2D term 3D poses are back-projected in each view to constrain per view appearance (\(e_{2d}^{L}\)), semantic segmentation (\(e_{2d}^{S}\)) and motion estimation (\(e_{2d}^{M}\)) in 2D. If \(p \in {\mathscr {C}}_{i}\),

$$\begin{aligned} e_{2d}^{L}(l)= & {} \exp \left( -\sum _{p \in \psi _S} \sum _{p \in \psi _T} \frac{\left\| I(\Pi (b_i)) - I(p) \right\| ^{2}}{\left| \sigma _{S_{L}} \right| }\right) \\ e_{2d}^{S}(l)= & {} \exp \left( -\sum _{p \in \psi _S} \sum _{p \in \psi _T} \frac{\left\| \Pi (b_i) - p \right\| ^{2}}{\left| \sigma _{S_{S}} \right| }\right) \\ e_{2d}^{M}(m)= & {} \exp \left( -\sum _{p \in \psi _S} \sum _{k \in \psi _T} \frac{\left\| \vartheta _{p,\Pi (b^{k}_{i})} - \vartheta _{p+m_{p},\Pi (b^{k+1}_{i})} \right\| ^{2}}{\left| \sigma _{S_{M}} \right| } \right) \end{aligned}$$

where, \(\Pi \) is the back-projection of 3D poses to 2D, \(N_{pose}\) is the number of nearest neighbours, \(\sigma _{S_{L}} = \left\langle \frac{\left\| \Pi (b_i) - q \right\| ^{2}}{\vartheta _{\Pi (b_i),q}}\right\rangle \) and, \(\sigma _{S_{S}}\) and \(\sigma _{S_{M}}\) is defined similarly, and \(\vartheta _{\Pi (b_i),q}\) is the Euclidean distance between pixel \(\Pi (b_i)\) and q. Similarly other \(\vartheta _{p,\Pi (b^{k}_{i})}\) and \(\vartheta _{p+m_{p},\Pi (b^{k+1}_{i})}\) denotes the Euclidean distances between other pixels in 2D. \(e_{2d}^{L}(l)\) and \(e_{2d}^{S}(l)\) ensures that the pixels around projected 3D pose \(\Pi (b_{i})\) have the same semantic label and appearance across views (\(\psi _S\)) and time (\(\psi _T\)) thereby ensuring spatio-temporal appearance and semantic consistency respectively.

3.3.4 Motion Constraints- \(E_{f}(m) \text { and } E_{r}(l,m)\)

Flow term This term is obtained by integrating the sum of three penalisers over the reference image domain inspired from (Tao et al. 2012), defined as:

$$\begin{aligned} E_{f}({p,m_p}) = e_{F}^{T}({p,m_p}) + e_{F}^{V}({p,m_p}) + e_{F}^{S}({p,m_p}) \end{aligned}$$

where, \(e_{F}^{T}({p,m_p}) =\sum _{i = 1}^{N_{v}} \Vert (I_{i}(p,t) - I_{i}(p+ m_p, t+1)) \Vert ^{2}\) penalises deviation from the brightness constancy assumption in a temporal neighbourhood for the same view;

\(e_{F}^{V}({p,m_p}) =\sum _{t \in \psi _T} \sum _{i = 2}^{N_{v}} \left\| (I_{1}(p,t) - I_{i}(p+ m_p,t)) \right\| ^{2}\) penalises deviation in appearance from the brightness constancy assumption between the reference view and other views at other time instants; and \( e_{F}^{S}({p,m_p}) = 0 \text { if } p \in N \text { otherwise } \infty \) which forces the flow to be close to nearby sparse temporal correspondences. \(I_{i}(p,t)\) is the intensity at point p at time t in camera i. The flow vector m is located within a window from a sparse constraint at p and it forces the flow to approximate the sparse 2D temporal correspondences.

Motion regularisation term This penalises the absolute difference of the flow field to enforce motion smoothness and handle occlusions in areas with low confidence (Tao et al. 2012).

$$\begin{aligned} E_{r}({l,m})= & {} \sum _{p,q \in N_{p}} \left\| \Delta m \right\| ^{2} \lambda _{r}^{L} e_{r}^{L} (p, q, m_{p}, m_{q}, l_{p}, l_{q}) + \\&\lambda _{r}^{A} e_{r}^{A} (p, q, m_{p}, m_{q}, l_{p}, l_{q}) \end{aligned}$$

where \(\Delta m = m_{p} - m_{q}\) and;

$$\begin{aligned} e_{r}^{X} = {\left\{ \begin{array}{ll} \text { }\underset{q \in N_{p}}{\text {mean}} \text { } E_{X}({q,m_q}) - \underset{q \in N_{p}}{\min } E_{X}({q,m_q}),&{} \text {if } \underset{l_{p} = l_{q}}{\forall } \\ 0, &{} \text {otherwise} \end{array}\right. } \end{aligned}$$

We compute \(e_{R}^{L}\) (semantic regularisation) and \(e_{R}^{A}\) (appearance regularisation) as the minimum subtracted from the mean energy within the neighbourhood search window \(N_{p}\) for each pixel p. \(\lambda _{r}^{L}\) and \(\lambda _{r}^{A}\) are constants, computed empirically.

The motion term in the proposed framework is not tailored to human motion. Results are shown for human motion because of its higher complexity, which makes the method more generalizable to different types of motion. We can easily handle linear motion of rigid objects (like cars).

3.3.5 Long-term Temporal Coherence

Sparse temporal correspondences The sparse 3D points projected in all views are matched between frames \(N_{f}^{i}\) and key-frames across the sequence using nearest neighbour matching (Mustafa et al. 2019) followed by a symmetry test which employs forward and backward match consistency by performing two-way matching to remove the inconsistent correspondences. This gives sparse temporal feature correspondence tracks per frame for each object: \(F^{c}_{i} = \{ {f^{c}_{1}, f^{c}_{2}, . . . , f^{c}_{R_{i}^{c}}}\}\), where \(c={1 \text { to } N_{v}}\). \(R_{i}^{c}\) are the 3D points visible at each frame i. Exhaustive matching is performed, such that each frame is matched to every other frame to handle appearance, reappearance and disappearance of points between frames.

Key-frame detection Features at view c frame i, \(F^{c}_{i}\) are matched to features at view c to frames \(j = \{ {i+1, . . . , N_{f}^{i}}\}\) to give correspondences for all the frames \(N_{f}^{i}\) with key-frame \(K_{i}\). The corresponding joint locations from the 3D pose are back-projected in each view and added to sparse temporal tracks in between key-frames. Any new point-tracks are added to the list of point tracks for key-frame \(K_{i}\). More details on key-frame detection are provided in Sect. 3.2.

3.3.6 Unary Terms - \(E_{unary}(l,d,m)\)

Depth term This gives a measure of photo-consistency between views \(E_{d}(d) = \sum _{p\in \psi _S} e_{d}(p, d_{p})\), defined as:

$$\begin{aligned} e_{d}(p, d_{p}) = {\left\{ \begin{array}{ll} M(p, q) = \sum _{i \in {\mathscr {O}}_{k}}m(p,q) ,&{} \text {if } d_{p}\ne {\mathscr {U}}\\ M_{{\mathscr {U}}}, &{} \text {if } d_{p} = {\mathscr {U}} \end{array}\right. } \end{aligned}$$

where \(M_{{\mathscr {U}}}\) is the fixed cost of labelling pixel unknown and q denotes the projection of the hypothesised point P (3D point along the optical ray passing through pixel p located at a distance \(d_{p}\) from the camera) in an auxiliary camera. \({\mathscr {O}}_{k}\) is the set of the k most photo-consistent pairs with reference camera and m(pq) is inspired from (Mustafa et al. 2016).

Appearance term This term is computed using the negative log likelihood (Boykov and Kolmogorov 2004) of the colour models (GMMs with 10 components) learned from the initial semantic mask in the temporal neighbourhood \(\psi _T\) and the foreground markers obtained from the sparse 3D features for the dynamic objects. It is defined as:

$$\begin{aligned} E_{a}(l) = \sum _{p \in \psi _T} \sum _{p\in \psi _S} -\log P(I_{p}|l_{p}) \end{aligned}$$

where \(P(I_{p}|l_{p} = l_i)\) denotes the probability of pixel p belonging to layer \(l_i\).

Semantic term This term is based on the probability of the class labels at each pixel based on Chen et al. (2016), defined as:

$$\begin{aligned} E_{sem}(l) = \sum _{p \in \psi _T} \sum _{p\in \psi _S} -\log P_{sem}(I_{p}|l_{p}) \end{aligned}$$

where \(P_{sem}(I_{p}|l_{p} = l_i)\) denotes the probability of pixel p being in layer \(l_i\) in the reference image obtained from initial semantic instance segmentation (He et al. 2017).

3.3.7 Pairwise Terms - \(E_{pair}(l,d,m)\)

There are two pairwise terms in the joint per-view optimization - smoothness and contrast. These terms are inspired from Guillemaut and Hilton (2010), which includes a proof as to how these pairwise terms satisfy the regularity condition required for graph-cut optimisation via alpha-expansion (Boykov and Kolmogorov 2004).

Smoothness term This term ensures that depth labels vary smoothly within a neighbourhood and is defined as:

$$\begin{aligned} E_{s}(l,d) = \lambda _{s}^{t} \sum _{p,q \in \psi _T} e_{s}(l_p,d_{p},l_q,d_{q},d_{max}^{t}) + \\ \lambda _{s}^{s} \sum _{p,q \in \psi _S} e_{s}(l_p,d_{p},l_q,d_{q},d_{max}^{s})\\ e_{s}(l_p,d_{p},l_q,d_{q},d_{max}) = \\ {\left\{ \begin{array}{ll} \min (\left| d_{p} - d_{q} \right| , d_{max}),&{} \text {if } l_{p} = l_{q} \text { and } d_{p},d_{q}\ne {\mathscr {U}}\\ 0, &{} \text {if } l_{p} = l_{q} \text { and } d_{p},d_{q} = {\mathscr {U}}\\ d_{max}, &{} \text {otherwise} \end{array}\right. } \end{aligned}$$

where, \(d_{max}^{s}\) avoids over-penalising large discontinuities for spatial smoothness and is set to 50 times the size of the depth sampling step. \(d_{max}^{t}\) ensures smoothness in time over the temporal neighbourhood and is twice the value of \(d_{max}^{s}\) to allow large movement in the object.

Contrast term This term is defined as:

$$\begin{aligned} E_{c}(l)= & {} \sum _{p,q \in \psi _T} e_{c}(p,q,l_p,l_q,\sigma _{\alpha }^{t},\vartheta _{p,q}^{t},\sigma _{\beta }^{t}) + \\&\sum _{p,q \in \psi _S} e_{c}(p,q,l_p,l_q,,\sigma _{\alpha }^{s},\vartheta _{p,q}^{s},\sigma _{\beta }^{s}) \\&e_{c}(p,q,l_p,l_q,\sigma _{\alpha },\vartheta ,\sigma _{\beta }) = \mu \left( l_p,l_q \right) \times \\&\left( \lambda _{ca} e^{- (\frac{\left\| B(p) - B(q) \right\| ^{2}}{2 \left( \sigma _{\alpha } \right) ^{2}\left( \vartheta \right) ^{2} } )} + \lambda _{cl} e^{- (\frac{\left\| L(p) - L(q) \right\| ^{2}}{2 \left( \sigma _{\beta } \right) ^{2}}) } \right) \end{aligned}$$

where \( \mu \left( l_p,l_q \right) = 1 \text { if } (l_{p} = l_{q}) \text { otherwise } 0\) and \(\vartheta _{p,q}\) is the euclidean distance between p and q. ‘Bilateral’ kernel B forces pixels with similar colour and position to have similar labels and the Gaussian kernel L enforces spatial smoothness, with \(\sigma _{\alpha } = \left\langle \frac{\left\| B(p) - B(p)\right\| ^{2}}{\vartheta _{p,q}^{2}}\right\rangle \) and \(\sigma _{\beta }\) controlling the scale of these kernels, where the operator \( \big \langle \big \rangle \) denotes the mean computed across the neighbourhoods \(\psi _S\) and \(\psi _T\) for spatial and temporal contrast respectively.

The proposed joint optimization is inspired from previous work (Guillemaut and Hilton 2010) which perform joint segmentation and reconstruction to achieve a globally consistent solution by performing the joint optimization per-view and by initializing the reconstruction with a reliable visual hull which is obtained using per-view segmentation which is taken as input. In the proposed method we obtain a globally consistent solution by performing joint per-view optimization on a reliable initial coarse reconstruction which is obtained by combining semantic instance segmentation with sparse reconstruction. Global optimisation of Equation 2 is performed per-view over all terms simultaneously, subject to each pixel p in the region \(\mathscr {R_{i}}\) using the \(\alpha \)-expansion algorithm by iterating through the set of labels in \({\mathscr {L}} \times {\mathscr {D}} \times {\mathscr {M}}\) Boykov et al. (2001). Each label \({\mathscr {L}}, {\mathscr {D}}, {\mathscr {M}}\) is initialised before: \({\mathscr {L}}\) is initialised using the initial semantic segmentation obtained in Sect. 3.1; \({\mathscr {D}}\) is initialised using the depth of the initial coarse reconstruction estimate \({\mathscr {R}}\), such that the each \(d_{i}\) is obtained by sampling the optical ray from the camera within the region \({\mathscr {R}}\). The ray is sampled by a factor of 50 to calculate each \(d_{i}\) as in Mustafa et al. (2016); and \({\mathscr {M}}\) is initialized using discrete flow fields as in Tao et al. (2012); Menze et al. (2015). Each iteration is solved by graph-cut using the min-cut/max-flow algorithm (Boykov and Kolmogorov 2004). Convergence is achieved in 7-8 iterations.

3.4 4D Scene Understanding

The final 4D scene model fuses the semantic instance segmentation, depth information and dense flow across views and in time between frames (\(N_{f}^{i}\)) and key-frames (\(K_{i}\)). The initial instance segmentation, human pose and motion information for each object is combined to obtain final instance segmentation of the scene. The per-view depth maps obtained by optimizing Equation 2 for each camera view are combined across views using Poisson surface reconstruction (Kazhdan et al. 2006) to obtain a mesh for each object in the scene. For sports sequence with large calibration errors (1-2 pixels) each view-dependent 2.5D foreground scene representation is converted into a regular mesh with vertices defined by image pixel locations. Vertex connectivity is decided based on the layer segmentation and thresholding of the angle separating the line segment connecting 3D surface points defined by pairs of neighbouring pixels and the optical ray passing through the midpoint of the pixel pair (a threshold of \(80\deg \) is used). This allows pixel belonging to different layers or located at a depth discontinuity to be correctly converted into separate mesh components.

The 3D meshes for each object per frame are combined with per-view motion estimates obtained by optimizing Equation 2 to get 4D temporally coherent meshes for each person in the scene. The most consistent motion information from all views for each 3D point is used to estimate correspondences between two frames. This is combined with spatial semantic instance information to give per-pixel semantic and temporal coherence. Appearing, disappearing, and reappearing regions are handled by using the sparse temporal tracks and their respective motion estimate. The dense flow and semantic instance segmentation together with 3D models of each object in the scene gives the final 4D understanding of the scenes. Examples are shown in Figs. 1 and 5 on two datasets, where objects are coloured in one key-frame and colours are propagated reliably between frames and key-frames across the sequence for robust 4D scene modelling.

Fig. 5
figure 5

Example of 4D scene reconstruction for one indoor and one outdoor dataset. The first frame is uniquely coloured for each dataset and the colours are propagated using proposed motion estimation

Fig. 6
figure 6

Proposed semantic instance segmentation for Juggler2 outdoor dataset with three people

Table 4 Properties of all datasets: \(N_{v}\) is the number of views, L is the sequence length, KF gives number of key-frames, and Tracks gives the number of sparse temporal correspondence tracks averaged over the entire sequence for each object (S stands for static cameras and M for moving cameras)

The proposed method handles multiple people, appearing, disappearing and re-appearing in the scene. The method labels and tracks all static and dynamic objects in the scene. Multiple people and objects are identified using the initial semantic instance segmentation together with the clustering of the sparse reconstruction at each time frame. Object tracking and re-appearance is handled using the sparse temporal feature tracks and proposed dense flow. Exhaustive matching between all frames enables object re-identification. The pose constraints are only used for the human class and for other classes \(E- E_{p}\) is minimized allowing us to work with different objects. An example is shown for Juggler2 dataset in Fig. 6 with 3 humans and an object entering the scene.

Table 5 Illustration of tasks performed by state-of-the-art methods which are evaluated in Sect. 4 for segmentation, reconstruction and motion estimation against the proposed method

4 Results and Evaluation

Joint semantic instance segmentation, reconstruction and flow estimation (Sect. 3) is evaluated quantitatively and qualitatively against 15 state-of-the-art methods on a variety of publically available multi-view indoor and outdoor dynamic scene datasets, detailed in Table 4. Juggler2 and Magician datasets are synchronised using audio information and the rest of the datasets are synchronised using time code generator. A list of tasks performed by each state-of-the-art method is illustrated in Table 5.

Algorithm parameters listed in Table 6 are the same for all outdoor datasets, and for indoor datasets parameters depend on the number of cameras (\(N_{v}\)). Pairwise costs are constant \(\lambda _{p} = 0.9\), \(\lambda _{c}=\lambda _{s}=\lambda _{r}=0.5\) for all datasets. The parameters defined in Table  6 cover all possibilities of datasets (indoor, outdoor, different number of views). The change in parameters does not drastically affect the performance. We used indoor parameters (row 2 in table) for outdoor dataset Juggler2. This reduces the reconstruction performance by \(3\%\), segmentation by \(4\%\) and motion performance by \(2\%\).

Due to the low resolution of objects in the sports dataset (people are only 30-70 pixel in height) and high the calibration errors (1-2 pixels), the parameters above could not be used for the proposed framework. The pairwise costs are as follows: \(\lambda _{p} = 2\), \(\lambda _{c}=\lambda _{s}=\lambda _{r}=1.1\) and the unary costs are shown in the bottom row of Table 6.

Table 6 Parameters for all datasets. I is Indoor

4.1 Reconstruction Evaluation

The proposed approach is compared against state-of-the-art approaches for semantic co-segmentation and reconstruction (SCSR) Mustafa and Hilton (2017), piecewise scene flow (PRSM) Vogel et al. (2015), multi-view stereo (SMVS) Langguth et al. (2016), and deep learning based stereo approaches (LocalStereo) Taniai et al. (2018). Since PRSM Vogel et al. (2015) and LocalStereo Taniai et al. (2018) methods only work for 2 views/stereo pair of images, we divide the cameras in pairs and stereo is estimated for each pair.

The per-view depth maps for each camera view are combined across views using Poisson surface reconstruction (Kazhdan et al. 2006) to obtain a mesh for each object in the scene in a similar way to the proposed method. Default parameters are used to run both of these methods. The other state-of-the-art methods SMVS (Langguth et al. 2016) and SCSR (Mustafa and Hilton 2017) are multi-view approaches, where code available online is used to estimate the per-frame reconstruction using default parameters. Qualitative comparison with proposed method is shown in Fig. 7.

Table 7 Reconstruction evaluation: Projection error across views against state-of-the-art methods, LS is LocalStereo. \(P_{P} = E- E_{p},P_{M} = E- E_{f} - E_{r},P_{PM} = E- E_{f} - E_{r} - E_{p},P_{S} = E- E_{sem}\) and \(P_{PS} = E- E_{sem} - E_{p}\), where E is defined in Equation 2
Fig. 7
figure 7

Reconstruction evaluation against existing methods. Two different views of 3D model are shown for proposed method

Pre-trained parameters were used for LocalStereo and per-view depth maps were fused using Poisson reconstruction. The quality of surface obtained using proposed method is improved compared to state-of-the-art methods. In contrast to previous approaches, limbs of people are reliably reconstructed because of the exploitation of human-pose and temporal information (motion) in the joint optimisation.

For quantitative comparison to state-of-the-art methods, we project the reconstruction onto different views and compute the projection errors shown in Table 7. A significant improvement is obtained in projected surface completeness with the proposed approach. Further quantitative evaluation of the surface obtained using state-of-the-art methods is shown in Fig. 8. The reconstructions shown for Handstand are compared against the proposed method and the errors are colour coded, with red showing the maximum error.

Table 8 Segmentation comparison against state-of-the-art methods using the Intersection-over-Union metric

4.2 Segmentation Evaluation

Our approach is evaluated against a variety of state-of-the-art multi-view (SCV Tsai et al. (2016), SCSR Mustafa and Hilton (2017), and JSR Guillemaut and Hilton (2010)) and single-view (Dv3+ Chen and Zhu (2018), MRCNN He et al. (2017), PSP Zhao et al. (2017), CRF RNN Zheng et al. (2015), and Segnet Badrinarayanan et al. (2017)) segmentation methods, shown in Fig. 9. For fair evaluation against single-view semantic segmentation methods, multi-view consistency is applied for segmentation estimated from each view to obtain multi-view consistent semantic segmentation using dense multi-view correspondence. Colour and visualizations in the results are kept from the original papers and default parameters are used for state-of-the-art methods. Only MRCNN and the proposed approach gives instance segmentation.

Quantitative evaluation against state-of-the-art methods is measured by Intersection-over-Union with ground-truth, shown in Table 8. Ground-truth is available on-line for most of the datasets and obtained by manual labelling for other datasets. Pre-trained parameters were used for semantic segmentation methods. The semantic instance segmentation results from the joint optimisation are significantly better compared to the state-of-the-art methods (\(\approx 20 - 40 \%\)).

Semantic segmentation comparison results against CRF RNN (Zheng et al. 2015), Segnet (Badrinarayanan et al. 2017), PSP (Zhao et al. 2017) are shown in Fig. 10 on four datasets. Ground-truth segmentation comparison is shown in Fig. 11 against JSR (Guillemaut and Hilton 2010) and SCSR Mustafa and Hilton (2017). The red and green regions highlight the error, green regions are present in segmentation but not ground-truth and red regions are present in ground-truth but not the segmentation.

Fig. 8
figure 8

Comparison of reconstruction obtained using state-of-the-art methods against proposed method

Fig. 9
figure 9

Semantic segmentation comparison against state-of-the-art methods. In the proposed method shades of pink depicts instances of humans and shades of yellow depict instances of cars (Color figure online)

4.3 Motion Evaluation

Flow from the joint estimation is evaluated against state-of-the-art methods: (a) Dense flow algorithms DCflow (Xu et al. 2017) and Deepflow Weinzaepfel et al. (2013; b) Scene flow methods PRSM (Vogel et al. 2015); and (c) Non-sequential alignment of partial surfaces 4DMatch (Mustafa et al. 2016) (requires a prior 3D mesh of the object as input for 4D reconstruction).

Fig. 10
figure 10

Semantic segmentation comparison against state-of-the-art methods. In the proposed method shades of pink depicts instances of humans and shades of yellow depict instances of cars (Color figure online)

Per-view motion estimate from the proposed method is compared with each of the state-of-the-art methods. All of the methods we have compared with DCflow, Deepflow, PRSM and 4DMatch estimate flow for each camera which makes a fair comparison and default parameters are used for all state-of-the-art methods.

The key-frames of sequences are coloured using the unique color scheme shown in Fig. 12 and the colour is propagated using the motion estimate from the joint optimisation throughout the sequence. With accurate motion estimates the colors should propagate reliably across the sequence. The red regions in 2D dense flow in Fig. 12 are the regions for which reliable correspondences are not found. This demonstrates improved performance using the proposed method. The colours in the 4D alignment in Fig. 13 are not reliably propagated by DCFlow for limbs.

We also compare the silhouette overlap error (\(S_e\)) across frames, key-frames and views to evaluate long-term temporal coherence in Table 9 for all datasets. This is defined as \(S_e = \frac{1}{N_{v} N_{k} N_{f}^{i}}\sum _{i = 1}^{N_{k}}\sum _{j = 1}^{N_{f}^{i}}\sum _{c = 1}^{N_{v}} \frac{\text {Area of intersection}}{\text {Area of semantic segmentation}}\). Dense flow in time is used to obtain the propagated mask for each image. The propagated mask is overlapped with semantic segmentation at each time instant to evaluate the accuracy of the propagated mask. The lower the \(S_e\) the better. Our approach gives the lowest error demonstrating higher accuracy compared to the state-of-the-art methods.

Fig. 11
figure 11

Ground-truth semantic segmentation comparison against state-of-the-art methods JSR and SCSR

Table 9 Table caption

We evaluate the temporal coherence across the Meetup sequence, by evaluating the variation in appearance for each scene point between frames and between key-frames and frames for state-of-the-art methods. The metric is defined as: \(\sqrt{\frac{\Delta r^{2} + \Delta g^{2} + \Delta b^{2}}{3}}\), where \(\Delta \) is the difference operator. Evaluation shown in Table 10 against state-of-the-art methods demonstrates the stability of long term temporal tracking for proposed method (the lower the error the better).

4.4 Ablation Study on Equation 2

We perform an ablation study on Equation 2, such that we remove motion \(E_{f}, E_{r}\), pose \(E_{p}\) and semantic \(E_{sem}\) constraints from the equation, defining \(P_{M} = E- E_{f} - E_{r}, P_{P} = E- E_{p}, P_{PM} = E- E_{f} - E_{r} - E_{p}, P_{S} = E- E_{sem}\) and \(P_{PS} = E- E_{sem} - E_{p}\). Reconstruction, flow and semantic segmentation is obtained with removed constraints, and the results are shown in Tables 7, 9 and 8 respectively. The proposed approach gives best performance with joint pose, motion and semantic constraints.

Based on this ablation analysis, it is shown that the contribution of each term is task dependent. Reconstruction Pose and semantic constraints play an equal role in reconstruction and the motion term contributes less to this task. Semantics Pose constraints contribute more to the performance than the motion term. Flow Pose constraints contribute more than semantics to the performance. To sum up, the motion term gives proposed 4D flow, however it contributes least to the overall performance, followed by semantic, pose then depth. Also, the terms in Equation 2 are chosen based on a series of experiments to achieve optimal performance.

Fig. 12
figure 12

Temporal coherence evaluation against existing methods

Fig. 13
figure 13

4D alignment evaluation against DCFlow Xu et al. (2017)

Table 10 Temporal coherence evaluation for Meetup dataset against existing methods
Table 11 Silhouette overlap error for multi-view datasets for evaluation of long-term temporal coherence

The method requires pose as prior for human reconstruction, however Table 78 and 9 demonstrate results without pose constraints (\(P_{P}\)). The performance is reduced but the method still performs better than other state-of-the-art approaches.

Fig. 14
figure 14

Semantic instance segmentation on two sports sequences with multiple people

Table 12 Comparison of computational efficiency for a few dynamic sequences against state-of-the-art methods (time in seconds)
Fig. 15
figure 15

4D Reconstruction demonstrated on Wembley dataset. Top row shows different cameras for Frame 105 and bottom rows show reconstruction from 3 different randomly picked viewpoints

Fig. 16
figure 16

4D Reconstruction demonstrated on Soccer dataset for frame 40, 50 and 60. Unique colours are assigned to the reconstruction of each player in the cropped image for frame 1 and the colours are reliably propagated across the sequence

4.5 Ablation Study Without Key-Frame Detection

The higher the number of key-frames the better the quality of alignment. However if no key-frames are detected for a sequence, it will degrade the performance of 4D long-term scene flow. To evaluate the effect of key-frame detection we evaluate the performance of 4D scene flow for proposed joint optimization with and without key-frames in Table 11. The results show an \(\approx 15 \%\) improvement in scene flow with key-frame detection.

4.6 Computation Time Comparison

Computation times for the proposed approach vs other methods that perform joint estimation are presented in Table 12. The proposed approach to reconstruct temporally coherent 4D models is comparable in computation time to per-frame multiple view reconstruction and gives a \(\sim \)50% reduction in computation cost compared to previous joint segmentation and reconstruction approaches using a known background. This efficiency is achieved through improved per-frame initialisation based on temporal propagation and the introduction of the geodesic star constraint in joint optimisation.

4.7 Results on Sports Data in the Wild

Qualitative results using the proposed 4D scene understanding framework on three sports sequences Football, Wembley and Soccer with multiple people are shown in Figs. 14, 15 and 16. Football sequence is a synthetic dataset, Wembley and Soccer are real datasets. Properties of the sports datasets are listed in Table 4.

Table 13 Quantitative comparison of segmentation, reconstruction and motion estimation obtained from proposed 4D scene understanding framework on sports datasets

The proposed method obtains robust semantic instance segmentation on Football and Wembley sequences demonstrating the generalizing capability of the method for in the wild datasets, illustrated in Fig. 14. 4D Reconstruction obtained using the proposed 4D scene understanding framework is shown in Fig. 15. Reconstruction is shown from different viewpoints demonstrating the applicability of automatic scene understanding framework for in the wild data. Proposed long term temporal coherence is shown in Fig. 16. Each player is uniquely coloured and the colours are propagated using proposed motion estimate from the 4D scene understanding framework. In spite of the low resolution (\(\approx 12X30\) px) of each player the colours are reliably propagated across the sequence.

For comparative evaluation on sports datasets, majority of the state-of-the-art methods are unable to obtain a reliable segmentation, mesh and flow for the players. Hence we have evaluated on the selected methods that work, and comparative evaluation of segmentation, reconstruction and motion estimation is shown in Table 13. The results demonstrate that proposed method achieves significant improvement over existing methods for all tasks.

4.8 Complexity Analysis on Synthetic Multi-Veiw Data

To evaluate how proposed method works with crowded scenes we use a multi-view synthetic dataset with multiple people in the scene ranging from 2–10 people at each time instant (Caliskan et al. 2020). Qualitative results are shown in Fig. 17.

We also perform complexity analysis for the proposed method through a quantitative evaluation on the synthetic dataset. The number of people are increased in the scene and reconstruction and segmentation accuracy is calculated for different number of people in the scene as seen in Table 14. The reconstruction and segmentation accuracy slightly decreases with the increase in the number of people in the scene, due to increased occlusion and clutter in the scene.

Fig. 17
figure 17

Reconstruction Results on the Multi-view Synthetic Dataset

Table 14 Quantitative comparison of segmentation and reconstruction accuracy on multi-view synthetic multiple people dataset

4.9 Limitations

Gross errors in initial semantic instance segmentation and 3D pose estimation lead to degradation in the quality of results (e.g. the cars in Juggler2 - Fig. 9).

Small errors in semantic segmentation, initial coarse reconstruction and 3D pose are handled gracefully by the proposed method. This is evident from Fig. 9 where the initial semantic segmentation MRCNN has small errors and the final result of the proposed method improves the semantic segmentation. However, large errors over successive frames in semantic segmentation propagate through the optimization, which reduces the quality of the final segmentation and reconstruction as seen is Fig. 16.The final reconstructions look incomplete because of large errors in initial semantic segmentation.

Also errors in the initial coarse reconstruction, for example an incomplete initial coarse reconstruction with missing human limb will lead to incomplete final reconstruction. Gross errors in 3D pose estimation for crowded scenes may lead to errors in the final reconstruction. Failure or errors in key-frame detection only slightly degrades the quality of the flow estimate from the proposed approach. Although 3D human pose helps in robust 4D reconstruction of interacting people in dynamic scenes, current 3D pose estimation is unreliable for highly crowded environments resulting in degradation of the proposed approach.

5 Conclusions

This paper introduced the first automatic method for 4D dynamic scene understanding of multiple interacting people from multi-view video that does not need any ground-truth data for training or manual intervention. A novel joint flow, reconstruction and semantic instance segmentation estimation framework is introduced exploiting 2D/3D human-pose, motion, semantic, shape and appearance information in space and time. Ablation study on the joint optimisation demonstrates the effectiveness of the proposed scene understanding framework for general scenes with multiple interacting people. The semantic, motion and depth information per view is fused spatially across views for 4D semantically and temporally coherent scene understanding. A fully automatic system is presented for multiple view semantic instance segmentation, 4D reconstruction and motion estimation from moving broadcast cameras to allow full 4D scene understanding of in the wild sports data such as soccer. Extensive evaluation against state-of-the-art methods on a variety of complex indoor and outdoor datasets with large non-rigid deformations demonstrates a significant improvement in the accuracy in semantic segmentation, reconstruction, motion estimation and 4D alignment.