Semantically Coherent 4D Scene Flow of Dynamic Scenes

Simultaneous semantically coherent object-based long-term 4D scene flow estimation, co-segmentation and reconstruction is proposed exploiting the coherence in semantic class labels both spatially, between views at a single time instant, and temporally, between widely spaced time instants of dynamic objects with similar shape and appearance. In this paper we propose a framework for spatially and temporally coherent semantic 4D scene flow of general dynamic scenes from multiple view videos captured with a network of static or moving cameras. Semantic coherence results in improved 4D scene flow estimation, segmentation and reconstruction for complex dynamic scenes. Semantic tracklets are introduced to robustly initialize the scene flow in the joint estimation and enforce temporal coherence in 4D flow, semantic labelling and reconstruction between widely spaced instances of dynamic objects. Tracklets of dynamic objects enable unsupervised learning of long-term flow, appearance and shape priors that are exploited in semantically coherent 4D scene flow estimation, co-segmentation and reconstruction. Comprehensive performance evaluation against state-of-the-art techniques on challenging indoor and outdoor sequences with hand-held moving cameras shows improved accuracy in 4D scene flow, segmentation, temporally coherent semantic labelling, and reconstruction of dynamic scenes.


Your article is published under the Creative
Commons Attribution license which allows users to read, copy, distribute and make derivative works, as long as the author of the original work is cited. You may selfarchive this article on your own website, an institutional repository or funder's repository and make it publicly available immediately.

Introduction
Advances in visual scene understanding using deep learning, with convolutional neural network architectures and large annotated image collections (Chen et al. 2016Xie et al. 2016;Luo et al. 2015), have achieved excellent performance in per-pixel labelling of semantic categories in real-world scenes from images.
These advances in semantic segmentation have been exploited to improve scene flow estimation between pairs of frames for dynamic scenes (Behl et al. 2017). However semantic segmentation from a single view suffers from errors due to the inherent visual ambiguity which leads to errors in flow estimation at object boundaries and for regions of uniform appearance. Errors may also be introduced in scene flow estimated between pairs of frames due to large non-rigid Communicated by Karteek Alahari. B Armin Mustafa a.mustafa@surrey.ac.uk Adrian Hilton a.hilton@surrey.ac.uk 1 CVSSP, University of Surrey, Guildford, UK motions and self-occlusions for dynamic sequences. In the case of multiple views, independent classification for different views and different time instants of the same scene may result in inconsistent per-pixel flow and semantic labelling for the same object. This paper introduces a framework for semantically coherent long-term 4D scene flow (aligning entire dynamic sequence of > 150 frames), co-segmentation and reconstruction of dynamic scenes, as shown in Fig. 1 for the publicly available Juggler dataset (Ballan et al. 2010) captured with 6 hand-held unsynchronised moving cameras. Joint semantic co-segmentation(top-row), flow estimation, and 4D reconstruction (bottom-row) results in significant improvement in per-view 2D segmentation, 4D scene flow and reconstruction. The approach enforces semantic coherence both spatially across different views of the scene and temporally across different observations of the same object for robust long-term 4D flow estimation. Semantic tracklets are introduced to identify similar frames in time across a sequence, exploiting semantic, motion, shape and appearance information between different observations of a dynamic object over time. This gives improved temporal coherence enabling long-term flow estimation along-with consistent semantic Fig. 1 Example of input image from Juggler dataset (Ballan et al. 2010) and proposed framework resulting in an accurately labeled segmentation, 4D reconstruction and scene flow (represented by color mask propagation in dynamic object of the scene) (Color figure online) co-segmentation of long sequences across multiple views. Joint semantic scene flow, co-segmentation, and reconstruction enforces spatio-temporal semantic coherence in flow estimation resulting in improved performance over previous approaches which did not exploit semantic and depth information in space and time.
Previous research has demonstrated the advantages of joint semantic segmentation and flow estimation (Tokmakov et al. 2019;Behl et al. 2017;Sevilla-Lara et al. 2016a;Zhu et al. 2018), joint segmentation and reconstruction across multiple views (Yang et al. 2018;Hane et al. 2013Hane et al. , 2016Engelmann et al. 2016;Kundu et al. 2014), co-segmentation of multiple view images (Khoreva et al. 2019;Chiu and Fritz 2013;Kolev et al. 2012;Djelouah et al. 2015Djelouah et al. , 2016 and temporal coherence in reconstruction (Li et al. 2018;Goldluecke and Magnor 2004;Floros and Leibe 2012;Larsen et al. 2007;Mustafa et al. 2016a). Our contribution is the introduction of a framework for joint semantically coherent 4D scene flow with co-segmentation and reconstruction of complex dynamic scenes to obtain semantically coherent per-view long-term scene flow, 2D object segmentation and 4D scene reconstruction from wide-baseline camera views. Our approach to long-term scene flow, co-segmentation and 4D dynamic shape reconstruction leverages recent advances in single-view semantic segmentation and semantic flow estimation.
The input to the framework is multi-view videos. Perview initial semantic segmentation is obtained using Mask RCNN (He et al. 2017) and FCN , this could in principle use any semantic video segmentation approach. The initial semantic segmentation is combined with sparse reconstruction to obtain initial semantic reconstruction. A joint semantic flow, co-segmentation and reconstruction opti-mization is proposed to refine this initial segmentation and reconstruction. Semantic coherence is enforced using semantic tracklets, which link frames to enforce temporal coherence between widely spaced timeframes. Semantic coherence refers to spatial and temporal coherence of semantic labels across the sequence. The per-view semantic flow and reconstruction is combined across views for entire dynamic sequence to obtain semantically coherent long-term dense 4D scene flow, co-segmentation and reconstruction.
The primary contribution is semantically coherent scene flow, semantic co-segmentation and 4D reconstruction across multiple views. An initial version of this work was published in CVPR  where we proposed a method for semantic segmentation and reconstruction of dynamic scenes. The contributions of this paper over our previous work are as follows: (a) Semantically coherent long-term 4D scene flow estimation for dynamic scenes in addition to semantic segmentation and reconstruction; (b) Refined methodology enabling joint semantically coherent scene flow, co-segmentation and reconstruction by adding motion optimization in the energy defined in Eq. 6. The resulting 2D flow is projected to the 3D reconstruction to obtain the final 4D scene flow; (c) Refined methodology to estimate semantic tracklets by adding motion constrain in Eq. 1 and (d) Comprehensive performance evaluation of flow, segmentation, and reconstruction on challenging datasets. To the best of our knowledge, this is the first method addressing the problem of semantically and temporally coherent long-term 4D scene flow; semantic co-segmentation and reconstruction for dynamic scenes. The contributions of the paper include: -A method to estimate scene flow, 4D mesh and 2D semantic video segmentation for natural dynamic scenes from multi-view videos. -Joint semantic scene flow, co-segmentation, and reconstruction of dynamic objects in complex scenes exploiting spatial and temporal coherence. -Semantic tracklets for long-term 4D reconstruction by enforcing spatial and temporal coherence in semantic labelling for improved scene flow of video across widetimeframes. -Improved flow, segmentation, and reconstruction of dynamic scenes from multiple moving cameras 2 Related Work

Semantic Segmentation
Various methods have been proposed in the literature for semantic segmentation of images. In the first category the image is initially segmented followed by a per-segment object category classification (Mostajabi et al. 2015;Gupta et al. 2014). However, errors in segmentation propagate to the semantic labelling. Several papers address these issues by proposing deep per-pixel CNN features followed by classification of each pixel in the image (Farabet et al. 2013;Hariharan et al. 2015). The per-pixel prediction leads to segmentations with fuzzy boundaries and spatially disjoint regions. Another group of methods pioneered by Long et al. (2015) and He et al. (2017) predict segmentations from the raw pixels. Methods were introduced to improve the spatial coherence of the semantic segmentation using conditional random fields (CRF) (Kundu et al. 2016;Zheng et al. 2015;Chen et al. 2014). End-to-end methods were proposed for semantic segmentation to overcome the limitations of methods using CRF Zhang et al. 2018), improving the performance significantly.
Co-segmentation: This was first introduced by Rother et al. (2006) for simultaneous binary segmentation of object parts in an image pair and extended to simultaneous segmentation of multiple images (Batra et al. 2010). Multi-view cosegmentation in space and time was introduced in Djelouah et al. (2016). A common foreground is obtained from multiple views using the information from appearance and motion cues. Semantic co-segmentation methods from a single video use spatio-temporal object proposals (Joulin et al. 2012;Luo et al. 2015), segments (Kolev et al. 2012), motion (Rother et al. 2006) and foreground propagation (Goldluecke and Magnor 2004). Recently, co-segmentation methods were introduced to segment common objects in a collection of videos for a single object (Maninis et al. 2018;Fu et al. 2014) or multiple objects (Tokmakov et al. 2019;Chiu and Fritz 2013;Zhong and Yang 2016). A CNN method for both single and multiple object segmentation was introduced in Khoreva et al. (2019), exploiting an intuitive training strategy from less data.

Semantic Flow Estimation
Methods have been proposed to exploit semantic information to improve monocular flow or motion estimate per frame (Li et al. 2018;Behl et al. 2017;Sevilla-Lara et al. 2016a;Zhu et al. 2018;. Semantic 2D detections were exploited to improve the tracking for autonomous driving in Li et al. (2018). Advantages of segmentations, bounding boxes and object coordinates to flow estimation were reviewed in Behl et al. (2017) for the case of dynamic road scenes. Sevilla-Lara et al. (2016a) exploit the advances in static semantic segmentation to segment the image into objects of different types followed by modelling motion for each object depending on the type of object. However for non-rigid dynamic objects such as people defining a unique motion model for the entire object is not effective. A method to exploit flow information for video segmentation was proposed in , reporting improvement in video segmentation exploiting flow information. However all of these methods either work for street scenes or static scenes and do not exploit any stereo or multiple view information.

Joint Estimation
General multi-view image segmentation methods use appearance and contrast information which may not be sufficient in the case of complex real world scenes. To improve the results joint optimisation of segmentation with 3D reconstruction has been proposed (Mustafa et al. 2016a) by including the multiple view photo-consistency. This concept was extended to semantic segmentation and reconstruction to obtain additional information from the scene (Jiao et al. 2018;Hane et al. 2016;Xie et al. 2016). Methods were introduced to utilize appearance-based pixel categories and stereo cues in a joint framework for street scenes from a monocular camera Floros and Leibe 2012). These methods used CRF to perform simultaneous dense reconstruction and segmentation of street scenes captured from a moving camera. A method to estimate pose and shape of people was proposed in Zanfir et al. (2018) and another method to estimate the pose and 3D shape of rigid objects on street scenes was proposed (Engelmann et al. 2016). An unsupervised method to jointly learn depth and flow using cross-task consistency was proposed for monocular video (Zou et al. 2018). Another method jointly estimates dense depth, optical flow and camera pose (Yin and Shi 2018). Recently a method was proposed for joint unsupervised learning of depth, camera, motion, optical flow and motion segmentation (Ranjan et al. 2018). However these methods cannot be directly applied to multi-view wide-baseline scenes. A method for joint estimation of 3D geometry and pose was proposed for rigid objects (Tulsiani et al. 2018). Dense semantic reconstruction of rigid objects was proposed by Bao et al. (2013). Joint semantic segmentation and reconstruction using multiple images was proposed for static scenes (Hane et al. 2013). However, these methods are limited to static scenes and rigid objects. Joint motion and reconstruction or segmentation (Roussos et al. 2012;Sevilla-Lara et al. 2016b) methods were proposed for dynamic scenes. Techniques have been introduced to align dense meshes using correspondence information between consecutive frames (Zanfir and Sminchisescu 2015;Mustafa et al. 2016b) or extracting the scene flow by estimating the pairwise surface or volume correspondence between reconstructions at successive frames (Wedel et al. 2011;Basha et al. 2010). State-of-the-art joint estimation methods give per frame reconstruction and semantic segmentation of the scenes (Chen et al. 2019;Kendall et al. 2017) exploiting a multi-task learning framework. However these methods do not align meshes for the entire sequence, give seman- Fig. 2 Semantically coherent co-segmentation, reconstruction and flow estimation framework tically coherent segmentation, or work for wide-baseline scenes. Our previous work  gives per-frame semantic segmentation and reconstruction of dynamic scenes, leading to unaligned meshes for dynamic sequence. The proposed method estimates 4D scene flow along with reconstruction and semantic co-segmentation, aligning meshes for entire dynamic sequence giving longterm semantic 4D scene flow.
This paper introduces joint semantic flow, co-segmentation, and reconstruction enforcing coherence in both the spatial and temporal domains for scenes, with rigid and non-rigid dynamic objects, captured with multiple widebaseline moving cameras. A key contribution of our work is that we combine semantics, shape, motion and appearance information in space and time in a single optimization to generate results automatically. The per-view motion, depth and semantic segmentation is combined across views and time for entire dynamic sequence to obtain 4D semantic flow. Evaluation demonstrates improved accuracy and completeness of flow, segmentation and reconstruction for complex dynamic scenes.

Semantic 4D Scene Flow and Segmentation
Overview: This section gives an overview of the proposed framework for semantic temporal coherence, illustrated in Fig. 2. It comprises of following stages: -Input: Multi-view videos are input to the system.
-Initial Semantic Segmentation-Sect. 3.1: Initial semantic labels are estimated for each pixel in the image per-view using state-of-the-art semantic segmentation (He et al. 2017;). -Initial Semantic Reconstruction-Sect. 3.1: Semantic information for each view is combined with sparse 3D feature correspondence between views to obtain an initial semantic 3D reconstruction. This initial reconstruction combines semantic information across views but results in inconsistency due to inaccuracies in the initial per-view segmentation. -Semantic Tracklets-Sect. 3.2: To enforce long-term semantic coherence temporally we propose semantic tracklets that identify a set of similar frames for each dynamic object. Similarity between any pair of frames is estimated from the per-view semantic labels, appearance, shape and motion information. Semantic trackets provide a prior for the joint space-time semantic co-segmentation and reconstruction to enforce temporal coherence. -Joint Semantic Flow, Co-segmentation and Reconstruction-Sect. 3.3: The initial semantic segmentation and reconstruction is refined per-view for each dynamic object through joint optimisation of flow, segmentation, and shape across multiple views and over time using the semantic tracklets. Per-view information is merged into a single 3D model using Poisson surface reconstruction (Kazhdan et al. 2006). -Semantic 4D Scene Flow and Segmentation-Sect. 3.3: The process is repeated for the entire sequence and is combined across views and in time to obtain semantically coherent long-term dense 4D scene flow, cosegmentation, and reconstruction for the complete scene.
The following sections include a detailed explanation of the proposed approach and highlight the novel contributions.

Initial Segmentation and Reconstruction
Initial Semantic Segmentation: Mask RCNN is used for initial semantic segmentation because it is the state-of-theart object detector that computes per instance masks and per instance class labels. This adopts a two-stage procedure to predict semantic segmentation of images. The object masks from Mask RCNN (He et al. 2017) are combined with background segmentation ) to obtain dense semantic segmentation mask. For each frame in the sequence we perform deep semantic segmentation which estimates the probabilities of various classes at each pixel in the image. The network is trained on MS-COCO ) dataset with 81 classes and is refined on PASCAL VOC12 (Everingham et al. 2012) dataset. In spite of being the state-of-the-art method the masks output still do not accurately align with the object boundaries as illustrated in Fig. 3b.

Initial Semantic Reconstruction:
Sparse feature-based reconstruction of the scene is performed using SFD features (Mustafa et al. 2019) and SIFT descriptor (Lowe 2004) with the constraint that each 3D feature should be visible in 3 or more camera views for robustness (Hartley and Zisserman 2003). The resulting point-cloud is clustered in 3D (Rusu 2009). Clusters are formed between points with the same class labels across multiple views such that each cluster represents a semantically consistent object. Insufficient 3D features may occur on parts of an object due to lack of texture or visual ambiguity. To avoid incomplete reconstruction the sparse 3D object clusters are combined with the initial semantic segmentation to obtain the initial semantic reconstruction. A mesh is obtained for sparse 3D point clusters by triangulation to obtain an initial coarse reconstruction for each object. The initial coarse reconstruction is back-projected in each view onto the initial semantic segmentation. If the back-projected mask is smaller than its respective semantic region in 2 or more views then the initial coarse reconstruction is dilated in volume(3D) by v to enclose the object to match the segmentation boundaries in where N h is the number of views with smaller back-projected mask, B i s is the area of the semantic segmentation and B i r is the area of the back-projected mask of the initial coarse reconstruction. This automatically initializes the reconstruction of each object in the scene without any strong initial prior.

Semantic Tracklets
In the case of general dynamic scenes with non-rigid objects, independent per-frame scene flow estimation, segmentation and reconstruction leads to incoherent results, for example failure to predict flow and reconstruct thin structures such as limbs and poorly localized object boundaries. Sequential methods for frame-to-frame temporal coherence are prone to errors due to drift and rapid motion (Beeler et al. 2011;Prada et al. 2016). Previous work Zhong and Yang (2016) introduced semantic tracklets for object segmentation in single view video based on co-segmentation across video collections. In this paper to achieve long-term scene flow, semantic co-segmentation and robust temporally coherent 4D reconstruction by introducing semantic tracklets which link instances of dynamic objects across wide-timeframes. This provides a prior to constrain longterm flow, co-segmentation and reconstruction. In our work semantic tracklets are defined for multiple views of the same dynamic scene to ensure temporal and spatial coherence in semantic 4D flow and 2D labelling, whereas in Zhong and  tracklets segment objects in a single video and relate them to similar object instances in multiple videos.
Semantic tracklets for a dynamic object are defined as a set of frames which have similar motion across 3 or more views, semantic labels, appearance and 2D shape as illustrated in Fig. 4. Tracklets are used for long-term learning of flow, semantic labels, appearance and shape information for per-view joint semantic 4D scene flow, co-segmentation and Dynamic objects are identified in the scene using motion information from sparse temporal SFD feature correspondences with SIFT descriptors. The semantic, 2D shape, motion and appearance similarity of the dynamic object is evaluated for each frame against all previous frames to identify the set of similar frames which form a tracklet. Similarity metric is defined as follows: where C() is the measure of appearance similarity, M() is the measure of motion similarity, J () is the measure of shape similarity and L() is the measure of semantic similarity. N v Fig. 5 Comparison of segmentation of the proposed multi view optimization against optimization with no semantic and no tracklet information respectively for Handshake and Odzemok datasets is the number of views at each frame. These similarities are combined across time and views and all frames with similarity > 0.75 are selected as N S similar frames to form a semantic tracklet T i for each dynamic object at the i th frame, . An example of the frame-to-frame similarities is illustrated in Fig. 6 for Juggler sequence, depicting the differences in various measures and the overall similarity matrix.

Semantic Similarity:
The semantic region associated with the object at each frame is identified using sparse widetimeframe SFD feature matches combined with SIFT descriptor. An affine warp (Evangelidis and Psarakis 2008) based on the feature correspondence and region boundary is employed to transfer the semantic region segmentation to the current frame. The semantic similarity metric L c i, j is defined as the ratio of the number of pixels with the same class label z c i, j to the total number or pixels in the segmented region y c i, j at frame i and j for view c: Appearance Similarity: The appearance metric C c i, j between frame i and j for the semantic region segmentation in view c corresponding to a dynamic object is based on the ratio of the number of temporal feature correspondences which are consistent across three or more views q c i, j to the total number of feature correspondence in the segmented region u c i, j (Mustafa et al. 2016b): International Journal of Computer Vision Motion Similarity: The motion metric M c i, j between frame i and j for the semantic region segmentation in view c corresponding to a dynamic object is based on the average motion for the object across three or more views s c i, j to the maximum motion between frames for entire sequence max (Mustafa et al. 2016b): Shape Similarity: This metric gives a measure of the 2D region shape similarity between pairs of frames for each dynamic object. Semantic region segmentations are aligned using an affine warp (Evangelidis and Psarakis 2008). This is defined as the ratio of the intersection of the aligned segmentation h c i, j to the union of the area a c i, j : Importance of Semantic Tracklet: Semantic tracklets provide both temporal and multi-view priors for semantic 4D long-term flow estimation and co-segmentation. This is the importance of semantics in obtaining improved scene flow, segmentation and 4D reconstruction. Comparison is presented for optimization with/without semantic label and temporal tracklet information for multiple views in Fig. 5. Semantic tracklets result in significant improvement in scene flow estimation, reconstruction and multi-view video segmentation in comparison to state-of-the-art methods, as demonstrated in Sect. 4. The importance of the proposed semantically coherent optimization exploiting the information from semantic labels and tracklets for proposed multi-view joint optimization is shown in the Fig. 5. The proposed approach consistently performs better giving a more accurate flow and segmentation. The final proposed multiple view 4D flow, co-segmentation and reconstruction method using both semantic labels and tracklets gives significantly improved and more robust 4D flow and segmentation.

Joint Semantic Scene Flow, Co-segmentation and Reconstruction
The goal of multi-view joint semantic flow estimation, co-segmentation and reconstruction is to refine the initial semantic reconstruction obtained in Sect. 3.1 for each dynamic object for the region R per-view by optimizing the following variables: (a) Translation for each pixel location p = (x p , y p ) in image I , m p = (δx p , δy p ) in time from a predefined set of flow vectors M ; (b) A semantic label from a set of semantic classes obtained as an initialization (Sect. 3.1), L = l 1 , . . . , l |L | , to each pixel p for the initial semantic segmentation region S of each object, where |L | is the total number of classes in the network; and (c) An accurate depth value is jointly assigned for each pixel p from a set of depth values D = d 1 , . . . , d |D|−1 , U , where d i is obtained by sampling the optical ray from the camera and U is an unknown depth value to handle occlusions. Long-term 4D flow and co-segmentation is achieved by propagating the semantic labels across views and over time using tracklets in the framework. Formulation of a cost function for semantically coherent depth and motion estimation and co-segmentation is based on the following principles: -Local spatio-temporal coherence: Spatially and temporally neighbouring pixels are likely have the same semantic labels if they have similar appearance. -Multi-view coherence: The surface is photo-consistent and semantically consistent across multiple views. -Depth variation: The depth at spatially neighbouring pixels within an object varies smoothly for most of the surface (except internal depth discontinuities). -Long-term temporal coherence: The semantic labels on each object remain consistent across a long time-frames in a sequence.
The cost function enforces spatial and temporal constraints on the semantic, appearance, motion and shape. Temporal semantic coherence is enforced using tracklets based on dynamic object similarity S i, j Eq. 1. An example of multiview semantic scene flow, segmentation and reconstruction is shown in Fig. 3c. Enforcing temporal coherence with semantic tracklets for a multi-view video reduces noise in per-pixel labels. Errors in object segmentation remain due to the low spatial resolution of the initial semantic boundaries and visual ambiguity is addressed by combining information across multiple views. Joint optimisation of multiple view scene flow, co-segmentation and reconstruction minimises: where, d is the depth at each pixel, m is the motion and l is the semantic label. E d () is the matching/depth cost, E a () is the appearance/color cost, E c () is the constrast cost, E sm () is the semantic labelling cost, E s () is the smoothness cost, and E m () is the motion/flow cost. Individual cost terms enforce spatial and temporal coherence for dynamic objects in semantic labels, appearance, region boundary contrast and motion cost. This is solved subject to a geodesic star-convexity constraint on the semantic labels l (Mustafa et al. 2016a): where S (C ) is the set of all shapes which are geodesic star-convex wrt the features in C = {c 1 , . . . , c n } within the initial semantic segmentation R. E (l|x, C ) is the geodesic star-convexity constraint enforced on the semantic labels l. α-expansion is used to iterate through the set of labels in L × D × M (Boykov et al. 2001) and a solution is obtained using graph-cuts (Boykov and Kolmogorov 2004) across spatial and temporal neighbourhoods as shown in Fig. 4. The initially reconstructed surface R is updated by minimizing the Energy in Eq. 6, by estimating the depth, segmentation and motion at each pixel within the projection of region R in each view.

Spatial neighbourhood:
The spatial neighbourhood is defined as pairs of spatially close pixels in the image domain. A standard 8-connected spatial neighbourhood is used denoted by ψ S ; the set of pixel pairs ( p, q) such that p and q belong to the same frame and are spatially connected.

Temporal neighbourhood:
The temporal neighbourhood is defined based on the set of tracklets T i generated for any frame i. Optical flow is used to compute a dense flow field on the tracklets, initialized from the sparse temporal SIFT feature correspondences. EpicFlow (Revaud et al. 2015) is used to preserve large displacements as the tracklets are distributed widely in time, and forward-backward flow consistency is enforced. Optical flow vectors define the temporal neighbour- where j is the number of a frame in tracklet T i = { j = t r }, and d i, j is the displacement vector from image i to j.

Semantic cost E sm (l, d):
This term enforces multi-view consistency on the semantic labels of each pixel p. Inconsistent labels across views are penalised to ensure semantic coherence. This cost is computed based on the probability of the class labels at each pixel for the initial semantic segmentation (Chen et al. 2016). Unlike previous approaches to achieve semantic coherence we enforce spatial and temporal consistency using tracklets across the neighbourhoods. The term is defined as: is assumed along the optical ray passing through pixel p located at a distance d p from the reference camera. The projection of hypothesized point P( p, d p ) in view c is defined by r = φ c (P). N K is the total number of views in which point P( p, d p ) is visible.
where l r is the semantic label at pixel r in view c and P sem (I p |l p = l i ) denotes the probability of the semantic label l i at pixel p in the classification image obtained from initial semantic segmentation.

Contrast cost E c (l):
The contrast cost (Chen et al. 2016) is modified to introduce spatial and temporal semantic coherence and ensure that for dynamic objects the region boundaries have high contrast. Semantic region boundaries are propagated using the tracklets as a prior for the optimization: where μ l p , l q = 1 if (l p = l q ) else 0 and ϑ pq is the Euclidean distance between pixel p and q. The first Gaussian kernel is a bilateral kernel which depends on RGB color (B() is bilateral filtered image) and pixel positions, and the second kernel only depends on pixel positions L. The parameters σ α , σ β and σ γ control the scale of the Gaussian kernels.
The first kernel forces pixels with similar color and position to have similar labels, while the second kernel only considers semantic spatial proximity when enforcing smoothness. The , with the operator denoting the mean computed across the neighbourhoods ψ S and ψ T for spatial and temporally coherent contrast respectively.
Appearance cost E a (l): This cost is computed using the negative log likelihood (Boykov and Kolmogorov 2004) of the color models learned from the foreground object and background. In this work the foreground models are learnt from the sparse features of the dynamic object in the current frame and foreground regions from tracklets to improve the consistency of the results. Static background models are learnt from the sparse features outside the initial semantic segmentation of the dynamic object in the current frame and the region outside the semantic segmentation in the tracklets. Appearance cost is defined as: where P(I p |l p = l i ) is the probability of pixel p in the reference image belonging to label l i . Color models use GMMs with 10 components each for foreground or background.

Matching cost E d (d):
The photo-consistency matching cost across views is defined as: Hu and Mordohai (2012). M U is the fixed cost of labelling a pixel unknown. r denotes the projection of the hypothesised point P in an auxiliary camera where P is a 3D point along the optical ray passing through pixel p located at a distance d p from the reference camera. O k is the set of k most photo-consistent pairs with a reference camera across views. O k are identified using the highest number or feature matches spatially across frames.

Motion cost E m (l, m):
This adds the brightness consistency assumption to the cost function generalized for spatial and temporal neighbourhood, defined as: if at t and t+1 l p = l p+m p else 0 if at t l p = l p+m p else 0 E l () penalizes deviation from the brightness constancy assumption in time for a single view. Term E c () penalizes deviation from the brightness constancy assumption between the reference view and each of the other views at other time instants. Here N v is the number of views at each time frame and I i ( p, t) is the intensity at a given pixel p at time instant t in view i, ψ S and ψ T are the spatial and temporal neighbourhood. This term denotes that the flow vector m p is located within a window from a sparse constraint at p and it forces the flow to approximate the sparse 2D temporal correspondences.

Smoothness cost E s (l, d):
The surface smoothness cost introduced in Mustafa et al. (2016a) is extended to spatial and temporal neighbourhoods: d max is introduced to avoid over-penalising large discontinuities. d s max ensures spatial smoothness and d t max ensures smoothness over time between the temporal neighbourhood of the tracklets and is set to twice d s max to allow large movement in the object between tracklet frames.

Semantically and Temporally Coherent Reconstruction
The estimated dense flow for each view is projected to the 3D visible surface to establish dense 3D correspondence (scene flow) between frames and between semantic tracklets T i to obtain 4D semantically and temporally coherent dynamic scene reconstruction, as illustrated in Fig. 1. Temporal correspondence is first obtained for the view with maximum visibility of 3D points. To increase surface coverage correspondences are added in order of visibility of 3D points for different views. Dense temporal correspondence is propagated to new surface regions as they appear using the dense flow estimated from joint refinement. Temporal coherence is also estimated between semantic tracklets to overcome the limitations of sequential correspondence propagation by correcting any errors introduced in semantically and temporally coherent reconstruction. As a result along with segmentation and reconstruction of dynamic scenes, we have temporal and semantic per-pixel correspondence information in both  Fig. 7. The 2D per-view depth maps are combined using Poisson surface reconstruction (Kazhdan et al. 2006), which leads to loss in the details in mesh of the object compared to the semantic segmentation.

Results and Evaluation
Joint semantic co-segmentation, reconstruction and scene flow estimation (Sect. 3.3) is evaluated on a variety of publically available multi-view indoor and outdoor dynamic scene datasets, details in Table 1.

4D Flow Evaluation
We evaluate semantic and temporal coherence obtained using the proposed 4D semantic flow algorithm on all of the datasets. Stable long-term 4D correspondence propagation is illustrated using color coded results. First frame of the sequence is color coded and the colors are propagated between frames using the 2D-3D motion information obtained from the joint refinement explained in Sect. 3.3. Results of the proposed 4D temporal and semantic alignment, illustrated in Fig. 8 shows that the colour of the points remains consistent between frames. The proposed approach is qualitatively shown to propagate the correspondences reliably for complex dynamic scenes with large non-rigid motion.
For comparative evaluation we use:(a) state-of-the-art dense flow algorithm Deepflow (Weinzaepfel et al. 2013); (b) a recent algorithm for alignment of partial surfaces (4DMatch) (Mustafa et al. 2016b) and (d) Simple flow (Tao et al. 2012). Qualitative results against 4DMatch, Deepflow and Simpleflow shown in Fig. 9 indicate that the propagated colour map does not remain consistent across the sequence for large motion as compared to the proposed method (red regions indicate correspondence failure).
For quantitative evaluation we compare the silhouette overlap error (SOE). Dense correspondence over time is used to create propagated mask for each image. The propagated mask is overlapped with the silhouette of the projected surface reconstruction at each frame to evaluate the accuracy of the dense propagation. The error is defined as: Area of intersection Area of back-projected mask (8) Evaluation against the different techniques is shown in Table 2 for all datasets. As observed the silhouette overlap error is lowest for the proposed approach showing relatively high accuracy. We evaluate the temporal coherence across the Magician sequence, by evaluating the variation in appearance for each scene point between frames and between semantic tracklets for state-of-the-art methods, defined as: is the difference operator. Evaluation shown in Table 3 against state-of-the-art methods demonstrates the stability of long-term temporal tracking for proposed joint semantic scene flow, co-segmentation and reconstruction.

Segmentation Evaluation
Mutli-view co-segmentation is evaluated against a variety of state-of-the-art methods:  and Conditional random field as recurrent neural networks (CRF-RNN) (Zheng et al. 2015). Proposed segmentation is also evaluated against singleview segmentation methods MVC (Chiu and Fritz 2013) and ObMiC (Fu et al. 2014). These are applied independently on each view for comparison. Comparison against MVVS (Djelouah et al. 2016) is shown in Fig. 10 and evaluation against TcMVS (Mustafa et al. 2016a), SCV (Zhong  Bold values represent best performance SD standard deviation and Yang 2016) and CRF-RNN (Zheng et al. 2015) are shown in Fig. 12 for dynamic datasets. Ground-truth segmentation comparison with TcMVS (Mustafa et al. 2016a) is shown in Fig. 11. Quantitative evaluation against stateof-the-art methods is measured by Intersection-over-Union with ground-truth, shown in the Table 4. Ground-truth is available online for most of the datasets and obtained by manual labelling for other datasets. The proposed semantically coherent joint multi-view 4D flow, co-segmentation and reconstruction achieves the best segmentation performance against ground-truth for all datasets tested. Results presented in Fig. 12 indicate that the proposed approach accurately segments fine detail such as hands and feet where other approaches are unreliable. (2016) is shown in Fig. 12 and Table 4. The results show that the proposed approach achieves a significant improvement for multi-view video segmentation compared to co-segmentation approach using tracklets (Zhong and Yang 2016) (average 45% improvement in intersection-overunion of the segmentation vs. ground-truth).

88.7
Bold values represent best performance

Reconstruction Evaluation
The reconstruction results obtained from the proposed approach are compared against state-of-the-art approaches in joint segmentation and reconstruction (TcMVS Mustafa et al. 2016a) and multi-view stereo (Colmap Schönberger et al. 2016, MVE Semerjian 2014, SMVS Langguth et al. 2016. MVE, SMVS and Colmap are state-of-the-art multiview stereo techniques which do not refine the segmentation. All the methods are initialized with the same initial semantic reconstruction (Sect. 3.1) for fair comparison. Comparison of reconstructions Fig. 13 demonstrates that the proposed method gives consistently more complete and accurate models. Figure 14 presents a comparison to a statistical modelbased approach MBR (Rhodin et al. 2016) which reconstructs a single human body shape from the whole sequence together with pose at each frame. This provides a good estimate of the underlying body shape but does not take into account cloth- ing resulting in inaccurate silhouette overlap. Comparison of full scene reconstruction against MVE and SMVS is shown in Fig. 15 showing improved completeness and accuracy. Joint semantic 4D scene flow, co-segmentation and reconstruction results in a 3D model for which every surface point has consistent surface labelling across all views and over time. To illustrate the semantic wide-timeframe coherence achieved using the proposed approach unique colors are assigned to human body parts in one frame and the colors are propagated using the estimated temporal coherence. The color in different parts of the object remains consistent over time as shown in Fig. 8. Parameters: Results are insensitive to parameter setting for all indoor and outdoor scenes. Table 5 shows the parameters used, with constant contrast cost λ ca = λ cl = 0.5 and smoothness cost λ S s = 0.4, λ T s = 0.6. Limitations: The proposed approach is dependent on an initial semantic labelling of the scene for each view obtained using Mask-RCNN. Gross errors or mislabeling may be propagated resulting in incorrect semantic reconstruction, such as the soft-toys labelled as people on the left hand side of the Odzemok dataset Fig. 2. Whilst enforcing semantic coherence is demonstrated to improve scene flow, segmentation and reconstruction for a wide-variety of scenes visual ambiguity in appearance and occlusion may degrade performance.

Conclusion
This paper proposes a novel approach to joint semantic 4D scene flow, multi-view co-segmentation and reconstruction of complex dynamic scenes. Temporal and semantic coherence is enforced over long-time frames by semantic tracklets identifying similar frames using the semantic label, appearance, shape and motion information. Tracklets are used for long-term learning to constrain flow per-frame and co-segmentation optimization on general dynamic scenes. Joint optimization simultaneously improves the scene flow, semantic segmentation and reconstruction of the scene by enforcing semantic coherence both spatially across views and temporal across widely-spaced similar frames. Comparative evaluation demonstrates that enforcing semantic coherence achieves significant improvement in scene flow and segmentation of general dynamic indoor and outdoor scenes captured with multiple hand-held cameras. Introduction of space-time semantic coherence in the proposed framework achieves bet-ter reconstruction and flow estimation against state-of-the-art methods.