Pedestrian segmentation based on a spatio-temporally consistent graph-cut with optimal transport
- 46 Downloads
Abstract
We address a method of pedestrian segmentation in a video in a spatio-temporally consistent way. For this purpose, given a bounding box sequence of each pedestrian obtained by a conventional pedestrian detector and tracker, we construct a spatio-temporal graph on a video and segment each pedestrian on the basis of a well-established graph-cut segmentation framework. More specifically, we consider three terms as an energy function for the graph-cut segmentation: (1) a data term, (2) a spatial pairwise term, and (3) a temporal pairwise term. To maintain better temporal consistency of segmentation even under relatively large motions, we introduce a transportation minimization framework that provides a temporal correspondence. Moreover, we introduce the edge-sticky superpixel to maintain the spatial consistency of object boundaries. In experiments, we demonstrate that the proposed method improves segmentation accuracy indices, such as the average and weighted intersection of union on TUD datasets and the PETS2009 dataset at both the instance level and semantic level.
Keywords
Pedestrian segmentation Edge sticky superpixel Optimal transport Conditional random fieldAbbreviations
- CRF
Conditional random field
- GMM
Gaussian mixture model
- OT
Optimal transport
- ESS
Edge-sticky superpixel
- SLIC
Simple linear iterative clustering
- SEEDS
Superpixels extracted via energy-driven sampling
- M. IoU
Mean intersections over union
- W. IoU
Weighted intersections over union
- P. IoU
Pedestrian intersections over union
1 Introduction
Silhouette extraction or human body segmentation is widely conducted as the first step in many high-level computer vision tasks of video surveillance systems, such as human tracking [1, 2, 3, 4], human action recognition [5, 6, 7, 8] and gait-based identification and recognition [9, 10, 11]. In human tracking, the extracted human silhouette is used for human full-body localization or human part localization [1, 2, 3, 4]. In human action recognition, studies [5, 7, 8] have directly extracted features from a silhouette sequence; Chaaraoui et al. [6] used contour points of the human silhouette for action representation. For gait-based identification and verification, Collins et al. [9] used the silhouette for shape matching; Chen et al. [2] extracted features from the spatio-temporal silhouette for gait recognition while Liu et al. [11] proposed the average silhouette as a feature for recognition.
Pedestrian silhouette extraction has long been studied. This research mainly falls into three categories: supervised methods, unsupervised methods, and semisupervised methods.
Supervised methods [12, 13] have performed well in recent years. A typical approach of supervised pedestrian silhouette extraction requires a manually annotated mask of the target in the first frame and propagates the mask frame by frame. An automatic surveillance system, however, cannot adopt manual annotation.
Unsupervised methods, including methods based on background subtraction (e.g., [14, 15]) and motion segmentation (e.g., [16, 17, 18, 19]), are the most popular approaches because they do not require manual annotation. Methods based on background subtraction model the background using statistical models (e.g., a Gaussian mixture model) and extract the silhouettes of moving targets as the foreground. However, methods based on background subtraction only classify the moving target and background and do not realize instance-level silhouette extraction. Multi-label motion segmentation assigns human labels to sparse points or pixels according to motion information (e.g., optical flow), allowing targets with different motion patterns to be discriminated. However, because of the lack of object detection information, motion segmentation still cannot discriminate pedestrians with the same motion pattern (e.g., pedestrians walking in the same direction side by side) and may sometimes assign different labels to human parts with different motion patterns. Motion segmentation therefore suffers from under-segmentation and over-segmentation.
Semisupervised methods that do not require a manually annotated silhouette at the first frame but a bounding box trajectory are more suitable for pedestrian silhouette extraction by an automatic surveillance system, because the trajectory of the bounding box can be automatically extracted using recently advanced approaches of object detection [20, 21, 22] and multiple-object tracking [23, 24, 25]. To the best of our knowledge, semisupervised methods use optical flow to maintain temporal consistency (e.g., [26]). Because optical flow sometimes fails in handling large displacement, optical-flow-based semisupervised approaches often suffer segmentation errors for human parts having large displacement (e.g., a pedestrian’s leg and arm). Moreover, a conditional random field (CRF) framework that uses a color-based Gaussian mixture model (GMM) for the background data term and a simple linear iterative clustering (SLIC) superpixel [27] as nodes in the CRF has been adopted [26]. However, color information is not enough for modeling a nonhuman region (e.g., when a pedestrian and the background have similar colors) and the SLIC superpixel sometimes cannot preserve the object boundary well, which is vital for construction of the spatial pairwise term.
Optimal transport (OT)-based temporal consistency. In contrast to most related work, we adopt OT to maintain temporal consistency. The lack of capacity in terms of handling large displacement is a main drawback of optical flow. Although there are methods that improve the handling of large displacement (e.g., the pyramid strategy [28]), the motion of leg and arm parts still cannot be described correctly. Compared with conventional optical flows, the proposed method successfully handles large displacement between two frames thanks to the global optimal property of the OT framework. As far as we know, the OT framework is usually used to measure the difference between two discrete distributions (e.g., a dissimilarity measure between two color histograms), which is also known as the earth mover’s distance. The proposed method does not use the final outcome of the OT framework (i.e., a distance) but the "process" of the OT framework (i.e., flow (or correspondence) between two frames), which is the primal novelty of the proposed method.
Combination of the edge-sticky superpixel (ESS) and OT. The time complexity of the OT increases as the dimension of the discrete distributions (e.g., the number of bins of histograms) increases, and direct application of the OT to pixel-wise image representation is computationally prohibited. We therefore need to appropriately transform the input image into a discrete distribution with a relatively low dimension. Superpixel segmentation is one such effective way to represent an image as a discrete distribution while keeping information, that is, compressing redundancy. More specifically, we regard an input image as a histogram, where the number of superpixels is the number of bins, a gravity center of a superpixel is a representative value of a bin, and a number of pixels (area) of a superpixel is the frequency (or vote) for a bin. Moreover, superpixel segmentation needs to well preserve object boundaries for our final goal, that is, pedestrian silhouette extraction. State-of-the-art superpixel segmentation methods (e.g., the SLIC superpixel [27] and superpixels extracted via energy-driven sampling (SEEDS) superpixel [29]) provide a balance between appearance and shape regularity, and usually perform well in computer vision tasks. However, this balance between appearance and shape regularity does not always guarantee that the object boundary is well preserved. Our ultimate target is to extract pedestrians’ silhouettes, and we thus need to adopt a superpixel segmentation method that better preserves object boundaries. We therefore adopt the ESS, which introduces edge detection information explicitly into the process of superpixel generation. As a result, the object boundary can be preserved well while balancing the appearance and shape regularity.
Performance improvement on segmentation benchmarks. We demonstrate that the proposed method improves the performance of pedestrian silhouette extraction at both the instance level and semantic level on public datasets compared with state-of-the-art methods.
2 Related work
The silhouette extraction or human segmentation of multiple pedestrians has been addressed in the literature [12, 13, 16, 26, 30, 31, 32]. We categorize typical approaches as follows: ∙Supervised methods. Supervised methods perform well in video segmentation. The most popular framework [12, 13] is to manually annotate the target’s mask in the first frame and propagate the target mask to other frames. In [13], a two-branch approach was proposed whereby the features from ResNet-101 [33] and FlowNet [34] were combined for joint object segmentation and optical flow estimation. In [12], a method of frame-by-frame object segmentation was implemented by learning the appearance of the annotated object. However, because the mask annotation has a manual burden, it is difficult to apply supervised methods to pedestrian silhouette extraction in an automatic surveillance system. ∙Unsupervised methods. Unsupervised methods require no manual annotation and hence can be applied directly to an automatic surveillance system. Most unsupervised methods are based on motion information. The temporal superpixel [35] involves optical flow into a superpixel segmentation framework to realize a temporally consistent superpixel. Ochs et al. [16] adopted a two-step approach: generate sparse segments by clustering long-term trajectories and then obtain dense segments according to color. However, the temporal superpixel is a superpixel segmentation and thus requires a manual annotator that specifies the pedestrian’s superpixel, which is again not possible for an automatic surveillance system. Ochs’s approach [16] is also prone to under-segmentation because multiple pedestrians walking in the same direction are likely to be segmented into an identical segment. ∙Semisupervised methods. Compared with supervised and unsupervised methods, semisupervised methods that only require a bounding box annotation are more suitable for silhouette extraction by a real-world surveillance system. Milan [26] exploited a joint tracking and segmentation method that first applies superpixel segmentation and multiple-pedestrian tracking. A CRF is then constructed and all superpixels are assigned with the labels of pedestrian trajectories. Because optical flow is used in the construction of the CRF, Milan’s approach sometimes fails for pedestrian’s legs, for which there is large spatial displacement. ∙Pedestrian segmentation methods for a single frame. In recent years, great strides have been made in cellular neural network (CNN)-based image semantic segmentation and instance segmentation. In [31], a multipath refinement network was presented where CNN features with multiple resolutions are fused so that semantic features can be refined using lower-level features. In [32], an object detection network [20] is concatenated by a fully convectional network [36] so that object detection and instance-level segmentation can be achieved jointly. Single-frame segmentation methods can therefore be easily extended to pedestrian silhouette extraction in video using bounding box trajectories.
3 Proposed method
3.1 Problem setting
The present study presents a method of extracting silhouettes of multiple pedestrians from a video. We assume that the cameras are static and the bounding box trajectories are given by well-established detectors [20] and trackers [23].
3.2 Framework
Framework of the proposed method. a Given input images. b Superpixel segmentation followed by the construction of a CRF consisting of c a data term, d spatial pairwise term, and e temporal pairwise term. Application of the graph-cut with α-expansion to get f the segmentation result
Superpixel segmentation. Given an input image sequence, superpixel segmentation is first applied frame by frame to reduce the computational cost. We adopt the ESS, which better preserves object boundaries.
Superpixel-wise labeling. Given the superpixel segmentation result and pedestrian trajectories (i.e., a bounding box sequence for a pedestrian), each superpixel is assigned with a trajectory label (i.e., a pedestrian label) in this step, resulting in instance-level segmentation as shown in Fig. 1f.
The label assignment problem has been well studied for decades and recent progress expanded its application area to many computer vision tasks. As an example, Wu [37] proposed an adaptive label assignment method to handle the “one example human re-identification” problem where there is only one example available for each human identity, that is, the labeled data. The adaptive label assignment method can both select a set of candidates from the unlabeled data and assign labels of the candidates using a nearest neighbors (NN) classifier in the feature space extracted by the CNN model.
However, in the present work, we cannot generate a set of "labeled data" as in [37] owing to the different problem settings. Furthermore, spatio-temporal consistency is strongly required in the present work, and pairwise features that maintain spatio-temporal consistency (e.g., edge-based features) can only be extracted in a pairwise manner instead of using the independently extracted features. As a result, the approach in [37] cannot be applied directly in the superpixel-wise labeling step of the present work.
To better handle the features extracted in a pairwise manner, we adopt the well-established CRF for superpixel-wise labeling. The label assignment problem is then formulated as a CRF problem and solved using the graph-cut with α-expansion algorithm.
Details are discussed in the following subsections.
3.3 ESS
The superpixel is a popular technology used to reduce the redundancy of an image and is employed in many computer vision applications. We use the superpixel because not only does it reduce the computational complexity but also it preserves object boundaries.
Framework of the ESS. a Each pixel in the input image (e.g., a 5×5 grayscale image) initialized as a superpixel, where a black number is the label of a superpixel. b Each pixel relabeled under an energy minimization framework. In each iteration, we scan and update the labels of all pixels. For each pixel (yellow), the label assignment costs of its four-connected neighbors (blue) are calculated as shown by red numbers, and each pixel’s label is updated with the lowest-cost neighbor’s label. The iteration continues until there is no change in each pixel’s label. Finally, the superpixel segmentation result is obtained as in c
where each pixel is assigned with the label of a superpixel (i.e., the index of a superpixel).
where α, β, and γ are hyperparameters. The location and appearance vector for the i-th pixel are denoted vloc(i) and vapp(i), while the mean location and appearance vector for the l-th superpixel are denoted μloc(l) and μapp(l). Moreover, cedge is the edge cost and Al is the size of the l-th superpixel.
The first and second terms of Eq. (2) maintain the spatial consistency of the superpixel, while the third term controls the size of the superpixel.
Framework of the proposed method. a Given input images and two samples of image patches (blue and green edged). b Binary edge masks obtained using a pre-trained random forest. c Edge detection results (i.e., edge probability map) obtained by aggregating all edge masks
SED firstly separates an input image into a set of image patches. A pre-trained random forest is then applied to the set of image patches to achieve a set of binary edge masks as shown in Fig. 3b. Finally, the set of edge masks are aggregated to generate the edge probability (i.e., the edge detection result) as shown in Fig. 3c. We refer the reader to [37] for more details.
Example of the edge cost function. a Input image of the frame t. b Clipping around the i-th pixel. Edge probability \(p^{t}_{\text {edge}} = 0.9\) on the left side (as represented by red) and \(p^{t}_{\text {edge}} = 0.9\) in the middle and on the right side (as represented by blue). c Edge cost of assigning the label l1 to the i-th pixel cedge(i,l1)=−0.1 while cedge(i,l2)=−0.9; therefore, l2 is more likely to be assigned to the i-th pixel
Figure 4 shows that the i-th pixel’s four-connected neighbors are j1 (whose superpixel label is l1) and j2, j3, and j4 (whose superpixel labels are l2). The edge probability is represented in pseudo-color, where the edge probability for a red pixel is 0.9 while that for a blue pixel is 0.1, i.e., there is an edge on the left side of the i-th pixel. According to Eq. 3, cedge(i,l1) = − 0.1 and cedge(i,l2) = − 0.9, it is more difficult to assign the label l1 than the label l2 to the i-th pixel. As a result, the edge cost function helps preserve the object boundary.
Example of the ESS. a Input image. b Edge probability map (represented by pseudo-color). c ESS result. The pedestrian’s boundary is well preserved by the ESS
3.4 Superpixel-wise labeling
Here, the first term is the data term while the second and third terms are respectively spatial and temporal pairwise terms. ωS and ωT are respectively the weights of spatial and temporal pairwise terms. The definitions of \( \mathcal {N}_{\mathrm {S}}\), \(\mathcal {N}_{\mathrm {T}}\), EData, ES, and ET are explained in the following sections.
The multi-label CRF problem can then be solved using the graph-cut with α-expansion algorithm [39], which is widely used for CRF inference. The algorithm iterates each possible label (i.e., the label α in a given CRF), and in each iteration, the algorithm segments the α and the non- α components with the graph-cut. The energy function of the CRF in this work contains spatial and temporal pairwise terms, and the graph-cut with α-expansion algorithm is thus adopted in a spatio-temporally consistent way.
3.4.1 Data term
contains two components, namely a pedestrian term \(E_{\text {Data}}\left (p,X_{\text {CRF}}(p)\not =l_{\text {BG}}^{\text {TR}}\right)\) and background term \(E_{\text {Data}}\left (p, X_{\text {CRF}}(p)=l_{\text {BG}}^{\text {TR}}\right)\) for an arbitrary superpixel p.
Example of the background term. a Input image. b Human score map from RefineNet. c Background term
We subsequently sample and train a GMM for multiple pedestrians to define the pedestrian term. We denote a set of pixels belonging to the k-th superpixel as uk={i|XSP(i)=k} and pixels inside the bounding box trajectory of the i-th pedestrian ti as \(\mathcal {U}_{i}\). If the k-th superpixel overlaps with the bounding box sequence of the i-th pedestrian ti (i.e., \(u_{k} \bigcap \mathcal {U}_{i}\not =\emptyset \)), it is sampled for the GMM training of the i-th pedestrian. A superpixel may sometimes overlap with multiple trajectories and we thus adopt a winner-takes-all strategy by which the pedestrian closest to the camera (i.e., the pedestrian with the lowest bound of the bounding box) takes the superpixel.
Example of the pedestrian term. a Input image and pedestrian’s bounding box. b Pedestrian term of the pedestrian inside the bounding box. Outside the bounding box, the pedestrian term is set as a sufficiently large constant
3.4.2 Spatial pairwise term
We then use the color and edge probability to formulate the spatial pairwise energy function ES.
Example of color-based pairwise energy. a Input image. b Color-based pairwise energy. If the colors between pedestrians or between a pedestrian and the background are similar, the color-based pairwise energy fails to preserve the object’s boundary; e.g., the pedestrian’s boundary inside the white bounding box in b
The color-based pairwise energy function may sometimes fail to maintain spatial consistency when the colors of different pedestrians or a pedestrian and the background are similar as shown in the white bounding box in Fig. 8. We therefore further include the edge probability in the spatial pairwise energy function.
Example of edge-based pairwise energy. a Input image. b Edge-based spatial pairwise energy. The pedestrian’s boundary in the bounding box in b is better preserved than the same region in Fig. 8
where ωe is a hyperparameter that controls the weight of edge-based spatial pairwise energy.
3.4.3 OT-based temporal pairwise term
where connT is the temporal connectivity function.
Different from spatial connectivity, which can be easily defined according to the pixel lattice structure, the temporal connectivity must involve object motion information. To the best of our knowledge, optical flow is the most popular motion information used to define temporal connectivity. However, optical flow usually fails to handle the large displacement that often occurs for the pedestrian leg and arm. We therefore introduce OT-based temporal connectivity for better motion estimation.
The OT distance, also known as the earth mover’s distance, is a useful distance with which to compare two probability distributions. The OT problem is described as follows.
where 〈·,·〉F denotes the Frobenius dot product. \(\mathcal {P}(r, c)=\left \{ P\in \mathbb {R}_{+}^{m\times n} \vert P\boldsymbol {1}_{n}=\boldsymbol {r}, P^{T}\boldsymbol {1}_{m}=\boldsymbol {c}\right \}\), where 1m and 1n are m- and n-dimensional vectors of ones.
In this study, we formulate motion estimation as an OT problem. We denote superpixel labels in frame t by \(\mathcal {L}_{\text {SP}}^{t} = \left \{l_{1}^{t},..., l_{|\mathcal {L}_{SP}^{t}|}^{t}\right \}\) and then define a superpixel size vector in frame t as \(\hat {\boldsymbol {A}}^{t} = \left [A^{t}_{1},... A^{t}_{|\mathcal {L}_{SP}^{t}|}\right ]\), where \(A^{t}_{i}\) is the size of the \(l_{i}^{t}\)-th superpixel. The normalized size vector is then defined as \(\boldsymbol {A}^{t}=\hat {\boldsymbol {A}}^{t}/|\mathcal {L}_{P}^{t}|\). Because ||At||=1 and \(\forall i\in \left \{1,..., |\mathcal {L}_{SP}^{t}|\right \}, \boldsymbol {A}^{t}(i)\geq 0\), At is a probability distribution. We therefore treat the normalized size vector in two consecutive frames At and At+1 as the input of an OT problem.
The first item of m(i,j) encourages transportation between spatially nearer superpixels while the second term encourages transportation between superpixels that appear similar. Furthermore, we include the third term to encourage transportation between superpixels in the pedestrian region.
Example of OT-based temporal connectivity. a Manually selected pedestrian’s superpixel in frame t. b OT-based temporal connected superpixels in frame t+1
where the definition of λ is the same as in Eq. (16).
4 Experiments
4.1 Experimental setting
4.1.1 Datasets
We test our proposed method on four publicly available image sequences: TUD-Stadtmitte, TUD-Campus, TUD-Crossing and PETS2009 S2L1. Each sequence contains a long-term occlusion that makes segmentation highly challenging. Furthermore, TUD-Stadtmitte and TUD-Campus present the challenges of low contrast and similar clothing.
We use manually annotated pedestrian bounding box trajectories for each dataset when we test the proposed method as well as the other baseline methods. We also annotate ground-truth pedestrian silhouettes (instance segmentation) for the evaluation of pedestrian silhouette extraction.
4.1.2 Evaluation metrics
For the instance-level evaluation, we adopt mean and weighted intersections over union (M.IoU and W.IoU) to evaluate experimental results. M.IoU is a measure of the instance-wise IoU for each ground-truth instance averaged over all frames while W.IoU further weights the sizes of segments.
where nTR is the number of pedestrian trajectories.
Finally, we adopt the computational time as an evaluation metric with which to quantitatively analyze the efficiency of the proposed method.
4.1.3 Baseline methods
For instance-level segmentation, we adopt the methods of Milan et al. [26], He et al. [32] and Ochs et al. [16] as baseline methods. For fair comparison, we modify the baseline methods as follows.
Milan’s method generates an overcomplete set of trajectory hypotheses and then assigns superpixels to trajectories. We substitute the trajectory hypothesis with the trajectory ground truth and eliminate the update of the trajectory.
He’s method and Ochs’s method have different pedestrian labeling schemes and thus need to be relabeled. We use a greedy assignment method by which, from the largest ground-truth segments to the smallest, we assign label i of trajectory tri to the segment with the highest IoU with \(y_{i}^{*}\). Moreover, because He’s method generates multi-category instance-level segmentation, we apply the greedy assignment to both human segments and bag segments for the reason that the ground truth of the pedestrian contains both human and bag regions.
We adopt Lin’s method (i.e., RefineNet [31]) as a baseline method for the semantic-level segmentation. We use a pre-trained model on the Cityscapes dataset [40] whose output contains 20 labels. We focus only on the quality of the pedestrian silhouette and thus convert the original RefineNet output into a binary mask that only contains the "human" label and “non-human” label. An example of the binary mask is shown in the second column of Fig. 15.
4.1.4 Implementation details
The pedestrian bounding box trajectories used in the experiment are manual annotations. For the ESS, we set α=0.7 and β=0.7, and to keep the average size of superpixels the same, we set γ=545 for TUD-Stadtmitte, γ=560 for TUD-Campus, γ=475 for TUD-Crossing and γ=300 for PETS2009; i.e., there are approximately 2000 superpixels per frame for TUD datasets and 2850 per frame for PETS2009.
Thresholds thHm and thtemp are set as 0.5. In the spatial pairwise term, ωe is set as 300 while for CRF, ωS is set as 8 and ωT is set as 12. Finally, to handle an arbitrary length of frames, we use a batch process that sets the batch length as 20 frames.
Both instance-level and semantic-level evaluations are conducted on a personal computer with an Intel I7 CPU, 64 GB memory and a NVIDIA GTX 1080Ti GPU. We further address the use of the GPU for each method as follows.
For Ochs’s method and Milan’s method, GPUs are not used in the computation because no GPU version of codes was provided. For He’s method, the experiments are conducted using GPUs. For the proposed method, we only use a GPU for the RefineNet-based background term and not other parts.
4.2 Component comparison
4.2.1 Superpixel
Component comparisonon TUD-Campus
M. IoU [%] | W. IoU [%] | M. IoU B [%] | Time [min] | |
---|---|---|---|---|
SLIC + OT | 46.22 | 77.55 | 15.26 | 8.5 |
ESS + optical flow | 44.41 | 77.09 | 14.79 | 11.4 |
ESS + OT (Proposed) | 48.08 | 77.89 | 16.85 | 9.6 |
4.2.2 Temporal pairwise term
We run another component comparison experiment to demonstrate the merits of the OT-based temporal pairwise term compared with the optical-flow-based temporal pairwise term. We follow Liu’s work [42] for the optical-flow calculation. We then define an optical-flow-based connectivity function connflow(p,q) with which to substitute connT(p,q).
We then substitute the OT-based temporal pairwise term with the optical-flow-based term and run the component comparison experiment without changing other settings on the TUD-Campus dataset.
The experimental results are also given in Table 1. The OT-based temporal term performs better than the optical-flow-based temporal term.
4.3 Experimental results
4.3.1 Instance-level evaluation
Instance-level mask-type result
Instance-level edge-type result
Instance-level results
TUD-Stadtmitte | ||||
---|---|---|---|---|
M. IoU [%] | W. IoU [%] | M. IoU B [%] | Time [min] | |
Ochs’s | 13.19 | 25.96 | 3.39 | 1255 |
Milan’s | 50.97 | 44.80 | 20.70 | 209 |
He’s | 70.36 | 79.21 | 20.23 | 0.5 |
Proposed | 57.48 | 81.12 | 20.44 | 17.7 |
TUD-Campus | ||||
M. IoU [%] | W. IoU [%] | M. IoU B[%] | Time [min] | |
Ochs’s | 15.08 | 34.87 | 2.85 | 500 |
Milan’s | 51.38 | 49.29 | 14.27 | 83 |
He’s | 63.92 | 71.54 | 13.04 | 0.2 |
Proposed | 48.08 | 77.89 | 16.85 | 9.6 |
TUD-Crossing | ||||
M. IoU [%] | W. IoU [%] | M. IoU B [%] | Time [min] | |
Ochs’s | 6.5 | 26.65 | 1.50 | 1512 |
Milan’s | 14.30 | 22.87 | 5.89 | 240 |
He’s | 38.00 | 56.17 | 10.08 | 0.65 |
Proposed | 30.83 | 64.18 | 13.54 | 20.5 |
PETS2009 | ||||
M. IoU [%] | W. IoU [%] | M. IoU B [%] | Time [min] | |
Ochs’s | 14.41 | 29.54 | 5.40 | 4355 |
Milan’s | 33.17 | 34.39 | 2.24 | 870 |
He’s | 79.25 | 85.61 | 42.11 | 2.1 |
Proposed | 68.20 | 80.61 | 37.75 | 118 |
The proposed method outperforms Ochs’s and Milan’s methods for all metrics. On the TUD datasets, the proposed method outperforms He’s method in terms of W.IoU and M.IoUB while underperforming He’s method in terms of M.IoU. Furthermore, on the PETS2009 dataset, the proposed method fails to outperform He’s method.
The performance of the proposed method compared with He’s method is explained below.
Instance-level mask-type result for large pedestrians
Instance-level mask-type result for small pedestrians
Examples of failure cases
Another drawback of our proposed method is a lack of ability to handle occlusion. Figure 15 shows that the proposed method fails to segment the two pedestrians in frame t because of heavy occlusion. This relates to our adoption of a color-based GMM for pedestrian modeling, which may fail when the appearances of two pedestrians are similar.
4.3.2 Semantic-level evaluation
Semantic-level mask-type result
Semantic-level edge-type result
Semantic-level results
TUD-Stadtmitte | TUD-Campus | |||||
---|---|---|---|---|---|---|
P. IoU [%] | P. IoU B [%] | Time [min] | P. IoU [%] | P. IoU B [%] | Time [min] | |
Lin’s (RefineNet) | 72.74 | 11.10 | 1.1 | 71.74 | 10.15 | 0.5 |
Proposed | 79.12 | 30.00 | 17.7 | 80.45 | 24.05 | 9.6 |
TUD-Crossing | PETS2009 | |||||
P. IoU [%] | P. IoU B [%] | Time [min] | P. IoU [%] | P. IoU B [%] | Time [min] | |
Lin’s (ReFineNet) | 73.75 | 12.26 | 1.3 | 57.68 | 11.62 | 6.7 |
Proposed | 78.82 | 28.27 | 20.5 | 75.42 | 79.12 | 118 |
4.3.3 Sensitivity analysis
Sensitivity analysis on TUD-Campus
SP amount | M. IoU [%] | W. IoU [%] | M. IoU B[%] | Time [min] |
---|---|---|---|---|
500 | 38.27 | 64.16 | 11.28 | 2.8 |
1000 | 43.36 | 73.25 | 14.48 | 5.3 |
2000 | 48.08 | 77.89 | 16.85 | 9.6 |
5000 | 52.08 | 79.67 | 17.21 | 30.1 |
5 Conclusion
We proposed a method of extracting multiple pedestrian silhouettes. The proposed method is formulated as a CRF inference problem that incorporates the ESS, semantic segmentation-based human score, and OT-based temporal pairwise term. In addition, we tested the proposed method on public datasets and achieved competitive performance.
A detector of human parts [43] and multiple-detector fusion for the tracking of multiple objects [44] have recently been developed, and a future avenue of research will apply the human-part detector to occlusion reasoning.
Footnotes
- 1.
Code for the ESS is released at https://github.com/pdollar/edges.
Notes
Acknowledgments
We thank Glenn Pennycook, MSc, from Edanz Group (www.edanzediting.com/ac) for editing a draft of this manuscript.
Authors’ contributions
YY executed the experiments, analyzed results, and wrote the initial draft of the manuscript. MY managed the advisor position for the collection of data, designed the experiment, and reviewed the manuscript. YY supervised the design of the work and provided technical support and conceptual advice. All authors read and approved the final manuscript.
Funding
This work was supported by a JSPS Grant-in-Aid for Scientific Research (A) JP18H04115.
Competing interests
The authors declare that they have no competing interests.
References
- 1.Plaenkers R, Fua P (2002) Model-based silhouette extraction for accurate people tracking In: European Conference on Computer Vision, 325–339.. Springer, Berlin.Google Scholar
- 2.Chen X, He Z, Anderson D, Keller J, Skubic M (2006) Adaptive silhouette extraction and human tracking in complex and dynamic environments In: Image Processing, 2006 IEEE International Conference On, 561–564.. IEEE, New York.CrossRefGoogle Scholar
- 3.Ahn J-H, Choi C, Kwak S, Kim K, Byun H (2009) Human tracking and silhouette extraction for human–robot interaction systems. Patt Anal Appl 12(2):167–177.MathSciNetCrossRefGoogle Scholar
- 4.Howe NR (2004) Silhouette lookup for automatic pose tracking In: Computer Vision and Pattern Recognition Workshop, 2004. CVPRW’04. Conference On, 15–22.. IEEE, New York.CrossRefGoogle Scholar
- 5.Wang L, Suter D (2007) Recognizing human activities from silhouettes: Motion subspace and factorial discriminative graphical model In: Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference On, 1–8.. IEEE, New York.Google Scholar
- 6.Chaaraoui AA, Climent-Pérez P, Flórez-Revuelta F (2013) Silhouette-based human action recognition using sequences of key poses. Patt Recogn Lett 34(15):1799–1807.CrossRefGoogle Scholar
- 7.Wang L, Suter D (2007) Learning and matching of dynamic shape manifolds for human action recognition. IEEE Trans Image Process 16(6):1646–1661.MathSciNetCrossRefGoogle Scholar
- 8.Ikizler N, Duygulu P (2009) Histogram of oriented rectangles: a new pose descriptor for human action recognition. Image Vision Comput 27(10):1515–1526.CrossRefGoogle Scholar
- 9.Collins RT, Gross R, Shi J (2002) Silhouette-based human identification from body shape and gait In: Automatic Face and Gesture Recognition, 2002. Proceedings. Fifth IEEE International Conference On, 366–371.. IEEE, New York.CrossRefGoogle Scholar
- 10.Wang L, Tan T, Ning H, Hu W (2003) Silhouette analysis-based gait recognition for human identification. IEEE Trans Patt Anal Mach Intell 25(12):1505–1518.CrossRefGoogle Scholar
- 11.Liu Z, Sarkar S (2004) Simplest representation yet for gait recognition: averaged silhouette In: Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference On, 211–214.. IEEE, New York.Google Scholar
- 12.Caelles S, Maninis K-K, Pont-Tuset J, Leal-Taixé L, Cremers D, Van Gool L (2017) One-shot video object segmentation In: CVPR 2017.. IEEE, New York.Google Scholar
- 13.Cheng J, Tsai Y-H, Wang S, Yang M-H (2017) Segflow: joint learning for video object segmentation and optical flow In: 2017 IEEE International Conference on Computer Vision (ICCV), 686–695.. IEEE, New York.CrossRefGoogle Scholar
- 14.Migdal J, Grimson WEL (2005) Background subtraction using markov thresholds In: Application of Computer Vision, 2005. WACV/MOTIONS’05 Volume 1. Seventh IEEE Workshops On, 58–65.. IEEE, New York.CrossRefGoogle Scholar
- 15.Zivkovic Z (2004) Improved adaptive gaussian mixture model for background subtraction In: Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference On, 28–31.. IEEE, New York.CrossRefGoogle Scholar
- 16.Ochs P, Malik J, Brox T (2014) Segmentation of moving objects by long term video analysis. IEEE Trans Patt Anal Mach Intell 36(6):1187–1200.CrossRefGoogle Scholar
- 17.Narayana M, Hanson A, Learned-Miller E (2013) Coherent motion segmentation in moving camera videos using optical flow orientations In: Computer Vision (ICCV), 2013 IEEE International Conference On, 1577–1584.. IEEE, New York.CrossRefGoogle Scholar
- 18.Unger M, Werlberger M, Pock T, Bischof H (2012) Joint motion estimation and segmentation of complex scenes with label costs and occlusion modeling In: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference On, 1878–1885.. IEEE, New York.CrossRefGoogle Scholar
- 19.Chen Y-M, Bajic IV (2011) A joint approach to global motion estimation and motion segmentation from a coarsely sampled motion vector field. IEEE Trans Circ Syst Vid Technol 21(9):1316–1328.CrossRefGoogle Scholar
- 20.Ren S, He K, Girshick RB, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks In: IEEE Transactions on pattern analysis and machine intelligence, 39, 1137–1149.CrossRefGoogle Scholar
- 21.Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).. IEEE, New York.Google Scholar
- 22.Girshick R (2015) Fast R-CNN In: Proceedings of the International Conference on Computer Vision (ICCV).. IEEE, New York.Google Scholar
- 23.Kim C, Li F, Ciptadi A, Rehg JM (2015) Multiple hypothesis tracking revisited In: Proceedings of the IEEE International Conference on Computer Vision, 4696–4704.. IEEE, New York.Google Scholar
- 24.Choi W (2015) Near-online multi-target tracking with aggregated local flow descriptor In: Proceedings of the IEEE International Conference on Computer Vision, 3029–3037.. IEEE, New York.Google Scholar
- 25.Keuper M, Tang S, Zhongjie Y, Andres B, Brox T, Schiele B (2016) A multi-cut formulation for joint segmentation and tracking of multiple objects. Computing Research Repository (CoRR):1–14.Google Scholar
- 26.Milan A, Leal-Taixé L, Schindler K, Reid I (2015) Joint tracking and segmentation of multiple targets In: Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference On, 5397–5406.. IEEE, New York.CrossRefGoogle Scholar
- 27.Achanta R, Shaji A, Smith K, Lucchi A, Fua P, Süsstrunk S (2012) Slic superpixels compared to state-of-the-art superpixel methods. IEEE Trans Pattern Anal Mach Intell 34(11):2274–2282.CrossRefGoogle Scholar
- 28.Brox T, Bruhn A, Papenberg N, Weickert J (2004) High accuracy optical flow estimation based on a theory for warping In: European Conference on Computer Vision, 25–36.. Springer, Berlin.zbMATHGoogle Scholar
- 29.Van den Bergh M, Boix X, Roig G, Van Gool L (2015) Seeds: Superpixels extracted via energy-driven sampling. Int J Comput Vis 111(3):298–314.MathSciNetCrossRefGoogle Scholar
- 30.Makihara Y, Tanoue T, Muramatsu D, Yagi Y, Mori S, Utsumi Y, Iwamura M, Kise K (2015) Individuality-preserving silhouette extraction for gait recognition. IPSJ Trans Comput Vis Appl 7:74–78.CrossRefGoogle Scholar
- 31.Lin G, Milan A, Shen C, Reid I (2017) Refinenet: multi-path refinement networks for high-resolution semantic segmentation In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR).. IEEE, New York.Google Scholar
- 32.He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-cnn In: Computer Vision (ICCV), 2017 IEEE International Conference On, 2980–2988.. IEEE, New York.CrossRefGoogle Scholar
- 33.He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778.. IEEE, New York.Google Scholar
- 34.Dosovitskiy A, Fischer P, Ilg E, Hausser P, Hazirbas C, Golkov V, van der Smagt P, Cremers D, Brox T (2015) Flownet: Learning optical flow with convolutional networks In: Proceedings of the IEEE International Conference on Computer Vision, 2758–2766.. IEEE, New York.Google Scholar
- 35.Chang J, Wei D, Fisher III JW (2013) A video representation using temporal superpixels In: Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference On, 2051–2058.. IEEE, New York.CrossRefGoogle Scholar
- 36.Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3431–3440.. IEEE, New York.Google Scholar
- 37.Wu Y, Lin Y, Dong X, Yan Y, Bian W, Yang Y (2019) Progressive learning for person re-identification with one example. IEEE Trans Image Process 28(6):2872–2881.MathSciNetCrossRefGoogle Scholar
- 38.Dollár P, Zitnick CL (2013) Structured forests for fast edge detection In: Computer Vision (ICCV), 2013 IEEE International Conference On, 1841–1848.. IEEE, New York.CrossRefGoogle Scholar
- 39.Á2 À. (2001) Fast approximate energy minimization via graph cuts. IEEE Trans Patt Anal Mach Intell 23(11):1.Google Scholar
- 40.Cordts M, Omran M, Ramos S, Rehfeld T, Enzweiler M, Benenson R, Franke U, Roth S, Schiele B (2016) The cityscapes dataset for semantic urban scene understanding In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3213–3223.. IEEE, New York.Google Scholar
- 41.Rother C, Kolmogorov V, Blake A (2004) Grabcut: interactive foreground extraction using iterated graph cuts In: ACM Transactions on Graphics (TOG), 309–314.. ACM, New York.Google Scholar
- 42.Liu C, et al (2009) Beyond pixels: exploring new representations and applications for motion analysis. PhD Thesis:48–50.Google Scholar
- 43.Cao Z, Simon T, Wei S-E, Sheikh Y (2017) Realtime multi-person 2d pose estimation using part affinity fields In: CVPR.. IEEE, New York.Google Scholar
- 44.Henschel R, Leal-Taixé L, Cremers D, Rosenhahn B (2017) Fusion of head and full-body detectors for multi-object tracking In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 1509–150909.Google Scholar
Copyright information
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.