Foreground Object Detection by Motion-based Grouping of Object Parts

Effective video-based detection methods are of great importance to intelligent transportation systems (ITS), and here we propose a method to localize and label objects. The method is able to detect pedestrians and bicycle riders in a complex scene. Our method is inspired by the common fate principle, which is a mechanism of visual perception in human beings, and which states tokens moving or functioning in a similar manner tend to be perceived as one unit. Our method embeds the principle in an Implicit Shape Model (ISM). In our method, keypoint-based object parts are firstly detected and then grouped by their motion patterns. Based on the grouping results, when the object parts vote for object centers and labels, each vote belonging to the same object part is assigned a weight according to its consistency with the votes of other object parts in the same motion group. Afterwards, the peaks, which correspond to detection hypotheses on the Hough image formed by summing up all weighted votes, become easier to find. Thus our method performs better in both position and label estimations. Experiments show the effectiveness of our method in terms of detection accuracy.


Introduction
In ITS areas, detection methods using cameras can be used for navigation, safe driving, surveillance, and sustaining results from other sensors.In traditional ITS applications, vehicles are the main targets.Currently pedestrians are also considered as important subjects of ITS applications, and bicycles also are becoming very popular for environmental and economical reasons.In Japan, the number of traffic accidents among bicycles and pedestrians is very large.Thus we tackle an issue of detecting freely moving bicycle riders and pedestrians from the data collected by a camera which keeps them under surveillance from the top.These situations can be observed in parks, university campuses, station squares, tourist spots, etc.Here we focus on techniques from the area of computer vision for detection under surveillance scenarios.It is also assumed by the method that target objects captured by the surveillance cameras do not change much in scale.
Most state-of-the-art visual detection methods fall into two main categories: sliding-window methods and Hough transform based methods.The methods [10,27] based on a sliding window schema perform detection in a typical machinery way.In these methods, decisions of whether a target object exists or not are made for part of or all of the sub-images in a test image.Besides the attractive performance and the extendibility of combining various kernels, these methods are favorable because they consider each object as a whole during detection.However, they share limited aspects with visual perception in human beings, and their efficiency heavily relies on the size of the test images.
The other methods [5,6,13,18] detect objects based on the generalized Hough transform [1].Object parts are detected, and the object parts provide confidence of the locations being the potential objects' centers.Locations of objects are decided according to the converged confidence.These methods are favorable for their robustness to partial deformation and ease of training.To human beings, this kind of method seems to be more natural.And in our work, we combine a mechanism of visual perception in humans, with the ISM [13] to demonstrate this natural property.
A typical Hough transform based method contains two steps: training and detection.During training, a codebook of object parts is built from a set of well annotated images.Each code in the codebook contains information about the appearance of the object part, the relative position to the object center, and the class label.Each object part's appearance is given in the form of keypoint descriptors [13], image patches [7,19], or image regions [8].Each code not only encodes one object part's appearance, but also its offset to the object center and the class label.During the detection step, object parts are detected on each test image.Then every object part is matched against the codebook, and several codes nearest in appearance are activated.The offset and class label encoded in each activated code will act as a vote.All the votes from the object parts are added up to form a Hough image.The peaks of the Hough image are considered detection hypotheses with the height of each peak as the confidence for the corresponding hypothesis.
Two challenging issues for detection methods are how to separate near objects and how to separate similar differentclass objects.The target objects, in the case of ITS applications, are pedestrians, bicycle riders, and automobiles.In the schema of sliding window, usually non-maximum suppression is needed for post-processing, and a mechanism in [10] works by excluding from the feature pool the features which belong to each successive detection response.In Hough transform based methods, a similar mechanism is also employed in [2], however, this effort is employed after the forming of a Hough image.During the forming of a Hough image, two kinds of votes make detection challenging: (1) votes cast by object parts from near objects make the peaks corresponding to different objects mixed up, and (2) votes cast by similar different-class object parts lead to tough decisions on the class label of the peaks.See Fig. 2d.Before the forming of Hough images, problems also arise from the pollution of the training images' background part to the codebook.During training a very clean codebook can be built with the foreground marked, which requires manual efforts.Otherwise, a large amount of training examples are needed for the effectiveness of the codebook, and this decreases efficiency.
In videos, motion information is also available by simple tracking of object parts.Thus we propose a method for detection which utilizes both appearance and motion information.The method is based on the common fate principle [23].The principle is one of the visual perception principles as theorized by gestalt psychologists, and it states that for human beings, tokens moving coherently are perceptually grouped.This provides an intuition to group the object parts by their motion patterns, and let them vote afterwards.In our work, the object parts are represented using keypoint descriptors, which are tracked to generate trajectories.The object parts are grouped by the pairwise similarities of their corresponding trajectories.Using the assumption that object parts in the same motion group probably belong to the same object, for each object part, we assign higher weights for the votes of the object parts which are more "agreeable" within the motion group.This results in votes corresponding to true detection responses to be more likely assigned higher weights.And on a Hough image formed by summing up these weighted votes, the peaks are easier to find as shown in Fig. 1d.
Due to the combination of motion analysis results and the Hough transform framework, and by assigning different weights to each object part's votes, the proposed method has several appealing properties: • The method's ability to estimate object position and label multiple objects from different classes.The existence of three types of objects makes the task challenging: near objects, similar different-class objects, and multi-pose same-class objects.• Its ability to use a codebook trained by images with cluttered backgrounds.• The framework used to combine grouping results of object parts is very general, and thus can be easily expanded.
The remaining paper is organized as follows.Section 2 reviews related work.Section 3 formalizes the common fate Hough transform.Section 4 describes inference on the formed Hough images.Section 5 gives experimental results, and Section 6 concludes.

Related Work
Our work is most closely related to object detection methods [2,[12][13][14][15][16] based on the Hough transform framework.Recently, such methods are making a lot of progress.The ISM [13,14] is extended by notifying correspondences between the object parts and the hypotheses [2] for the detection of multiple near objects.While in the methods [7,15,19], the Hough transform is placed in a discriminative framework for object detection in a way that the codes are assigned different weights by the co-occurrence frequency of their appearance and offset to the object center.Two Hough transform methods consider the grouping of object parts [20,26].The method in [20] deals with scale change.Instead of estimating the scale by local features trained from different scaled examples, the votes are considered as voting lines.By considering the difference between the voted centers, local features are first grouped, resulting in a more consistent vote for the object center.In [26], the grouping of object parts, the correspondence between object parts and object, and the decisions on detection hypotheses are optimized in the same energy function.For this method, the problem is that the grouping results don't have meaning or correspond to any real entities.
Our work is also related to object detection methods which use trajectories [3,4], methods using the weighting of features [25], methods dealing with codebook noise [17], and methods which integrate temporal information [24].

Common Fate Hough Transform
Probabilistic standpoints are very appealing because of inference ease.However, as pointed out in [11], placing an Implicit Shape Model (ISM) in a probabilistic framework is not satisfactory.Especially, describing weights of the votes as priors does not make sense.A Hough transform can be simply considered as the transformation from a set of object parts, {e}, to a confidence space of object hypotheses, C(x, l).Where x is the coordinate of the object center, and l the label.Terms described as priors of the votes in the ISM are actually weights, and the likelihood terms are actually blurring functions to convert discrete votes into continuous space.This section describes how a Hough image for the estimation of object centers and labels is formed from object parts observed on an image.
Let e denote an object part observed on the current image.The appearance of e is matched against the codebook, and e activates N best matched codes from the trained codebook.Each code contains the appearance, its offset to the object center, and the class label.According to the N matched codes, e casts N votes.Each vote V e is about the object center that generates e.The position of the object center casted by a vote, V, is denoted by x V , while the class label is l V .
Based on the N votes of e, the confidence that a position x is the center of an object with class label l is given by, Here B x, l; V i e is the blurring function.And w V i e is the weight of V i e .The idea of the proposed method is that the weight term, w V i e , is defined by the motion grouping results of all the object parts.
The blurring function is defined as, Here G(x; x V , σ ) is a Gaussian function that fixes the spatial gap between x and x V .
Let M be the total number of object parts on the image, then by summing up over all the object parts, the confidence of x being the center of an l-class object is given by, A uniform weight is assumed for each object part, and w(e j ) = 1 M .By considering C(x, l) as the evaluation score of the Hough space (x, l), the task of estimating object centers and labels converts to finding, and then validating, the local maxima of the Hough image.

Common Fate Weights
To meet the challenges of separating near objects, separating similar different-class objects, and using a noisy codebook, different weights are assigned to the votes of each object part by considering the motion grouping results of the object parts.In this sub-section, when given some grouping results, how the results are combined into a Hough transform framework is introduced.
Let γ = {g} denote the grouping results, where g is a group of object parts.Assume e m ∈ g and e n ∈ g.Those votes of e m which are more "agreeable" than the votes of the other objects in g are assigned larger weights.
Towards this end, the relationship between the votes of e m and the votes of e n needs to be given in advance.This relationship is named support.The support from V e n to V e m is defined based on V e n and the confidence that V e m 's voted center is correct, as, Here B(x V em , l V em ; V e n ) is defined in Eq. 2. This measures the coherence of the two votes from different object parts.
Then, the support from e n to V e m is defined based on e n , and the confidence that V e m 's voted center is correct, as, And the support from g to V e m is defined by the confidence that V e m 's voted center is correct based on the votes of all the other object parts excluding its belonging object part in g, as, By assuming all object parts in the same motion group are from the same object, which means motion grouping gives good results, the estimations for center position and class label given by every object part should be consistent with that given by the motion group.Thus for a particular vote of e m , i.e., Ṽe m , a weight is assigned to it by considering its consistence with g and the consistence of e m 's other votes with g, as: Here, is a small constant for preventing zeros.Notice w Ṽe m is defined using w V k e j -the weights of the votes of the other object parts in g.In order to determine w( Ṽe m ), uniform weights are firstly assigned to the votes of each object part in g, i.e., w V k e j = 1 N .Then new weights are calculated based on the uniformly assigned weights.The weights of votes used to form the Hough image are the iteratively converged weights.
The grouping result γ = {g}, can be replaced by grouping results based on other information, for example our method utilizes motion to group the voting elements.The manner of extending the Hough transform is very general, and the extended Hough transform with motion grouping results is called the common fate Hough transform.The votes given by the best matched codes and the votes with higher defined weights are shown in Fig. 2.

Motion Grouping
In this subsection, how to group the object parts by their motion patterns is introduced.Basically the object parts are tracked, and clustered by their motion patterns.The object parts are tracked through frames before and after the current frame, to generate trajectories.Then the object parts are grouped by their corresponding trajectories' pairwise motion similarities.
The object parts in this method are in the form of keypoint descriptors.The Harris Corner [9] feature is chosen, for robustness, to represent each object part, while for appearance, the region covariance [22] feature of the image patch around each keypoint is used.The image feature is chosen because of its flexibility to combine multiple channels of information, and also for its capability of handling scale changes in a certain range.
For each object part, a trajectory is generated by tracking its corresponding Harris Corner by the KLT tracker [21].To group the trajectories, two pairwise similarities are defined.Here, i is the frame index, and L is the number of frames in which both trajectories exist.
To define the second similarity, the ith directional vector of T is firstly defined as, Before grouping the trajectories, the static points are excluded.The defined D 1 is calculated for all pairs of trajectories, and a minimum spanning tree is then built using the calculated similarities.The built minimum spanning tree is split by cutting edges larger than a threshold, D 1 th , and this gives a grouping result of the trajectories.For each element in the clustering result, D 2 is used in the same procedure to generate even smaller clusters.This hierarchical procedure ensures that trajectories in the same group have both small D 1 and D 2 .Max operation is used in the definitions of both D 1 and D 2 .This is helpful because very often two trajectories are of different lengths, and under such situations, max operation will have better stability than other operations, e.g., average, that consider only overlapping frames.
Each trajectory corresponds to an object part, and the grouping results of the trajectories correspond to grouping results of the object parts.

Codebook
For training, Harris corners are extracted from the training images with the object center and the class label annotated.In this method, region covariance is chosen to represent the appearance, which is defined as, Here, K is the number of pixels in the region, and z i is a 7-dimensional vector regarding the (x, y) coordinate of the pixel, while μ is the mean of z i .And z(x, y) contains the RGB color of the pixel and the intensity gradients of the pixel, as: r(x, y), g(x, y), b(x, y), ∂I (x,y)   ∂x , and ∂ 2 I (x,y) ∂y 2   .
Fig. 8 Example Hough images.On the top are the original images.In the middle are the Hough images formed by votes with uniform priors.On the bottom are the Hough images formed by votes with the proposed weights.Red indicates leopards, and blue indicates tigers.Note that for the two leopards, there is no peak corresponding to the one on the right, on the benchmark Hough image.For the three leopards, there is also no peak corresponding to the leopard behind on the benchmark Hough image The appearance similarity between r m and r n is given by, Here, λ i is the generalized eigenvalue obtained by solving the generalized eigenvalue problem, λ i r m u i = r n u i , u i = 0, with u i the eigenvector.
A square image patch around each keypoint is used to represent the appearance of an object part.Six region covariances are generated for each image patch by using the pixels of the top-left, the top-right, the bottom-left, the bottom-right, the central portion, and the entire image patch.Then besides the offset and the class label, a code contains six region covariances.All codes from all training images constitute the codebook.When an object part is matched against the codebook, the similarity between the image patch of the object part and a code is defined by the smallest similarity of the corresponding region covariance.This will handle scale changes of a small range, since the six image patches are not of the same scale.And the method's ability of handling scale changes is limited.So it can only be used in surveillance situations where the scales of target objects change in a limited range.

Detection
After forming the Hough image, the detection hypotheses are validated.Let h = {H } be the points in the Hough space which are evaluated by C(x H , l H ) and have C(x H , l H ) > 0. Inspired by [2], the hypotheses are validated by an optimizing procedure.Let O be the number of the points in h.Let u i = 1 or 0, indicate H i as a true object center or not.The problem is: Let v ij = 1 or 0, indicate e j belongs to H i or not, then and by assuming one object part belongs to and only belongs to one hypothesis, the problem is, arg max Following [2], the optimal result for the problem is given by greedy maximization.As described in Algorithm 1, the largest local maximum of all the local maxima is chosen to be the center of a true object, and then the object parts belonging to the chosen object center are excluded from the object part set.A new Hough image, where new objects are found, is formed using the remaining object parts.And this procedure ends when the object part set is empty, or when the confidence of the chosen object is lower than a given threshold.

Algorithm 1 Greedy Maximization
Let ε be the set of object parts, C th be the low confidence threshold to accept detection responses, and ĥ be the local maxima of h 1: while ε = ∅ do 2: Form h with ε 3: Generate ĥ and select H i ∈ ĥ with the largest

Experimental Results
In our experiments, improvement of the method is verified in terms of detection accuracy.The method is tested on the P-campus dataset with [2] as benchmark, and then tested on a dataset of several animals.

Campus-scene Detection
Dataset The P-campus dataset contains two primary classes of foreground objects: pedestrians and bicycle riders.The frame size is 720×576.Among all the 401 continuous frames, 633 different-class ground truth bounding boxes are annotated on 79 frames.In this dataset, pedestrians and bicycle riders have in common the upper human body, and pedestrians appear in front, back, and side views.
Implementation Settings For training, 52 bicycle riders and 171 pedestrians are randomly selected from the marked ground truths.Harris corners are detected on these randomly selected training images, examples are given in Fig. 3a.For appearance, six region covariances are generated for each keypoint using the 9×9 image patch around it as shown in Fig. 3b.The appearance, the offset to the image (object) center, and the label of the training image are encoded into a code, and the code is inserted into a codebook.The final codebook contains 5502 codes.Testing data is formed by the 79 frames, on which the ground truth bounding boxes are marked.Harris corners are detected, and region covariances are generated in the same manner as for the training images.For each Harris corner on one testing image, the corresponding region covariances are matched against the codebook for the most similar codes.Some of the training examples will appear in the test sequences.The emphasis of this experiment is to verify the proposed framework's ability of combining motion information.Both the proposed method and the benchmark method use the same training and testing images, so the comparison is fair and proves the effectiveness of the proposed method.
For motion grouping, each keypoint is tracked through 10 frames before, and 10 frames after the current frame.The similarity of two 21-point trajectories is defined using only the frames in which both trajectories exist.To set the two thresholds for motion grouping, D 1 and D 2 are measured for keypoint pairs of different objects.D 1 th is set so that it is larger than only 10% of the measured D 1 s, and so is D 2 th .By doing this, keypoints belonging to different objects are not likely to be grouped together.So that in one motion group, the keypoints are very likely to belong to the same object, as shown in Fig. 4.
To form the Hough image, 35 best matched codes are chosen from the codebook for each object part.In Eq. 3, d and σ need to be given.The precision-recall curves are based on σ , while d is set to 10.Here σ is the most important parameter.
Comparisons For comparison, detection is done on the Hough images formed with and without motion grouping results.The same codebook and the same parameter settings are used for forming and searching over both Hough images.The votes of each object part are assigned uniform weights in the benchmark method, while weights defined in Eq. 4 are assigned in the proposed method.
The precision-recall curves are shown in Fig. 5a.An object is considered as correctly detected only if the distance from the ground truth to it is less than 10 pixels.In Fig. 5a, the correctly positioned but wrongly labeled objects are considered as true positives, aiming at verifying the positioning ability of the proposed method.
The confusion matrices are given in Fig. 5b.For clarity, the proposed method is compared with the benchmark method when the two methods have a nearly equal number of false alarms.To evaluate the labeling ability, a class of "none" to represent missed detections and false alarms is manually added.For example, in Fig. 5b, 487 pedestrian instances are correctly positioned and labeled by the proposed method; 2 are wrongly labeled to be bicycle riders, and 21 are miss-detected.More results are shown in Fig. 6.

Wild-scene Detection
Dataset In order to show that our method can be used for general purposes, we test our method on complicated scenes, especially, complicated background.Even in these cases, our method works well, which shows robustness of our method.A mini dataset is built upon leopards and tigers of the family Felidae.Note especially that the image feature used by this method belongs to the type texture, and texture from different positions of the leopards are almost the same.The dataset contains 6 video clips of 9 leopards and 4 tigers.The frame size is 640×480.Both of the animals are in the side view.
Implementation settings Most implementation settings are the same as the settings used for campus object detection.For training, 5 leopards and 2 tigers are used.The size of the image patch around each keypoint is 27×27.
Comparisons In Fig. 7, the motion grouping results, and how the voted centers are affected, are given.Since parts from different positions of the leopard are very similar, the true center of a leopard is difficult to find using the voted centers of the object parts.In Fig. 8, example Hough images are given to show the merit of the proposed prior by the ability to detect leopards.In Fig. 9, the detection results are given.The proposed method successfully localizes and labels all the leopards and tigers, while the benchmark method miss-detects three leopards.

Conclusion
The computational ability of human beings is limited, while their ability to detect is far beyond that of machines.Thus, it is very possible that this detection ability benefits from multiple perceptual mechanisms.By using one of these mechanisms, we propose a detection method.By embedding motion grouping results into the voting schema of hough transforms, the method is able to distinguish near objects' positions, distinguish similar objects' labels, and maintain the detection rate with a noisy codebook.The success of our method further demonstrates the advancement of perceptual mechanisms in human beings.And the success of this method will help with detection methods in ITS areas.

Fig. 1
Fig. 1 Merit of the proposed method.a Original image.b Motion grouping results.Some parts are enlarged to show details.c Original Hough image.d Hough image formed using our method.The grids in c and d correspond to the grids in a

Fig. 2
Fig. 2 Effect of the proposed weight.a Motion groups, different colors mark different motion groups.b Voted centers given by the 7 best matched codes.c Voted centers with the highest defined weights.d Voted centers with weights higher than a threshold

Fig. 3 aFig. 4
Fig. 3 a Training images.Note some keypoints fall on the background.b The manner how a 9 × 9 image patch is used to generate six region covariances, and red rectangles indicate the pixels used for each covariance and b i = a i •b i b i •b i .Then the second similarity is defined as, D 2 (T e m , T e n ) = max i=1...L−3 (max(|a i − a i a i |, |b i − b i b i |)) .

Fig. 9
Fig. 9 Results.Red crosses mark the centers for leopards and blue crosses mark the centers for tigers