1 Introduction and Related Work

Wertheimer [22, 23], Köhler [7], Koffka [6] and Metzger [9] were the pioneers of studying Gestalt psychology. Gestalt principles (also called Gestalt laws) aim to formulate the regularities according to which the perceptual input is organized into unitary forms, also referred to as wholes, groups, or Gestalten. The principles are much like heuristics, which are mental short-cuts for solving problems. Perceptual organization can be defined as the ability to impose structural organization on sensory data, so as to group sensory primitives arising from a common underlying cause [2]. In computer vision this is more often called perceptual grouping.

Perceptual grouping has a long tradition in computer vision, but many especially of the earlier approaches suffered from susceptibility to scene complexity. Accordingly scenes tended to be “clean” or the methods required an unwieldy number of tunable parameters and heuristics to tackle scene complexity. A classificatory structure and a list of representative work for perceptual grouping methods in computer vision was introduced by Sarkar and Boyer [2, 18].

Generic object segmentation from 3D-data or from RGB-D images was in the past less popular, but a few methods exist [1, 3, 8, 19, 20]. Recently two methods have been published: The method of Mishra et al. [10] is an attention-driven active segmentation algorithm, designed to extract boundaries of (freestanding) simple objects, and the method by Ückermann et al. [21], which is an edge-based segmentation approach using pre-defined heuristics to end up with object hypotheses.

Fig. 1
figure 1

Hierarchical perceptual grouping over four levels of data abstraction and associated processing methods

This article summarizes the work of Richtsfeld et.al. published in [12, 1417]. Compared to other image segmentation work, a hierarchical grouping process over several levels of data abstraction is proposed using the structure of Sarkar and Boyer. Input data is organized in bottom-up fashion, stratified by layers of abstraction: signal, primitive, structural and assembly level, see Fig. 1. Raw sensor data is grouped in the signal level to surface patches, before the primitive level produces parametric surfaces and associated boundaries. Perceptual grouping principles are learned in the structural and assembly level to infer a value representing the connectivity between patches. Finally, a globally optimal segmentation is achieved using Graph-Cut on a graph consisting of the surface patches and the connectivity values between these patches.

Fig. 2
figure 2

Original image, pixel clusters, parametric surface patches, segmented scene. (more examples in [12, 16])

The main contribution of the work is the combination of perceptual grouping with SVM learning following a designated hierarchical structure. The learning approach of the framework enables segmentation for a wide variety of different objects in cluttered scenes, even if objects are partially occluded. Figure 2 shows segmentation of a complex scene, processed with the proposed framework.

2 Pre-Segmentation

3D cameras, such as Microsoft’s Kinect or Asus’ Xtion provide RGB-D data, consisting of a color image and the associated depth information for each pixel. The task of the pre-segmentation module is twofold: First the sensor characteristics are modelled and considered during normal calculation. Second, neighbouring pixels are clustered to uniform patches without discontinuities using the estimated normals.

A classic way to calculate the normals of a point-cloud is to locally fit planes to neighbouring 3D points. RGB-D sensors deliver organized point-clouds and a kernel is used to define the neighbourhood of a certain point. There are two parameters that influence the normal calculation: the kernel radius \(k_r\) and the inlier distance \(d_{in}\) to account for high deviations that would distort the local plane. The former one defines the number of points used and thus the smoothing of the normals, the latter the maximum allowed euclidean distance of the points within the window to the centre point to be considered for the normal calculation.

Recursive clustering of normals is controlled by the maximum allowed angle between normals \(\gamma _{cl}\) and by the maximum allowed normal distance of points \(d_{cl}\) to a hypothesized plane, defined by the mean of the already clustered point normals and the mean position.

3 Parametrization and Model Selection

In the last section uniform patches without discontinuities were extracted from RGB-D data. Parametrization of these patches to certain surface models reduces data size and leads to more meaningful abstractions. Two parametric models are chosen, a plane model to represent simple planar patches and B-spline surfaces which can model free-form surfaces allowing representation of difficult surfaces. B-splines could also represent planes makine explicit plane superfluous. But B-splines are more expensive in terms of data size, computation and and especially for further processing. More details of the used B-spline fitting approach can be found in [11].

To reduce the number of patches after parametrization neighbouring patches get merged after parametrization, if a joint parametric model fits better than the two individual models. To come to a decision, model selection with minimum description length (MDL) [8] is used.

4 Parametric Surface Grouping

After the first two levels parametric surfaces are available for further processing in the structural and in the assembly level. A crucial task for surface grouping is to find relations between surface patches, indicating that they belong to the same object and to define them in a way that relations are valid for a wide variety of different objects.

Inspired by the already discussed Gestalt principles, the following relations between neighbouring surface patches are introduced, which will be used for the structural level:

  • \(r_{co}\) ... Similarity of patch colour

  • \(r_{rs}\) ... Similarity of patch size

  • \(r_{tr}\) ... Similarity of texture quantity

  • \(r_{ga}\) ... Similarity of texture: gabor filter

  • \(r_{fo}\) ... Similarity of texture: fourier filter

  • \(r_{co3}\) ... Similarity of color on patch border

  • \(r_{cu3}\) ... Mean curvature on patch border

  • \(r_{cv3}\) ... Curvature variance on patch border

  • \(r_{di2}\) ... Mean depth on patch border

  • \(r_{vd2}\) ... Variance of depth on patch border

  • \(r_{2d3}\) ... 3D-2D boundary ratio

The assembly level is the last level of grouping and is responsible to group spatially separated surface groupings. Similar to the structural level, relations between patches are introduced. The first five relations are equal to the relations used at the structural level and the following are added:

  • \(r_{md}\) ... Minimum distance between patches

  • \(r_{nm}\) ... Similarity of mean of surface normals direction

  • \(r_{nv}\) ... Similarity of variance of surface normals direction

  • \(r_{ac}\) ... Diff. of normals direction of nearest contour points

  • \(r_{dn}\) ... Mean (normal) distance of nearest contour points

  • \(r_{cs}\) ... Collinearity continuity

  • \(r_{oc}\) ... Mean collinearity occlusion

  • \(r_{ls}\) ... Closure line support

  • \(r_{as}\) ... Closure area support

  • \(r_{lg}\) ... Closure lines to gaps

A feature vector \(r_{st}\) for the structural level is defined, containing all relations between neighbouring patches and a feature vector \(r_{as}\) for the assembly level, containing all relations between non-neighbouring patches.

Now one has to decide, whether two surface patches belong together or not. This decision is based on the relations of the feature vector. Setting thresholds for classification is getting more complex the more relations are used and would not be manually adjustable any more. A solution to this problem lies in learning of the grouping rules using a learning method which classifies feature vectors to single decision values.

In this approach we use a support vector machine (SVM) to learn to classify the given feature vectors. SVMs are maximum margin classifiers, i.e. they try to find a separating hyperplane between different classes in the data with the maximum margin. SVMs support non-linear classification by using a kernel, mapping input data from a general set \(S\) into an inner product space \(V\), which is of higher dimension than the input space. This is done in the hope that the data will gain meaningful linear structure.

For the offline training and online testing phase the freely available libsvm package [4] is used. After training the SVM is not only capable to provide a binary decision \(same\) or \(notsame\) for each feature vector \(\mathbf {r}\), but also a probability value \(p(same \, | \, \mathbf {r})\) for each decision, based on the theory introduced by Wu et al. [24]. As solver we use C-support vector classification (C-SVC) with \(C=1\), \(\gamma =1/n\) and \(n=9\) and as kernel the radial basis function (RBF):

$$\begin{aligned} \mathbf {K}{(x_i,x_j)} = e^{\gamma ||{x_i-x_j}||^2} \end{aligned}$$
(1)

Hand-annotated ground truth segmentation from a set of RGB-D images is used with the estimated feature vectors \(r_{st}\) and \(r_{as}\) to train a SVM for each level during an offline training phase. Feature vectors of patch pairs from the same object represent positive training examples and vectors of pairs from different objects or objects and background represent negative examples. With this strategy, not only the affiliation of patches to the same object, but also the disparity of object patches to other objects or background is learned.

After the learning phase the SVMs are able to classify the feature vectors, delivering a probability value for each patch pair. When using the estimated probabilities, groups of neighbouring surface patches could be formed by applying a threshold [e.g., \(p(same \, | \, \mathbf {r})=0.5\)]. With this strategy, a single wrong decision of the SVM (e.g., \(p=0.51\)) would probably lead to wrongly connected objects. Hence, an optimal object hypotheses can not be created by simply thresholding these values. Instead, a globally optimal solution can be found by building a graph and performing Graph-Cut segmentation.

5 Global Decision Making

After SVM classification in the structural and assembly level some probability estimates may contradict when trying to form object hypotheses. A globally optimal solution has to be found to overcome vague or wrong local predictions from the SVMs at the structural and assembly level. To this end a graph is defined, where surface patches represent nodes and edges are represented by the probability values of the SVMs. Graph-cut segmentation method introduced by Felzenszwalb et al. [5] is emplyed, using the probability values as the pairwise energy terms to find a global optimum for object segmentation.

6 Evaluation

After all parts of the framework are introduced, evaluation of the proposed object segmentation method is shown. Because of the limited space only a part of the evaluation of [12] can be presented. The proposed object segmentation method is compared with the method of Mishra et al. [10] and the method by Ückermann et al. [21].

Evaluation is done on the object segmentation database (OSD) [13] as well as on the Willow Garage dataset Footnote 1

Table 1 Precision and recall on the OSD and Willow Garage dataset for the approach by Mishra et al [10, Ückermann et al. [21] and for the proposed approach, when using the SVM of the structural level \(SVM_{st}\) and when using both data abstraction levels \(SVM_{st+as}\)

Table 1 shows \(Precision P\) and \(Recall R\) of segmentation from the OSD for the algorithms of Mishra and Ückermann and for both methods, when using just the support vector machine of the structural level \(SVM_{st}\) or when using also the SVM of the assembly level \(SVM_{st+as}\) to build relations between estimated patches. The SVMs are trained with the four learning sets of the OSD for all experiments, even for the evaluation of the Willow Garage dataset. This shows the generalization of the presented approach with respect to other objects and scenes during training.

The results in Table 1 show that the presented method works significantly better than the approach by Mishra for all sets of the OSD as well as for the Willow Garage dataset. In contrast the results of Ückermann are almost similar to this method. A closer look on the values shows a higher precision \(P\) but at the same time a lower recall \(R\). This indicates that their approach avoids wrong assignments of surfaces, but at the cost of sometimes over-segmenting the objects.

The benefit of using the assembly level can be seen for the occluded objects set of the OSD. Recall is much higher when additionally using the assembly level, while precision remains almost constant on a high level. This demonstrates that occluded parts have been connected without wrongly assigning surface patches.

Evaluation of the method by Mishra on the Willow Garage dataset shows better performance compared to the OSD dataset, because of the reduced complexity of scenes. Objects in the dataset are mainly free-standing on a ground plane and there are no occluded objects. Segmentation with the proposed approach performs also well on such examples, but the benefit when using the assembly level is not evident any more in this case. But when considering that the SVMs have been trained with data of the OSD, this can be interpreted as an evidence that perceptual grouping rules act in a generic manner and are portable into different situations with different types of objects.

7 Conclusion

A framework was presented for segmenting unknown objects of reasonably compact shape in cluttered table top scenes of RGBD-images. Raw input data is abstracted in a hierarchical framework. Instead of matching geometric object models, more general perceptual grouping rules are learned with support vector machines (SVMs) to group parametric surfaces. Experiments have shown the generic manner of the learned rules on different datasets with different objects and a comparison with state of the art methods show the good performance compared to other methods.