Keywords

1 Introduction

Robot manipulators are more frequently used in households and small and medium-sized enterprises (SME) [1]. In such applications, the robot’s environment may change over time. The robot uses sensors to detect and understand the scene. This includes information about what different work pieces are present and where they are located. A commonly used approach is to have a camera as a sensor and combine it with visual object recognition techniques. However, only one view on the scene is not always sufficient to identify all objects, e.g. due to occlusion or large scenes. The approach of generating new views is known as active vision [2]. Based on the scene recorded thus far, new views gathering additional information about the scene are determined and incorporated into the world representation. During the movement of the robot to the next view, a human co-worker may modify the scene. This invalidates some information in the robot’s world representation. Additionally, it is possible, that the sensor captures an unwanted signal, e.g. an arm of a co-worker. In both cases, it is not known a priori, which part of the world representation is still valid or which part of the signal is useful. Therefore, it is necessary to develop and evaluate an approach to handle dynamic scenes as well as unwanted signals.

In this paper we present a novel approach regarding the handling of dynamic scenes using a 3D scene reconstruction. Based on the state of the art (Sect. 2) and our preliminary work (Sect. 3.1), we identify requirements, make necessary assumptions, and present our overall concept (Sect. 3.2). We classify all possible cases of how changes in the environment can occur between two recordings (Sect. 3.3) and describe a method to detect changes for a specific kind of world representation (Sect. 3.4). As in our previous work [3, 4], we use boundary representation models (B-Reps). We present an approach to handle changes and to incorporate them into the world representation to assure the validity after each view (Sect. 3.5). This approach is evaluated by a proof of concept as well as a comparison between our method and the ground truth B-Rep (Sect. 4). Finally, we discuss our contribution and future work (Sect. 5).

2 State of the Art

The detection and handling of dynamic scenes is encountered in several fields of research, e.g. in autonomous driving, computer vision, and robotics [5]. In these different applications the problem and the solution can be viewed under multiple aspects. On the one hand, the type of internal scene representation is from interest. This ranges from LIDAR sensor data [6], over point clouds [7, 8] to bounding boxes [9] (e.g. from semantic segmentation). The representation of the scene impacts the possibilities of detecting and handling changes. When using point clouds, either each pixel must be handled on its own or a segmentation is necessary to group multiple points into clusters. In the case of semantic segmentation, these clusters are complete objects and can be used to identify changes between two frames.

On the other hand, changes can be handled in different ways: One method is to introduce a time component and aging [10, 11] to remove knowledge, which was not validated for a certain amount of time. The basic idea is to attach a certainty to each object instance and decrease it with increasing time and human proximity, as humans can only manipulate objects to which they are spatially close. Whenever an object is visible in the current view, the certainty is reset. Another approach is based solely on the given model, by comparing multiple views and reason, which parts of the scene are still available. This can be done by utilizing background detection methods [12], while multiple frames are processed and moving objects can be identified in the foreground. Alternatively, two given frames can be compared to detect possible changes. Overall the goal is to obtain a valid representation of the scene at all times.

Finally, it is of interest how the new poses for the sensor are obtained: E.g. a human operator uses a hand-held camera [10] or a robot system decides autonomously where to move [11]. Depending on the application, the requirements regarding the correctness of the world representation differs. In some cases, it is acceptable if small movement of an object is undetected and therefore not handled. In other applications, it is preferable if each change is detected—even if this results in too many detected and handled changes.

A special case of how new poses are obtained is the explicit tracking of objects [13]. For this approach it is necessary, that the object of interest is visible in a majority of all views. This assumption is difficult to fulfill in some robotic applications, especially when using an eye-in-hand camera. Another special case is visual servoing [14] in which the sensor follows the movement of the object.

Based on the state of the art and our previous research (see Sect. 2), we examine a model-based approach regarding dynamic scenes. Especially the usage of B-Reps for detection and handling of change is of interest as the vision process utilizing B-Reps as a world representation is not well explored [3, 4]. We will focus on a model-based approach in this paper. On the one hand, existing aging methods can be applied to model-based approaches as well. On the other hand, it is of interest how B-Reps can handle dynamic scenes only based on their 3D geometry information. Furthermore, we use a robot-mounted sensor, whose motion is controlled by an active vision process for object recognition. Therefore, we cannot move the camera to specific poses to detect or further investigate changes. Due to this, tracking methods are not applicable.

3 Our Approach

3.1 Basic Approach for Static Scenes

In our previous work we developed an approach for object recognition based on B-Reps [3]. In that work, we create an object database, in which every object is stored as a B-Rep. To obtain a scene representation, we use a robot-mounted depth camera and the resulting point cloud is transformed into a B-Rep [15, 16]. This B-Rep is the input for our object recognition approach. We determine multiple sets of hypotheses between the scene and objects from our database and select the best fitting one. Some objects may not be recognized with the current view; therefore we determine new views by using our active vision approach [4]. At a new camera pose we record another point cloud, transform it into a B-Rep, and merge it into the scene representation. Afterwards we apply our object recognition method once again. This procedure repeats until each object is correctly classified. The problem of dynamic scenes occurs between the capturing of two point clouds from different views.

3.2 Enhancement for Dynamic Scenes

In this section we present our overall approach. The reconstruction and object recognition are an iterative process, and we want to ensure a valid scene representation after each view. Therefore, every change has to be incorporated directly. We primarily focus on the faces represented in the B-Rep when handling the dynamic scenes. Faces represent all information stored in B-Reps. If the faces are correctly reconstructed, the vertices and edges contained within a face are correct as well. Furthermore, faces are a robust and high-quality representation of objects, as a face is calculated by averaging over numerous points from a point cloud [15]. Therefore, our overall B-Rep is valid and correct if all faces are correct. Faces are the most important feature for our object recognition approach, as well as the active vision method.

The first step to handle the problem of dynamic scenes is to categorize the possible changes regarding faces. Based on this classification, we have to detect these changes within our representation. We capture a point cloud from a new view, convert it into a B-Rep, and compare what is currently visible and what should be visible based on the current world model from the current position. For each category, we discuss how this comparison can be calculated and how it can be detected within a scene. Finally, we handle the detected change in two ways. If object hypotheses are available, we utilize this information. If no hypotheses are available, we only use the information from the B-Reps given directly by the detected change. As an additional assumption (similar to our previous work [3, 4]) the position and extent of our working surface is known (in our context a table).

3.3 Classifying Possible Changes

Based on our previous work and the given assumptions, we classify the possible changes. We have to compare the world model reconstructed thus far and the current view in every step to ensure a valid representation. As we do not use a certainty or an aging process, the existence of a face in a scene is binary, meaning it exists or it does not. Now we discuss every possible case regarding the visibility of the face in the world model and the current view:

  1. 1.

    The face is not stored in the world model and ...

    1. (a)

      is also visible in the current view: added face.

    2. (b)

      is also not visible in the current reconstruction: As this face does not exist, this case is not named and can be omitted.

  2. 2.

    The face is already stored in the world model and ...

    1. (a)

      is also visible in the current view: validated face.

    2. (b)

      is also not visible in the current reconstruction and ...

      1. (i)

        it should be visible from the current point of view: removed face.

      2. (ii)

        it cannot be seen from the current point of view due to occlusions or camera limitations: occluded face.

3.4 Detecting Changes

To detect every change between two B-Reps, we describe a method for every case mentioned in the section before. The basic approach for every case is to compare what is currently visible and what should be visible based in the world model. If we find a difference, we know something has changed within the scene and we can categorize this change.

We project the faces of the world model reconstructed thus far onto the 2D image plane of our current pose \(T_C \in \mathbb {R}^{4 \times 4}\). The world model is given by the B-Rep \(W = (F_W, E_W, V_W, B_W)\) (with faces \(F_W\), half-edges \(E_W\), vertices \(V_W\), and boundaries \(B_W\)). Since the world model has multiple views incorporated, the 2D projection may contain faces, which are not recordable from our current view. Therefore, we have to incorporate the view frustum of our depth sensor and further physical limitations (e.g. angle of incidence). All these limitations are collected in a tuple L. A projection can be described as \(\phi (B, T, L) \rightarrow \{1,...,||F_B|| \}^{n \times m}\), with a B-Rep B and a pose \(T \in \mathbb {R}^{4 \times 4}\). The result stores in each pixel, which face is in the front (by using an ID). So we can obtain the projection of the B-Rep W as \(P_W = \phi (W, T_C, L)\). We repeat this procedure with the reconstructed B-Rep \(C = (F_C, E_C, V_C, B_C)\) from the current view. Therefore, we have two 2D projections: One of the world model reconstructed thus far, and one of the reconstruction from the current view \(P_C = \phi (C, T_C, L)\). It should be noted that we still have full knowledge which face in the 3D representation belongs to which pixels in the 2D projection. We can now compare these two projections by looking at every projected face. For each face we can search for a correspondence in the other projection. We can use these projections to determine whether a face f is visible (1) or not (0) from a pose T, if another B-Rep B is present, regarding physical limitations L as \(\varphi (f, B, T, L) \rightarrow \{0,1\}\).

In addition, correspondences between faces (in the 3D representation) are calculated by a function \(\eta (f,g) \rightarrow \{0, 1\}\), based on their position, normal vector, and size. In our previous work [4], \(\eta \) matches the explainedby-function. This function returns for two faces fg whether they correspond (1) or not (0).

If a face is visible in the current view but not in the world model, we conclude that it was added, resulting in \(A = \{g \in F_C | \not \exists f \in F_W: \eta (g,f) \}\). If a face of the world representation is also visible in the current view, then there is no change but a validation. This set of validated faces is determined as \(V = \{f \in F_W | \exists g \in F_c: \eta (f,g) \}\). The case where a face is visible in the world model but not in the current view has to be subdivided in two cases. We have to make sure to detect occlusions correctly. If no correspondence is found, but it should be perceivable from the current pose, we know that this face was removed. This can be denoted by \(R = \{f \in F_w | \not \exists g \in F_C: \eta (f,g) \wedge \varphi (f,C,T_C,L) \}\). If a face has no correspondence, and is also not perceivable from the current pose, it is occluded \(O = \{f \in F_W | \not \exists g \in F_C: \eta (f,g) \wedge \lnot \varphi (f,C,T_C,L) \}\).

3.5 Handling Changes

So we have now the information for each face whether it should exist in the updated world representation. Our goal is to obtain a valid representation of the whole scene after each step. In our domain with complete objects it is impossible for faces to exist on their own as they always originate from an object. If a single face of an object is missing, the complete object should be removed.

The set V of validated faces can be handled directly by B-Rep merging [15]. The same is possible with the added faces A. As no decision can be made regarding the occluded faces O, we decide that they remain within in world representation. If they are removed, this will still be captured later in the active vision process.

Therefore, the set of faces to remove R remains. As we know the object instances \(H = \{h_1,...,h_o\}\) (with every \(h_i\) containing at least a B-Rep model representing the object), we can handle the detected changes in two separate ways: If the face that should be removed corresponds to an existing object hypothesis, we remove the complete hypothesis. This is done by determining and deleting all faces in the world representation, which correspond to the hypothesis. First, we determine the set of faces to delete \(D_{H_0} = \cup _{\{h_i \in H | \exists r \in R: r \in f(h_i)\}} f(h_i)\). These are all the faces of hypotheses that correspond to a face to remove. By \(f(h_i)\) we obtain all faces from the B-Rep W corresponding to the B-Rep model of hypothesis \(h_i\). Furthermore, we must remove all faces directly connected to the hypothesis, to ensure a valid world representation (this originates from B-Reps as the underlying data structure). Therefore, the final faces to delete can be determined by \(D_{H_1} = D_{H_0} \cup \{f \in F_W | \exists g \in D_{H_0}: \texttt {neighbor}(f,g)\}\), where neighbor denotes, whether two faces f and g are neighbored, regarding their half-edges.

If no hypothesis is available for a face, we remove all neighboring faces (meaning they share an edge). This process is repeated transitively, but the working surface is removed beforehand and therefore the procedure stops there. First, we determine the set off faces without a correspondence as \(D_{\bar{H}_0} = \{r \in R | \not \exists h_i \in H: r \in f(h_i)\} = R \backslash D_{H_0}\). Now, we add the neighboring faces by \(D_{\bar{H}_j} = D_{\bar{H}_{j-1}} \cup \{f \in F_W | \exists g \in D_{\bar{H}_{j-1}}: \texttt {neighbor}(f,g) \}\). On the one hand, we have to delete multiple faces, as we do not know which object may correspond to these faces. On the other hand, we have to ensure the validity of the B-Rep. This has the effect, that too many faces may be removed. However, the faces remaining within the scene can be examined later using the underlying active vision approach.

4 Evaluation

4.1 Setup

The evaluation is split into two parts. On the one hand, we validate our classification of the different types of change. To do this, for each change type a scene is recorded which contains exactly one change. Furthermore, we validate the usefulness regarding scene unspecific signals, e.g. a recorded human. On the other hand, we evaluate our approach by comparing the reconstruction of a dynamic scene and a static one. To achieve this, we build a scene, record it with our active vision approach and the handling of dynamic scene enabled. When the reconstruction is completed, we use only the active vision approach on the now static scene, to obtain a ground truth. To compare both reconstructions, we use these criteria: First, we count the number of faces. Second, we remove all faces which are in both reconstructions. To determine to which faces this applies we use the definition of \(\eta \), which determines whether two faces correspond to each other. If the number of unexplained faces is low, the two reconstructed scenes are similar. Finally, we delete all faces which are explained by manually validated hypothesis. This is necessary, because some faces may be impossible to view in the static reconstruction due to occlusion. Additionally, our goal is a correct recognition of all objects, and not a complete reconstruction of the scene. If any faces remain afterwards, an error occurred during the reconstruction respectively the handling of dynamic scenes and should be investigated further.

Fig. 1
figure 1

The reconstruction of a scene with an irrelevant signal (left, surrounded in green) and the resulting reconstruction after another image from the same pose was incorporated (right). The B-Rep is drawn in gray; the hypotheses are the red wire frame models. The arrows indicate possible next views for the active vision process. The coordinate system indicates the base of the robot

Fig. 2
figure 2

Removing an object from the initial reconstruction (left) and the resulting representation (right). Camera frustum projection on the table is drawn in black

Fig. 3
figure 3

Removing and adding multiple objects in one step. An additional view is necessary to delete the removed object due to occlusion

As a hardware setup we use a KUKA LWR 4 robot with a hand/eye calibrated ENSENSO N10 depth camera. To ensure high quality point clouds, we average over multiple from one view to reduce the impact of noise. We utilize an object database with 25 instances [3], which consist of objects from different domains and complexity levels considering the number of faces, symmetry, and convexity.

4.2 Results

In the validation, we start with the removal of scene unspecific signals, as seen in Fig. 1. In a first scene, a human arm is reconstructed by multiple, planar segments. With the arm removed and another B-Rep incorporated into the scene, the segments are identified as removed and deleted from the reconstruction. Only one patch remains, since it is too small. Furthermore, more objects are classified correctly, as the arm occludes some of these.

Regarding the validation of every possible change case, the removed one can be seen in Fig. 2. From an initial pose two objects are reconstructed and identified correctly. The robot manipulator moves to a new pose to validate the hypotheses. One of the objects is removed in between and the corresponding faces are deleted in the resulting reconstruction. Additionally, the resulting gap in the table is closed. The next case is the occluded one, as shown in Fig. 3. First, one object is visible. In front of it, two more objects are placed, which occlude the first one. Furthermore, the now occluded object is removed. In the resulting representation, the new objects are added and the previous one is not deleted, due to the occlusion. As we cannot be sure, whether this object is still there or not, it should remain inside the representation. Finally, another image is taken from a different view (from which the first object should be visible) and the object is removed from the representation, as we can be sure, that it is not there anymore. The remaining two cases added and validated were evaluated as well but are not shown with figures here because the handling is done with existing and already evaluated methods.

Fig. 4
figure 4

One example scene of the evaluation. On the left the final world representation and pose of the dynamic evaluation is visible, and on the right the one for the static

For our evaluation, we used five scenes with different objects and overall complexity. Each scene was modified by multiple changes (moving, removing, and adding multiple objects and generating unwanted signals). One scene is visible in Fig. 4. Accumulated over all scenes we gathered 126 faces for the dynamic case and 120 for the static one. 97 faces from the dynamic reconstruction had a match in the static one, and 98 faces the other way around, meaning one face was explained by two others. This occurs e.g. when the complete face was not captured by the sensor and two patches were reconstructed instead of one complete face. Furthermore, 29 faces had no correspondence to the static reconstruction (28 in the other case). However, only 1 face was left after deleting all faces with hypotheses correspondences. The high number of faces without a correspondence originate primarily from occlusion during the static reconstruction. Furthermore a few small faces are impossible to directly look at using active vision, due to collision prevention mechanics (e.g. if the face is close to the working area). Therefore, in some reconstructions a face may be present, as it was captured together with a neighboring face (which may not be the case in another reconstruction). The remaining 1 face occurs because of too similar properties of an old and a new face. An object was removed from the scene, and another object was placed there instead. One face of the new object has the same properties, here face area and normal, regarding the function \(\eta \). Therefore, the old face was not deleted. Depending on the face it may be possible that it gets removed if the camera takes a direct look at it and the algorithm is able to differentiate between the old and the new one.

Based on these results, we can conclude the usefulness of our approach: On the one hand, we can successfully tackle the problem of unwanted signals. If any scene unrelated part is captured, it is investigated further by the active vision method and therefore deleted as soon as the disrupting object is removed. On the other hand, we can handle changes that occur during the robot movement as seen in our validation and evaluation.

5 Conclusion

We present a novel approach to detect and handle dynamic scenes for a special type of representation. Our approach uses a categorization of all possible types of change for B-Reps. To detect these changes we compare the world model thus far and the current view. To determine the type of change we project both the world representation and the current view onto a 2D plane and compare what we should see. Afterwards, the detected change is handled, either by utilizing existing object hypothesis or only the geometric information from the scene. With an evaluation we conclude the usefulness of our approach for using B-Reps in object recognition.

Future work may include the usage of a time-based component to delete world model entries which were not validated within a certain period. Furthermore, different techniques of how much of the scene should be deleted if a face is missing can be implemented and evaluated. Finally, other representations than B-Reps are of interest, which are easier to keep valid when deleting faces.