1 Introduction

The determination of next best view is to find a new observation view based on the information of visual object in current view to achieve the goal that the maximal information of unknown regions of visual object can be obtained by the camera.

Nowadays, scholars have gained some achievements on the next best view. Connolly [7], as one of the earlier scholars studying the next best view, used partial octree model to describe visual object, and made different marks to the nodes to determine the next best view. Roy et al. [32] used search tree nodes to determine the next best view. Low et al. [21] proposed a next best view method by using an adaptive hierarchical algorithm. Blaer et al. [4] used a voxel-based occupancy method to plan the next best view. By combining GKLT feature tracking, Trummer et al. [34] explicitly used the knowledge about the current 3D estimation of the tracked point to determine the next best view. Based on the former work, they also proposed a next best view method by combining online method in literature [35]. Based on the model’s covariance structure and appearance, Dunn et al. [12] determined the next best view by deploying a hierarchical uncertainty driven model refinement process. Jia et al. [17] determined the next best view by using information on the image sequences and their relative 3D positions. Haner et al. [15] proposed a method by using covariance propagation to determine the next best view. Based on the field-of-view constraint of stereo vision, Freundlich et al. [13] iteratively minimized the fused uncertainty to determine the next best view. Li et al. [20] extracted different views’ features by unsupervised feature learning, and then trained classifiers to evaluate each view’s discrimination ability to determine the next best view. Adler et al. [1] sorted the candidate views by achievable information gain to determine the next best view. Mauro et al. [24] proposed a next best view method based on the concept of view importance. Yiakoumettis et al. [40] introduced a relevance feedback on-line learning strategy to learn the user’s preference to determine the next best view.

However, because of never considering occlusion factor in these methods, the more serious the occlusion phenomenon is, the lower the accuracy of these methods would be. Therefore, scholars proposed the next best view methods taking occlusion into account. Based on the Positional Space algorithm, Pito [30] determined the next best view from plenty of candidate views. Banta et al. [3] proposed a method based on the overall observation strategy to determine next best view. Li et al. [19] proposed a viewpoint planning method by calculating information entropy, and regarded the view corresponding to maximal information entropy as next best view. Vázquez et al. [36] proposed an automatic view selection using viewpoint entropy. Combining layered ray tracing and octree, Vasquez-Gomez et al. [37] constructed the object model and generated candidate views based on sorting of the utility function to determine the next best view. Wenhardt et al. [16] used a Kalman filter to obtain the best estimate of the object’s geometry, and determined the next best view by choosing a suitable optimization criterion. Potthast et al. [31] utilized a belief model of the unobserved space to estimate the expected information gain of each possible viewpoint to determine the next best view. Kriegel et al. [18] proposed a surface-based next best view approach by creating a triangle surface and determining viewpoints similar to human intuition. Maver et al. [25] approximated the occluded region by polygons and used the occluded region information to determine the next best view. Wu et al. [38] determined the next best view by utilizing layered contour fitting(LCF) with a density-based clustering algorithm. Giorgi et al. [14] determined the next best view according to semantic criteria. Munkelt et al. [26] proposed a next best view method based on voxel space. Based on a retrainable neural network architecture, Papaoulakis et al. [29] proposed a next best view method for detecting athletes in large-scale Olympic events. Delannay et al. [9] selected the next best view based on the contextual features. Chen et al. [5] extracted foreground likelihood and projected it to define a ground occupancy map to determine the next best view. Daniyal et al. [8] used a multivariate Gaussian distribution to determine the next best view. Chen et al. [6] used ray tracing to determine how much new information a given sensor perspective would reveal, and the next best view was determined by new information. Although these methods consider the factor of occlusion, there are limitations in time complexity [3, 19, 30, 36], specific equipment [16, 18, 31, 37], priori knowledge [14, 25, 26, 38], multi cameras [5, 6, 8, 9, 29] etc., and what’s more, all the research objects in literature [1, 3,4,5,6,7,8,9, 12,13,14,15,16,17,18,19,20,21, 24,25,26, 29,30,31,32, 34,35,36,37,38, 40] are stationary. Furthermore, in many scientific research fields such as 3D reconstruction of moving object, automatic tracking, recognition of moving object, operation of robot in hazardous regions, spacecraft docking etc., the visual objects are moving and have self-occlusion, and these tasks have high demand for real-time. Due to the limitation in literature [1, 3,4,5,6,7,8,9, 12,13,14,15,16,17,18,19,20,21, 24,25,26, 29,30,31,32, 34,35,36,37,38, 40], they can’t solve these issues.

Aiming to the moving visual object, depth images of object need to be matched for motion estimation. The ORB(Oriented Fast and Rotated BRIEF) algorithm proposed by Rublee et al. [33] has fast speed and high efficiency, which is widely applied in image-based matching. Makantasis et al. [22] utilized ORB to deal with image filtering from removing outlines as to perform a 3D image retrieval from the wild. Based on ORB algorithm, Mur-Artal et al. [27] proposed a feature-based monocular SLAM system operated in real-time, in small and large, indoor and outdoor environments. Mason et al. [23] developed an approach to object perception based on the principle of object discovery by using ORB. In this paper, based on ORB algorithm, a method to pre-match two images is proposed to estimate the motion of visual object.

The Kinect sensor shows promise in many computer vision applications, such as data acquisition and 3D modeling. Alexiadis et al. [2] described a novel system that automatically evaluated dance performances and provided the visual feedback to the performer in a 3D virtual environment, and the motion of a performer was acquired and modeled via Kinect-based human skeleton tracking. Dimitropoulos et al. [10] used the Kinect sensor to track the volume of a performer and produce skeletal data, so that the intangible treasures can be learned in an interactive 3D environment. Doulamis et al. [11] utilized the Kinect sensor to build Digital Heritage Libraries to protect the tangible and intangible cultural heritage. Yang et al. [39] proposed a real-time synthetic aperture imaging algorithm based on Kinect sensor. In the process of real experiments in this paper, the Kinect sensor is used to acquire depth images of moving objects.

In order to determine the next best view when the visual object is moving, this paper, by using the self-occlusion information in depth image, proposes a method through combining the self-occlusion avoidance and 3D motion estimation to determine the next best view. And the proposed method is different from the traditional next best view methods for reconstruction or recognition. The main purpose of proposed method is to observe the occluded region, which contains much useful information of the visual object. If the information of occluded region can be obtained, both the reconstruction and recognition results are greatly improved, so that the visual system combining the proposed method can perform these tasks better, such as 3D reconstruction of moving object, automatic tracking, recognition of moving object, operation of robot in hazardous regions, spacecraft docking etc.. Experimental results in our work demonstrate that the proposed method is feasible and has relatively high real-time performance.

2 Problem formulation and method overview

The determination of next best view based on the self-occlusion information in depth image of moving object can be defined as that taking self-occlusion regions as the unknown regions and taking two depth images of moving object as the research object, the result of self-occlusion avoidance is calculated by using the self-occlusion information in the first depth image, then the result of 3D motion estimation is calculated by using the two depth images of moving object. Finally the next best view is determined by combining the result of self-occlusion avoidance and 3D motion estimation to achieve the goal that the maximal information of self-occlusion regions of the moving object can be obtained by the camera.

The definition of self-occlusion avoidance is that when the visual object is stationary, the next best view is calculated based on the self-occlusion information in depth image to achieve the goal that the maximal information of self-occlusion regions can be obtained by the camera. But in this paper, the problem of next best view specially refers to the fact that when the visual object is moving, the next best view is calculated by combining the result of self-occlusion avoidance and 3D motion estimation to achieve the goal that the maximal information of self-occlusion regions of the moving object can be obtained by the camera.

Fig. 1 shows the position relation between the depth camera and the moving object. Fig. 1a is the position relation in the initial view. The region ABEAACDA is the self-occlusion region. Fig. 1b is the position relation in the next best view which is only calculated by the method of self-occlusion avoidance. Fig. 1c is the position relation in the next best view which is calculated by our proposed method for moving object.

Fig. 1
figure 1

The position relation between the depth camera and the moving object (a) The position relation in the initial view (b) The position relation in the next best view which is only calculated by the method of self-occlusion avoidance (c) The position relation in the next best view which is calculated by our proposed method for moving object

It can be seen from Fig. 1 that the camera can’t reach the next best view if only use the method of self-occlusion avoidance, because the visual object is moving. But in our work, the motion of visual object is estimated, which compensates the visual object’s motion, so the camera can achieve the next best view. It illustrates that the proposed method can solve the problem of next best view of a camera for moving object.

Based on the analysis above, the overall idea of proposed method is as follows. Firstly, the first depth image of moving object is acquired and the self-occlusion detection is utilized in the acquired image. On this basis, the self-occlusion regions are modeled by utilizing space quadrilateral subdivision and the area, center and normal vector of each space quadrilateral are calculated. Secondly, the result of self-occlusion avoidance corresponding to current object is calculated based on the idea of mean shift by using the space quadrilateral information. Thirdly, the second depth image of moving object is acquired and the mean curvature of each pixel in the two acquired images is calculated as the local invariant feature. The features of the two acquired images are pre-matched first, and a method to remove the wrong matching points by using the constraint of rigid invariance is utilized to get the accurate matching results. Then according to the accurate matching results, the 3D motion can be estimated by using the 3D coordinates of these accurate matched points. Finally, the next best view is determined by combining the results of self-occlusion avoidance and 3D motion estimation.

3 Self-occlusion avoidance

3.1 Modeling the self-occlusion regions

3.1.1 Obtaining the self-occlusion information of visual object

In order to model the self-occlusion regions, first of all, the self-occlusion information is obtained from the depth image of visual object. Self-occlusion information refers to the self-occlusion boundaries and its corresponding nether adjacent boundaries obtained from the depth image in the current view, and each obtained self-occlusion boundary and its corresponding adjacent boundary compose a self-occlusion region in 3D space. The self-occlusion boundaries and its corresponding nether adjacent boundaries are obtained by utilizing the method in literature [41], and then all the points on the self-occlusion boundaries compose the self-occlusion boundary set O and all the points on the nether adjacent boundaries compose the nether adjacent boundary set O. Fig. 2 shows the depth image of Bunny and its self-occlusion boundaries and nether adjacent boundaries in current view. The red points are the self-occlusion boundary points and the green points are the nether adjacent boundary points in Fig. 2b.

Fig. 2
figure 2

The depth image of Bunny and its self-occlusion boundaries and nether adjacent boundaries (a) The depth image of Bunny (b) The self-occlusion boundaries and nether adjacent boundaries in the depth image of Bunny

3.1.2 Modeling the self-occlusion regions based on the self-occlusion information

Based on the obtained self-occlusion information, the self-occlusion regions are modeled to provide the basis for self-occlusion avoidance. Because the internal information of self-occlusion regions is unknown, one self-occlusion region is subdivided to describe itself by the following method. Two adjacent self-occlusion points o i , oi + 1 on the same self-occlusion boundary are taken out from the self-occlusion boundary set O, meanwhile their corresponding adjacent points \( {o}_i^{\prime } \), \( {o}_{i+1}^{\prime } \) are taken out from the nether adjacent boundary set O. Then a space quadrilateral \( {o}_i{o}_{i+1}{o}_{i+1}^{\prime }{o}_i^{\prime } \) is formed by the four points in 3D space and denoted by patch i , where i is an integer from 1 to N − 1, N is the number of points on the self-occlusion boundary. At last, all self-occlusion regions are modeled by the above space quadrilateral subdivision method. Fig. 3 shows the sketch map of self-occlusion region subdivision.

Fig. 3
figure 3

The sketch map of self-occlusion region subdivision

3.1.3 Calculating the area, center and normal vector of each patch

After modeling self-occlusion regions, the area, center and normal vector of each patch are calculated to solve the problem of next best view.

Firstly, the area of each patch is calculated. In order to describe the patch i as far as possible, the area S i of patch i is defined as half of the sum area of 4 triangles which compose patch i , namely

$$ {S}_i=\frac{1}{2}\left({S}_{\Delta {o}_i{o}_i^{\prime }{o}_{i+1}^{\prime }}+{S}_{\Delta {o}_i{o}_{i+1}^{\prime }{o}_{i+1}}+{S}_{\Delta {o}_i{o}_i^{\prime }{o}_{i+1}}+{S}_{\Delta {o}_{i+1}{o}_i^{\prime }{o}_{i+1}^{\prime }}\right)\kern1em \mathrm{s}.\mathrm{t}.\kern1em 1\le i\le N-1 $$
(1)

where \( {S}_{\Delta {o}_i{o}_i^{\prime }{o}_{i+1}^{\prime }} \), \( {S}_{\Delta {o}_i{o}_{i+1}^{\prime }{o}_{i+1}} \), \( {S}_{\Delta {o}_i{o}_i^{\prime }{o}_{i+1}} \), \( {S}_{\Delta {o}_{i+1}{o}_i^{\prime }{o}_{i+1}^{\prime }} \) are the area of triangle \( {o}_i{o}_i^{\prime }{o}_{i+1}^{\prime } \), triangle \( {o}_i{o}_{i+1}^{\prime }{o}_{i+1} \), triangle \( {o}_i{o}_i^{\prime }{o}_{i+1} \), triangle \( {o}_{i+1}{o}_i^{\prime }{o}_{i+1}^{\prime } \) respectively.

Then, the center of each patch is calculated. The center C i of patch i is defined as the average of coordinates of the four space quadrilateral points which compose patch i , namely

$$ {C}_i=\frac{1}{4}\left({o}_i+{o}_{i+1}+{o}_i^{\prime }+{o}_{i+1}^{\prime}\right) $$
(2)

At last, the normal vector of each patch is calculated. The normal vector of patch i is defined as the vector which starts from C i and parallels to the common perpendicular of \( {o}_i{o}_{i+1}^{\prime } \) and \( {o}_i^{\prime }{o}_{i+1} \). The direction of normal vector is toward outside of the visual object. The concrete method is as follows. Take o i as the start point and \( {o}_{i+1}^{\prime } \) as the end point constructs the vector μ i , meanwhile take oi + 1 as the start point and \( {o}_i^{\prime } \) as the end point constructs the vector γ i . Then the normal vector n i of patch i is defined as

$$ {\mathbf{n}}_i={\boldsymbol{\upmu}}_i\times {\boldsymbol{\upgamma}}_i\kern0.5em \mathrm{or}\kern0.5em {\mathbf{n}}_i={\boldsymbol{\upgamma}}_i\times {\boldsymbol{\upmu}}_i $$
(3)

Through analysis of the self-occlusion boundary and its corresponding nether adjacent boundary, it can be seen that the depth values of self-occlusion points are less than the depth values of its corresponding nether adjacent points, so the direction of normal vector can be determined by the following method to ensure that the direction is toward outside of the visual object. In the depth image, o i oi + 1 is the vector from the point o i to oi + 1, \( {\mathbf{o}}_{\mathbf{i}}{\mathbf{o}}_{\mathbf{i}}^{\prime } \) is the vector from the point o i to \( {o}_i^{\prime } \) and \( {\mathbf{o}}_{\mathbf{i}}{\mathbf{o}}_{\mathbf{i}+1}^{\prime } \) is the vector from the point o i to \( {o}_{i+1}^{\prime } \). Take o i as the circle center to rotate o i oi + 1 in a clockwise direction, if the rotation angle which is from o i oi + 1 to \( {\mathbf{o}}_{\mathbf{i}}{\mathbf{o}}_{\mathbf{i}}^{\prime } \) is greater than 0 and less than or equal to 180, meanwhile the rotation angle which is from o i oi + 1 to \( {\mathbf{o}}_{\mathbf{i}}{\mathbf{o}}_{\mathbf{i}+1}^{\prime } \) is greater than or equal to 0 and less than 180, the normal vector of patch i is defined as μ i  × γ i to ensure that the direction is toward outside of the visual object, namely

$$ {\mathbf{n}}_i={\boldsymbol{\upmu}}_i\times {\boldsymbol{\upgamma}}_i $$
(4)

if the rotation angle which is from o i oi + 1 to \( {\mathbf{o}}_{\mathbf{i}}{\mathbf{o}}_{\mathbf{i}}^{\prime } \) is greater than or equal to 180 and less than 360, meanwhile the rotation angle which is from o i oi + 1 to \( {\mathbf{o}}_{\mathbf{i}}{\mathbf{o}}_{\mathbf{i}+1}^{\prime } \) is greater than 180 and less than or equal to 360, the normal vector of patch i is defined as γ i  × μ i to ensure that the direction is toward outside of the visual object, namely

$$ {\mathbf{n}}_i={\boldsymbol{\upgamma}}_i\times {\boldsymbol{\upmu}}_i $$
(5)

3.2 The method of self-occlusion avoidance

After modeling the self-occlusion regions, a self-occlusion avoidance method is proposed based on the idea of mean shift by using the information of area and normal vector of each patch. The main process is as follows. Firstly, the best observation position of each patch is determined by using its information of area and normal vector, and all the best observation positions form a set S p . Secondly, starting from the current camera position P begin , based on the idea of mean shift, the center of mass of all the elements in S p is calculated by using the information of area and normal vector, then the best observation position P e of the self-occlusion avoidance result is determined by using the constraint of camera observation distance to the center of mass. The best observation direction V e of the self-occlusion avoidance result is the direction from P e to the midpoint of all visible patch centers when the camera is in P e . At last, by combining the best observation position and the best observation direction, the result of self-occlusion avoidance is (P e , V e ). To make our proposed method clear, the concrete process is discussed as follows.

In order to calculate the best observation position of each patch, the normal vector of each patch needs to be handled. Firstly, the length of each normal vector is normalized. The normalized length is equal to the length of vector which is from P begin to the center of visual object. Then the end point p i of normal vector n i which is from C i is defined as

$$ \left({x}_{p_i},{y}_{p_i},{z}_{p_i}\right)={\mathbf{n}}_i+\left({x}_{C_i},{y}_{C_i},{z}_{C_i}\right) $$
(6)

where \( \left({x}_{p_i},{y}_{p_i},{z}_{p_i}\right) \) is the coordinate of p i , \( \left({x}_{C_i},{y}_{C_i},{z}_{C_i}\right) \) is the coordinate of C i .

Then p i is defined as the best observation position of patch i , and all the best observation position of patches form the set S p .

After that, the mean shift vector F(P k ) in P k is defined as:

$$ F\left({P}_k\right)=\frac{1}{k}\sum \limits_{p_i\in {S}_p}{g}_{P_k}\left({p}_i\right)\omega \left({p}_i\right)\left({p}_i-{P}_k\right) $$
(7)

where ω(p i ) is the weight corresponding to the point p i , k is the number of elements in S p , \( {g}_{P_k}\left({p}_i\right) \) is defined as a sigmoid function to judge whether p i has effect on iteration or not when the camera in P k .

The weight ω(p i ) of point p i in Eq. (7) is defined as the ratio of the area of patch i to the total area of all patches, namely

$$ \omega \left({p}_i\right)=\frac{S_i}{\sum \limits_{i=1}^{N-1}{S}_i} $$
(8)

where S i is the area of patch i .

The equation of \( {g}_{P_k}\left({p}_i\right) \) is defined as

$$ {g}_{P_k}\left({p}_i\right)=\frac{1}{1+{e}^{-\alpha \cos {\theta}_i}} $$
(9)

where θ i  ∈ [0, 180] is the angle between the normal vector of patch i and the vector from the center of patch i to P k . α is a positive constant, the accuracy of result is proportional to its size. Considering the accuracy of result and consumption, we set α = 400 in this paper. Analyzing Eq. (9), if θ i is less than 90, cosθ i is a positive, so \( {g}_{P_k}\left({p}_i\right) \) is approximately equal to 1. In this case, p i has effect on iteration. If θ i is greater than or equal to 90, cosθ i is a negative or zero, so \( {g}_{P_k}\left({p}_i\right) \) is approximately equal to 0. In this case, p i has no effect on iteration.

Afterwards, based on the mean shift vector and the constraint of camera observation distance, the best observation position of current visual object can be calculated by Eq. (10):

$$ {P}_e=\underset{p_k}{\mathrm{argmin}}\left\Vert F\left({P}_k\right)\right\Vert $$
(10)

where P k is the kth iterative position.

The constraint condition minimizing ‖F(P k )‖ refers to the fact that the distance from the initial observation position P0 to the center of visual object is equal to the distance from the best observation position P e to the center of visual object.

In this paper, the initial iteration position is P0 = P begin , and the allowable error is ε = 0.1. When‖F(P k )‖ > ε, set Pk + 1 = F(P k ) + P k and continue iterating according to Eq. (10). While when ‖F(P k )‖ < ε, the best observation position of self-occlusion avoidance result is P e  = P k .

Then, the best observation direction of self-occlusion avoidance result is calculated. Firstly, the midpoint C m of all visible patch centers when the camera is in P e is calculated by

$$ {C}_m=\frac{\sum \limits_{p_i\in {S}_p}^{i\in \left[1,N-1\right]}{g}_{P_e}\left({p}_i\right){C}_i}{\sum \limits_{p_i\in {S}_p}{g}_{P_e}\left({p}_i\right)} $$
(11)

where C i is the center of patch i .

After calculating C m , the best observation direction V e of self-occlusion avoidance result is defined as the direction from P e to C m , namely

$$ {\mathbf{V}}_e={C}_m-{P}_e $$
(12)

Finally, the result of self-occlusion avoidance is (P e , V e ).

4 3D motion estimation

4.1 Matching two depth images by utilizing ORB algorithm

In order to estimate the 3D motion of visual object, first of all, two acquired depth images should be matched. Because the ORB(Oriented Fast and Rotated BRIEF) algorithm in literature [33] has fast speed and high efficiency, it is utilized to pre-match the two depth images. The concrete process is as follows. Firstly, the mean curvature of each pixel in the two acquired depth images is calculated to be the feature of the pixel. Then the matching points are obtained by utilizing the ORB algorithm in the two depth images acquired before and after visual object motion respectively. Fig. 4 shows the matching results of two depth images acquired before and after the visual object Bunny motion respectively. The blue points in Fig. 4 are feature points, and the two feature points which are connected by the green line are a pair of matching points.

Fig. 4
figure 4

The matching results of two depth images acquired before and after the visual object Bunny motion respectively

4.2 Filtering matching results to get accurate matching results

Because the error may cause mismatching, a method is proposed to filter matching results by using the constraint of rigid invariance to get the accurate matching results. The idea of proposed method is as follows. Based on the constraint of rigid invariance, the relative position of each matching point in the visual object is invariant in the process of visual object moving. Therefore, the triangle which is constructed by any three accurate matching points in the first image, and the triangle which is constructed by their corresponding points in the second image, should be congruent, and the inaccurate match points generally can not satisfy this condition, so the inaccurate match points can be removed by using this characteristic. Based on this characteristic, a triangular-based inaccurate matching point filter algorithm is presented in this paper. The main steps of the algorithm are as follows.

Firstly, all matching points in the first image are used to form the set M1, and their corresponding points in the second image are used to form the set M2. Secondly, triangle t1 is constructed by the three points in M1. Meanwhile, triangle t2 is constructed by the three points corresponding to the points which are constructed triangle t1, and each edge length of triangle t1 and t2 is calculated. Thirdly, through comparing the corresponding edge length of triangle t1 and t2, the matching points are filtered by the following rules.

  1. (1)

    If all the corresponding edges length of triangle t1 and t2 are equal to each other, namely t1 ≅ t2, the reason for this situation may be that the relative position of three pairs of matching points are invariant, so the three pairs of points are the accurate matching points. Then, the three points of triangle t1 are deleted from M1 and put into the set \( {M}_1^{\prime } \), and their corresponding points of triangle t2 are deleted from M2 and put into the set \( {M}_2^{\prime } \). Finally, three pairs of matching points from M1 and M2 are taken to construct the triangle t1 and t2 for further judgment continually.

  2. (2)

    If two pairs of corresponding edges length of triangle t1 and t2 are equal to each other, one pair of corresponding edges length is not equal, the reason for this situation may be that one or two of the points which construct the unequal edge are mismatching points. In order to reduce the time complexity of the algorithm, we would consider that the two points which construct the unequal edge are mismatching points, so the two points which construct the unequal edge in triangle t1 are deleted from M1 and the two points which construct the unequal edge in triangle t2 are deleted from M2. Then, with the rest pair of matching points, two pairs of matching points from M1 and M2 are taken to construct the triangle t1 and t2 for further judgment.

  3. (3)

    If only one pair of corresponding edges length of triangle t1 and t2 is equal each other, two pairs of corresponding edges length are not equal, the reason for this situation may be that the two points which construct the equal edge are accurate matching points while the another point which construct the triangle is mismatching point, so the point which is the common point of two unequal edges in triangle t1 is deleted from M1 and the point which is the common point of two unequal edges in triangle t2 is deleted from M2. Then, with the rest two pairs of matching points, one pair of matching points from M1 and M2 is taken to construct the triangle t1 and t2 for further judgment.

  4. (4)

    If all the corresponding edge length of triangle t1 and t2 are unequal each other, the reason for this situation may be that the three pairs of points are the mismatching points, so the three points which constructed triangle t1 are deleted from M1 and the three points which constructed triangle t2 are deleted from M2. Then, three pairs of matching points from M1 and M2 are taken to construct the triangle t1 and t2 for further judgment continually.

The process of filtering is repeated until that the number of points in M1 and M2 is all less than three. Then the points in \( {M}_1^{\prime } \) and \( {M}_2^{\prime } \) are the accurate matching points. Fig. 5 is the accurate matching results after filtering the matching results in Fig. 4.

Fig. 5
figure 5

The accurate matching results after filtering the matching results in Fig. 4

4.3 The 3D motion estimation

According to the accurate matching results, the 3D motion can be estimated. The relation between the points in the depth images acquired before and after visual object motion respectively is

$$ {m}_{2_i}=\boldsymbol{R}{m}_{1_i}+\boldsymbol{T} $$
(13)

Where \( {m}_{1_i} \) is the point in \( {M}_1^{\prime}\kern0.5em \), \( {m}_{2_i} \) is the corresponding point of \( {m}_{1_i} \) in \( {M}_2^{\prime } \), R is unit orthogonal rotation matrix, T is the translation vector.

As can be seen from Eq. (13), the purpose of 3D motion estimation is to determine the R and T which let all of the \( {m}_{1_i} \) and \( {m}_{2_i} \) satisfy Eq. (13). Because the point-to-plane ICP(Iterative Closest Point) algorithm in literature [28] is faster than the traditional ICP algorithm, the R and T are calculated by utilizing the method in literature [28].

After solving R and T, the result of 3D motion estimation can be expressed as

$$ {d}_2=\boldsymbol{R}{d}_1+\boldsymbol{T} $$
(14)

Where d1 is the 3D point corresponding to the pixel in depth image acquired in the current view, d2 is the 3D point corresponding to the pixel in depth image acquired in the next view.

5 The determination of next best view

The next best view is determined by combining the result of self-occlusion avoidance and 3D motion estimation. Because the position of moving object is constantly changing, the best view should be changed along with the moving object. In this paper, the self-occlusion avoidance result (P e , V e ) is the best view when the visual object is not moving. When the visual object is moving, the position relation between the best view and visual object should be constant. Therefore, the self-occlusion avoidance result (P e , V e ) should be changed based on the Eq. (14). The self-occlusion avoidance result (P e , V e ) is calculated based on the first depth image, and the motion of visual object is estimated by two adjacent depth images (the first depth image and the second depth image) to obtain the motion information of visual object. Since the purpose of acquiring second depth image is to obtain the motion information of visual object, the effect of our next best view method is verified by the third depth image. Moreover, the camera is in the observation position of the first depth image initially, so the next best view (P nbv , V nbv ) is calculated by using Eq. (14) twice. Namely

$$ \left\{\begin{array}{l}{P}_{nbv}={\boldsymbol{R}}^{\ast}\boldsymbol{R}{P}_e+\left(\boldsymbol{R}+\mathbf{I}\right)\boldsymbol{T}\\ {}{V}_{nbv}={\boldsymbol{R}}^{\ast}\boldsymbol{R}{V}_e+\left(\boldsymbol{R}+\mathbf{I}\right)\boldsymbol{T}\end{array}\right. $$
(15)

Where R, T are the unit orthogonal rotation matrix and the translation vector which are calculated by 3D motion estimation, I is the identity matrix. (P e , V e ) is the result of self-occlusion avoidance.

6 Experiments and analysis

6.1 Experimental environment

In order to validate the effectiveness of proposed method, the experiments based on 3D object models in Stuttgart Range Image Database are conducted. The experimental hardware environment is the Intel (R) Pentium (R) CPU G2020 @ 2.90GHz, the memory is 4.00GB. The proposed method is implemented by combining C++ and OpenGL. In the process of simulation experiments, the parameter of projection matrix in OpenGL is (60,  1,  200, 600), the window size is 400 × 400, the initial observation position is (0, −1, 300) and the initial observation direction is (0, 1, −300). In the process of real experiments, depth images are acquired by using Kinect, the horizontal viewing angle is 57, the distance from the camera to the center of the object is 1200 mm, and the window size is 640 × 480.

6.2 Experimental results and analysis

To validate the feasibility and real-time performance of proposed method, Section 6.2.1 gives the experimental results and analysis of self-occlusion avoidance. Section 6.2.2 gives the experimental results and analysis of 3D motion estimation. Section 6.2.3 gives the experimental results and analysis of the next best view method for moving object.

6.2.1 Experiments of self-occlusion avoidance

Fig. 6 shows the experimental results based on the self-occlusion avoidance method proposed in this paper. Fig. 6a is the name of visual object. Fig. 6b is the depth image acquired in the initial view. Fig. 6c is the self-occlusion boundaries and nether adjacent boundaries, where the red lines are self-occlusion boundaries and the green lines are nether adjacent boundaries. Fig. 6d is the normal vector of each patch. Fig. 6e is the visible patch observed from the result of self-occlusion avoidance. Fig. 6f is the depth image acquired from the result of self-occlusion avoidance.

Fig. 6
figure 6

The experimental results of self-occlusion avoidance (a) Visual object (b) Depth image acquired in initial view (c) Self-occlusion boundaries and nether adjacent boundaries (d) Normal vector of each patch (e) Visible patch from the result of self-occlusion avoidance (f) Depth image acquired from the result of self-occlusion avoidance

As can be seen from Fig. 6, for the visual object Duck, as the self-occlusion phenomenon is not obvious, the visible patch from the result of self-occlusion avoidance is less, namely, the red region in Fig. 6e is smaller. While for the visual object Bunny, Mole, Rocker and Dragon, as the self-occlusion phenomenon is obvious, the visible patch from the result of self-occlusion avoidance is more, namely, the red region in Fig. 6e is larger. Therefore, the more obvious the self-occlusion phenomenon is, the more effective the proposed method is. Meanwhile comparing the depth images in Fig. 6b and Fig. 6f, it can be seen that the results of self-occlusion avoidance which are calculated by the proposed method align with the observing habit of human vision.

In order to better evaluate the effect of the self-occlusion avoidance method proposed in this paper, the proposed self-occlusion avoidance method is compared with the methods in [15, 17] which are both based on the depth image and consider the occlusion. Fig. 7 shows the experimental results of different methods. Fig. 7a is the name of visual object. Fig. 7b is the depth image acquired in the initial view. Fig. 7c is the depth image acquired from the result calculated by the method in [17]. Fig. 7d is the depth image acquired from the result calculated by the method in [15]. Fig. 7e is the depth image acquired from the result calculated by the proposed self-occlusion avoidance method.

Fig. 7
figure 7

The experimental results of different methods (a) Visual object (b) Depth image acquired in initial view (c) Depth image acquired from the result calculated by the method in [17] (d) Depth image acquired from the result calculated by the method in [15] (e) Depth image acquired from the result calculated by the proposed self-occlusion avoidance method

Analyzed from Fig. 7, the results calculated by the method in [17] focus on observing the back of visual object, and the results calculated by the method in [15] focus on observing the adjoining unknown region of the largest information gain point in initial view. While in this paper, based on the self-occlusion information in depth images acquired in initial view, the results calculated by the proposed self-occlusion avoidance method focus on observing self-occlusion region, which align with the observing habit of human vision.

In order to further examine the effect of proposed method, Table 1 shows the quantitative evaluation of different methods. In Table 1, N n is the number of surface points, N o is the number of overlap points, N new  = N n  − N o is the number of new added points, R o is the overlap rate and R new is the new added rate.

Table 1 The quantitative evaluation of different methods

Analyzing Table 1, it shows that compared with the method in [17], for the visual objects where the region of back is larger than the region of self-occlusion, such as Duck, Bunny and Mole, the number of new added points in depth images acquired in the result of proposed method is relatively less. But for the visual objects where the region of back is smaller than the region of self-occlusion, such as Rocker and Dragon, the numbers of new added points in depth images acquired in the result of proposed method are relatively more(although the new rate is slightly lower). The reason is that the method in [17] focuses on considering the back of visual object, when the region of back is smaller than the region of self-occlusion, the method in [17] can’t achieve good results. Therefore, the method in [17] has a relatively great limitation. Compared with the method in [15], for the visual objects where the surface is not complex, such as Duck, Bunny, Mole and Rocker, the number of new added points in depth images acquired in the result of proposed method is relatively more. But for the visual objects where the surface is complex, such as Dragon, the number of new added points in depth images acquired in the result of proposed method is slightly less, but the new added rate is higher. The reason is that the method in [15] focuses on considering the adjoining unknown region of the largest information gain point, when the surface is not complex, the method in [15] can’t achieve good results. Therefore, the method in [15] has a relatively great limitation. As can be seen from the experimental results of proposed self-occlusion avoidance method, overcoming the limitations of the method in [15, 17], the proposed method has a better applicability to different visual objects.

Because the research object is moving in this paper, the requirement of real-time performance is high. Table 2 shows the comparison of time consumption between the method in [17], the method in [15] and the proposed method.

Table 2 The comparison of time consumption between different methods

As can be seen from Table 2, the time consumption of the proposed method is far less than the time consumption of the method in [15, 17]. The average time of obtaining self-occlusion information by the method in [24] is 47.43 ms. Even though considering that time, the average time consumption is 49.57 ms, which is also far less than the time consumption of the method in [15, 17]. Therefore, the proposed self-occlusion avoidance method has relatively high real-time performance.

6.2.2 Experiments of 3D motion estimation

In order to validate the feasibility and real-time performance of proposed 3D motion estimation method, this paper adopts several different methods to estimate the various motions of Bunny, and then the unit orthogonal rotation matrices and the translation vectors calculated by different 3D motion estimation methods are utilized to move the vector (0, −1, 300) which is from the origin of the world coordinate system to the initial observation position. The results and time consumption of different methods are obtained for comparison. Table 3 shows the results and time consumption of different methods. In Table 3, the ideal results are calculated by multiplying the modelview matrix which is extracted from OpenGL directly and the vector (0, −1, 300). Method 1 is only utilizing ICP algorithm in literature [30] to estimate the 3D motion. Method 2 is combining the ORB algorithm and the ICP algorithm to estimate the 3D motion, but in the process of 3D motion estimation, it doesn’t filter the matching results. The proposed method not only combines the ORB algorithm and the ICP algorithm, but also filters the matching results in the process of 3D motion estimation. Motion modes include translation along the vector [1,0,0]T at the speed of 6cm/s, rotation around the vector [4,1,2]T at the speed of 60/s, rotation around the vector [2,5,1]T at the speed of 20/s and translation along the vector [1,0,0]T at the speed of 10cm/s, rotation around the vector[2,1,6]T at the speed of 60/s and translation along the vector[1,0,0]T at the speed of 4cm/s.

Table 3 Results and time consumption of different methods

As can be seen from Table 3, the time consumption of method 1 is 7 to 9 times more than that of method 2 and the proposed method. The reason is that method 1 only adopts ICP algorithm to estimate the 3D motion, all the points in the two acquired depth images are iterated, so method 1 is limited in efficiency. Method 2 matches points by utilizing ORB algorithm first, and then iterates the matching points by utilizing ICP algorithm. It greatly reduces the number of iteration points. So compared with method 1, the time consumption is greatly reduced. But due to the influence of matching error and other factors, the results of method 2 differ greatly from the ideal results. So method 2 is limited in accuracy. On the basis of method 2, the proposed method uses the constraint of rigid invariance to filter the matching points. The proposed method reduces the number of mismatching points and iteration points in method 2, so the proposed method has higher accuracy than method 2, and the time consumption is less than method 2. Overall, through combining the ORB algorithm and the ICP algorithm, the proposed method reduces the time consumption of ICP algorithm, and puts forward the constraint of rigid invariance to improve the accuracy of 3D motion estimation. So the proposed method overcomes the limitations of method 1 and method 2, and it has higher real-time performance and accuracy than method 1 and method 2.

6.3 Experiments of next best view

To validate the feasibility of next best view method proposed in this paper, Fig. 8 shows the depth images acquired in the next best views which are calculated by the proposed method when visual objects are in different motions. The visual object Duck Bunny, Mole, Rocker and Dragon are 3D object models, and the visual object Kettle and Printer are the real objects. Fig. 8a is the name of visual object. Fig. 8b is the depth image acquired in the initial view. Fig. 8c is the depth image acquired from the result of self-occlusion avoidance. Fig. 8d is the depth image acquired in the next best view when visual object does translation along the vector [1,-1,-1]T at the speed of \( 2\sqrt{3}\mathrm{cm}/\mathrm{s} \). Fig. 8e is the depth image acquired in the next best view when visual object does rotation around the vector [2,1,1]T at the speed of 30/s. Fig. 8f is the depth image acquired in the next best view when visual object does rotation around the vector [−4,1,2]T at the speed of 20/s and translation along the vector [2,0,1]T at the speed of \( 2\sqrt{5}\mathrm{cm}/\mathrm{s} \).

Fig. 8
figure 8

Depth images acquired in the next best view when visual objects are in different motions (a) Visual object (b) Depth image acquired in initial view (c) Depth image acquired from the result of self-occlusion avoidance (d) Depth image acquired in the next best view when visual object does translation along the vector [1,-1,-1]T at the speed of \( 2\sqrt{3}\mathrm{cm}/\mathrm{s} \) (e) Depth image acquired in the next best view when visual object does rotation around the vector [2,1,1]T at the speed of 30/s (f) Depth image acquired in the next best view when visual object does rotation around the vector [−4,1,2]T at the speed of20/sand translation along the vector [2,0,1]T at the speed of \( 2\sqrt{5}\mathrm{cm}/\mathrm{s} \).

As can be seen from Fig. 8, for the 3D object models, the depth images are ideal (low noise and smooth boundary). However, for the real visual objects, there are noise pixels and depth data loss in the depth images acquired by using Kinect. The next best view is determined based on the self-occlusion information of moving object in this paper, so the depth images acquired in the next best views when the visual object is in different motions should be same as the depth image acquired in the result of self-occlusion avoidance. Analyzing the depth images in Fig. 8c, Fig. 8d, Fig. 8e and Fig. 8f, for the ideal 3D models, we can see that the depth images of 3D object models acquired in the next best views when visual object moving are almost same as the depth image acquired in the result of self-occlusion avoidance. For the real visual objects, the difference between the results of real visual objects is slightly larger than the results of ideal 3D object models. Through the analysis of the whole experimental process, the process of 3D motion estimation is the major cause for the difference. Compared with ideal 3D models, in the process of 3D motion estimation, the existence of noise pixels and depth data loss in the depth images acquired by using Kinect bring trouble in pre-matching the two depth images. The noise pixels make the mean curvature feature value of each pixel incorrect, and the depth data loss decreases the number of matching points. These cause the result of 3D motion estimation inaccurate. Therefore, the difference between the results of real visual objects is slightly larger but not obvious. This shows that the proposed method has a good applicability to the visual object in different motions.

In order to validate the effect of the next best view method in this paper, Table 4 shows the quantitative evaluation of the result of self-occlusion avoidance and the next best views when the visual object is in different motions. In Table 4, N n is the number of surface points, N new is the number of new added points, R new is the new added rate. The motion mode 1 is that the visual object does translation along the vector [1,-1,-1]T at the speed of \( 2\sqrt{3}\mathrm{cm}/\mathrm{s} \), the motion mode 2 is that the visual object does rotation around the vector [2,1,1]T at the speed of 30/s, the motion mode 3 is that the visual object does rotation around the vector [−4,1,2]T at the speed of 20/s and translation along the vector [2,0,1]T at the speed of \( 2\sqrt{5}\mathrm{cm}/\mathrm{s} \).

Table 4 The quantitative evaluation of the result of self-occlusion avoidance and different next best views

It can be seen that when the visual object is in different motions, the number of surface points N n , the number of new added points N new and the new added rates R new in the depth images acquired in different next best views which are calculated by the proposed method are almost same as these in the depth image acquired in the result of self-occlusion avoidance. Even though considering the influence of 3D motion estimation on the result of next best view, the proposed next best view method has a good effect.

It can be analyzed from Table 4 that, when visual object is in different motions, the number of surface points N n , the number of new added points N new and the new added rates R new in the depth images acquired in the next best views which are calculated by the proposed method are almost same. Moreover, there is no significant difference between the quantitative results of ideal 3D object models and real visual objects by the proposed method, which suggests that the proposed method has a good applicability to different motion modes.

Based on the comprehensive analysis of Fig. 8 and Table 4, it is obvious that for the 3D object models, an ideal next best view can be determined by the proposed method. For the real visual objects, the noise pixels and depth data loss would affect the 3D motion estimation of visual object, which leads to some errors of the experimental results of visual object, but the difference between the experimental results of visual object and the ideal 3D object models is not obvious. It can be seen that the noise pixels and depth data loss have a slight impact on the 3D motion estimation indeed, but a good next best view can be determined by the proposed method, which shows that not only for the ideal 3D object models, but also for the real visual objects including the noise pixels and depth data loss, the proposed method is very robust.

Table 5 shows the time consumption of proposed method when visual objects are in different motions. The motion mode 1 is that the visual object does translation along the vector [1,-1,-1]T at the speed of \( 2\sqrt{3}\mathrm{cm}/\mathrm{s} \), the motion mode 2 is that the visual object does rotation around the vector [2,1,1]T at the speed of 30/s, the motion mode 3 is that the visual object does rotation around the vector [−4,1,2]T at the speed of 20/s and translation along the vector [2,0,1]T at the speed of \( 2\sqrt{5}\mathrm{cm}/\mathrm{s} \), the motion mode 4 is that the visual object does rotation around the vector [3,3,6]T at the speed of 40/s and translation along the vector [0,2,1]T at the speed of \( 2\sqrt{5}\mathrm{cm}/\mathrm{s} \), and the motion mode 5 is that the visual object does rotation around the vector [1,-2,2]T at the speed of 20/s and translation along the vector [1,2,0]T at the speed of \( 2\sqrt{5}\mathrm{cm}/\mathrm{s} \).

Table 5 The time consumption of the proposed method when visual objects are in different motions

Table 5 shows that the average time consumption of proposed method is 100.51 ms. Compared with the average time consumption of methods in [15, 17], it can be seen that the average time consumption of proposed method is much less than the average time consumption of methods which don’t consider motion in [15, 17]. Therefore, the proposed method has a relatively high real-time performance. Moreover, there is no significant difference between the time consumption of ideal 3D object models and real visual objects by the proposed method. This shows that noise pixels and depth data loss in depth images acquired by using Kinect have few impacts on the time consumption, which illustrates that the proposed method has a relatively high real-time performance and applicability for real visual objects.

7 Conclusions

In this paper, a next best view method based on self-occlusion information in depth images for moving object is proposed. Based on this method, the next best view of a moving object can be effectively determined in real-time. We validate the proposed method by simulation experiments and real experiments.

The major contribution of this paper is that a next best view method for moving object is proposed. The proposed method determines the next best view of a moving object by combining the self-occlusion avoidance and 3D motion estimation, which overcomes the limitation that the traditional next best view methods only apply to the static visual objects. And it provides a means for solving the problem that self-occlusion avoidance methods don’t work for moving object.

Another important contribution is that a self-occlusion avoidance method based on the idea of mean shift is proposed. Firstly, based on the self-occlusion information, this method models the self-occlusion regions by utilizing space quadrilateral subdivision. And then based on the idea of mean shift, the result of self-occlusion avoidance is calculated by using the quadrilateral information. This method provides a new means for solving the self-occlusion avoidance and significantly reduces the time consumption of the traditional self-occlusion avoidance methods.

Finally, a 3D motion estimation method through combining the ORB algorithm and the ICP algorithm is proposed. The proposed 3D motion estimation method significantly reduces the time consumption. And in the process of 3D motion estimation, a method to filter the matching results based on the constraint of rigid invariance is proposed to improve the precision of 3D motion estimation.

The method proposed in this paper describes a new idea of determining next best view. Future work may follow two directions. Because the existing next best view evaluation criteria are all for the static visual objects, we will describe a good evaluation criterion for the next best view of moving objects. Moreover, we also intend to determine the next best view for moving object in a complex environment.