1 Introduction

With the advancement of Artificial Intelligence (AI) and easily available computing devices, robots are becoming usable for various complicated tasks in our day-to-day lives. Telepresence [7, 68] and teleportation robots have come to spotlight in the recent years. These robots enable a human to operate the robot from a remote location and find significant applications in remote teaching/meetings/surveillance etc. Given the current scenario of the COVID-19 pandemic, telepresence robots are in much demand for hospital setups to mitigate the physical communication of medical staff with corona patients. In parallel, several applications of digital-twin are in high demand after the publication of Industry 4.0, a holistic automation and smart manufacturing execution architecture. For example, the images captured in the production line can directly be utilized to generate a digital replica of the environment. This replica can be passed to the product designer so that he/she can provide expert guidance on improving the production process. These applications require a good structural representation of the scene for remote assistance and guidance. XR-Aided Industry i.e. virtual reality, augmented reality, and mixed reality technologies also need the 3D reconstruction for base environment creation. The paper [21] beautifully reviews the importance of 3D scene generation for telepresence, teleoperation and XR technologies.

As the robotics applications demand, the reconstructed 3D environment should not be too sparse to understand the occupancy, it should not be too heavy to transfer over the network for teleoperations/telepresence applications, and at the same time the geometry of the reconstruction should be preserved properly. For high-fidelity scene reconstruction, often LIDAR sensors [62, 63, 65, 73] are used. But these solutions are too costly. Due to the size constraint, these sensors are often not flexible to mount on smaller robots/drones. At the same time, the output point-clouds contain redundant points and are not suitable to transfer over the network due to higher disk size. RGB image based reconstruction methods work well for constraint static environments but miserably fail in texture-less / less-textured environments. The results generated by these methods also suffer from depth inconsistency. Recently, wide availability of cheaper and smaller depth controllers like Microsoft’s Kinect [45], HoloLens [32], Intel RealSense[46] encourage researchers to include depth images with RGB [10, 17, 76, 77] for better consistent 3D reconstruction. In this work, we mainly focus to design a novel pipeline to model the world for recent robotic applications given a set of input RGBD images of the environment.

In our proposed pipeline, we mostly exploit planar properties of the scene for pose estimation as well as planar mesh creation. Compared to point features or volumetric representation, the usage of higher-level features like planes always help to preserve the structural property of the scene and thus, increase the accuracy of the pipeline. But a combination of point and planar properties can yield even better results [29, 39, 74, 78, 80]. We segment and model planes with irregular boundary shapes or holes. We further process the non-planar arbitrary objects of the scene to generate a triangular mesh that keeps the structural properties intact. We incrementally refine those over time to form a high-quality mesh of the environment. We represent the world model by triangular meshes as maintaining the point-cloud is always complex in terms of both time and space. To generate mesh from a set of map points, the majority of the State-Of-the-Art (SOA) systems use Delaunay Triangulation [9], but suffers from “curse of dimensionality” [3] i.e. the time complexity of the system increases exponentially in a higher dimension or with a higher number of points. In our paper, we propose two different mesh generation algorithms which are simpler and unlike Delaunay Triangulation, work really well in higher dimensions too.

We design and develop a novel pipeline for generating a world model represented in terms of mesh for robotic applications. The major contributions of this paper are three-fold.

  • We exploit the point and planar properties of the scene for tracking and mesh generation of the planes. At the same time, we process the non-planar arbitrary objects for completeness of the scene.

  • For non-planar arbitrary objects, we propose a Self Organizing Map (SOM) [36] based novel mesh generation method, which is more accurate, robust and better represents the object boundaries than the state-of-the-art (SOA) mesh creation methods.

  • We propose a simple grid fitting algorithm for planar surfaces, which is less complex compared to the SOA.

We present our pipeline along with the proposed surface fitting algorithms in Section 3. The results are presented and compared with SOA in Section 4. The conclusions and future work are stated briefly in Section 5. We place our contribution with respect to the related SOA in the next section.

2 Related works

3D reconstruction is a well-researched topic for the last two decades and is popular in both the robotics and computer vision communities. Many research papers are published on this topic including both geometric [35] and learning-based [33] approaches. Although, due to the complex layout, occlusions, and complicated object interactions, indoor 3D reconstruction still poses challenges [34, 44]. Structure from Motion (SfM) [87] is one of the main approaches for geometric reconstruction. The existing methods in the SfM literature [5, 14, 23, 75] suffer from computationally intensive steps like feature extraction and matching, mainly due to the usage of the point features. Unlike those, we use the plane as the feature and match the planes between consecutive keyframes. A plane consists of a set of points, thus the matching time is significantly reduced as the number of extracted features per frame are significantly less.

Over the last few years, another similar research topic on SLAM (Simultaneous Localization And Mapping) [25, 28, 47, 51, 58, 59] has progressed a lot, but world model creation with high fidelity objects still requires potential research. Although many methods [12, 16, 19, 28, 30, 40, 43, 54, 66, 67, 69, 71] have been proposed in this area in recent times, most of those work for offline reconstruction and not suitable for our application. World model generated by other papers [16, 30, 66, 71] either does not preserve the structural property of the environment or is too sparse to understand the environment. Our method applies specialized approaches for non-planar objects and planar objects separately. This results in a reconstruction which is a combination of dense (for non-pla ar objects) and sparse (for planar objects) mesh, good enough to preserve the structural properties of the non-planar and planar objects. Figure 1 shows the mesh generated by our pipeline on a publicly available dataset [16].

Fig. 1
figure 1

3D Mesh generated by proposed pipeline on Copyroom [16] dataset. (a): original frame, (b): generated normal wire-frame, (c): vertex-colored 3D mesh

Several deep-learning based approaches are present in the SOA that reconstruct the environment in near real-time. As an example, the paper [38] discusses a very interesting method using a learning model to construct textured 3D shapes with precise geometry from 2D images. Generally supervised learning models [52] would require training on a huge labeled dataset of textured/colored meshes which is not easy to achieve. To overcome these problems, we propose an unsupervised learning based novel pipeline to generate a 3D mesh of the environment from an ego-centric RGBD sequence. Two interesting surveys on 3D object reconstruction under the umbrella of deep learning are presented in [33] and [26]. Recently, Mildenhall et. al. has proposed Neural Radiance Field (NeRF) [55] as a fully-connected neural network, which is able to generate novel views of complex 3D scenes. NeRF takes 2D observations as input and represents the environment as an implicit surface. Many follow-up papers [6, 18, 31, 83, 85] incorporate depth with RGB and demonstrate very good reconstruction. The major drawback of these methods is the higher training time due to the global (non-incremental) reconstruction strategy. Thus, these methods are completely inappropriate for onine robotic applications. Our method exploits the topological ordering capability of unsupervised Self-Organizing-Map (SOM) neural network. Our proposed method does not need a training phase and approximates complex real objects in few iterations.

Some SOA methods represent each non-planar object in the scene with a predefined model such as CAD. The problem with such methods is getting a such huge number of models, which is infeasible for real-life applications. Therefore, we need a more generalized method. Some methods [42, 48, 82] in the literature use SOM-based surface fitting for such purpose. We propose a SOM-based generalized method for generating the mesh, fitting the surface of the objects. Our method better maintains the structure and better represents the boundary compared to SOA. The next section, presents the overall design of our proposed approach.

3 Our system: overview and design

Our approach exploits the point as well as planar properties of the scene for pose correction as well as planar mesh creation. Structural reconstruction of the environment often demands a higher level of features for better accuracy, at the same time point features can give more granular information to achieve high fidelity. So, we choose points and planes both in dense volumetric representations for feature matching. We use an RGBD sensor’s data (from the robot’s ego view) as input. We process the RGB data to segment the planes and further enhance the result using depth information. First, we use the plane segmentation results to estimate the plane parameters in Hessian/Hesse Normal Form (HNF) [8]. Later, we use and compare two approaches for plane tracking and pose estimation. In the first approach, we take the plane parameters in HNF and convert it to Plane Parameter Space (PPS) [72]. Based on the plane parameters in PPS, we match the planes between two consecutive frames and estimate the relative pose between the frames. The other approach exploits the point and planar properties both as described in Plane Bundle Adjustment (PBA) [84]. We further separate the non-planar points and form a point-cloud accumulating those points in consecutive frames using estimated pose. To reduce redundancy and process the pipeline faster, we limit our mesh creation on some selected frames, also known as keyframes. For mesh creation, we adopt two different approaches. Firstly, we use a surface approximation-based mesh creation algorithm on newly segmented planes in every keyframe. Whereas, for non-planar areas, we use the point-cloud to create mesh using our novel SOM-based approach. The neural network (SOM) based approach provides a highly parallelizable method (described in Section 3.5) for fast output [60]. We update the created meshes incrementally using the estimated pose from next keyframe onward. Figure 2 depicts the overall pipeline of our proposed method. Next, we describe the major components of our simple yet robust pipeline.

Fig. 2
figure 2

The block diagram of our proposed pipeline of robust 3D scene reconstruction from the robot’s ego view (RGBD sensor)

3.1 Plane segmentation

We adopt the concept of Plane RCNN [49] that uses Mask RCNN [37] for plane detection and segmentation. We choose Plane RCNN as it is capable of handling any number of planes, whereas other SOA plane segmentation pipelines like PlaneNet [50] and PlaneRecover [81] need prior information about the maximum number of planes. Plane RCNN segments only two categories, either “planar” or “non-planar”. We adopt the same concept and use the Mask R-CNN pre-trained on ScanNet [15] dataset. We further refine the plane boundaries using the depth information. Once the plane segmentation is done, we calculate the plane parameters using a RANSAC [24] in HNF, where a plane π is represented by [qπ,dπ]. Here qπ is the unit normal vector [qx,qy,qz] ∈ I R3 on the plane and dπI R is the distance of π from the origin. As we are using a less noisy depth information coming from a depth sensor, RANSAC works really well with less number of iterations. After plane segmentation, we need to track planes and estimate pose.

3.2 Plane tracking and pose estimation

For plane Ttracking, we adopt the PPS [72] based plane representation for frame to frame plane matching proposed by Sun et al.. A plane π = [qx,qy,qz,dπ] in Cartesian coordinate is transformed in PPS by using the (1).

$$ \left\{ \begin{array}{ll} \theta_{\pi} = arccos (q_{z})\\ \varphi_{\pi} = atan2 (q_{y},q_{x})\\ d_{\pi} = d_{\pi} \end{array}\right. $$
(1)

Every Cartesian plane is represented by a point in PPS and all the Cartesian points on the plane lie on the same point in PPS. We exploit this property to find the plane correspondence between two consecutive frames. Based on the correspondence, this algorithm estimates the relative pose R,t between two consecutive frames. Further, we use a plane based bundle adjustment to refine the poses as described in next sub-section.

3.3 Pose refinement

Bundle adjustment [2] is a commonly used optimization technique for camera pose (position and orientation) estimation. But the accuracy of this method highly depends upon the number of feature points and the accuracy of the correspondences. On the other-hand, planes are the high level geometric primitives of a scene and help to reduce geometrical ambiguity caused by noisy feature points. We already have plane parameters from previous section, we exploit those parameters along with plain points to estimate the camera pose. Zhou et. al. [84] propose a Planar Bundle Adjustment technique (PBA) that formulates the optimization problem using a point-to-plane cost function for pose estimation. In PBA, both the plane and the points and their relationships are considered for pose estimation. At the same time, if the number of plane points is large, it can lead to a large scale least-squares problem which is efficiently addressed in [84] itself. Here, the squared distance between a plane and plane points is used as the cost function.

Let us assume, Ri and ti are the rotation and translation respectively to transform i th frame from local coordinate system to a global coordinate system. Also assume pijkI Pj, where I Pj is the set of K number of points which are sampled from i th frame and lie on j-th plane πj. Note that there may exist multiple planes in one frame. The signed distance between a point pijk and corresponding plane πj can be written as (2).

$$ \delta_{ijk} = [q_{x}, q_{y}, q_{z}]_{j} .(R_{i} p_{ijk} + t_{i}) + d_{\pi j} $$
(2)

Here, ‘.’ represents a dot product and \(\delta _{ijk}^{2}\) is the point-to-plane cost. We use PBA to perform a motion only optimization to get a refined camera pose [Ri, ti] by minimizing the point-to-plane cost. In the next sub-section we discuss the 3D mesh creation technique.

3.4 3D mesh propagation for planes

Delaunay triangulation [9] is a well-known technique to generate triangular meshes in I R2, but its complexity is high with large number of points [3, 53]. Thus, it is not suitable for finer mesh creation. So we avoid it and devise a Cartesian grid based mesh fitting algorithm as our own contribution as described in Algorithm 1. By changing the grid dimension represented by dgrid we can control the generated mesh quality from coarse to finer. Figure 3 shows a sample input and output of our algorithm. While Algorithm 1 is suitable for faster meshing of planes, for non-planar objects we need a non-planar generalized method. Next, we present our proposed method for generating meshes for non-planar objects.

Fig. 3
figure 3

Planar mesh generation using diagonal grid

Algorithm 1
figure a

Generate 3D mesh for plane π = [qπ,dπ].

3.5 3D mesh creation for non-planar objects

Given a non-planar object point-cloud extracted from a sequence of RGBD images, we need to approximate the object surface with a mesh. For getting this object point-cloud, we do the following. Given the RGB image, we apply object segmentation method such as Mask RCNN [1] that provides the mask of the object. Using the mask, from the RGBD image, we segment out the point-cloud of the object. Figure 4 illustrates this with an example.

Fig. 4
figure 4

Different steps of non-planar mesh approximation given an RGBD image. (a): Input RGB image of the scene, (b): object segmentation, (c): depth map, (d): point-cloud of the object extracted from the scene RGB of (a), segmentation mask of (b) and depth map of (c), (e): initialization of the mesh nodes and (f): approximated mesh

We compare the mesh with a fishing net with fixed number of knots with elastic edges. We cover the entire object (point-cloud) with the fishing net by stretching its nodes. The amount and direction of this stretch is directed by the distribution of the points in the object.

Given the topology-preserving universal approximation capability of Kohonen’s Self Organizing Map (SOM) [36], we adopt a two-dimensional lattice (for understanding see Fig. 5(a)) of SOM as our base method for mesh approximation.

Fig. 5
figure 5

Showing boundary effect due to conventional SOM and the proposed SOM. (a): Schematic diagram depicting a 2D 5 × 5 lattice of mesh nodes (the numbers represent the node index), (b): condition of 10 × 10 mesh after initialization, (c): input cap point-cloud, (d): result of applying conventional SOM, (e): result of applying proposed Algorithm 2, (f): (e) overlayed on (d), (g): (d) and (e) overlayed on (c), the red curve shows the boundary of (d)

Let us assume that the point-cloud of the given object consists of l points represented by X = {xiR3}, i = 1,..,l, the 2D lattice consists of M = m × m nodes where the set of nodes is represented by set of weight vectors W = {wjR3}, j = 1,..,M. The xi acts as the activation pattern. Given xi, the nodes wj compete among each other. The winning node i(x) learns the pattern of xi following a learning rate η(n) and cooperates with the neighboring nodes for learning. The neighborhood is defined by \(h_{j,i(\textbf {x})}(n)=\exp {(-d^{2}/2\sigma ^{2}(n))}\), where d is the Euclidean distance between i(x) and j th node, σ is the spread of the neighborhood function and n is epoch number. A number of epochs N arranges the nodes of the 2D lattice such that it covers the surface of the object. The base SOM algorithm is given in [36].

Since, SOM [36] is an unsupervised clustering algorithm, a set of input points (xi) are represented by a central mesh node (wj). Thus, general SOM algorithm does not extend the mesh to the exact boundary of the input point-cloud and does not represent the boundary well (see Fig. 5). Therefore, we propose to modify the conventional SOM algorithm [36] such that the mesh nodes near the boundary have greater learning rate and less elasticity. As a result, the boundary nodes move freely, almost independent of their neighboring inner nodes, towards the input points on the boundary. Thus, the resulting mesh covers the boundary of the object well.

The existing literature [36] partitions the entire SOM iterations into two phases: (1) ordering, and (2) convergence. We look at SOM iterations as consisting of three phases: (1) unfurling phase where from the state of complete un-organizations (Fig. 5(b)), SOM mesh unfurls itself, (2) ordering phase where the unfurled mesh approximates the point-cloud surface and (3) convergence phase where the ordered mesh reaches out to the boundary. For accelerated result, in the unfurling phase, both the learning rate η and the spread represented by σ of neighborhood should be very high initially. Then they both should decrease very quickly so that the unfurling process happening in the first iteration should not get undone in the subsequent iterations. For similar reason, in the convergence phase the learning rate should decrease very fast. The σ value should also be very small in the ordering phase so that the boundary nodes of the lattice can approach the boundary points of the object and does not affect already ordered inner points of the lattice. The decay functions of η and σ in conventional SOM [36] do not satisfy these requirements. Equations (3) and (4) present the conventional decay functions and (5) and (6) present the modified decay functions that satisfy our need.

$$ \eta(n)=\eta_{0}\exp(-n/\tau_{1}), n=1,2,..,N. \tau_{1}=N $$
(3)
$$ \sigma(n)=\sigma_{0}\exp(-n/\tau_{2}), n=1,2,..,N. \tau_{2}=N/\log \sigma_{0}. $$
(4)
$$ \eta(n)=n^{-0.2}\xi(n), n=1,2,..,N. $$
(5)
$$ \sigma(n)=\sigma N +\xi(n)(\sigma 0 \times 0.05^{n/N}-\sigma N)n^{-0.25}, n=1,2,..,N. $$
(6)

Here, \(\xi (n)=1-\exp {(5(n-N)/N})\), σ0 and σN represent the desired spread of the neighborhood function at iteration 0 and iteration N. At the start of iterations, we want that almost all the nodes must be affected by the update. Since, we are considering a 10 × 10 lattice, we set σ0 = 5. At the end of iterations, we want that the winning node does not affect its neighborhood, thus we set σN = 0.5. The two constants 2.5 and 2.0 in steps 9, 10, 12 and 13 of Algorithm 2 are set empirically. Figure 6 compares the modified vs. the conventional decay functions for η and σ.

Algorithm 2
figure b

Algorithms to generate a mesh addressing boundary condition given a point-cloud.

Fig. 6
figure 6

Comparing the modified vs. the convention decay functions for σ (left) and η (right). Notice that during the initial and towards the end of iterations the modified σ decreases faster. The value of modified η is also less in the ordering phase compared to the conventional one

By the end of ordering phase both the η (learning rate) and σ become low. In the convergence phase, we want the boundary nodes to have higher η so that they can move towards the input boundary points. We want them to have smaller σ (less elasticity) so that they do not make huge alteration to the already arranged inner nodes. We propose Algorithm 2 as our novel contribution that satisfies the above mentioned requirements and addresses boundary condition. Note that the main update at steps 11 and 15 of Algorithm 2 can be implemented in a parallel fashion for all the nodes in the lattice. This helps in acceleration of algorithm execution.

For further refining the mesh shape in local areas having sudden change in normal, we do the following. After running Algorithm 2, for each node w, we find the set \(X^{\prime }=\{\textbf {x}_{i}, i=1,...,r: \textbf {x}_{i}\in X, \textbf {w}=\arg \min \limits _{j} ||\textbf {x}_{i}-\textbf {w}_{j}||,j=1,...,M\}\) Then we get the average using \(\textbf {x}^{\prime }={\sum }_{i=1}^{r}{\textbf {x}_{i}}/|\textbf {x}_{i}|\). If the Euclidean distance between w and \(\textbf {x}^{\prime }\) is greater than a threshold, we apply ARAP [41] deformation keeping w as source and \(\textbf {x}^{\prime }\) as destination. One example is shown in Fig. 1 in the supplementary material. Next, we evaluate the efficacy of our approach.

4 Experimental results and evaluation

4.1 Datasets and setups

We use three known datasets to evaluate our pipeline and the individual components. The ScanNet [15] datset is used for plane segmentation part. For non-planar objects, we also use ShapeNet [11]. Finally for the entire pipeline evaluation we use, Apt0 [4], Copyroom [13] and Office3 [61] datasets. All the experiments are done w.r.t. a Double2 [20] robot mounted with a RealSense [64] camera for RGBD data. The pipeline is run on an edge device connected to our robot having single-core CPU and 6GB RAM.

4.2 Results on 3D mesh creation for non-planer objects

To test the efficacy of the proposed Algorithm 2, we have done extensive experiments, both qualitative and quantitative, on two types of data. The first type consists of object point-clouds extracted from real RGBD images (Copyroom and Office3 datasets). The second type consists of object point-clouds taken from CAD models (ShapeNet dataset [11, 22]). Whereas, the point-clouds from CAD model are expected to be ideal with no noise or hole, the point-clouds extracted from RGBD images are expected to contains noise, holes and uneven density of points. Our algorithm smoothens noise, fills the holes and works well for uneven density. The run time of our proposed algorithms is proportional to the size (l) of the input point-cloud. Therefore, to reduce the size while preserving the shape and topology of the input point-cloud, we apply voxel grid filter. This filtering also helps us to tackle the issue of uneven density of input point-clouds. The resolution of the generated mesh is determined by the number M = m × m of nodes in the approximating 2D lattice. Greater the value of M, greater is the resolution of the generated mesh. The upper limit of this resolution is restricted by the resolution of the input object point-cloud. We select this M such that the generated mesh is dense enough to represent the structure of the object and sparse enough for low cost storage. Tables 1 and 2 show some results on real and CAD objects. We also compare our results with that of the SOA [70] which also utilizes SOM for approximating object surface. Comparing the third and the fourth columns of Tables 1 and 2 we observe that our method better approximates the surface shapes of the input objects. Column five overlays the result of [70] on that of ours. These results in column four shows that our method stretches the mesh boundary to the object boundary better as compared to [70]. The first row of Tables 1 and 2 shows one example where our method can be seen to fill the hole.

Table 1 Qualitative comparison of our method with [70] for object point-clouds extracted from real RGBD images
Table 2 Qualitative comparison of our method with [70] for object point-clouds from CAD models

In Tables 3 and 4 the two quantitative metrics (as in [27]) are defined as follows. Accuracy: distance d such that a given percentage of the reconstruction is within d from the ground truth model. Completeness: percentage of the ground truth model that is within a given distance from the reconstruction. For accuracy, we define given percentage as 90%. Different point-clouds can be in different scales. Therefore, for completeness, we define given distance as one tenth of the maximum distance between any two points along x, y and z axis. Note that smaller the value of accuracy, the better it is. Similarly, bigger the value of completeness, the better it is. From Tables 3 and 4 we can say that our method generates better result compared to SOA [70] both qualitatively and quantitatively.

Table 3 Quantitative comparison of our method with [70] for object point-clouds extracted from real RGBD images
Table 4 Quantitative comparison of our method with [70] for object point-clouds from CAD models

For the sake of completeness and to have an idea of the overall performance of our approach on the whole datasets, we have done two sets of experiments. In the first set, we have taken the non-noisy point-clouds of 30 objects from three different sources [56, 57, 86]. In the second set, we have extracted (segmented) the point-clouds of the objects from the indoor scene of the Copyroom3 dataset as explained in Section 3.5 and also in Fig. 4. Since these object point-clouds in the second set are processed from RGBD images of indoor scenes, these point-clouds contain noise. In these two set of experiments, we have run our Algorithm 2 and also the approach by Steffen et. al., [70] on the objects from the whole of two datasets. The results are compared in terms of two quantitative measurements namely accuracy and completeness (defined in Section 4.2). The results are presented in Table 5. Since the volume of different objects are different, instead of mentioning the absolute distance d for which 90% of the reconstructed points are within d, we have mentioned accuracy as d × 100/(maximum distance between any two points in that object). Table 5 shows that our method is better in terms of both accuracy and completeness. The modified Algorithm 2 for better boundary approximation, especially the two (5) and (6) are the reason behind this improvement with respect to the SOA.

Table 5 Quantitative comparison of our method with [70] for point-clouds of synthetic objects from [56, 57, 86] and noisy point-clouds extracted from Copyroom3 indoor scene dataset [13]

4.3 Final mesh results

The world model containing planar and non-planar object meshes generated by our novel pipeline holds the structural property of the environment and thus perfect for remote virtual navigation and manipulation. Figure 1(c) shows the reconstructed environment mesh. The mesh normal in Fig. 1(b) clearly shows a smooth output as desired. Figure 7 shows the complete generated mesh by our proposed pipeline on three different publicly available datasets. Some other results are presented in the supplementary material.

Fig. 7
figure 7

Vertex colored 3D mesh generated by our pipeline on (a): Office3, (b): Apt0, (c): Copyroom

4.4 Execution time

Plane extraction, plane matching, planar and non-planar mesh generation are four computational building blocks of our pipeline. As we design our system to process only keyframes and the Double2 robot movement speed is not on the higher side, we need to process 2-3 keyframes per second. For plane extraction, we use Plane RCNN, which is a learning-based plane segmentation algorithm and generates output in near real-time [79] in our experimental setup. For plane matching, we wisely exploit PPS-based plane representation and are able to track the planes in real-time. For planar mesh generation, instead of using computationally expensive Delaunay triangulation, we use a much faster grid fitting algorithm. Any indoor environment mostly has planar objects like walls, floors, and ceilings. So, the usage of a less expensive mesh generation algorithm for the most part of the environment helps to minimize the run-time of the pipeline. For non-planar regions, we modify the SOM-based clustering algorithm, which is highly parallelizable. The main updates of steps 11 and 15 of Algorithm 2 can be done in parallel for all the M nodes of the lattice. Thus, can be implemented easily in multi-threaded CPU / GPU. We have implemented the customized SOM to run in several threads in a CPU-based Edge device. In the CPU-based edge device, our whole pipeline is able to run at 3 frames per second, which is sufficient for a keyframe-based pipeline (3 key-frames per second) in a constrained indoor environment. Utilization of GPU will increase the speed further.

5 Conclusion and future works

We have proposed a complete pipeline for the representation of a 3D world model in terms of triangular meshes given a continuous sequence of RGBD images of the scene captured from the robot’s ego view. Our method being simplistic, can be easily integrated into any robotic platform. The method does not have any training phase and does not require a huge training set as required by supervised learning based methods. Thus, this method is generalized to any indoor environment. Our mesh approximation of the objects in the scene better represents the shape and size of the object compared to SOA. Both qualitative and quantitative results done on multiple datasets establish the efficacy of our method compared to the SOA.

The main limitation of this work is the inclusion of proper texture information in the generated 3D scene and its objects. This module can be stitched into the existing pipeline easily in the future.