1 Introduction

Augmented reality (AR) combines virtual and real worlds, allowing users to experience virtual objects in real space. AR is different from virtual reality, which allows users to immerse themselves in a new space by building an artificial world. AR complements the real world and helps users’ understanding by implementing virtual interfaces based on real world. Hence, it has recently been used in various industries such as healthcare, education, gaming, and entertainment [1, 2].

As the usability of augmented reality technologies improves, changes in interface expression methods and interaction functions according to user input are required. Display, tracking, and video processing technologies are used in augmented reality. Among them, tracking technology is crucial to the performance and functionality of applications as it identifies the point-in-time position and maintains a specific state through recognition. Therefore, many studies have been conducted to achieve high completeness.

Tracking techniques in augmented reality are generally based on markers, as shown in Fig. 1. When various types of markers are recognized, virtual visual information(interface) is augmented based on their location [3, 4]. Traditional tracking techniques visualize information based on markers, resulting in restrictions on the working space. To overcome this limitation, it is necessary to recognize objects that exist inside AR based on space; this is how spatial markers are used. However, when using spatial markers, the control unit is not an object; thus, it is not possible to interact with or control the change in the working space. In other words, depending on the type of marker, the challenges are to overcome the constraints of space and gain control of each object. Therefore, this study proposes a system that enables tracking of cameras in a real-world space, recognition of objects in real time, and visualization of virtual visual information through location estimation.

Fig. 1
figure 1

Augmented reality using marker (left is binary marker and right is image marker)

To realize the proposed system, point cloud data are generated based on the color and depth image information obtained using the camera. The generated point cloud data are used to construct a space through tracking. Using SLAM [5], the locations of objects in space are estimated based on the point cloud data. We propose a method to eliminate the gaps that arise in the characteristic points. The method is also applicable in recognizing objects and estimating three-dimensional (3D) poses in the video stream of AR execution process. We propose a 3D pose estimation method that recognizes two-dimensional (2D) objects using deep learning and utilizes the recognized 2D positional information to minimize errors.

The contributions of our research are as follows. First, the space can be detected through real-world camera tracking, which can create a space map using an RGB-D camera without using special equipment. Second, objects can be recognized based on the generated space and the estimated camera positions. In this study, real time was ensured by applying multithread and fast deep learning algorithms. Third, the camera point of view was corrected by extending the recognized 2D object model to three dimensions. Finally, the 3D extension of object information was augmented via AR, allowing virtual visual information to be visualized according to the actual object location.

2 Related work

2.1 Object recognition and location estimation techniques

Studies that focus on object recognition use various techniques for estimating locations [6,7,8]. The method for recognizing objects from an input video or image is largely divided into computer vision-based and deep learning-based methods. Computer vision-based recognition detects features in an image and determines object. Early object recognition studies mainly used computer vision methods. Scale invariant feature transform(SIFT) [9], speeded up robust features(SURF) [10], and oriented FAST and rotated BRIEF(ORB) [11] are the commonly used algorithms. Feature detection algorithms have different underlying techniques but detection occurs in the geometric feature part of the recognition object. Because it is performed pixel-by-pixel, it can be applied to estimate location information while finding matching feature points.

In deep learning-based recognition, object recognition is performed through neural network learning. Early neural networks were used only for object recognition, followed by R-CNN [12] and spatial pooling in deep convolutional networks(SPPnet) [13], and single shot multibox detector(SSD) [14], which estimate object recognition and 2D locations. Because the location information of objects in real space has three dimensions, studies have been conducted to obtain vision-based 3D location information [15,16,17], or deep learning-based 3D location information [18, 19]. Previous studies have the disadvantages of not recognizing objects, or not having sufficient accuracy in positioning. Deep learning-based studies have yielded high performance in terms of accuracy and object recognition. In addition, they either require many datasets for learning or are difficult to operate in large spaces. Our study proposes a system that can operate in a large real space, estimate 3D locations, and recognize objects with high accuracy even with small datasets.

2.2 Implementing augmented reality

Augmented reality was first reported in 1992 [20] and has since been studied together with computer vision technology. Subsequent studies have mainly focused on using markers to achieve accurate positioning and tracking. Marker forms include binaries, images, objects, and spaces. Initially, the binary markers were the main form of markers used. This was because it was easy to construct multiple shapes and extract feature points using binary markers [21]. However, when placing markers in the real world, they are naturally connected to cover other objects without being seen. As image and object markers, objects that exist in the real world are used. Therefore, there are few cases of visual alienation or interference. However, feature extraction and throughput for marker recognition for image and object markers must be relatively large compared to binary markers. Consequently, they were not applied in early tracking technologies. A research showed that lighter systems with high processing speeds were possible in real time but resulted in high usage [22]. AR using binary, image, or object markers is not suitable for AR applications in large spaces because augmented reality is possible only within a limited space where the camera recognizes the marker.

In augmented reality, space can be used as a marker, for example, in GPS-based cases such as [23] and Pokemon GO games. However, it is difficult to expect a natural synthesis of virtual objects with real world because detailed location calculations or operations are not possible indoors. Therefore, the use of SLAM technology for augmented reality has been proposed [24]. If the space is used as a marker, it is impossible to estimate the recognition and location of objects in the real world. Recently, studies have been conducted to simultaneously recognize space and objects [25, 26]. Our system can construct real-world spaces and objects in three dimensions using AR technology and can be used as a space-based marker to implement AR.

3 Proposed method

3.1 System overview

In this study, we propose a method for visualizing virtual information in a real-world location by recognizing objects and estimating their location in a 3D coordinate system while simultaneously tracking the camera. The operating stages of the system consist of image input, tracking, 2D object detection, 3D pose estimation, and visualization. The system receives color and depth image information through the camera and uses it to generate the point cloud data. The SLAM-based tracking system estimates the location of cameras in the real world in real time and configures the entered data in a map form. Before performing deep learning-based object detection, we built the datasets and performed neural network learning. The map created via the point cloud data in the last step, 2D spatial location obtained through object detection, and object point cloud were input into the ICP algorithm to estimate the 3D position and posture of objects in the real world. The estimated location was augmented using virtual visual information through graphics technology. Figure 2 shows the conceptual flow of the system proposed in this study.

Fig. 2
figure 2

System overview (First, we retrieve color and depth image data using the RGB-D camera. Second, a dense point cloud map is generated using the feature map and spare point cloud. Using the dense point cloud map, 2D objects are detected. The detection is performed by Yolo algorithm. Finally, we estimate the pose of an object for augmented reality visualization)

3.2 Date collection

A definition of space is needed to estimate the location of users in the real world and track them. There is a method to obtain satellite signals and locate the users using GPS sensors without defining the space. However, this method cannot be used indoors and errors occur below meters. Additional equipment, such as a dead recoding module, geographic information system, inertial estimation equipment, and cameras, was needed to compensate for this. For spatial definitions, the characteristics must be identifiable. The main target of the features is visual information, and the image is used as a medium. In this study, location estimation and tracking are performed indoors and outdoors using RGB-D cameras as input devices without complex equipment. When color and depth video are input through the RGB-D camera, a point cloud is created. We conducted the study assuming that objects were separated from each other in the scanned space. This was because when objects overlap, estimating the pose was problematic. This issue is discussed at the end. Figure 3 shows an input image from the camera and an example of a generated point cloud.

Fig. 3
figure 3

Example of input data and point cloud (left: RGB image from the real world, middle: depth image from the real world, right: generated point cloud)

3.3 Tracking

A definition of space is needed to estimate the location of users in the real world and track them. The algorithms used to select suitable AR construction systems must satisfy the following conditions:

  • Real-time operation in a light system environment

  • Definition of spatial markers indoors and outdoors

  • Prevent performance degradation as maps increase

  • Fast return if location estimation and tracking are stopped

Tracking was carried out using ORB-SLAM [5], which satisfied the above conditions. Although there are various SLAM techniques [5, 27,28,29], we used ORB-SLAM in our system because it is the most satisfactory with respect to our system specifications. ORB-SLAM2 is based on an ORB feature extraction algorithm. It can establish a light system environment and enable real-time operation. Safe performance is ensured through the deletion of redundant points and loop closing in indoor and outdoor environments. Using the bag-of-words technique, it is also possible to quickly return to the positioning process at a tracking stop instant. Figure 4 shows the results of map creation and tracking based on the information input through the RGB-D camera.

Fig. 4
figure 4

Result of tracking based on RGB-D camera (This is extracted sparse point cloud using ORB-SLAM2)

Because the distribution of the point cloud is very sparse, the resulting map in Fig. 4 does not cause any problems in tracking. However, the map does not work when estimating the pose of an object later. This can be seen in Fig. 5. The object coordinates on the left side of Fig. 5 differ by 45\(^\circ\) from the map coordinates. After estimating the pose, it is evident that the positional movement has been performed normally, as shown on the right side of Fig. 5, but the rotation shows a difference of approximately 90\(^\circ\).

Fig. 5
figure 5

Result of generating pose estimation with spare point cloud (left: before pose estimation, right: after pose estimation)

To solve this problem, we used the point library to create a dense point cloud. Figure 6a shows the results of a single frame. This information generates maps via SLAM. At this point, it is necessary to locate and synthesize each frame. In general, point matching algorithms, such as ICP, are used to synthesize different point cloud data. However, there are difficulties in real-time operation owing to the need for a long performance time. The optimization of duplicate points is also required. Therefore, this study uses ORB-SLAM to omit the processing of the frame position and redundant points. The synthesis is performed through feature maps and the dense point cloud matching method, and can be completed in real time. Figure 6b shows the process and result of creating a dense point cloud map.

Fig. 6
figure 6

Process and result of dense point cloud (a result of dense point cloud in single frame, b process of generating dense point cloud)

3.4 2D object detection

Based on ORB-SLAM, feature points were extracted from the images, and maps were generated. Because space was defined, tracking and camera positioning were possible. However, it was impossible to recognize objects and estimate their locations in three dimensions. This was because objects were included in space, but the computer system could understand only a geometric form of space. Therefore, additional systems are needed to define the characteristics of the objects and estimate their recognition and location. This study detects objects (perception and location estimation) based on deep learning. The data required for neural network learning are constructed through semi-auto labeling and data inflation. The detection results are passed to the tracking system to visualize the class of objects and results of the 3D position estimation are passed to the user. Of the various neural networks that can recognize objects, we used the Yolov3 [30] model because its recognition is sufficiently accurate, lightweight, and fast in terms of system usage. As Table 1 shows, its object recognition accuracy object is not significantly different from that of other neural networks, yet the execution is very fast.

Table 1 Compare throughput and mean average precision (mAP) values by neural network

To reduce the dataset construction time, the initial data are acquired through frame segmentation in the image, and several datasets are constructed via data inflation through growing, color change, rotation, synthesis, and partial deletion. For each generated image, a data label is assigned. The composition of the data label consists of x-and y-coordinates, width, and height. The input of these values must be provided by the user. Therefore, to improve data labeling, we developed a semi-auto labeling system based on SURF feature point matching. Figure 7 shows the structure of the semi-auto labeling system.

Fig. 7
figure 7

System of semi-auto labeling

The bounding box entered is automatically transformed into a labeling coordinate configuration. Subsequently, the process of automatically determining the bounding box is repeated until the end by matching the generated source image with the next frame image. Learning was conducted using datasets built through data inflation and semi-auto labeling on the Yolo neural network. There are three classes of objects learned with a batch size of 64 and the number of learnings of 2700. Figure 8 shows the accuracy results of detecting objects through a learned neural network.

Fig. 8
figure 8

Result of object recognition (cooker: 98%, Box: 99%, Refrigerator: 97%)

3.5 3D pose estimation

In this study, the proposed method estimates the position of the camera and detects the objects simultaneously. Estimating an object using the results discussed in Sects. 3 and 4 causes the problem of augmenting virtual information inconsistently with the object, as shown in Fig. 9. This is because the location estimation information is two dimensional, that is, it is stereoscopic and cannot be expressed according to the direction of rotation. Therefore, the ICP algorithm [31] was used to calibrate the 3D pose.

Fig. 9
figure 9

Result of pose estimation and augmentation using 2D information

The ICP algorithm outputs the difference in position and angle between the two models in a matrix through the repeated matching between close points. Because of repeated matching between points, the data entered is in point cloud format. It uses a point cloud of objects created through a dense point cloud map and model reconstruction. Figure 10 shows the results of the point cloud creation with objects reconstructed using the Meshroom program.

Fig. 10
figure 10

Result of generating point cloud of objects

Because ICP includes both position and rotation information, it can proceed immediately without two-dimensional information. However, using ICP alone can cause errors in estimating an object’s pose. This is because ICP is performed with a point cloud of maps and objects rather than matching the same data. Figure 11 illustrates the problems that arise when the two-dimensional object detection step is omitted while estimating object pose.

Fig. 11
figure 11

Problems with ICP algorithm application after omitting 2D object detection steps

The image on the left in Fig. 11 illustrates the performance of the ICP algorithm. The real world is represented by the point cloud map data, and the actual object location is shown in yellow. The point cloud data in red represent the reconstructed objects. The right side of Fig. 11 shows the matching results. Note that there is a difference in position as the algorithm is performed with adjacent points. Figure 12 shows the results of ICP and region limiting in point cloud map data through two-dimensional location detection. Comparing these results with Fig. 9, we can see that the virtual information is augmented to match the real object.

Fig. 12
figure 12

Results of 3D pose estimation

4 Experimental result

This study proposes a system that enables tracking of cameras inside a real-world space and visualizes virtual visual information through recognition and location estimation of objects. The system was implemented in the environment described in Table 2.

Table 2 System environment

To fulfill real-time requirements, tracking, two-dimensional object detection, and three-dimensional pose estimation steps were handled in different threads. Figure 13 briefly illustrates the thread-motion structure. The three threads shown in Fig. 13 operate simultaneously. The tracking thread was maintained, while the overall system was not stationary. Two-dimensional object detection and three-dimensional pose estimation can result in state changes upon user request.

Fig. 13
figure 13

Structure of thread operation

We conducted camera tracking, object recognition, and positioning in a 3 x 3 m2 area. Based on a spatial map composed of 500,000 dense point clouds, the object was found, and its location was estimated to be augmented.

Figure 14 shows the results of the synthesis of the real world and virtual visual information. The recognition results of each object are displayed with different colors of virtual visual information. Each object can be recognized and virtual visual information in different colors can be synthesized into the real world. When the three-dimensional bounding box was checked for each result, we observed that it accurately surrounded the object according to its rotation and position. The direction and position of the teapot indicate that accurate tracking is performed even when the camera moves. The result is that mapping is performed, and tracking of virtual visual information works normally even in localization mode. In addition, virtual visual information can be augmented on objects and tracking can be checked, as shown in Fig. 15, which operates normally even in a large area.

Fig. 14
figure 14

Result of augmented visual information based on various objects (a cooker, b refrigerator, c box. As the 3D bounding box can be confirmed for each result, it accurately surrounds the object according to its rotation and position. Note that the direction and position of the kettle are accurately tracked and augmented even when the camera moves. The system operates normally even in localization mode)

Fig. 15
figure 15

Result of tracking in a large area (Note that the enhancement is maintained in a large room rather than a limited space. If an object is already augmented and tracked, tracking will continue even if the middle is interrupted by partitions or stairs)

Figure 16 (end of paper) illustrates a comparison of the commercialized SDKs. Fig. 16a and b depicts marker-based augmented reality, where tracking stops when a marker is invisible. Our research shows that virtual visual information is maintained owing to the possibility of spatial tracking. Figure 16c and d illustrates the comparison between the results of constructing a map based on SLAM and enhancing virtual visual information on two different systems. As objects are added to space, traditional SDKs (MAXST SLAMs) are required to construct new maps to perform object recognition and position virtual visual information input by users. Because the input is performed directly on the development engine, the accurate placement is difficult. Our results show that the system performs object recognition and location estimation even when an object is added from the same map data. In addition, when creating maps with MAXST, there were difficulties in normal tracking behavior owing to the lack of environmental factors with characteristics.

The proposed system can synthesize virtual visual information at the exact object location in real time and perform normal tracking according to the camera location in the real world. However, owing to high dependence on visual information, it is difficult to operate properly with many objects or occlusions. In particular, ICPs that operate on point-cloud data are heavily affected. The ICP is responsible for correcting the virtual visual information to be synthesized with respect to the object location. Therefore, it is difficult to synthesize objects in the correct position if they are obscured or attached to other objects. Figure 17 shows the results of location estimation failure due to occlusion.

Fig. 16
figure 16

Comparison with other sdk (a, b comparison of object augmentation according to camera conversion—the Vuforia SDK does not become an augmented-blue box if the camera rotates and the marker is not recognized. Our work, on the other hand, maintains an augmented-white box because it is spatial marker-based and therefore augmented on object-based basis without markers. c, d comparison of augmentation based on map configuration If an object is added to the same space, a new map must be constructed and repositioned, but our results do not need to be)

Fig. 17
figure 17

Failure to estimate location due to occlusion (This is an error that occurs in point-to-point matching of ICP. When two objects are overlapped, attempting to match the object in front yields an unexpected enhancement for improved clarity)

5 Conclusion

In this study, we propose a system that enables camera tracking in the real world and visualizes virtual visual information through object recognition estimation. Our system could augment a space through camera tracking, two-dimensional object recognition, and three-dimensional pose estimation based on RGB-D camera information, while recognizing three-dimensional positional information and objects. A SLAM-based camera tracking technology was used because the two coordinates need to be shared for the interaction between reality and virtual space. The minimum unit of control in the space is the object. However, it is impossible to recognize and estimate the location of objects if a spatial map is created based on SLAM. This issue was resolved using deep learning-based object recognition. Finally, the ICP algorithm augments the virtual visual information to match the position and rotation direction of the real object.

The contributions of our study are as follows: First, a given space can be scanned using simple camera equipment without special equipment. Usually, to reconstruct a three-dimensional space, special equipment or specialized tools must be used. However, in our study, a three-dimensional space was configured by photographing video images using a simple RGB-D camera. Second, object recognition can be performed using the created point cloud. Based on the point cloud data collected from the video image, not only space but also objects existing in it can be reconstructed. Moreover, objects can be recognized in a short time. Third, it is possible to determine the 3D poses of the localized objects and detect the state in which they are placed. This helps to augment other virtual objects.

The proposed system can become the underlying technology that enables real-time interaction between the real world and objects present in virtual space. It is expected that AR will be available through convergence with Internet of Things (IoT) if problems arising from location estimation or occlusion of dynamic objects, which exist as complementary points to our research, are resolved. AR technology enables users to control and manage objects through virtual visual information in the real world. It is expected to provide users with a sense of immersion and realism as an operational method through an intuitive interface. Sensory-related benefits and advances in tracking technology can increase the likelihood of application in other fields, such as artificial intelligence, healthcare, education, gaming, military, and entertainment. As a simple example, it is expected that datasets that are necessary when learning 3D objects in artificial intelligence can be established using AR tracking technology.