Binocular mobile augmented reality based on stereo camera tracking
- 2.3k Downloads
Mobile augmented reality (AR) applications have become feasible with the evolution of mobile hardware. For example, the advent of the smartphone allowed implementing real-time mobile AR, which triggered the release of various applications. Recently, rapid development of display technology, especially for stereoscopic displays, has encouraged researches to implement more immersive and realistic AR. In this paper, we present a framework of binocular augmented reality based on stereo camera tracking. Our framework was implemented on a smartphone and supports autostereoscopic display and video see-through display in which a smartphone can be docked. We modified edge-based 3-D object tracking in order to estimate poses of left and right cameras jointly; this guarantees consistent registration across left and right views. Then, virtual contents were overlaid onto camera images using estimated poses, and the augmented stereo images were distorted to be shown through a video see-through display. The feasibility of the proposed framework is shown by experiments and demonstrations.
KeywordsMobile augmented reality Stereoscopic display Video see-through display Stereo camera tracking Model-based tracking
Augmented reality (AR) provides immersive experiences for users by overlaying a camera preview with virtual contents. In order to make AR realistic, the real world and virtual world should be aligned precisely; this alignment can be achieved by using physical sensors or camera images.
Recently, the use of smartphones equipped with cameras and sensors such as global positioning system (GPS), digital compass, and gyroscope has boosted various mobile AR applications in fields such as education, tourism, and advertising. There is also a growing sense of anticipation of the possibility that mobile AR can be more immersive and realistic with the help of advanced display technology.
There are several types of stereo displays that allow people to feel the cubic effect, for example, stereoscopic display and head-mounted display (HMD). With the development of these displays, realistic 3-D contents in contexts such as games, movies, virtual reality (VR), and AR have also been extensively explored. Technologies concerned with these contents on a stereoscopic display have usually been studied in order to create and provide 3-D movies and broadcasts; people have usually watched them in static environments such as theaters and living rooms. However, the stereoscopic displays have become available on portable game consoles and smartphones as they have been miniaturized, and this has allowed people to enjoy realistic 3-D content anywhere. Stereo cameras being placed on those devices have also enabled interactive applications on stereoscopic displays, including mobile AR.
Meanwhile, the recent evolution of HMD technology has brought with it an increasing interest in virtual and augmented reality technologies. Video see-through HMDs, especially their high-end versions, provide immersive VR experiences based on their capacity to render quality, wearability, a wide field of view, sensors, and so on. However, as AR devices, they have some problems, for example, the need for additional computing devices, the absence of a camera, limited mobility, and a high price. On the other hand, there are HMDs in which smartphones can be docked. Some of them, for example Google Cardboard,1 have quite a simple composition, because they use the smartphone’s display and sensors; this can make their price quite low. By using them, we can get a VR system for only a few dollars; immersive mobile AR applications can also be available if they have a stereo camera.
Some things should be considered for the implementation of mobile AR using a stereo display; these include a precise registration between the real and virtual worlds, real-time processing, and a handling of the intrinsic hardware problems. In early stages, physical sensors were used for registration in AR . However, sensors are sensitive to noise, and their accuracy is not enough for robust and precise registration; for precise registration, a vision-based tracking technique is necessary. Vision-based tracking is one of the computer vision techniques that have been actively researched, with many methods that can be run in real time even on mobile devices. A marker-based approach provides precise and robust registration despite the simplicity of the algorithm . Furthermore, it is fast enough to perform stereo camera tracking in real time. Therefore, a marker-based approach was widely used relatively earlier than other vision-based tracking approaches for stereo AR.
However, stereo camera tracking in natural scenes exacts a computation cost; it often makes the implementation of a real-time application difficult. For example, there was a mobile AR using stereo camera tracking based on depth sensing algorithms that recovered geometric structure captured in stereoscopic images . Although this approach was well implemented on a smartphone with 3-D display, it was still not suitable for processing in real time, because of its expensive computational costs. Recently, it has become possible to run natural feature-based planar object tracking  as well as 3-D object tracking in real time even on a mobile platform [5, 6] with the development of hardware and software. Nevertheless, expansion of those methods to stereo camera tracking requires the careful design of an algorithm and optimization, because of the additional computation required for consistent registration across cameras.
In this paper, we present a mobile binocular AR framework based on stereo camera tracking. Our approach aims to provide AR on mobile stereoscopic displays; the proposed framework supports autostereoscopic display and a Google cardboard-like video see-through HMD. We implemented the proposed framework on a stereo camera-equipped smartphone. Stereo images were converted from YUV to RGB color space and were warped for stereo image rectification. Camera poses for registration between the real and virtual worlds were estimated by stereo camera tracking; at this point, camera poses were jointly estimated using the geometric relationship between cameras in order to guarantee consistent registration across the left and right views. Then, the virtual contents were augmented onto rectified images. If the user were to use a video see-through display, the augmented image would be predistorted in order to correct lens distortion. For real-time performance, we separated the tracking and rendering processes into different threads, and color space conversion and stereo image rectification, which were run in the rendering thread, were implemented on GPU. The feasibility of the proposed framework was shown by experiments and demonstrations.
2 The framework
Figure 1 shows the proposed mobile binocular AR framework based on stereo camera tracking. The framework consists of two threads. One of them performs stereo camera tracking, and the other performs the rendering of images from the stereo camera and virtual contents. They are here called the tracking thread and the rendering thread, respectively.
2.1 Stereo camera tracking
We extended the edge-based 3-D object tracking method proposed by Seo et al.  to stereo camera tracking and applied it to our framework. Edge-based 3-D object tracking uses the silhouette of a 3-D mesh model of the target object and the image edges as visual cues. Figure 2 shows the process of correspondence searching in edge-based 3-D object tracking. In order to extract the silhouette of the 3-D model, back-faces invisible at the camera pose in the previous frame are removed by back-face culling, and edges that are not shared by two faces among the remaining faces are removed. However, edges inside the silhouette cannot be removed completely, and they can interfere with robust tracking. Therefore, another round of filtering is performed in order to remove the remaining edges inside the silhouette. An object region mask is created by rendering all faces of the 3-D model onto the image plane, and a silhouette is extracted from the object mask by contour detection. Then, the remaining edges are projected, and the edge whose distance from silhouette is over a certain threshold is filtered.
2.2 Rendering stereo images
In the proposed framework, three kinds of image processing are performed to render augmented images: color space conversion, stereo image rectification, and predistortion. These processes are suitable for performing parallel processing, because the same operation is performed independently for each pixel. Therefore, we implemented these processes on the GPU using a programmable shader in order to enhance performance. Fortunately, no bottlenecks due to data transfer from the GPU to CPU existed, because the processing results were displayed immediately and asynchronously.
Although Android smartphones usually support YUV camera input, the OpenGL ES API for image rendering does not support YUV textures. Thus, camera images should be converted from YUV format to RGB format for camera preview rendering. This conversion can be computed simply by a matrix multiplication for each pixel value and can be faster on a GPU.
The converted RGB image needs to be rectified because a misalignment of cameras and lens distortions can be caused by the manufacturing process, which can result in an interference in the 3-D effect. Stereo image rectification allows the alignment of cameras to be coplanar and resolves the problem. Since stereo image rectification including the undistortion of the image is a nonlinear transformation, each pixel is warped inversely using the transformation map. The transformation map contains the coordinates to which each pixel moves and can be computed in the offline stage. Although this process is quite simple, the use of the transformation map and bilinear interpolation requires much memory access and floating point instructions; these requirements can be a burden on a mobile phone with low computing power and a small cache when those operations are sequentially performed.
In our framework, predistortion method based on texture mapping  is used. After the augmented images are rendered to offscreen buffer, each pixel in the offscreen buffer was warped using a distortion map computed in advance; this warping was similar to that for stereo image rectification. The distortion map was obtained by a camera calibration algorithm with a chessboard pattern image distorted as a barrel shape.
3 Experimental results
The distortion caused in video see-through HMD can be easily corrected by using software development kits providing predistortion functionalities (e.g., Cardboard SDK for Android,2 Oculus SDK,3 Mali VR SDK4); however, unfortunately, our target device could not be supported by any SDKs, because of the low version of the Android OS and because of GPU capability. Therefore, we implemented predistortion by ourselves as described in Sect. 2.2. The distorted pattern image for computing predistortion map was created by applying a distortion filter in Photoshop and fitting it to a screenshot captured from another device on which the Google Cardboard application was running.
We calibrated the stereo camera in advance in order to obtain the cameras intrinsic parameters, the distortion coefficient, and the geometric relationship between the left and right cameras; information for stereo image rectification such as the rectification transformation matrix and the lookup tables were also computed after stereo calibration. At this point, we fixed the focal length of the cameras and disabled the auto-convergence mode, which adjusts disparity according to the depth of a subject by a disparity remapping-like algorithm , since this would invalidate the stereo camera calibration. Figure 6 illustrates the cameras and the calibration patterns with respect to the left camera. The distance along the x-axis between the cameras was estimated at 23.29 mm, which means that the cameras were correctly calibrated, because this was close to the distance in the specifications of the device (24 mm).
Average processing time in each thread
Tracking thread (ms)
Rendering thread (ms)
Processing times for color space conversion and stereo image rectification in rendering thread on CPU and GPU
CPU implementation (ms)
GPU implementation (ms)
In order to verify that joint camera pose estimation is superior to independent estimation of each camera pose in terms of accuracy and robustness, we compared jointly estimated poses and independently estimated poses with the ARToolkit  which was regarded as a ground truth due to its sufficient robustness and accuracy. We used a marker-attached box (Fig. 11); this allowed both edge-based 3-D object tracking and marker-based tracking to be applied same image sequence. The image sequence has 500 frames and was captured from the stereo camera on the Optimus 3-D smartphone. Figure 12 shows the trajectories of the cameras, which were estimated jointly and independently using box model, and from ARToolKit. Their differences were mostly not so significant; however, the error of independently estimated position of the left camera (orange line) increased in the interval from frame 418 to 449. This error results in inconsistent augmentation on stereo display, which can disrupt immersion and cause visual discomfort; this problem can be avoided by adopting joint camera pose estimation. On the other hand, joint camera pose estimation worked well maintaining geometric relationship between the left and right cameras at every frames.
However, stereo camera tracking we used can fail if camera motion between adjacent frames is too large because correspondence searching range cannot cover the motion and edges cannot be detected due to motion blur. Because this is the intrinsic problem of the edge-based tracking, it is difficult to evade tracking failure due to large motion. In order to handle tracking failure, a recovery method such as  should be adopted. Unfortunately, the device we used does not have sufficient computing power to performing such recovery method; if new devices which have newest hardware and stereo camera are released, recovery methods can be adopted.
This paper presented an augmented reality framework for mobile stereo display. Our framework supports autostereoscopic display as well as a video see-through display like Google cardboard. Joint camera pose estimation in stereo camera tracking allows the precise registration of the real and the virtual worlds and consistent augmentation across both views. Utilizing the GPU and multi-threading enabled the framework to perform at an interactive rate, despite many computations for stereo camera tracking and image warping. The experiments and demonstration showed the feasibility of the framework.
Nevertheless, there is considerable room for the improvement of our framework. One issue is that the distortion mapping function for predistortion was not estimated to completely adjust the features of the display but was just approximated. Another is that the use of the old-fashioned device restricted the potential speed-up of the tracking by optimization. For example, recent devices have many possibilities for optimization by supporting OpenCL and memory mapping between the CPU and GPU. We are currently conducting more optimizations of tracking for further speed-up and robustness.
This research is supported by Ministry of Culture, Sports and Tourism (MCST) and Korea Creative Content Agency (KOCCA) in the Culture Technology (CT) Research and Development Program 2015.
- 1.Feiner, S., MacIntyre, B., Hollerer, T., Webster, A.: A touring machine: prototyping 3D mobile augmented reality systems for exploring the urban environment. In: 1997. Digest of Papers, First International Symposium on Wearable Computers, pp. 74–81 (Oct 1997)Google Scholar
- 2.Kato, H., Billinghurst, M.: Marker tracking and HMD calibration for a video-based augmented reality conferencing system. In: 1999 (IWAR’99) Proceedings. 2nd IEEE and ACM International Workshop on Augmented Reality, pp. 85–94. IEEE (1999)Google Scholar
- 7.Berning, M., Kleinert, D., Riedel, T., Beigl, M.: A study of depth perception in hand-held augmented reality using autostereoscopic displays. In: 2014 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), pp. 93–98 (Sept 2014)Google Scholar
- 8.Kerber, F., Lessel, P., Mauderer, M., Daiber, F., Oulasvirta, A., Krüger, A.: Is autostereoscopy useful for handheld AR? In: Proceedings of the 12th International Conference on Mobile and Ubiquitous Multimedia, pp. 4:1–4:4. ACM MUM ’13, New York (2013)Google Scholar
- 9.Watson, B., Hodges, L.: Using texture maps to correct for optical distortion in head-mounted displays. In: 1995. Proceedings Virtual Reality Annual International Symposium, pp. 172–178 (Mar 1995)Google Scholar
- 10.Pohl, D., Johnson, G.S., Bolkart, T.: Improved pre-warping for wide angle, head mounted displays. In: Proceedings of the 19th ACM Symposium on Virtual Reality Software and Technology, pp. 259–262. ACM (2013)Google Scholar
- 12.Mangiat, S., Gibson, J.: Disparity remapping for handheld 3D video communications. In: 2012 IEEE International Conference on Emerging Signal Processing Applications (ESPA), pp. 147–150 (Jan 2012)Google Scholar
- 13.Hinterstoisser, S., Lepetit, V., Ilic, S., Fua, P., Navab, N.: Dominant orientation templates for real-time detection of texture-less objects. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (2010)Google Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.