1 Introduction

Augmented reality (AR) provides immersive experiences for users by overlaying a camera preview with virtual contents. In order to make AR realistic, the real world and virtual world should be aligned precisely; this alignment can be achieved by using physical sensors or camera images.

Recently, the use of smartphones equipped with cameras and sensors such as global positioning system (GPS), digital compass, and gyroscope has boosted various mobile AR applications in fields such as education, tourism, and advertising. There is also a growing sense of anticipation of the possibility that mobile AR can be more immersive and realistic with the help of advanced display technology.

There are several types of stereo displays that allow people to feel the cubic effect, for example, stereoscopic display and head-mounted display (HMD). With the development of these displays, realistic 3-D contents in contexts such as games, movies, virtual reality (VR), and AR have also been extensively explored. Technologies concerned with these contents on a stereoscopic display have usually been studied in order to create and provide 3-D movies and broadcasts; people have usually watched them in static environments such as theaters and living rooms. However, the stereoscopic displays have become available on portable game consoles and smartphones as they have been miniaturized, and this has allowed people to enjoy realistic 3-D content anywhere. Stereo cameras being placed on those devices have also enabled interactive applications on stereoscopic displays, including mobile AR.

Meanwhile, the recent evolution of HMD technology has brought with it an increasing interest in virtual and augmented reality technologies. Video see-through HMDs, especially their high-end versions, provide immersive VR experiences based on their capacity to render quality, wearability, a wide field of view, sensors, and so on. However, as AR devices, they have some problems, for example, the need for additional computing devices, the absence of a camera, limited mobility, and a high price. On the other hand, there are HMDs in which smartphones can be docked. Some of them, for example Google Cardboard,Footnote 1 have quite a simple composition, because they use the smartphone’s display and sensors; this can make their price quite low. By using them, we can get a VR system for only a few dollars; immersive mobile AR applications can also be available if they have a stereo camera.

Some things should be considered for the implementation of mobile AR using a stereo display; these include a precise registration between the real and virtual worlds, real-time processing, and a handling of the intrinsic hardware problems. In early stages, physical sensors were used for registration in AR [1]. However, sensors are sensitive to noise, and their accuracy is not enough for robust and precise registration; for precise registration, a vision-based tracking technique is necessary. Vision-based tracking is one of the computer vision techniques that have been actively researched, with many methods that can be run in real time even on mobile devices. A marker-based approach provides precise and robust registration despite the simplicity of the algorithm [2]. Furthermore, it is fast enough to perform stereo camera tracking in real time. Therefore, a marker-based approach was widely used relatively earlier than other vision-based tracking approaches for stereo AR.

However, stereo camera tracking in natural scenes exacts a computation cost; it often makes the implementation of a real-time application difficult. For example, there was a mobile AR using stereo camera tracking based on depth sensing algorithms that recovered geometric structure captured in stereoscopic images [3]. Although this approach was well implemented on a smartphone with 3-D display, it was still not suitable for processing in real time, because of its expensive computational costs. Recently, it has become possible to run natural feature-based planar object tracking [4] as well as 3-D object tracking in real time even on a mobile platform [5, 6] with the development of hardware and software. Nevertheless, expansion of those methods to stereo camera tracking requires the careful design of an algorithm and optimization, because of the additional computation required for consistent registration across cameras.

On the other hand, stereo displays still have problems as AR devices, due to their hardware and operating principles. In the case of autostereoscopic display, doubt has been raised about the effectiveness of depth perception in AR. Recent studies of this problem discovered that the effect of stereo display on depth perception can be significantly superior to mono display, depending on whether the scene provides other depth cues more dominant than stereopsis [7, 8]. A video see-through display has the problem of image distortion and chromatic aberration due to the lenses inside the display. Although geometric distortion can be handled by a predistortion of images [9], it is still being studied in order to learn more effective and efficient ways to generate fine-quality images. For example, Pohl et al. [10] proposed an improved predistortion method that corrects chromatic aberration. The method distorts each color channel separately by considering a different index of refraction according to the wavelength.

Fig. 1
figure 1

Flowchart of the proposed framework

In this paper, we present a mobile binocular AR framework based on stereo camera tracking. Our approach aims to provide AR on mobile stereoscopic displays; the proposed framework supports autostereoscopic display and a Google cardboard-like video see-through HMD. We implemented the proposed framework on a stereo camera-equipped smartphone. Stereo images were converted from YUV to RGB color space and were warped for stereo image rectification. Camera poses for registration between the real and virtual worlds were estimated by stereo camera tracking; at this point, camera poses were jointly estimated using the geometric relationship between cameras in order to guarantee consistent registration across the left and right views. Then, the virtual contents were augmented onto rectified images. If the user were to use a video see-through display, the augmented image would be predistorted in order to correct lens distortion. For real-time performance, we separated the tracking and rendering processes into different threads, and color space conversion and stereo image rectification, which were run in the rendering thread, were implemented on GPU. The feasibility of the proposed framework was shown by experiments and demonstrations.

2 The framework

Figure 1 shows the proposed mobile binocular AR framework based on stereo camera tracking. The framework consists of two threads. One of them performs stereo camera tracking, and the other performs the rendering of images from the stereo camera and virtual contents. They are here called the tracking thread and the rendering thread, respectively.

In the tracking thread, camera poses are estimated using edge-based 3-D object tracking. Correspondences between the edges of the mesh model of a target object and the edges detected from the Y channel image are established for each view of the stereo pair. Then, camera poses are jointly estimated using the correspondences from both sides and are sent to the rendering thread. In the rendering thread, the YUV images captured by the stereo camera are converted into an RGB color space and are rectified. At this time, these color space conversion and stereo image rectifications are performed on a GPU for real-time performance. The rectified images are rendered, and the virtual contents are augmented onto them using camera poses. Geometric transformations between the cameras, the intrinsic camera parameters, and the mapping information for stereo image rectification are computed at the offline stage. In addition, the predistortion map for video see-through HMD is also computed at the offline stage.

Fig. 2
figure 2

Correspondence searching of the edge-based 3-D object tracking

2.1 Stereo camera tracking

We extended the edge-based 3-D object tracking method proposed by Seo et al. [5] to stereo camera tracking and applied it to our framework. Edge-based 3-D object tracking uses the silhouette of a 3-D mesh model of the target object and the image edges as visual cues. Figure 2 shows the process of correspondence searching in edge-based 3-D object tracking. In order to extract the silhouette of the 3-D model, back-faces invisible at the camera pose in the previous frame are removed by back-face culling, and edges that are not shared by two faces among the remaining faces are removed. However, edges inside the silhouette cannot be removed completely, and they can interfere with robust tracking. Therefore, another round of filtering is performed in order to remove the remaining edges inside the silhouette. An object region mask is created by rendering all faces of the 3-D model onto the image plane, and a silhouette is extracted from the object mask by contour detection. Then, the remaining edges are projected, and the edge whose distance from silhouette is over a certain threshold is filtered.

Edges belonging to the extracted silhouette are regularly sampled and projected onto the image plane. Then, the closest image edge that lies on the normal direction of each model edge sample is set into correspondence with the edge sample. The image edge can be detected by a Canny operator [11]; however, it is inefficient to detect edges from the whole image, since the searching ranges of the model edge samples may not cover the entire image plane. Therefore, local edge components are detected by applying a Sobel filter only to the pixels within the searching ranges of the model edge samples, for computational efficiency (Fig. 3). These processes, from silhouette extraction to correspondence searching, are performed independently on each view.

Fig. 3
figure 3

Correspondence searching using local edge component detection

We can estimate the camera pose using the established correspondences. The camera pose of a monocular camera \({\mathbf{E }}_t\) is updated from previous pose \({\mathbf{E }}_{t-1}\) by camera motion \(\Delta {\mathbf{E }}\).

$$\begin{aligned} {\mathbf {E}}_{t}={\mathbf {E}}_{t-1}\cdot \Delta {\mathbf {E}} \end{aligned}$$
(1)

The camera motion \(\Delta \mathbf{E }\) consists of the rotation transformation R and the translation transformation t between the previous and current frames. It can be parameterized by an exponential map as follows,

$$\begin{aligned} \Delta {\mathbf {E}}=\begin{bmatrix} {\mathbf {R}}& {\mathbf {t}} \\ {\mathbf {0}}_{1\times 3}& 1 \end{bmatrix}=\sum _{i=0}^{5}\hbox {e}^{\alpha _{i}{\mathbf {G}}_{i}}\approx {\mathbf {I}}+\sum _{i=0}^{5}\alpha _{i}{\mathbf {G}}_{i}, \end{aligned}$$
(2)

where \({\mathbf{G }}_i\) are the basis matrices of rotation and translation transformation along the x, y, z axes in homogeneous coordinates, \(\alpha _i\) are infinitesimal changes in their respective bases. The distance between the projected edge samples from a 3-D model and their correspondences should be similar to the displacement at the projected point \(\hat{\mathbf{m }}=\begin{bmatrix}u&\quad v&\quad w\end{bmatrix}^{\top }\) from 3-D Point \({\mathbf{M} }\) by the basis \({\mathbf{G }}\). The \({\alpha }=\begin{bmatrix}\alpha _{0}&\quad \alpha _{1}&\quad \alpha _{2}&\quad \alpha _{3}&\quad \alpha _{4}&\quad \alpha _{5}\end{bmatrix}^{\top }\) satisfying this condition can be estimated by

$$\begin{aligned} \hat{{\alpha }}=\underset{{\alpha }}{\hbox {argmin}}\left\{ \sum _{j}\left\| d^j-\sum _{i=0}^{5}\alpha _{i}\left( {\mathbf {n}}_{j}\cdot {\mathbf {l}}_{i,j} \right) \right\| ^2 \right\} , \end{aligned}$$
(3)

where \({\mathbf{n} }_j\) is the normal vector of the jth sample, and \({\mathbf{l}}_{i,j}\) is its motion vector. The motion vector \({\mathbf{l}}_{i,j}\) can be computed by using camera intrinsic matrix \({\mathbf{K}}\) and partial derivatives of \(\hat{{\mathbf{m}}}_{i,j}\):

$$\begin{aligned} \Delta \hat{{\mathbf {m}}}_{i,j}= \begin{bmatrix} {u}^{\prime }_j&{v}^{\prime }_j&{w}^{\prime }_j \end{bmatrix} ^{\top }={\mathbf {KE}}_{t-1}{\mathbf {G}}_{i}{\mathbf {M}}_{j} \end{aligned}$$
(4)
$$\begin{aligned} {\mathbf {l}}_{i,j}= \begin{bmatrix} \left( \frac{u}{w} \right) ^{\prime }& \left( \frac{v}{w} \right) ^{\prime } \end{bmatrix} \end{aligned}$$
(5)

After the motion parameter \(\alpha\) is estimated by least square, the camera pose \({\mathbf{E}}_{t}\) can be updated using camera motion \(\Delta {\mathbf{E}}\).

In the case of a stereo camera, the individually estimated camera pose of each view can cause inconsistent registration. Therefore, camera poses are jointly estimated by using the geometric relationship between cameras. From Eq. 1, the camera pose of each view can be represented as

$$\begin{aligned} {\mathbf {E}}_{L,t}&= {\mathbf {E}}_{L,t-1}{{\varvec{\Delta }}}{\mathbf {E}}_{L} \nonumber \\ {\mathbf {E}}_{R,t}& = {\mathbf {E}}_{R,t-1}{{\varvec{\Delta }}}{\mathbf {E}}_{R}, \end{aligned}$$
(6)

and the camera motions of each can be estimated as

$$\begin{aligned} {{{\varvec{\Delta }}}}{\mathbf {E}}_{L,t}^{*}= & \underset{{{\varvec{\Delta }}}{\mathbf {E}}_{L}}{{\mathrm {argmin}}}\left\{ \sum _{i}\left\| {\mathbf {K}}_{L}{\mathbf {E}}_{L}{{{\varvec{\Delta }}}}{\mathbf {E}}_{L}{\mathbf {X}}_{i}-{\mathbf {x}}_{i} \right\| ^2 \right\} \nonumber \\ {{{\varvec{\Delta }}}}{\mathbf {E}}_{R,t}^{*}= & \underset{{{\varvec{\Delta }}}{\mathbf {E}}_{R}}{{\mathrm {argmin}}}\left\{ \sum _{j}\left\| {\mathbf {K}}_{R}{\mathbf {E}}_{R}{{{\varvec{\Delta }}}}{\mathbf {E}}_{R}{\mathbf {X}}_{j}-{\mathbf {x}}_{j} \right\| ^2 \right\} . \end{aligned}$$
(7)

Here, the subscripts L and R indicate the left and right cameras, respectively, and m is the correspondence of the edge sample of the 3-D model. In order to estimate the motion of both cameras jointly, the error functions (Eq. 7) should be combined:

$$\begin{aligned} {{{\varvec{\Delta }}}}{\mathbf {E}}_{L,t}^{*}, {{\varvec{\Delta }}}{\mathbf {E}}_{R,t}^{*}&= \underset{{\varvec{\Delta }}{\mathbf {E}}_{L}, {{\varvec{\Delta }}}{\mathbf {E}}_{R}}{{\mathrm {argmin}}}\left\{ \sum _{i}\left\| {\mathbf {K}}_{L}{\mathbf {E}}_{L}{{\varvec{\Delta }}}{\mathbf {E}}_{L}{\mathbf {X}}_{i}-{\mathbf {x}}_{i} \right\| ^2 \right. \nonumber \\&\quad \left. +\,\sum _{j}\left\| {\mathbf {K}}_{R}{\mathbf {E}}_{R}{{\varvec{\Delta }}}{\mathbf {E}}_{R}{\mathbf {X}}_{j}-{\mathbf {x}}_{j} \right\| ^2 \right\} . \end{aligned}$$
(8)

By assuming that the motions of both cameras are the same and that the stereo camera has already been calibrated, Eq. 8 can be rewritten as

$$\begin{aligned} {{\varvec{\Delta }}}{\mathbf {E}}_{L,t}^{*}&= \underset{{\varvec{\Delta }}{\mathbf {E}}_{L}}{{\mathrm {argmin}}}\left\{ \sum _{i}\left\| {\mathbf {K}}_{L}{\mathbf {E}}_{L}{{\varvec{\Delta }}}{\mathbf {E}}_{L}{\mathbf {X}}_{i}-{\mathbf {x}}_{i} \right\| ^2 \right. \nonumber \\&\quad \left. +\,\sum _{j}\left\| {\mathbf {K}}_{R}{\mathbf {E}}_{L\rightarrow R}{\mathbf {E}}_{L}{{\varvec{\Delta }}}{\mathbf {E}}_{L}{\mathbf {X}}_{j}-{\mathbf {x}}_{j} \right\| ^2 \right\} . \end{aligned}$$
(9)

Now, we can estimate \(\Delta {\mathbf{E}}_{L}\) by using correspondences obtained from both views without regard to \(\Delta {\mathbf{E}}_{R}\). After the estimation of \(\Delta {\mathbf{E}}_{L}\), the right camera pose \({\mathbf{E}}_{R}\) can be computed by multiplying the left camera pose \({\mathbf{E}}_{R}\) and the relationship between cameras \({\mathbf{E}}_{L\rightarrow R}\). The camera poses are further optimized by performing these processes iteratively.

2.2 Rendering stereo images

In the proposed framework, three kinds of image processing are performed to render augmented images: color space conversion, stereo image rectification, and predistortion. These processes are suitable for performing parallel processing, because the same operation is performed independently for each pixel. Therefore, we implemented these processes on the GPU using a programmable shader in order to enhance performance. Fortunately, no bottlenecks due to data transfer from the GPU to CPU existed, because the processing results were displayed immediately and asynchronously.

Although Android smartphones usually support YUV camera input, the OpenGL ES API for image rendering does not support YUV textures. Thus, camera images should be converted from YUV format to RGB format for camera preview rendering. This conversion can be computed simply by a matrix multiplication for each pixel value and can be faster on a GPU.

The converted RGB image needs to be rectified because a misalignment of cameras and lens distortions can be caused by the manufacturing process, which can result in an interference in the 3-D effect. Stereo image rectification allows the alignment of cameras to be coplanar and resolves the problem. Since stereo image rectification including the undistortion of the image is a nonlinear transformation, each pixel is warped inversely using the transformation map. The transformation map contains the coordinates to which each pixel moves and can be computed in the offline stage. Although this process is quite simple, the use of the transformation map and bilinear interpolation requires much memory access and floating point instructions; these requirements can be a burden on a mobile phone with low computing power and a small cache when those operations are sequentially performed.

After stereo image rectification, virtual contents are rendered onto the rectified stereo image using camera poses, and the augmented image can be displayed. However, when the framework runs with a video see-through HMD docked in the smartphone, the lenses inside the HMD for providing a wide field of view cause geometric distortion and chromatic aberration as shown in Fig. 4a. In our framework, only the geometric distortion is handled, because it may not be possible to achieve chromatic aberration correction in real time. The geometric distortion can be corrected by applying predistortion to the image to be displayed, as shown in Fig. 4b.

Fig. 4
figure 4

Pincushion distortion due to lens in video see-through HMD (a) lens distortion correction by rendering predistorted image (b)

In our framework, predistortion method based on texture mapping [9] is used. After the augmented images are rendered to offscreen buffer, each pixel in the offscreen buffer was warped using a distortion map computed in advance; this warping was similar to that for stereo image rectification. The distortion map was obtained by a camera calibration algorithm with a chessboard pattern image distorted as a barrel shape.

3 Experimental results

We implemented the proposed framework on the Android platform; the device used for the experiments was an LG Optimus 3-D smartphone with a stereo camera and parallax barrier display, a dual-core 1.0 GHz CPU, and 512 MB RAM (Fig. 5).

Fig. 5
figure 5

Optimus 3-D smartphone

The distortion caused in video see-through HMD can be easily corrected by using software development kits providing predistortion functionalities (e.g., Cardboard SDK for Android,Footnote 2 Oculus SDK,Footnote 3 Mali VR SDKFootnote 4); however, unfortunately, our target device could not be supported by any SDKs, because of the low version of the Android OS and because of GPU capability. Therefore, we implemented predistortion by ourselves as described in Sect. 2.2. The distorted pattern image for computing predistortion map was created by applying a distortion filter in Photoshop and fitting it to a screenshot captured from another device on which the Google Cardboard application was running.

We calibrated the stereo camera in advance in order to obtain the cameras intrinsic parameters, the distortion coefficient, and the geometric relationship between the left and right cameras; information for stereo image rectification such as the rectification transformation matrix and the lookup tables were also computed after stereo calibration. At this point, we fixed the focal length of the cameras and disabled the auto-convergence mode, which adjusts disparity according to the depth of a subject by a disparity remapping-like algorithm [12], since this would invalidate the stereo camera calibration. Figure 6 illustrates the cameras and the calibration patterns with respect to the left camera. The distance along the x-axis between the cameras was estimated at 23.29 mm, which means that the cameras were correctly calibrated, because this was close to the distance in the specifications of the device (24 mm).

Figures 7 and 8 show the target objects to be tracked and their 3-D models. We used three objects and their models: a cube whose model had 8 vertices and 18 edges; a piece of pottery whose model had 673 vertices and 2000 edges; and a cat figure whose model had 1252 vertices and 3750 edges. We chose objects whose models had different complexities in shape from each other in order to explore how the performance of our framework varied with the complexity of the model.

Fig. 6
figure 6

Stereo camera calibration result of Optimus 3-D smartphone

Fig. 7
figure 7

a Cube and pottery and b their 3-D mesh models for tracking

Fig. 8
figure 8

a Cat figure and b its 3-D mesh model for tracking

Figure 9 shows the resulting images from tracking the pottery object and the augmentation using the proposed framework. In order to visualize 3-D images via paper, the 3-D images were rendered in anaglyph form instead of using the parallax barrier display. The 3-D model of the target object with the green color was successfully and correctly overlaid on the object. Figure 10 demonstrates the augmentation on autostereoscopic display and the predistortion image for the video see-through HMD; tracking can be performed even with a complex model.

Fig. 9
figure 9

Tracking and augmentation results (pottery). Top side-by-side images. Bottom anaglyph images

Fig. 10
figure 10

Tracking and augmentation results: a augmented side-by-side image, b the side-by-side image (a) is displayed via autostereoscopic display, and c the side-by-side image (a) is rendered as predistorted image for video see-through HMD

Table 1 Average processing time in each thread
Table 2 Processing times for color space conversion and stereo image rectification in rendering thread on CPU and GPU
Fig. 11
figure 11

Target object for comparison of jointly estimated and independently estimated poses: a a box with marker and b its 3-D model

Table 1 shows the average processing time of each thread over 500 frames. We measured the elapsed time only for color space conversion and stereo image rectification in the rendering thread. The processing time in the rendering thread was stable, because it did not rely on the complexity of the model but on the image resolution. The processing time in the tracking thread was longer than that for the rendering thread even when the simplest model was used. Since the device had a dual-core CPU, each thread could run in parallel. Therefore, the overall speed was bounded by the tracking process. In this experiment, the minimum speed of the framework was about 14 fps, which meant that the framework could run at an interactive rate. On the other hand, the use of the GPU allowed faster processing than with the CPU. Table 2 shows the average processing time of color conversion and stereo image rectification on CPU and GPU. We achieved about an eightfold speed-up by implementing these processes on the GPU.

Fig. 12
figure 12

Comparison of estimated trajectory using ARToolKit (blue line) and edge-based 3-D object tracking with independent pose estimation (orange line) and with joint pose estimation (gray line) from stereo image sequence. Top left camera. Bottom right camera

In order to verify that joint camera pose estimation is superior to independent estimation of each camera pose in terms of accuracy and robustness, we compared jointly estimated poses and independently estimated poses with the ARToolkit [2] which was regarded as a ground truth due to its sufficient robustness and accuracy. We used a marker-attached box (Fig. 11); this allowed both edge-based 3-D object tracking and marker-based tracking to be applied same image sequence. The image sequence has 500 frames and was captured from the stereo camera on the Optimus 3-D smartphone. Figure 12 shows the trajectories of the cameras, which were estimated jointly and independently using box model, and from ARToolKit. Their differences were mostly not so significant; however, the error of independently estimated position of the left camera (orange line) increased in the interval from frame 418 to 449. This error results in inconsistent augmentation on stereo display, which can disrupt immersion and cause visual discomfort; this problem can be avoided by adopting joint camera pose estimation. On the other hand, joint camera pose estimation worked well maintaining geometric relationship between the left and right cameras at every frames.

However, stereo camera tracking we used can fail if camera motion between adjacent frames is too large because correspondence searching range cannot cover the motion and edges cannot be detected due to motion blur. Because this is the intrinsic problem of the edge-based tracking, it is difficult to evade tracking failure due to large motion. In order to handle tracking failure, a recovery method such as [13] should be adopted. Unfortunately, the device we used does not have sufficient computing power to performing such recovery method; if new devices which have newest hardware and stereo camera are released, recovery methods can be adopted.

4 Conclusion

This paper presented an augmented reality framework for mobile stereo display. Our framework supports autostereoscopic display as well as a video see-through display like Google cardboard. Joint camera pose estimation in stereo camera tracking allows the precise registration of the real and the virtual worlds and consistent augmentation across both views. Utilizing the GPU and multi-threading enabled the framework to perform at an interactive rate, despite many computations for stereo camera tracking and image warping. The experiments and demonstration showed the feasibility of the framework.

Nevertheless, there is considerable room for the improvement of our framework. One issue is that the distortion mapping function for predistortion was not estimated to completely adjust the features of the display but was just approximated. Another is that the use of the old-fashioned device restricted the potential speed-up of the tracking by optimization. For example, recent devices have many possibilities for optimization by supporting OpenCL and memory mapping between the CPU and GPU. We are currently conducting more optimizations of tracking for further speed-up and robustness.