KI - Künstliche Intelligenz

, Volume 28, Issue 4, pp 321–324 | Cite as

Modeling Reality for Camera Registration in Augmented Reality Applications

Doctoral and Postdoctoral Dissertations

1 Introduction

One of the central problems of Augmented Reality is to make reality and virtual objects coincide in some way. Any technique aiming at solving this problem requires an internal way of representing the real objects of interest, i.e. the objects that the system expects to see. We name these representations Reality Models and take in this work a closer look at the various ways of representing reality in the context of AR. We propose a classification of AR applications based on the requirements on Reality Models. AR applications can be first classified into applications where the full 6DOF camera pose is recovered and applications where a 2D object localization in the image space is sufficient. We further detail the classification by providing examples of planar fiducials and textured objects in the first case and object detection and local pose estimation in the second case. In all provided examples we extend the state of the art by providing new Reality Models or better ways to construct and use existing ones.

Augmented Reality is a powerful and intuitive technology that can be used in various situations, ranging from small smartphone games to large-scale outdoor augmentations for marketing and tourism. However, these diverse possibilities share one common feature: in order to augment reality, at least some parts of the real environment need to be known in advance and modeled in an appropriate way. This prior knowledge is necessary to allow an AR system to build the “link” between the real world and its virtual counterpart. In this context, the challenge is that the mathematical models used for real objects have to meet the requirements of an AR application: the representation should be sparse to ensure a low memory footprint, robust to different kinds of alteration that arises in the context of optical sensors (illumination changes, partial occlusions, shape deformation) and at the same time complete enough to allow for recovering 3D information (object position or complete camera pose) in real time.
Fig. 1

Proposed classification of AR applications

A look at the state of the art shows that this aspect of Augmented Reality has never been analyzed thoroughly. In this work, our goal is to investigate how real objects and scenes can be efficiently modeled for AR applications and to develop solutions for reality modeling and pose estimation for different types of AR scenarios. We therefore propose a classification of Reality Models based on the type of real objects used, the nature of the virtual augmentation and the type of registration they permit. In this context, we present novel representations, and provide a detailed analysis of their use in AR. Depending on the type of information available on the real objects present in the scene, and on the type of AR application being built, two different approaches for registration can be developed: a full 3D registration of the camera or a 2D object labeling approach (see Fig. 1). A full 3D registration means that all the parameters of the camera are known. As a consequence, virtual objects can be inserted in the real scene at an exact position and they appear as if they were completely integrated in the environment. Reality Models for this approach can be further divided into marker-based models, and models based on textured objects. 2D object labeling is used when the focus of the application is not the integration of virtual objects in a real scene, but rather the automatic identification of objects or scene parts in order to provide contextual information to the user. For this approach, the Reality Models can use object detection or extended object detection (Fig. 1).

In the remainder of the article, we review the different type of Reality Models and detail the proposed improvement of the state of the art. Further details can be found in the PhD thesis [1].

2 Marker-based 3D registration

Historically, the first way of storing reality models for AR was to use planar markers. Most of the markers that can be found in the state of the art are square, black markers containing a code that can be used for identifying the marker. One major drawback of using square markers is that the computation of the camera pose relies on the precise determination of the four corners of the marker, which can be difficult in the case of occlusion. In our work, we instead advocate the use of circular markers, as we believe that they are easier to detect and provide a pose estimate that is more robust to noise. Robustness is due to the equal role played by all the contour’s points in the ellipse fitting algorithm and the subsequent pose estimation algorithm. Unlike existing systems using circular markers, our method computes the exact pose from one single circular marker, and does not need specific points being explicitly shown on the marker (like center, or axes orientations). Indeed, the center and orientation is encoded directly in the marker’s code. We can thus use the entire marker surface for the code design, which we choose to encode 16 bits over two rings. After solving the back projection problem for one conic correspondence, we end up with two possible poses. The marker’s code, its rotation and the final pose can be computed in one single step, by using a pyramidal cross-correlation optimizer [3]. Figure 2 show one of our markers, augmented by a virtual cow.
Fig. 2

The new circular marker with a virtual augmentation

3 Texture and 3D reconstruction

In some situations, standard markers cannot be used directly. This happens for example when the environment should look natural to the user in applications like augmented advertisement in magazines. In this case, it is more convenient to use known images instead of black and white markers as fiduciary objects. When these images are printed on paper, the user can place them anywhere in the scene, or use them as support of the augmentation. To this aim, salient points on the surface of the poster are recognized in a matching process, and the pose of the camera is computed by solving the Perspective from \(n\) Points (PnP) problem.

This concept can be extended to any textured object, as soon as the geometry is known. When the complete environment can be modeled, the full model can be used for 3D registration of a camera. However, in the case of texture-based camera registration, the geometry of the underlying object has to be known. The thesis therefore investigates 3D reconstruction from photographs. State of the art reconstruction methods use standard perspective images and recover the geometry by retrieving first the position of the camera (Structure from Motion) and then the dense geometry of the object (Multiple View Stereo). For complex scenes, we show that the use of spherical cameras that capture the complete surrounding from one single point of space is far better suited [2]. We therefore develop the concepts of SFM and MVS for spherical, high-resolution cameras. As an example, we show reconstructions of the Fritz Walter Stadium in Kaiserslautern in Fig. 3.
Fig. 3

3D Reconstruction of the Fritz Walter Stadium with spherical cameras

4 Object detection for AR labeling

Fig. 4

AR Labeling in a working environment

The idea of AR labeling is to provide localized information to the user in form of textual or iconographic information in the vicinity of the 2D location of an object. This information can be for example an indicator raising the user’s attention to a specific object in the scene (see Fig. 4). The use of 2D labels for augmented reality applications is quite recent. The first goal of computer vision techniques applied to AR was to recover the exact camera position and orientation (pose estimation) as well as the camera’s internal parameters (camera calibration), in order to seamlessly integrate virtual objects in a real scene. AR labeling has originally emerged from multi-sensor technologies integrating a camera, a GPS sensor, a compass, and sometimes inertial measurement units in one single, portable device. Thus, the first applications of AR labeling were using the camera merely as visual support for displaying the information. This means that no image processing was involved in the labeling process. In general, using image information in addition to the other sensors can refine the first position estimate, and help making AR labeling applications more robust. We therefore investigated vision-based methods for object recognition and detection.

In the context of Augmented Reality labeling, our aim is to recognize a specific object among similar ones. Therefore, we need to find a descriptor for one specific instance of an object. Thus, the descriptors for object recognition are tuned to be highly distinctive for a particular object, and the best descriptors cannot be automatically computed from a large database of similar objects. We are therefore interested in a representation for objects based on their unique features. In addition, the object of interest should be localized in the picture in real-time, without prior information about its position. As the object can be seen under various scales and viewing angles, this means that the recognition algorithm should test various possible locations in the image. We therefore focused on object representations that permits a fast exhaustive search, and show how the computation of intermediate image representations can speed up the process.

In particular, we developed the concept of Integral P-Channel representation [5], that permit a fast computation of object descriptors for a large number of regions in an image. The P-Channel representation can be seen as an extension of histograms, where each bin (each channel) stores as additional information the sum of offsets from the channel center. Another difference is the memory footprint of the two methods: while a histogram with \(n\) bins per dimension and for \(D\) individual features requires to store \(n^D\) values, the complete P-Channel representation can be stored using \(n^D\times (D+1)\) values (see [1] for details).

5 Learning-based partial pose estimation

In some situations, it can be helpful to recover more information than only the position and size of the object. In these scenarios additional knowledge about the pose of the real object can be gained directly from the object appearance. This knowledge can be for example a local affine transformation of the object or a planar homography in case of locally planar objects. A first advantage is that the visual appearance of the augmentation can be adapted to the local transformation. For example if the object is planar (as a book for example) or if it can be coarsely approximated by a plane (e. g. facades of buildings), virtual information like written text can be warped to appear as if it was following the object’s transformation. This property can be used for example in mobile Augmented Reality where the full pose estimation is not possible, but affine or homographic transformations can be computed.

In this context, we propose a method where learning techniques are used to generate the Reality Model [4]. The idea is that if a database of images containing a large number of views of a particular object under various illumination conditions is available, then this database can be used to learn the most useful features of the object for object recognition or pose estimation purposes. The advantage is that Machine Learning methods automatically select the object features that are the most appropriate for solving the problem at hand. However, these methods rely on the fact that a large database of examples consisting of a pair of input image and output pose exists. Because this is not necessarily the case, we show that this database can be artificially generated from one single view of the object using image synthesis techniques.
Fig. 5

The local orientation is computed from the patch apperance, even under severe occlusion

In particular, we present a new method for infering the local 3D orientation of keypoints from their appearances. The method is based on the idea that the relation between keypoint appearance and pose can be learnt efficiently with an adequate regressor. Using one reference view of a keypoint, it is possible to train a keypoint-specific regressor that takes the point appearance as input and delivers the local perspective transformation as output. We show that an elegant choice of regressor is a set of sparse regressors applied sequentially in a cascade. In our case, we use a set of parametrized multivariate relevance vector machines (MVRVM) to learn the local 8-dimensional homography from the patch normalized pixel values. We show that using a cascade of regressors, ranging from coarse pose approximation to fine rectifications, considerably speeds up the identification and pose estimation process. Moreover, we show that our method improves the precision of classical points detectors, as the location of the point is rectified together with the homography. Because the input of the machine is only the intensity inside the object, the local orientation can be recovered even in case of severe occlusion as shown in Fig. 5. The resulting system is able to recover the orientation of patches in real time.

6 Conclusion

Looking at Augmented Reality under the aspect of Reality Models led us to a study of a variety of techniques for diverse AR applications. We identified 3D registration and AR labeling as two major possibilites for registration and derived several novel solutions for these two types of challenges. As a starting point for future research we would like to investigate dynamic Reality Models for non static scenes.



This work was partially funded by the German BMBF project DENSITY (01IW12001).


  1. 1.
    Pagani A (2014) Reality Models for efficient registration in Augmented Reality. Ph.D. thesis, University of Kaiserslautern, Dr. Hut Verlag.Google Scholar
  2. 2.
    Pagani A, Gava C, Cui Y, Krolla B, Hengen JM, Stricker D (2011) Dense 3D point cloud generation from multiple high-resolution spherical images. In: Proceedings of the International Symposium on Virtual Reality, Archaeology and Cultural Heritage (VAST)Google Scholar
  3. 3.
    Pagani A, Koehler J, Stricker D (2011) Circular markers for camera pose estimation. In: Proceedings of the International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS)Google Scholar
  4. 4.
    Pagani A, Stricker D (2009) Learning local patch orientation with a cascade of sparse regressors. In: Proceedings of the British Machine Vision Conference (BMVC)Google Scholar
  5. 5.
    Pagani A, Stricker D, Felsberg M (2009) Integral P-channels for fast and robust region matching. In: Proceedings of the IEEE International Conference on Image Processing (ICIP)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2014

Authors and Affiliations

  1. 1.DFKI GmbHKaiserslauternGermany

Personalised recommendations