Modeling Reality for Camera Registration in Augmented Reality Applications
One of the central problems of Augmented Reality is to make reality and virtual objects coincide in some way. Any technique aiming at solving this problem requires an internal way of representing the real objects of interest, i.e. the objects that the system expects to see. We name these representations Reality Models and take in this work a closer look at the various ways of representing reality in the context of AR. We propose a classification of AR applications based on the requirements on Reality Models. AR applications can be first classified into applications where the full 6DOF camera pose is recovered and applications where a 2D object localization in the image space is sufficient. We further detail the classification by providing examples of planar fiducials and textured objects in the first case and object detection and local pose estimation in the second case. In all provided examples we extend the state of the art by providing new Reality Models or better ways to construct and use existing ones.
A look at the state of the art shows that this aspect of Augmented Reality has never been analyzed thoroughly. In this work, our goal is to investigate how real objects and scenes can be efficiently modeled for AR applications and to develop solutions for reality modeling and pose estimation for different types of AR scenarios. We therefore propose a classification of Reality Models based on the type of real objects used, the nature of the virtual augmentation and the type of registration they permit. In this context, we present novel representations, and provide a detailed analysis of their use in AR. Depending on the type of information available on the real objects present in the scene, and on the type of AR application being built, two different approaches for registration can be developed: a full 3D registration of the camera or a 2D object labeling approach (see Fig. 1). A full 3D registration means that all the parameters of the camera are known. As a consequence, virtual objects can be inserted in the real scene at an exact position and they appear as if they were completely integrated in the environment. Reality Models for this approach can be further divided into marker-based models, and models based on textured objects. 2D object labeling is used when the focus of the application is not the integration of virtual objects in a real scene, but rather the automatic identification of objects or scene parts in order to provide contextual information to the user. For this approach, the Reality Models can use object detection or extended object detection (Fig. 1).
In the remainder of the article, we review the different type of Reality Models and detail the proposed improvement of the state of the art. Further details can be found in the PhD thesis .
2 Marker-based 3D registration
3 Texture and 3D reconstruction
In some situations, standard markers cannot be used directly. This happens for example when the environment should look natural to the user in applications like augmented advertisement in magazines. In this case, it is more convenient to use known images instead of black and white markers as fiduciary objects. When these images are printed on paper, the user can place them anywhere in the scene, or use them as support of the augmentation. To this aim, salient points on the surface of the poster are recognized in a matching process, and the pose of the camera is computed by solving the Perspective from \(n\) Points (PnP) problem.
4 Object detection for AR labeling
The idea of AR labeling is to provide localized information to the user in form of textual or iconographic information in the vicinity of the 2D location of an object. This information can be for example an indicator raising the user’s attention to a specific object in the scene (see Fig. 4). The use of 2D labels for augmented reality applications is quite recent. The first goal of computer vision techniques applied to AR was to recover the exact camera position and orientation (pose estimation) as well as the camera’s internal parameters (camera calibration), in order to seamlessly integrate virtual objects in a real scene. AR labeling has originally emerged from multi-sensor technologies integrating a camera, a GPS sensor, a compass, and sometimes inertial measurement units in one single, portable device. Thus, the first applications of AR labeling were using the camera merely as visual support for displaying the information. This means that no image processing was involved in the labeling process. In general, using image information in addition to the other sensors can refine the first position estimate, and help making AR labeling applications more robust. We therefore investigated vision-based methods for object recognition and detection.
In the context of Augmented Reality labeling, our aim is to recognize a specific object among similar ones. Therefore, we need to find a descriptor for one specific instance of an object. Thus, the descriptors for object recognition are tuned to be highly distinctive for a particular object, and the best descriptors cannot be automatically computed from a large database of similar objects. We are therefore interested in a representation for objects based on their unique features. In addition, the object of interest should be localized in the picture in real-time, without prior information about its position. As the object can be seen under various scales and viewing angles, this means that the recognition algorithm should test various possible locations in the image. We therefore focused on object representations that permits a fast exhaustive search, and show how the computation of intermediate image representations can speed up the process.
In particular, we developed the concept of Integral P-Channel representation , that permit a fast computation of object descriptors for a large number of regions in an image. The P-Channel representation can be seen as an extension of histograms, where each bin (each channel) stores as additional information the sum of offsets from the channel center. Another difference is the memory footprint of the two methods: while a histogram with \(n\) bins per dimension and for \(D\) individual features requires to store \(n^D\) values, the complete P-Channel representation can be stored using \(n^D\times (D+1)\) values (see  for details).
5 Learning-based partial pose estimation
In some situations, it can be helpful to recover more information than only the position and size of the object. In these scenarios additional knowledge about the pose of the real object can be gained directly from the object appearance. This knowledge can be for example a local affine transformation of the object or a planar homography in case of locally planar objects. A first advantage is that the visual appearance of the augmentation can be adapted to the local transformation. For example if the object is planar (as a book for example) or if it can be coarsely approximated by a plane (e. g. facades of buildings), virtual information like written text can be warped to appear as if it was following the object’s transformation. This property can be used for example in mobile Augmented Reality where the full pose estimation is not possible, but affine or homographic transformations can be computed.
In particular, we present a new method for infering the local 3D orientation of keypoints from their appearances. The method is based on the idea that the relation between keypoint appearance and pose can be learnt efficiently with an adequate regressor. Using one reference view of a keypoint, it is possible to train a keypoint-specific regressor that takes the point appearance as input and delivers the local perspective transformation as output. We show that an elegant choice of regressor is a set of sparse regressors applied sequentially in a cascade. In our case, we use a set of parametrized multivariate relevance vector machines (MVRVM) to learn the local 8-dimensional homography from the patch normalized pixel values. We show that using a cascade of regressors, ranging from coarse pose approximation to fine rectifications, considerably speeds up the identification and pose estimation process. Moreover, we show that our method improves the precision of classical points detectors, as the location of the point is rectified together with the homography. Because the input of the machine is only the intensity inside the object, the local orientation can be recovered even in case of severe occlusion as shown in Fig. 5. The resulting system is able to recover the orientation of patches in real time.
Looking at Augmented Reality under the aspect of Reality Models led us to a study of a variety of techniques for diverse AR applications. We identified 3D registration and AR labeling as two major possibilites for registration and derived several novel solutions for these two types of challenges. As a starting point for future research we would like to investigate dynamic Reality Models for non static scenes.
This work was partially funded by the German BMBF project DENSITY (01IW12001).
- 1.Pagani A (2014) Reality Models for efficient registration in Augmented Reality. Ph.D. thesis, University of Kaiserslautern, Dr. Hut Verlag.Google Scholar
- 2.Pagani A, Gava C, Cui Y, Krolla B, Hengen JM, Stricker D (2011) Dense 3D point cloud generation from multiple high-resolution spherical images. In: Proceedings of the International Symposium on Virtual Reality, Archaeology and Cultural Heritage (VAST)Google Scholar
- 3.Pagani A, Koehler J, Stricker D (2011) Circular markers for camera pose estimation. In: Proceedings of the International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS)Google Scholar
- 4.Pagani A, Stricker D (2009) Learning local patch orientation with a cascade of sparse regressors. In: Proceedings of the British Machine Vision Conference (BMVC)Google Scholar
- 5.Pagani A, Stricker D, Felsberg M (2009) Integral P-channels for fast and robust region matching. In: Proceedings of the IEEE International Conference on Image Processing (ICIP)Google Scholar