Introduction

Augmented Reality (AR) combines real world imagery with synthetic content generated by a computer. The first comprehensive review of AR [1] identified a broad range of applications of this technology, including medicine, manufacturing and robot path planning. Subsequently, AR has been applied in cultural heritage [2] such as the reconstruction of ancient Olympia in Greece [34]. Although it is reported that AR enriches human perception [5] in general, the principal reason for performing AR reconstructions in cultural heritage is that the owners of sites are usually reticent to permit physical reconstructions in situ so that the archaeology remains undisturbed for future generations [3].

Developments in multimedia technology facilitate the learning experience in cultural heritage [6] with the aid of improved user interaction methods. Developed models or virtual tours in reconstructions of archaeological sites (e.g. [78]) provide entertaining means of learning. However, ex situ reconstructions such as models and movies are difficult to visualize in the context of the archaeological remains. AR reconstructions can be produced in situ with minimal physical disturbance, an attractive property, even though they may take a significant time to develop [9].

There are several forms that AR reconstructions may take, and the work reported here is directed towards a kind of ‘historical mirror,’ in which virtual buildings are built around flat surfaces visible in the real world and human participants are clothed appropriately for the historical period. The literature presents examples of using Kinect for cultural heritage [1012]. It was used as a 3D scanner in the work given in [10] and as a motion tracker to navigate in virtual reality reconstructions in [11, 12]. User experience for large screens with Kinect in a cultural heritage setting, such as in an exhibition, has been demonstrated in [13].

This paper uses Kinect to establish 3D world and 2D image correspondences and, from them, determine camera pose. It then finds planar objects within the real world and augments them, the aim being to render the appearance in antiquity in front of real-world features. The advantage of this work over the system in [14], which tries to augment synthetic objects over camera images, is that no offline phase is needed to create a map of the environment: data from the Kinect is used in real time to augment the appearance of flat surfaces in the real world. Moreover, this can be achieved without GPU programming, particularly promising for systems that require low power or are mobile.

Using the human tracking functionality of the OpenNI library [15], humans within the environment are identified and their appearances are also augmented; existing AR systems in the cultural heritage domain do not attempt the latter, yet it is essential if participants are truly to experience a sense of presence following the ideas presented in [16] and [17].

The rest of the paper is structured as follows: "Background" presents the use of vision-based methods for augmented reality and gives examples from the literature making use of Kinect for cultural heritage. "In situ augmentation" describes a method for determining the camera pose and registering 3D models with real world objects. "Augmenting participants" then describes the use of skeleton tracking to augment participants with clothes etc. "Creating 3D models" describes the modelling phase, while "Results" assesses the effectiveness of the in situ augmentation algorithm and presents the results of augmentation. Finally, conclusions are drawn in "Conclusions".

Background

The principal problem faced in AR applications concerns the accuracy of user tracking and, consequently, the registration of 3D models with real-world features [1, 18, 19]. When the 3D structure of the environment (feature positions, and internal/external parameters of the camera, etc.) is known, sufficient accuracy can be obtained to match virtual and real objects seamlessly; as a rule of thumb, this involves identifying the direction of gaze to ~1° in azimuth and elevation, and locating the position to ~0.1 m.

It is sensible to use a camera to perform tracking since the images from the real-world can be used both for rendering the synthetic content and finding the viewpoint of the user [20]. This sort of vision-based tracking can be classified into two groups namely marker-based and markerless methods [19].

A marker is a distinguishable element placed in the environment so it can be identified apart from other objects [21]. Once the marker is detected then the pose of the camera can be obtained and models can be rendered on the camera image. An example of this usage is shown in Figure 1 where ARToolkita is used to track a marker to place 3D models in the scene.

Figure 1
figure 1

Stages of augmentation using markers.

Placing markers is not always practical especially when the concerns on preserving the original structure of the ruins are taken into account. In this case, markerless methods can be used to extract features which are already present in the environment. These features can be points [22], lines [23] or higher-level geometric structures such as planes [24]. It is important to note that these natural features should be robust and stable so that the tracking can be performed accurately.

Estimation of the camera pose in 3D is a widely-studied research topic [2526], normally obtained using vision-only methods such as the Perspective-n-Point (PnP) solution, which aims to recover camera pose using the positions of 3D features and their projections. For instance, [14] recently used OpenCV’s PnP functionality for real-time AR: In a preliminary (offline) phase, a map of features was first created from known 3D positions in the real world, with each point being described by a SURF descriptor [27]. Subsequently, an online system matched SURF features from the camera to those of the map, and then the camera pose was calculated. The time taken for calculating and matching SURF features is significant, so this had to be performed on a GPU.

The advent of the Kinect [28] permits both colour and depth of a scene to be sensed concurrently. This allows extracting the 3D structure of the surrounding and creating a representation of the environment and the use of Kinect data with the PnP algorithm has already been investigated [29] for image-based registration. The analysis involved determining relative and absolute accuracies of the Kinect and another sensor according to the camera pose obtained by using Efficient PnP (EPnP) algorithm [30].

Kinect has also been used in the context of cultural heritage. For instance, Remondino [10] presented a review of using different types of imaging and depth sensors including Kinect to perform 3D scanning of archaeological objects for the purposes of digital recording, historical documentation and preservation of cultural heritage. Richards-Rissetto et al. [11] used Kinect’s body motion detection features to perform navigation in a 3D reconstructed model of an ancient Maya city using virtual reality.

In addition to the usages described above, this innovative sensor also provides new opportunities in terms of on-site learning described in [6] by presenting the learners an interesting visualization of structures overlaid on top of the ruins as we aim in this study.

In situ augmentation

Since the viewpoint information is required for augmentation, the first step is to establish correspondences between 2D image features and their depth since, from them, the camera pose can be calculated. With that knowledge, rectangular features in the image are identified and augmented with columns, as this is a common requirement in cultural heritage reconstructions in the eastern Mediterranean as shown in Figure 2.

Figure 2
figure 2

Views of the Hadrian Temple in Ephesus in Turkey. b Obtained from [3132].

Generating 3D–2D correspondences

The distance between the two sensors of the Kinect results in different projected locations in the RGB and the depth images for the same 3D point. To compensate for this, the calibration parameters of both sensors must be estimated, or errors will ensue [33]. These calibration parameters can be computed using the Kinect Calibration Toolbox [34], or obtained using a method [35] similar to stereo calibration [36]. The first method provides a semi-automatic way of calibrating the camera, by both allowing the user to select corners manually from a calibration target (a chess-board pattern) in the RGB image and the plane where the calibration target is placed in the depth image; these selections are used as an initial guess for the calibration parameters for a set of RGB and depth images (~30 images). A non-linear minimization method is then used to reduce the projection errors. This work uses the calibration parameters obtained by the second method [35], which again finds corners on the calibration target and performs the calibration only for the RGB camera. Then the depth camera is also calibrated by manually finding corresponding corners in the depth image.

The transformations (\(\mathbf {R}\) and \(\mathbf {T}\)) between the depth and RGB sensors of the Kinect are used to align the two sensors and are obtained as calibration results along with the intrinsic parameters (focal lengths and distortions) for both sensors. The camera matrices \(\mathbf {K_D}\) and \(\mathbf {K_C}\) for the depth and RGB sensors respectively are:

$$\begin{aligned} {\mathbf {K}}_{\mathbf {D}}=\left[ \begin{array}{ccc} f_{x_D} &{} 0 &{} c_{x_D} \\ 0 &{} f_{y_D} &{} c_{y_D} \\ 0 &{} 0 &{} 1 \end{array} \right] \end{aligned}$$
(1)
$$\begin{aligned} {\mathbf {K}}_{\mathbf {C}}=\left[ \begin{array}{ccc} f_{x_C} &{} 0 &{} c_{x_C} \\ 0 &{} f_{y_C} &{} c_{y_C} \\ 0 &{} 0 &{} 1 \end{array} \right] \end{aligned}$$
(2)

where \(f_x\) and \(f_y\) are the focal lengths in the x and y directions and \(c_x\) and \(c_y\) are the centre coordinates of the relevant sensor.

Figure 3 illustrates the process of back-projecting raw depth values into 3D coordinates and then projecting them onto the RGB image.

Figure 3
figure 3

Back-projection of depth data and projecting 3D points; shaded values (red) in the depth data show unreliable raw disparities.

Given raw depth (d) values from the depth image captured by the Kinect, the world coordinates of a 3D point (\(\mathbf {p} \equiv (\mathbf {p}_x, \mathbf {p}_y, \mathbf {p}_z)^T\)) are calculated using

$$\begin{aligned} \begin{array}{l} \displaystyle \mathbf {p}_x=\left( x\times c_{x_D}\right) \times \frac{d_m}{f_{x_D}}\\ \displaystyle \mathbf {p}_y=\left( y\times c_{y_D}\right) \times \frac{d_m}{f_{y_D}}\\ \displaystyle \mathbf {p}_z=d_m \end{array} \end{aligned}$$
(3)

where x and y denote a position in the depth data. \(d_m\) is a conversion from raw depth values (d) to depth in meters [37] and is calculated as:

$$\begin{aligned} d_m=0.1236\times \tan \left( \frac{d}{2842.5}+1.1863\right) \end{aligned}$$
(4)

Unreliable values (those having the maximum raw disparity value of 2047 [29], as illustrated in Figure 3) are discarded and the remainder stored.

Using the transformation and distortion (used for modelling radial and tangential distortions arising from the inaccuracies in the manufacturing process) parameters for the Kinect sensor [24], the 3D points are projected onto the RGB image [36]. For a 3D point \(\mathbf {p}\), first the transformation parameters (\(\mathbf {R}\) and \(\mathbf {T}\)) are applied to \(\mathbf {p}\) to find \(p'\equiv (p'_x, p'_y, p'_z)^T\):

$$\begin{aligned} p' = \mathbf {R p + T} \end{aligned}$$
(5)

from which the projected coordinates (\(p'_x\), \(p'_y\)) are calculated (\(p'_z\) is ignored) using

$$\begin{aligned} \begin{array}{l} p'_x = \frac{p'_x}{p'_z}\\ p'_y = \frac{p'_y}{p'_z}\\ p'_z = \frac{p'_z}{p'_z}=1\\ \end{array} \end{aligned}$$
(6)

The radial (\(k_1\), \(k_2\), \(k_3\)) and tangential (\(p_1\), \(p_2\)) distortion parameters of the RGB camera are applied to get \(p''\equiv (p''_x, p''_y)^T\):

$$\begin{aligned} p''_x &= p'_x\left( 1+k_{1}r^2+k_{2}r^4+k_{3}r^6\right) \\ & \quad + 2p_1p'_xp'_y + p_2\left( r^2+2{p'_x}^2\right) \\ p''_y &= p'_y\left( 1+k_{1}r^2+k_{2}r^4+k_{3}r^6\right) \\ & \quad + p_1\left( r^2+2{p'_y}^2\right) +2p_{2}p'_{x}p'_y \end{aligned}$$
(7)

where \(r=\sqrt{{p'_x}^2+{p'_y}^2}\). Finally, the pixel locations are obtained using the camera matrix \(\mathbf {K_C}\):

$$\begin{aligned} p'''=\left[ \begin{array}{c} p'''_x \\ p'''_y \end{array} \right] =\mathbf {K_C} \left[ \begin{array}{c} p''_x\\ p''_y \end{array} \right] \end{aligned}$$
(8)

The 3D points resulting from this back-projection of the depth sensor and their corresponding projection points are stored in a data structure (Table 1) that allows the 3D position corresponding to a 2D pixel (\(p'''\)) to be obtained quickly.

Table 1 Data structure to store point data

Finding the camera pose

The camera position must be found relative to the 3D points obtained from the Kinect sensor, and that is best performed by tracking reliable image features over several frames. The set of points used for tracking needs to be stable in order to achieve robust localization of the camera. To achieve this, the FAST feature detector [38] was used to find features in the RGB image. This detector produces features that are repeatable [39] and widely scattered across the image [40], important characteristics if the resulting homography matrix is to be accurate, and the system is to operate at video rate. The key-points obtained from FAST were described using the BRIEF descriptor [41], which is also known to be robust and operate at video rate. The binary structure of the BRIEF descriptor allows two descriptors to be matched using XOR instructions, and is therefore rapid to execute.

After the initial set of features has been obtained, these points are matched against the features detected in subsequent frames, outliers being rejected using RANSAC [42]. 3D information for matched points is obtained from data structure (Table 1) alluded to in the previous section. There can be cases where a pixel position may not have an associated 3D datum, perhaps due to reflection of the infra-red beam or the disparity between the depth sensor and camera; when this happens, the closest 2D point in the data structure is used and the corresponding depth calculated as the mean of depths within a \(21\times 21\)-pixel region.

Algorithms for calculating the position and orientation of the camera are well-established [4344], provided that the intrinsic parameters (focal length etc.) are known through calibration. Correspondences between 3D points and their 2D projections are used in order to recover the camera pose. For three correspondences, four possible solutions can be found, whereas a unique solution can be found for six or more correspondences [45]. The corresponding 2D and 3D points stored in Table 1 are used to calculate the camera position using a recent PnP solution known as EPnP [3046] as the initial estimate.

As discussed in [44], PnP solutions are easily affected by noise, and this manifests itself as an unpleasant jittering of the camera position and hence rendered imagery. To reduce this, the estimate of the camera pose is filtered. As will be shown below, a sliding window filter of size 15 reduces this effect but does not eliminate it, while a Kalman filter [47] was found to be more effective. The state of this filter comprises the xyz coordinates of the camera position and their velocities \(V_x\), \(V_y\) and \(V_z\). Measurements of (xyz) are obtained from the EPnP algorithm discussed above. The velocities were initialized to 0.5 units/frame for the three axes, reasonable for a user standing and observing ancient ruins (i.e., little or no motion). The transition matrix F is:

$$\begin{aligned} F=\left[ \begin{array}{ c c c c c c} 1 &{} 0 &{} 0 &{} \Delta t &{} 0 &{} 0\\ 0 &{} 1 &{} 0 &{} 0 &{}\Delta t &{} 0\\ 0 &{} 0 &{} 1 &{} 0 &{} 0 &{} \Delta t\\ 0 &{} 0 &{} 0 &{} 1 &{} 0 &{} 0\\ 0 &{} 0 &{} 0 &{} 0 &{} 1 &{} 0\\ 0 &{} 0 &{} 0 &{} 0 &{} 0 &{} 1 \end{array} \right] \end{aligned}$$
(9)

At each frame, first the transition matrix F is applied to the current camera position as the prediction step (\(\Delta t\) is the time passed between two consecutive frames). Then the measurements for the camera pose calculated using EPnP are used to refine this prediction for updating the filter, and the state is used for the viewpoint of augmentation. It was found that this filter reduced jittering to the point where it was imperceptible.

Finding objects for augmentation

After determining the camera pose, the next step is to find planar objects, to be augmented, in this work, by synthetic column models. Whether columns are cylindrical or fluted, their projection in the 2D image will be rectangular, and such shapes are relatively easy to detect: Any visual noise in the image was reduced by Gaussian smoothing, then the Canny edge detector [48] was applied to each of the colour channels independently. Contours were extracted from the edge images using an approximation method [49]; contours that contain four vertices with angles between pairs of lines joining these vertices close to 90° form reliable rectangles (Figure 4). Small rectangles and rectangles having inappropriate aspect ratios for columns were rejected.

Figure 4
figure 4

Rectangular features (marked in black) located within an image.

The rectangles detected using the method described above tend to disappear and re-appear from frame to frame due to the changes in lightning conditions. For this reason, the vertices of these rectangles were tracked using the Condensation algorithm [50], essentially a particle filter for visual tracking supporting a multi-modal estimate. For a rectangle centred at \(\mathbf {X}={(x_c, y_c)}^T\), the initial state is initialized from the rectangle detector and the particles were scattered within a 20-pixel radius of that point. The state \(\mathbf {X}\) was updated using two models, one dynamical and the other observation.

The dynamic model \(\mathbf {A}\) generates the predicted (hypothetical) state \(\hat{\mathbf {X}}\) and is initialized to the identity matrix for simplicity:

$$\begin{aligned} A=\left[ \begin{array}{ c c } 1 &{} 0 \\ 0 &{} 1 \end{array} \right] \end{aligned}$$
(10)

At each frame the predicted state is calculated for all particles (\(1 \ldots n\)):

$$\begin{aligned} \hat{\mathbf {X}}_n=\mathbf {A} \mathbf {X}_n \end{aligned}$$
(11)

Particle weights are updated using the observation model. Measurements (\(\mathbf {M} = (m_{c_x}, m_{c_y})\)) from the rectangle detection algorithm are used to refine the particle confidences using

$$\begin{aligned} w_{i_x} &= e^{\frac{-1}{2\sigma ^2_x}\left( m_{c_x}-p_{i_{x}}\right) ^2}\\ w_{i_y} &= e^{\frac{-1}{2\sigma ^2_y}\left( m_{c_y}-p_{i_{y}}\right) ^2}\\ w_i &= w_{i_x}\times w_{i_y}\\ \end{aligned}$$
(12)

where \(w_i\) denotes the weight of particle i and \(\sigma ^2_x\) and \(\sigma ^2_y\) are the variances of the samples for the x and y coordinates of the centre points. The hypothetical values of the rectangle centre are stored in \(p_{i_x}\) and \(p_{i_y}\). From (12), it can be seen that a Gaussian probability distribution is used for the update of the particle confidences.

This process is followed by re-sampling, in which the new confidences will be used to create a new set of particles of the same size [36]. The number of particles for the algorithm was selected as \(N=50\), which was found to be sufficient (i.e., not causing particle deprivation [51]). At each frame, the particles are updated using the result of the rectangle detection algorithm described above, resulting in the centre of the rectangle being tracked robustly as depicted in Figure 5. After the re-sampling update is performed, the particle with the largest weight is selected both for augmentation and propagation into the next state. The 3D position of each rectangle’s centre is retrieved and the column models are rendered at these 3D positions.

Figure 5
figure 5

Tracking selected rectangles. Small circles show the estimates of their centres whereas the large circle is the particle with largest weight.

The complete augmentation algorithm is given in Algorithm 1.

figure a

Augmenting participants

This section describes the use of the skeleton tracking features of Kinect to augment a participant with clothes from ancient times. The OpenNI library [15] processes depth information from Kinect and performs detection of body parts identified as joints, as shown in Figure 6. In this work, three ‘joints’ are used: the head, torso and right hand, for superimposing a galea (Roman helmet), toga and a sword respectively.

Figure 6
figure 6

Skeleton joints identified by OpenNI.

As each joint is recognized by OpenNI, its position and orientation are returned. These transformations are absolute, not relative, and so can be used directly for rendering, with the exception that the rotation matrix for orientation \(R_M\) must converted to a vector R using Rodrigues’ formula:

$$\begin{aligned} R=\left[ \begin{array}{l} R_y\\ R_x\\ R_z \end{array} \right] =\left[ \begin{array}{l} \sin ^{-1}R_{M_{0,2}}\\ \frac{\sin ^{-1}-R_{M_{1,2}}}{\cos R_y}\\ \frac{\sin ^{-1}-R_{M_{0,1}}}{\cos R_y}\\ \end{array} \right] \end{aligned}$$
(13)

When using optical see-through displays, images from the real environment are obtained automatically using a mirror in the display device; when the graphics are rendered, augmentation comes at no cost [52]. However, when using video see-through displays or a camera as the input source, the images of the real environment must also be rendered together with other graphics in order to generate the final image. The approach commonly adopted is to convert camera images to the texture format of the rendering engine (Irrlicht in our case); then, at each call to the function that draws the whole scene, the camera image is first rendered and then the computer models are superimposed on it.

Creating 3D models

The approach followed for creating the models used for AR applications is presented briefly in this section.

Modelling and optimization

The ancient columns and other models were created using 3D Max [53]. Buildings were modelled [8] in accordance with reconstruction images prepared by archaeologists [3132]. AutoCAD [53] was used to draw 2D profiles of buildings and other structures such as columns as shown in Figure 7a. These profiles were later exported to 3D Max. 3D objects are created using modifiers such as bevel profile which rotates the given profile around a boundary (square, circle etc.) to get a 3D object (Figure 7b).

Figure 7
figure 7

Creating 3D models of the col umns in front of the Hadrian temple by spinning its profile around a circle based on the structure of the ruins and the reconstruction image in Figure 2.

3D models having a large number of vertices (hence and hence faces) reduce the frame rate of a real time application. For this reason, created models were optimized using the MultiRes modifier, which works by first computing the number of vertices and faces in a model, then allows the user to eliminate some of them manually. This method proved to be effective for 3D models consisting of thousands of faces.

Texture baking

Texture baking (also known as “render to texture”) [54] is the process of creating a single texture map from multiple materials that have been applied to a model. There are several ways to perform this; the following method was found to produce the best results.

To produce a single texture from several materials applied to a model, one first uses an Unwrap UVW modifier to store the current material map—this defines how the texture must wrap around an object that has a complex structure. The next step is to unwrap all the individual faces of the model as shown in Figure 8 and render the material information into the output texture.

Figure 8
figure 8

Unwrapping the faces of a model for texture baking.

When rendering, a diffuse map was selected (instead of the complete map model, which included maps for lighting or surface normals and failed to create the texture for the invisible side of the model). Later, this single texture was applied to the model again, by removing the previous material. Finally, the stored map must be applied to obtain the original look of the model before texture baking.

Results

This section first presents the created models and then presents a quantitative analysis of the camera pose estimation. It finally shows the effects of the complete augmentation process which can run at video rates (25 frames per second).

Models used in the experiments

The models created are shown in Figure 9. The optimization step resulted in a \(10{-}25\%\) reduction in the number of faces for the models. Usually, the more complex the model is, the higher the percentage of faces can be removed without a noticeable change in appearance; here, these optimizations were most efficient in terms of removing faces for the galea model (Figure 9a) due to its roughly spherical shape (many faces are required for a smooth surface, hence many can be removed during optimization) whereas the column (Figure 9e) was unable to accommodate the removal of many vertices without losing detail of the capital (top) or pediment (base) (Figure 7a).

Figure 9
figure 9

Models created for the augmenting participants.

Performance of the in situ augmentation algorithm

The errors in the initial estimation are shown in Figure 10. The relative error is calculated for the initial set of 3D–2D correspondences by re-projecting 3D points using the estimated translation and rotation for the camera, while the true re-projection error is calculated using ground truth, obtained from the knowledge that the camera is stationary. A mean difference of \(\simeq\)0.26 pixels was obtained between the true and estimated re-projection errors, showing that the estimation is reasonably accurate.

Figure 10
figure 10

Estimated (red) and true (blue) re-projection errors calculated with EPnP.

Figure 11
figure 11

Rotational (blue) and translational (red) errors for the camera position estimate.

Again using the ground truth, the calculated errors in rotation and translation and are given in Figure 11. (Calculating the rotation error involves converting the actual and estimated rotation matrices into quaternions and finding the distance between the two quaternions. For the translation error, the Euclidean distance is calculated between the actual and estimated translations.) Fluctuations are substantially higher for the translational error (\(\sigma _{t_{error}}=0.009\)) compared to the rotational error (\(\sigma _{r_{error}}=0.0004\)). As alluded to above, the magnitude of this translational error results in rendered augmentation being unstable (‘jittering’).

Figure 12 shows the results from sliding window and Kalman filters, and it is clear that the Kalman filter produces more stable positions. The jitter is reduced from \(\bar{\sigma }_{raw}=0.43\) to \(\bar{\sigma }_{Kalman}=0.07\), where \(\bar{\sigma }_{raw}\) and \(\bar{\sigma }_{Kalman}\) are the standard deviations of the camera coordinates for the initial estimate and Kalman filtered results.

Figure 12
figure 12

Camera coordinate estimations using raw estimation result (blue), sliding window (red) and Kalman filter (green).

Augmentation results

To illustrate the augmentation algorithms presented above, an application was developed to augment rectangular regions of a specific size and aspect ratio in the Kinect imagery. When the camera position and orientation had been found and the centres of suitable rectangles identified, the 3D column models described in the previous section were rendered in front of them. When rectangles were found at are particular distance apart, an arch could be placed above the columns to form an arch, as shown in Figure 13.

Figure 13
figure 13

Augmenting columns over rectangles.

The result of augmenting users is shown in Figure 14a for a single user and in Figure 14b for two users. The general effect is acceptable for toga and sword but some minor registration errors are apparent for the galea model in the case of a single user. These registration errors become more severe when multiple users are present, and this appears to be due to the accuracy of skeleton tracking decreasing when the user is not centred in the field of view.

Figure 14
figure 14

Augmenting participants.

Conclusions

It is clear that the Kinect has great potential in AR applications for cultural heritage. Different studies have shown that Kinect can be used as an imaging and depth sensor in order to scan and record archaeological objects [10]. It can also be used as an interaction device tracking the movements of users [11].

In this paper, we presented two novel usages of Kinect within the context of cultural heritage. The first contribution is the in situ AR application using the algorithm described in "In situ augmentation" which allows the camera pose to be estimated reasonably precisely and robustly from Kinect imagery and depth values. Planar, rectangular regions can be identified robustly through condensation and augmented with features from historical buildings. The second contribution is the use of Kinect data to augment the appearance of humans, so that they can be clothed in a way that matches their surroundings.

The analysis and augmentation presented in this paper can be achieved in real time using a single computer equipped with a Kinect. The lack of any set-up phase (as opposed to [14]) and this speed of processing are important practical considerations for installations in museums that are to be used without supervision, or for use in educational games. Indeed, we b elieve that AR applications similar to ones presented in this paper will improve the on-site learning experience [6] and provide people with an incentive to learn about their and other people’s past and protect our historical artefacts and monuments as a memory of the past.

Endnote

a http://www.hitl.washington.edu/artoolkit.