2.1. Main System Setup
Figure 3 shows the components of the system. It consists of an optical-see-through AR headset Visette 45SXGA from Cybermind , a Prosilica CV 1280 camera , and an MTx inertia tracker from XSens . A backpack contains the control box for the headset, LiPo batteries , and a Dell Inspiron 9400 laptop  with video outputs for the left and right images, running Ubuntu . This hardware was selected to make the system wearable and at the same time powerful enough for many different applications. The Visette45 is the most affordable high resolution (1280 1024) stereo OST HMD with an opening angle of .
The Prosilica firewire camera was chosen for its high resolution and the MTx is one of the most used inertia trackers available. We chose the Dell Inspiron laptop as it had enough processing and graphics power for our system and has usable dual external display capabilities, which is not common.
Note that Figure 2 shows a prototype AR headset that, in our project, was designed by Niels Mulder, student of the Postgraduate Course Industrial Design of the Royal Academy of Art with as basis the Visette 45SXGA.
Off-line virtual content is made using Cinema-4D ; its Open-GL output is online rendered on the laptop to generate the left and right-eye images for the stereo headset. The current user's viewpoint for the rendering is taken from a pose prediction algorithm, also online running on the laptop, which is based on the fusion of data from the inertia tracker and the camera, looking at one or more markers in the image. In case more markers are used, their absolute positions in the world are known. Note that also markers with no fixed relation to the real world can be used. They can be used to represent moveable virtual objects such as furniture.
For interaction with virtual objects a 5DT data glove  is used. A data-glove with RFID reader (not shown here) was made to make it possible to change/manipulate virtual objects when a tagged real object is touched.
2.2. Head Pose Tracking
The Xsens MTx inertia tracker  contains three solid state accelerometers to measure acceleration in three orthogonal directions, three solid state gyroscopes to measure the angular velocity in three orthogonal directions, and three magnetic field sensors (magnetometers) that sense the earth's magnetic field in three orthogonal directions. The combination of magnetometers and accelerometers can be used to determine the absolute 3D orientation with respect to the earth. The inertia tracker makes it possible to follow changes in position and orientation with an update rate of 100 Hz. However, due to inaccuracies in the sensors, as we integrate the angular velocities to obtain angle changes and double integrate accelerations to obtain position changes, they can only track reliably for a short period. The error will grow above 10 to 100 meter within a minute. This largest error is due to errors in the orientation that leads to an incorrect correction for the earth's gravitational pull. This should be corrected by the partial, absolute measurements of the magnetometers, as over short distances the earth's magnetic field is continuous; but this field is very weak and can be distorted by metallic objects nearby. Therefore, although the magnetic field can be used to help "anchoring" the orientation to the real world, the systematic error can be large depending on the environment. We measured deviations of 50° near office tables. Hence, in addition to the magnetometers, other positioning systems with lower drift are necessary to correct the accumulating errors of the inertia tracker.
A useful technique for this is to use visual information acquired by video cameras. Visual markers are cheap to construct and easily mounted (and relocated) on walls, doors, and other objects. A marker has a set of easy detectable features such as corners or edges that enable recognition of the marker and provide positional information. Many different marker types exist, circular  or barcode like . We chose a marker with a rectangular border to be able to easily detect and localize the marker and chose a 2D barcode as its identity is detectable even when the marker is very small (Figure 4).
If the marker is unique, then the detection of the marker itself restricts the possible camera positions already. From four coplanar points, the full 6D pose can be calculated with respect to the marker with an accuracy that depends on the distance to the marker and on the distance between the points. In case more markers are seen at the same time, and their geometric relation is known, our pose estimation will use all available detected points in a more precise estimation. In a demo situation with multiple markers, the marker positions are usually measured by hand.
Tracking is not restricted to markers, also pictures, doorposts, lamps, or all that is visible could be used. However, finding and tracking natural features, for example, using SIFT [23, 24], GLOH , or SURF  comes at a cost of high process times (up to seconds as we use images of 1280 1024), which is undesirable in AR due to the possibility of a human to turn his head very quickly. To give an impression: in case of a visual event in the peripheral area of the human retina, after a reaction time of about 130 ms in which the eye makes a saccade to that periphery, the head starts to rotate accelerating with to a rotational speed of to get the object of interest in the fovea. When the eye is tracking a slow moving object (smooth pursuit) the head rotates with about [27, 28].
Moreover, sets of natural features have to be found that later enable recognition from various positions and under various lighting conditions to provide position information. The biggest issue with natural features is that their 3D position is not known in advance and should be estimated using, for instance, known markers or odometry (Simultaneous Localization And Mapping [29, 30]). Hence, we think that accurate marker localization will remain crucial for a while in mobile immersive AR.
2.3. Required Pose Accuracy
The question rises what should be the accuracy of a tracking system if we want to have adequate alignment of virtual and real objects. For an eye with a visual acuity of about , looking through a head-mounted display at 10 cm distance with an opening angle of , we actually need a resolution of about pixels. As our HMD has pixels the maximum accuracy we can obtain is one pixel of our display, which translates to roughly or 0.5 mm at 1 meter distance of the eye. Hence, currently an AR user at rest will always perceive static misalignment due to the limitations of the HMD. Dynamically, we can present virtual objects on our HMD at a rate of 60 Hz. Assuming instantaneous head pose information from the pose measuring system, and assuming head movements in smooth pursuit we obtain a misalignment lag of . If we assume head motions as reaction on attention drawing, we obtain a temporary misalignment lag due to head movements of . Consequently, with the current headset technology the user will inevitably notice both static and dynamic misalignment due to head motion. Reasoning the other way around, the extra dynamic misalignment due to the current headset cannot be noticed (less than the static misalignment) if we rotate our head with less than . Concluding, the target accuracies for our pose measurement system are based on the accuracies for the pose of virtual objects that can be realized by the current HMD and we can distinguish three scenarios.
A static misalignment of 0.0, that is, a position misalignment of 0.05 cm of a virtual object at 1 m.
A dynamic misalignment of 0. when smoothly pursuing an object, that is, a temporal position error of cm of a virtual object at 1 m.
A dynamic misalignment of 2. when another event in the image draws the attention and the head rotates quickly, that is, a position error of 4.3 cm of virtual object at 1 m.
These are theoretical values. Given the flexible and versatile human vision system users might not find these errors disturbing. We address this in Section 3.
2.4. Camera-Only Tracking
Below we describe our methods to calculate the pose of a camera from an image of a known marker. Our aim was to use as few markers as possible, ultimately a single marker seen from quite a distance. Hence, we also use a lens with a very large opening angle of . We investigated the influence of image noise and parameters such as line thickness and marker size on the accuracy of the estimated pose. We used a rectangular pattern with a big black border on a white field with inside a 2D barcode to identify the individual markers [7, 8] (see Figure 4). Figure 5 shows the real-time image processing steps that we use to track the pose of the camera with respect to a marker.
To minimize latency we need fast methods. Therefore, we first detect candidate markers (single closed contours) using a Canny edge detector, with a fixed threshold on the gradient to suppress noise from the imaging system. While following the edges in the Canny algorithm we keep track of connected edge points and count the number of points that are not part of a line (end-points, T crossings, etc.). Only contours with no special points (single closed contour) are interesting.
Then we search for corners only along these contours and keep contours with four corners. The corners are found by using a modified Haralick-Shapiro corner detector [31, 32]. As the gradients are high on the edge, we only need a threshold on their circularity measure and search for local maxima of that measure along the edge. After splitting the contour in the four segments, we find the accurate location of the edge points, correct for lens distortions, and fit a line through each segment. The intersections of the lines give an unbiased location of the four corners needed for pose estimation. Other corner detectors as [31–33] did not perform well as they need either a large patch around the corner (impairs speed and makes them less robust against nearby other edges) or have a bias in their estimate. To reach our unbiased estimate we had to correct the location of the edge points for lens distortion prior to fitting the lines.
Accurate edge-point locations are crucial to find accurate corner points; hence, we must eliminate systematic errors and noise as well as possible [34, 35]. Using the step-edge model (Gaussian blurred edge)
we can calculate the edge location accurately from three pixels centered on and perpendicular to the edge. To increase processing speed we evaluate three pixels along the horizontal or vertical direction, depending on which one is most perpendicular to the edge.
Where usually the gradient magnitudes are used to find the location as the top of a parabola, we use the logarithm of the gradients. This makes sure that the parabolic profile assumption is valid for sharp images as well, and an unbiased estimate for the edge location of our model edge is obtained. In an experiment with a linearly moving edge the bias in location was measured to be up to 0.03 px without the logarithm, and 0.01 px with the logarithm.
We first investigated the influence of the thickness of the black border on our step-edge locator. We found that when the black border is thicker than 8 pixels in the image, the edge points on the outer contour of the border can be located with practically zero bias and an RMS error 0.01 pixel using integer Gaussian derivative operators with a scale of 1.0 px. We use integer approximations of the Gaussians because of their fast implementations using SIMD instructions. Using simpler derivatives, this bias will stay low even at a thickness of 3–5 pixels; however, this error is then symmetrically dependent on the subpixel location of the edge. If a large number of points are used for fitting a line through the edge-points—usually 12–30 points are used—the bias error can be regarded as a zero mean noise source, but for short edges the fit will have an offset. We tried several edge detectors/locators and in the presence of noise, the most accurate and robust detector was using an integer Gaussian derivative filter with the three gradient magnitude values to calculate the edge position not from neighboring pixels but from pixels at a distance of two pixels, provided that the line thickness was big enough.
We used this detector but with three neighboring pixels as we expect line thicknesses of near five pixels (markers at a few meters distance). The detector to use in other situations should be chosen on basis of the expected line thickness and noise, for example, marker distance, marker viewing angle, and illumination (indoor/outdoor) circumstances.
We then determined the size of the marker pattern that is needed when it should be detected at 5 m distance under an angle of . With a 5-pixel line thickness and leaving pixels for the black and white blocks, the minimum size of a marker is cm, fitting on A4. The bias per edge location will then be between 0.01 and 0.04 pixels, depending on the scale of the edge. When the camera is not moving, the scale is 0.8 pixels corresponding to a bias of 0.01 pixels. Because the edge location has only a small bias, the error of our algorithm is noise limited, and in the absence of noise, it is model limited.
We then verified our step-edge model and found that it fits well to experimental data. We still found a bias of around 0.004 pixel and an RMS error around 0.004 pixel as well. This bias we attribute to the small error we still make in assuming a Gaussian point spread function of the imaging system. When the Contrast to Noise Ratio——is around 26 dB, the standard deviation of the edge location is 0.1 pixel. This is also the residual error of the saddle points after a lens calibration.
When the CNR is higher, the biggest source of error in our experimental setup seems to be the (model of the) lens. In order to be able to use a pinhole camera model, we tried to calibrate all distortions away, but even with an elaborate lens distortion model we obtained a residual calibration error of 0.37 pixel maximum (standard deviation 0.1 pixel). We found an increased blurring at the borders of the image, suggesting lens artifacts. In photography, these artifacts are minimized using more elaborate lens systems. More research is needed to investigate how to further reduce this systematic error, with a better lens (model) as a starting point. Our lens distortion model is given by
with = ; and denote distorted/undistorted metric sensor plane coordinates. This model performs better in our case than the other models we tried [36–39]. The parameters were estimated using the Zhang calibration method .
We found that we can detect the contours of a marker robustly down to a CNR of 20 dB and now we only need to worry about the detection of the four corners along these contours. The Haralick-Shapiro corner detector [31, 32] is the least sensitive to noise while it performs well along the Canny edge, and we found it can be used with CNR ratios higher than 20 dB. Along the edge we can reliably detect corners with an angle of less than . When the CNR is 25 dB, corners can be detected up to . Corner angles of and relate to marker pitch angles of and , respectively. To realize our target of detecting the marker up to pitch angles of , we need the CNR to be around 25 dB.
For online estimation of the pose from four corners we used a variation of the Zhang calibration algorithm; only the external parameters need to be estimated. Using static measurements to determine the accuracy of our pose estimation algorithm we determined that the position of a marker in camera coordinates is very accurate when the marker is on the optical axis at 5 m, that is, less than 0.5 mm in and , and less than 1 cm along the optical axis. The marker orientation accuracy, however, highly depends on that orientation. The angular error is less than ( due to noise) when the marker pitch is less than at 5 m. When we convert the marker pose in camera coordinates to the camera pose in marker coordinates, the stochastic orientation error results in an error in position of 2.7 cm/m. With a pitch larger than , the orientation accuracy is much better, that is, less than ( due to noise), resulting in a stochastic positional error of the camera of less than 0.9 cm/m. Hence, markers can best be viewed not frontally but under a camera pitch of at least .
Finally, with this data, we can determine the range where virtual objects should be projected around a marker to achieve the required precision for our AR system. We found that with one marker of size cm (at 1.5 m–6 m from the camera), a virtual object should not be projected at more than 60 cm from that marker in the depth direction, or within 1 m from that marker in the lateral direction to achieve the target accuracy of error in the perceived virtual object position.
2.5. Camera Data Fused with Inertia Data
We need fast inertia data to keep up with fast head movements. However, cheap solid-state inertia trackers build up severe pose errors within a second. Consequently, these pose measurements should be corrected using the much slower but more stable camera pose data that is acquired by locking onto features of markers in the real world. We used an inertia tracker fixed onto a camera. Our sensor fusing Kalman filter [40, 41] combines the absolute pose estimate from the camera with acceleration sensors, angular velocity sensors and magnetic sensors to get a better estimate of the HMD pose. The Kalman filter is also necessary to interpolate the pose in-between the slow pose estimates from the camera. Figure 6 shows the problem we encounter when we fuse pose data from the camera with pose data from the inertia tracker. The inertia pose data has a frequency of 100 Hz. The camera with image processing has an update rate of about 15 Hz. Note that the online viewpoint-based rendering costs also time. The Kalman filter with inertia tracker data can be used to predict the head pose at the precise moment we display the virtual objects precisely aligned on the headset.
From now on, we refer to the pose of the camera with respect to a marker at a certain point in time as its state. This state does not only include the position and orientation of the camera at that point in time, but also its velocity and angular velocity, and where necessary their derivatives. The error state is the estimation of the error that we make with respect to the true state of the camera.
Our fusion method takes latencies explicitly into account to obtain the most accurate estimate; other work assumes synchronized sensors [42, 43] or incorporates measurements only when they arrive  ignoring the ordering according to the time of measurement.
Our filter is event based, which means that we incorporate measurements when they arrive, but measurements might be incorporated multiple times as explained next.
We synchronize the camera data with the filter by rolling back the state updates to the point in time at which the camera has acquired its image. We then perform the state update using the camera pose data and use stored subsequent inertia data again to obtain a better estimate of the head pose for the current point in time, and to predict a point of time in the near future, as we need to predict the pose of the moving head at the moment in time that the image of the virtual objects are projected onto the LCD displays of the headset. In this way, we not only get a better estimate for the current time, but also for all estimates after the time of measurement; this was crucial in our case as camera pose calculations could have a delay of up to 80 ms, which translates to 8 inertia measurements.
A Kalman filter can only contribute to a limited extend to the total accuracy of the pose estimates. The estimate can only be made more accurate when the filter model is accurate enough; that is, that the acceleration/angular speed is predictable, and that the inertia sensors are accurate enough. A bias in the sensors—for instance caused by a systematic estimation error or an unknown delay in the time of measurement—will prevent the filter from giving a more accurate result than the camera alone. We minimized the errors introduced by the Kalman filter by using robust methods to represent the orientation and time update of the orientation, and decreased the nonlinearity be using a nonadditive error state Kalman filter in which the error state is combined with the real state using a nonlinear function (see the transfer of the orientation error in Figure 8). We used Quaternions  for a stable differentiable representation. To make the orientation model more linear, we used an indirect Kalman filter setup where the error states are estimated instead of the actual state. Due to this choice the error-state update is independent of the real state. Effectively we created an extended kalman Filter for the error state. If the error state is kept at zero rotation by transferring the error-state estimate to the real state estimate immediately after each measurement update, the linearization process for the Extended Kalman Filter  becomes very simple and accurate. In addition, we convert all orientation measurements to error-quaternions: . This makes the measurement model linear (the state is also an error-quaternion) and stable in case of large errors, at the expense of a nonlinear calculation of the measurement and its noise.
In simulations we found that the position sensor accuracy has the largest influence on the total filter accuracy in absence of orientation errors. Changing the sampling rates or using more accurate acceleration measurements had less influence. We can argue that when the process noise in acceleration (or angular velocity for that matter) due to the user's motion is high compared to the measurement noise of the inertia sensors, it is of little use to filter the inertia sensor measurements, meaning that a computationally cheaper model can be used in which the inertia sensors are treated as an input during the time update.
Figure 7 shows the process models of the two Kalman filters as we implemented them. The orientation-error Kalman filter at the top estimates errors in orientation and errors in gyroscope bias. The position-error filter estimates errors in position, speed, and accelerometer bias. When gyroscope and accelerometer data is received—they are transmitted simultaneously by the inertia tracker—all real states are updated. In addition, both filters perform a prediction step using their respective process models. In our current setup, we immediately transfer predicted errors to the real states, so the error states will always be zero—or more precisely, they indicate zero error. With zero error input, the output of the prediction step will also be zero. However, the uncertainty of this zero error will increase due the noisy measurements and the expected change in the acceleration and angular velocity. These expected changes should be provided by the application. In our demos we did not make special assumptions for the motions and used the same process noise values for all axes. For the position-error filter we could find a full solution for the process noise due to acceleration change and bias change. We could also find a full solution for the orientation-error filter's process noise. The resulting equation, however, was not practical for implementation. We further assumed the angular velocity to be zero and used the result presented in the figure. The process noise values can be increased a bit to account for the error in this extra assumption, but in practice these values are determined experimentally already.
Figure 8 shows how position and orientation measurements are incorporated in the observation update steps. The camera measurements have a delay and in order to calculate the best estimate, we reorder all measurements by their measurement time. Therefore, when a camera measurement is received, both error-state filters and the states themselves are rolled back synchronously to the closest state to the time , the capture time of the image for the camera pose measurement. All measurements taken after time will now be processed again, ordered in time. This reprocessing starts at state . Gyroscope and accelerometer measurements are again processed using the process models, and they will advance the state . Position and orientation measurements will be used to update the a priori estimates at state to a posteriori estimates in the observation update steps of the Kalman filters. First, these measurements need to be transformed into error observations. We do this using the nonlinear transformations, and thereby circumvent the linearization step of the measurement model for better accuracy. Then, these error measurements are incorporated using the standard Kalman observation update equations. The resulting estimates of the errors are transferred to the separately maintained states of position, orientation, bias and so forth. Hence, all pose estimates up to the present time will benefit from this update.
2.6. AR System Accuracies
Finally, we measured our complete tracking system: camera, inertia tracker and Kalman filter, using an industrial robot as controllable motion source and a marker at 3.2 m. The robot motions are shown in Figure 9. The positional accuracy of the system is shown in Figure 10. The values along the -axis were the most inaccurate. Without the filter to correct for the slow and delayed camera measurements, the positional error would be up to 20 cm depending on the speed of the robot (Figure 10(a)). With the filter, the accuracy is generally just as good as the accuracy of the camera measurements.
The camera pose shows a position dependent systematic error of up to 3 cm (Figure 10(b)). This proved to be due to a systematic error in the calculated orientation from the camera. When we correct for the orientation error, the positional error becomes less than 1 cm (Figure 10(c)). However, in normal situations the ground truth orientation will not be available. Using the orientation from the inertia tracker did not help in our experiments; the high accelerations are misinterpreted as orientation offsets, which introduces a systematic error in its output.
From our experiments we conclude that our data fusion does its task of interpolating the position in between camera measurements very well.
The tracking system has an update rate of 100 Hz. However, the pose estimates—albeit at 100 Hz—were less accurate than the estimates from the camera because of the high process noise (unknown jerk and angular acceleration from user movements).
We measured that the required orientation accuracy of 0. when moving slowly can be met only when the encountered systematic error in camera pose estimation is ignored: 1 cm at 3 m translates to . Since the camera is the only absolute position sensor, the encountered error of up to 4 cm () cannot be corrected by inertia tracker data.
Ways to diminish this static error are the following.
View markers under an angle 2. Viewing a marker straight on can introduce static pose errors in the range of . Markers should be placed such that the camera observes them mostly under an angle of greater than .
Use multiple markers, spread out over the image; this will average the pose errors.
Find ways to calibrate the lens better, especially at the corners.
Use a better lens with less distortion.
A systematic static angular error leads to the fact that an acceleration measured by the inertia tracker is wrongly corrected. This is also visible in static situations due to the acceleration due to gravity. For example with a error, the Kalman filter will first output an acceleration of cm/s2, which is slowly adjusted by the filter since the camera indicates that there is no acceleration. When the camera faces the marker again with a zero error, the wrongly estimated accelerometer bias now generates the same error but then in the other direction and hence this forms jitter on the pose of the virtual object. We found that the bias of the accelerometer itself is very stable. When the process noise for this bias is set very small, the bias will not suffer much from this systematic error. To counter a systematic orientation error it seems more appropriate to estimate a bias in the orientation. However, when the user rotates, other markers will come into view at another location in the image, with another bias. The real effective solution is to minimize camera orientation errors. However, knowing that systematic errors occur we can adapt our demos such that these errors are not disturbing, by letting virtual objects fly for instance. Of all errors, jitter is the most worrying. This jitter is due to noise in the camera image in bad illumination conditions and due to the wrong correction of the earth gravitational field. Note that the first jitter also occurs in, for example, ARToolkit. Jitter in virtual objects makes that it draws the attention of the user, as the human eye cannot suppress saccades to moving objects.
Finally, to make a working optical-see-through AR system, many extra calibrations are needed, such as the poses of the sensors, displays, and the user's eyes, all of them crucial for accurate results. Most of these calibrations were done by hand, verifying a correct overlay of the virtual world with the real world.