Fast, robust, and accurate monocular peer-to-peer tracking for surgical navigation

Purpose This work presents a new monocular peer-to-peer tracking concept overcoming the distinction between tracking tools and tracked tools for optical navigation systems. A marker model concept based on marker triplets combined with a fast and robust algorithm for assigning image feature points to the corresponding markers of the tracker is introduced. Also included is a new and fast algorithm for pose estimation. Methods A peer-to-peer tracker consists of seven markers, which can be tracked by other peers, and one camera which is used to track the position and orientation of other peers. The special marker layout enables a fast and robust algorithm for assigning image feature points to the correct markers. The iterative pose estimation algorithm is based on point-to-line matching with Lagrange–Newton optimization and does not rely on initial guesses. Uniformly distributed quaternions in 4D (the vertices of a hexacosichora) are used as starting points and always provide the global minimum. Results Experiments have shown that the marker assignment algorithm robustly assigns image feature points to the correct markers even under challenging conditions. The pose estimation algorithm works fast, robustly and always finds the correct pose of the trackers. Image processing, marker assignment, and pose estimation for two trackers are handled in less than 18 ms on an Intel i7-6700 desktop computer at 3.4 GHz. Conclusion The new peer-to-peer tracking concept is a valuable approach to a decentralized navigation system that offers more freedom in the operating room while providing accurate, fast, and robust results.


Introduction
Optical navigation systems normally consist of one tracking unit and several tracked tools. While the tracking unit usually uses line scan cameras (e.g., Stryker FP 6000) or plane image sensors (e.g., NDI Polaris), the markers are either active [light-emitting diodes (LEDs)] or passive (retroreflective spheres, black-white targets). If at least three markers can be triangulated, the pose of the tracked tool is calculated using point-to-point matching. This concept is "centralized," and the tracking unit needs an unobstructed view to all tracked objects and therefore has to be placed at a significant distance away from the trackers. This can be problematic in an operating room, where the situs is surrounded by OR personnel. Furthermore, this centralized concept lacks redundancy: The transformation between two tracked tools depends on the unobstructed view to both of them.
In [15], we proposed the novel concept of peer-to-peer trackers which are tracking units and tracked tools at the same time. The tracking of other peers is realized using one camera in conjunction with a novel pose estimation algorithm. In this work, the underlying tracker layout concept is presented which is based on marker triplets together with a fast and robust marker assignment algorithm which assigns the markers found in the camera image to the correct markers of the tracked tools. Furthermore, a new iterative pose estimation algorithm not relying on initial guesses or previously tracked poses is introduced. It is shown that tracking with the proposed concept and the novel algorithms is fast, accurate, and robust even in the presence of accidental marker detections like reflections or unwanted light sources. The correct assignment of markers is of central importance since testing all possible combinations would take too much time.
In contrast to point-to-point matching with at least three point correspondences [1], monocular pose estimation needs at least four point-to-line correspondences (see, e.g., Oberkampf et al. [12]). While Oberkampf presents an iterative pose estimation algorithm for large distances between camera and object, this work focuses on small distances and high accuracy. Furthermore, Oberkampf assigns quality measures to each step of the iteration in order to get the best possible solution. Here, if more than one possible constellation for the same tracker type is found, the smallest residual error of an appropriate objective function is used to find the correct pose of the particular tracker type.
Tjaden et al. [17] use a tracker with seven non-planar LED markers in cross-shape for the marker assignment algorithm. The image features are correctly assigned by analyzing the two lines of the LED cross in the 2D image. Especially in case of highly distorting optics (e.g., fisheye lenses), this requires an image distortion correction. Moreover, Tjaden only introduces one tracker layout and uses a k-means approach to distinguish clearly separated clusters of features in the image. The presented algorithms are able to distinguish four different tracker types even if they strongly overlap.
Dornaika and Garcia [6] describe two pose estimation algorithms for weak perspective and paraperspective projection which both assume that the camera exhibits a unique projection center where all viewing rays pass through. The pose estimation algorithm described in "Pose estimation algorithm" section overcomes this restriction and works perfectly for arbitrary constellations. As stated in [12,14,19], most pose estimation algorithms show difficulties using planar targets that lead to pose ambiguities and can only be solved with time-consuming calculations. Solutions often lack the ability to truly support feature point assignment under occlusion or with many accidental feature points [17]. This work is robust against such influences.
The presented work was developed for surgical navigation purposes, but is definitely not restricted to this application area. In many different disciplines, methods are needed to determine the location and orientation of objects. These methods have to be fast, cheap, and still accurate enough for the specific fields, e.g., satellite navigation or the calculation of the relative transformation between various unmanned aerial vehicles such as quadrocopters [11]. In recent years, augmented reality applications relying on cheap solutions with fast and accurate transformations between real and virtual objects are on the rise [10].
Teixeira et al. [16] present a solution in which pose estimation with LEDs is used to determine the position and orientation of a flying quadrocopter. They also state that in the field of areal robotics, a cheap solution is needed for pose estimation. Their achieved accuracies lack the requirements for surgical navigation. Faessler et al. [8] use a monocular setup and an iterative algorithm that uses the last detected pose as a starting value for the pose estimation of the next frame. This can result in subsequent errors if the last calculation was incorrect or too old.
The main focus of this work is a fast and robust marker assignment algorithm together with the underlying marker layout concept as well as a novel pose estimation algorithm which works without initial guesses. Figure 1 shows the actual design of the proposed peer-to-peer trackers with integrated cameras for pose estimation. The seven markers (LEDs) are arranged in a square configuration with four corners, one middle marker, and two markers (L 2 and L 6 ) tagging one of the two outer markers of the respective side triplet. Depending on the position of these tagging markers, four different types of trackers can be realized (see Fig. 1). All in all, each tracker type includes four collinear LED triplets: two side triplets with an asymmetric and two middle triplets with a symmetric LED layout. The cameras are calibrated with the pixel-wise and model-free calibration method introduced in [9]. In [15], it was shown that pose estimation with one camera is accurate enough for surgical navigation purposes.

Marker assignment algorithm
As stated in "Introduction" section, the fast and correct mapping of image feature points to the markers is crucial for pose estimation. These points are found using a weighted center blob detection which results in a set of N 2D pixel coordinates. Each blob center defines a straight line containing all world points that are projected on the specific pixel coordinates. These straight lines can be defined by two points marking the beginning (near point) and the end (far point) of the calibration area (see Fig. 2).

Marker triplets
Given that three markers are collinear, their corresponding viewing lines, and therefore their near and far points, are coplanar (see Fig. 2). All N 3 possible combinations of three blobs are investigated by calculating the regression plane through their six near and far points. If the sum of squared distances (SSD) of the points to the regression plane is below Fig. 1 The four possible peer-to-peer trackers with seven markers (LEDs) and one camera (left) and the underlying design concept (right) Fig. 2 The three near and three far points of a marker triplet lie on a single plane a defined threshold t SSD , the triplet is stored. This results in a set of possible triplet candidates that are processed further on.

Marker tagging
The middle marker of a triplet can be used to distinguish between triplets with symmetric and asymmetric layout. In the latter case, the triplet's middle marker is placed at 20% resp. 80% of the distance between the two outer ones. The differentiation is realized by calculating and sorting the three angles (α min , α mid , α max ) between the straight lines corresponding to one triplet. The two lines corresponding to the largest angle α max correspond to the two outer markers, and the tagged marker also contributes to α min . See Fig. 3 (left) for further details. If the triplet is observed from a distance significantly larger than the extension of the triplet, the ratio f = α mid α min is 1.0 for symmetric triplets and 4.0 for the chosen percentage of 20% resp. 80%. But the smaller this distance gets, the more this ratio differs from these values. Figure 3 (right) shows f for a triplet extension of 80 mm and a viewing distance of 200 mm for all possible viewing angles. It ranges from 2.7 to 6.0 for asymmetric (green line) and from 0.7 to 1.5 for symmetric triplets (red line). Therefore, f can be used to distinguish the two kinds of triplets: f < 2.0 for symmetric, f > 2.0 for asymmetric triplets. Triplets with f > 6.0 are rejected (the markers align coincidentally).
All in all, this part of the algorithm results in a number of M symmetric middle triplets and S asymmetric side triplets.

Combining triplets
Now, all M 2 combinations of two middle triplets are checked for a shared middle marker. If a pair was found, the blob IDs are stored as follows: p id . . . outer marker of triplet 1 with smaller blob ID q id . . . outer marker of triplet 2 with smaller blob ID r id . . . outer marker of triplet 1 with larger blob ID s id . . . outer marker of triplet 2 with larger blob ID m id . . . blob ID of shared middle marker p id , q id , r id , and s id (in this order) define the blob IDs of the corners of the square in clockwise or counterclockwise order. Now, the first side triplet is searched which connects p id and q id , q id and r id , r id and s id , or s id and p id . If found, a second side triplet is searched which shares one corner with the first side triplet. Let u id and v id be the outer marker blob IDs of this second side triplet. If the first triplet connects, e.g., p id and q id , there are two possible cases for the second side triplet: 1. (u id = p id and v id = s id ) or (v id = p id and u id = s id ) (the second triplet connects s id and p id ). Then, corner p id is the shared marker and the involved blob IDs are stored in the following order: p id /tagging marker ID of side triplet 1/q id /r id /s id /tagging marker ID of side triplet 2/m id 2. (u id = q id and v id = r id ) or (v id = q id and u id = r id ) (the second triplet connects q id and r id ). Then, corner q id is the shared marker and the involved blob IDs are stored in the following order: q id /tagging marker ID of side triplet 2/r id /s id / p id /tagging marker ID of side triplet 1/m id Figure 4 (left) illustrates the two described cases in counterclockwise order. Other cases are handled analogously.
The handedness of the found constellation is investigated by calculating the determinant of the viewing directions of the first, fifth, and third entry of the current square candidate. If this determinant is negative, the square IDs are stored clockwise and have to be resorted as follows: 7 The last step of the algorithm determines the type of the square tracker by investigating which outer markers are tagged by the tagging markers of the two side triplets (see

Pose estimation algorithm
Optimization function T can be eliminated by solving ∂ f ∂T = 0 for T and inserting it back into (1). After parameterizing R with the elements of the corresponding unit quaternion q T = q 0 q 1 q 2 q 3 , the optimization function only depends on q and has the form where B i j are symmetric 4×4 matrices and k i j are scalars depending on p i , a i , and d i . A comprehensive derivation of (2) can be found in [15].

Optimization
While Olsson et al. [13] use a branch and bound algorithm for minimizing (2)-unfortunately without presenting its calcu- where the gradient of F(q, λ) has to be zero: This nonlinear equation system can be solved iteratively: The elements of the symmetric Hessian matrix H F (q, λ) are for 0 ≤ k, l ≤ 3 where δ kl is the Kronecker delta and h i j = B i j · q. Furthermore, H 4k = H k4 = 2q k and H 44 equals zero.

Starting values for the Lagrange-Newton iteration
The above described iterative optimization relies on suitable starting values for q. An inevitable demand for this work was The solution is using the vertices of the 600-cell or hexacosichora-one of the six platonic solids in 4D and the equivalent of the icosahedron in 3D. The 120 vertices of the 600-cell are the 16 possible combinations of (± 1 2 , ± 1 2 , ± 1 2 , ± 1 2 ), the 8 permutations of (±1, 0, 0, 0), and the 96 even permutations of (± τ 2 , ± 1 2 , ± 1 2τ , 0) with (τ = (1 + √ 5)/2). Since q and −q define the same rotation, this results in 60 uniformly distributed quaternions.
While the derivation of the optimization function (2) is state of the art, uniformly distributing starting rotations as described above has, to the best of the authors' knowledge, not been presented in the literature before. All experiments have shown that the global minimum is always found if these 60 quaternions are used as starting values.

Results
In a MATLAB simulation, all four possible tracker types (28 LEDs) together with 4 extra LEDs (possible reflections or other accidental light sources) were randomly placed in front of a virtual camera 100,000 times and analyzed as described in "Marker assignment algorithm" section. Figure 6 (left) shows a typical camera image of the 32 LEDs, and Fig. 6 (right) shows the same scene after analyzing the possible tracker constellations. The simulation was performed without preventing the trackers from overlapping and with the following parameters: 64 mm tracker side length; 150-200 mm distance range of tracker center to projection center; 140 mm maximum distance between tracker center and camera's principal axis; 85 • maximum angle between tracker normal and principal axis.
In 100% of the trials, all four trackers were found. With a relative frequency of 71.6%, only the four trackers and no other valid candidates were found. Furthermore, with a relative frequency of 21.2%, the actual four trackers and one more possible candidate were found. Further details can be found in Fig. 7. Samples with 7 or more possible candidates only occur if the four randomly placed trackers strongly overlap or if one or more of the four extra LEDs are very close to one of the tracker LEDs, which is rather unlikely in real scenes (see Fig. 8).
Note 1: The fact that the algorithm finds more possible candidates than present in the image does not mean that the correct poses of the trackers can not be found. The marker assignment algorithm has to heavily reduce the number of possible candidates before the pose estimation algorithm optimizes the objective function (2) of "Pose estimation algorithm" section. The correct poses correspond to the smallest residual objective function values for the particular tracker types.
Note 2: State-of-the-art surgical navigation systems (e.g., NDI Polaris, Stryker FP 6000) triangulate marker positions and calculate transformations by means of point-to-point matching for which at least three markers are needed. If one of these three markers is occluded, the calculation fails. While  the minimum number or markers for monocular tracking resp. point-to-line matching is four, the presented algorithms rely on the visibility of seven markers in order to achieve a robust and fast marker assignment which is essential for monocular tracking. The transformation can only be calculated if all seven LEDs are visible. All subsequent tests were performed under real conditions. The first test scenario included a calibrated Ximea MQ013MG-E2 with all four tracker types simultaneously placed at distances ranging from 200 to 350 mm. A typical result after marker assignment can be found in Fig. 9. The underlying software was written in C++ using the "Armadillo library for linear algebra & scientific computing" and executed on an Intel i7-6700 at 3.4 GHz. The mean The mean calculation time for one pose estimation with 60 uniformly distributed start rotations was 5.5 ms per tracker. Therefore, the overall calculation time is below 18 ms for a common use case of two trackers tracked by another peer (6.5 ms for markers assignment, two times 5.5 ms for pose estimation).

Accuracy tests
In order to compare the proposed peer-to-peer tracking concept in terms of accuracy, three different tests were conducted. They each utilize the same two cameras (The Imaging Source DMM 37UX273-ML), the same camera calibrations, as well as a the same tool center point (TCP) distance of 150 mm.

Stereo tracking versus monocular tracking
A stereo camera system is used to calculate the pose of a peer-to-peer tracker utilizing both cameras simultaneously (triangulation and point-to-point matching) and each camera separately (point-to-line matching). This results in three transformations from tracker to camera coordinates. In the first test, the tracker is oriented frontally, under 45°in the second test, and at arbitrary angles in the third. Table 2 shows the measured mean deviations for 100 test positions each. In case of LED deviations, the triangulated LED positions are taken as ground truth and the deviations of the transformed LED positions are used to calculate the mean deviations. For the tool center points located at 150 mm away from the tracker center, the transformation resulting from triangulation was used to calculate the ground truth positions, which are compared to those resulting from point-to-line matching. The results show that the achieved accuracy is practically independent from the viewing angle.
The same test is performed using a linear guide rail together with two rotary joints which are used to automatically control the distance as well as the pitch and yaw angle of the peer-to-peer tracker with respect to the cameras. Again, the stereo transformation is taken as ground truth and the TCP deviations are calculated for more than 2500 positions. Figure 10 shows these deviations for the resulting distances and viewing angles. Please note that the cameras are only calibrated up to a maximum distance of 350 mm which explains higher deviations beyond this limit.

Peer-to-peer tracking versus Stryker FP 6000
In the following two tests, the accuracy of the presented peer-to-peer tracking concept is compared to the accuracy of Stryker's surgical navigation camera FP 6000. For the first test, a bearing ball is rigidly attached 150 mm away from the center of a Stryker universal tracker which itself is mounted back to back on a peer-to-peer tracker. Now, the bearing ball is pivoted inside its matching counterpart and for each of the two trackers, 1000 different poses are recorded. Finally, the pivot points are optimized in both coordinate systems and the resulting deviations to the pivot points are calculated for all transformations. Table 3 shows the mean, standard, and maximum deviations for both navigation systems. The results clearly show that the presented peer-to-peer tracking is nearly as accurate as Stryker's stateof-the-art navigation camera and definitely accurate enough for surgical navigation.
For the second test, Stryker's universal tracker is again mounted back to back on a peer-to-peer tracker and the two navigation cameras are rigidly attached to the same table such that their relative transformation stays constant (see Fig. 11). Now, 100 arbitrary pairs of corresponding poses F CT i and F SU i are recorded and used to optimize the static transformations F U T and F C S . Afterward, tool center points p T (still 150 mm away from the tracker center) defined constantly in peer-to-peer tracker coordinates T are transformed to navigation camera coordinates C directly using F CT i and indirectly using the Stryker transformation as ground truth: Statistically analyzing the distances d i = |p C i P2P − p C iStryker | results in mean deviation of 0.50 mm and a root mean square of 0.57 mm. All in all, it turns out that the accuracy of the presented peer-to-peer tracking concept is definitely comparable to the accuracy of a state-of-the-art surgical navigation system.

Peer-to-peer tracking versus ground truth
A thorough comparison of Stryker's FP 6000 (used in "Peerto-peer tracking versus Stryker FP 6000" section) against a coordinate measurement machine can be found in [7]. Elfring et al. conclude that the Stryker camera exhibits best-in-class accuracy with a trueness of 0.07 mm.
In order to compare the peer-to-peer tracking concept against a ground truth, two linear guide rails are used to position the pivot mold of the test described in "Peer-to-peer tracking versus Stryker FP 6000" section with an absolute accuracy of 0.02 mm at rasterized grid positions with a grid spacing of 37.5 mm in x-direction and 40 mm in z-direction. For 40 grid positions, 500 transformations of the pivoted peer-to-peer tracker are recorded. This results in 40 pivot points which are matched to the exact grid positions using point-to-point matching. Figure 12 shows the pivot-to-grid point deviations after matching for all 40 positions. The mean deviation equals 0.14 mm and the maximum deviation is 0.31 mm.

Conclusion and discussion
In this work and in [15], a new peer-to-peer tracking concept was presented which overcomes the separation between tracking tools and tracked tools. It was shown that one camera per tracker provides sufficient accuracy for surgical navigation. Furthermore, novel algorithms for pose estimation and marker assignment as well as a whole new tracker layout and coding concept based on marker triplets were introduced. Simulations and real experiments showed that the overall concept works fast, accurately, and robustly.
The promising results shown in chapter 3 suggest that the presented approach is a feasible alternative to the current state-of-the-art surgical navigation and has the potential to entail new applications that are not possible using current technology. For example, peer-to-peer tracking opens up the possibility to build tracking chains (one peer tracking the next) and to navigate "around the corner." Furthermore, the presented trackers only consist of a housing, seven LEDs, and a low-cost camera and can therefore be realized as disposable items which in turn facilitates that they do not have to be sterilizable.
The presented approach also implies advantages with respect to the emerging field of glasses-based augmented reality in the operating theater: Instead of trying to get the tracked see-through glasses (worn by the surgeon) and the tracked tools into the working volume of a conventional surgical navigation system (which is nearly impossible), the glasses and tools only have to be equipped with one of the proposed peer-to-peer trackers in order to realize precise overlays. Corresponding use scenarios can be found in [2,5]. A similar approach for augmented reality was presented by Vogt et al. [18].
Last but not least, although the concept was developed for surgical navigation, it is also suitable for many other fields where cheap, fast, and accurate methods of tracking are of special interest, e.g., unmanned areal vehicles or selforganizing robot swarms [3,4]. Future work will concentrate on parallelizing and optimizing the algorithms to ensure an update rate of at least 100 Hz. An achievable goal is reducing the processing time for marker assignment from 7 ms to 5 ms and parallelizing pose estimation such that the poses of all trackers can be calculated in 5 ms.