Background

The insertion of pedicle screws is a critical step in spine fixation surgery. Conventional open surgery is performed through a mid-line incision where the posterior aspect of the spine is exposed. However, there is a trend toward increased use of minimally invasive surgical techniques, due to reductions in blood loss, length of hospital stay, and surgical site infections [1]. Minimally invasive surgery (MIS) is performed through small skin incisions where the vertebrae are reached by use of tubular retractors [2]. Due to the reduced visibility during MIS procedures, intraoperative imaging such as fluoroscopy is frequently used. However, to reduce radiation exposure and increase accuracy, a number of computer-assisted navigation solutions have been devised [3,4,5,6]. Clinical studies have shown that the use of intraoperative three-dimensional (3D) imaging coupled to a navigation system leads to higher accuracies than competing technologies [7]. All navigation technologies require co-registration of the patient and the pre- or intraoperative images to allow tracking of both patient and surgical instruments relative to the medical images. Conventional navigation solutions typically include infra-red camera systems tracking a dynamic reference frame attached to a vertebra [8]. Efforts have been made to design patient tracking methods based on unobtrusive markers or no markers at all. One such system using non-invasive optical markers has been described by Malham et al. [9, 10] (SpineMask, Stryker, Kalamazoo, Michigan, USA). The system enables high accuracy placement of minimally invasive lumbar pedicle screws. Markerless tracking solutions have been used experimentally on phantoms in other surgical fields. However, studies on spine surgery are lacking [11, 12]. A robot system using light to track the bony anatomy and performing pedicle screw placements was recently presented. The device was validated on cervical vertebrae phantoms, reaching a mean positional error of 0.28 ± 0.16 mm [13]. The Microsoft Hololens uses surface matching for tracking and has been used experimentally in non-medical phantoms with an accuracy ranging from 9 to  45 mm [14], while in a spine phantom study, an accuracy of roughly 5 mm was achieved [15]. The navigation technology used in this study is an augmented-reality surgical navigation (ARSN) system relying on adhesive optical skin markers for motion tracking and compensation [16, 17]. Four high-resolution optical cameras are integrated in the flat detector of a C-arm with cone-beam computed tomography (CBCT) capability. The markers are recognized by the cameras and their relative positions in space are used to create a virtual reference grid, which is co-registered to the patient during CBCT acquisition. Optical markers attached to the patient’s skin have been used for respiratory motion tracking [18] and for medical imaging applications [19]. The use of optical markers for motion tracking is combined with digital image correlation and tracking techniques [20,21,22,23]. Recently, Xue et al. [24] demonstrated that ink dots on the skin could be video tracked with high precision and that the post-processing retrieved more detailed information compared to marker-based methods. Similarly, direct tracking of spine features and tracking of skin features using hyperspectral cameras for spine surgery have recently been proven feasible.

In this study, a new markerless tracking technology using grey-scale video cameras based on skin feature detection was evaluated. Image analysis techniques were applied to detect and track natural features of the skin. There are several advantages when refraining from using optical markers for motion tracking. First, the workflow of the procedure can be improved by simplifying the protocol for patient preparation and by increasing the reliability of tracking during the surgical procedure. Second, the risk of losing sight of the markers can be abolished when skin features can be used as a reference. The well-known feature detection algorithms, Maximally Stable Extremal Regions (MSER), and Speeded Up Robust Features (SURF) were applied to detect and extract skin features such as moles and pigment spots. These methods were chosen, since they offer a good reproducibility under different image views, being invariant to rotation, scaling, and affine transformation [25,26,27,28]. The proposed 3D-localization framework, used multi-view geometry principles to perform image rectification, enhance feature detectability, improve feature matching, and calculate and assess each triangulated feature. The sum of squared differences (SSD) was used as a feature matching metric on scan lines between multiple-view acquisitions. To remove 3D outliers, a second feature selection step was applied, specifically performed for the z-coordinate mismatch after the triangulation of all pairs of matched features. The final inliers were evaluated by computing the overall mean triangulation error. In summary, the contribution of this paper is an alternative to marker-based tracking. We hypothesize that the camera feed provides enough details for skin feature detection and tracking. The sub-millimeter localization accuracy achieved was sufficient for surgical navigation. The framework included 3D reconstruction and feature localization over multiple-view acquisitions. It was validated on eight clinical spinal surgery cases performed in an academic tertiary medical center. This paper concentrates on the application of skin feature detection techniques to achieve accurate markerless tracking in spinal surgery. In the development of tracking systems, feature detectors and descriptors are widely investigated, since they demand the highest percentage of the processing time. The former is dependent on available image information, while the latter defines the encoding [29]. Aspects such as adaptability to image transformation and mismatched features need to be evaluated, as they can potentially affect tracking [30].

In this paper, we evaluate different feature detectors and extractors (SURF and MSER), for studying the number of inliers and the overall localization error on multi-view images from several spine patients, subjected to different illumination conditions. In addition, to strengthen the stability in tracking by eliminating the mismatched multi-view keypoints and improve image matching, we proposed a 3D outlier removal step, imposing the matching to the keypoint relying on the same epipolar line. The 3D triangulations were obtained only from the matches relying on the epipolar constraint. The contributions are: (1) building a computer vision framework for preprocessing optical skin images, detecting and matching local invariant image regions for two different image views; (2) assessing the most stable feature detection approach for reliability and accuracy; (3) improving the image matching by introducing a 3D epipolar constraint; (4) validating the methodology on optical images acquired in eight patients to assess the 3D-localization error for the matched features.

Results

Optical data for assessing the 3D localization were collected from two sources: a cadaver study and a prospective clinical observational study. The cadaver study was performed according to all applicable laws and directives. The clinical study was approved by the local ethics committee and all enrolled patients signed informed consent. The data that support the findings of this study are generated by Philips Electronics B.V., Best, The Netherlands and Karolinska University Hospital, Stockholm, Sweden. All images of the datasets were acquired at the same UHD resolution of 2592 pixels by 1920 lines. The first dataset consisted of one multi-view acquisition, thus four images, of a cadaver. MSER and SURF were applied to several selected regions to perform the first multi-view experiment for skin localization. The second dataset consisted of image data from eight patients included in a spine navigation study and taken during the surgical procedures. The data were used to perform two different experiments, first a skin feature localization and later an optical marker localization and the ground-truth comparison. In Table 1, the total number of analyzed frames during the acquisitions and the corresponding acquisition times are reported. All patients were classified by the physicians regarding Fitzpatrick Skin Type I, II, or III. The first feasibility study was performed by analyzing the skin of the cadaver. The localization system was used for four flat selected regions with the c1|c3 camera pair. The features detected for this dataset were triangulated with a mean triangulation error of 0.239 mm for MSER and 0.218 mm for SURF. Due to its intrinsic functional operation, the MSER algorithm detects multiple blobs located at the same coordinates. This explains why MSER seems to have the capability to detect more features than SURF. The discarded-feature ratio of the matched features to the selected inliers was 3.96 ± 0.80 and 2.93 ± 0.45 for MSER and SURF, respectively. The clinical dataset involved patients undergoing open surgical procedures via mid-line incisions along the spine. Several plane regions were carefully selected for some patients (2nd, 4th, and 7th), where the skin was partially covered by blood. The performance results of the localization framework for all the eight patients in the study are shown in Tables 2 and 3. Descriptive statistics for triangulation error for each detection method are reported in Tables 2 and 3. Figure 1 shows two examples of MSER and SURF feature detection and corresponding matches at the same location for an image pair.

Table 1 Patient data acquisitions
Table 2 Triangulation error analysis—MSER
Table 3 Triangulation error analysis—SURF
Fig. 1
figure 1

Two examples of MSER and SURF feature detections (a, b) and corresponding matches (c, d) at the same location for an image pair

On a total amount of 4934 (MSER) and 1727 (SURF) features, mean triangulation errors of 0.207 and 0.204 mm were reached for MSER and SURF, respectively. An important observation was that 75% of the detected features had a triangulation error within 0.3 mm (Fig. 2), appropriate for spinal surgery applications. The discarding ratio of the matched features to the selected inliers in this case was 3.73 ± 2.69 and 2.61 ± 1.70 for MSER and SURF, respectively. The median errors show a similar variability in the triangulation error when SURF and MSER are used, respectively (Fig. 3a, b). The variability and the outliers may be caused by lighting differences or limited visibility of the skin area [31]. The triangulation error using SURF and MSER for each individual case is depicted in Fig. 4.

Fig. 2
figure 2

Triangulation error distribution (MSER and SURF)—patient data

Fig. 3
figure 3

Boxplots for the mean triangulation error in eight patients. The red line represents the median. Upper and lower limits depict the 75th and 25th percentiles, respectively. The min and max values are visualized with whiskers, and the outliers are shown with ’+’ symbol

Fig. 4
figure 4

Bar plot visualizing the triangulation error (MSER and SURF)

Two-sample t tests were used to assess differences when using MSER and SURF. A p value of less than 0.05 was considered statistically significant. No statistically significant differences between the two methods were found (p > 0.05). A two-sample t test was also performed to assess the accuracy of the markerless approach, which was found to be superior (p < 0.05) compared to the ground truth (marker-based detection). A significant statistical difference was also found when detecting skin features among different patients (p < 0.001). This can reflect differences in the number of analyzed frames per patient, illumination conditions, number of detected features, and skin type. In this case, the f test rejects the null hypothesis at the default 5% significance level and suggests that the true variance is greater than 25%.

The computation time for the skin feature detection was on average 0.19 and 1.86 s per frame when SURF and MSER were used respectively. Per-patient results are visualized in Fig. 5a. With a mean of 5 fps, SURF is most suitable for a future real-time implementation. The preprocessing step reached a computation time of 1.14 s. Notably, for real-time navigation, shortened preprocessing times may be achieved with improved lighting conditions.

Fig. 5
figure 5

Computation time for the preprocessing and skin feature detection

Marker localization and ground-truth comparison

The marker detection and ground-truth comparisons were performed by applying both MSER and SURF feature detection algorithms, to detect the optical markers positioned on the patient. The mean triangulation error of the tracked markers was 0.290 mm for MSER and 0.303 mm for SURF, as shown in Table 4. Table 4 portrays that an average Euclidean distance of 0.627 mm for MSER and that of 0.622 mm for SURF are reached, in relation to the ground truth. Descriptive statistics for triangulation error and Euclidean distance when the detection methods are applied to the optical markers are reported in Table 4. Figure 6a, b, shows the frequency distributions of triangulation errors and Euclidean distances for MSER and SURF detection methods. Notably, the thresholding performed for segmenting the optical markers prior to applying the feature detection can cause a non-ideal identification of the markers and decrease the triangulation accuracy. This is the main reason why the coordinates of the triangulated markers differ slightly with respect to the ground truth. However, all markers are triangulated with a sub-millimeter accuracy, resulting in a triangulation error comparable to the one obtained with the skin features.

Table 4 Optical marker comparison, triangulation error, and Euclidian distance
Fig. 6
figure 6

a Triangulation error and b Euclidean distance of detected optical markers for ground-truth comparison

Discussion

This feasibility study proposes a new innovative, accurate, and unobtrusive alternative approach for skin feature localization which can be used for patient tracking in surgical navigation. This result was achieved with the direct detection of features on the patient’s skin using high-resolution grey-scale cameras and the subsequent analysis of the captured multi-view images. The framework is based on MSER and SURF feature detection algorithms to localize the visual skin features.

The accuracy, of roughly 0.6 mm at skin level, achieved with the current framework should be seen in light of previous results obtained by the ARSN system relying on adhesive skin markers. In a recent study using ARSN, Burström et al. [32] demonstrated a technical accuracy of 0.94 ± 0.59 mm and 1.97 ± 1.33 mm, respectively, for cadaveric and clinical cases of pedicle screw placement. The difference in accuracy, could be attributed to the actual placement of pedicle screws in a clinical situation. Studies on markerless tracking in surgery are scarce. In the early 2000, a series of articles regarding augmented reality for surgical planning using video projectors were published [33,34,35,36,37,38]. An off the shelf video camera and a 3D surface scanner were used to create a representation of the surface of the patient. Previously segmented structures could then be projected back on the patient. This system made dynamic tracking of the patient possible and reached a high accuracy of 1.5 mm. It has been used for radiotherapy and in craniofacial surgery, but no use in spine has ever been published [39]. Microsoft Hololens has recently been used in many applications. The obtained accuracies range from 9 to 45 mm in non-medical to roughly 5 mm in a spine phantom study [14, 15]. A recent robotic study using a structured light camera for markerless tracking of the bony anatomy reached a precision of 0.28 ± 0.16 mm [13]. A study using a similar approach as used here for spine feature tracking reached an accuracy of 0.5 mm [31]. When exploring the use of HSI to detect skin features in 2D on healthy volunteers, an accuracy better than 0.25 mm was achieved [40]. In the current study, the feasibility of 3D-localization of skin features in patients undergoing spine surgery was demonstrated, employing a pre-existing surgical navigation system using optical cameras. The use of grey-scale cameras rather than HSI is motivated by several factors. First, HSI is highly dependent on proper lighting conditions. Surgical lights illuminating the skin surface can interfere with the image acquisition process [40]. Second, the HSI system did not reach deeper than 1 mm in the skin, limiting its added value. Third, the integration of two or more HS cameras in the navigation system, to enable stereo-vision, would come at a considerable cost. In this scenario, the use of grey-scale cameras represents a good comprise.

The results obtained by the proposed framework, using grey-scale cameras, are thus well in line with these previous results. A markerless tracking framework has the advantage of building a virtual reference grid that cannot be dislodged or completely obscured during surgery. Furthermore, compared to conventional dynamic reference frames that track a single vertebra, a markerless framework can track the entire spine and compensate for inherent movements within the spinal column during surgery. In this study, several datasets were used for validation. MSER showed a better capability of detecting features of observable anatomical skin details. It was visually verified that MSER provided a higher number of detected features contributing to a better plane selection in the 3D outlier removal step. The multi-camera system enabled triangulation of each feature, to obtain an accurate 3D triangulation performance. This performance can be potentially evaluated by automatically computing the triangulation error continuously and in real time within a software function for navigation. The proposed framework may simplify the existing patient preparation procedure and improve the reliability of the tracking process by relying on skin features instead of optical markers or reference frames.

Limitations

The limitation of this study is the small sample size and the retrospective setup. The results were validated on eight clinical cases. The addition of more cases would strengthen the conclusions. A prospective study comparing different modes of patient tracking (different types of optical markers versus different frameworks for feature tracking) is warranted. Another important issue is to reach enough computational power for real-time tracking of movements and surgical navigation.

Conclusion

This study demonstrates the feasibility of skin feature localization by exploiting an optical multi-view, grey-scale, camera system combined with image analysis and tracking techniques. The system has been tested on several patients undergoing spinal surgery with sub-millimeter accuracy. This study can be the basis for future surgical applications where optical patient tracking is required.

Methods

Image preprocessing

The principles of multi-view geometry [41] are based on assuming a pinhole camera model, applied for correction of camera images with respect to intrinsic parameters. A preprocessing step, the Contrast-Limited Adaptive Histogram Equalization (CLAHE) algorithm, was used to maximize the detection of skin features and reduce the noise during the acquisition [42]. The simple computation of the fundamental matrix F with the normalized eight-point algorithm was used for image rectification [43]. The fundamental matrix F was defined as:

$$\begin{aligned} (x')^T\cdot F\cdot x = 0. \end{aligned}$$
(1)

For any pair of matching points \( x'\) and x, there are two images in the same coordinate system. The obtained pixel points were imposed on the corrected input images, to enable feature detection.

Feature detection, extraction, and matching

Let us consider an image pair captured with different cameras \(c_i\) and \(c_j\), from different views: \(Ic_i\) and \(Ic_j\). For both images, a corresponding set of \(n(c_i )\) and \(n(c_j )\) features, respectively, were extracted and saved in a dedicated object ensemble \(Fc_i\) and \(Fc_j\) , to capture all information of the detected features f(cn):

$$\begin{aligned} Fc_i= & {} f(c_i,1), f(c_i,2), . . .,f(c_i,n(c_i ) ), \end{aligned}$$
(2)
$$\begin{aligned} Fc_j= & {} f(c_j,1),f(c_j,2),. . .,f(c_j,n(c_j ) ). \end{aligned}$$
(3)

MSER and SURF algorithms were applied for blob-similar feature design and feature detection. Afterward, the SURF feature descriptor was applied for feature extraction. The regions of interest (ROIs) on the skin were selected manually and saved as bounding-box coordinates for future iterations. A manual selection was performed, because the regions of the same subject were located within different views. Unfortunately, it was not possible to manually select precisely the same region boundary within multiple views. For this reason, the matching process contained some outliers, which were filtered out at a later step of the processing. Attention was paid to select regions where the skin was as flat as possible. The chosen skin area was located around the optical markers (used as ground truth), so that both the markers and skin area had the same illumination conditions. This manual extraction may represent a limitation for a real-time application. For every detected feature, it was necessary to extract a feature vector, known as descriptor that provided information about the feature, in this case the pixels surrounding the center of the blob. For this purpose, the SURF algorithm was used to extract the feature vector [27]. This method was adopted, since it offers high reproducibility, even under different viewing conditions. The descriptor vectors were then saved in the following set of dedicated descriptor vectors:

$$\begin{aligned} \Delta _ci= & {} \delta (c_i,1),\delta (c_i,2),...,\delta (c_i,n(c_i ) ), \end{aligned}$$
(4)
$$\begin{aligned} \Delta _cj= & {} \delta (c_j,1),\delta (c_j,2),...,\delta (c_j,n(c_j ) ). \end{aligned}$$
(5)

Where every descriptor \(\delta (c,n)\) consisted of an SURF descriptor vector and the y-coordinate of a specific feature n from a generic camera c. Using the previous dedicated descriptor vectors \(\Delta _ci\) and \(\Delta _cj\), the feature matching step performed a matching between the \((c_i)\) feature detected in one view with respect to the \(n(c_j)\) feature from another view and provided an index of correspondences between the two dedicated descriptor vectors \( \Delta _ci \) and \(\Delta _cj\), and the feature vectors \(F_cj\) and \(F_cj\). These correspondences were achieved by computing the SSD between the SURF descriptor vectors of those features lying within the scan lines of interest. At this point, fusing the epipolar constraint was crucial, since it leads the matching process between features that were shifted along an epipolar line for a specific range. Thanks to this top–bottom scan-line stereo matching, matches between features that lie on different epipolar lines were omitted, to reduce the computational cost and maximize the chance of a good match. This step returned two vectors with the indexes related to the matched features. Using these indexes, it was possible to build two new dedicated object ensembles \(M_ci\) and \(M_cj\) of equal size, with matched features.

Feature triangulation and outlier removal

In computer vision, triangulation allows determination of the 3D position of a point, given that the positions of the same points are matched in at least two alternative views [41]. This was achieved with two or more lines projected from each camera center to the respective point on the camera plane. Consequently, the projected lines did not always intersect in the same 3D point. It was important to evaluate and quantify the accuracy of this method. In this feasibility study, the triangulation error was used as evaluation metric to obtain an index of the triangulation accuracy, and then, the 3D point locations were used to perform a benchmark against an existing tracking system. The triangulation error was computed by calculating the location of the shortest distance between the two projected lines. The center of the line segment was the triangulated point, and the length of the line segment is the triangulation error, expressed in millimeters (or in micrometers). The triangulation function of the ARSN system returns the 3D Cartesian coordinates of the triangulated point and the corresponding triangulation error through the projections from the camera centers, resulting in:

$$\begin{aligned} \Psi (c_i|c_j,n) = x((c_i|c_j),n),y (c_i|c_j ,n),z(c_i|c_j,n),e (c_i|c_j,n)), \end{aligned}$$
(6)

where \( c_i|c_j \) denotes the pair of cameras used for the triangulation, \(x(c_i|c_j ,n)\), \(y(c_i|c_j,n)\), \(z(c_i|c_j,n)\) are the 3D Cartesian coordinates of the triangulated feature n, and \(e(c_i|c_j,n)\) is the triangulation error of the triangulated feature n with cameras \(c_i|c_j\). These 3D coordinates were stored in a dedicated object vector \(\Psi c_i|c_j\) for further analysis. The triangulation was achievable between at least two camera pairs. In fact, it was possible to obtain the rectification for the six camera pairs: \(c_i|c_j =c_1|c_2\), \(c_1|c_3 \), \(c_1|c_4 \), \(c_2|c_3\), \(c_2|c_4 \), and \(c_3|c_4 \), and then detect, match, and triangulate the features with at least one of these pairs, mainly determined by choosing a more favorable line-of-sight that avoids occlusion of the analyzed regions. The overall mean triangulation error was computed by averaging the triangulation errors of all the features detected over multiple regions. At this point, most of the matches were obtained by only relying on the epipolar constraint. Using the outlier removal step a 3D constraint was imposed, by approximating the skin surface to a simple planar representation, in which all the detected features were situated. Now, M-estimator SAmple and Consensus (MSAC) [44] was used to fit a plane to the 3D point cloud and remove triangulated points that lie below or above a certain maximum distance from the plane model (expressed in mm). The reference orientation constraint was inferred from the mechanical parameters of the C-arm once the position of the multi-view camera system with respect to the skin surface was obtained. It was important to quantify the considered region of interest for algorithm comparison and the applicability of the approach. For this purpose, a simple method for skin-area estimation was applied by considering the triangulated features included in the approximated skin-model plane. A flat mesh plot was created from the coordinates obtained by intersecting the normal projections of the features onto the model plane. The overall area was computed as the sum of all the Delaunay triangulation areas included within the boundaries of the region [45]. The size of the analyzed region was then expressed in square centimeters. Figure 7 shows the 3D point cloud that represents the inliers (in blue), and the 3D point cloud depicting the outliers (in red), which are the discarded features. Both are reconstructed from the four high-resolution optical cameras illustrated on top.

Fig. 7
figure 7

3D outlier removal: the 3D point cloud in blue represents the inliers, while in red, the 3D point cloud depicts the outliers which are the discarded keypoints. On the top, the numerated cameras

Marker localization and ground-truth comparison

The accuracy of the proposed framework was evaluated by detection 3D triangulation of the optical markers, which were considered as ground truth. The detection and matching of the optical markers were performed in two different views of the patients. For this purpose, a preprocessing step consisting of a segmentation mask and a simple binarization thresholding was applied. At the end, the MSER and SURF methods were used to perform the feature detection of the bright circular region of every optical marker. The final matching process for a pair of images is shown in Fig. 8. When the marker matching was performed, their triangulation was computed, and their 3D location was obtained. The benchmarking was performed with respect to the ground truth of each optical marker provided by the ARSN system. The Euclidean distances in 3D between the detected marker and the corresponding ground-truth marker coordinates were computed to assess the accuracy of the localization.

Fig. 8
figure 8

Matching of the optical markers