Introduction

In recent years, efforts to enhance the accuracy and robustness of feature tracking is a fundamental topic with extensive applications, particularly in simultaneous localization and mapping (SLAM). Traditional frame-based cameras, which take images at predetermined intervals, have been the primary sensor type for feature tracking. However, because of their coordinated functioning, these cameras have a limited ability to adapt to quick changes in the surroundings. Recent research has sought to fuse data from multiple sensors, including cameras, lidars, and inertial measurement units (IMUs). Nonetheless, these multi-sensor solutions face challenges when dealing with scenarios characterized by high-speed motion and a wide dynamic range. A novel solution has emerged with the introduction of bio-inspired event cameras, notably the dynamic vision sensor (DVS) [1], which has garnered significant attention in the fields of robotics and computer vision.

Differing from conventional cameras that produce absolute intensity information within intervals between frames, event cameras yield asynchronous event streams capable of promptly responding to local pixel-level changes in brightness. Typical event data comprises timestamps, polarity, and pixel coordinates. Event cameras have several advantages, including low power consumption, an expansive dynamic range, and high temporal resolution. Furthermore, these methods exhibit sensitivity to scene motion and can capture pixel-level brightness changes with minimal latency lowest to 1 \(\upmu \)s.

Another notable bio-inspired sensor, the dynamic and active-pixel vision sensor (DAVIS) [2], is capable of delivering both asynchronous event streams and traditional intensity images. Feature tracking based on event cameras plays pivotal roles in numerous high-level vision tasks such as feature-based odometer [3], visual recognition [4], and image registration [5]. Event-based feature tracking has unique advantages in handling challenging scenarios characterized by variations in lighting conditions, occlusions, motion blur, and the presence of dynamic background elements. Due to the natural characteristics of event cameras with high temporal resolution and sensitivity to local pixel-level brightness changes, these challenges can be effectively addressed. By leveraging these capabilities, event-based feature tracking contributes significantly to the robustness and accuracy of vision systems in dynamic environments.

It is noteworthy that the asynchronous event streams produced by event cameras exhibit substantial distinctions from traditional intensity images. Hence, algorithms tailored for frame-based cameras are inherently unsuitable for direct implementation in this context. The development of innovative algorithms that are specifically adept at processing asynchronous event streams is imperative for ensuring efficacious feature tracking. An effective strategy involves leveraging the complementary attributes of temporal information derived from both frames and events.

To address the challenges associated with asynchronous data, we present a novel feature tracking method that integrates data from both event and frame-based cameras. Our method is dedicated to enhance the robustness of feature tracking by leveraging the strengths of each camera. The frame-based camera provides rich intensity information, while the event camera offers the high temporal resolution information, allowing for more accurate and reliable tracking in scenarios characterized by high-speed motion and large changes in illumination. In this paper, we introduce a feature tracking method is composed of three core modules:

  • FAST-based feature patch initialization: This module introduces a cornerstone in the form of FAST-based feature patch initialization. By generating corner points, this approach enables rapid responses to local pixel brightness changes, laying a robust foundation for the subsequent tracking process.

  • Quality assessment for patch optimization: The integration of quality assessment within the patch optimization module represents another key improvement. This feature empowers the system to proactively identify and eliminate tracks with sub-optimal tracking quality, thereby enhancing the overall accuracy and reliability of the tracking process.

  • Nearest neighbor (NN) based patch association: The third module focuses on nearest neighbor based patch association, facilitating the establishment of connections between newly detected features and existed features. This strategic association mechanism ensures the continuity of feature tracking, contributing to the method’s overall robustness and stability.

The primary contribution of this paper lies in the reduction of average tracking error and the extension of feature age, which enhances the accuracy and robustness of the tracking method through the integration of the three aforementioned modules. We systematically evaluate our method using both public and self-collected datasets, focusing on metrics such as average tracking error and feature age to gauge tracking robustness. Our results indicate that our method outperforms the state-of-the-art method, achieving a reduction in average tracking error and an enhancement in feature age. This improvement is attributed to the effective fusion of frame and event data, which optimizes the utilization of event information in the asynchronous feature tracking paradigm. The remainder of this paper is organized as follows: “Related work” provides an overview of the presented algorithm. “Method” describes the details of the event feature tracking method are outlined. “Result” presents the experimental results and corresponding analysis. Finally, we make a discussion and draw our conclusions in “Discussion” and “Conclusion”.

Related work

Event-based feature detection and tracking have proven valuable for facilitating data association for visual odometer. Kueng et al. [6] introduced a low-latency visual odometer approach based on feature tracking methods to estimate the camera pose. The successful implementation of this low-latency visual odometer relies on the integration of both frame and event cameras. The fundamental workflow involves the detection of corner points within the frame to extract feature points, followed by the utilization of two distinct methods for feature tracking within the event stream. One method aims to align weighted point set, enabling effective short-term feature tracking. The other approach provides a robust solution for long-term feature tracking by comparing the spatial histograms of events. Zhu et al. [7] introduced an odometer method having highly accurate six-degree-of-freedom camera pose estimation by using both event and IMU data. This method involves detecting corner points on frames and tracking them within a spatio-temporal window around the corner points. The method incorporates the IMU information by using an extended Kalman filter. Guan et al. [8] proposed a feature-based visual-inertial odometer (VIO) method that uses event-based corner points, event-based line features, and frame-based features to optimize the pose of a monocular camera in real-time. The method uses point features from natural environments and line features from artificial scenes to provide additional structure or constraint information. This method can achieve better performance compared to state-of-the-art frame-based VIO or line-based VIO. Le Gentil et al. [9] proposed an optimization-based framework for inertial measurement units and event cameras using line features called IDOL. The method firstly extracts the line features of the scene in the event stream and subsequently estimates the camera pose by minimizing the point-line distance between individual events and the line projected in the image plane. Expanding upon these foundational works, it becomes evident that while significant strides have been made in leveraging event-based data for visual odometer, current methods have yet to reach their full potential, especially with regard to improving the durability and robustness of feature tracking.

In the domain of feature detection, repeatability is a critical property, ensuring that features detected in different images of the same scene remain consistent. Recent studies have focused on transferring popular image-based corner detectors into the realm of event-based tracking. Vasco et al. [10] improved the conventional Harris corner points detection method [11] which captures pixel-level brightness changes exclusively in response to motion or scene alterations. However, this approach necessitates gradient computation and convolution. Building upon the conventional FAST corner point detection method [12], Mueggler et al. [13] proposed an event-based corner point detection method called eFAST, which leverages the surface of active events (SAE) and requires only comparison operations. The SAE serves as a 2D representation of the event stream which stores the timestamp of the most recent event at each pixel position. To enhance the robustness of the eFAST, Alzugaray and Chli [14] introduced an event feature detection method named as Arc*. Additionally, Li et al. [15] presented a FA-Harris that select and refine strategy using the improved eFAST to select candidate points and filters with an improved eHarris criterion. Mohamed et al. [16] introduced the TLF-Harris method which uses a three-layer filtering stage and a low-complexity Harris score. Tedaldi et al. [17] proposed a feature tracking method using both frame-based cameras and event cameras. The method initialized the feature patches based on large, spatial contrast variations, followed by alignment using the events generated to achieve tracking. Zhu et al. [18] introduced a probability-based association method for linking event streams and features. They characterized event corner point tracking as an optimization problem that involves matching the current view with feature templates. It requires evaluating a discrete set of tracking assumptions. Alzugaray and Chli [19] proposed the asynchronous event-corner tracking algorithm through the normalization descriptor for extracted event-corners. Alzugaray and Chli [20] also introduced an event feature tracking method with multiple data association possibilities based on a tree structure, in which each node is an event corner. Its matching mechanism for adding nodes is based on spatio-temporal constraints [21]. Duo and Zhao [22] added constraints on the direction of corner points to the tree structure based on spatio-temporal constraints. This improvement can cut off branches and simplify the tree structure. Li et al. [23] proposed a gradient descriptor based on a speed-invariant time surface and used it to match two event corner points in a tree structure. Nevertheless, the above methods have not yet fully utilized event information fully. These methods mostly leverage spatio-temporal information but neglect the polarity information contained in event data. Furthermore, these methods do not take full advantage of information from frame-based cameras as a supplement.

Fig. 1
figure 1

Flow chart of asynchronous feature tracking method using event streams and frame

In addition to feature tracking, there have been significant advances in other areas that are worth learning from. An optimal iterative learning control approach for linear systems with nonuniform trial lengths under input constraints was proposed in [24], demonstrating the potential of iterative learning in optimizing control tasks. This aligns with the pursuit of robust feature tracking through iteratively refining the tracking process based on event data. The concept of self-triggered control, as discussed in [25] for discrete-time Markov jump systems, introduces an adaptability that is highly relevant to the feature tracking method. By adapting control actions based on the current system state, we can achieve a more robust tracking mechanism. Furthermore, [26] introduced a work on bipartite synchronization for cooperative-competitive neural networks with reaction–diffusion terms via a dual event-triggered mechanism. This parallels the objective of synchronizing feature tracking across event and frame-based cameras, which operate under distinct dynamics. The patent [27] proposed on Internet of Things (IoT) integration for environmental monitoring leads us to contemplate the role of event cameras as ‘environmental sensors’ within the context of computer vision tasks. Hence, our research focus towards developing a feature tracking method that not only improves upon existing methods but also adapts to the variability and unpredictability of actual environments.

To bridge the gap between present approaches and the changing demands of feature tracking, it is critical to realize the current limits in fully utilizing event-based information and leveraging the complementary advantages provided by frame-based cameras. Gehrig et al. proposed a more effective method called event-based Kanade–Lucas–Tomasi (EKLT) to fuse frame and event data. The EKLT excels in identifying features in intensity frames and tracking them using only event data, achieving asynchronous tracking and fully capitalizing on polarity information [28]. On public datasets, the EKLT method performs better than EOF [18], ACE [19], AMH [20], and AEG trackers [23] in terms of both average tracking error and feature age. This suggests that EKLT might be a preferred choice for tracking objects in various applications. However, the EKLT suffers robustness challenges, particularly in maintaining the length of feature age.

To address these limitations, we proposed a method builds on hybrid modules that augments the EKLT framework. Our method contributes to this developing landscape by introducing a novel feature tracking method that combines data from both event and frame-based cameras, using their respective strengths. We aim to increase feature age and improve tracking robustness by using FAST-based feature patch initialization, quality assessment for patch optimization, and nearest neighbor patch association, thereby extending the bounds of asynchronous feature tracking in computer vision applications.

Method

Overview

Three modules compose the algorithm for changing feature point locations based on event information, as depicted in Fig. 1. These modules include patch initialization, patch optimization, and patch association. The patch initialization module extracts initial patches around corner points by applying the FAST corner detection algorithm to the frame. Initial corner points that react to variations in local pixel brightness can be identified through the FAST-based feature patch initialization module. In the patch optimization module, the motion parameters (optical flow and warp parameter) modify patch positions based on the difference between the observed and predicted brightness-increment images. To create the observed brightness-increment image, the polarities of the event streams are aggregated pixel-by-pixel on the patch until the number of events approaches a threshold. Concurrently, gradients and optical flow are employed to obtain the predicted brightness-increment image. A novel method is proposed in this module to evaluate the quality of optimization for lost condition checking, leading to the early removal of tracks exhibiting poor tracking quality. The patch association module introduces the nearest neighbor (NN) algorithm to address the patch association problem, effectively establishing associations between new and existing features. This ensures the continuity of features throughout the tracking process.

Patch initialization

In contrast to the EKLT method, which only identifies Harris corner points in regions lacking corners, our approach detects FAST corner points across the entire frame. The original FAST corner points detection method proposed by Rosten and Drummond [12] is depicted in Fig. 2, where p represents a candidate corner point. The pixels numbered 1 to 16 clockwise are utilized to determine if 12 contiguous points are brighter than p. This method has several limitations, including implicit assumptions about the distribution of feature appearance and disregarding information from the first four pixels. Consequently, we employ a machine learning-based FAST corner detection method [29], constructing a decision tree [30] for corner point detection and implementing non-maximal suppression to eliminate corners with adjacent corners having higher score function values. After the above method detects FAST corner points on the whole frame, we extract initial patches around corner points.

Fig. 2
figure 2

(Figure is adapted from Rosten and Drummond [12])

Original FAST corner points detect method.

Patch optimization

The motion parameters are utilized to update the positions of the initial patches in order to accomplish feature tracking. The cost function is used to optimize the motion parameters by minimizing the difference between the observed and predicted brightness-increment images. The event streams acquire the observed image depicting changes in brightness, while the predicted image displaying changes in brightness is obtained by the use of image gradients and motion parameters such as optical flow and warp parameter. The event camera model is initially presented in order to elucidate the underlying principle behind the creation of observed brightness-increment images. It exhibits an asynchronous behavior by producing output event streams that are formed as the discrete response of individual pixels to logarithmic changes in brightness, exclusively includes spatial, temporal, and polarity information. The event \(\{{\varvec{u}},pol,t\}\) is comprised of the 2D positions \({\varvec{u}}=\{x,y\}\) of the pixel, the polarity \({ pol } \in \{+1,-1\}\) indicating the sign of brightness change, and the timestamp t denoting the occurrence of the event. The event occurs when there is a logarithmic change in brightness at the pixel location between time t and \(t-\Delta t\) that exceeds the threshold of \(\pm C\) \((C\>0).\) The equation \({\varvec{q}}({\varvec{u}}, t)= \log ({\varvec{I}}({\varvec{u}},t))\) represents the logarithmic brightness image. Additionally, the logarithmic brightness-increment image \(\Delta {\varvec{q}}({\varvec{u}}, t)\) may be mathematically stated as:

$$\begin{aligned} \Delta {\varvec{q}}({\varvec{u}}, t)={\varvec{q}}({\varvec{u}}, t)-{\varvec{q}}({\varvec{u}}, t-\Delta t)=p o l * C \end{aligned}$$
(1)

where \(t-\Delta t\) represents the timestamp of the previous event that occurred at the same pixel. This stage replicates the optimal reverse procedure by employing the generative concept of event information. The accumulation of event polarities at each pixel point in the patch \({\textbf{P}}\) occurs when the number of events surpasses the adaptive threshold \(N_e,\) within a time interval \(\Delta \tau .\) The sum \(\Delta {\varvec{q}}({\varvec{u}}, t),\) as depicted in Eq. (2), is commonly referred to the observed brightness-increment image. It represents the cumulative total of all events whose positions lie within the patch throughout the time interval \(\Delta \tau .\)

$$\begin{aligned} \Delta {\varvec{q}}({\varvec{u}}, t)=\int _t^{t+\Delta \tau } C * f({\varvec{u}}, t) dt \quad {\varvec{u}} \in {\textbf{P}} \end{aligned}$$
(2)

where \(f({\varvec{u}}, t)\) represents the polarity of events occurring at a specific pixel position \({\varvec{u}}\) and timestamp t. It should be noted that the polarity of the event might be either \(-1\) or \(+1,\) and its occurrence is determined by the threshold value denoted as C. The image depicting the accumulation of brightness is subsequently subjected to normalization. The adaptive threshold is initially set to a fixed value (e.g. 100) based on empirical knowledge. Subsequently, it is recalculated through optimization using the image gradients \(\nabla {\varvec{q}}({\varvec{u}}, t)\) and optical flow v:

$$\begin{aligned} N_e \approx \int _{{\varvec{u}} \in {\textbf{P}}}\left| \nabla {\varvec{q}}({\varvec{u}}, t) * \frac{{\varvec{v}}}{\Vert {\varvec{v}}\Vert }\right| d {\varvec{u}}. \end{aligned}$$
(3)

The observed brightness-increment image is generated by the event streams. Therefore, the graphic displaying the increase in brightness can be considered as an accurate representation of the actual value. During the generation of the predicted brightness-increment image, our initial focus is solely on the scenario when the optical flow is unknown. It is assumed that the gradients and the logarithmic brightness image within the patch remain constant over the time interval \(\Delta \tau ,\) where \(\Delta \tau \) represents a brief period preceding t. The derivatives of the function \({\varvec{q}}({\varvec{u}}, t)\) can be represented as:

$$\begin{aligned} \frac{\partial {\varvec{q}}}{\partial t}+\nabla {\varvec{q}}({\varvec{u}}, {\textrm{t}}) * {\varvec{v}}=0 \end{aligned}$$
(4)

where \(\nabla {\varvec{q}}({\varvec{u}}, {\textrm{t}})\) represents the brightness gradients at pixel position \({\varvec{u}},\) while \({\varvec{v}}=d{\varvec{u}}/dt\) denotes the optical flow. The equation can be approximated using Taylor’s series expansion as follows:

$$\begin{aligned} \Delta {\varvec{q}}({\varvec{u}}, t) \approx {\varvec{q}}({\varvec{u}}, t)-{\varvec{q}}({\varvec{u}}, t-\Delta \tau )=\frac{\partial {\varvec{q}}}{\partial t} \Delta \tau . \end{aligned}$$
(5)

Substitute Eq. (4) into Taylor’s approximation Eq. (5):

$$\begin{aligned} \Delta \widehat{{\varvec{q}}}({\varvec{u}}, {\varvec{v}}, t)=-\nabla {\varvec{q}}({\varvec{u}}, t) * {\varvec{v}} * \Delta \tau \end{aligned}$$
(6)

where \(\Delta \widehat{{\varvec{q}}}({\varvec{u}}, {\varvec{v}}, t)\) is called predicted brightness-increment image, as it is calculated by image gradients and optical flow, with the optical flow being an unknown variable. The gradients undergo variations during this period in accordance with the warp parameter \({\varvec{p}}\):

$$\begin{aligned} {\varvec{W}}({\varvec{u}}, {\varvec{p}})={\varvec{R}}({\varvec{p}}) * {\varvec{u}}+\tau ({\varvec{p}}) \end{aligned}$$
(7)

where \(({\varvec{R}}, {\varvec{T}}) \in SE(2)\) represents the rotation and translation. Similarly, \({\varvec{p}}\in se(2)\) denotes the corresponding Lie algebra. The predicted brightness-increment image resulting from an affine transformation of the gradients can be expressed as:

$$\begin{aligned} \Delta \widehat{{\varvec{q}}}({\varvec{W}}({\varvec{u}}, {\varvec{p}}), {\varvec{v}}, {\varvec{t}})=-\nabla {\varvec{q}}({\varvec{W}}({\varvec{u}}, {\varvec{p}}), {\varvec{t}}) * {\varvec{v}} * \Delta \tau . \end{aligned}$$
(8)

Motion parameters (optical flow \({\varvec{v}}\) and warp parameter \({\varvec{p}}\)) and gradients were utilized to generate the predicted image of luminance increase. The interpretation of this image as the estimated value is contingent upon the lack of knowledge regarding the motion parameters. The determination of the ideal motion parameters involves the minimization of the difference between the observed and predicted images of brightness increments. Therefore, the cost function Q is defined as the normalized difference between the predicted and observed values:

$$\begin{aligned} Q= & {} \min _{{\varvec{p}}, {\varvec{v}}} \left\| \frac{\Delta \widehat{\varvec{q}}({\varvec{W}}({{\textbf {u}}}, {\mathbf{{p}}}), {{{\textbf{v}}}}, t)}{\Vert \Delta \widehat{{\varvec{q}}}({\varvec{W}}({\varvec{u}}, {{{\textbf{p}}}}), {\varvec{v}}, t)\Vert _{L^2({\textbf{P}})}} \right. \nonumber \\{} & {} - \left. \frac{\Delta {\varvec{q}}({\varvec{u}}, t)}{\Vert \Delta {\varvec{q}}({\varvec{u}}, t)\Vert _{L^2({{{\textbf{P}}}})}} \right\| _{L^2({{{\textbf{P}}}})}^2. \end{aligned}$$
(9)

The utilization of the non-linear least square algorithm is employed in order to minimize the cost function for the purpose of estimating the motion parameters. Subsequently, the motion parameters are employed to revise the positions of the patches:

$$\begin{aligned} {\textbf{u}}^{\prime }={\varvec{R}}({\varvec{p}})^{-1} * {\varvec{u}}-{\varvec{R}}({\varvec{p}})^{-1} * {\varvec{T}}({\varvec{p}}). \end{aligned}$$
(10)

The effectiveness of the motion parameter in the EKLT approach is contingent upon the optimization’s quality. Hence, we present a method for assessing the efficacy of optimization in the context of lost condition checks. The patch position is updated by the warp parameter \({\varvec{p}},\) assuming the validity of the optimization findings. The evaluation of the optimization outcome is conducted by assessing the ultimate value of the cost function \(Q_{last}.\) The optimized parameter values are denoted as \(p_{last}\) and \(v_{last},\) and the resulting cost function value at the end of the optimization process:

$$\begin{aligned} Q_{\text {last}}= & {} \left\| \frac{\Delta \widehat{{{{\varvec{q}}}}}\left( {{{\textbf{W}}}}\left( {\varvec{u}}, {{{\varvec{p}}}}_{\text {last}}\right) , {\varvec{v}}_{\text {last}}, t\right) }{\left\| \Delta \widehat{{{{\varvec{q}}}}}\left( {{{\textbf{W}}}}\left( {\varvec{u}}, {{{\varvec{p}}}}_{\text {last}}\right) , {\varvec{v}}_{\text {last}}, t\right) \right\| _{L^2({{\textbf{P}}})}}\right. \nonumber \\{} & {} - \left. \frac{\Delta {{{\varvec{q}}}}({\varvec{u}}, t)}{\Vert \Delta {{{\varvec{q}}}}({\varvec{u}}, t)\Vert _{L^2({{{\textbf{P}}}})}} \right\| _{L^2({{\textbf{P}}})}^2. \end{aligned}$$
(11)

To compute the average value of the final cost function across the last n optimization iterations:

$$\begin{aligned} Q_{\text {average}}=\frac{\sum _{i=1}^n v_{\text {last}}^i}{n}. \end{aligned}$$
(12)

The optimization result’s quality is assessed by comparing the value of \(Q_{\text {average}}\) to a preset threshold parameter \(c_{\text {threshold}}.\) If \(Q_{\text {average}}\) surpasses \(c_{\text {threshold}}\) during optimization, the patch state is set to “lost,” indicating that optimization has failed. Conversely, the patch positions are updated according to Eq. (10). This approach ensures that tracks with poor tracking quality are identified and eliminated early in the optimization process.

Patch association

The present module suggests the utilization of the nearest neighbor (NN) method as a solution to the patch association problem. The module receives initial features and patches that are tracked by events as its inputs. AssoicateFeatures is utilized to store the feature points that are associated with the patches that already exist. The parameter \(d_{\text {threshold }}\) is a pre-determined constant value. PatchSize represents the pixel size of the feature block.

Firstly, iterate through all existing patches, find the closest initial feature based on Euclidean distance and ensure that its distance is lower than the threshold. Secondly, add the features that meet the above conditions to AssoicateFeatures. Finally, new patches will be extracted centered on the features not added to AssoicateFeatures. The details of the whole process are shown in Algorithm 1.

Algorithm 1
figure a

Patch Association

To start the algorithm, the following parameters and beginning conditions have been specified:

Feature detection parameters: For the FAST-based feature patch initialization, parameters such as the threshold value for corner detection need to be defined. This value determines the sensitivity of the algorithm in detecting corner points in the frame images.

Patch size: The size of the patch around the detected features is an essential initial condition. It defines the amount of context considered for each feature point and has a direct impact on the tracking process.

Distance threshold \((d_{threshold})\): This value is used during the patch association module to determine the maximum allowable distance for feature point matching. It is crucial for maintaining the continuity of feature tracks across frames.

Quality assessment threshold \((c_{threshold})\): For the patch optimization module, a threshold value is needed to assess the quality of the optimization process. This value helps in deciding whether to maintain or discard a feature track based on the optimization quality.

Initial feature points: The algorithm requires an initial set of feature points detected in the first frame. These points serve as the starting point for the tracking process.

Result

In this section, we present a comprehensive analysis of the experimental data, focusing on the average tracking error and feature age for each dataset. Average tracking error serves as a key metric in evaluating the performance of feature tracking methods. A lower average tracking error indicates higher accuracy in tracking features. Feature age provides valuable insights into the robustness and stability of feature tracking methods. A higher feature age signifies that features remain tracked for longer durations without losing accuracy. The data is compared across different methods, including EKLT, the combination of individual modules, and the full proposed method. This analysis aims to assess the effectiveness of the proposed enhancements in asynchronous feature tracking.

Tracking performance on all datasets

The public event camera datasets generated with a DAVIS240C, which has a spatial resolution of \(240 \times 180\) pixels, are adopted to perform the comparison. The dataset comprises event streams and intensity images. We selected several scenes with different complex textures from the dataset, including “shapes_6dof“, “shapes_rotation”, “boxes_6dof”, “boxes_rotation“, “poster_6dof“, “poster_rotation“, from the Event Camera Dataset and Simulator [31,32,33], “pipe_2”, “bicycles“ from the paper [28]. We recorded two datasets, “computer_rotation”, “keyboard_translation” using the Celex5 camera. This camera has a spatial resolution of \(1280 \times 800\) pixels. The datasets “shapes_rotation” and “shapes_6dof” are simple black and white scenes in which the camera captures multiple geometric patterns attached to the wall, “boxes_6dof”, “boxes_rotation”, “poster_6dof” and “poster_rotation are highly textured scenes where the camera captures multiple boxes with different textures and highly textured posters, “pipe_2” and “bicycles” are natural scenes where the camera captures outdoor pipes and outdoor parked bicycles. “Computer_rotation” and “keyboard_translation” are low-texture scenes where the camera captures the computer and keyboard. The dataset names suffixes, such as “rotation”, “translation”, and “6dof” corresponded to the camera’s mode of motion. To provide visual context, we display frames from all datasets in Fig. 3.

Fig. 3
figure 3

Description of each datasets. a is a simple black and white scene; b and c are high-texture scenes; d and e are natural scenes; f and g are low-texture scenes

We conducted our analysis using the open-source feature tracking evaluation package, “rpg_feature_analysis,” available on GitHub, which is also employed for evaluating the EKLT method. This package provides ground truth information, established through the Kanade–Lucas–Tomasi (KLT) tracking method on DAVIS image frames. The EKLT method relies on local brightness invariance and motion continuity assumptions for tracking corner points across frames.

As event-based tracking results have higher temporal, we evaluated errors by comparing each ground truth sample with the estimated position, determined through linear interpolation of the two closest pixel positions in time. Our method to parameter selection was based on both theoretical considerations and empirical testing. We started by looking through the current literature on event-based vision systems and feature tracking algorithms to find a good starting point for our parameter values. This initial selection was guided by common practices and recommendations found in the literature. Following that, we ran a number of early tests on public and self-collected datasets to fine-tune the parameters. We changed each parameter individually and in conjunction with others to see how they affected tracking performance measures like average tracking error and feature age. Using this iterative procedure, we determined the best parameter values to balance tracking precision and computational efficiency. In experiments using publicly available datasets, the patch size was set to \(25 \times 25\) pixels, and the threshold values for \(d_{\text {threshold}}\) and \(c_{\text {threshold}}\) were set to 1.5 and 0.5, respectively. In experiments using self-collected datasets, the patch size was increased to \(40 \times 40\) pixels, and the threshold values for \(d_{\text {threshold}}\) and \(c_{\text {threshold}}\) were set to 3.0 and 1.5, respectively.

It depicts that the visualization process of our asynchronous feature tracking in Fig. 4. It displays the projection of features onto the current frame, with the magenta arrow (marked by an ‘x’ at the end) indicating the direction of feature tracking in the image plane, along with their flow angle estimation.

Fig. 4
figure 4

Visualization of asynchronous event feature tracking. The trajectory of feature tracking is plotted in carmine in this figure

To validate the effectiveness of the three modules proposed in our approach, we designed ablation experiments. The feature block initialization method based on FAST is referred to as A, the method for evaluating optimization quality is referred to as B, and the nearest neighbor feature block association method is referred to as C. The plus sign \((+)\) is used to represent combinations of different methods.

The results are listed in Tables 1 and 2, respectively presenting the average tracking error and feature age obtained from testing on ten datasets under various method combinations, along with a comparison to the EKLT method.

In Table 1, the EKLT method achieves sub-pixel level average tracking error across all datasets, but its accuracy is affected by track quality and sensor noise, exhibiting tracking errors above 0.6 pixels in some cases. The addition of individual modules (A, B, C) to the EKLT consistently results in a reduction of tracking error. However, the combination of all three modules in Ours (A + B + C) yields the most significant reduction in tracking error, achieving the lowest values across all datasets. It’s worth noting that in the “boxes_rotation” dataset, our proposed method achieves significant improvements, reducing the average tracking error from 0.89 to 0.63 pixels, which is a reduction of 29.2% compared to the EKLT. Across all datasets, our method consistently reduces the average tracking error by at least 1.3%.

In Table 2, the EKLT method exhibits varying feature ages across datasets, performing well in some scenarios but struggles in others. The addition of individual modules consistently increases feature age, indicating more stable tracking. However, the most significant improvements are observed when all three modules are combined. This combination leads to substantial feature age increases across all datasets, demonstrating the efficacy of the proposed method. Notably, in the “pipe_2” dataset, substantial improvements, increasing the feature age from 0.78 to 1.03 s, which represents a growth of 32.1% compared to the EKLT. Across all datasets, our method consistently increased the feature age by at least 9.6%. It also showed a more pronounced increase in feature age for datasets with simple black-and-white scenes or high levels of texture complexity.

The results indicate that our enhancements to patch initialization, optimization and association contribute to the elimination of low-quality tracks, thereby reducing tracking error and enhancing feature age. Figures 5 and 6’s bar charts clearly illustrate our experimental findings compare to the EKLT across all ten datasets, with the horizontal coordinates corresponding to the dataset names in the tables. It illustrates that the combination of enhanced modules in our methods greatly develop the effectiveness and improve the feature tracking stability across a range of motion scenarios.

Table 1 Average tracking error
Table 2 Feature age

Computational performance on all the datasets

All experiments were conducted on a Linux system using C++ programming language, leveraging open-source libraries such as OpenCV, Eigen, Sophus, and Ceres. The computations ran on a computer equipped with an AMD Ryzen 7 4700G processor, clocked at 3.60 GHz, and 16 GB of RAM.

Our method employed the FAST corner detection algorithm for the initialization module, which outperformed the Harris corner detection used in the EKLT in terms of speed. While the other two modules involved additional computations, both methods met real-time requirements for feature detection and tracking, ensuring no compromise on speed.

The number of corner points processed per second serves as an indicator of a method’s efficient in handling corner points during feature detection and tracking. To demonstrate the efficacy of our patch association module improvement, we conducted a comparison of the number of maintained features per second between the EKLT and our method. In all ten datasets, we achieved faster feature processing with our method, as seen in Table 3 and Fig. 7. The relative percentage indicates the improvement in feature processing speed compared to the EKLT. The improvement ranges from 1.2 to 7.6%, demonstrating that our method is more efficient in handling feature points in feature detection and tracking. This enhanced computational performance is a valuable attribute, as it ensures that our feature tracking method can process a larger number of features in a given amount of time, making it well-suited for real-time applications and scenarios where quick and efficient processing is essential.

Discussion

Our research introduces an innovative approach to asynchronous feature tracking in event cameras by integrating event and frame streams. The key contributions of our method are highlighted through the following discussion:

Enhanced hybrid data-driven modules: The employment of the FAST corner detection algorithm in our patch initialization module has proven to be a cornerstone for rapid and robust feature detection. By swiftly responding to local pixel brightness changes, our method ensures a strong foundation for the subsequent tracking process. The machine learning-based enhancement and non-maximal suppression further refine the selection of corner points, leading to more accurate feature patch initialization compared to traditional methods.

Our patch association module, which utilizes the nearest neighbor algorithm, demonstrates significant improvements in establishing effective connections between newly detected features and existing tracks. This strategic association mechanism is pivotal for maintaining the continuity of feature tracking and contributes substantially to the overall robustness and stability of our method.

The integration of a quality assessment within the patch optimization module is a key innovation of our approach. By proactively identifying and eliminating tracks with sub-optimal tracking quality, our method enhances the accuracy and reliability of the tracking process. The use of a cost function to evaluate optimization quality allows for the early removal of poor tracks, thereby improving the overall tracking performance.

Fig. 5
figure 5

Comparison results of average tracking error on all the datasets

Fig. 6
figure 6

Comparison results of feature age on all the datasets

Performance across diverse datasets: One of the most compelling aspects of our method is its consistent performance across a range of public and self-collected datasets. Our method’s ability to reduce average tracking error and increase feature age, as demonstrated in the results, underscores its effectiveness in handling diverse motion scenarios and environmental conditions. This adaptability demonstrates the generalizability of our method and highlights its potential for practical applications.

Table 3 Computational performance
Fig. 7
figure 7

Comparison results of computational performance on all the datasets

Computational efficiency: The computational performance of our method is noteworthy, with improvements in feature processing speed over the EKLT method. This efficiency is crucial for real-time applications where quick and accurate processing is essential. Our method’s balance between accuracy and speed positions it as a strong candidate for hybrid data-driven visual odometer and other computer vision tasks based on multiple sensors.

Limitations and future work: While our method has shown promising results, there are areas where further improvements can be made. For instance, our method may encounter challenges in highly dynamic scenes with rapid motion and significant changes in lighting conditions. And the computational complexity of the quality assessment in patch optimization could be optimized for even faster processing speeds. Additionally, the generalization of our method to other types of event cameras and broader scenarios requires further investigation. In the future, we plan to introduce adaptive mechanisms to handle dynamic scenes more effectively. This involves altering parameters in real time based on the current scene conditions, as well as using more powerful models capable of handling rapid motion and varied illumination. We will explore optimization strategies, including algorithmic improvements and parallel processing, to reduce the computational overhead of our method. This will make it more suitable for real-time applications and devices with limited computational power. To improve generalization, we will test our method on a wider variety of datasets and make an assessment of the uncertainty boundary in our future work, including those with different types of motion, lighting conditions, and scene complexity.

Conclusion

In conclusion, this study has presented a comprehensive evaluation of an enhanced asynchronous feature tracking method, integrating FAST-based feature patch initialization, quality-driven patch optimization, and Euclidean distance-based patch association. Through rigorous experimentation on diverse datasets, we have demonstrated the superior performance of our method in comparison to the established EKLT method. The significant reduction in average tracking error and concurrent increase in feature age underscore the effectiveness of our approach in achieving more accurate and stable feature tracking. The ablation experiments have validated the individual contributions of each module, emphasizing their synergistic effect in enhancing tracking capabilities. Our method’s consistent outperformance with varying complexities and motion scenarios attests to its robustness and adaptability. Furthermore, the evaluation of computational performance has affirmed the real-time feasibility of our method, with notable efficiency gains in feature processing speed compared to the EKLT. These results position our method as a promising solution for applications demanding efficient and accurate feature tracking, particularly in the context of a completely event and frame driven visual odometer.

Overall, this research introduces a novel and effective method for asynchronous feature tracking, contributing valuable insights to the broader field of computer vision. The demonstrated improvements in tracking accuracy, stability, and computational efficiency, will provide the possibility for the development of state-of-the-art in fully event-driven visual odometer pipelines and other computer vision tasks that require processing asynchronous discrete outputs from event cameras.