Enhancing robustness in asynchronous feature tracking for event cameras through fusing frame steams

Xu, Haidong; Yu, Shumei; Jin, Shizhao; Sun, Rongchuan; Chen, Guodong; Sun, Lining

doi:10.1007/s40747-024-01513-0

Enhancing robustness in asynchronous feature tracking for event cameras through fusing frame steams

Original Article
Open access
Published: 24 June 2024

(2024)
Cite this article

Download PDF

You have full access to this open access article

Complex & Intelligent Systems Aims and scope Submit manuscript

Enhancing robustness in asynchronous feature tracking for event cameras through fusing frame steams

Download PDF

170 Accesses
Explore all metrics

Abstract

Event cameras produce asynchronous discrete outputs due to the independent response of camera pixels to changes in brightness. The asynchronous and discrete nature of event data facilitate the tracking of prolonged feature trajectories. Nonetheless, this necessitates the adaptation of feature tracking techniques to efficiently process this type of data. In addressing this challenge, we proposed a hybrid data-driven feature tracking method that utilizes data from both event cameras and frame-based cameras to track features asynchronously. It mainly includes patch initialization, patch optimization, and patch association modules. In the patch initialization module, FAST corners are detected in frame images, providing points responsive to local brightness changes. The patch association module introduces a nearest-neighbor (NN) algorithm to filter new feature points effectively. The patch optimization module assesses optimization quality for tracking quality monitoring. We evaluate the tracking accuracy and robustness of our method using public and self-collected datasets, focusing on average tracking error and feature age. In contrast to the event-based Kanade–Lucas–Tomasi tracker method, our method decreases the average tracking error ranging from 1.3 to 29.2% and boosts the feature age ranging from 9.6 to 32.1%, while ensuring the computational efficiency improvement of 1.2–7.6%. Thus, our proposed feature tracking method utilizes the unique characteristics of event cameras and traditional cameras to deliver a robust and efficient tracking system.

EKLT: Asynchronous Photometric Feature Tracking Using Events and Frames

Article 22 August 2019

Asynchronous, Photometric Feature Tracking Using Events and Frames

Asynchronous Corner Tracking Algorithm Based on Lifetime of Events for DAVIS Cameras

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

In recent years, efforts to enhance the accuracy and robustness of feature tracking is a fundamental topic with extensive applications, particularly in simultaneous localization and mapping (SLAM). Traditional frame-based cameras, which take images at predetermined intervals, have been the primary sensor type for feature tracking. However, because of their coordinated functioning, these cameras have a limited ability to adapt to quick changes in the surroundings. Recent research has sought to fuse data from multiple sensors, including cameras, lidars, and inertial measurement units (IMUs). Nonetheless, these multi-sensor solutions face challenges when dealing with scenarios characterized by high-speed motion and a wide dynamic range. A novel solution has emerged with the introduction of bio-inspired event cameras, notably the dynamic vision sensor (DVS) [1], which has garnered significant attention in the fields of robotics and computer vision.

Differing from conventional cameras that produce absolute intensity information within intervals between frames, event cameras yield asynchronous event streams capable of promptly responding to local pixel-level changes in brightness. Typical event data comprises timestamps, polarity, and pixel coordinates. Event cameras have several advantages, including low power consumption, an expansive dynamic range, and high temporal resolution. Furthermore, these methods exhibit sensitivity to scene motion and can capture pixel-level brightness changes with minimal latency lowest to 1 $\upmu $s.

Another notable bio-inspired sensor, the dynamic and active-pixel vision sensor (DAVIS) [2], is capable of delivering both asynchronous event streams and traditional intensity images. Feature tracking based on event cameras plays pivotal roles in numerous high-level vision tasks such as feature-based odometer [3], visual recognition [4], and image registration [5]. Event-based feature tracking has unique advantages in handling challenging scenarios characterized by variations in lighting conditions, occlusions, motion blur, and the presence of dynamic background elements. Due to the natural characteristics of event cameras with high temporal resolution and sensitivity to local pixel-level brightness changes, these challenges can be effectively addressed. By leveraging these capabilities, event-based feature tracking contributes significantly to the robustness and accuracy of vision systems in dynamic environments.

It is noteworthy that the asynchronous event streams produced by event cameras exhibit substantial distinctions from traditional intensity images. Hence, algorithms tailored for frame-based cameras are inherently unsuitable for direct implementation in this context. The development of innovative algorithms that are specifically adept at processing asynchronous event streams is imperative for ensuring efficacious feature tracking. An effective strategy involves leveraging the complementary attributes of temporal information derived from both frames and events.

To address the challenges associated with asynchronous data, we present a novel feature tracking method that integrates data from both event and frame-based cameras. Our method is dedicated to enhance the robustness of feature tracking by leveraging the strengths of each camera. The frame-based camera provides rich intensity information, while the event camera offers the high temporal resolution information, allowing for more accurate and reliable tracking in scenarios characterized by high-speed motion and large changes in illumination. In this paper, we introduce a feature tracking method is composed of three core modules:

FAST-based feature patch initialization: This module introduces a cornerstone in the form of FAST-based feature patch initialization. By generating corner points, this approach enables rapid responses to local pixel brightness changes, laying a robust foundation for the subsequent tracking process.
Quality assessment for patch optimization: The integration of quality assessment within the patch optimization module represents another key improvement. This feature empowers the system to proactively identify and eliminate tracks with sub-optimal tracking quality, thereby enhancing the overall accuracy and reliability of the tracking process.
Nearest neighbor (NN) based patch association: The third module focuses on nearest neighbor based patch association, facilitating the establishment of connections between newly detected features and existed features. This strategic association mechanism ensures the continuity of feature tracking, contributing to the method’s overall robustness and stability.

The primary contribution of this paper lies in the reduction of average tracking error and the extension of feature age, which enhances the accuracy and robustness of the tracking method through the integration of the three aforementioned modules. We systematically evaluate our method using both public and self-collected datasets, focusing on metrics such as average tracking error and feature age to gauge tracking robustness. Our results indicate that our method outperforms the state-of-the-art method, achieving a reduction in average tracking error and an enhancement in feature age. This improvement is attributed to the effective fusion of frame and event data, which optimizes the utilization of event information in the asynchronous feature tracking paradigm. The remainder of this paper is organized as follows: “Related work” provides an overview of the presented algorithm. “Method” describes the details of the event feature tracking method are outlined. “Result” presents the experimental results and corresponding analysis. Finally, we make a discussion and draw our conclusions in “Discussion” and “Conclusion”.

Related work

Event-based feature detection and tracking have proven valuable for facilitating data association for visual odometer. Kueng et al. [6] introduced a low-latency visual odometer approach based on feature tracking methods to estimate the camera pose. The successful implementation of this low-latency visual odometer relies on the integration of both frame and event cameras. The fundamental workflow involves the detection of corner points within the frame to extract feature points, followed by the utilization of two distinct methods for feature tracking within the event stream. One method aims to align weighted point set, enabling effective short-term feature tracking. The other approach provides a robust solution for long-term feature tracking by comparing the spatial histograms of events. Zhu et al. [7] introduced an odometer method having highly accurate six-degree-of-freedom camera pose estimation by using both event and IMU data. This method involves detecting corner points on frames and tracking them within a spatio-temporal window around the corner points. The method incorporates the IMU information by using an extended Kalman filter. Guan et al. [8] proposed a feature-based visual-inertial odometer (VIO) method that uses event-based corner points, event-based line features, and frame-based features to optimize the pose of a monocular camera in real-time. The method uses point features from natural environments and line features from artificial scenes to provide additional structure or constraint information. This method can achieve better performance compared to state-of-the-art frame-based VIO or line-based VIO. Le Gentil et al. [9] proposed an optimization-based framework for inertial measurement units and event cameras using line features called IDOL. The method firstly extracts the line features of the scene in the event stream and subsequently estimates the camera pose by minimizing the point-line distance between individual events and the line projected in the image plane. Expanding upon these foundational works, it becomes evident that while significant strides have been made in leveraging event-based data for visual odometer, current methods have yet to reach their full potential, especially with regard to improving the durability and robustness of feature tracking.

In the domain of feature detection, repeatability is a critical property, ensuring that features detected in different images of the same scene remain consistent. Recent studies have focused on transferring popular image-based corner detectors into the realm of event-based tracking. Vasco et al. [10] improved the conventional Harris corner points detection method [11] which captures pixel-level brightness changes exclusively in response to motion or scene alterations. However, this approach necessitates gradient computation and convolution. Building upon the conventional FAST corner point detection method [12], Mueggler et al. [13] proposed an event-based corner point detection method called eFAST, which leverages the surface of active events (SAE) and requires only comparison operations. The SAE serves as a 2D representation of the event stream which stores the timestamp of the most recent event at each pixel position. To enhance the robustness of the eFAST, Alzugaray and Chli [14] introduced an event feature detection method named as Arc*. Additionally, Li et al. [15] presented a FA-Harris that select and refine strategy using the improved eFAST to select candidate points and filters with an improved eHarris criterion. Mohamed et al. [16] introduced the TLF-Harris method which uses a three-layer filtering stage and a low-complexity Harris score. Tedaldi et al. [17] proposed a feature tracking method using both frame-based cameras and event cameras. The method initialized the feature patches based on large, spatial contrast variations, followed by alignment using the events generated to achieve tracking. Zhu et al. [18] introduced a probability-based association method for linking event streams and features. They characterized event corner point tracking as an optimization problem that involves matching the current view with feature templates. It requires evaluating a discrete set of tracking assumptions. Alzugaray and Chli [19] proposed the asynchronous event-corner tracking algorithm through the normalization descriptor for extracted event-corners. Alzugaray and Chli [20] also introduced an event feature tracking method with multiple data association possibilities based on a tree structure, in which each node is an event corner. Its matching mechanism for adding nodes is based on spatio-temporal constraints [21]. Duo and Zhao [22] added constraints on the direction of corner points to the tree structure based on spatio-temporal constraints. This improvement can cut off branches and simplify the tree structure. Li et al. [23] proposed a gradient descriptor based on a speed-invariant time surface and used it to match two event corner points in a tree structure. Nevertheless, the above methods have not yet fully utilized event information fully. These methods mostly leverage spatio-temporal information but neglect the polarity information contained in event data. Furthermore, these methods do not take full advantage of information from frame-based cameras as a supplement.

In addition to feature tracking, there have been significant advances in other areas that are worth learning from. An optimal iterative learning control approach for linear systems with nonuniform trial lengths under input constraints was proposed in [24], demonstrating the potential of iterative learning in optimizing control tasks. This aligns with the pursuit of robust feature tracking through iteratively refining the tracking process based on event data. The concept of self-triggered control, as discussed in [25] for discrete-time Markov jump systems, introduces an adaptability that is highly relevant to the feature tracking method. By adapting control actions based on the current system state, we can achieve a more robust tracking mechanism. Furthermore, [26] introduced a work on bipartite synchronization for cooperative-competitive neural networks with reaction–diffusion terms via a dual event-triggered mechanism. This parallels the objective of synchronizing feature tracking across event and frame-based cameras, which operate under distinct dynamics. The patent [27] proposed on Internet of Things (IoT) integration for environmental monitoring leads us to contemplate the role of event cameras as ‘environmental sensors’ within the context of computer vision tasks. Hence, our research focus towards developing a feature tracking method that not only improves upon existing methods but also adapts to the variability and unpredictability of actual environments.

To bridge the gap between present approaches and the changing demands of feature tracking, it is critical to realize the current limits in fully utilizing event-based information and leveraging the complementary advantages provided by frame-based cameras. Gehrig et al. proposed a more effective method called event-based Kanade–Lucas–Tomasi (EKLT) to fuse frame and event data. The EKLT excels in identifying features in intensity frames and tracking them using only event data, achieving asynchronous tracking and fully capitalizing on polarity information [28]. On public datasets, the EKLT method performs better than EOF [18], ACE [19], AMH [20], and AEG trackers [23] in terms of both average tracking error and feature age. This suggests that EKLT might be a preferred choice for tracking objects in various applications. However, the EKLT suffers robustness challenges, particularly in maintaining the length of feature age.

To address these limitations, we proposed a method builds on hybrid modules that augments the EKLT framework. Our method contributes to this developing landscape by introducing a novel feature tracking method that combines data from both event and frame-based cameras, using their respective strengths. We aim to increase feature age and improve tracking robustness by using FAST-based feature patch initialization, quality assessment for patch optimization, and nearest neighbor patch association, thereby extending the bounds of asynchronous feature tracking in computer vision applications.

Method

Overview

Three modules compose the algorithm for changing feature point locations based on event information, as depicted in Fig. 1. These modules include patch initialization, patch optimization, and patch association. The patch initialization module extracts initial patches around corner points by applying the FAST corner detection algorithm to the frame. Initial corner points that react to variations in local pixel brightness can be identified through the FAST-based feature patch initialization module. In the patch optimization module, the motion parameters (optical flow and warp parameter) modify patch positions based on the difference between the observed and predicted brightness-increment images. To create the observed brightness-increment image, the polarities of the event streams are aggregated pixel-by-pixel on the patch until the number of events approaches a threshold. Concurrently, gradients and optical flow are employed to obtain the predicted brightness-increment image. A novel method is proposed in this module to evaluate the quality of optimization for lost condition checking, leading to the early removal of tracks exhibiting poor tracking quality. The patch association module introduces the nearest neighbor (NN) algorithm to address the patch association problem, effectively establishing associations between new and existing features. This ensures the continuity of features throughout the tracking process.

Patch initialization

In contrast to the EKLT method, which only identifies Harris corner points in regions lacking corners, our approach detects FAST corner points across the entire frame. The original FAST corner points detection method proposed by Rosten and Drummond [12] is depicted in Fig. 2, where p represents a candidate corner point. The pixels numbered 1 to 16 clockwise are utilized to determine if 12 contiguous points are brighter than p. This method has several limitations, including implicit assumptions about the distribution of feature appearance and disregarding information from the first four pixels. Consequently, we employ a machine learning-based FAST corner detection method [29], constructing a decision tree [30] for corner point detection and implementing non-maximal suppression to eliminate corners with adjacent corners having higher score function values. After the above method detects FAST corner points on the whole frame, we extract initial patches around corner points.

Patch optimization

The motion parameters are utilized to update the positions of the initial patches in order to accomplish feature tracking. The cost function is used to optimize the motion parameters by minimizing the difference between the observed and predicted brightness-increment images. The event streams acquire the observed image depicting changes in brightness, while the predicted image displaying changes in brightness is obtained by the use of image gradients and motion parameters such as optical flow and warp parameter. The event camera model is initially presented in order to elucidate the underlying principle behind the creation of observed brightness-increment images. It exhibits an asynchronous behavior by producing output event streams that are formed as the discrete response of individual pixels to logarithmic changes in brightness, exclusively includes spatial, temporal, and polarity information. The event $\{{\varvec{u}},pol,t\}$ is comprised of the 2D positions ${\varvec{u}}=\{x,y\}$ of the pixel, the polarity ${ pol } \in \{+1,-1\}$ indicating the sign of brightness change, and the timestamp t denoting the occurrence of the event. The event occurs when there is a logarithmic change in brightness at the pixel location between time t and $t-\Delta t$ that exceeds the threshold of $\pm C$ $(C\>0).$ The equation ${\varvec{q}}({\varvec{u}}, t)= \log ({\varvec{I}}({\varvec{u}},t))$ represents the logarithmic brightness image. Additionally, the logarithmic brightness-increment image $\Delta {\varvec{q}}({\varvec{u}}, t)$ may be mathematically stated as:

$$\begin{aligned} \Delta {\varvec{q}}({\varvec{u}}, t)={\varvec{q}}({\varvec{u}}, t)-{\varvec{q}}({\varvec{u}}, t-\Delta t)=p o l * C \end{aligned}$$

(1)

where $t-\Delta t$ represents the timestamp of the previous event that occurred at the same pixel. This stage replicates the optimal reverse procedure by employing the generative concept of event information. The accumulation of event polarities at each pixel point in the patch ${\textbf{P}}$ occurs when the number of events surpasses the adaptive threshold $N_e,$ within a time interval $\Delta \tau .$ The sum $\Delta {\varvec{q}}({\varvec{u}}, t),$ as depicted in Eq. (2), is commonly referred to the observed brightness-increment image. It represents the cumulative total of all events whose positions lie within the patch throughout the time interval $\Delta \tau .$

$$\begin{aligned} \Delta {\varvec{q}}({\varvec{u}}, t)=\int _t^{t+\Delta \tau } C * f({\varvec{u}}, t) dt \quad {\varvec{u}} \in {\textbf{P}} \end{aligned}$$

(2)

where $f({\varvec{u}}, t)$ represents the polarity of events occurring at a specific pixel position ${\varvec{u}}$ and timestamp t. It should be noted that the polarity of the event might be either $-1$ or $+1,$ and its occurrence is determined by the threshold value denoted as C. The image depicting the accumulation of brightness is subsequently subjected to normalization. The adaptive threshold is initially set to a fixed value (e.g. 100) based on empirical knowledge. Subsequently, it is recalculated through optimization using the image gradients $\nabla {\varvec{q}}({\varvec{u}}, t)$ and optical flow v:

$$\begin{aligned} N_e \approx \int _{{\varvec{u}} \in {\textbf{P}}}\left| \nabla {\varvec{q}}({\varvec{u}}, t) * \frac{{\varvec{v}}}{\Vert {\varvec{v}}\Vert }\right| d {\varvec{u}}. \end{aligned}$$

(3)

The observed brightness-increment image is generated by the event streams. Therefore, the graphic displaying the increase in brightness can be considered as an accurate representation of the actual value. During the generation of the predicted brightness-increment image, our initial focus is solely on the scenario when the optical flow is unknown. It is assumed that the gradients and the logarithmic brightness image within the patch remain constant over the time interval $\Delta \tau ,$ where $\Delta \tau $ represents a brief period preceding t. The derivatives of the function ${\varvec{q}}({\varvec{u}}, t)$ can be represented as:

$$\begin{aligned} \frac{\partial {\varvec{q}}}{\partial t}+\nabla {\varvec{q}}({\varvec{u}}, {\textrm{t}}) * {\varvec{v}}=0 \end{aligned}$$

(4)

where $\nabla {\varvec{q}}({\varvec{u}}, {\textrm{t}})$ represents the brightness gradients at pixel position ${\varvec{u}},$ while ${\varvec{v}}=d{\varvec{u}}/dt$ denotes the optical flow. The equation can be approximated using Taylor’s series expansion as follows:

$$\begin{aligned} \Delta {\varvec{q}}({\varvec{u}}, t) \approx {\varvec{q}}({\varvec{u}}, t)-{\varvec{q}}({\varvec{u}}, t-\Delta \tau )=\frac{\partial {\varvec{q}}}{\partial t} \Delta \tau . \end{aligned}$$

(5)

Substitute Eq. (4) into Taylor’s approximation Eq. (5):

$$\begin{aligned} \Delta \widehat{{\varvec{q}}}({\varvec{u}}, {\varvec{v}}, t)=-\nabla {\varvec{q}}({\varvec{u}}, t) * {\varvec{v}} * \Delta \tau \end{aligned}$$

(6)

where $\Delta \widehat{{\varvec{q}}}({\varvec{u}}, {\varvec{v}}, t)$ is called predicted brightness-increment image, as it is calculated by image gradients and optical flow, with the optical flow being an unknown variable. The gradients undergo variations during this period in accordance with the warp parameter ${\varvec{p}}$:

$$\begin{aligned} {\varvec{W}}({\varvec{u}}, {\varvec{p}})={\varvec{R}}({\varvec{p}}) * {\varvec{u}}+\tau ({\varvec{p}}) \end{aligned}$$

(7)

where $({\varvec{R}}, {\varvec{T}}) \in SE(2)$ represents the rotation and translation. Similarly, ${\varvec{p}}\in se(2)$ denotes the corresponding Lie algebra. The predicted brightness-increment image resulting from an affine transformation of the gradients can be expressed as:

$$\begin{aligned} \Delta \widehat{{\varvec{q}}}({\varvec{W}}({\varvec{u}}, {\varvec{p}}), {\varvec{v}}, {\varvec{t}})=-\nabla {\varvec{q}}({\varvec{W}}({\varvec{u}}, {\varvec{p}}), {\varvec{t}}) * {\varvec{v}} * \Delta \tau . \end{aligned}$$

(8)

Motion parameters (optical flow ${\varvec{v}}$ and warp parameter ${\varvec{p}}$) and gradients were utilized to generate the predicted image of luminance increase. The interpretation of this image as the estimated value is contingent upon the lack of knowledge regarding the motion parameters. The determination of the ideal motion parameters involves the minimization of the difference between the observed and predicted images of brightness increments. Therefore, the cost function Q is defined as the normalized difference between the predicted and observed values:

$$\begin{aligned} Q= & {} \min _{{\varvec{p}}, {\varvec{v}}} \left\| \frac{\Delta \widehat{\varvec{q}}({\varvec{W}}({{\textbf {u}}}, {\mathbf{{p}}}), {{{\textbf{v}}}}, t)}{\Vert \Delta \widehat{{\varvec{q}}}({\varvec{W}}({\varvec{u}}, {{{\textbf{p}}}}), {\varvec{v}}, t)\Vert _{L^2({\textbf{P}})}} \right. \nonumber \\{} & {} - \left. \frac{\Delta {\varvec{q}}({\varvec{u}}, t)}{\Vert \Delta {\varvec{q}}({\varvec{u}}, t)\Vert _{L^2({{{\textbf{P}}}})}} \right\| _{L^2({{{\textbf{P}}}})}^2. \end{aligned}$$

(9)

The utilization of the non-linear least square algorithm is employed in order to minimize the cost function for the purpose of estimating the motion parameters. Subsequently, the motion parameters are employed to revise the positions of the patches:

$$\begin{aligned} {\textbf{u}}^{\prime }={\varvec{R}}({\varvec{p}})^{-1} * {\varvec{u}}-{\varvec{R}}({\varvec{p}})^{-1} * {\varvec{T}}({\varvec{p}}). \end{aligned}$$

(10)

The effectiveness of the motion parameter in the EKLT approach is contingent upon the optimization’s quality. Hence, we present a method for assessing the efficacy of optimization in the context of lost condition checks. The patch position is updated by the warp parameter ${\varvec{p}},$ assuming the validity of the optimization findings. The evaluation of the optimization outcome is conducted by assessing the ultimate value of the cost function $Q_{last}.$ The optimized parameter values are denoted as $p_{last}$ and $v_{last},$ and the resulting cost function value at the end of the optimization process:

$$\begin{aligned} Q_{\text {last}}= & {} \left\| \frac{\Delta \widehat{{{{\varvec{q}}}}}\left( {{{\textbf{W}}}}\left( {\varvec{u}}, {{{\varvec{p}}}}_{\text {last}}\right) , {\varvec{v}}_{\text {last}}, t\right) }{\left\| \Delta \widehat{{{{\varvec{q}}}}}\left( {{{\textbf{W}}}}\left( {\varvec{u}}, {{{\varvec{p}}}}_{\text {last}}\right) , {\varvec{v}}_{\text {last}}, t\right) \right\| _{L^2({{\textbf{P}}})}}\right. \nonumber \\{} & {} - \left. \frac{\Delta {{{\varvec{q}}}}({\varvec{u}}, t)}{\Vert \Delta {{{\varvec{q}}}}({\varvec{u}}, t)\Vert _{L^2({{{\textbf{P}}}})}} \right\| _{L^2({{\textbf{P}}})}^2. \end{aligned}$$

(11)

To compute the average value of the final cost function across the last n optimization iterations:

$$\begin{aligned} Q_{\text {average}}=\frac{\sum _{i=1}^n v_{\text {last}}^i}{n}. \end{aligned}$$

(12)

The optimization result’s quality is assessed by comparing the value of $Q_{\text {average}}$ to a preset threshold parameter $c_{\text {threshold}}.$ If $Q_{\text {average}}$ surpasses $c_{\text {threshold}}$ during optimization, the patch state is set to “lost,” indicating that optimization has failed. Conversely, the patch positions are updated according to Eq. (10). This approach ensures that tracks with poor tracking quality are identified and eliminated early in the optimization process.

Patch association

The present module suggests the utilization of the nearest neighbor (NN) method as a solution to the patch association problem. The module receives initial features and patches that are tracked by events as its inputs. AssoicateFeatures is utilized to store the feature points that are associated with the patches that already exist. The parameter $d_{\text {threshold }}$ is a pre-determined constant value. PatchSize represents the pixel size of the feature block.

Firstly, iterate through all existing patches, find the closest initial feature based on Euclidean distance and ensure that its distance is lower than the threshold. Secondly, add the features that meet the above conditions to AssoicateFeatures. Finally, new patches will be extracted centered on the features not added to AssoicateFeatures. The details of the whole process are shown in Algorithm 1.

To start the algorithm, the following parameters and beginning conditions have been specified:

Feature detection parameters: For the FAST-based feature patch initialization, parameters such as the threshold value for corner detection need to be defined. This value determines the sensitivity of the algorithm in detecting corner points in the frame images.

Patch size: The size of the patch around the detected features is an essential initial condition. It defines the amount of context considered for each feature point and has a direct impact on the tracking process.

Distance threshold $(d_{threshold})$: This value is used during the patch association module to determine the maximum allowable distance for feature point matching. It is crucial for maintaining the continuity of feature tracks across frames.

Quality assessment threshold $(c_{threshold})$: For the patch optimization module, a threshold value is needed to assess the quality of the optimization process. This value helps in deciding whether to maintain or discard a feature track based on the optimization quality.

Initial feature points: The algorithm requires an initial set of feature points detected in the first frame. These points serve as the starting point for the tracking process.

Result

In this section, we present a comprehensive analysis of the experimental data, focusing on the average tracking error and feature age for each dataset. Average tracking error serves as a key metric in evaluating the performance of feature tracking methods. A lower average tracking error indicates higher accuracy in tracking features. Feature age provides valuable insights into the robustness and stability of feature tracking methods. A higher feature age signifies that features remain tracked for longer durations without losing accuracy. The data is compared across different methods, including EKLT, the combination of individual modules, and the full proposed method. This analysis aims to assess the effectiveness of the proposed enhancements in asynchronous feature tracking.

Tracking performance on all datasets

The public event camera datasets generated with a DAVIS240C, which has a spatial resolution of $240 \times 180$ pixels, are adopted to perform the comparison. The dataset comprises event streams and intensity images. We selected several scenes with different complex textures from the dataset, including “shapes_6dof“, “shapes_rotation”, “boxes_6dof”, “boxes_rotation“, “poster_6dof“, “poster_rotation“, from the Event Camera Dataset and Simulator [31,32,33], “pipe_2”, “bicycles“ from the paper [28]. We recorded two datasets, “computer_rotation”, “keyboard_translation” using the Celex5 camera. This camera has a spatial resolution of $1280 \times 800$ pixels. The datasets “shapes_rotation” and “shapes_6dof” are simple black and white scenes in which the camera captures multiple geometric patterns attached to the wall, “boxes_6dof”, “boxes_rotation”, “poster_6dof” and “poster_rotation are highly textured scenes where the camera captures multiple boxes with different textures and highly textured posters, “pipe_2” and “bicycles” are natural scenes where the camera captures outdoor pipes and outdoor parked bicycles. “Computer_rotation” and “keyboard_translation” are low-texture scenes where the camera captures the computer and keyboard. The dataset names suffixes, such as “rotation”, “translation”, and “6dof” corresponded to the camera’s mode of motion. To provide visual context, we display frames from all datasets in Fig. 3.

We conducted our analysis using the open-source feature tracking evaluation package, “rpg_feature_analysis,” available on GitHub, which is also employed for evaluating the EKLT method. This package provides ground truth information, established through the Kanade–Lucas–Tomasi (KLT) tracking method on DAVIS image frames. The EKLT method relies on local brightness invariance and motion continuity assumptions for tracking corner points across frames.

As event-based tracking results have higher temporal, we evaluated errors by comparing each ground truth sample with the estimated position, determined through linear interpolation of the two closest pixel positions in time. Our method to parameter selection was based on both theoretical considerations and empirical testing. We started by looking through the current literature on event-based vision systems and feature tracking algorithms to find a good starting point for our parameter values. This initial selection was guided by common practices and recommendations found in the literature. Following that, we ran a number of early tests on public and self-collected datasets to fine-tune the parameters. We changed each parameter individually and in conjunction with others to see how they affected tracking performance measures like average tracking error and feature age. Using this iterative procedure, we determined the best parameter values to balance tracking precision and computational efficiency. In experiments using publicly available datasets, the patch size was set to $25 \times 25$ pixels, and the threshold values for $d_{\text {threshold}}$ and $c_{\text {threshold}}$ were set to 1.5 and 0.5, respectively. In experiments using self-collected datasets, the patch size was increased to $40 \times 40$ pixels, and the threshold values for $d_{\text {threshold}}$ and $c_{\text {threshold}}$ were set to 3.0 and 1.5, respectively.

It depicts that the visualization process of our asynchronous feature tracking in Fig. 4. It displays the projection of features onto the current frame, with the magenta arrow (marked by an ‘x’ at the end) indicating the direction of feature tracking in the image plane, along with their flow angle estimation.

To validate the effectiveness of the three modules proposed in our approach, we designed ablation experiments. The feature block initialization method based on FAST is referred to as A, the method for evaluating optimization quality is referred to as B, and the nearest neighbor feature block association method is referred to as C. The plus sign $(+)$ is used to represent combinations of different methods.

The results are listed in Tables 1 and 2, respectively presenting the average tracking error and feature age obtained from testing on ten datasets under various method combinations, along with a comparison to the EKLT method.

In Table 1, the EKLT method achieves sub-pixel level average tracking error across all datasets, but its accuracy is affected by track quality and sensor noise, exhibiting tracking errors above 0.6 pixels in some cases. The addition of individual modules (A, B, C) to the EKLT consistently results in a reduction of tracking error. However, the combination of all three modules in Ours (A + B + C) yields the most significant reduction in tracking error, achieving the lowest values across all datasets. It’s worth noting that in the “boxes_rotation” dataset, our proposed method achieves significant improvements, reducing the average tracking error from 0.89 to 0.63 pixels, which is a reduction of 29.2% compared to the EKLT. Across all datasets, our method consistently reduces the average tracking error by at least 1.3%.

In Table 2, the EKLT method exhibits varying feature ages across datasets, performing well in some scenarios but struggles in others. The addition of individual modules consistently increases feature age, indicating more stable tracking. However, the most significant improvements are observed when all three modules are combined. This combination leads to substantial feature age increases across all datasets, demonstrating the efficacy of the proposed method. Notably, in the “pipe_2” dataset, substantial improvements, increasing the feature age from 0.78 to 1.03 s, which represents a growth of 32.1% compared to the EKLT. Across all datasets, our method consistently increased the feature age by at least 9.6%. It also showed a more pronounced increase in feature age for datasets with simple black-and-white scenes or high levels of texture complexity.

The results indicate that our enhancements to patch initialization, optimization and association contribute to the elimination of low-quality tracks, thereby reducing tracking error and enhancing feature age. Figures 5 and 6’s bar charts clearly illustrate our experimental findings compare to the EKLT across all ten datasets, with the horizontal coordinates corresponding to the dataset names in the tables. It illustrates that the combination of enhanced modules in our methods greatly develop the effectiveness and improve the feature tracking stability across a range of motion scenarios.

Table 1 Average tracking error

Full size table

Table 2 Feature age

Full size table

Computational performance on all the datasets

All experiments were conducted on a Linux system using C++ programming language, leveraging open-source libraries such as OpenCV, Eigen, Sophus, and Ceres. The computations ran on a computer equipped with an AMD Ryzen 7 4700G processor, clocked at 3.60 GHz, and 16 GB of RAM.

Our method employed the FAST corner detection algorithm for the initialization module, which outperformed the Harris corner detection used in the EKLT in terms of speed. While the other two modules involved additional computations, both methods met real-time requirements for feature detection and tracking, ensuring no compromise on speed.

The number of corner points processed per second serves as an indicator of a method’s efficient in handling corner points during feature detection and tracking. To demonstrate the efficacy of our patch association module improvement, we conducted a comparison of the number of maintained features per second between the EKLT and our method. In all ten datasets, we achieved faster feature processing with our method, as seen in Table 3 and Fig. 7. The relative percentage indicates the improvement in feature processing speed compared to the EKLT. The improvement ranges from 1.2 to 7.6%, demonstrating that our method is more efficient in handling feature points in feature detection and tracking. This enhanced computational performance is a valuable attribute, as it ensures that our feature tracking method can process a larger number of features in a given amount of time, making it well-suited for real-time applications and scenarios where quick and efficient processing is essential.

Discussion

Our research introduces an innovative approach to asynchronous feature tracking in event cameras by integrating event and frame streams. The key contributions of our method are highlighted through the following discussion:

Enhanced hybrid data-driven modules: The employment of the FAST corner detection algorithm in our patch initialization module has proven to be a cornerstone for rapid and robust feature detection. By swiftly responding to local pixel brightness changes, our method ensures a strong foundation for the subsequent tracking process. The machine learning-based enhancement and non-maximal suppression further refine the selection of corner points, leading to more accurate feature patch initialization compared to traditional methods.

Our patch association module, which utilizes the nearest neighbor algorithm, demonstrates significant improvements in establishing effective connections between newly detected features and existing tracks. This strategic association mechanism is pivotal for maintaining the continuity of feature tracking and contributes substantially to the overall robustness and stability of our method.

The integration of a quality assessment within the patch optimization module is a key innovation of our approach. By proactively identifying and eliminating tracks with sub-optimal tracking quality, our method enhances the accuracy and reliability of the tracking process. The use of a cost function to evaluate optimization quality allows for the early removal of poor tracks, thereby improving the overall tracking performance.

Performance across diverse datasets: One of the most compelling aspects of our method is its consistent performance across a range of public and self-collected datasets. Our method’s ability to reduce average tracking error and increase feature age, as demonstrated in the results, underscores its effectiveness in handling diverse motion scenarios and environmental conditions. This adaptability demonstrates the generalizability of our method and highlights its potential for practical applications.

Table 3 Computational performance

Full size table

Computational efficiency: The computational performance of our method is noteworthy, with improvements in feature processing speed over the EKLT method. This efficiency is crucial for real-time applications where quick and accurate processing is essential. Our method’s balance between accuracy and speed positions it as a strong candidate for hybrid data-driven visual odometer and other computer vision tasks based on multiple sensors.

Limitations and future work: While our method has shown promising results, there are areas where further improvements can be made. For instance, our method may encounter challenges in highly dynamic scenes with rapid motion and significant changes in lighting conditions. And the computational complexity of the quality assessment in patch optimization could be optimized for even faster processing speeds. Additionally, the generalization of our method to other types of event cameras and broader scenarios requires further investigation. In the future, we plan to introduce adaptive mechanisms to handle dynamic scenes more effectively. This involves altering parameters in real time based on the current scene conditions, as well as using more powerful models capable of handling rapid motion and varied illumination. We will explore optimization strategies, including algorithmic improvements and parallel processing, to reduce the computational overhead of our method. This will make it more suitable for real-time applications and devices with limited computational power. To improve generalization, we will test our method on a wider variety of datasets and make an assessment of the uncertainty boundary in our future work, including those with different types of motion, lighting conditions, and scene complexity.

Conclusion

In conclusion, this study has presented a comprehensive evaluation of an enhanced asynchronous feature tracking method, integrating FAST-based feature patch initialization, quality-driven patch optimization, and Euclidean distance-based patch association. Through rigorous experimentation on diverse datasets, we have demonstrated the superior performance of our method in comparison to the established EKLT method. The significant reduction in average tracking error and concurrent increase in feature age underscore the effectiveness of our approach in achieving more accurate and stable feature tracking. The ablation experiments have validated the individual contributions of each module, emphasizing their synergistic effect in enhancing tracking capabilities. Our method’s consistent outperformance with varying complexities and motion scenarios attests to its robustness and adaptability. Furthermore, the evaluation of computational performance has affirmed the real-time feasibility of our method, with notable efficiency gains in feature processing speed compared to the EKLT. These results position our method as a promising solution for applications demanding efficient and accurate feature tracking, particularly in the context of a completely event and frame driven visual odometer.

Overall, this research introduces a novel and effective method for asynchronous feature tracking, contributing valuable insights to the broader field of computer vision. The demonstrated improvements in tracking accuracy, stability, and computational efficiency, will provide the possibility for the development of state-of-the-art in fully event-driven visual odometer pipelines and other computer vision tasks that require processing asynchronous discrete outputs from event cameras.

Data availability

The data are available from the corresponding authors on reasonable request.

References

Lichtsteiner P, Posch C, Delbruck T (2008) A $128\times 128$ 120 db 15 $\upmu $s latency asynchronous temporal contrast vision sensor. IEEE J Solid State Circuits 43(2):566–576
Article Google Scholar
Brandli C, Berner R, Yang M, Liu S-C, Delbruck T (2014) A $240\times 180$ 130 db 3 $\upmu $s latency global shutter spatiotemporal vision sensor. IEEE J Solid State Circuits 49(10):2333–2341
Article Google Scholar
Mur-Artal R, Montiel JMM, Tardós JD (2015) Orb-slam: a versatile and accurate monocular slam system. IEEE Trans Robot 31(5):1147–1163
Article Google Scholar
Tsintotas KA, Bampis L, Gasteratos A (2021) Tracking-DOSeqSLAM: a dynamic sequence-based visual place recognition paradigm. IET Comput Vis 15(4):258–273
Article Google Scholar
Ramli R, Idris MYI, Hasikin K et al (2020) Local descriptor for retinal fundus image registration. IET Comput Vis 14(4):144–153
Article Google Scholar
Kueng B, Mueggler E, Gallego G, Scaramuzza D (2016) Low-latency visual odometry using event-based feature tracks. In: IEEE/RSJ international conference on intelligent robots and systems (IROS), Daejeon, Korea (South). IEEE Press, pp 16–23
Zhu AZ, Atanasov N, Daniilidis K (2017) Event-based visual inertial odometry. In: IEEE conference on computer vision and pattern recognition (CVPR), Honolulu, HI, USA. IEEE Press, pp 5816–5824
Guan W, Chen P, Xie Y, Lu P (2022) PL-EVIO: robust monocular event-based visual inertial odometry with point and line features. IEEE Trans Autom Sci Eng 1–17
Le Gentil C, Tschopp F, Alzugaray I et al (2020) IDOL: a framework for IMU-DVS odometry using lines. In: IEEE/RSJ international conference on intelligent robots and systems (IROS), Las Vegas, NV, USA. IEEE Press, pp 5863–5870
Vasco V, Glover A, Bartolozzi C (2016) Fast event-based Harris corner detection exploiting the advantages of event-driven cameras. In: IEEE/RSJ international conference on intelligent robots and systems (IROS), Daejeon, Korea (South). IEEE Press, pp 4144–4149
Ma J, Jiang X, Fan A, Jiang J, Yan J (2021) Image matching from handcrafted to deep features: a survey. Int J Comput Vis 129:23–79
Article MathSciNet Google Scholar
Rosten E, Drummond T (2006) Machine learning for high-speed corner detection. In: Computer vision—ECCV: 9th European conference on computer vision, Graz, Austria. Springer Press, pp 430–443
Mueggler E, Bartolozzi C, Scaramuzza D (2017) Fast event-based corner detection. In: British machine vision conference (BMVC), London, UK. Zurich Open Repository and Archive, UZH, pp 1–8
Alzugaray I, Chli M (2018) Asynchronous corner detection and tracking for event cameras in real time. IEEE Robot Autom Lett 3(4):3177–3184
Article Google Scholar
Li R, Shi D, Zhang Y, Li K, Li R(2019) FA-Harris: a fast and asynchronous corner detector for event cameras. In: IEEE/RSJ international conference on intelligent robots and systems (IROS), Macau, China. IEEE Press, pp 6223–6229
Mohamed SAS et al (2021) Dynamic resource-aware corner detection for bio-inspired vision sensors. In: 25th International conference on pattern recognition (ICPR), Milan, Italy. IEEE Press, pp 10465–10472
Tedaldi D, Gallego G, Mueggler E, Scaramuzza D (2016) Feature detection and tracking with the dynamic and active-pixel vision sensor (DAVIS). In: Second international conference on event-based control, communication, and signal processing (EBCCSP), Krakow, Poland. IEEE Press, pp 1–7
Zhu AZ, Atanasov N, Daniilidis K (2017) Event-based feature tracking with probabilistic data association. In: IEEE international conference on robotics and automation (ICRA), Singapore. IEEE Press, pp 4465–4470
Alzugaray I, Chli M (2018) ACE: an efficient asynchronous corner tracker for event cameras. In: 2018 International conference on 3D vision (3DV), Verona, Italy. IEEE Press, pp 653–661
Alzugaray I, Chli M (2019) Asynchronous multi-hypothesis tracking of features with event cameras. In: 2019 International conference on 3D vision (3DV), Québec, Canada. IEEE Press, pp 269–278
Alzugaray I (2022) Event-driven feature detection and tracking for visual SLAM. PhD thesis, ETH Zurich, Switzerland
Duo J, Zhao L (2021) An asynchronous real-time corner extraction and tracking algorithm for event camera. Sensors 21(4):1475
Article Google Scholar
Li R, Shi D, Zhang Y, Li R, Wang M (2021) Asynchronous event feature generation and tracking based on gradient descriptor for event cameras. Int J Adv Robot Syst 18(4). https://doi.org/10.1177/17298814211027028
Zhuang Z, Tao H, Chen Y, Stojanovic V, Paszke W (2023) An optimal iterative learning control approach for linear systems with nonuniform trial lengths under input constraints. IEEE Trans Syst Man Cybern: Syst 53(6):3461–3473
Article Google Scholar
Wan H, Luan X, Stojanovic V, Liu F (2023) Self-triggered finite-time control for discrete-time Markov jump systems. Inf Sci 634:101–121
Article Google Scholar
Song X, Wu N, Song S, Zhang Y, Stojanovic V (2023) Bipartite synchronization for cooperative-competitive neural networks with reaction–diffusion terms via dual event-triggered mechanism. Neurocomputing 550:126498
Article Google Scholar
Mohamed A-B et al.(2023) IoT based aerial device to detect and monitor carbon dioxide in an environment. WIPO. https://patentscope2.wipo.int/search/en/detail.jsf?docId=DE405681734 &_cid=P20-LQBPYR-57176-1. Accessed 16 Oct 2023
Gehrig D, Rebecq H, Gallego G, Scaramuzza D (2020) EKLT: asynchronous photometric feature tracking using events and frames. Int J Comput Vis 128(3):601–618
Article Google Scholar
Rosten E, Drummond T (2005) Fusing points and lines for high performance tracking. In: IEEE international conference on computer vision (ICCV’05), Beijing, China. IEEE Press, pp 1508–1515
Khraisat A, Gondal I, Vamplew P, Kamruzzaman J (2019) Survey of intrusion detection systems: techniques, datasets and challenges. Cybersecurity 2(1):1–22
Article Google Scholar
Mueggler E, Rebecq H, Gallego G, Delbruck T, Scaramuzza D (2017) The event-camera dataset and simulator: event-based data for pose estimation, visual odometry, and slam. Int J Robot Res 36(2):142–149
Article Google Scholar
Gallego G, Lund JEA, Mueggler E, Rebecq H, Delbruck T, Scaramuzza D (2017) Event-based, 6-DOF camera tracking from photometric depth maps. IEEE Trans Pattern Anal Mach Intell 40(10):2402–2412
Article Google Scholar
Forster C, Zhang Z, Gassner M, Werlberger M, Scaramuzza D (2016) SVO: semidirect visual odometry for monocular and multicamera systems. IEEE Trans Robot 33(2):249–265
Article Google Scholar

Download references

Funding

Research supported by National Key Research and Development Program (2022YFB4702202), Key Research and Development Program of Jiangsu Province (no. BE2021009-02) and the National Natural Science Foundation of China (number 61773273).

Author information

Authors and Affiliations

School of Mechanical and Electrical Engineering, Soochow University, Suzhou, Jiangsu, China
Haidong Xu, Shumei Yu, Shizhao Jin, Rongchuan Sun, Guodong Chen & Lining Sun

Authors

Haidong Xu
View author publications
You can also search for this author in PubMed Google Scholar
Shumei Yu
View author publications
You can also search for this author in PubMed Google Scholar
Shizhao Jin
View author publications
You can also search for this author in PubMed Google Scholar
Rongchuan Sun
View author publications
You can also search for this author in PubMed Google Scholar
Guodong Chen
View author publications
You can also search for this author in PubMed Google Scholar
Lining Sun
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Shumei Yu or Rongchuan Sun.

Ethics declarations

Conflict of interest

The authors have no conflict interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Xu, H., Yu, S., Jin, S. et al. Enhancing robustness in asynchronous feature tracking for event cameras through fusing frame steams. Complex Intell. Syst. (2024). https://doi.org/10.1007/s40747-024-01513-0

Download citation

Received: 29 November 2023
Accepted: 29 May 2024
Published: 24 June 2024
DOI: https://doi.org/10.1007/s40747-024-01513-0

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Enhancing robustness in asynchronous feature tracking for event cameras through fusing frame steams

Abstract

Similar content being viewed by others

EKLT: Asynchronous Photometric Feature Tracking Using Events and Frames

Asynchronous, Photometric Feature Tracking Using Events and Frames

Asynchronous Corner Tracking Algorithm Based on Lifetime of Events for DAVIS Cameras

Introduction

Related work