Introduction

Multi-object tracking (MOT) entails the analysis of video footage to identify and track one or more targets. To achieve this, the targets of interest need to be detected in each frame of the video. Correctly associating identical targets across successive frames, as well as accurately handling newly appearing or disappearing targets, is crucial.

In scenarios like video surveillance [27] and autonomous driving [16], multi-object tracking algorithms are frequently employed to detect and track pedestrian targets, aiding in the comprehension and analysis of their movement trajectories. This tracking capability facilitates early warnings of abnormal pedestrian behavior or contributes to effective vehicle control. However, pedestrian targets are subject to various factors, including changes in the external environment, pedestrian posture variations, and object occlusion [21]. Consequently, maintaining a consistent ID for a specific pedestrian target throughout the tracking process becomes challenging, leading to a degradation in tracking effectiveness, as illustrated in Fig. 1.

Fig. 1
figure 1

Illustration of tracking results of FairMOT [39] and our SCGTracker. In the image frames, two pedestrian objects with \({ID}_{1}\) and \({ID}_{2}\) (shown in red and yellow bounding boxes, respectively) are moving toward opposite direction in frame 1, while they are partially obscured at frame 2, and are separated again at frame 3. In (a), FairMOT [39] takes the original \({ID}_{2}\) object (in yellow bounding box) as a newly emerging object at frame 3,and assigns a new ID to it. Whereas in (b) our proposed SCGTracker maintains the original ID for \({ID}_{2}\) at frame 3. This observation is also applicable to s–6

To address the challenge of frequent switching of pedestrian target IDs, multi-object tracking (MOT) algorithms predominantly focus on enhancing the appearance feature representation of pedestrian targets. The prevalent approach [2, 28, 31, 33] involves training convolutional neural network (CNN) models using both historical frames and current frames as inputs. This allows the CNN to learn associations between historical frames and the current frame, utilizing these associations to improve the feature representation of the current frame. However, a notable drawback is that features of pedestrian objects are typically extracted independently of each other, with minimal consideration given to the interactions between objects. Consequently, several studies have tackled the multi-target tracking task as both a spatio-temporal graph modeling problem. For example, the graph neural networks (GNNs) [41] is employed to capture the interrelationships and contextual information between objects. Studies [13, 36] involve a graph representation for potential connections between trajectories and detection results. TransMOT [4] establishes connections between trajectories in both time and spatial domain using the transformer encoders, treating the connection of tracked objects as sparse graphs. However, the majority of existing graph-based MOT algorithms fall short in addressing the interrelationships among targets within the same frame and do not consider the object occlusion scenarios, where the extracted features may be compromised, potentially leading to correlation errors and error propagation over consecutive frames.

In response to these challenges, we present a joint object detection and tracking method by incorporating a Self-Cross attention Graph to improve the feature representative ability for better Multi-Object Tracking, which we term as SCGTracker. SCGTracker seamlessly integrates object detection and tracking, leveraging the intrinsic characteristics of objects as moving objects with low constant velocity and predictable spatial relationships between neighboring objects. SCGTracker is built on the highly efficient joint detection and embedding (JDE [31]) framework. According to the JDE detected objects and their corresponding feature embeddings, we propose to model the relationships between individual objects in both spatial and temporal domains through building the spatial–temporal object graph for two consecutive frames in a video stream. To reduce the number of ID switch caused mainly by occlusion objects’ tracking, we propose to apply the self and cross-attention mechanism in the spatial–temporal object graph. In special, the self-attention aggregates the information of all neighboring objects in a frame for each object. While the cross-attention is to map objects with similar contexts in consecutive frames into a shared space by aggregating relevant information across frames. Through message passing, the self-cross attention enhances the objects’ feature by considering both the spatial relationship between objects in a frame and temporal correspondence between two objects across frames. SCGTracker is an efficient online MOT method that optimizes the association of targets in consecutive frames. Through extensive experiments, it is shown to obtain the best performance in terms of both tracking accuracy and the number ID switches.

The contribution of this study can be divided into three aspects.

  1. 1.

    The SCGTracker, our proposed solution, is an end-to-end framework designed for seamless integration of pedestrian target detection and tracking, utilizing graph neural networks. Through this innovative approach, we aim to improve the features associated with pedestrian targets and achieve a globally optimized solution for both detection and tracking tasks.

  2. 2.

    We examine the interrelationships among targets within the same frame by constructing the object map in the spatial dimension for that frame. Additionally, recognizing the smaller target displacement between consecutive frames, we model the targets in successive frames to create an object graph spanning different frames in the temporal dimension.

  3. 3.

    We incorporate a graph neural network, specifically a Self-Cross Attention Graph, to improve the miss tracking of the occluded targets. This is accomplished by spatially aggregating target context information within the same frame through the self-attention mechanism, temporally aggregating target information across consecutive frames using the cross-attention mechanism, and updating target features through message passing to derive highly discriminative pedestrian object appearance features.

Related work

The tracking-by-detection (TBD)-based MOT algorithms

Numerous Multi-Object Tracking (MOT) algorithms adopt the tracking-by-detection framework, which entails dividing the multi-target tracking task into object detection and trajectory association. A robust detector, such as Faster R-CNN [25], CenterNet [6], or YOLOv5 [43], is crucial for predicting the object's bounding box. These bounding boxes are then linked through data association to establish trajectories. To accomplish this, Bewley et al. [2] initially proposed using Kalman filtering [32] to predict the position of bounding boxes from the previous frame in the current frame. They then utilized the Hungarian algorithm [20] and IOU distance to match these predicted positions with the bounding boxes in the current frame for trajectory association. Subsequently, Wojke et al. [33] introduced the Re-ID network to extract appearance features of the bounding boxes, resulting in improved performance. However, this method demands substantial computational resources due to the necessity for additional Re-ID networks.

The joint detection and tracking (JDE)-based MOT algorithms

The Joint Detection and Embedding (JDE)-based MOT algorithm integrates target detection and re-identification (Re-ID) tasks within a single network. Zhou et al. [41] suggested predicting the offset of object centroids between consecutive frames and utilized it for data association. Wang [31] enhanced the original detection task by incorporating the Re-ID task, achieved by modifying the predictor head of the detector. Computational efficiency was further optimized through feature sharing and multi-task learning. Zhang et al. [39] proposed an architecture based on unanchored target detection, employing different feature maps for the detection and Re-ID tasks to alleviate competition between them. Despite these advancements within the JDE paradigm, there remains potential for further improvement in the accuracy of trackers.

The graph neural network-based MOT algorithms

LGM [8] proposes transforming the target association problem into a graph matching problem by modeling a graph based on relationships between trajectories and detections. It relaxes the undirected graph matching into a continuous quadratic programming problem. TrackMPNN [24] introduces a framework based on dynamic undirected graphs, leveraging Message Passing Graph Neural Networks (GNNs) [41] to generate likelihood for associating each target. Reference [13] constructs an undirected graph between trackers and detections, incorporating target appearance features as node features and pose features as edge features. Node features are updated through node similarity, and aggregated updated edge features. TransMOT [4] establishes trajectory links by constructing encoders in both temporal and spatial domains, treating tracked targets as a sparse band-weighted graph. The decoder component predicts the correspondence between the output of the encoders and the graph representation of the current frame. However, this structure requires substantial computational resources.

Many existing algorithms in this domain often neglect the interdependencies among targets within the same frame, resulting in a diminished correlation between consecutive frames. Moreover, they frequently overlook the impact of occlusion, where the features of a detected target are influenced by unfavorable factors. As a result, the interaction between the detected target and the trajectory target through the graph neural network [41] may inadvertently compromise the initially favorable features of the trajectory target.

Methodology

In the naturally captured videos, there is an assumption that the relationships between multiple moving objects are invariant in a short time period, and even some object is temporally occluded by the obstacles, the relative relationship between this object and others will be maintained. Hence, besides the appearance feature of the individual object, the relative relationship between objects in frame t and the correspondence relationship across consecutive frames are also important known information for objects correlation in MOT. Motivated by this assumption, in this paper, we propose a joint object detection and tracking method based on JDE [31] framework, which we termed as SCGTracker. As shown in Fig. 2, SCGTracker takes two consecutive frames as inputs, and the CNN-based joint object detection and feature embedding are applied to both frames. Then a spatial–temporal object graph is built by taking the feature embedding as node description, the relative position description in a frame as spatial edges, and the object correspondence between two consecutive frames as temporal edges. To improve the feature representative ability for re-identification of temporary occluded objects, a self-cross attention is applied to the spatial–temporal object graph. During the training of the self-cross attention-based spatial–temporal object graph, the message passing process transfers the correlation information of neighboring objects for a given object, and the aggregation step collects all the context information to update the given object’s feature description. As a result, in addition to the appearance feature, the relative position and the temporal correlation are all taken into consideration for target matching, leading to better tracking accuracy and less ID switches in contrast to the MOT methods that only match on object appearance feature. This paper provides some definitions as follows (see Table 1):

Fig. 2
figure 2

Overview of SCGTracker. The current frame image \({I}_{t}\) is sent to the backbone network to obtain the feature graph \({F}_{t}\). The cross-attention network [12] is used to decompose \({F}_{t}\) into two separate feature graphs: \({F}_{t1}\) for the Re-ID task and \({F}_{t2}\) for the target detection task. The target feature vector \({C}_{t-1}\) extracted from the previous frame \({I}_{t-1}\) is combined with the feature graph \({F}_{t1}\) and passed through the self-cross graph module to obtain the target feature vector at the moment of \({I}_{t}\)

Table 1 Abbreviated statement expression

Notations:

\({I}_{t}\):

Represents the current frame image

\({F}_{t}\)

The features extracted from the backbone network for image \({I}_{t}\)

\({F}_{t1}\)

Features from \({F}_{t}\) used for Re-Identification (Re-ID) tasks

\({I}_{t-1}\)

Represents the previous frame image

\({F}_{t-1}\)

Features extracted from the backbone network for image \({I}_{t-1}\)

\({C}_{t-1}\)

Represents the feature vector extracted from \({I}_{t-1}\) for a specific target

\({D}_{p}^{t-1}\)

Represents the feature information of the p-th target in \({I}_{t-1}\)

\({D}^{t-1}=\{{D}_{1}^{t-1} ,{D}_{2}^{t-1} ,\cdots ,{D}_{{n}_{t-1}}^{t-1}\}\)

Represents the feature information for all targets in \({I}_{t-1}\)

\({n}_{t-1}\)

The total number of targets in \({I}_{t-1}\)

\({G}_{t-1}\)

Represents the target graph for \({I}_{t-1}\)

\({G}_{t}\)

Represents the target graph for \({I}_{t}\)

\({E}_{{\text{self}}}\):

Represents the edges within \({G}_{t}\) and \({G}_{t-1}\)

\({E}_{{\text{cross}}}\)

Represents the edges between \({G}_{t}\) and \({G}_{t-1}\)

\({C}_{t-1}{\prime}\)

Represents the target feature vector of \({I}_{t-1}\) after Self-Cross Attention Graph updates

\({F}_{t1}{\prime}\)

Represents the target feature map of \({I}_{t}\) after Self-Cross Attention Graph updates

\({\varphi }_{x,y}\)

Represents the target feature extracted from \({F}_{t1}{\prime}\) with \((x, y)\) as the center

Architecture of proposed method

SCGTracker is designed to achieve efficient and accurate tracking of multi-objects, analyzing their movement trajectories for real-time tracking applications such as autonomous driving. The approach employs a graph neural network [41] to map two pedestrian targets with high similarity in consecutive frames to a common space. This mapping aggregates object information to generate highly expressive object appearance features, thereby preventing confusion in object association. Leveraging these tools, SCGTracker offers a reliable method for pedestrian target tracking.

Our strategy for pedestrian target detection involves employing an enhanced version of the Deep Aggregation Network (DLA-34) [35] as our backbone network. We adhere to the concept of detecting pedestrian targets based on their centroids. We input two consecutive frames into the network, utilizing the image \({I}_{t-1}\) at time \({t-1}\) as input to obtain the feature map Ft-1. Given that we can directly acquire position information for the pedestrian target center in the \({I}_{t-1}\) image, we extract the appearance feature vector \({C}_{t-1}\) of the pedestrian target based on that position information within the feature map \({F}_{t-1}\). This approach enables effective detection of pedestrian targets and extraction of their features for tracking.

In addition to the feature map \({F}_{t-1}\), we input the image \({I}_{t}\) at time \({t}\) into our backbone network. The feature map \({F}_{t}\) obtained from this process encompasses information regarding object class confidence, object size, and object appearance. Recognizing the distinctions in tasks, we employ the Cross Attention Network (CAN) module to understand both the commonalities and specificities of detection and Re-ID task features. The CAN module learns self-relationships between different feature channels to enhance the feature representation of each task. Simultaneously, it employs a cross-relationship mechanism to capture shared information between the two tasks for commonality learning. Finally, we decompose the feature map \({F}_{t}\) into two separate feature maps: \({F}_{t1}\) for the Re-ID task and \({F}_{t2}\) for the object detection task.

To facilitate the data association process, we propose constructing keypoints based on the appearance features of pedestrian objects. Specifically, we use the appearance feature vector \({C}_{t-1}\) of pedestrian objects detected in image \({I}_{t-1}\) as the keypoint information\({D}_{p}^{t-1}\), which is then organized into \({D}^{t-1}=\{{D}_{1}^{t-1} ,{D}_{2}^{t-1} ,\ldots ,{D}_{{n}_{t-1}}^{t-1}\}\). Here, \({n}_{t-1}\) represents the maximum number of objects detected in \({I}_{t-1}\). For image \({I}_{t}\), as the position of pedestrian objects is not directly available, we employ the feature vector at each position of the Re-ID task's feature graph \({F}_{t1}\) as the keypoint information. To aggregate this information, we utilize the graph neural network [41] Self-Cross Attention Graph to combine \({D}^{t-1}\) and \({F}_{t1}\). This process merges the appearance features of pedestrian objects from the previous frame with those in the current frame, resulting in a more expressive representation of pedestrian object appearance in the current frame (the specific algorithm flow is illustrated in Fig. 2) (see Table 2).

Table 2 SCGTracker training process

Self-cross attention graph

In real-world environments, pedestrian targets can be obscured or affected by motion blur, introducing complexity to the tracking process. Existing tracking algorithms frequently employ the Re-ID features of targets directly in the data association link, without accounting for potential interdependencies between targets. This methodology may undermine the correlation between different frames, leading to a consistent switch in the ID of the same pedestrian target. As a result, the tracking outcomes become unstable, causing a notable deterioration in Multi-Object Tracking performance.

To tackle this challenge, we introduce the Self-Cross Attention Graph. The primary innovation of this approach lies in its spatial modeling of targets within the same frame. It achieves this by leveraging the self-attention mechanism [29] to comprehensively capture information within the target area. Moreover, it aggregates target context information within the same frame and updates target features through message passing. The method further extends spatial correlation to the temporal dimension by modeling targets between successive frames, exploiting the consistent contextual relationships of pedestrian targets over a short period. The cross-attention mechanism [12] enhances focus on target information, and message passing is employed to bring targets with similar contexts in different frames closer in terms of spatial distance. Consequently, this process significantly improves the representation of target features.

The Self-Cross Attention Graph entails the construction of object graphs within the same frame (the previous frame graph \({G}_{t-1}\) and the current frame graph \({G}_{t}\)) and between consecutive frames. The nodes in these graphs correspond to the keypoints in the two images. Intra-frame object graphs connect node \(i\) to all other nodes via \({E}_{{\text{self}}}\), while object graphs between frames connect node \(i\) to all the keypoints in the other object graph through \({E}_{{\text{cross}}}\). Both \({E}_{{\text{self}}}\) and \({E}_{{\text{cross}}}\) represent undirected edges.

The nodes in the graph \({G}_{t-1}\) and graph \({G}_{t}\) are updated with representation information through messages propagated by their edges \(E\). Following the message passing phase, the SCGTracker derives the updated pedestrian target feature vector \({C}_{t-1}{\prime}\) in the image \({I}_{t-1}\) and \({F}_{t1}\) of image \({I}_{t}\). From the feature map \({F}_{t1}{\prime}\), the Re-ID feature \({\varphi }_{x,y}\) of the pedestrian target in the current frame is directly extracted, with \((x, y)\) serving as the center.

Attentional aggregation Multi-target tracking tasks often face challenges such as occlusion, changes in attitude, scale variations, and the presence of external or invisible regions, all of which can degrade target feature information. As a remedy, we have introduced an attention mechanism designed to prioritize the undisturbed portion of the target, thereby minimizing sensitivity to disruptive factors.

Self-attention mechanism [29]: Given the similarity in structure between two object graphs in consecutive frames, aggregating self-information within the same object graph can be beneficial for identifying similar nodes.

Cross-attention mechanism [12]: To augment the expressiveness of pedestrian object features in the current frame, a comprehensive comparison is needed between all the keypoints \({D}_{p}^{t-1}\) in the previous frame and the feature graph \({F}_{t1}\) of the current frame. This entails searching for contextual clues that aid in distinguishing a true match from other similarities and filtering out keypoints in the current frame that correspond to \({D}_{p}^{t-1}\). This iterative process focuses attention on specific locations, facilitating information transfer and the completion of different object graphs (as illustrated in Fig. 3).

Fig. 3
figure 3

Self-Cross Attention Graph. object graphs \({G}_{t-1}\) and \({G}_{t}\) constructed from object feature vectors at the moments \({I}_{t-1}\) and \({I}_{t}\). First, intra-frame object information aggregation is performed, followed by inter-object graph information aggregation based on cross-attention [12]. Through this process, we obtain more expressive pedestrian target appearance features

To implement self-attention-based information [29] aggregation within the object graph, we connect node \(i\) to all other nodes in the same graph using both edge \({E}_{{\text{self}}}\) in graph \({G}_{t-1}\) and edge \({E}_{{\text{self}}}\) within \({G}_{t}\). Additionally, node \(i\) in graph \({G}_{t-1}\) undergoes cross-attention-based [12] information aggregation with nodes in between \({G}_{t}\) by connecting to them. Information between nodes is aggregated through edges \({E}_{{\text{self}}}\) and \({E}_{{\text{cross}}}\), and the representation of nodes is updated on each layer of the graph neural network [41].

In the aggregation process, the attention mechanism is utilized to account for the relationship between a node and its neighbors during information aggregation. The node aggregation formula is as follows:

$$ H_{i}^{l} = \left [{x_{i}^{l} |{|}m_{E \to i}^{l} } \right]. $$
(1)

Here, \({x}_{i}^{l}\) represents node \(i\) in graph \({G}_{t}\), and \(l\) represents the number of layers in the graph neural network [41]. The node's information transfer, denoted as \({m}_{E\to i}^{l}\), is the result of aggregation from all nodes \(\{j:(i,j)\in E\}\), where \(E\in \{{E}_{{\text{self}}},{E}_{{\text{cross}}}\}\). The notation \( [\cdot ||\cdot ]\) indicates node information splicing.

We employ the attention mechanism to perform aggregation and compute node information transfer \({m}_{E\to i}\). To obtain the attention of other nodes towards node \(i\), we compute the representation of node \(i\) as a query object \({q}_{i}\) and retrieve the value \({v}_{j}\) of certain nodes based on their properties (e.g., keyword \({k}_{j}\)). Subsequently, we calculate a weighted average of this information to obtain the attention.

$$ m_{E \to i}^{l} = \mathop \sum \limits_{{j:\left( {i,j} \right) \in E_{ } }} \alpha_{i,j}^{ } v_{j} $$
(2)

The attention weight \({\alpha }_{i.j}\) between nodes \(i\) and \(j\) is obtained by querying the softmax of the similarity between object \({q}_{i}\) and keyword \({k}_{j}\):

$${\alpha }_{i,j}=Softmax\left({q}_{i}^{T}{k}_{j}\right).$$
(3)

To calculate the query object \({q}_{i}\), keyword \({k}_{j}\), and value \({v}_{j}\), we use linear projection of the depth feature of the graph neural network [41]:

$$ \begin{aligned} q_{i} & = W_{1}^{l} x_{i}^{l} + b_{1}^{l} , \\ k_{j} & = W_{2}^{l} x_{j}^{l} + b_{2}^{l} , \\ v_{j} & = W_{3}^{l} x_{j}^{l} + b_{3}^{l} \\ \end{aligned} $$
(4)

To enhance the representational power of the model, we use a multi-headed attention mechanism with \(h\) attention heads in practice.

$${m}_{E\to i}^{l}={W}^{l}\left({m}_{E\to i}^{l,1} ||{m}_{E\to i}^{l,2}||\cdot \cdot \cdot ||{m}_{E\to i}^{l,h}\right)$$
(5)

During the update process of the graph neural network [41], the neighborhood information \({H}_{i}^{l}\) obtained from aggregation is utilized to update the features of the current node \(i\):

$${x}_{i}^{l+1}={x}_{i}^{l}+MLP\left({H}_{i}^{l}\right).$$
(6)

The Self-Cross Attention Graph incorporates two types of aggregation mechanisms: self-attention [29] and cross-attention [12]. In the self-attention mechanism, the input is \({G}_{t}\)/\({G}_{t-1}\), which aggregates contextual information of nodes within the same feature map, enhancing the expressiveness of the node features in \({G}_{t}\)/\({G}_{t-1}\). Simultaneously, this information is fed into the cross-attention mechanism [12] to enable interaction between nodes connected through continuous edges. The node information messages from \({G}_{t-1}\) are utilized to strengthen the node features in \({G}_{t}\), resulting in improved outcomes during the data association phase.

Network loss

Our proposed network output comprises a detection task and a pedestrian target feature re-identification task. The learning task for target detection adheres to the loss function design of the target centroid-based detection network. We utilize cross-entropy loss [17] to compute the target centroid category \({L}_{{\text{cls}}}\). For calculating the target centroid offset loss \({L}_{{\text{off}}}\) and the target area size loss \({L}_{{\text{size}}}\), we employ \({L}_{1}\) Loss [40].

The losses for the detection task are calculated as follows during testing:

$${L}_{{\text{det}}}={L}_{{\text{cls}}}+{\lambda }_{{\text{off}}}{L}_{{\text{off}}}+{\lambda }_{{\text{size}}}{L}_{{\text{size}}},$$
(7)

where \({\lambda }_{{\text{off}}}=1\) and \({\lambda }_{{\text{size}}}\) = 0.1.

To learn and identify features with different identities in the Re-ID task, we treat it as a classification task. During training, we consider all objects in the dataset with the same identity ID as the same class and use their IDs as classification labels for the Re-ID task. To obtain the target center location \(({C}_{x}^{i}, {C}_{y}^{i})\) on the heat map, we use \({b}^{i}=({x}_{1}^{i}, {y}_{1}^{i}, {x}_{2}^{i}, {y}_{2}^{i})\). The Re-ID feature vector \({E}_{({C}_{x}^{i}, {C}_{y}^{i})}\) is extracted from the target centroid location \(({C}_{x}^{i}, {C}_{y}^{i})\) and mapped to the class distribution vector \(P=\{p\left(k\right), k\in [1,k]\}\) using a fully connected layer and softmax operation. The classification labels for the Re-ID task are encoded using one-hot encoding \({L}^{i}(k)\). Re-ID losses are then computed as follows:

$$ L_{{{\text{identity}}}} = - \sum\limits_{{i = 1}}^{N} {\sum\limits_{{k = 1}}^{K} {L^{i} \left( k \right)\log \left( {p\left( k \right)} \right)} } , $$
(8)

where \(K\) represents the number of IDs of all targets in the training set. To summarize, the overall loss \({L}_{total}\) is calculated as follows:

$${L}_{{\text{total}}}=\frac{1}{2}\left(\frac{1}{{e}^{{w}_{1}}}{L}_{{\text{det}}}+\frac{1}{{e}^{{w}_{2}}}{L}_{{\text{identity}}}+{w}_{1}+{w}_{2}\right),$$
(9)

where \({w}_{1}\) and \({w}_{2}\) are learnable parameters that balance these two tasks.

Experiment

Training details & parameter settings

Datasets We conducted our experiments on the MOTchallenge, specifically utilizing the MOT 16 [19] and MOT 17 [19] pedestrian datasets. Both datasets consist of the same set of videos, with seven videos assigned for training and seven for testing. However, it is important to note that while MOT 16 provides only one detector, MOT 17 offers three detectors, namely Faster R-CNN [25] and SDP [9]. Additionally, we employed MOTSynth, a large-scale synthetic dataset designed to replace real data, for pedestrian detection, tracking, and segmentation. MOTSynth [7] encompasses a wide range of variations, including changes in environment, camera perspective, object texture, lighting conditions, weather, seasonal changes, and object identity. By leveraging this diversity, MOTSynth [7] aims to bridge the gap between synthetic and real data, enhancing the robustness and generalizability of our methods.

Evaluation metrics The MOT dataset not only offers data support for video sequences but also provides a range of related metrics for evaluating multi-object tracking algorithms comprehensively. These metrics [1] assess various aspects of performance, including detection and identity tracking. Table 3 presents the algorithm evaluation criteria and their descriptions provided by the MOT dataset.

Table 3 Evaluation metrics for multi-object tracking algorithms

In the context of algorithm research, it is essential to concentrate on specific metrics that align with targeted business requirements. When evaluating multi-target tracking algorithms, certain metrics offer particularly informative insights. These include MOTA, IDF1, IDSwitch, ML, and MT. MOTA (Multiple Object Tracking Accuracy) and IDF1 (ID F1 Score) are comprehensive metrics that provide a holistic assessment of algorithm performance. MOTA emphasizes detector performance, while IDF1 prioritizes accuracy in trajectory matching. The formulas for MOTA and IDF1 are as follows:

$${\text{MOTA}}=1-\frac{{\sum }_{t}FN+FP+ID}{\sum_{t}gt},$$
(10)
$${\text{IDF}}1=\frac{IDTP}{IDTP+0.5IDFP+0.5IDFN}$$
(11)

We conducted the experiments on Ubuntu 20.04 LTS and trained the model using GeForce RTX3090. To train the network on the MOT17 [19] dataset and accelerate the process, we first pre-trained on the CrowdHuman [26] dataset. This pre-training helped improve the human detection performance while providing strong domain generalization. The network takes inputs with an image resolution of 1088 × 608, is trained for 40 epochs, has an initial learning rate of 0.00001, and employs a batch size of 12. We applied a learning rate reduction by a factor of 10 after every 20 training cycles.

Experimental results and analysis

Experimental results and analysis of different modules To verify the impact of the cross-attention network CAN and the graph neural network [41] SCG on the multi-object tracking algorithm. We used the MOT20 [5] training set as the validation set and performed ablation experiments on it.

Table 4 displays the experimental results module on the validation set for the different modules. It is evident that when using only SCG, our method exhibits a slight improvement in the MOTA metric and a significant decrease in the FP metric. These outcomes indicate that SCG can effectively enhance the representation of appearance features of pedestrian targets

Table 4 Experimental results of different modules on the verification set

After incorporating only the Cross Attention Network (CAN), a notable enhancement in tracking performance and tracking persistence was observed. This observation suggests a genuine competition between the Re-ID task and the object detection task. To enhance the model's performance, we decoupled these two tasks. The best results on the validation set were achieved when all modules were added to the model.

Experimental results and analysis of the number of layers in the graph neural network We conducted an experiment to investigate the influence of the number of layers in the graph neural network [41] on the overall performance of the algorithm. To identify the most suitable number of Self-Constructing Graph (SCG) layers, we systematically increased the number of layers in the graph neural network [41]. In our study, we introduced two hyperparameters to control the depth of the graph neural network [41] specifically for the self-attention and cross-attention mechanisms [12]. The parameter \({l}_{s}\) determines the depth of the GNN for self-attention [29], indicating the number of layers through which information is propagated and aggregated within each node's local neighborhood. Similarly, the parameter \({l}_{c}\) controls the depth of the GNN for cross-attention.

Table 5 depicts the tracking results obtained by the SCG with varying numbers of layers on the MOT17 [19] validation set. Notably, when the number of SCG layers reaches 3, the tracking performance shows a decline compared to cases where fewer layers are utilized. This phenomenon can be attributed to the increased neighborhood aggregation of GNN nodes, causing the loss of node diversity within the graph. Consequently, the vector representations become more similar, resulting in smoother node features. In comparison to using only one SCG layer, employing two SCG layers yields the best results across various indicators. This observation can be explained by the fact that when the number of graph neural network layers is insufficient, the information propagation path becomes limited, impeding the network's ability to capture long-range relationships and contextual information between nodes. Consequently, the network may struggle to capture global patterns and structures within the graph data. For the final algorithm configuration, we opt for a two-layer graph neural network with \({l}_{s}=1\) and \({l}_{c}=1\)

Table 5 Number of self-attention layers and cross-attention layers

Experimental results and analysis of different features To assess the effectiveness of augmenting object features, our study compares the use of solely object appearance features to the inclusion of object location information.

Table 6 illustrates a decrease of 0.5% in both MOTA and IDF1 when incorporating Gemo., suggesting that adding location information may introduce similar target location characteristics, potentially causing tracking algorithm errors by incorrectly associating distinct targets as a single target. In light of this observation, we chose to exclusively use the appearance features of the objects

Table 6 Ablation study on the effect of using geometric features during affinity computation

Experimental results and analysis of the number of layers of the graph neural network We investigate the influence of the number of multi-head attention mechanisms on the overall performance of the algorithm.

We conducted ablation experiments, varying the number of attention heads, denoted as "\(h\) ". The results, presented in Table 7, demonstrate the performance of the model under different configurations.

Table 7 Number of different multi-head attentions in results on MOT17 validation set

Interestingly, we observed that the best overall metrics were achieved when the number of attention heads was set to 3, with the exception of the FP metric. This can be attributed to the fact that each attention head focuses on different subspaces of features, allowing for a more comprehensive understanding of the data. However, when the number of attention heads is 4, the model may overly emphasize noise or less significant features in the training data, leading to reduced generalization ability. On the other hand, a smaller number of attention heads may limit the model's capacity to explore the diversity present in the data.

We choose the optimal number of attention heads, in this case, 3, allows for a balance between capturing relevant features and avoiding overfitting or underutilization of important information, resulting in improved model performance and generalization ability.

In terms of computational and parameter requirements (as illustrated in Table 8), our model exhibits a slight increase compared to FairMOT [39]. This is primarily due to our algorithm's focus on addressing the issue of frequent ID switching among high-density pedestrians. The attention module we have devised involves calculating and modeling correlations between multiple elements, resulting in higher computational complexity. However, the attention mechanism also enhances the model's modeling and representation capabilities, despite typically having a relatively small number of parameters. When evaluating the frames per second (FPS) on the MOT17 [19] dataset, SCGTracker demonstrates competitive performance while simultaneously improving tracking accuracy.

Table 8 Comparison of efficiency between SCGTracker and FairMOT

The loss diagram of MOT17 [19] is illustrated in Fig. 4. The metric train_hm_loss reflects the detection loss, while train_id_loss pertains to the loss of features. On the other hand, train_loss represents the overall or total loss. Notably, when the epoch reaches 15, the curve of train_id_loss starts to exhibit a gradual smoothness. In contrast, train_hm_loss continues to display a downward trend at 15 epochs, but eventually stabilizes around 29 epochs. It is worth mentioning that train_loss reaches a plateau by epoch 29, indicating a diminishing improvement in the overall loss. Consequently, the training process is halted at epoch 26 to prevent further training iterations that would yield minimal gains.

Fig. 4
figure 4

The loss chart of MOT17 [19]

Comparison with other algorithms: To demonstrate the state-of-the-art performance of our algorithm on the MOT challenge dataset, we conducted a comparative analysis with other top-performing tracking algorithms. By examining Tables 9 and 10, it becomes apparent that our algorithm does not exhibit a substantial improvement in the MOTA metric for the MOT16 [19] and MOT17 [19] datasets. This is primarily attributed to the fact that our algorithm primarily focuses on addressing the issue of ID switching in high-density pedestrian datasets.

Table 9 Comparison of ours and other algorithms on MOT16 test set
Table 10 Comparison of ours and other algorithms on MOT17 test set

One crucial measure to evaluate the effectiveness of our approach is the IDF1 metric, which quantifies the number of instances where the track ID number differs from the initial track ID number. Additionally, the IDs metric indicates the frequency of trajectory identity exchanges. It is worth noting that our method achieved the best results in terms of IDF1 and IDs, indicating that it effectively alleviates the problem of pedestrian ID switching in dense scenes.

While the improvement in the MOTA metric may not be substantial, the exceptional performance in IDF1 and IDs demonstrates the efficacy of our method in mitigating the challenges associated with ID switching in crowded pedestrian scenarios. This showcases the unique contribution and value of our approach in addressing this specific problem, even if it does not lead to a significant improvement in overall MOTA performance (The best results are highlighted in red, and the second-best results are highlighted in blue).

To assess the robustness of our algorithm, we conducted an experiment on the MOTSynth [7] dataset, comparing it with existing graph neural network-based multi-target tracking algorithms. The results of this experiment are presented in Table 11, alongside the findings from other relevant papers.

Table 11 Comparison of ours and other algorithms on MOTSynth test set

Upon analyzing Table 11, we observe that while our algorithm does not achieve the highest scores in terms of MOTA indicators, it performs on par with other methods in terms of MOTA, ML, MT, and IDs. Here, MT represents the proportion of ground-truth trajectories covered by track hypotheses that overlap with at least 80% of their respective ground-truth trajectory, while ML represents the proportion of ground-truth trajectories covered by track hypotheses that overlap with up to 20% of their respective ground-truth trajectory.

This suggests that our algorithm effectively obtains discriminative target features, contributing to a reduction in the number of target ID switches. This ability to capture strongly discriminative target features is a notable strength of our algorithm, contributing to its robustness in multi-target tracking scenarios.

Visualization results

As depicted in Fig. 5, three distinct trace scenarios were chosen for effective display plots from the test set. The first row of the figure illustrates the detection effect graph without utilizing the decoupling module (CAN), while the second row demonstrates the effect of decoupling with the Cross Attention Network (CAN).

Fig. 5
figure 5

Impact of CAN on detection tasks

As depicted in the detection effect plots, the response area of the pedestrian targets enclosed within the red circles in each image of the first row is notably smaller than that of the second row. This observation suggests that the competition between the detection task and the Re-ID task has a substantial impact, not only on the detection task but also on the tracking task. Hence, it is imperative to decouple them using a cross-attention network.

The tracking performance of our method on the MOT17 test set is illustrated in Fig. 6. Each row of the figure corresponds to a video sequence from the MOT17 test set, and each column from left to right represents our tracking results every 30 frames.

Fig. 6
figure 6

Tracking effects on the MOT17 test set

Discussion

The SCGTracker is an online algorithm designed for end-to-end multi-target tracking. It leverages an attention mechanism to aggregate information surrounding the targets. Additionally, it employs message passing to interact with target feature information, thereby identifying highly discriminative characteristics. However, there are some notable drawbacks that need to be addressed. First, the performance in target detection falls short of expectations, as evidenced by unsatisfactory results obtained from the MOTA indicator. As previously discussed, our method primarily focuses on enhancing target features while disregarding the crucial data association module. Second, the SCGTracker fails to fully exploit the positional information of the targets. Our experiments reveal that incorporating the positional information at the pixel level in the current frame may introduce similar target position features, resulting in errors within the tracking algorithm. Resolving these aforementioned issues constitutes a significant research area within the context of the MOT framework based on graph neural networks.

Conclusion

We conducted a literature review on the application of Graph Neural Networks (GNNs) [41] for enhancing target re-identification (Re-Id). Our findings reveal that existing algorithms often overlook the interdependencies among targets within the same frame. Additionally, when facing occlusion, the features of the detected target can unintentionally compromise the high-quality features of the trajectory target. To address this issue, our paper introduces the construction of object graphs for each frame and between consecutive frames. We leverage the self-attention mechanism to aggregate target features within the same frame and employ cross-attention to gather information from pedestrian targets in two consecutive frames, effectively capturing their correlations. The target features are then updated using a Graph Neural Network [41]. Experimental evaluations on the MOT17 dataset demonstrate that our proposed method is highly competitive compared to state-of-the-art tracking methods. In fact, it achieves comparable or superior results across almost all evaluation metrics.