Abstract
The challenge of multi-object tracking stands as a fundamental focus in computer vision research, finding widespread applications in areas such as public safety, transportation, autonomous vehicles, robotics, and other domains involving artificial intelligence. Given the intricate nature of natural scenes, the occurrence of object occlusion and semi-occlusion is commonplace in basic tracking tasks. These factors often result in challenges such as ID switching, object loss, detection errors, and misaligned bounding boxes, thereby significantly impacting the precision of multi-object tracking.This paper aims to address the aforementioned issues and proposes a novel multi-object tracker, incorporating Relative location mapping (RLM) and Target region density (TRD) modeling. The new tracker is more sensitive to differences in the spatial relationships between targets, allowing it to dynamically introduce low-scoring detection boxes into different regions based on the density of target regions in the image. This improves the accuracy of target tracking while avoiding the consumption of a significant amount of computational resources.Our research results indicate that when applying this method to state-of-the-art multi-object tracking approaches, the proposed model achieves improvements of 0.4 to 0.8 points in the HOTA and IDF1 metrics on the MOT17 and MOT20 datasets. This demonstrates the effectiveness of the proposed method in enhancing multi-object tracking performance.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Multi-object tracking (MOT) technology is a crucial task in the field of computer vision, and in recent years, it has found widespread applications in areas such as autonomous driving, robot navigation, and public video analysis. Currently, the paradigms for multi-object tracking mainly consist of tracking by detection (TBD) and joint detection and tracking (JDT). TBD paradigm involves detecting targets with an object detector and then associating target objects and trajectories through techniques like target re-identification (Re-ID) [1] or Kalman filtering [2] to obtain complete target paths. JDT paradigm, on the other hand, performs detection and tracking simultaneously in a single stage. TBD has the advantage of leveraging the latest object detectors, ensuring the use of high-performance detection models for tracking tasks. In contrast, the strength of JDT lies in its ability to design end-to-end tracking models while achieving faster model inference speed.
However, regardless of the paradigm used, the object detector plays a crucial role, as its accuracy directly affects multi-object tracking performance. Occlusion is a common factor that degrades detector accuracy. In real-world scenarios, occlusion frequently occurs, resulting in incomplete detected bounding boxes and thus reduced tracking accuracy. Moreover, variations in target size, appearance, etc. can lead to a mismatch between detected boxes and actual target dimensions, causing errors. Enhancing multi-object tracking accuracy and robustness remains a core research problem.
In multi-object tracking scenarios, the relative positions of targets in the image captured by the camera do not reflect the true distance relationships, due to different shooting angles. Specifically, even with the same pixel interval, the distance between distant targets is often presented as greater than that between near targets. In addition, even if the targets move at constant speed, there may be visual illusions of the motion speed differences between far and near targets in the image. Our experiments found that the discrepancy between the relative and actual target positions increased errors in tracking algorithms. These errors resulted in originally non-crossing target motions falsely appearing as crossings in the image, causing visual and computational errors.
Real-time performance is a crucial requirement for multi-object tracking, especially for many downstream tasks that need rapid and accurate identification and tracking of targets. These applications typically require the tracking system to efficiently process large-scale data streams, adapt to changing environments, and provide instant decision support. In addition, rational allocation of computing resources also directly affects the efficiency of multi-object tracking systems. For example, in a video, more resources can be allocated to regions where targets are present or dense, while relatively fewer resources to other regions.
We propose two overarching solutions for improvement. Firstly, we introduce a ground-centric bounding box weighting strategy, involving the projection of targets onto a latent plane that accurately reflects inter-target distances and motion relationships. This is achieved by considering the detection box center as a reference point and implementing advanced relative position mapping algorithms. On this latent plane, target coordinates replace the original coordinates for the purpose of tracking. The fundamental concept is to use the ground as a stable reference, projecting targets onto a virtual plane to enhance the capture of relative positional relationships between targets and, consequently, elevate tracking accuracy.
Furthermore, we have developed a method that leverages a regional density model to judiciously allocate computational resources based on target density within a specific region. In areas characterized by high target density, a more substantial allocation of resources is directed towards achieving precise multi-object tracking. Conversely, in regions with lower target density, a proportionally smaller allocation of resources is made to mitigate unnecessary computational overhead.
In this paper, by analyzing the reasons for tracking errors generated by previous trackers, we propose a simple and effective solution to enhance multi-object tracking, with the following main contributions:
1. The proposed target relative location mapping model strengthens the restoration of the actual position relationships between targets during tracking, effectively reducing the probability of ID switching errors in multi-object tracking tasks caused by factors such as target size, posture, and occlusion.
2. The proposed target region density model quantifies the target density in different regions of the video image, allowing the tracker to adaptively calculate an appropriate threshold for low-scoring detection boxes based on region density. This reduces interference from low-scoring detection boxes in low-density target areas and also lowers computational costs.
3. Using the target relative location mapping model to project targets in the video image onto a latent plane, we employ a standardized bounding box approach for the first time to uniformly label the actual positions of targets in the latent plane. This method can reduce the occurrence of occlusion phenomena due to inconsistent target box sizes and enhance the tracker’s perception of target movement speed and position changes, thereby improving tracking accuracy (Fig. 1).
2 Related work
The current research in the field of multi-object tracking primarily consists of two main approaches: the Tracking by Detection (TBD) paradigm [3,4,5,6,7,8] and the joint detection and tracking (JDT) paradigm [9,10,11,12,13,14]. These two categories of research methods have different emphases and focuses.
2.1 Tracking by detection paradigm
The Tracking by Detection (TBD) paradigm is one of the widely applied methods in current multi-object tracking. It decomposes the multi-object tracking task into independent models for detection and tracking. Typically, in the detection phase, state-of-the-art or specifically tailored object detection models such as FasterRCNN [15], YOLO [16], Transformer [17], and their latest optimized versions like Cascade R-CNN [18], YOLOX [19], DINO [20], swim Transformer [21] are used. Target detection, as one of the fundamental research areas in computer vision, encompasses various detection models that have achieved outstanding results in terms of detection speed and accuracy. Therefore, researchers employing the TBD paradigm can focus more on target association research, thereby improving the accuracy of multi-object tracking.
Based on the real-time processing level of video image data, the TBD paradigm can be categorized into online tracking and offline tracking. Online tracking [22,23,24,25,26,27,28] is a real-time processing approach that tracks target trajectories based on the current and previous video frames. In contrast, offline tracking [29, 30] is more commonly used to process all frames or batches of frames from offline videos to address the target tracking problem. Since multi-object tracking is primarily applied in practical engineering applications such as autonomous driving, surveillance, and robotics, the online tracking approach has a broader range of applications. Processing video frames locally usually results in faster processing efficiency.
In early research, approaches like SORT mentioned in [31] utilized Kalman filters to predict the future position of targets, followed by the use of the Hungarian algorithm for tracking result association. While this method is simple and effective, it tends to lose targets and make mistakes in the presence of target occlusion. To enhance the effectiveness of association, DeepSORT proposed in [32] introduced deep information about target appearance to improve the association process and reduce the impact of occlusion on tracking results. The IQHAT framework proposed in [30] allows multiple target identities corresponding to multiple targets, demonstrating good performance in handling occlusion in crowded scenes. On the other hand, the StrongSORT study (mentioned in [4]) is an upgraded version based on DeepSORT, introducing a model for appearance-free linking to correlate short and complete trajectories and using Gaussian smoothing interpolation to compensate for missing detections, achieving better tracking performance. The method proposed in [3] balances detection and association weights, suggesting that shallow target features are easier to obtain common features for the same target than deep features. Furthermore, the research in [6] presents a cautious handling approach for low-score detections to confirm whether they belong to occluded targets, significantly improving tracking accuracy.
2.2 Joint detection and tracking paradigm
With the continuous maturation of multi-task learning techniques, some research studies [33,34,35] have begun to explore a more challenging paradigm called collaborative tracking and detection, which involves target estimation and identity recognition, moving beyond the traditional detection and tracking paradigms. These studies integrate Re-ID (Identity Recognition) branches into the backbone network to obtain features for each target. In early research, a direct approach was often employed, involving cropping target regions after target detection and inputting them into pre-trained Re-ID models. Deep neural network techniques were then used to calculate target feature representations. However, this method significantly increased computational costs, as it required more parameters to account for target variations and obtain clearer intra-class features. To address this issue, subsequent research gradually explored more efficient methods for target feature extraction, aiming to reduce computational costs and enhance the performance of target recognition. This includes sharing features of anchor- or point-based detectors in the target detection backbone network [36], reducing redundant calculations of target image features. In the face of the challenge of target occlusion, some studies [37] attempted to detect only the visible parts of targets or divided targets into multiple parts for sequential detection and comparison. Others addressed occlusion by making multiple predictions for a single detection box, and point-based detectors [25] aimed to solve the problem of excessively large repetition of target boxes and the removal of low-score boxes by non-maximum suppression (NMS).
In addition to traditional CNN approaches, breakthroughs in the Natural Language Processing (NLP) field have led to the increasing exploration of transformer-based single-model architectures in the image domain. The JDE method designed in [36] combined target detection and ID association in the same model with shared parameters, improving the efficiency of multi-target detection while extracting target association features through deep learning, to some extent reducing the impact of occlusion. The MOTR method in [38] extended DETR and modeled temporal relationships in video sequences to improve tracking accuracy. The research in [39] using CSTrack analyzed the performance problems caused by the overlooked differences when handling detection and association with a single model. It proposed the interactive network REN and the Scale-Aware Attention Network SAAN. The study in [25], using CenterTrack, replaced the target detection box with the center point heatmap of the target and the previous frame’s RGB image as the target tracking point trajectory. This approach reduced the complexity of detecting and tracking target boxes, achieving an optimal balance between efficiency and accuracy. The research in [5, 40,41,42] utilized attention mechanisms to enhance the connection between frames and improve association efficiency.
However, compared to the previous two-step detection and tracking paradigm, the collaborative tracking and detection paradigm exhibits a decrease in overall performance. The main reason is that adding a Re-ID branch to the backbone network to maintain target identity information consumes a significant amount of computational and storage resources. These resources are used to maintain potentially unnecessary identity information, resulting in redundancy in the system and impacting performance. It’s worth noting that the anchor-based approach of one-shot trackers is not suitable for simultaneously extracting features for detection objects and Re-ID features because they use different feature types and dimensions. This method introduces additional complexity, increases computational burden, and may lead to performance degradation. The study in [43] analyzed why the effectiveness of a single-model multi-target tracker is lower than that of a dual-model structure. The difference is attributed to the differences in feature indicators required for target detection and target association. Target detection requires large inter-class feature distinctiveness and small intra-class distinctiveness, while target association requires large intra-class distinctiveness. Therefore, a single model cannot effectively meet both of these requirements simultaneously.
2.3 This paper
This paper adopts the tracking by detection paradigm and employs a series of innovative methods to address occlusion issues, thereby enhancing the performance and robustness of multi-object tracking systems. Firstly, a positional mapping operation is applied to the target box positions obtained by detectors in different scenes, projecting the targets in the video image onto a ground-based mapping plane. This step considers the optical relationships, helping alleviate tracking issues caused by occlusion, image angles, and other factors. In terms of target detection, the YOLOX model [19], trained on an additional dataset, is utilized, while drawing inspiration from the ByteTrack [6] approach. This involves using low-scoring bounding boxes to associate and track targets, considering the information from low-scored target bounding boxes and thereby improving target association accuracy.
Furthermore, a method to generate standardized target bounding boxes in the mapping image is redesigned, and a Target Region Density Model is introduced to quantify areas prone to occlusion. By adaptively adjusting the size of standardized bounding boxes and the threshold for low-scored target bounding boxes based on the occlusion probability, the system better adapts to various occlusion scenarios, enhancing tracking accuracy. Real-time adjustments to the Kalman gain coefficient are made using the region density model, thereby strengthening tracking stability.
Experimental results demonstrate that these improvement methods effectively reduce the probability of identity switches and target losses during the target tracking process in crowded scenes. They provide new and effective approaches to address occlusion issues, enhancing the performance and robustness of multi-object tracking systems. These contributions are valuable for addressing complex multi-object tracking scenarios in practical applications such as surveillance, autonomous driving, and robot navigation.
3 Relative location mapping
In video images, cameras typically capture pedestrians on roads from certain angles. The camera’s focus is usually directed towards areas with crowds or dense pedestrian traffic to capture the scene as comprehensively as possible. However, this shooting approach may lead to issues of overlay and occlusion of target pedestrians in the image. To mitigate the impact of these factors, it becomes crucial to project the positions of targets in the video image onto a ground-based potential plane, where the relationship between the potential plane and the actual positions of targets in the scene is more aligned. As illustrated in Fig. 2, through this mapping operation, we can better understand the relative positions and motion directions between targets. The potential plane generated by this mapping helps alleviate issues caused by occlusion, enhancing the accuracy and stability of the multi-object tracking system.
3.1 Relative location mapping model
We represent the position of the target on the image using the coordinates of the midpoint at the bottom of the target bounding box, as illustrated in the vertical view of Fig. 3. The ratio of the vertical distance between the target position and the lower boundary of the image to the overall image height corresponds to the ratio between the target imaging angle \(\alpha _v\) and the imaging angle of the camera device \(\beta _v\). Equation 1 expresses the relationship between the size of the target imaging angle and the coordinates x, y, denoted by \(\mathcal {G}\). As the target moves to different positions in the video image, corresponding real-world locations exhibit a nonlinear trend. We typically interpret these differences as deformations generated by the video.
The variable y represents the distance between the target’s position on the video image and the bottom of the image. As y changes, the imaging angle \(\alpha _v\) undergoes nonlinear variations. The formula for the vertical mapping function \(\mathcal {G}(y)\) is as follows:
In Eq. 2, H represents the total height of the video image, and \(\beta _v\) denotes the vertical viewing angle of the camera device. The camera device’s viewing angle is often fixed over a segment of monitoring video based on focal length and factory configurations. Alternatively, we can obtain an empirical viewing angle through methods such as deep learning algorithms.
From the formula, it can be observed that within the vertical viewing angle of the camera device, the size of the target imaging angle is related to the maximum vertical viewing angle of the camera device and the height of the image. As shown in Fig. 4, where the horizontal axis represents the size of the target position in the image, set here as 1080, and the vertical axis represents the angle \(\alpha _v\), the curves in each graph depict the relationship between the target position and the target imaging angle for different maximum vertical viewing angles \(\beta _v\) of the camera device. The larger the visible angle of the camera device, the more pronounced the nonlinear relationship between the target image position and the real position.
In terms of horizontal mapping, \(\alpha _h\) is calculated in a similar manner, as illustrated on the right side of Fig. 3. The angle size is related to the width W of the image and the maximum horizontal viewing angle \(\beta _h\) of the camera device. The formula is as follows:
x represents the horizontal coordinate length of the target position in the image.
The target mapping coefficient refers to the ratio between the target’s position in the image and its position on the mapping plane. Due to perspective issues, there is deformation in the relative positions of image and actual locations. Therefore, the same target imaging angle will have differences between image positions and actual positions. As shown in Fig. 3, the ratio of \(x'\), \(y'\) to x, y is the target mapping coefficients \(\varphi _h, \varphi _v\) as expressed in Eq. 4. These coefficients directly reflect the impact of image deformation on the relative positions of the target in actual space.
Equation 4 represents the mapping coefficients for the target in the horizontal and vertical directions. It can be observed that these coefficients are related to the target imaging angle, the maximum viewing angle of the camera device, and the tilt angle.
As per the illustration in Fig. 3, the formula for calculating \(y'\) in the vertical direction is as follows:
\(\theta \) represents the angle between the lower edge of the camera device’s field of view and the vertical line. Assuming the camera device is at a height h above the ground, \(y'\) is given by:
The formula for calculating y is as follows:
The formula for calculating \(\varphi _v\) is obtained as follows:
The above equation represents the deformation ratio coefficient of the target’s vertical coordinate in the video image to its actual position in the vertical direction. This coefficient reflects the vertical deformation ratio of the video image at a specific point.
From Fig. 5, we can see a schematic relationship between the target’s vertical mapping coefficient \(\varphi _v\) and the camera device’s tilt angle. It is evident that as the tilt angle of the camera device decreases, the curve depicting the growth of the vertical mapping coefficient in the vertical direction becomes steeper. This effectively reflects the deformation characteristics of the video image.
Regarding the horizontal mapping coefficient, based on the schematic on the right side of Fig. 3, it can be observed that the increase in horizontal deformation occurs with an increase in vertical distance in the image. The horizontal mapping coefficient, as indicated in Eq. 4, \(\varphi _h\), is only related to the maximum vertical viewing angle and the tilt angle of the camera device. It can be expressed by the following formula:
\(\theta \) is obtained from Eq. 5, and \(y'\) is given by Eq. 6.
Figure 6 illustrates a comparison between \(\varphi _v\) and \(\varphi _h\) when \(\beta _v = 120\) and \(\gamma = 20\). From the graph, it can be observed that the video image mapping coefficients increase with the growth of the target imaging angle. Additionally, you may notice that the increase in the horizontal direction is greater than the increase in the vertical direction. However, this observation holds when \(\beta _v\) is sufficiently large. As \(\beta _v\) decreases, the horizontal mapping coefficient will gradually approach or become smaller than the vertical increase rate.
At this point, the mapping coordinates expressing the relative relationship between the target’s video image position and its actual position are as follows:
Achieving the mapping relationship from the video image to the actual ground is now complete.
3.2 Target region density
In different scenarios, the area where targets appear may vary. Relative to the entire image, we often pay more attention to specific regions where targets are present. Additionally, regions with fewer targets are less likely to experience occlusion. To address this characteristic, we introduce a technique called the target region density (TRD) similar to attention mechanisms in object detection. This method is particularly beneficial for multi-object tracking tasks, as it enables the tracking system to quickly perceive potential occluded and interference areas. Through the target region density method, we can reduce the weighting of detection box selection, concentrating the attention of the tracking system on regions where targets exist. This helps improve the accuracy of the tracker by rapidly capturing potentially overlooked tracking points.
Furthermore, the Target Region Density Method allows us to tighten the standardized bounding boxes to assist tracking effects in a more consistent manner. It also aids the tracker in more accurately capturing tracking points that may be missed, thus enhancing the overall accuracy of the tracking process. Importantly, the parameters of the Target Region Density Method are updated in real-time, providing high flexibility to dynamically adjust bounding boxes based on different scenes and requirements for better tracking results. This method holds promise for enhancing the performance of multi-object tracking systems, making them more intelligent, accurate, and adaptable to diverse scenarios.
We divide the entire video image into 9 regions, and each region, as visualized in Fig. 7, clearly shows the area with the highest density in the image. The density calculation for each region is based on the weighted comparison of the portions where target detection boxes fall within each region. The normalization is then performed with the maximum density region as the reference. We calculate the densities recursively based on the positions, sizes, and scores of different target detection boxes, adding them with bottom-weighted coefficients higher than those for the upper part, especially when the target spans multiple regions. The formula for calculating target region density is as follows:
In Eq. 13, Bbox represents the proportion of the intersection between the original bounding box within the target region and the original bounding box, Score represents the score of the original bounding box, \(\varpi \) represents the proportion coefficient of the target box region after bottom-weighted addition, and n represents the number of original bounding boxes in the target region. In Eq. 14, m represents the number of target regions, \(\rho _i\) represents the calculated density values for each region, \(\rho _{max}\) represents the maximum calculated density for each region, and Q is the mask matrix. After multiplying with the region matrix W, it yields the normalized density matrix \(A_\rho \).
3.3 Standardized bounding boxes
We have obtained the key point distribution of the target based on the ground in the original video image, forming a map of the actual relationship of the target. In the original image, the depth of the target is typically represented by the size or area of the target bounding box. However, in real-world scenarios, differences in the physical shape, behavior, and carried items of different targets can result in similar-looking bounding boxes in the image. This can lead to unexpected identity switches, such as when an adult in the distance and a child nearby may have target boxes of similar sizes.
RLM can authentically reflect the physical spatial relationships of targets in the actual environment, independent of the area of the target bounding box. Therefore, after mapping to the new target position, we can use bounding boxes with a fixed width to represent the target’s location. However, it is essential to maintain the aspect ratio of the original bounding boxes in the image to increase the distinction between different targets. As shown in Fig. 8, the mapped standardized bounding boxes project the targets from the original video image onto a horizontal plane coordinate system. At the same time, they replace the original bounding boxes of different sizes with bounding boxes of the same width. This mapping operation helps better represent the target’s position, reduce identity switches, and enhance the accuracy of the multi-object tracking system.
The size of the mapped target standardized bounding box is determined by the mapping coefficients \(\varphi \), the region density \(A_{\rho }\), the aspect ratio \(\gamma \) of the original detected target box, and the video frame rate f. Our goal is to achieve optimal intra-class recognition performance while better avoiding identity switching issues caused by overlapping bounding boxes. Therefore, we position the bounding box at a balanced location between maximum scale and minimum overlap to achieve optimal performance.
By continuously tracking image density, we can iteratively optimize the size of standardized bounding boxes to improve tracking effectiveness. Equation 16 illustrates the ongoing optimization through the density matrix to compute the area values of standardized bounding boxes. In this formula, \(A_{\varphi i}\) represents the mean matrix of target mapping coefficients in the recognition region, \(\mathcal {L}_{arg}\) denotes the average spacing between targets in that region, \(\overline{W}\) represents the average obtained standard width of bounding boxes, f is the video frame rate, and c is a frame rate control constant. This formula indicates that, when the video frame rate is higher, the bounding box area can be set smaller.
3.4 Relative location mapping tracking algorithm
Based on the aforementioned model, the relative location mapping tracking algorithm involves mapping the target positions in the surveillance video images to a potential plane. Following this mapping, the algorithm calculates the association relationships between various target points on the potential plane. The detailed implementation process of the relative location mapping tracking algorithm is elaborated below (Fig. 9).
The algorithm presented in Algorithm 3.1 outlines the entire process of the tracker using this model for video image tracking tasks. This algorithm draws inspiration from previous work [3] that introduced the concept of low-scoring bounding boxes. Unlike previous methods, it employs a series of stages, including detection box filtering, relative location mapping, initial tracker association, secondary tracker association, and tracker output. The following provides a detailed explanation of these stages:
Detection Box Filtering Firstly, the detector performs frame-by-frame detection on the video images. Then, it selects high-scoring target bounding boxes with scores greater than \(\tau _{high}\). These high-scoring bounding boxes anchor the target position coordinates to the bottom center of the bounding box.
Relative Location Mapping Next, the tracker calculates the target region density matrix \(A_\rho \) based on the density of targets. After obtaining the target region density matrix, the algorithm filters out low-scoring target bounding boxes with scores greater than \(A_\rho * \tau _{low}\) based on the density of different regions. These bounding boxes will participate in subsequent tracking associations.
Initial Tracker Association At this point, the positions of high-scoring and low-scoring bounding boxes are transformed to the mapping plane using the RLM model, obtaining relative position coordinates \(\mathcal {R}{high}, \mathcal {R}{low}\). Subsequently, complete standardized bounding boxes \(\mathcal {D}{high}, \mathcal {D}{low}\) are generated from the relative position coordinates on the mapping plane. This step completes the task of relative location mapping transformation.
Secondary Tracker Association The parameters of the transformed target bounding boxes and their region density values are fed into the Kalman filtering algorithm to calculate the predicted position of the next frame’s target in the mapping plane. The tracker performs association operations on the target positions for each frame, first associating high-scoring target bounding boxes and then supplementing the tracking task with low-scoring bounding boxes to obtain \(\mathcal {T}'\).
Tracker Output Finally, the associated results need to undergo the reverse operation of relative location mapping to obtain the final set of tracked target bounding boxes \(\mathcal {T}\), completing the tracking task.
This algorithm framework effectively addresses occlusion issues in multi-object tracking. Through the introduction of the relative location mapping model and optimization of the target region density model, the accuracy and robustness of the tracking system are improved.
4 Experiments
4.1 Configuration
Datasets and Backbone Commonly used datasets in the multi-object tracking domain include MOT17 [44] and MOT20 [45]. These datasets cover both training and testing data. However, as the validation dataset is not available, we adopt the approach from [46] to split the training dataset into two halves. The first half is used for training, while the second half is employed to validate the algorithm’s tracking performance.
MOT17 dataset consists of 7 training videos and 7 testing videos, with video lengths ranging from 7 to 90 s and frame rates between 14 and 30 frames per second. Most videos have a resolution of 1920\(\times \)1080. MOT20 is a dataset for multi-person tracking in crowded and complex scenes, with an average of about 170 people per image. For improved detection performance, we use a mixture of CrowdHuman, Cityperson, and ETHZ datasets for training under a proprietary detection protocol. To achieve excellent tracking results and a consistent validation environment, we employ the YOLOX [47] object detection framework as the backbone. This framework exhibits outstanding detection accuracy and speed, meeting the real-time detection demands of multi-object tracking tasks.
Evaluation Metrics For validation evaluation, we utilize the main evaluation measures provided by MOTChallenge, including Multiple-Object Tracking Accuracy (MOTA), IDF1 [48] score, Higher-Order Tracking Accuracy (HOTA) [49], false positives (FP), detection accuracy (DetA), and other dimensions. The TrackEval [50] tool is used to verify the evaluation metrics. These evaluation metrics contribute to a comprehensive assessment of the performance of our multi-object tracking algorithm.
Implementation Details In our experiments, we use 8 NVIDIA Tesla V100S GPU cards for parallel training and testing on one of these cards. The experimental code is implemented based on the PyTorch [51] framework.
By default, we set the threshold \(\tau _{high}\) to 0.6, and \(\tau _{low}\) is based on 0.3, calculated according to the probability density of the region. We keep 30 frames of redundant data for reappearing lost trajectories.
The RLM tracking algorithm requires preliminary estimation of camera variables. Typically, in target recognition work under surveillance environments, a fixed-angle monitoring camera system is used to obtain stable video shots. From Eq. 9, it is clear that the camera’s field of view and tilt angle affect the mapping relationship of targets in the image. In the experiment, we manually set the camera parameters to adapt to the current video scene. This helps to obtain more accurate target mapping positions.
Preprocessing MOT17 and MOT20 datasets contain scenes with motion shots and screen jitter. Although the camera’s field of view remains unchanged, slight oscillations in the camera’s tilt angle can occur, leading to irregular variations in the trajectories of moving objects in the video. Additionally, camera shake can cause changes between image frames, reducing tracking accuracy. To maximize the continuity of object motion, we employed real-time Affine Transformation on the videos using the OpenCV library [52] and vidgear library [53]. The results of this processing aim to preserve the original trajectories of the objects to the greatest extent. The formula for Affine Transformation is shown in Eq. 19, where matrix A represents the transformation of image angles, and matrix B represents the transformation of image translation distances.Matrix \(M_{k-1\vert k}\) in Eq. 20 is used to represent the transformation matrix from the \((k-1)\)-th frame to the k-th frame in the video.
4.2 Ablation experiments
The main objective of the ablation experiments is to validate the combination of the proposed model algorithm with various trackers and demonstrate the performance improvement and underlying reasons of the RLM algorithm in mapping video images in different directions. Additionally, the experiments investigate the performance enhancement of the TRD model across various scene distributions in the MOT20 dataset, especially in scenarios with an increased number of targets. These experiments aim to validate the effectiveness and robustness of the proposed approach, providing a better understanding of its application and performance in diverse contexts.
Baseline Ablation Experiments In this section of ablation experiments, we systematically tested the proposed RLM algorithm and the TRD model to evaluate their impact on the performance of multi-object trackers. Two baseline models were employed: ByteTrack, based on the YOLOX detector, and BoT-SORT, based on YOLOv7. The purpose of Table 1 is to analyze and quantify the performance improvement achieved by our new methods on these two distinct baseline models. This analysis helps in better understanding their roles in multi-object tracking tasks.
Impact of Position Mapping Direction Different directions of relative location mapping reflect varying degrees of visual occlusion from different angles. The experiment verified the influence of horizontal and vertical RLM on the tracker. Figure 3 illustrates the different spatial relationships mapped in various directions, while Table 2 evaluates the enhancement of tracker performance under different mapping angles. We compared the marking of potential plane targets using the standardization method of bounding boxes described in Sect. 3.3. According to the experimental results, we observed that the mapping effect in the vertical direction is more significant than in the horizontal direction, possibly due to the vertical length of the target box being greater than the horizontal length.
Impact of Dynamic Camera Parameters Depending on the different visual angles and tilt angles of the camera device, the mapping results calculated by the RLM model will vary (as shown in the charts in Sect. 3.1). This experiment verified the influence of angles on tracker performance. Assuming a fixed maximum visual angle of 60\(^\circ \) for the camera device, the relationship between the tilt angle of the camera device and the tracker’s performance is shown in Fig. 10, where the horizontal axis represents the tilt angle of the camera device, and the vertical axis represents the MOTA results.
It can be observed that different scenes have different tilt angles for the camera device, causing a change in the relative position of the target in the image. When using the RLM algorithm, specific configuration parameters based on the camera device’s tilt angle need to be considered. In practical applications, when the camera device is stationary, the tilt angle is fixed, achieving optimal RLM mapping results under such conditions.
TRD Multi-Scene Effect Experiment In this section, the RLMTrack algorithm is evaluated on different datasets, MOT17 and MOT20, to assess the degree of performance enhancement provided by the TBD method on these datasets. A comparative analysis is also conducted to contrast the processing speed differences when using ReID versus when not using ReID. According to the experimental results in Table 3, it is observed that the improvement of the TBD method on MOT17 is slightly higher than on MOT20. Our analysis suggests that the dense distribution of individuals in MOT20 increases the baseline weight of the primary regions in the image during association processing. Consequently, a higher level of computational resources is required for association calculations in most regions of the image.
4.3 Evaluation results
We compared the methods and results of multi-object tracking tasks on the MOT17 and MOT20 datasets using the evaluation tool TrackEval provided by MOTChallenge and marked the best results.
MOT17 Dataset In Table 4, we conducted comparative experiments on various tracking models using the SORT algorithm on the MOT17 dataset’s validation set. It can be observed that our method has improved the tracking accuracy for several algorithms in the MOT17 dataset. It performs better in terms of the HOTA and IDF1 metrics, while achieving comparable precision and speed to the BYTETrack method in terms of MOTA and FPS metrics. This is because RLM typically consumes fewer computations, thus minimally affecting the model’s operational efficiency. As shown in Fig. 11, the use of overlapping target false positive boxes in ByteTrack leads to ID switches. However, with the application of RLM, false positive boxes are mapped further from the target, reducing the occurrence of ID switches effectively.
MOT20 Dataset In Table 5, it can be observed that our method, compared to several algorithms, shows significant improvements in HOTA, IDF1, and IDs metrics on the MOT20 dataset. However, there is a slight decrease in MOTA compared to ByteTrack. This is because the RLM algorithm, after mapping to the target plane, increases the actual distance between corresponding detection boxes. Particularly for distant targets, the relationships between targets are amplified. Additionally, low-scoring detection boxes outside the predicted trajectory are discarded during association. The high density of the MOT20 dataset causes the reduction in DetA due to the discarded detection boxes, consequently affecting the MOTA score. However, these low-scoring detection boxes would increase the likelihood of ID switches. After comprehensive consideration, our algorithm discards these detection boxes to achieve optimal tracking performance.
Limitations The association algorithms based on the Kalman filter mostly rely on the accuracy of target detection frameworks. When the accuracy of target detection is low or there is strong jitter, ID switches are prone to occur, representing a drawback of such detection algorithms. In our experiments, we effectively reduced the occurrence of ID switches in high-density areas through the target region density method. The trade-off, however, is the discarding of some low-scoring but valid detection bounding boxes. The proposed method in this paper remains the optimal solution, providing a balance between real-time tracking speed requirements and excellent tracking accuracy.
5 Conclusion
In this paper, we introduced an enhanced multi-object tracker called RLM-Tracking. This tracker utilizes the relative location mapping method to map targets from the original video image to a top-view mapping plane. It incorporates a region density model to optimize target box filtering and standardize bounding box generation, thereby improving the accuracy of multi-object tracking.
This approach not only holds potential applications in the field of multi-object tracking but also offers performance enhancements at a relatively low cost in other related domains. Regarding the dynamic acquisition of camera parameters mentioned in the paper, we will continue our research to make further contributions to the field of multi-object tracking. We hope that our work will drive advancements in this area and lead to more progress in the future.
Data availability
The data that support the findings of this study are openly available in MOTChallenge: The Multiple Object Tracking Benchmark at https://motchallenge.net/.
References
Fu D, Chen D, Bao J, Yang H, Yuan L, Zhang L, Li H, Chen D (2021) Unsupervised Pre-training for Person Re-identification. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 14745–14754. arXiv:2012.03753. https://doi.org/10.1109/CVPR46437.2021.01451
Welch G, Bishop G, et al (1995) An introduction to the kalman filter
Zhang Y, Wang C, Wang X, Zeng W, Liu W (2021) FairMOT: On the Fairness of Detection and Re-identification in Multiple Object Tracking. Int J Comput Vis 129(11):3069–3087. arXiv:2004.01888. https://doi.org/10.1007/s11263-021-01513-4
Du Y, Song Y, Yang B, Zhao Y (2022) StrongSORT: Make DeepSORT Great Again, 1–19. arXiv:2202.13514
Wan J, Zhang H, Zhang J, Ding Y, Yang Y, Li Y, Li X (2022) DSRRTracker: Dynamic Search Region Refinement for Attention-based Siamese Multi-Object Tracking, 1–25
Zhang Y, Sun P, Jiang Y, Yu D, Weng F, Yuan Z, Luo P, Liu W, Wang X ByteTrack: Multi-Object Tracking by Associating Every Detection Box (2021) arXiv:2110.06864
Bergmann P, Meinhardt T, Leal-Taixe L Tracking without bells and whistles. Proceedings of the IEEE International Conference on Computer Vision 2019-October, 941–951 (2019) arXiv:1903.05625. https://doi.org/10.1109/ICCV.2019.00103
Aharon N, Orfaig R, Bobrovsky B-Z Bot-sort: Robust associations multi-pedestrian tracking (2022) arXiv:2206.14651 [cs.CV]
Chu Q, Ouyang W, Liu B, Zhu F, Yu N DASOT: A unified framework integrating data association and single object tracking for online multi-object tracking. AAAI 2020 - 34th AAAI Conference on Artificial Intelligence, 10672–10679 (2020). https://doi.org/10.1609/aaai.v34i07.6694
Saito S, Yang J, Ma Q, Black MJ SCANimate: Weakly Supervised Learning of Skinned Clothed Avatar Networks 1(2), 2886–2897 (2021) arXiv:2104.03313
Reading C, Harakeh A, Chae J, Waslander SL Categorical Depth Distribution Network for Monocular 3D Object Detection. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 8551–8560 (2021) arXiv:2103.01100. https://doi.org/10.1109/CVPR46437.2021.00845
Husain SS, Ong EJ, Bober M ACTNET: End-to-End Learning of Feature Activations and Multi-stream Aggregation for Effective Instance Image Retrieval. International Journal of Computer Vision 129(5), 1432–1450 (2021) arXiv:1907.05794. https://doi.org/10.1007/s11263-021-01444-0
Sun S, Akhtar N, Song H, Mian A, Shah M Deep Affinity Network for Multiple Object Tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 43(1), 104–119 (2021) arXiv:1810.11780. https://doi.org/10.1109/TPAMI.2019.2929520
Kervadec C, Jaunet T, Antipov G, Baccouche M, Vuillemot R, Wolf C How Transferable are Reasoning Patterns in VQA? Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 4205–4214 (2021) arXiv:2104.03656. https://doi.org/10.1109/CVPR46437.2021.00419
Ren S, He K, Girshick R, Sun J Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39(6), 1137–1149 (2017) arXiv:1506.01497. https://doi.org/10.1109/TPAMI.2016.2577031
Redmon J, Divvala S, Girshick R, Farhadi A You only look once: Unified, real-time object detection. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition 2016-Decem, 779–788 (2016) arXiv:1506.02640. https://doi.org/10.1109/CVPR.2016.91
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S End-to-End Object Detection with Transformers. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 12346 LNCS, 213–229 (2020) arXiv:2005.12872. https://doi.org/10.1007/978-3-030-58452-8_13
Cai Z, Vasconcelos N Cascade r-cnn: Delving into high quality object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6154–6162 (2018)
Ge Z, Liu S, Wang F, Li Z, Sun J YOLOX: Exceeding YOLO Series in 2021, 1–7 (2021) arXiv:2107.08430
Zhang H, Li F, Liu S, Zhang L, Su H, Zhu J, Ni LM, Shum H-Y DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection, 1–23 (2022) arXiv:2203.03605
Liu Z, Hu H, Lin Y, Yao Z, Xie Z, Wei Y, Ning J, Cao Y, Zhang Z, Dong L, Wei F, Guo B Swin Transformer V2: Scaling Up Capacity and Resolution (2021) arXiv:2111.09883
Cao J, Pang J, Weng X, Khirodkar R, Kitani K Observation-centric sort: Rethinking sort for robust multi-object tracking (2023) arXiv:2203.14360 [cs.CV]
Peng J, Wang C, Wan F, Wu Y, Wang Y, Tai Y, Wang C, Li J, Huang F, Fu Y Chained-tracker: Chaining paired attentive regression results for end-to-end joint multiple-object detection and tracking. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16, pp. 145–161 (2020). Springer
Zheng L, Tang M, Chen Y, Zhu G, Wang J, Lu H Improving Multiple Object Tracking with Single Object Tracking. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2453–2462 (2021). https://doi.org/10.1109/CVPR46437.2021.00248
Zhou X, Koltun V, Krähenbühl P Tracking Objects as Points. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 12349 LNCS(Figure 1), 474–490 (2020) arXiv:2004.01177. https://doi.org/10.1007/978-3-030-58548-8_28
Chu Q, Ouyang W, Li H, Wang X, Liu B, Yu N Online Multi-Object Tracking Using CNN-based Single Object Tracker with Spatial-Temporal Attention Mechanism. Iccv, 1–10 (2017)
Zhou Z, Xing J, Zhang M, Hu W Online Multi-Target Tracking with Tensor-Based High-Order Graph Matching. Proceedings - International Conference on Pattern Recognition 2018-August, 1809–1814 (2018). https://doi.org/10.1109/ICPR.2018.8545450
Zhu J, Yang H, Liu N, Kim M, Zhang W, Yang MH Online Multi-Object Tracking with Dual Matching Attention Networks. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 11209 LNCS, 379–396 (2018) arXiv:1902.00749. https://doi.org/10.1007/978-3-030-01228-1_23
Kim C, Li F, Ciptadi A, Rehg JM Multiple hypothesis tracking revisited. Proceedings of the IEEE International Conference on Computer Vision 2015 Inter, 4696–4704 (2015). https://doi.org/10.1109/ICCV.2015.533
He Y, Wei X, Hong X, Ke W, Gong Y (2022) Identity-Quantity Harmonic Multi-Object Tracking 31:2201–2215
Bewley A, Ge Z, Ott L, Ramos F, Upcroft B Simple online and realtime tracking. Proceedings - International Conference on Image Processing, ICIP 2016-Augus, 3464–3468 (2016) arXiv:1602.00763. https://doi.org/10.1109/ICIP.2016.7533003
Wojke N, Bewley A, Paulus D Simple online and realtime tracking with a deep association metric. Proceedings - International Conference on Image Processing, ICIP 2017-Septe, 3645–3649 (2018) arXiv:1703.07402. https://doi.org/10.1109/ICIP.2017.8296962
Li G, Chen X, Li M, Li W, Li S, Guo G, Wang H (2022) One-shot multi-object tracking using CNN-based networks with spatial-channel attention mechanism. Opt Laser Technol 153(April):108267. https://doi.org/10.1016/j.optlastec.2022.108267
Zhou X, Yin T, Koltun V, Krähenbühl P Global tracking transformers (2022) arXiv:2203.13250 [cs.CV]
Liu J, Zheng S, Xu G, Lin M (2021) Cross-domain sentiment aware word embeddings for review sentiment analysis. Int J Mach Learn Cybern 12:343–354
Wang, Z., Zheng, L., Liu, Y., Li, Y., Wang, S.: Towards Real-Time Multi-Object Tracking. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 12356 LNCS, 107–122 (2020) arXiv:1909.12605. https://doi.org/10.1007/978-3-030-58621-8_7
Felzenszwalb P, McAllester D, Ramanan D A discriminatively trained, multiscale, deformable part model. In: 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2008). Ieee
Zeng F, Dong B, Zhang Y, Wang T, Zhang X, Wei Y Motr: End-to-end multiple-object tracking with transformer. In: European Conference on Computer Vision, pp. 659–675 (2022). Springer
Liang C, Zhang Z, Zhou X, Li B, Zhu S, Hu W (2022) Rethinking the Competition between Detection and ReID in Multi-Object Tracking. IEEE Trans Image Process 31:1–1. https://doi.org/10.1109/tip.2022.3165376
Xu, Y., Ban, Y., Delorme, G., Gan, C., Rus, D., Alameda-Pineda, X.: Transcenter: Transformers with dense queries for multiple-object tracking (2021)
Sun P, Cao J, Jiang Y, Zhang R, Xie E, Yuan Z, Wang C, Luo P TransTrack: Multiple Object Tracking with Transformer (2020) arXiv:2012.15460
Yu E, Li Z, Han S, Wang H RelationTrack: Relation-aware Multiple Object Tracking with Decoupled Representation. IEEE Transactions on Multimedia (2022) arXiv:2105.04322. https://doi.org/10.1109/TMM.2022.3150169
Wu J, Cao J, Song L, Wang Y, Yang M, Yuan J Track to detect and segment: An online multi-object tracker. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition i, 12347–12356 (2021) arXiv:2103.08808. https://doi.org/10.1109/CVPR46437.2021.01217
Milan A, Leal-Taixé L, Reid I, Roth S, Schindler K Mot16: A benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831 (2016)
Dendorfer, P., Rezatofighi, H., Milan, A., Shi, J., Cremers, D., Reid, I., Roth, S., Schindler, K., Leal-Taixé, L.: Mot20: A benchmark for multi object tracking in crowded scenes. arXiv preprint arXiv:2003.09003 (2020)
Aharon N, Orfaig R, Bobrovsky B-Z Bot-sort: Robust associations multi-pedestrian tracking. arXiv preprint arXiv:2206.14651 (2022)
Ge Z, Liu S, Wang F, Li Z, Sun J Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430 (2021)
Ristani E, Solera F, Zou R, Cucchiara R, Tomasi C Performance measures and a data set for multi-target, multi-camera tracking. In: European Conference on Computer Vision, pp. 17–35 (2016). Springer
Luiten J, Osep A, Dendorfer P, Torr P, Geiger A, Leal-Taixé L, Leibe B (2021) Hota: A higher order metric for evaluating multi-object tracking. Int J Comput Vision 129(2):548–578
Jonathon Luiten, A.H.: TrackEval. https://github.com/JonathonLuiten/TrackEval (2020)
Pytorch ADI Pytorch (2018)
Bradski G The opencv library. Dr. Dobb’s Journal: Software Tools for the Professional Programmer 25(11), 120–123 (2000)
Thakur A, Papakipos Z, Clauss C, Hollinger C, Boivin V, Lowe B, Schoentgen M, Bouckenooghe R abhiTronix/vidgear: VidGear V0.2.5. https://doi.org/10.5281/zenodo.6046843
Zheng L, Tang M, Chen Y, Zhu G, Wang J, Lu H Improving multiple object tracking with single object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2453–2462 (2021)
Xu Y, Ban Y, Delorme G, Gan C, Rus D, Alameda-Pineda X Transcenter: Transformers with dense queries for multiple-object tracking. arXiv preprint arXiv:2103.15145 (2021)
Li, W., Xiong, Y., Yang, S., Xu, M., Wang, Y., Xia, W.: Semi-tcl: Semi-supervised track contrastive representation learning. arXiv preprint arXiv:2107.02396 (2021)
Zhang Y, Wang C, Wang X, Zeng W, Liu W (2021) Fairmot: On the fairness of detection and re-identification in multiple object tracking. Int J Comput Vision 129(11):3069–3087
Yu E, Li Z, Han S, Wang H Relationtrack: Relation-aware multiple object tracking with decoupled representation. arXiv preprint arXiv:2105.04322 (2021)
Tokmakov P, Li J, Burgard W, Gaidon A Learning to track with object permanence. arXiv preprint arXiv:2103.14258 (2021)
Liang, C., Zhang, Z., Zhou, X., Li, B., Lu, Y., Hu, W.: One more check: Making" fake background" be tracked again. arXiv preprint arXiv:2104.09441 (2021)
Wang, Q., Zheng, Y., Pan, P., Xu, Y.: Multiple object tracking with correlation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3876–3886 (2021)
Chu P, Wang J, You Q, Ling H, Liu Z Transmot: Spatial-temporal graph transformer for multiple object tracking. arXiv preprint arXiv:2104.00194 (2021)
Yang F, Chang X, Sakti S, Wu Y, Nakamura S (2021) Remot: A model-agnostic refinement for multiple object tracking. Image Vis Comput 106:104091
Stadler D, Beyerer J Modelling ambiguous assignments for multi-person tracking in crowds. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 133–142 (2022)
Cao J, Weng X, Khirodkar R, Pang J, Kitani K Observation-centric sort: Rethinking sort for robust multi-object tracking. arXiv preprint arXiv:2203.14360 (2022)
Du Y, Song Y, Yang B, Zhao Y Strongsort: Make deepsort great again. arXiv preprint arXiv:2202.13514 (2022)
Wang Y-H, Hsieh J-W, Chen P-Y, Chang M-C, So HH, Li X Smiletrack: Similarity learning for occlusion-aware multiple object tracking (2023) arXiv:2211.08824 [cs.CV]
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Ren, K., Hu, C. & Xi, H. Rlm-tracking: online multi-pedestrian tracking supported by relative location mapping. Int. J. Mach. Learn. & Cyber. 15, 2881–2897 (2024). https://doi.org/10.1007/s13042-023-02070-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13042-023-02070-7