1 Introduction

Multi-object tracking (MOT) technology is a crucial task in the field of computer vision, and in recent years, it has found widespread applications in areas such as autonomous driving, robot navigation, and public video analysis. Currently, the paradigms for multi-object tracking mainly consist of tracking by detection (TBD) and joint detection and tracking (JDT). TBD paradigm involves detecting targets with an object detector and then associating target objects and trajectories through techniques like target re-identification (Re-ID) [1] or Kalman filtering [2] to obtain complete target paths. JDT paradigm, on the other hand, performs detection and tracking simultaneously in a single stage. TBD has the advantage of leveraging the latest object detectors, ensuring the use of high-performance detection models for tracking tasks. In contrast, the strength of JDT lies in its ability to design end-to-end tracking models while achieving faster model inference speed.

However, regardless of the paradigm used, the object detector plays a crucial role, as its accuracy directly affects multi-object tracking performance. Occlusion is a common factor that degrades detector accuracy. In real-world scenarios, occlusion frequently occurs, resulting in incomplete detected bounding boxes and thus reduced tracking accuracy. Moreover, variations in target size, appearance, etc. can lead to a mismatch between detected boxes and actual target dimensions, causing errors. Enhancing multi-object tracking accuracy and robustness remains a core research problem.

In multi-object tracking scenarios, the relative positions of targets in the image captured by the camera do not reflect the true distance relationships, due to different shooting angles. Specifically, even with the same pixel interval, the distance between distant targets is often presented as greater than that between near targets. In addition, even if the targets move at constant speed, there may be visual illusions of the motion speed differences between far and near targets in the image. Our experiments found that the discrepancy between the relative and actual target positions increased errors in tracking algorithms. These errors resulted in originally non-crossing target motions falsely appearing as crossings in the image, causing visual and computational errors.

Real-time performance is a crucial requirement for multi-object tracking, especially for many downstream tasks that need rapid and accurate identification and tracking of targets. These applications typically require the tracking system to efficiently process large-scale data streams, adapt to changing environments, and provide instant decision support. In addition, rational allocation of computing resources also directly affects the efficiency of multi-object tracking systems. For example, in a video, more resources can be allocated to regions where targets are present or dense, while relatively fewer resources to other regions.

We propose two overarching solutions for improvement. Firstly, we introduce a ground-centric bounding box weighting strategy, involving the projection of targets onto a latent plane that accurately reflects inter-target distances and motion relationships. This is achieved by considering the detection box center as a reference point and implementing advanced relative position mapping algorithms. On this latent plane, target coordinates replace the original coordinates for the purpose of tracking. The fundamental concept is to use the ground as a stable reference, projecting targets onto a virtual plane to enhance the capture of relative positional relationships between targets and, consequently, elevate tracking accuracy.

Furthermore, we have developed a method that leverages a regional density model to judiciously allocate computational resources based on target density within a specific region. In areas characterized by high target density, a more substantial allocation of resources is directed towards achieving precise multi-object tracking. Conversely, in regions with lower target density, a proportionally smaller allocation of resources is made to mitigate unnecessary computational overhead.

In this paper, by analyzing the reasons for tracking errors generated by previous trackers, we propose a simple and effective solution to enhance multi-object tracking, with the following main contributions:

1. The proposed target relative location mapping model strengthens the restoration of the actual position relationships between targets during tracking, effectively reducing the probability of ID switching errors in multi-object tracking tasks caused by factors such as target size, posture, and occlusion.

2. The proposed target region density model quantifies the target density in different regions of the video image, allowing the tracker to adaptively calculate an appropriate threshold for low-scoring detection boxes based on region density. This reduces interference from low-scoring detection boxes in low-density target areas and also lowers computational costs.

3. Using the target relative location mapping model to project targets in the video image onto a latent plane, we employ a standardized bounding box approach for the first time to uniformly label the actual positions of targets in the latent plane. This method can reduce the occurrence of occlusion phenomena due to inconsistent target box sizes and enhance the tracker’s perception of target movement speed and position changes, thereby improving tracking accuracy (Fig. 1).

Fig. 1
figure 1

Overview diagram of relative location mapping tracking method

2 Related work

The current research in the field of multi-object tracking primarily consists of two main approaches: the Tracking by Detection (TBD) paradigm [3,4,5,6,7,8] and the joint detection and tracking (JDT) paradigm [9,10,11,12,13,14]. These two categories of research methods have different emphases and focuses.

2.1 Tracking by detection paradigm

The Tracking by Detection (TBD) paradigm is one of the widely applied methods in current multi-object tracking. It decomposes the multi-object tracking task into independent models for detection and tracking. Typically, in the detection phase, state-of-the-art or specifically tailored object detection models such as FasterRCNN [15], YOLO [16], Transformer [17], and their latest optimized versions like Cascade R-CNN [18], YOLOX [19], DINO [20], swim Transformer [21] are used. Target detection, as one of the fundamental research areas in computer vision, encompasses various detection models that have achieved outstanding results in terms of detection speed and accuracy. Therefore, researchers employing the TBD paradigm can focus more on target association research, thereby improving the accuracy of multi-object tracking.

Based on the real-time processing level of video image data, the TBD paradigm can be categorized into online tracking and offline tracking. Online tracking [22,23,24,25,26,27,28] is a real-time processing approach that tracks target trajectories based on the current and previous video frames. In contrast, offline tracking [29, 30] is more commonly used to process all frames or batches of frames from offline videos to address the target tracking problem. Since multi-object tracking is primarily applied in practical engineering applications such as autonomous driving, surveillance, and robotics, the online tracking approach has a broader range of applications. Processing video frames locally usually results in faster processing efficiency.

In early research, approaches like SORT mentioned in [31] utilized Kalman filters to predict the future position of targets, followed by the use of the Hungarian algorithm for tracking result association. While this method is simple and effective, it tends to lose targets and make mistakes in the presence of target occlusion. To enhance the effectiveness of association, DeepSORT proposed in [32] introduced deep information about target appearance to improve the association process and reduce the impact of occlusion on tracking results. The IQHAT framework proposed in [30] allows multiple target identities corresponding to multiple targets, demonstrating good performance in handling occlusion in crowded scenes. On the other hand, the StrongSORT study (mentioned in [4]) is an upgraded version based on DeepSORT, introducing a model for appearance-free linking to correlate short and complete trajectories and using Gaussian smoothing interpolation to compensate for missing detections, achieving better tracking performance. The method proposed in [3] balances detection and association weights, suggesting that shallow target features are easier to obtain common features for the same target than deep features. Furthermore, the research in [6] presents a cautious handling approach for low-score detections to confirm whether they belong to occluded targets, significantly improving tracking accuracy.

2.2 Joint detection and tracking paradigm

With the continuous maturation of multi-task learning techniques, some research studies [33,34,35] have begun to explore a more challenging paradigm called collaborative tracking and detection, which involves target estimation and identity recognition, moving beyond the traditional detection and tracking paradigms. These studies integrate Re-ID (Identity Recognition) branches into the backbone network to obtain features for each target. In early research, a direct approach was often employed, involving cropping target regions after target detection and inputting them into pre-trained Re-ID models. Deep neural network techniques were then used to calculate target feature representations. However, this method significantly increased computational costs, as it required more parameters to account for target variations and obtain clearer intra-class features. To address this issue, subsequent research gradually explored more efficient methods for target feature extraction, aiming to reduce computational costs and enhance the performance of target recognition. This includes sharing features of anchor- or point-based detectors in the target detection backbone network [36], reducing redundant calculations of target image features. In the face of the challenge of target occlusion, some studies [37] attempted to detect only the visible parts of targets or divided targets into multiple parts for sequential detection and comparison. Others addressed occlusion by making multiple predictions for a single detection box, and point-based detectors [25] aimed to solve the problem of excessively large repetition of target boxes and the removal of low-score boxes by non-maximum suppression (NMS).

In addition to traditional CNN approaches, breakthroughs in the Natural Language Processing (NLP) field have led to the increasing exploration of transformer-based single-model architectures in the image domain. The JDE method designed in [36] combined target detection and ID association in the same model with shared parameters, improving the efficiency of multi-target detection while extracting target association features through deep learning, to some extent reducing the impact of occlusion. The MOTR method in [38] extended DETR and modeled temporal relationships in video sequences to improve tracking accuracy. The research in [39] using CSTrack analyzed the performance problems caused by the overlooked differences when handling detection and association with a single model. It proposed the interactive network REN and the Scale-Aware Attention Network SAAN. The study in [25], using CenterTrack, replaced the target detection box with the center point heatmap of the target and the previous frame’s RGB image as the target tracking point trajectory. This approach reduced the complexity of detecting and tracking target boxes, achieving an optimal balance between efficiency and accuracy. The research in [5, 40,41,42] utilized attention mechanisms to enhance the connection between frames and improve association efficiency.

However, compared to the previous two-step detection and tracking paradigm, the collaborative tracking and detection paradigm exhibits a decrease in overall performance. The main reason is that adding a Re-ID branch to the backbone network to maintain target identity information consumes a significant amount of computational and storage resources. These resources are used to maintain potentially unnecessary identity information, resulting in redundancy in the system and impacting performance. It’s worth noting that the anchor-based approach of one-shot trackers is not suitable for simultaneously extracting features for detection objects and Re-ID features because they use different feature types and dimensions. This method introduces additional complexity, increases computational burden, and may lead to performance degradation. The study in [43] analyzed why the effectiveness of a single-model multi-target tracker is lower than that of a dual-model structure. The difference is attributed to the differences in feature indicators required for target detection and target association. Target detection requires large inter-class feature distinctiveness and small intra-class distinctiveness, while target association requires large intra-class distinctiveness. Therefore, a single model cannot effectively meet both of these requirements simultaneously.

2.3 This paper

This paper adopts the tracking by detection paradigm and employs a series of innovative methods to address occlusion issues, thereby enhancing the performance and robustness of multi-object tracking systems. Firstly, a positional mapping operation is applied to the target box positions obtained by detectors in different scenes, projecting the targets in the video image onto a ground-based mapping plane. This step considers the optical relationships, helping alleviate tracking issues caused by occlusion, image angles, and other factors. In terms of target detection, the YOLOX model [19], trained on an additional dataset, is utilized, while drawing inspiration from the ByteTrack [6] approach. This involves using low-scoring bounding boxes to associate and track targets, considering the information from low-scored target bounding boxes and thereby improving target association accuracy.

Furthermore, a method to generate standardized target bounding boxes in the mapping image is redesigned, and a Target Region Density Model is introduced to quantify areas prone to occlusion. By adaptively adjusting the size of standardized bounding boxes and the threshold for low-scored target bounding boxes based on the occlusion probability, the system better adapts to various occlusion scenarios, enhancing tracking accuracy. Real-time adjustments to the Kalman gain coefficient are made using the region density model, thereby strengthening tracking stability.

Experimental results demonstrate that these improvement methods effectively reduce the probability of identity switches and target losses during the target tracking process in crowded scenes. They provide new and effective approaches to address occlusion issues, enhancing the performance and robustness of multi-object tracking systems. These contributions are valuable for addressing complex multi-object tracking scenarios in practical applications such as surveillance, autonomous driving, and robot navigation.

3 Relative location mapping

Fig. 2
figure 2

Schematic diagram of relative location mapping (RLM)

In video images, cameras typically capture pedestrians on roads from certain angles. The camera’s focus is usually directed towards areas with crowds or dense pedestrian traffic to capture the scene as comprehensively as possible. However, this shooting approach may lead to issues of overlay and occlusion of target pedestrians in the image. To mitigate the impact of these factors, it becomes crucial to project the positions of targets in the video image onto a ground-based potential plane, where the relationship between the potential plane and the actual positions of targets in the scene is more aligned. As illustrated in Fig. 2, through this mapping operation, we can better understand the relative positions and motion directions between targets. The potential plane generated by this mapping helps alleviate issues caused by occlusion, enhancing the accuracy and stability of the multi-object tracking system.

3.1 Relative location mapping model

We represent the position of the target on the image using the coordinates of the midpoint at the bottom of the target bounding box, as illustrated in the vertical view of Fig. 3. The ratio of the vertical distance between the target position and the lower boundary of the image to the overall image height corresponds to the ratio between the target imaging angle \(\alpha _v\) and the imaging angle of the camera device \(\beta _v\). Equation 1 expresses the relationship between the size of the target imaging angle and the coordinates xy, denoted by \(\mathcal {G}\). As the target moves to different positions in the video image, corresponding real-world locations exhibit a nonlinear trend. We typically interpret these differences as deformations generated by the video.

Fig. 3
figure 3

Vertical mapping and horizontal mapping schematic

$$\begin{aligned} \begin{aligned} \alpha _y = \mathcal {G}(y) \end{aligned} \end{aligned}$$
(1)

The variable y represents the distance between the target’s position on the video image and the bottom of the image. As y changes, the imaging angle \(\alpha _v\) undergoes nonlinear variations. The formula for the vertical mapping function \(\mathcal {G}(y)\) is as follows:

$$\begin{aligned} \mathcal {G}_v (y)=\left\{ \begin{array}{lr} \vspace{1em} \dfrac{\beta _v }{2}-\tanh [(1-\dfrac{2y}{H}) \tan \dfrac{\beta _v }{2}], &{} y<\dfrac{H}{2} \\ \dfrac{\beta _v }{2}+\tanh [(\dfrac{2y}{H}-1) \tan \dfrac{\beta _v }{2}], &{} y \geqslant \dfrac{H}{2} \\ \end{array} \right. \end{aligned}$$
(2)

In Eq. 2, H represents the total height of the video image, and \(\beta _v\) denotes the vertical viewing angle of the camera device. The camera device’s viewing angle is often fixed over a segment of monitoring video based on focal length and factory configurations. Alternatively, we can obtain an empirical viewing angle through methods such as deep learning algorithms.

From the formula, it can be observed that within the vertical viewing angle of the camera device, the size of the target imaging angle is related to the maximum vertical viewing angle of the camera device and the height of the image. As shown in Fig. 4, where the horizontal axis represents the size of the target position in the image, set here as 1080, and the vertical axis represents the angle \(\alpha _v\), the curves in each graph depict the relationship between the target position and the target imaging angle for different maximum vertical viewing angles \(\beta _v\) of the camera device. The larger the visible angle of the camera device, the more pronounced the nonlinear relationship between the target image position and the real position.

Fig. 4
figure 4

The relationship between y and \(\alpha _v \)

In terms of horizontal mapping, \(\alpha _h\) is calculated in a similar manner, as illustrated on the right side of Fig. 3. The angle size is related to the width W of the image and the maximum horizontal viewing angle \(\beta _h\) of the camera device. The formula is as follows:

$$\begin{aligned} \mathcal {G}_h (x)=\left\{ \begin{array}{lr} \vspace{1em} \dfrac{\beta _h }{2}-\tanh [(1-\dfrac{2x}{W}) \tan \dfrac{\beta _h }{2}], &{} x<\dfrac{W}{2} \\ \dfrac{\beta _h }{2}+\tanh [(\dfrac{2x}{W}-1) \tan \dfrac{\beta _h }{2}], &{} x \geqslant \dfrac{W}{2} \\ \end{array} \right. \end{aligned}$$
(3)

x represents the horizontal coordinate length of the target position in the image.

The target mapping coefficient refers to the ratio between the target’s position in the image and its position on the mapping plane. Due to perspective issues, there is deformation in the relative positions of image and actual locations. Therefore, the same target imaging angle will have differences between image positions and actual positions. As shown in Fig. 3, the ratio of \(x'\), \(y'\) to x, y is the target mapping coefficients \(\varphi _h, \varphi _v\) as expressed in Eq. 4. These coefficients directly reflect the impact of image deformation on the relative positions of the target in actual space.

$$\begin{aligned} \begin{aligned} \vspace{1em} \varphi _v = \dfrac{y^{'}}{y} = \mathcal {F}_v (\alpha _v, \beta _v, \gamma )\\ \varphi _h = \dfrac{x^{'}}{x} = \mathcal {F}_h (\alpha _v,\beta _v \gamma ) \end{aligned} \end{aligned}$$
(4)

Equation 4 represents the mapping coefficients for the target in the horizontal and vertical directions. It can be observed that these coefficients are related to the target imaging angle, the maximum viewing angle of the camera device, and the tilt angle.

As per the illustration in Fig. 3, the formula for calculating \(y'\) in the vertical direction is as follows:

$$\begin{aligned} \begin{aligned} \theta&= 90 - \gamma - \dfrac{\beta }{2} \end{aligned} \end{aligned}$$
(5)

\(\theta \) represents the angle between the lower edge of the camera device’s field of view and the vertical line. Assuming the camera device is at a height h above the ground, \(y'\) is given by:

$$\begin{aligned} \begin{aligned} y^{'}&= h* [\tan (\alpha _v + \theta )-\tan \theta ] \end{aligned} \end{aligned}$$
(6)

The formula for calculating y is as follows:

$$\begin{aligned} y=\left\{ \begin{array}{lr} \vspace{1em} \dfrac{h}{\cos \theta } *[\tan \dfrac{\beta _v }{2} -\tan (\dfrac{\beta _v }{2} - \alpha _v ) ], &{} \alpha _v < \dfrac{\beta _v }{2} \\ \dfrac{h}{\cos \theta } *[\tan \dfrac{\beta _v }{2} +\tan (\alpha _v - \dfrac{\beta _v }{2}) ], &{} \alpha _v \geqslant \dfrac{\beta _v }{2} \\ \end{array} \right. \end{aligned}$$
(7)

The formula for calculating \(\varphi _v\) is obtained as follows:

$$\begin{aligned} \varphi _v =\left\{ \begin{array}{lr} \vspace{1em} \dfrac{\cos \theta * [ \tan (\alpha _v + \theta _v )-\tan \theta ]}{\tan \dfrac{\beta _v }{2} - \tan (\dfrac{\beta _v }{2} - \alpha _v )}, \alpha _v < \dfrac{\beta _v }{2} \\ \dfrac{\cos \theta * [ \tan (\alpha _v + \theta _v )-\tan \theta ]}{\tan \dfrac{\beta _v }{2} + \tan (\alpha _v - \dfrac{\beta _v }{2})}, \alpha _v \geqslant \dfrac{\beta _v }{2} \\ \end{array} \right. \end{aligned}$$
(8)

The above equation represents the deformation ratio coefficient of the target’s vertical coordinate in the video image to its actual position in the vertical direction. This coefficient reflects the vertical deformation ratio of the video image at a specific point.

Fig. 5
figure 5

Relationship between camera-ground angle \(\gamma \) and mapping coefficient \(\varphi _v \) of vertical

From Fig. 5, we can see a schematic relationship between the target’s vertical mapping coefficient \(\varphi _v\) and the camera device’s tilt angle. It is evident that as the tilt angle of the camera device decreases, the curve depicting the growth of the vertical mapping coefficient in the vertical direction becomes steeper. This effectively reflects the deformation characteristics of the video image.

Regarding the horizontal mapping coefficient, based on the schematic on the right side of Fig. 3, it can be observed that the increase in horizontal deformation occurs with an increase in vertical distance in the image. The horizontal mapping coefficient, as indicated in Eq. 4, \(\varphi _h\), is only related to the maximum vertical viewing angle and the tilt angle of the camera device. It can be expressed by the following formula:

$$\begin{aligned} \begin{aligned} \mathcal {F}_h(x) = \dfrac{x^{'}}{x} = \dfrac{\cos \left|\dfrac{\beta _v }{2} -\alpha _v \right|* \cos \theta }{ \cos \dfrac{\beta _v }{2}*\cos (\alpha _v +\theta ) } \end{aligned} \end{aligned}$$
(9)

\(\theta \) is obtained from Eq. 5, and \(y'\) is given by Eq. 6.

Fig. 6
figure 6

Schematic diagram of the relationship between horizontal mapping coefficient and vertical mapping coefficient

Figure 6 illustrates a comparison between \(\varphi _v\) and \(\varphi _h\) when \(\beta _v = 120\) and \(\gamma = 20\). From the graph, it can be observed that the video image mapping coefficients increase with the growth of the target imaging angle. Additionally, you may notice that the increase in the horizontal direction is greater than the increase in the vertical direction. However, this observation holds when \(\beta _v\) is sufficiently large. As \(\beta _v\) decreases, the horizontal mapping coefficient will gradually approach or become smaller than the vertical increase rate.

At this point, the mapping coordinates expressing the relative relationship between the target’s video image position and its actual position are as follows:

$$\begin{aligned} \begin{aligned} \mathcal {R}_h(x) =&\dfrac{W}{2}(\varphi _{max} - \varphi _h ) + \mathcal {P}_{video}[x] * \varphi _h \\ \mathcal {R}_v(y) =&\mathcal {P}_{video}[y] * \varphi _v \end{aligned} \end{aligned}$$
(10)

Achieving the mapping relationship from the video image to the actual ground is now complete.

3.2 Target region density

In different scenarios, the area where targets appear may vary. Relative to the entire image, we often pay more attention to specific regions where targets are present. Additionally, regions with fewer targets are less likely to experience occlusion. To address this characteristic, we introduce a technique called the target region density (TRD) similar to attention mechanisms in object detection. This method is particularly beneficial for multi-object tracking tasks, as it enables the tracking system to quickly perceive potential occluded and interference areas. Through the target region density method, we can reduce the weighting of detection box selection, concentrating the attention of the tracking system on regions where targets exist. This helps improve the accuracy of the tracker by rapidly capturing potentially overlooked tracking points.

Fig. 7
figure 7

Illustration of target density matrix in different scenarios

Furthermore, the Target Region Density Method allows us to tighten the standardized bounding boxes to assist tracking effects in a more consistent manner. It also aids the tracker in more accurately capturing tracking points that may be missed, thus enhancing the overall accuracy of the tracking process. Importantly, the parameters of the Target Region Density Method are updated in real-time, providing high flexibility to dynamically adjust bounding boxes based on different scenes and requirements for better tracking results. This method holds promise for enhancing the performance of multi-object tracking systems, making them more intelligent, accurate, and adaptable to diverse scenarios.

We divide the entire video image into 9 regions, and each region, as visualized in Fig. 7, clearly shows the area with the highest density in the image. The density calculation for each region is based on the weighted comparison of the portions where target detection boxes fall within each region. The normalization is then performed with the maximum density region as the reference. We calculate the densities recursively based on the positions, sizes, and scores of different target detection boxes, adding them with bottom-weighted coefficients higher than those for the upper part, especially when the target spans multiple regions. The formula for calculating target region density is as follows:

$$\begin{aligned} W= & {} \left[ \begin{array}{ccc} \rho _1 &{} \rho _2 &{} \rho _3 \\ \rho _4 &{} \rho _5 &{} \rho _6 \\ \rho _7 &{} \rho _8 &{} \rho _9 \\ \end{array} \right] , Q=\left[ \begin{array}{ccc} 1 &{} 1 &{} 1 \\ 1 &{} 1 &{} 1 \\ 1 &{} 1 &{} 1 \\ \end{array} \right] \end{aligned}$$
(11)
$$\begin{aligned} \rho _{max}= & {} \max (\rho _n), \,\,\,\,\, \rho _n \in W \end{aligned}$$
(12)
$$\begin{aligned} \rho _n= & {} \sum _{i=1}^n (Bbox_{i} \times Score_{i} \times \varpi ) \end{aligned}$$
(13)
$$\begin{aligned} A_{\rho }= & {} \prod _{i=1}^m[W * (\frac{\rho _i}{\rho _{max}} * Q)] \end{aligned}$$
(14)

In Eq. 13, Bbox represents the proportion of the intersection between the original bounding box within the target region and the original bounding box, Score represents the score of the original bounding box, \(\varpi \) represents the proportion coefficient of the target box region after bottom-weighted addition, and n represents the number of original bounding boxes in the target region. In Eq. 14, m represents the number of target regions, \(\rho _i\) represents the calculated density values for each region, \(\rho _{max}\) represents the maximum calculated density for each region, and Q is the mask matrix. After multiplying with the region matrix W, it yields the normalized density matrix \(A_\rho \).

3.3 Standardized bounding boxes

We have obtained the key point distribution of the target based on the ground in the original video image, forming a map of the actual relationship of the target. In the original image, the depth of the target is typically represented by the size or area of the target bounding box. However, in real-world scenarios, differences in the physical shape, behavior, and carried items of different targets can result in similar-looking bounding boxes in the image. This can lead to unexpected identity switches, such as when an adult in the distance and a child nearby may have target boxes of similar sizes.

Fig. 8
figure 8

The comparison of the annotated target positions in the dataset images without using RLM and the dataset images after applying RLM and standardizing bounding boxes for the same frames

RLM can authentically reflect the physical spatial relationships of targets in the actual environment, independent of the area of the target bounding box. Therefore, after mapping to the new target position, we can use bounding boxes with a fixed width to represent the target’s location. However, it is essential to maintain the aspect ratio of the original bounding boxes in the image to increase the distinction between different targets. As shown in Fig. 8, the mapped standardized bounding boxes project the targets from the original video image onto a horizontal plane coordinate system. At the same time, they replace the original bounding boxes of different sizes with bounding boxes of the same width. This mapping operation helps better represent the target’s position, reduce identity switches, and enhance the accuracy of the multi-object tracking system.

The size of the mapped target standardized bounding box is determined by the mapping coefficients \(\varphi \), the region density \(A_{\rho }\), the aspect ratio \(\gamma \) of the original detected target box, and the video frame rate f. Our goal is to achieve optimal intra-class recognition performance while better avoiding identity switching issues caused by overlapping bounding boxes. Therefore, we position the bounding box at a balanced location between maximum scale and minimum overlap to achieve optimal performance.

$$\begin{aligned} \mathcal {L}_{arg}= & {} \frac{\sum _{i=1}^n(\mathcal {L}_i-\mathcal {L}_{i-1})}{n-1} \end{aligned}$$
(15)
$$\begin{aligned} W_i= & {} \mathcal {L}_{arg} * (\frac{ A_{\varphi i}}{A_{\rho i} * f c} ), \,\,\,\,A_{\rho i }>0,c\in \mathbb {R} \end{aligned}$$
(16)
$$\begin{aligned} S_{bbox}= & {} \overline{W} ^2 \gamma \end{aligned}$$
(17)

By continuously tracking image density, we can iteratively optimize the size of standardized bounding boxes to improve tracking effectiveness. Equation 16 illustrates the ongoing optimization through the density matrix to compute the area values of standardized bounding boxes. In this formula, \(A_{\varphi i}\) represents the mean matrix of target mapping coefficients in the recognition region, \(\mathcal {L}_{arg}\) denotes the average spacing between targets in that region, \(\overline{W}\) represents the average obtained standard width of bounding boxes, f is the video frame rate, and c is a frame rate control constant. This formula indicates that, when the video frame rate is higher, the bounding box area can be set smaller.

3.4 Relative location mapping tracking algorithm

Based on the aforementioned model, the relative location mapping tracking algorithm involves mapping the target positions in the surveillance video images to a potential plane. Following this mapping, the algorithm calculates the association relationships between various target points on the potential plane. The detailed implementation process of the relative location mapping tracking algorithm is elaborated below (Fig. 9).

figure a

Pseudo-code of RLMTrack

The algorithm presented in Algorithm 3.1 outlines the entire process of the tracker using this model for video image tracking tasks. This algorithm draws inspiration from previous work [3] that introduced the concept of low-scoring bounding boxes. Unlike previous methods, it employs a series of stages, including detection box filtering, relative location mapping, initial tracker association, secondary tracker association, and tracker output. The following provides a detailed explanation of these stages:

Detection Box Filtering    Firstly, the detector performs frame-by-frame detection on the video images. Then, it selects high-scoring target bounding boxes with scores greater than \(\tau _{high}\). These high-scoring bounding boxes anchor the target position coordinates to the bottom center of the bounding box.

Relative Location Mapping    Next, the tracker calculates the target region density matrix \(A_\rho \) based on the density of targets. After obtaining the target region density matrix, the algorithm filters out low-scoring target bounding boxes with scores greater than \(A_\rho * \tau _{low}\) based on the density of different regions. These bounding boxes will participate in subsequent tracking associations.

Initial Tracker Association    At this point, the positions of high-scoring and low-scoring bounding boxes are transformed to the mapping plane using the RLM model, obtaining relative position coordinates \(\mathcal {R}{high}, \mathcal {R}{low}\). Subsequently, complete standardized bounding boxes \(\mathcal {D}{high}, \mathcal {D}{low}\) are generated from the relative position coordinates on the mapping plane. This step completes the task of relative location mapping transformation.

Secondary Tracker Association    The parameters of the transformed target bounding boxes and their region density values are fed into the Kalman filtering algorithm to calculate the predicted position of the next frame’s target in the mapping plane. The tracker performs association operations on the target positions for each frame, first associating high-scoring target bounding boxes and then supplementing the tracking task with low-scoring bounding boxes to obtain \(\mathcal {T}'\).

Tracker Output    Finally, the associated results need to undergo the reverse operation of relative location mapping to obtain the final set of tracked target bounding boxes \(\mathcal {T}\), completing the tracking task.

This algorithm framework effectively addresses occlusion issues in multi-object tracking. Through the introduction of the relative location mapping model and optimization of the target region density model, the accuracy and robustness of the tracking system are improved.

4 Experiments

4.1 Configuration

Datasets and Backbone    Commonly used datasets in the multi-object tracking domain include MOT17 [44] and MOT20 [45]. These datasets cover both training and testing data. However, as the validation dataset is not available, we adopt the approach from [46] to split the training dataset into two halves. The first half is used for training, while the second half is employed to validate the algorithm’s tracking performance.

MOT17 dataset consists of 7 training videos and 7 testing videos, with video lengths ranging from 7 to 90 s and frame rates between 14 and 30 frames per second. Most videos have a resolution of 1920\(\times \)1080. MOT20 is a dataset for multi-person tracking in crowded and complex scenes, with an average of about 170 people per image. For improved detection performance, we use a mixture of CrowdHuman, Cityperson, and ETHZ datasets for training under a proprietary detection protocol. To achieve excellent tracking results and a consistent validation environment, we employ the YOLOX [47] object detection framework as the backbone. This framework exhibits outstanding detection accuracy and speed, meeting the real-time detection demands of multi-object tracking tasks.

Evaluation Metrics    For validation evaluation, we utilize the main evaluation measures provided by MOTChallenge, including Multiple-Object Tracking Accuracy (MOTA), IDF1 [48] score, Higher-Order Tracking Accuracy (HOTA) [49], false positives (FP), detection accuracy (DetA), and other dimensions. The TrackEval [50] tool is used to verify the evaluation metrics. These evaluation metrics contribute to a comprehensive assessment of the performance of our multi-object tracking algorithm.

Implementation Details    In our experiments, we use 8 NVIDIA Tesla V100S GPU cards for parallel training and testing on one of these cards. The experimental code is implemented based on the PyTorch [51] framework.

By default, we set the threshold \(\tau _{high}\) to 0.6, and \(\tau _{low}\) is based on 0.3, calculated according to the probability density of the region. We keep 30 frames of redundant data for reappearing lost trajectories.

The RLM tracking algorithm requires preliminary estimation of camera variables. Typically, in target recognition work under surveillance environments, a fixed-angle monitoring camera system is used to obtain stable video shots. From Eq. 9, it is clear that the camera’s field of view and tilt angle affect the mapping relationship of targets in the image. In the experiment, we manually set the camera parameters to adapt to the current video scene. This helps to obtain more accurate target mapping positions.

$$\begin{aligned} A= & {} \left[ \begin{array}{cc} a_{11} &{} a_{12} \\ a_{21} &{} a_{22} \\ \end{array} \right] ,\quad B = \left[ \begin{array}{c} b_{1} \\ b_{2} \\ \end{array} \right] \end{aligned}$$
(18)
$$\begin{aligned}{}[x',y']^T= & {} A\cdot [x,y]^T + B \end{aligned}$$
(19)
$$\begin{aligned} M_{k-1\vert k}= & {} [A_{k-1\vert k}\quad B_{k-1\vert k}] = \left[ \begin{array}{ccc} a_{11} &{} a_{12} &{} b_{1} \\ a_{21} &{} a_{22} &{} b_{2} \\ \end{array} \right] \end{aligned}$$
(20)

Preprocessing    MOT17 and MOT20 datasets contain scenes with motion shots and screen jitter. Although the camera’s field of view remains unchanged, slight oscillations in the camera’s tilt angle can occur, leading to irregular variations in the trajectories of moving objects in the video. Additionally, camera shake can cause changes between image frames, reducing tracking accuracy. To maximize the continuity of object motion, we employed real-time Affine Transformation on the videos using the OpenCV library [52] and vidgear library [53]. The results of this processing aim to preserve the original trajectories of the objects to the greatest extent. The formula for Affine Transformation is shown in Eq. 19, where matrix A represents the transformation of image angles, and matrix B represents the transformation of image translation distances.Matrix \(M_{k-1\vert k}\) in Eq. 20 is used to represent the transformation matrix from the \((k-1)\)-th frame to the k-th frame in the video.

4.2 Ablation experiments

The main objective of the ablation experiments is to validate the combination of the proposed model algorithm with various trackers and demonstrate the performance improvement and underlying reasons of the RLM algorithm in mapping video images in different directions. Additionally, the experiments investigate the performance enhancement of the TRD model across various scene distributions in the MOT20 dataset, especially in scenarios with an increased number of targets. These experiments aim to validate the effectiveness and robustness of the proposed approach, providing a better understanding of its application and performance in diverse contexts.

Table 1 For the MOT17 dataset baseline ablation experiments, the effectiveness of the RLM and standardization of bounding boxes methods will be validated when ByteTrack and BoT-SORT are employed as baseline models

Baseline Ablation Experiments    In this section of ablation experiments, we systematically tested the proposed RLM algorithm and the TRD model to evaluate their impact on the performance of multi-object trackers. Two baseline models were employed: ByteTrack, based on the YOLOX detector, and BoT-SORT, based on YOLOv7. The purpose of Table 1 is to analyze and quantify the performance improvement achieved by our new methods on these two distinct baseline models. This analysis helps in better understanding their roles in multi-object tracking tasks.

Table 2 Impact of RLM on tracker performance in horizontal and vertical mapping
Fig. 9
figure 9

Comparing the impact of horizontal and vertical mapping on tracking performance across MOT17 datasets

Impact of Position Mapping Direction    Different directions of relative location mapping reflect varying degrees of visual occlusion from different angles. The experiment verified the influence of horizontal and vertical RLM on the tracker. Figure 3 illustrates the different spatial relationships mapped in various directions, while Table 2 evaluates the enhancement of tracker performance under different mapping angles. We compared the marking of potential plane targets using the standardization method of bounding boxes described in Sect. 3.3. According to the experimental results, we observed that the mapping effect in the vertical direction is more significant than in the horizontal direction, possibly due to the vertical length of the target box being greater than the horizontal length.

Fig. 10
figure 10

Investigating the relationship between changes in camera pitch angles (horizontal axis) under different scenes in the MOT17 dataset and the resulting variations in the MOTA values (vertical axis). The red dashed line represents the MOTA values obtained using the baseline model in that particular scene

Impact of Dynamic Camera Parameters    Depending on the different visual angles and tilt angles of the camera device, the mapping results calculated by the RLM model will vary (as shown in the charts in Sect. 3.1). This experiment verified the influence of angles on tracker performance. Assuming a fixed maximum visual angle of 60\(^\circ \) for the camera device, the relationship between the tilt angle of the camera device and the tracker’s performance is shown in Fig. 10, where the horizontal axis represents the tilt angle of the camera device, and the vertical axis represents the MOTA results.

It can be observed that different scenes have different tilt angles for the camera device, causing a change in the relative position of the target in the image. When using the RLM algorithm, specific configuration parameters based on the camera device’s tilt angle need to be considered. In practical applications, when the camera device is stationary, the tilt angle is fixed, achieving optimal RLM mapping results under such conditions.

Table 3 Comparative evaluation of the TBD method on the MOT17 and MOT20 datasets with/without the ReID processing step

TRD Multi-Scene Effect Experiment    In this section, the RLMTrack algorithm is evaluated on different datasets, MOT17 and MOT20, to assess the degree of performance enhancement provided by the TBD method on these datasets. A comparative analysis is also conducted to contrast the processing speed differences when using ReID versus when not using ReID. According to the experimental results in Table 3, it is observed that the improvement of the TBD method on MOT17 is slightly higher than on MOT20. Our analysis suggests that the dense distribution of individuals in MOT20 increases the baseline weight of the primary regions in the image during association processing. Consequently, a higher level of computational resources is required for association calculations in most regions of the image.

4.3 Evaluation results

Fig. 11
figure 11

Comparison between ID switches caused by overlapping false positive bounding boxes and those corrected using RLM

We compared the methods and results of multi-object tracking tasks on the MOT17 and MOT20 datasets using the evaluation tool TrackEval provided by MOTChallenge and marked the best results.

Table 4 Comparison of evaluation metrics for different methods on MOT17 dataset

MOT17 Dataset    In Table 4, we conducted comparative experiments on various tracking models using the SORT algorithm on the MOT17 dataset’s validation set. It can be observed that our method has improved the tracking accuracy for several algorithms in the MOT17 dataset. It performs better in terms of the HOTA and IDF1 metrics, while achieving comparable precision and speed to the BYTETrack method in terms of MOTA and FPS metrics. This is because RLM typically consumes fewer computations, thus minimally affecting the model’s operational efficiency. As shown in Fig. 11, the use of overlapping target false positive boxes in ByteTrack leads to ID switches. However, with the application of RLM, false positive boxes are mapped further from the target, reducing the occurrence of ID switches effectively.

Table 5 Comparison of evaluation metrics for different methods on MOT20 dataset

MOT20 Dataset    In Table 5, it can be observed that our method, compared to several algorithms, shows significant improvements in HOTA, IDF1, and IDs metrics on the MOT20 dataset. However, there is a slight decrease in MOTA compared to ByteTrack. This is because the RLM algorithm, after mapping to the target plane, increases the actual distance between corresponding detection boxes. Particularly for distant targets, the relationships between targets are amplified. Additionally, low-scoring detection boxes outside the predicted trajectory are discarded during association. The high density of the MOT20 dataset causes the reduction in DetA due to the discarded detection boxes, consequently affecting the MOTA score. However, these low-scoring detection boxes would increase the likelihood of ID switches. After comprehensive consideration, our algorithm discards these detection boxes to achieve optimal tracking performance.

Limitations    The association algorithms based on the Kalman filter mostly rely on the accuracy of target detection frameworks. When the accuracy of target detection is low or there is strong jitter, ID switches are prone to occur, representing a drawback of such detection algorithms. In our experiments, we effectively reduced the occurrence of ID switches in high-density areas through the target region density method. The trade-off, however, is the discarding of some low-scoring but valid detection bounding boxes. The proposed method in this paper remains the optimal solution, providing a balance between real-time tracking speed requirements and excellent tracking accuracy.

5 Conclusion

In this paper, we introduced an enhanced multi-object tracker called RLM-Tracking. This tracker utilizes the relative location mapping method to map targets from the original video image to a top-view mapping plane. It incorporates a region density model to optimize target box filtering and standardize bounding box generation, thereby improving the accuracy of multi-object tracking.

This approach not only holds potential applications in the field of multi-object tracking but also offers performance enhancements at a relatively low cost in other related domains. Regarding the dynamic acquisition of camera parameters mentioned in the paper, we will continue our research to make further contributions to the field of multi-object tracking. We hope that our work will drive advancements in this area and lead to more progress in the future.