A unified multi-view multi-person tracking framework

Despite significant developments in 3D multi-view multi-person (3D MM) tracking, current frameworks separately target footprint tracking, or pose tracking. Frameworks designed for the former cannot be used for the latter, because they directly obtain 3D positions on the ground plane via a homography projection, which is inapplicable to 3D poses above the ground. In contrast, frameworks designed for pose tracking generally isolate multi-view and multi-frame associations and may not be sufficiently robust for footprint tracking, which utilizes fewer key points than pose tracking, weakening multi-view association cues in a single frame. This study presents a unified multi-view multi-person tracking framework to bridge the gap between footprint tracking and pose tracking. Without additional modifications, the framework can adopt monocular 2D bounding boxes and 2D poses as its input to produce robust 3D trajectories for multiple persons. Importantly, multi-frame and multi-view information are jointly employed to improve association and triangulation. Our framework is shown to provide state-of-the-art performance on the Campus and Shelf datasets for 3D pose tracking, with comparable results on the WILDTRACK and MMPTRACK datasets for 3D footprint tracking.


Introduction
A wide range of real-life applications (e.g., human-computer interaction) demands 3D person trajectories (i.e., footprint or pose trajectories).To obtain high-quality 3D person trajectories, multi-view camera systems have been developed to overcome challenges (e.g., occlusions) encountered in monocular systems.By leveraging multi-view information, 3D Multi-view Multi-Person Tracking (3D MM-Tracking) can recover partial 3D information that is not observed in a monocular view.Under the popular tracking-by-detection paradigm, 3D MM-Tracking can be approached by first obtaining camera parameters and monocular 2D features (e.g., bounding boxes or poses), and then associating multi-view 2D features to generate 3D tracklets.In this work, we focus on how to associate 2D features and generate 3D tracklets.
The cameras are synchronized and work together to identify and track people as they move around in the environment.Depending on the application, the desired 3D person trajectory could be a 3D footprint generated from multi-view 2D bounding boxes, or, 3D pose sequences obtained from multi-view 2D poses.Although numerous works [1][2][3][4][5][6][7][8] have contributed to 3D MM-Tracking, we realized that existing frameworks are designed separately-for 3D footprint tracking using 2D boxes, or, for 3D pose tracking using 2D poses.Consequently, while those tracking methods have their limitations, their merits have also not been shared.
On one hand, frameworks designed for tracking footprints [2,3,9,10] appear to weakly connect multi-frame and multi-view associations to improve tracking robustness.However, these methods generally assume that individuals are on the flat ground and their footprints can be directly computed from a monocular 2D bounding box with homographic projection, which is inapplicable to 3D poses above the ground and leads to limited applicability.On the other hand, frameworks designed for tracking poses [4][5][6][7] generally isolate multi-view and multi-frame associations.Because of the 2D-to-3D ambiguity and complexity of human articulation, it could encounter difficulties in associating multi-view 2D features correctly in a single frame.Moreover, performing multi-view association on 2D bounding boxes with the aforementioned methods might be more difficult than that on 2D poses, because 2D bounding boxes have fewer key points than 2D poses, which weakens multi-view association indices in a single frame [11].
To integrate the strengths and mitigate the limitations of the aforementioned methods, we proposed a Unified Multi-view Multi-person Tracking framework (c.f. Figure 1).Notwithstanding, creating a unified framework is not simply a matter of combining existing methods, but we bring new insights to enhance the robustness and efficiency of our framework.
We suppose that multi-view and multi-frame associations are highly correlated: multi-view geometric constraints can exclude false detection and improve multi-frame association in a monocular view, whereas multi-frame association in each view can compensate for the effect of noise and outliers that hamper multi-view association.Therefore, we attempt to jointly utilize multi-frame and multi-view information in our framework.In detail, we traverse the entire video with sliding windows to perform online processing.In each sliding window, we first connect 2D positions to 2D tracklets, and then compute the cross-view consistency between multi-view 2D tracklets using our normalized epipolar distance, which is irrelevant to the projection variance.Using the consistency distance, multi-view 2D tracklets are associated to clusters using our Propagable Distance-based Non-parametric Clustering (PDNC), which can propagate a calculable distance to compensate for an incalculable distance under spatiotemporal constraints.Subsequently, we obtain the 3D positions by applying our Collaborative Multi-frame Multi-view Triangulation (CMMT), which consolidates 2D position with index i at frame t.Its camera ID is ci and its 2D coordinate is (xt,i, yt,i).
Its horizontal and vertical scales are wt,i and ht,i lc j (xt,i) Epipolar line of xt,i in camera cj d cross (xt,i, xt,j) Consistency distance between cross-view 2D positions Tk,i 2D tracklet with index i, it is cropped by the sliding window anchored at keyframe k.Its active frames form a set Ψi and its camera ID is ci If it is assigned to cluster p in multi-view association, it can be further denoted as Tk,p,i ν Window size of sliding window δ Step size of sliding window S(Tk,i, Tk,j) A set of distance between Tk,i and Tk,j within the sliding window anchored at keyframe k Ωk The set of clusters at keyframe k ωk,i Cluster at keyframe k with index i.multi-view 2D tracklets belong to the identical person will be assigned to the cluster.

M
Assignment matrix, which is a Boolean matrix.When row i is assigned to column j, Mi,j = 1.
multi-frame multi-view information to calculate 3D positions and reject outliers.Finally, we link 3D tracklets between sliding windows.Regardless of the use of 2D poses or 2D bounding boxes, our proposal can generate high-quality 3D person trajectories.
For better readability, we summarize the notations used throughout this study in Table 1.
In summary, we present the following contributions: • We propose a Unified 3D MM-Tracking framework that integrates the strengths and mitigates the limitations of pose and footprint tracking.The framework also offers real-time performance on realistically sized problems.• We introduce a normalized epipolar distance that, for the first time, makes cross-view consistency independent of the projected 3D-to-2D variance.In general, appearance and geometric consistency are two important assumptions used for MOT.The previous appearance of an identical object should be similar to its current appearance (i.e., appearance consistency), and its previous location and shape added to its estimated motion should be approximate to its current location and shape (i.e., geometric consistency).While appearance-based MOT methods [14][15][16] have achieved promising performance, recent appearance-free MOT solutions [17,18] prove that only using the geometric features can also provide robust tracking results on multiple difficult MOT datasets [19].In this work, we recommend using the appearance-free approach to achieve fast online processing, but we also illustrate that adding an appearance feature can improve the tracking performance on some datasets.

3D MM-Tracking
We list two types of 3D MM-Tracking architectures related to our work in Figure 2. We mainly focus on how to associate multi-view 2D features and generate 3D tracklets.Therefore, we suppose that multi-view 2D observations (e.g., 2D bounding boxes or poses) are given and will not be discussed.
Type (1) related works [2,3,9,10] focused on obtaining 3D footprints.They often require all persons to stay on a planar ground with zero height.Without using multi-view stereo pairs, 3D footprints are directly generated by referring to the bottom point of the 2D bounding box in a monocular Camera 0 The projection of the 2D foot position is the 3D position on the plane ground

Camera 1
Camera 0 Camera 1 The projection of the 2D foot position is not the 3D position above the plane ground Ground Plane Ground Plane Fig. 3 Limitation of generating the 3D position with the homography projection.Hc i is the homography projection matrix from the camera ci to the 2D ground plane.The 3D position above the ground plane cannot be obtained with the homography projection to the 2D ground plane, which was applied in previous works [2,3,9,10].The Ground view (Figure 3).In the final step, 3D tracklets mapped from different views are clustered and merged in an offline manner.However, such an offline solution could be problematic: if identity switches are encountered in a pre-obtained 2D tracklet, then such a 2D tracklet cannot be matched with others; therefore, it will be discarded entirely.Our method uses the sliding window to handle this problem.In each sliding window, the effects of incorrect 2D tracklets tend to be mitigated, which means that, instead of rejecting an entire 2D tracklet, our method can locate and reject incorrect parts but keep the correct parts (sec.3.3).Moreover, when we aim to obtain 3D trajectories of body poses or 3D footprints on a stair, Type (1) methods may not work properly.Despite this defect, Type (1) methods provide an important insight: utilizing 2D tracklets other than single-frame 2D positions could form more robust features to verify cross-view consistency.Type (2) related works [4][5][6][7] delve into the problem of tracking 3D pose trajectories more than 3D footprints, and most of them are online methods.To some extent, associating arbitrary positions (e.g., poses) is fundamentally different from associating footprints on the flat groundit requires that multi-view stereo pairs are observable so that the 3D position can be calculated with triangulation.To obtain multi-view stereo pairs, they take a single-frame multi-view association without temporal constraints.At every single frame, 3D positions are independently generated from single-frame 2D multi-view stereo pairs.Finally, the multi-frame association is employed to connect single-frame 3D positions to 3D trajectories.However, due to 2D-to-3D ambiguity and complicated human articulation, correctly associating multi-view 2D features in a single frame could be difficult.
Taking the merits of Types ( 1) and ( 2), we introduce an online framework that combines the strengths and mitigates the limitations of the two approaches (see Table 2).We inherit the idea of constructing 3D positions from paired cross-view positions using triangulation, but modifications are made by collaboratively enforcing multi-frame multiview association, 3D position calculation, and outlier rejection.Our framework helps to enhance the robustness of 3D MM-Tracking with 2D bounding boxes and 2D poses as inputs.
Notably, we skip the discussion of 3D MM-Tracking methods (e.g., [8]) that modulate the entire pipeline into neural networks.According to their settings, the weights of the neural network therefore encode the 3D structure of the reference views, and often cannot generalize to new scenes, which have different camera settings and 3D space.These methods may need to retrain their models to fit the new scene.Moreover, it is challenging to apply those methods to a large-scale space in the wild, since the entire observation space needs to be embedded to the neural network features.We aim to enable better generalization in our unified framework.

Method
Figure 4 illustrates the overall architecture of our framework.We further divided our framework into three phases: Collaborative Multi-frame Multi-View Association (sec.3.1), Collaborative Multi-frame Multi-view Triangulation (sec.3.2), and Link Sliding Windows (sec.3.3).The details of each phase are introduced in the following parts.

Collaborative Multi-frame Multi-view Association
Presuming that multi-view videos are synchronized, the projective trajectories of a person in each view should be consistent in terms of positions and motions.Hence, multi-view 2D tracklets belonging to the same person can be matched by referring to the cross-view consistency.
We begin with a simple case to explain how to determine multi-view consistency.In a single frame, the distance of consistency between cross-view 2D positions is obtained by calculating the distance between one 2D position and the epipolar line constructed by the 2D position of another view.As introduced in previous works [5,11,20], such a calculation can be formulated as where d cross denotes the consistency distance between multi-view 2D positions x t,i and x t,j at frame t, and d x t,i , l ci (x t,j ) is the epipolar distance from the 2D position x t,i to the epipolar line l ci (x t,j ).Furthermore, we denote c i as the camera ID of x t,i and F cj ,ci as the fundamental matrix maps a point from camera c j to c i .However, this formula does not account for the fact that the 2D projective scale of the same 3D distance possibly varies as a function of the object-to-camera distance.Because of the detection bias, the multi-view 2D positions may not match perfectly.Thus, their consistency distance is large for a small object-to-camera distance, and vice versa.Therefore, using Eq. ( 1) leads to two limitations.First, multi-view consistency can not be properly determined in a large viewing space, because the 2D projective scale divergence is notable.Second, the hyperparameter for multi-view association needs to be adjusted after the cameras have been relocated.Although the previous method [20] applied the mean and standard deviation Our PDNC can propagate calculable distance to compensate for an incalculable distance under spatiotemporal constraints.

Convectional Clustering May Fail by Incomplete Distant Matrix
Our PDNC Construct Complete Distance Matrix with Ø

Fig. 6 Illustration of Distance Initialization in Propagable
Distance-based Non-parametric Clustering (PDNC).We streamline the cross-view association as a non-parametric clustering problem.From Eq. ( 3), the cross-view consistency distance between tracklets 0 and 1 is set to be inf because they are in the same camera view and overlap in the temporal domain.The distance between tracklets 2 and 3 is incalculable since there is no temporal overlap between them.However, our PDNC can propagate a computable distance to compensate for the incalculable distance under spatiotemporal constraints.
of a batch of cross-view distances for normalization, their normalization seems to only reduce the inner batch variance other than the 3D-to-2D projected variance caused by the object-to-camera distance, which could increase the difficulty of selecting a constant threshold of epipolar distance for the cross-view association.
To solve this problem, it is prudent to normalize epipolar distance w.r.t.Eq.1, and we present the following modifications (c.f. Figure 5): where w and h are the horizontal and vertical scales of bounding boxes, respectively.When 2D bounding boxes are utilized, we select their center point to calculate the person-to-person distance.When 2D poses are given, we generate 2D bounding boxes that encompass the 2D poses to obtain their scales.We also apply Eq. ( 2) to each joint point and take their average value to represent the person-to-person distance.Unlike traditional approaches [3][4][5][6][7], in our approach, the epipolar distance has been normalized to be irrelevant to the 3D-to-2D projected variance caused by the object-to-camera distance.
We have now introduced the consistency distance for multi-view 2D positions in a single frame.Most related works [4,5,7,8] independently associate multi-view information in a single view, which could be suboptimal: different persons may have similar cross-view consistencies at a single frame, and the correct assignment may not always be established.We form a collaborative multi-frame multi-view association to improve the robustness of cross-view association.By jointly measuring the cross-view consistency in multiple frames, the cross-view inconsistency could be increased for different persons but decreased for identical persons.Consequently, a robust association can be achieved.
We are not the first to jointly perform multi-frame multi-view association.Before us, existing solutions [21,22] jointly optimized multi-frame multi-view association via 4D graphs formed with multi-frame multi-view information.However, their computational complexity increases dramatically when more frames are involved in 4D graphs.More specifically, considering that every two cross-frame cross-view detections can form an edge, the number of edges in their association graph could be calculated with the combination formula For videos with high recording rates, the inclusion of more frames enables the acquisition of more distinct motion information for a better multi-view association.Here, we consider an alternative formulation that ensures the efficient use of spatiotemporal information from arbitrary frames: through calculating the cross-view distance between 2D tracklets that involve multi-frame information, the computation required to jointly optimize multi-frame multi-view association is no longer tied to the complexity of frames, but it is only related to the number of multi-view 2D tracklets, thus improving its real-world applicability.Although we utilize information from multiple frames, the number of edges in our association graph is reduced to Nc In our framework, we run multiple single-camera MOT trackers simultaneously (e.g., SORT [23]) to obtain 2D tracklets for each camera view.To achieve parallel single-camera tracking, we utilize sliding windows to pass through the obtained 2D tracklets.Within a sliding window whose center is anchored at keyframe k (c.f. Figure 4), the distance between cross-view 2D tracklets becomes a set of normalized epipolar distances.Supposing we have obtained

Inputs:
a) 2D tracklets cropped by the sliding window anchored at keyframe k.A 2D tracklet can be represented as T k,i , where i is its index.b) Epipolar distance between multi-view 2D tracklets.Each of them can be represented as , where i and j are indices.

Outputs:
Final clustering set at key frame k, as Ω k .1: Initialize: Assign each 2D tracklet to a cluster ω k,i ← T k,i , where ω k,i ∈ Ω k and i = 1, . . ., n.And therefore  for ω k,q ∈ Ω k and p = q do 7: Conventional Complete-linkage Clustering approach 8: end for 11: end while 2D tracklets T k,i and T k,j , which are from cameras c i and c j , respectively, we represent such a distance set with S and formulate it as follows: where Ψ i and Ψ j denote the frames covered by T k,i and T k,j , respectively.Then, we specifically design the distance between T k,i and T k,i based on three conditions.When T k,i and T k,i have temporal overlap and their camera IDs are different, we take the mean of S(T k,i , T k,j ).When there is no temporal overlap, regardless of whether their camera IDs are different or identical, we assign ∅ to the distance.Here, ∅ serves as a special symbol that will trigger the merge conditions in our PDNC (see Figure 6 and Algorithm 1).When temporal overlap exists and the camera IDs are the same, we set an infinitely large value and conjecture that these 2D tracklets should be assigned to different persons.The corresponding formula is expressed as follows: With the aforementioned cross-view consistency distance, associating cross-view 2D tracklets can be formulated as a clustering problem and aim to optimize global criteria.
Although clustering methods have been widely applied in tracking tasks, we should be aware of three issues in associating cross-view 2D tracklets.First, predefining the number of clusters or even specifying the maximum size of clusters (e.g., [3]) should be avoided; otherwise, the framework may degenerate if more observed persons are included.We tend to apply non-parametric clustering methods [24] to automatically learn the number of clusters from data.Second, most non-parametric clustering methods contain abstract hyperparameters that need to be interpreted; thus, it is difficult to use them in real applications.The hyperparameter of Complete-linkage Clustering is a distance threshold that has an intuitive physical definition.We take Complete-linkage Clustering as a candidate.Nonetheless, in contrast to single-frame multi-view association [5,20], we face the challenges posed by multi-frame multi-view association: when multi-view 2D tracklets have no overlapping frames, their distance cannot be directly calculated, therefore, conventional Complete-linkage Clustering, which requires distance of all elements in clustering, cannot be applied in this scenario.To tackle this issue, we proposed a novel Propagable Distance-based Non-parametric Clustering (PDNC) and describe its details in Algorithm 1.In PDNC, we substitute distance updating in conventional Complete-linkage Clustering with a dynamic updating mechanism.Utilizing our distance definition in Eq. ( 4), PDNC propagates the calculable distance to compensate for the incalculable distance under spatiotemporal constraints.

Collaborative Multi-frame Multi-view Triangulation
After we obtain the multi-view relationship, 3D positions are inferred from multi-view 2D positions by applying linear algebraic triangulation [25].Suppose that we have a pair of cross-view 2D positions that can be represented in homogeneous coordinates as x i = (x i , y i , 1) and x j = (x j , y j , 1), where x and y are the position values on an image coordinate system.Their corresponding perspective projection matrices are P ci and P cj , respectively.In general, the 3D position X is estimated by solving the following equations: x i P ci,3 − P ci,1 y i P ci,3 − P ci,2 x j P cj ,3 − P cj ,1 where P ci,n is the n-th row of P ci [25].
If we equally treat 2D positions in the triangulation approach, then the estimation errors of 2D positions and incorrect cross-view association could impair the accuracy of 3D positions.Toward the goal of reducing estimation errors, Iskakov et al. first proposed to apply the Random Sample Consensus (RANSAC) [26] to remove outliers, and further recommended a volumetric triangulation approach to jointly employ inlier and outlier information to generate improved 3D positions [27].Due to its simplicity, similar solutions have been applied in other works.Nonetheless, such a solution only considers the information in a single frame, while spatiotemporal information of nearby frames is ignored.Moreover, RANSAC can only handle a moderate percentage of outliers without the cost blowing up [28], but we may have a high rate of outliers caused by association failures and estimation errors.
We assume that the spatiotemporal information of nearby frames can be used to obtain the correct 3D positions.Consequently, a novel Collaborative Multi-frame Multi-view Triangulation (CMMT) is proposed to integrate the spatiotemporal information to 3D position calculation.The details of CMMT are shown in Figure 7 and explained in Algorithm 2. Algorithm 1 is applied for the association, whereas, Algorithm 2 is designed for outlier rejection and 3D position calculation.Although both of them have jointly utilized multi-frame multi-view information and applied non-parametric clustering, their functions are fundamentally different.
In CMMT, we adopt a simple approach-interpolation-to encode multi-frame information to interpolated 2D tracklets.Supposing that the interpolation window size is ϕ and the 1D interpolation function is f (•) (e.g., linear interpolation), we can generate the interpolated 2D position x inf ill i using the observed 2D positions x obs i , which is formulated as follows Applying Eq. ( 6) to each element of the 2D tracklet m }, could be generated by applying Eq. ( 5) to Nonetheless, the 3D positions generated from interpolations could either compensate for the missing observations as inliers or delude existing observations as outliers.Thus, we apply the conventional Complete-linkage Clustering [29] to identify the pattern of 3D candidate positions, as a set of clusters {ω 1 , • • • , ω p } ∈ Ω.Then, we regard the largest cluster ω largest as the inlier; meanwhile, the others are treated as outliers.The final 3D position is obtained from the centroid of the inlier.

Inputs:
2D tracklets cropped by the sliding window anchored at keyframe k and assigned to cluster p by using Algorithm 1.A 2D tracklet is represented as T k,p,i , where i is its index.We have T k,p,i ∈ ω k,p and ω k,p ∈ Ω k .

Outputs:
3D tracklets anchored at keyframe k and assigned to cluster p.A 3D tracklet is denoted as U k,p .1: Initialize: Copy each 2D tracklets to observed 2D tracklets, as Apply interpolation function f (•) (e.g., linear interpolation) to observed 2D tracklets to get interpolated 2D tracklets, as Multi-frame information is implicitly involved through interpolation 3: Merge observed and interpolated 2D tracklets with identical indices, as a) Apply Eq. ( 5) to multi-view stereo pairs to yield a set of 3D candidate positions at one frame, as Π.  c) Select the largest cluster at each frame and calculate their mean value for 3D position X k,p with Eq. ( 7).
where κ denotes the cutting threshold for the clustering.
Our CMMT still works when the percentage of inliers is less than 50%, which may be difficult for other outlier rejection methods.In terms of computational complexity, our PDNC and CMMT are modified from the complete-linkage clustering, which has a time complexity of O n 2 , where n is the number of elements to be clustered.In the MM-Tracking, if each camera can capture all persons, the number of cameras and persons are N c and N p , respectively, and we have n = N p N c , so the time complexity of our PDNC and CMMT can be approximated by O N 2 p N 2 c .However, in our test datasets, each camera only captures a few persons; moreover, since we assume that observations in the same camera view should not be clustered, we skip the distance calculation and clustering process for observations in the same camera view.Therefore, the actual complexity can generally be less than O N 2 p N 2 c .In practice, the number of cameras and people observed may not be too large, so the overall computational cost could be acceptable.

Link Sliding Windows
As illustrated in Figure 4, our previous processes have generated short-term 3D tracklets within a sliding window.In this subsection, we will learn how to link short-term 3D tracklets to the final result with our online track-to-track data association.
In conventional track-to-track data association works [4][5][6][7], the cross-camera relationship is unknown and there exist multiple redundant 3D tracklets for identical persons.Thus, those methods are mainly used to obtain the cross-camera relationship with a clustering approach.In our previous processes, however, the cross-camera relationship has been identified in each sliding window.Applying the conventional track-to-track methods discards our obtained cross-camera relationship and makes our previous effects useless.
To fully leverage the cross-camera relationship obtained in each sliding window, we propose an online track-to-track data association.The short-term 3D tracklets in each sliding window are treated as nodes in our online track-to-track data association.We apply linear assignment to associate these nodes across sliding windows in 3D space.Tracklets between two adjacent sliding windows should be connected if they are compatible.Based on the track management of SORT [23], we form a track-to-track management to decide how and when to initialize, update and terminate a long-term tracklet that is connected from short-term tracklets.
More specifically, for two 3D tracklets U k,i and U k+1,j , with centers anchored at keyframes k and k + 1, respectively, we set their distance as the average Euclidian distance between their overlapping parts.The distance matrix of 3D tracklets between two adjacent sliding windows is formulated as where |Ψ i ∩ Ψ j | denotes the number of overlapping frames 10 F.Y., S.O., S.Y., H.F., S.M., S.J.
between U k,i and U k+1,j , and X t,m and X t,n are the 3D points in U k,i and U k+1,j , respectively.

X X Sliding Window 1
Due to ID switch, none of them can be matched.

Sliding Window 2
Tracklets 1 and 2 are matched Fig. 8 Our sliding-window approach can utilize the noisy 2D tracklets more efficiently.Previous methods [2,3,9,10], that take the entire 2D tracklets for the cross-view association, may not be able to utilize 2D tracklets if ID Switches are encountered.Our method uses the sliding windows to alleviate this problem.
Following the most fundamental strategy of single-view MOT [23,30], we formulate the 3D tracklets connection problem as a linear assignment problem.The relationship between the 3D tracklets of two adjacent sliding windows is modeled as a bipartite graph, which can be further represented as a Boolean matrix M. When row i is assigned to column j, we have M i,j = 1.Each row is assigned to at most one column, and each column is assigned to at most one row.
Thus, the optimal assignment M * is obtained by minimizing the total cost, as follows Referring to M * , the matched 3D tracklets of the sliding window k + 1 are linked to the corresponding 3D tracklets of the sliding window k, whereas the unmatched 3D tracklets of the sliding window k + 1 are regarded as new tracklets.After passing through all sliding windows, short-term 3D tracklets are linked to our final results.
Except for approaching online processing, another merit of using the sliding window for 3D MM-Tracking is illustrated in Figure 8.Previous methods [2,3,9,10], that take the entire 2D tracklets for the cross-view association, may discard a 2D tracklet if ID Switches are encountered.Nonetheless, rejecting 2D tracklets generally have poor tracking capability in areas captured by a limited number of cameras.To alleviate this problem, our method uses the sliding window.2D tracking errors could be limited to one sliding window and do not affect the others.The demand for online processing and the effective use of tracking features have been successfully realized in our framework by adapting sliding windows in 3D-MM tracking.

Evaluation Datasets
We performed experiments on four public datasets.Among them, the Campus and Shelf [12,13] datasets focus on 3D pose tracking, whereas the WILDTRACK [31] and MMPTRACK [32] datasets focus on 3D footprint tracking.
The Campus and Shelf [12,13] datasets provide calibrated camera parameters and videos for multi-camera multi-person 3D pose tracking.While the Campus dataset consists of three cameras and three persons, the Shelf dataset includes five cameras and four persons.To infer the 3D pose trajectories above the ground plane, the methods used in previous works [2,3,9,10] cannot be applied, since their 3D positions are computed using the homography projection matrix from the camera view to the ground plane.
The WILDTRACK [31] dataset captures the 3D footprint of 313 individuals on a planar ground, by using seven calibrated cameras.Due to the large number of individuals, their projections in each camera view are heavily occluded, and therefore it is challenging to associate their cross-view observations in a single frame.Our framework jointly leverages multi-frame multi-view information to make robust tracking on this dataset.In our experiments, we followed the setup of LMGP [33] to apply the first 360 frames for training and the rest for testing.We also utilize the same detections as LMGP [33] did.
The MMPTRACK [32] dataset is the largest publicly available multi-view multi-person tracking dataset, which was introduced in the ICCV 2021 Multi-camera Multiple People Tracking Challenge.In total, more than 9.6 hours of videos were collected, covering 28 subjects in five representative environments: Retail, Lobby, Industry, Cafe, and Office.Depending on the environment, the number of cameras varies from four to six.The dataset is divided into training, validation, and test sets.While the annotations of the training and validation sets are given, the ground truth of the test set is hidden from the public for a fair comparison.In our evaluation, we submitted our tracking results to the official evaluation server to obtain the performance.

Evaluation Metrics
In our experiments, we jointly evaluate the tracking performance and 3D pose estimation performance.We follow the conventional setting to employ CLEAR metrics (e.g., MOTA) [38] and Identity metrics (e.g., IDF1) [39] to evaluate the tracking results from various perspectives.In detail, IDs (i.e., Number of ID switches) indicate the times of identity jumps, IDF1 (i.e., ID F1 scores) accounts for identity match performance, and MOTA (i.e., Multi-Object Tracking Accuracy) is a combination of false positives, missed targets and IDs.Besides, MT (i.e., number of mostly tracked trajectories) measures the number of trajectories whose target was tracked more than 80%, whereas ML (i.e., number of mostly lost trajectories) measures the number of trajectories that have less than 20% target tracked.In addition, FP and FN represent the ratio of False Positives and False Negatives, respectively.Among them, the MOTA score is the dominant metric used to measure the overall tracking performance.To evaluate the 3D pose estimation performance, we follow the trends to apply the PCP (i.e., Percentage of Correctly estimated Parts), which is described in the Campus and Shelf [12,13] datasets.

Experiment Setup
Since detection and tracking are separated in our framework, we decouple their settings as follows.For detection, we choose the off-the-shelf object detector-YOLOX [40]-to train and inference bounding boxes for our experimental datasets.We follow the default settings of YOLOX and run it on a single GPU machine (NVIDIA A100).For the tracking, we set up hyperparameters of Table 1 as ν = 30, δ = 20, λ = 0.3, ϕ = 7, κ = 0.2.In addition, although our suggested framework is appearance-free, we also perform the ablation study to investigate the effect of using the appearance feature.To obtain the appearance feature, we utilize the model of Strong-ReID [41] with one GPU machine (NVIDIA A100).Then we follow the strategy of DeepSORT [30] to fuse the geometry and appearance features in the single-view and multi-view data association.

Experiment Details
To build a unified 3D MM-Tracking framework that can perform robust 3D pose and footprint tracking, we explore the following aspects of our framework.

Can our framework generate satisfactory 3D footprints with cross-view 2D bounding boxes?
Given that the generation of 2D bounding boxes (e.g., YOLOX [40]) could be faster and easier than generating 2D poses (e.g., PoseMachine [47]) for multiple persons, in some applications, the use of 2D bounding boxes to generate 3D footprints is highly efficient.We investigated the 3D footprint tracking with the WILDTRACK [31] and MMPTRACK [32] datasets.
For the WILDTRACK [31] dataset, we show the quantitative and qualitative results in Table 3 and Figure 10, respectively.Our experiments empirically demonstrate that our framework can overcome adverse conditions (e.g., occlusions) and obtain promising results in large-scale outdoor scenes.Compared with the state-of-the-art method, even though we obtain the second-best results in terms of the IDF1 and MOTA scores, our framework offers online processing on realistically sized problems.Besides, we notice that using the appearance features can improve the performance of data association, as the IDF1 score is increased from 94.7% to 97.5%, and IDs score is decreased from 11 to 5.
For the MMPTRACK dataset, the quantitative results are tabulated in Table 4, and the qualitative results are illustrated in Figure 9. Up to the ICCV MMP-Tracking challenge deadline, our solution ranked the fourth place with the MOTA of 93% and IDF1 of 75%.At that time, we did not have PDNC and CMMT, but applied conventional Complete-linkage Clustering for cross-view 2D tracklets association and RANSAC for Triangulation outlier rejection.After the challenge deadline, we kept pushing our framework forward by adding PDNC and CMMT.By submitting our new results to the evaluation system, we made an improvement in MOTA (95%) and IDF1 (84%).Overall, we achieved comparable performance to top-ranking methods.For the dominant metric MOTA (c.f., [34,35]), our results reached 95%, which is 1% lower than two top-ranking results but should be a promising result by considering the notable crowding and occlusion in the challenging dataset.Besides, the top-ranking solutions had access to the unlabeled test set to boost their detection performance with semi-supervised learning.Our performance could also be improved by utilizing a similar approach, but this is out of the scope of building a unified framework.
For the metric IDF1 (c.f., [34,35]), our performance reached a score of 84% and had a margin to match the top-ranking results.However, 3D MM-Tracking is conducted with a complicated system, and a significant number of factors could be varied for the trade-off of the training cost, model generalization, tracking performance, and speed.Unlike other competitors, we did not utilize the appearance feature in our challenge solution.The challenging dataset only contains a few persons, and their appearances are easily distinguished by the Re-ID model (see Figure 9).Thus, it is reasonable for competitors, who utilized the appearance cue, to reduce ID switches and obtain better IDF1 scores than us.Moreover, despite the good performance, the cost of training the Re-ID model is notable and the generalization of the Re-ID model should be considered.On account of the potential domain gaps, the Re-ID model may need to tune its weights to fit the target data set [48][49][50]; otherwise, directly applying the Re-ID model may lead to unnecessary degradation of the tracking performance.Without using the Re-ID model, we can diminish the additional training cost and speed up the inference process, but we could also sacrifice the performance in terms of the IDF1.

Does PNDC improve the cross-camera data association?
We argued that it is challenging to apply conventional clustering methods to associate cross-view 2D tracklets, because the distance between 2D cross-view tracklets that do not overlap in time cannot be calculated.To solve this problem, we proposed our PDNC.In Table 5, we show that using our PDNC achieves better tracking performance than using complete-linkage clustering.Furthermore, in our Appendix, we use examples to illustrate the importance of correctly handling 2D cross-view tracklets that do not overlap in time.
3. Can our framework generate satisfactory 3D pose trajectories with cross-view 2D poses?We selected the Campus and Shelf datasets [12,13] to evaluate our framework in 3D multi-person pose tracking.We follow the same evaluation protocol as in previous works [8,20,22,46] to perform a comparison.The 2D detected poses provided by VoxelPose [46] are employed.PCP [12,13] and MOTA [35] are used together as evaluation metrics.
The quantitative evaluation results are shown in Table 6, and the qualitative evaluation results are illustrated in Figure 11.Table 6 shows that our solution outperforms others in terms of the Average PCP.We achieved the average PCP score of 97.0% and 97.7% on the Campus and Shelf datasets respectively, which are the state-of-the-art performance for these two datasets in terms of 3D pose estimation.From the tracking perspective, our method also reaches the best MOTA score among the reported works.Note that, for Campus and Shelf datasets, using the appearance feature can slightly F.Y., S.O., S.Y., H.F., S.M., S.J. Table 6 Comparison with the state-of-the-art methods on the Campus and the Shelf datasets [12,13].The evaluation metrics are PCP [12,13] and MOTA [35].The higher the PCP and MOTA scores are, the better the performance.improve our tracking performance, but does not affect the 3D pose estimation.Because we already achieved a promising tracking performance with our appearance-free framework.
The samples visualized in Figure 11 indicate that 3D poses are correctly reconstructed even in heavily occluded scenarios.These results reveal that our framework can generate satisfactory 3D pose trajectories with cross-view 2D poses.Meanwhile, we also proved the effectiveness of

Does CMMT provide a robust 3D triangulation?
On the Campus and Shelf datasets, we compared our CMMT to two common approaches: (1) solely applying single-frame triangulation and (2) applying single-frame triangulation with RANSAC.For a fair study, we applied the same multi-view associated 2D tracklets to exclude the effect from the previous cross-view association.
We present the results in Table 7.Because of the potential errors of the detection and cross-view association, solely using single-frame triangulation might be affected by outliers and the PCP score of 3D positions is relatively lower than that using outlier rejection with RANSAC or our CMMT.The existing approach of applying RANSAC only considers single-frame information for outlier rejection, whereas our CMMT determines the inlier and outliers by consolidating multi-frame multi-view information.The results imply that our CMMT can achieve the best PCP score for 3D positions, and it can be used as a robust 3D triangulation method.

How does the dominant hyperparameter affect the model performance?
Since the sliding windows are used to traverse videos in 3D MM tracking, the hyperparameter of the sliding windows could affect the tracking result and speed.Besides, the threshold of associating multi-view 2D tracklets plays a key role in our framework.Therefore, we conducted ablation studies to explore the effect of varying ν (i.e., the window size of the sliding window), δ (i.e., the step size of the sliding window), and λ (i.e., the threshold of associating multi-view 2D tracklets).In the ablation studies (see Table 8), we vary the value of the hyperparameters within a reasonable range based on their intuitive physical meanings.
In Table 8, the inference speed is increased by increasing the size and step of the sliding window, and vice versa by decreasing the scale and step of the sliding window.Increasing the size and step of the sliding window can eliminate the redundant computation in our Collaborative Multi-frame Multi-view Association, but it will increase the delay of our near-lone processing.When the window size is equal to the entire video, our framework becomes an offline method, however, when the window size decreases to one frame, our framework becomes an online method.We made a trade-off for the size and step of the sliding window by selecting ν = 50 and δ = 30.As we have discussed previously, we determine the value of λ by referring to the possible ratio between epipolar distances and a person's projection scale.If the detection is perfect, then the epipolar distance should be zero.However, in real practice, due to the detection bias, we assume that the bias should be within 0.3 of the person's projection scale.We have shown that setting λ = 0.1 and λ = 0.6 may not significantly affect the association performance.

What is the speed of this framework?
Although it is common to report an average speed of 3D MM-Tracking for processing an entire dataset, we suggest that it may not provide sufficient knowledge to apply the 3D MM-Tracking system to real applications, because the number of cameras and persons could differ based on the application scenarios.Therefore, we attempted to systematically analyze the speed of our tracking system (w/o appearance) by setting different numbers of cameras and persons.
The results are shown in Figure 12.When tracking two persons with two cameras, our framework can achieve a superfast speed of 1, 643 FPS.The speed of our framework decreased to 186 FPS when tracking two persons with six cameras and 1, 286 FPS when tracking eight persons with two cameras, indicating that the speed of our framework drops when more cameras and persons are included.Nonetheless, since most of the data take four cameras for 3D MM-Tracking, our framework still can be applied to real-time processing.

What are the limitations of this framework?
First, to achieve real-time processing, our recommended framework does not utilize the appearance feature.Although we can decrease the additional training cost in the Re-ID model and increase the inference speed of the whole framework, the ID switches could be relatively higher than using the appearance feature.Nonetheless, as shown in Tables 3 and 6, the appearance feature can be easily added to our framework to improve the tracking performance by sacrificing the inference speed.We suggest considering these trade-offs and using the Re-ID model based on the needs.
Second, we selected a typical sample, as shown in Figure 13, to explain another limitation of our framework.Different from previous works [2,3,9], in which the 3D position on the ground plane can be directly generated from a monocular 2D position, our framework brings the multi-view stereo pair of 2D positions to generate the corresponding 3D position.When multi-view stereo pairs cannot be formed, our framework fails to produce 3D positions.To mitigate the problem, we come up with requirements for camera settings.First, we prefer that the overlapping areas of the cameras can cover the entire observation space.Hence, regardless of where a person is, he/she should be simultaneously captured by at least two cameras.Second, we would like to set up sufficient lengths and angles of the baseline between cameras.For multi-view stereo triangulation, it is challenging to obtain accurate 3D positions from multi-view stereo pairs if their corresponding cameras are very close to each other [51].If these requirements are satisfied, the performance of our framework could be further improved.

Conclusion
In this work, we proposed a Unified 3D Multi-view Multi-person Tracking framework, which could be useful in a wide range of downstream applications, such as public spaces, retail stores, and office buildings.Our framework has clear benefits: flexibility, simplicity, and efficiency.First, our framework offers the maximum flexibility to utilize different Fig. 13 Failure case analysis on MMPTRACK dataset [32].Our method cannot handle two corner cases.(1) When only one camera captures the target, the corresponding 3D position cannot be calculated since multi-view stereo pair is unavailable.To solve this issue, we should add more cameras to observe those areas.(2) When two cameras are very close to each other, the triangulation bias will be significantly increased.To ameliorate this issue, we can relocate the cameras to increase the length and angle of the baseline.Fig. 14 Comparison of the clustering details between our PDNC and conventional hierarchical clustering methods.In this example, we use four camera videos that captured two people, and the goal is to cluster the 2D tracklets based on their epipolar distance.We compare five types of possible strategies, from (a) to (e), in terms of non-parametric clustering.Only our PDNC generates the correct clustering results by properly handling the incalculable distance.

Fig. 1
Fig. 1 Illustration of our proposed framework.Monocular 2D bounding boxes and 2D poses can be applied to yield 3D trajectories for multiple persons.

Fig. 4
Fig. 4 Architecture of our proposed framework.We utilize sliding windows to traverse the entire video.In each sliding window, Multiple-frame Single-view Multi-person Tracking is performed on each camera view to associate 2D positions into 2D tracklets.Then, multi-view 2D tracklets are associated to clusters using our Propagable Distance-based Non-parametric Clustering (PDNC).Subsequently, we obtain the 3D positions by applying our Collaborative Multi-frame Multi-view Triangulation (CMMT).Finally, we associate short-term 3D tracklets to long-term tracklets based on the Euclidian distance of their overlapping parts.

Fig. 5
Fig. 5 Illustration of our normalized epipolar geometry distance.We demonstrate an example by associating the bounding box center.Here, w and h denote the bounding box width and height, respectively, p is the bounding box center coordinate; l is the epipolar line and d is the Euclidian distance from the target point to the given epipolar line.The epipolar distance has been normalized to be irrelevant to the 3D-to-2D projected variance caused by relative distance (see the highlighted parts).

N f 2 2 Np 2 ,
Nc where N f , N c , and N p represents the number of frame, camera, and person, respectively. .
distances from the new cluster ω k,p to all other clusters in the current clustering set Ω k by 6:

5 :
The multi-view relationship remains the same: T merge k,p,i ∈ ω k,p and ω k,p ∈ Ω k .4: for ω k,p in Ω k do for multi-view stereo pairs formed in each frame of T merge k,p,i , where T merge k,p,i ∈ ω k,p do 6: Complete-linkage Clustering on Π with cutting threshold κ.
X k,p to 3D tracklet U k,p .

Fig. 9
Fig. 9 Qualitative results on the test set of MMPTRACK dataset [32].2D bounding boxes are applied to locate persons in each camera view and circles are utilized to indicate the 3D locations of persons in the bird's-eye view coordinate.Each person is denoted by a unique color based on the tracking results.

Fig. 10
Fig. 10 Qualitative results on the test set of WILDTRACK dataset [31].2D bounding boxes are applied to locate persons in each camera view and circles are utilized to indicate the 3D locations of persons in the bird's-eye view coordinate.Each person is denoted by a unique color based on the tracking results.
cluster 0 since their distance 0.1 is the smallest one and below the cutting threshold 0.5.cluster 0 since their distance 0.2 is the smallest one and bellow the cutting threshold 0cluster 0 since their distance 0.3 is the smallest one and bellow the cutting threshold 0.5.

Two types of 3D MM-Tracking architectures that are related to our work
[12,13]6][7]cally design a PDNC to collaboratively associate multi-frame multi-view 2D tracklets.When cross-camera 2D tracklets do not have overlapping .Type (1)[2,3,9,10]generates 2D tracklets for each view and the corresponding 3D positions are directly obtained with homography projection, which requires 3D positions on the ground plane.Then, it merges 3D tracklets projected from each view in an offline manner.Type (2)[4][5][6][7]first applies single-frame multi-view association and triangulation to obtain 3D positions, and then links 3D positions into 3D tracklets in an online manner.frames,theirdistancecannot be directly calculated; therefore, conventional clustering methods can not be applied to associate them.Our PDNC tackled this issue, and as a general clustering method, PDNC can also be applied to other research fields.•Wepresenta novel yet simple CMMT, which can aggregate multi-frame multi-view information to calculate 3D positions and reject outliers.Our results prove that it is more robust than the conventional methods compared in our experiments.•Wepropose a robust online track-to-track association to enable online processing of our framework and reduce the effect of track ID switches in tracklets of each monocular view.•Ourframework achieves state-of-the-art performance on the Shelf and Campus datasets[12,13]; meanwhile, it obtains comparable results on the WILDTRACK dataset and the ICCV 2021 Multi-camera Multiple People Tracking Benchmark x .The effectiveness of our framework has been verified.

2 Related Works 2.1 2D Single-view Multi-object Tracking
2D Single-view Multi-object Tracking (MOT) is the basis for 3D MM-Tracking.Although it seems less complicated than 3D MM tracking, 2D single-view MOT is still challenging due to a large number of objects that need to be tracked, the occlusions between objects, and the changing appearance of x https://competitions.codalab.org/competitions/33729objects over time.

Table 3 Comparison with the state-of-the-art methods on the WILDTRACK dataset [31].
[34,35] used as the dominant evaluation metric.The definitions of MOTA and IDF1 are provided in refs.[34,35].The data rendered in Bold and Underlined indicate the best and second-best results respectively.MethodIDF1 ↑ MOTA ↑ MT ↑ ML ↓ FP ↓ FN ↓ IDs ↓

Table 4 Evaluation result on the MMPTRACK Dataset [32
[34,35] is used as the dominant evaluation metric.The definitions of MOTA and IDF1 are provided in refs.[34,35].The data rendered in Bold indicates the best results.

Table 5 Ablation studies for our PNDC on the WILDTRACK dataset
[34,35]o appearance).MOTA is used as the dominant evaluation metric.The definitions of MOTA and IDF1 are provided in refs.[34,35].The data rendered in Bold and Underlined indicate the best and second-best results respectively.

o appearance 98.2% 94.6% 98
The data rendered in Bold indicates the best result.

Table 7 Ablation studies for our CMMT on the Campus and Shelf datasets [12, 13] .
[12,13]ric is PCP[12,13].The larger the PCP score is, the better the performance.The data rendered in Bold indicates the best results.Using the same multi-view associated 2D tracklets, we demonstrated that our CMMT is better than using the combination of Triangulation and RANSAC.
The speed of our framework decreases when more cameras and persons are included.monocularfeatures.Without additional modifications, our framework can adopt monocular 2D bounding boxes and 2D poses as inputs to produce 3D trajectories for multiple persons.It is also applicable to an arbitrary 3D position that is either in nonplanar or planar environments.Second, although the structure of our framework looks complicated, its hyperparameter is more intuitive than other related works, thereby ensuring simpler utilization.And finally, we verified its effectiveness by accomplishing state-of-the-art performance on the Campus and Shelf datasets for 3D pose tracking, and comparable results on the WILDTRACK and MMPTRACK datasets for 3D footprint tracking.Since we have not yet applied the existing useful tricks and modules in our framework, it could serve as a prototype and could be extended to a more powerful 3D multi-view multi-person tracking system in future work.twocameras are very close to each other.We can relocate them to obtain better 3D positions.This person is observed only in one camera, which cannot be located by our method.
Fig.12The speed of our framework (w/o appearance features)..