A Study of Using Synthetic Data for Effective Association Knowledge Learning

Association, aiming to link bounding boxes of the same identity in a video sequence, is a central component in multi-object tracking (MOT). To train association modules, e.g., parametric networks, real video data are usually used. However, annotating person tracks in consecutive video frames is expensive, and such real data, due to its inflexibility, offer us limited opportunities to evaluate the system performance w.r.t. changing tracking scenarios. In this paper, we study whether 3D synthetic data can replace real-world videos for association training. Specifically, we introduce a large-scale synthetic data engine named MOTX, where the motion characteristics of cameras and objects are manually configured to be similar to those of real-world datasets. We show that, compared with real data, association knowledge obtained from synthetic data can achieve very similar performance on real-world test sets without domain adaption techniques. Our intriguing observation is credited to two factors. First and foremost, 3D engines can well simulate motion factors such as camera movement, camera view, and object movement so that the simulated videos can provide association modules with effective motion features. Second, the experimental results show that the appearance domain gap hardly harms the learning of association knowledge. In addition, the strong customization ability of MOTX allows us to quantitatively assess the impact of motion factors on MOT, which brings new insights to the community.


Introduction
Multi-object tracking (MOT) is a compound system composed of several functional components, e.g., detection, visual representations, and association. Association is at the final stage of the MOT pipeline and is usually viewed as the core problem, aiming to connect bounding boxes with existing tracklets [1,2] . The association module makes inferences according to appearance features (e.g., re-identification features), motion features (e.g., location and size of bounding boxes), or both of them.
In the community, what many solutions to the association have in common is that they are trained with realworld video data [3,4] . However, there are several potential problems with this practice. First, annotating trajectories in video frames requires expensive labor costs. This potentially limits the scale of MOT training data. Second, privacy and ethics issues constrain the usage of real-world data in human-centered tasks, e.g., multiple pedestrian tracking.
In this paper, we investigate how to use synthetic data in MOT, so as to avoid the concerns listed above. We build a 3D simulation engine, MOTX, for generating videos with multiple targets, rich annotations, and controllable visual factors. Such data offer an inexpensive way to acquire large-scale data with accurate labels. With MOTX, we aim to answer two interesting questions.
First, does the association knowledge learned from synthetic data work in real-world videos? A common weakness of synthetic data consists of its distribution difference with real-world data, especially regarding the image-style. In "Appearance-centered" tasks (e.g., re-identification and segmentation), to avoid failure in real-world test environments, models trained on synthetic data require additional training techniques, such as fine-tuning or domain adaptation on the real data [5−8] . However, association learning is different from appearance learning regarding data requirements. According to existing works [1,2,9] , motion cues play an essential role in the association. While appearance realistic images are hard to simulate by the engine, it may be less difficult for motion cues, such as occlusion. Some sample results of appearance simulation and association scenario simulation are shown in Fig. 1.
Second, how do motion factors affect association knowledge learning? Existing datasets are mostly from the real world, such as MOT15. While these data benefit model training, that they are fixed offers us limited opportunities to understand how the system reacts to changing visual factors. For example, how does pedestrian density in the training set affect model accuracy? Can a model trained with static cameras be well deployed under moving-camera systems? In this paper, taking advantage of the strong customization ability of MOTX, we will make some initial investigations into these interesting directions. In response, this paper makes a two-fold contribution. First and most importantly, we show that on several state-of-the-art association networks, association knowledge learned from synthetic data can be well adapted to real-world scenarios without a performance drop. Specifically, we synthesize datasets using MOTX by manually setting key parameters (e.g., camera view) to be close to real-world training sets. 1 Then, when the recent association networks are trained on such synthetic videos, they achieve similar or sometimes even better tracking accuracy compared with real data training. Our ablation studies on appearance and motion features suggest: 1) The appearance-discrepancy between synthetic data and realworld data can hardly harm the association knowledge learning. 2) 3D engines can well simulate motion cues in association scenarios. The above findings can be the reason for the competitiveness of synthetic data and imply that MOT benefits more from using synthetic data than "Appearance-centered" tasks. To our knowledge, this is a very early study of pondering the role of synthetic data in MOT.
Second, we perform empirical studies on how objectrelated and camera-related factors affect the learning of association knowledge. Specifically, we investigate two groups of factors: 1) pedestrian-related factors, such as density and moving speed; and 2) camera-related factors, including the camera view and camera moving state. In detail, with the proposed MOTX engine 2 , motion factors are abstracted with system parameters, so we can readily simulate different scenarios by simply changing these parameters, e.g., setting the object velocity to 1 m/s. Our results shed light on the relationship between factors in training and testing data and MOT system performance.

Related work
Association methods in MOT. There are mainly two types of association: human-designed policies and parametric association modules. The former is usually seen in MOT works focusing on improving detection and appearance embedding [9−15] . They compute similarities between bounding boxes and objects according to predefined metrics. The most commonly used metrics are the intersection over union (IoU) score and the cosine similarity score between deep Re-ID features. Then, a bipartite matching algorithm (e.g., the Hungarian algorithm [16] ) associates the bounding boxes with objects. The Kalman Filter [17] can also predict motion and smooth trajectories.
The latter uses neural networks to formulate the association stage. For example, DeepMOT [2] proposes a long short-term memory (LSTM) method to approximate the Hungarian matching algorithm [16] . MPNTracker [1] formulates sequences as graphs and designs a differentiable message passing network to predict the score for each box link between frames. Li et al. [9] and Papakis et al. [18] use a graph neural network to model appearance and motion (geometric) features and produce similarities between tracklets and detections. These parametric association modules are trained based on appearance and motion features. In this paper, we observe that parametric association modules trained with synthetic videos can be successfully deployed in real-world test sets without domain adaptation.
Learning from synthetic data for real-world applications. Synthetic datasets have been used in image  Fig. 1 Simulated appearance VS. simulated association scenarios. (a) Simulated appearance usually has an image-style discrepancy with the real-world appearance. For many appearance-centered tasks such as re-identification, such appearance domain gap compromises models that are trained on synthetic data and tested on real data. (b) In comparison, we show that synthetic data are as effective as real data in training association models. It suggests that association scenarios (e.g., trajectories and occlusions) have a small domain gap between the synthetic and the real. classification [19,20] , object detection [21−23] , multi-object tracking [21,22,24] , semantic and instance segmentation [21,22,25] , pose estimation [24,26] and navigation [27] . Commonly used simulation platforms include Unity and Unreal. In this area, domain adaptation is mostly used. For example, Bąk et al. [5] use the cycle generative adversarial network to convert synthetic images into the real-world style. In comparison, there are much fewer works that do not need domain adaptation to get good performance in this area. Unlike the common practices using synthetic data to learn appearance features, this paper investigates the possibility of using synthetic data to focus on association module training in MOT.
Domain gap beyond appearance. While the domain gap caused by the image appearance is the most studied, there are some works studying other factors that lead to distribution differences between domains. Recently, Meta-sim [28] optimizes the probability grammar for scene content generation. Yao et al. [8] study the content-level domain gap in the vehicle re-identification task and show the feasibility of reducing the gap by editing synthetic data. This paper will identify and discuss factors beyond appearance (i.e., motion factors) that influence association learning in MOT.

MOTX engine
MOTX is a 3D rendering engine that receives a set of controllable factors related to objects, cameras, and others as inputs, and it outputs a 2D video together with ground truth annotations (Fig. 2). We build MOTX based on the Unity [29] game engine. Section 3.1 introduces controllable factors. Section 3.2 describes annotation acquisition.

Controllable factors
Object-related factors. Currently, MOTX focuses on tracking pedestrians. We collect 1 200 pedestrian 3D models with distinct appearances from the PersonX engine [30] . Controllable factors include pedestrian density, speed, and action. Density refers to the number of pedes-{walk, run} trians inside the viewing frustum. Each pedestrian takes action with a random speed drawn from a given speed distribution. The walking routes are randomly generated.
Camera-related factors. The viewing pose, spatial location, running path, and speed of the camera can be flexibly adjusted. In this paper, we mainly evaluate two commonly encountered camera views, the surveillance view (static camera, overlooking view) and the vehiclemounted view (moving camera, near-horizontal view).
Others factors. MOTX supports changing other visual factors that can influence the final rendering, including scenes, resolution, and lighting (light direction, light intensity, light color, etc.). If not specified, all videos are recorded at the resolution of 1 024×768.

Annotation acquisition
Bounding box annotation. We transform the 3D locations of person models in the scene into 2D locations in the camera view. By calculating the top, bottom, left, and right vertices of people, we can obtain accurate bounding boxes for the holistic body. For occluded or partially visible persons, the engine can tell the occlusion relations, and we accordingly annotate the bounding boxes of visible parts as well.
Identity annotation. Identity labels are directly given by the engine. This avoids the re-labeling problem when a person leaves and re-enter the field of view, which is a common annotation mistake in real datasets.
In practice, we build assets in the Unity engine and provide user interfaces to configure the predefined status of controllable factors and trajectories for persons and cameras in a 3D scene. The Unity engine will render the video according to our configurations automatically. An example of rendering videos in MOTX is shown in Fig. 3.

Association knowledge
A multiple-object tracker is usually composed of a detector, an appearance model, and an association model. In this work, we argue that it is possible to learn the associ- ation model with synthetic videos generated by the MOTX engine, while the learned association knowledge is applicable to real-world data without domain adaptation. Preliminary, we give a definition of association knowledge and briefly review how existing methods learn it.

Definition of association knowledge
Given a set of detected bounding boxes and tracked objects at frame , the assignment between the -th bounding box in and the -th object in is noted as , where . denotes that is associated with . Otherwise, belongs to other tracklets. The association module in an MOT system usually aims to optimize the assignment matrix at frame : is the entry of and is the association score between and . If , none of the bounding boxes in should be connected to . Similarly, indicates that the bounding box does not belong to any objects in . In this case, can be a new object, or the object ID that belongs to is missing in currently tracked objects. K We define association knowledge as a metric function that takes appearance features, motion features, or both of them as input and outputs the similarity score, where is the joint set of appearance features from both detections and existing tracklets, and is similar but contains motion features. In practice, appearance features are widely represented by the Re-ID features, while motion features usually contain geometric information such as the locations and the sizes of bounding boxes [1,9] .

K Fm Fa
In early literature, the association knowledge is commonly modeled with human-designed policies. For instance, a simple policy is to only consider motion cues , ignoring appearance cues . Specifically, the bounding boxes belonging to the same object ID in two adjacent frames should be closer than those belonging to different object IDs. Based on this observation, we use the IoU of the bounding boxes as the association score ( Fig. 4(a)). Another simple yet effective human-designed policy is to use the cosine similarity between the Re-ID features as the association score. Similarly, the cosine similarity belonging to the same ID has a larger value than that computed from Re-ID features extracted from different identities ( Fig. 4(b)).
Human-designed policies are sub-optimal as it is difficult for them to take full advantage of both appearance and motion cues. Beyond human-designed policies, the more recent arts [1,2,9,18] attempt to learn association knowledge directly from data with a parametric model, i.e., . As illustrated in Fig. 4(c), both and are taken as input by the association model, and the model learns its parameter by applying stochastic gradient descent (SGD) on a labelled dataset. During inference, output predictions with a single forward pass. The most prevalent choice of the parametric model is the graph neural network (GNN) [31] . In Section 6, we show, by empirical experiments, that it is possible to learn association knowledge from synthetic data.

Experiment setup
Comparison pipeline. This paper aims to compare synthetic data and real data on their effectiveness when they are used to learn association knowledge. The experimental setup is briefly illustrated in Fig. 5 Benchmark methods. For a comprehensive comparison, we select several typical association methods, including both parametric association models and human-designed association policy. We pay more attention to parametric models as they show superior performance. Details are described as follows: MPNTracker [1] formulates MOT with the classical network flow. A type of GNN named message passing network (MPN) is proposed to predict linkages based on the graph built with appearance features and motion cues. DeepMOT [2] proposes a deep Hungarian net (DHN) as an association module to approximate the Hungarian matching algorithm. GN-MOT [9] builds the appearance graph and the motion graph for two conjunctive frames. Then, two graph networks compute the similarities between nodes to achieve association. SORT [32] is a human-crafted association policy. It only employs motion cues. Observations are associated with tracklets in a hierarchical manner by comparing IoU distances. We mainly tune the key hyperparameter, IoU threshold, in the training set and use it in the test set. StrongSORT [33] proposes an appearance-free link model to associate short tracklets with complete trajectories. We use synthetic data to train this link model and then directly deploy it in the real-world testing environment.
Evaluation metric. For evaluation, we employ the widely used the widely used the clear mot metrics (CLEAR) [34] metrics. The main metrics include MOTA (MOT accuracy), IDF1 (ID F1-measure), IDSwR (identity switch rate), MT (mostly tracked target percentage), and ML (mostly lost target percentage). Among them, IDF1 and IDs are the most relevant ones to evaluate association accuracy.

Evaluation on benchmark datasets
In this section, we show that the association knowledge learned from synthetic data works well on realworld test sets. Specifically, we use the test set of MOT-15/16/17/20 [3,4,35] . For real data training, we use the corresponding train split of the target set, e.g., train on the MOT16 train and test on the MOT16 test. For synthetic data training, we build a synthetic training set, use this single set for training, and evaluate in all test sets. We name the synthetic dataset MOTX-S. MOTX-S is synthesized using the MOTX engine, consisting of 22 videos in total. Videos are generated by roughly simulating the scene dynamics (camera moving, camera view, person density, person velocity, etc.) of videos in the MOT15-17 dataset. As shown in Section 3.1, the resulting synthetic videos yield consistently good results even when the parameters of some scene dynamics vary in a relatively large range.
Association knowledge from the synthetic ...  Table 1. The major observation is that each association method trained on synthetic data can have a similar performance to that trained on realworld training data in terms of all metrics. Note that when training MPNTracker, MOTX-S shows its advantage over MOT15. Specifically, MOTX-S improves IDF1, MT, and ML with 0.5 , 3.1 , and 0.7 , respectively. It suggests that the association scenarios in MOTX give better supervision on association knowledge learning than MOT15. For all comparisons, we do not observe a noticeable performance drop when trained on MOTX-S. In most cases, the performance gap between MOTX and the real-world data is less than 1 for all evaluation indexes.
In the experiment of StrongSORT, where the learned association model is pure motion dependent and appearance-free, it achieved slightly better testing accuracy by using MOTX-S than real-world videos. On the one hand, the above observations suggest that the association know-ledge learned from synthetic data can achieve similar performance compared with that trained on real-world data.
On the other hand, such competitiveness of synthetic data can not be seen in "Appearance-centered tasks" if the deep system is only learned from the synthetic data. Because of the superior performance and run-time efficiency of MPNTracker, experiments in Section 3.1 are conducted on it.

% %
Association domain gap exists. We train MPN-Tracker on the training set of MOT15, MOT17, and their combination, respectively. Testing results on the MOT17 test set are shown in Table 2. Both MOT15 and the combined set are worse than using MOT17 alone. Specifically, MOT15 gets 3 lower IDF1 and about 25 higher ID switches.
A similar degeneration trend can also be found when deploying the association knowledge from MOT17 into the MOT15 domain. This suggests that there is a domain gap between association scenarios in MOT15 and MOT17. Table 1 Comparing synthetic data (MOTX-S) and real data in association knowledge learning on real-world test sets. The numbers in bold denote that association knowledge learned from synthetic data is superior or equal to that learned from real data, while underlined mean that the performance gaps are less than 1.0. Appearance domain adaptation is not necessary. We attempt to reduce the appearance domain gap between synthetic data and real-world data by converting the appearance of detections in MOTX-S into the real-world style by using a generative network using the similarity preserving generative adversarial network (SP-GAN) [36] . SPGAN is trained on data provided by VisDA2020 3 , which has both Unity-based synthetic persons and real-world persons. The results in Table 3 show that MOTX-S is still competitive without a domain adaptation on appearance.

Ablation study on appearance and motion features
Fa Fm 1 = (1, · · · , 1) T It is worthwhile to investigate why the competitive results in Table 1 can be achieved by only using synthetic data with a considerable domain gap in image-style. We conduct the ablation study on the input of the association model. Specifically, we eliminate the effect of appearance features or motion features in (2) by replacing them with dummy vectors . Videos {2, 10, 13} in MOT17 are divided as the validation set, and the rest videos in MOT17 make up the training set. We repeat each training on MPNTracker five times and report their means. We also perform hypothesis testing to validate the statistical significance of the results. The results are shown in Fig. 6.
Effectiveness of appearance features and motion features. The tracking performance degenerates when we eliminate either appearance features or motion features. It shows that both appearance features and motion features contribute to association knowledge learning. When training on both appearance & motion features, MOTX-S achieves similar performance on the MOT17 validation set. This is consistent with the conclusion in Section 6.1.
Synthetic VS. real on motion features. When only motion features are used (w/o A), MOTX-S shows a considerable advantage over real data. In detail, the ID % switch for MOTX-S is only half of that for real data. IDF1 score also leads by over 6 . This performance gap is not observed in experiments "A+M" and "w/o M". This phenomenon suggests that motion scenarios generated with MOTX can simulate the real-world association scenarios well.
Synthetic VS. real on appearance features. Intuitively, it is highly possible that the domain gap of the appearance feature harms association learning. This is because appearance models are trained on real-world Re-ID datasets, but in training association models, they are used to extract features of synthetic person images. Moreover, the final test set consists of real-world videos. However, we do not observe the expected performance drop due to the appearance domain gap. See results "w/o M" in Fig. 6, with appearance cues only, trained on real data, and synthetic data perform almost equally with similar IDs and IDF1. This suggests a somehow surprising finding: The appearance domain gap hardly harms the learning of association knowledge.
Quantitative analysis of using MOTX-S as supplements. We consider using MOTX-S to augment realworld training data. Quantitative experiments are conducted to analyze the impact of appearance features and motion features provided by MOTX-S in association knowledge learning. We gradually add videos in MOTX-S into the MOT17 training set. The results on the MOT17 validation set are shown in Fig. 7. We have two observations. The major one is that both appearance and motion features from MOTX-S are effective in augmenting realworld features. The IDF1 scores are improved by 1.4% and 6.5%, respectively. Second, when only motion features are available to learn association, using MOTX-S as supplements can boost the testing tracking accuracy. We can conclude that MOTX has expertise in providing motion-related association domain knowledge.
Discussion. The above insightful findings suggest that it is not necessary to perform additional appearance adaptation techniques when we deploy the learned association knowledge in the real-world test sets. Also, we observe that our synthetic data show stronger competitiveness in Fig. 6 than that in Table 1. The possible reason is that the training set and the test set in the MOTChallenge benchmark overlap in association scenarios, i.e., videos in the test set are collected at the same location as the training set where the camera-related and pedestrianrelated factors are very close.

Investigation of controllable factors
Another major advantage of synthetic data is that we can control multiple factors in generating videos. Therefore, it is possible to conduct a thorough investigation into how these controllable factors impact an association algorithm with the help of synthetic data. In this section, we mainly study the influence of four factors, i.e., camera view, camera moving state, pedestrian speed, and pedestrian density. For each factor, we design a group of contrast experiments using different training sets and test sets. A summary of the used datasets is illustrated in Table 4. A principle of these experiments is that we train an identical association model (here we use MPNTracker) with different customized synthetic data (e.g., camera view high VS. low) and test on different real data (both camera view high VS. low). The results are shown in Figs. 8 and 9.
prefix − middle − suffix Dataset notation. For clarification, datasets are notated in the format . The prefix can be "S" and "R", representing synthetic data or real data. The middle word is the controllable factor to be studied, e.g., "Cam" indicates the camera. The suffix is the value of the controllable factor. For instance, "S-Cam-H" represents this dataset consisting of synthetic videos with high camera views.
Camera view. The association models are trained on S-Cam-H, S-Cam-L, and their compound version {S-Cam-H, S-Cam-L}, respectively. Then the trained association models are tested on real-world videos with high camera views (R-Cam-H) or low camera views (R-Cam-L). According to Figs. 8(a)-8(d), a major observation is that association knowledge learning is sensitive to camera view. Specifically, when testing on R-Cam-H, the association model trained on S-Cam-H can achieve a close ID switch rate and IDF1 score compared with the model trained on the compound data. However, the accuracy of only using S-Cam-L decreases noticeably in this case shown in Figs. 8(a) and 8(b). We observe a similar trend when testing on videos with low camera view in Figs. 8(c) and 8(d). This suggests that the knowledge learned from high camera views can not be deployed in low camera view test environment successfully, and vice versa. In other words, there is an obvious association domain gap between high camera view scenarios and low camera view scenarios.
Camera moving state. We learn association knowledge from static cameras (S-Cam-S) and moving cameras ( Cam-M. This insightful discovery implies that the association knowledge learned from moving cameras has stronger compatibility than that learned from static cameras. n n ∈ Pedestrian speed. Association models are trained on S-Speed-, {1, 2, 4, 6}, which means pedestrian speed n is m/s. The test sets are R-Speed-L and R-Speed-H. In detail, the frame rate of videos in R-Speed-H ranges from 7 fps to 10 fps. It means that the moving speed of the same identity between two conjunctive frames is almost 3-4 times that in R-Speed-L, where the video frame rate Table 4 Notations for four groups of data to study motion factors. The prefix "S" and "R" represent synthetic data and real data, respectively. The suffix "H", "L", "S", and "M" stand for high, low, static, and moving.  Notations "n.s.", **, and *** have the same meaning as those in Fig. 6 Pedestrian density. Figs. 9(e)-9(h) show results for testing on real videos with different pedestrian densities. According to the official statistics of MOTChallenge, the average pedestrian density of all videos in our build R-Density-L is less than 10. However, for R-Density-L, the optimal density in the training set is 40 according to Figs. 9(g) and 9(h). It suggests that the gap in pedestrian density does not cause the gap in tracking performance if the testing environment has low pedestrian density. For example, S-Density-60 is better than S-Density-10 when testing on R-Density-L. However, association knowledge gained from low-density videos is not very effective in high-density environments (Figs. 9(e) and 9(f)). / Discussion. Our synthetic dataset is manually configured in MOTX. We do so by setting the motion-related parameters to roughly match the real training videos. The above experiment also serves as a confirmation that this manual configuration process is stable. For example, when the pedestrian speed is set between 1-2 m s, the IDS scores remain stable. The same observation also goes for other factors like pedestrian density. Therefore, in practice, we advise giving a possibly best manual estimation of the motion parameters of the testing / environment. Relatively small errors can be well-tolerated, but large errors (e.g., the camera speed is estimated to be 1 m s but is actually static) should be avoided.

Tuning human-designed policy w.r.t. scenes
Our synthetic dataset can also benefit the hyper-parameter search for the given scenes. We take the SORT [32] algorithm as an example. When R-Speed-H is the testing scenario, we synthesize a dataset according to the roughly estimated motion factors in R-Speed-H and search for the hyper-parameter IoU threshold. As shown in Table 5, compared with using MOT15 for the hyper-parameter search, the IoU threshold searched from the synthetic data is closer to that searched from fully labeled R-Speed-H. Table 5 Hyper-parameter tuning for the SORT algorithm. We report the best hyper-parameter for the given dataset.

Conclusions
This paper studies the role of synthetic data in multiobject tracking. Crediting to the proposed MOTX engine, we make two contributions. First, we show that association knowledge obtained from synthetic data can be directly deployed in the real-world environment without domain adaptation, even if the image-style discrepancy between synthetic data and real-world data exists. Second, with the help of the MOTX engine, we thoroughly investigate how association knowledge reacts to changes in camera-related and pedestrian-related motion factors. Experimental results lead to intriguing finds giving new insights into understanding the impact of data in association knowledge learning.

Declarations
Competing interests. The authors have no competing interests to declare that are relevant to the content of this article.

Open Access
This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article′s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article′s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.