Abstract
Multi-object tracking (MOT) is an important problem in computer vision which has a wide range of applications. Formulating MOT as multi-task learning of object detection and re-ID in a single network is appealing since it allows joint optimization of the two tasks and enjoys high computation efficiency. However, we find that the two tasks tend to compete with each other which need to be carefully addressed. In particular, previous works usually treat re-ID as a secondary task whose accuracy is heavily affected by the primary detection task. As a result, the network is biased to the primary detection task which is not fair to the re-ID task. To solve the problem, we present a simple yet effective approach termed as FairMOT based on the anchor-free object detection architecture CenterNet. Note that it is not a naive combination of CenterNet and re-ID. Instead, we present a bunch of detailed designs which are critical to achieve good tracking results by thorough empirical studies. The resulting approach achieves high accuracy for both detection and tracking. The approach outperforms the state-of-the-art methods by a large margin on several public datasets. The source code and pre-trained models are released at https://github.com/ifzhang/FairMOT.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Bae, S. H., & Yoon, K. J. (2014). Robust online multi-object tracking based on tracklet confidence and online discriminative appearance learning. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1218–1225).
Bae, S. H., & Yoon, K. J. (2017). Confidence-based data association and discriminative deep appearance learning for robust online multi-object tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(3), 595–610.
Berclaz, J., Fleuret, F., Turetken, E., & Fua, P. (2011). Multiple object tracking using k-shortest paths optimization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(9), 1806–1819.
Bergmann, P., Meinhardt, T., & Leal-Taixe, L. (2019). Tracking without bells and whistles. In ICCV (pp. 941–951).
Bernardin, K., & Stiefelhagen, R. (2008). Evaluating multiple object tracking performance: The clear mot metrics. EURASIP Journal on Image and Video Processing, 2008, 1–10.
Bewley, A., Ge, Z., Ott, L., Ramos, F., & Upcroft, B. (2016). Simple online and realtime tracking. In ICIP (pp. 3464–3468). IEEE.
Bochinski, E., Eiselein, V., & Sikora, T. (2017). High-speed tracking-by-detection without using image information. In 2017 14th IEEE international conference on advanced video and signal based surveillance (AVSS) (pp. 1–6). IEEE.
Bolme, D. S., Beveridge, J. R., Draper, B. A., & Lui, Y. M. (2010). Visual object tracking using adaptive correlation filters. In CVPR (pp. 2544–2550). IEEE.
Brasó, G., & Leal-Taixé, L. (2020). Learning a neural solver for multiple object tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 6247–6257).
Cai, Z., & Vasconcelos, N. (2018). Cascade r-cnn: Delving into high quality object detection. In CVPR (pp. 6154–6162).
Chao, P., Kao, C. Y., Ruan, Y. S., Huang, C. H., & Lin, Y. L. (2019). Hardnet: A low memory traffic network. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 3552–3561).
Chen, L., Ai, H., Shang, C., Zhuang, Z., & Bai, B. (2017). Online multi-object tracking with convolutional neural networks. In 2017 IEEE international conference on image processing (ICIP) (pp. 645–649). IEEE.
Chen, L., Ai, H., Zhuang, Z., & Shang, C. (2018a). Real-time multiple people tracking with deeply learned candidate selection and person re-identification. In 2018 IEEE international conference on multimedia and expo (ICME) (pp. 1–6). IEEE.
Chen, Z., Badrinarayanan, V., Lee, C. Y., & Rabinovich, A. (2018b). Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In ICML, PMLR (pp. 794–803).
Cheng, B., Xiao, B., Wang, J., Shi, H., Huang, T. S., & Zhang, L. (2020). Higherhrnet: Scale-aware representation learning for bottom-up human pose estimation. In CVPR.
Choi, W. (2015). Near-online multi-target tracking with aggregated local flow descriptor. In Proceedings of the IEEE international conference on computer vision (pp. 3029–3037).
Chu, P., Fan, H., Tan, C. C., & Ling, H. (2019). Online multi-object tracking with instance-aware tracker and dynamic model refreshment. In 2019 IEEE winter conference on applications of computer vision (WACV) (pp. 161–170). IEEE.
Chu, P., & Ling, H. (2019). Famnet: Joint learning of feature, affinity and multi-dimensional assignment for online multiple object tracking. In ICCV (pp. 6172–6181).
Dendorfer, P., Rezatofighi, H., Milan, A., Shi, J., Cremers, D., Reid, I., Roth, S., Schindler, K., & Leal-Taixé, L. (2020). Mot20: A benchmark for multi object tracking in crowded scenes. arXiv preprint arXiv:2003.09003.
Dollár, P., Wojek, C., Schiele, B., & Perona, P. (2009). Pedestrian detection: A benchmark. In CVPR (pp. 304–311). IEEE.
Dong, Z., Li, G., Liao, Y., Wang, F., Ren, P., & Qian, C. (2020). Centripetalnet: Pursuing high-quality keypoint pairs for object detection. In CVPR (pp. 10519–10528).
Duan, K., Bai, S., Xie, L., Qi, H., Huang, Q., & Tian, Q. (2019). Centernet: Keypoint triplets for object detection. In ICCV (pp. 6569–6578).
Ess, A., Leibe, B., Schindler, K., & Van Gool, L. (2008). A mobile vision system for robust multi-person tracking. In CVPR (pp. 1–8). IEEE.
Fang, K., Xiang, Y., Li, X., & Savarese, S. (2018). Recurrent autoregressive networks for online multi-object tracking. In WACV (pp. 466–475). IEEE.
Feichtenhofer, C., Pinz, A., & Zisserman, A. (2017). Detect to track and track to detect. In Proceedings of the IEEE international conference on computer vision (pp. 3038–3046).
Felzenszwalb, P., McAllester, D., & Ramanan, D. (2008). A discriminatively trained, multiscale, deformable part model. In CVPR (pp. 1–8). IEEE.
Guo, M., Haque, A., Huang, D. A., Yeung, S., & Fei-Fei, L. (2018). Dynamic task prioritization for multitask learning. In Proceedings of the European conference on computer vision (ECCV) (pp. 270–287).
Han, S., Huang, P., Wang, H., Yu, E., Liu, D., Pan, X., & Zhao, J. (2020) Mat: Motion-aware multi-object tracking. arXiv preprint arXiv:2009.04794
Han, W., Khorrami, P., Paine, T. L., Ramachandran, P., Babaeizadeh, M., Shi, H., Li, J., Yan, S., & Huang, T. S. (2016). Seq-nms for video object detection. arXiv preprint arXiv:1602.08465.
He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask r-cnn. In ICCV (pp. 2961–2969).
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR (pp. 770–778).
Henriques, J. F., Caseiro, R., Martins, P., & Batista, J. (2014). High-speed tracking with kernelized correlation filters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(3), 583–596.
Henschel, R., Zou, Y., & Rosenhahn, B. (2019). Multiple people tracking using body and joint detections. In CVPRW.
Hermans, A., Beyer, L., & Leibe, B. (2017). In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737.
Hornakova, A., Henschel, R., Rosenhahn, B., & Swoboda, P. (2020). Lifted disjoint paths with application in multiple object tracking. In International conference on machine learning, PMLR (pp. 4364–4375).
Kalman, R. E. (1960). A new approach to linear filtering and prediction problems. Journal of Fluids Engineering, 82(1), 35–45.
Kang, K., Li, H., Xiao, T., Ouyang, W., Yan, J., Liu, X., & Wang, X. (2017). Object detection in videos with tubelet proposal networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 727–735).
Kang, K., Ouyang, W., Li, H., & Wang, X. (2016). Object detection from video tubelets with convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 817–825).
Kendall, A., Gal, Y., & Cipolla, R. (2018). Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In CVPR (pp. 7482–7491).
Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
Kokkinos, I. (2017). Ubernet: Training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory. In CVPR (pp. 6129–6138).
Kuhn, H. W. (1955). The Hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2(1–2), 83–97.
Law, H., & Deng, J. (2018). Cornernet: Detecting objects as paired keypoints. In ECCV (pp. 734–750).
Leal-Taixé, L., Milan, A., Reid, I., Roth, S., & Schindler, K. (2015). Motchallenge 2015: Towards a benchmark for multi-target tracking. arXiv preprint arXiv:1504.01942.
Liang, C., Zhang, Z., Lu, Y., Zhou, X., Li, B., Ye, X., & Zou, J. (2020). Rethinking the competition between detection and reid in multi-object tracking. arXiv preprint arXiv:2010.12138.
Lin, T. Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017a). Feature pyramid networks for object detection. In CVPR (pp. 2117–2125).
Lin, T. Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017b). Focal loss for dense object detection. In ICCV (pp. 2980–2988).
Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In ECCV (pp. 740–755). Springer.
Liu, S., Johns, E., & Davison, A. J.(2019). End-to-end multi-task learning with attention. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 1871–1880).
Lu, Z., Rathod, V., Votel, R., & Huang, J. (2020). Retinatrack: Online single stage joint detection and tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 14668–14678).
Luo, H., Gu, Y., Liao, X., Lai, S., & Jiang, W. (2019a). Bag of tricks and a strong baseline for deep person re-identification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops.
Luo, H., Xie, W., Wang, X., & Zeng, W. (2019b). Detect or track: Towards cost-effective video object detection/tracking. Proceedings of the AAAI Conference on Artificial Intelligence, 33, 8803–8810.
Mahmoudi, N., Ahadi, S. M., & Rahmati, M. (2019). Multi-target tracking using cnn-based features: Cnnmtt. Multimedia Tools and Applications, 78(6), 7077–7096.
Milan, A., Leal-Taixé, L., Reid, I., Roth, S., & Schindler, K. (2016) Mot16: A benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831.
Milan, A., Roth, S., & Schindler, K. (2013). Continuous energy minimization for multitarget tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(1), 58–72.
Pang, B., Li, Y., Zhang, Y., Li, M., & Lu, C. (2020). Tubetk: Adopting tubes to track multi-object in a one-step training model. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 6308–6318).
Pang, J., Qiu, L., Li, X., Chen, H., Li, Q., Darrell, T., & Yu, F. (2021). Quasi-dense similarity learning for multiple object tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 164–173).
Peng, J., Wang, C., Wan, F., Wu, Y., Wang, Y., Tai, Y., Wang, C., Li, J., Huang, F., & Fu, Y. (2020). Chained-tracker: Chaining paired attentive regression results for end-to-end joint multiple-object detection and tracking. In European conference on computer vision (pp. 145–161). Springer.
Radosavovic, I., Kosaraju, R.P., Girshick, R., He, K., Dollár, P. (2020). Designing network design spaces. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10428–10436
Ranjan, R., Patel, V. M., & Chellappa, R. (2017). Hyperface: A deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition. T-PAMI, 41(1), 121–135.
Redmon, J., & Farhadi, A. (2018). Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767.
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems (pp. 91–99).
Ristani, E., Solera, F., Zou, R., Cucchiara, R., & Tomasi, C. (2016). Performance measures and a data set for multi-target, multi-camera tracking. In ECCV (pp. 17–35). Springer.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252.
Sadeghian, A., Alahi, A., & Savarese, S. (2017). Tracking the untrackable: Learning to track multiple cues with long-term dependencies. In Proceedings of the IEEE international conference on computer vision (pp. 300–311).
Sanchez-Matilla, R., Poiesi, F., & Cavallaro, A. (2016). Online multi-target tracking with strong and weak detections. In ECCV (pp. 84–99). Springer.
Sener, O., & Koltun, V. (2018). Multi-task learning as multi-objective optimization. In NIPS (pp. 527–538).
Shan, C., Wei, C., Deng, B., Huang, J., Hua, X. S., Cheng, X., & Liang, K. (2020). Fgagt: Flow-guided adaptive graph tracking. arXiv preprint arXiv:2010.09015.
Shao, S., Zhao, Z., Li, B., Xiao, T., Yu, G., Zhang, X., & Sun, J. (2018). Crowdhuman: A benchmark for detecting human in a crowd. arXiv preprint arXiv:1805.00123.
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
Sun, S., Akhtar, N., Song, H., Mian, A. S., & Shah, M. (2019). Deep affinity network for multiple object tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43, 104–119.
Tang, P., Wang, C., Wang, X., Liu, W., Zeng, W., & Wang, J. (2019). Object detection in videos by high quality object linking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(5), 1272–1278.
Tang, S., Andriluka, M., Andres, B., & Schiele, B. (2017). Multiple people tracking by lifted multicut and person re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3539–3548).
Valmadre, J., Bewley, A., Huang, J., Sun, C., Sminchisescu, C., & Schmid, C. (2021). Local metrics for multi-object tracking. arXiv preprint arXiv:2104.02631.
Vandenhende, S., Georgoulis, S., Van Gansbeke, W., Proesmans, M., Dai, D. & Van Gool, L. (2021). Multi–Task learning for dense prediction tasks: A survey. In IEEE Transactions on Pattern Analysis and Machine Intelligence. https://doi.org/10.1109/TPAMI.2021.3054719.
Voigtlaender, P., Krause, M., Osep, A., Luiten, J., Sekar, B. B. G., Geiger, A., & Leibe, B. (2019). Mots: Multi-object tracking and segmentation. In CVPR (pp. 7942–7951).
Wan, X., Wang, J., Kong, Z., Zhao, Q., & Deng, S. (2018). Multi-object tracking using online metric learning with long short-term memory. In 2018 25th IEEE international conference on image processing (ICIP) (pp. 788–792). IEEE.
Wang, J., Sun, K., Cheng, T., Jiang, B., Deng, C., Zhao, Y., et al. (2020). Deep high–resolution representation learning for visual recognition. In IEEE Transactions on Pattern Analysis and Machine Intelligence. https://doi.org/10.1109/TPAMI.2020.2983686.
Wang, Z., Zheng, L., Liu, Y., Li, Y., & Wang, S. (2020b). Towards real-time multi-object tracking. In Computer vision–ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16 (pp. 107–122). Springer.
Wen, L., Li, W., Yan, J., Lei, Z., Yi, D., & Li, S. Z. (2014). Multiple target tracking based on undirected hierarchical relation hypergraph. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1282–1289).
Wojke, N., Bewley, A., & Paulus, D. (2017). Simple online and realtime tracking with a deep association metric. In 2017 IEEE international conference on image processing (ICIP) (pp. 3645–3649). IEEE.
Xiang, Y., Alahi, A., & Savarese, S. (2015). Learning to track: Online multi-object tracking by decision making. In ICCV (pp. 4705–4713).
Xiao, T., Li, S., Wang, B., Lin, L., & Wang, X. (2017). Joint detection and identification feature learning for person search. In CVPR (pp. 3415–3424).
Xu, J., Cao, Y., Zhang, Z., & Hu, H. (2019). Spatial–temporal relation networks for multi-object tracking. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 3988–3998).
Yang, F., Choi, W., & Lin, Y. (2016). Exploit all the layers: Fast and accurate cnn object detector with scale dependent pooling and cascaded rejection classifiers. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2129–2137).
Yang, Z., Liu, S., Hu, H., Wang, L., & Lin, S. (2019). Reppoints: Point set representation for object detection. In ICCV (pp. 9657–9666).
Yu, F., Li, W., Li, Q., Liu, Y., Shi, X., & Yan, J. (2016). Poi: Multiple object tracking with high performance detection and appearance feature. In ECCV (pp. 36–42). Springer.
Yu, F., Wang, D., Shelhamer, E., & Darrel, l. T. (2018). Deep layer aggregation. In CVPR (pp. 2403–2412).
Zamir, A. R., Dehghan, A., & Shah, M. (2012). Gmcp-tracker: Global multi-object tracking using generalized minimum clique graphs. In European conference on computer vision (pp. 343–356). Springer.
Zhang, L., Li, Y., & Nevatia, R. (2008). Global data association for multi-object tracking using network flows. In 2008 IEEE conference on computer vision and pattern recognition (pp. 1–8). IEEE.
Zhang, S., Benenson, R., & Schiele, B. (2017). Citypersons: A diverse dataset for pedestrian detection. In CVPR (pp. 3213–3221).
Zhang, Y., Sheng, H., Wu, Y., Wang, S., Lyu, W., Ke, W., & Xiong, Z. (2020). Long-term tracking with deep tracklet association. IEEE Transactions on Image Processing, 29, 6694–6706.
Zheng, L., Zhang, H., Sun, S., Chandraker, M., Yang, Y., & Tian, Q. (2017a). Person re-identification in the wild. In CVPR (pp. 1367–1376).
Zheng, Z., Zheng, L., & Yang, Y. (2017b). A discriminatively learned cnn embedding for person reidentification. ACM Transactions on Multimedia Computing, Communications, and Applications, 14(1), 1–20.
Zhou, X., Koltun, V., & Krähenbühl, P. (2020). Tracking objects as points. In European conference on computer vision (pp. 474–490). Springer.
Zhou, X., Wang, D., & Krähenbühl, P. (2019a). Objects as points. arXiv preprint arXiv:1904.07850.
Zhou, X., Zhuo, J., & Krahenbuhl, P. (2019b). Bottom-up object detection by grouping extreme and center points. In CVPR (pp. 850–859).
Zhou, Z., Xing, J., Zhang, M., & Hu, W. (2018). Online multi-target tracking with tensor-based high-order graph matching. In 2018 24th International Conference on Pattern Recognition (ICPR) (pp. 1809–1814). IEEE.
Zhu, J., Yang, H., Liu, N., Kim, M., Zhang, W., & Yang, M. H. (2018). Online multi-object tracking with dual matching attention networks. In Proceedings of the European conference on computer vision (ECCV) (pp. 366–382).
Acknowledgements
This work was in part supported by NSFC (Nos. 61733007 and 61876212) and MSRA Collaborative Research Fund. We thank all the anonymous reviewers for their valuable suggestions.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Bumsub Ham.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Zhang, Y., Wang, C., Wang, X. et al. FairMOT: On the Fairness of Detection and Re-identification in Multiple Object Tracking. Int J Comput Vis 129, 3069–3087 (2021). https://doi.org/10.1007/s11263-021-01513-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-021-01513-4