Skip to main content

Norm-Aware Embedding for Efficient Person Search and Tracking


Person detection and Re-identification are two well-defined support tasks for practically relevant tasks such as Person Search and Multiple Person Tracking. Person Search aims to find and locate all instances with the same identity as the query person in a set of panoramic gallery images. Similarly, Multiple Person Tracking, especially when using the tracking-by-detection pipeline, requires to detect and associate all appeared persons in consecutive video frames. One major challenge shared by the two tasks comes from the contradictory goals of detection and re-identification, i.e, person detection focuses on finding the commonness of all persons while person re-ID handles the differences among multiple identities. Therefore, it is crucial to reconcile the relationship between the two support tasks in a joint model. To this end, we present a novel approach called Norm-Aware Embedding to disentangle the person embedding into norm and angle for detection and re-ID respectively, allowing for both effective and efficient multi-task training. We further extend the proposal-level person embedding to pixel-level, whose discrimination ability is less affected by misalignment. Our Norm-Aware Embedding achieves remarkable performance on both person search and multiple person tracking benchmarks, with the merit of being easy to train and resource-friendly.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9



  2. Code will be updated at this site.



  • Ahmed, E., Jones, M., & Marks, T. K. (2015). An improved deep learning architecture for person re-identification. CVPR.

    Article  Google Scholar 

  • Babaee, M., Athar, A., Rigoll, G. (2018) Multiple people tracking using hierarchical deep tracklet re-identification. arXiv preprint arXiv:1811.04091

  • Bergmann, P., Meinhardt, T., Leal-Taixe, L. (2019). Tracking without bells and whistles. In: ICCV

  • Bernardin, K., & Stiefelhagen, R. (2008). Evaluating multiple object tracking performance: the clear mot metrics. EURASIP Journal on Image and Video Processing, 2008, 1–10.

    Article  Google Scholar 

  • Breitenstein, M.D., Reichlin, F., Leibe, B., Koller-Meier, E., Van Gool, L. (2009). Robust tracking-by-detection using a detector confidence particle filter. In: ICCV

  • Chang, X., Huang, P.Y., Shen, Y.D., Liang, X., Yang, Y., Hauptmann, A.G. (2018). Rcaa: Relational context-aware agents for person search. In: ECCV

  • Chen, D., Zhang, S., Ouyang, W., Yang, J., Schiele, B. (2020). Hierarchical online instance matching for person search. In: AAAI

  • Chen, D., Zhang, S., Ouyang, W., Yang, J., Tai, Y. (2018). Person search via a mask-guided two-stream cnn model. In: ECCV

  • Chen, D., Zhang, S., Ouyang, W., Yang, J., & Tai, Y. (2020). Person search by separated modeling and a mask-guided two-stream cnn model. TIP, 29, 4669–4682.

    Google Scholar 

  • Chen, D., Zhang, S., Yang, J., Schiele, B. (2020). Norm-aware embedding for efficient person search. In: CVPR

  • Cheng, D., Gong, Y., Zhou, S., Wang, J., & Zheng, N. (2016). Person re-identification by multi-channel parts-based CNN with improved triplet loss function. CVPR.

  • Choi, W. (2015). Near-online multi-target tracking with aggregated local flow descriptor. In: ICCV

  • Chu, P., Ling, H. (2019). Famnet: Joint learning of feature, affinity and multi-dimensional assignment for online multiple object tracking. In: ICCV

  • Dendorfer, P., Rezatofighi, H., Milan, A., Shi, J., Cremers, D., Reid, I., Roth, S., Schindler, K., Leal-Taixé, L. (2020). Mot20: A benchmark for multi object tracking in crowded scenes. arXiv preprint arXiv:2003.09003

  • Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In: CVPR

  • Deng, J., Guo, J., Xue, N., Zafeiriou, S. (2018) Arcface: Additive angular margin loss for deep face recognition. arXiv preprint arXiv:1801.07698

  • Ding, S., Lin, L., Wang, G., & Chao, H. (2015). Deep feature learning with relative distance comparison for person re-identification. PR, 48(10), 2993–3003.

    Article  Google Scholar 

  • Dollar, P., Appel, R., Belongie, S., & Perona, P. (2014). Fast feature pyramids for object detection. TPAMI, 36(8), 1532–1545.

    Article  Google Scholar 

  • Dollar, P., Tu, Z., Perona, P., & Belongie, S. (2009). Integral channel features. In: BMVC.

  • Evangelidis, G. D., & Psarakis, E. Z. (2008). Parametric image alignment using enhanced correlation coefficient maximization. TPAMI, 30(10), 1858–1865.

    Article  Google Scholar 

  • Fan, X., Jiang, W., Luo, H., Fei, M. (2018). Spherereid: Deep hypersphere manifold embedding for person re-identification. arXiv preprint arXiv:1807.00537

  • Farenzena, M., Bazzani, L., Perina, A., Murino, V., & Cristani, M. (2010). Person re-identification by symmetry-driven accumulation of local features. CVPR.

  • Felzenszwalb, P. F., Girshick, R. B., Mcallester, D., & Ramanan, D. (2009). Object detection with discriminatively trained part based models. TPAMI, 32(9), 1627–1645.

    Article  Google Scholar 

  • Feng, W., Hu, Z., Wu, W., Yan, J., Ouyang, W. (2019). Multi-object tracking with multiple cues and switcher-aware classification. arXiv preprint arXiv:1901.06129

  • Girshick, R., Donahue, J., Darrell, T., Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR

  • Girshick, R., Iandola, F., Darrell, T., Malik, J. (2015). Deformable part models are convolutional neural networks. In: CVPR

  • Guo, Y., Zhang, L. (2017). One-shot face recognition by promoting underrepresented classes. arXiv preprint arXiv:1707.05574

  • Han, C., Ye, J., Zhong, Y., Tan, X., Zhang, C., Gao, C., Sang, N. (2019). Re-id driven localization refinement for person search. In: ICCV

  • He, K., Gkioxari, G., Dollár, P., Girshick, R. (2017). Mask r-cnn. In: ICCV

  • He, K., Zhang, X., Ren, S., Sun, J. (2016). Deep residual learning for image recognition. In: CVPR

  • Henschel, R., Zou, Y., Rosenhahn, B. (2019). Multiple people tracking using body and joint detections. In: CVPRW

  • Ioffe, S., Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: ICML

  • Keuper, M., Tang, S., Zhongjie, Y., Andres, B., Brox, T., Schiele, B. (2016). A multi-cut formulation for joint segmentation and tracking of multiple objects. arXiv preprint arXiv:1607.06317

  • Kim, C., Li, F., Ciptadi, A., Rehg, J.M. (2015). Multiple hypothesis tracking revisited. In: ICCV

  • Kostinger, M., Hirzer, M., Wohlhart, P., Roth, P. M., & Bischof, H. (2012). Large scale metric learning from equivalence constraints. CVPR.

  • Kuo, C.H., Nevatia, R. (2011). How does person identity recognition help multi-person tracking? In: CVPR

  • Lan, X., Zhu, X., Gong, S. (2018). Person search by multi-scale matching. In: ECCV

  • Leal-Taixé, L., Canton-Ferrer, C., Schindler, K. (2016). Learning by tracking: Siamese cnn for robust target association. In: CVPRW

  • Leal-Taixé, L., Milan, A., Reid, I., Roth, S., Schindler, K. (2015). Motchallenge 2015: Towards a benchmark for multi-target tracking. arXiv preprint arXiv:1504.01942

  • Li, W., Zhao, R., Xiao, T., & Wang, X. (2014). DeepReID: Deep filter pairing neural network for person re-identification. CVPR.

  • Li, X., Zheng, W. S., Wang, X., Xiang, T., & Gong, S. (2015). Multi-scale learning for low-resolution person re-identification. ICCV.

  • Liao, S., Hu, Y., Zhu, X., Li, S.Z. (2015). Person re-identification by local maximal occurrence representation and metric learning. In: CVPR

  • Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S. (2017). Feature pyramid networks for object detection. In: CVPR

  • Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L. (2014). Microsoft coco: Common objects in context. In: ECCV

  • Liu, H., Feng, J., Jie, Z., Jayashree, K., Zhao, B., Qi, M., Jiang, J., Yan, S. (2017). Neural person search machines. In: ICCV

  • Liu, H., Feng, J., Qi, M., Jiang, J., & Yan, S. (2017). End-to-end comparative attention networks for person re-identification. TIP, 26(7), 3492–3506.

    Article  MathSciNet  MATH  Google Scholar 

  • Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., Song, L. (2017). Sphereface: Deep hypersphere embedding for face recognition. In: CVPR

  • Liu, W., Wen, Y., Yu, Z., Yang, M. (2016). Large-margin softmax loss for convolutional neural networks. In: ICML

  • Lu, Z., Rathod, V., Votel, R., Huang, J. (2020). Retinatrack: Online single stage joint detection and tracking. arXiv preprint arXiv:2003.13870

  • Ma, L., Tang, S., Black, M.J., Van Gool, L. (2018). Customized multi-person tracker. In: ACCV

  • Milan, A., Leal-Taixé, L., Reid, I., Roth, S., Schindler, K. (2016). Mot16: A benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831

  • Milan, A., Roth, S., & Schindler, K. (2013). Continuous energy minimization for multitarget tracking. TPAMI, 36(1), 58–72.

    Article  Google Scholar 

  • Munjal, B., Amin, S., Tombari, F., Galasso, F. (2019). Query-guided end-to-end person search. In: CVPR

  • Ouyang, W., Wang, X. (2012). A discriminative deep model for pedestrian detection with occlusion handling. In: CVPR

  • Ouyang, W., Wang, X. (2013). Joint deep learning for pedestrian detection. In: ICCV

  • Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A. (2017). Automatic differentiation in pytorch. In: NIPS-W

  • Pirsiavash, H., Ramanan, D., Fowlkes, C.C. (2011). Globally-optimal greedy algorithms for tracking a variable number of objects. In: CVPR

  • Pishchulin, L., Insafutdinov, E., Tang, S., Andres, B., Andriluka, M., Gehler, P.V., Schiele, B. (2016). Deepcut: Joint subset partition and labeling for multi person pose estimation. In: CVPR

  • Ren, S., He, K., Girshick, R., & Sun, J. (2017). Faster R-CNN: Towards real-time object detection with region proposal networks. TPAMI, 39(6), 1137–1149.

    Article  Google Scholar 

  • Sun, Y., Zheng, L., Yang, Y., Tian, Q., Wang, S. (2018). Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In: ECCV

  • Tang, S., Andres, B., Andriluka, M., Schiele, B. (2015). Subgraph decomposition for multi-target tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5033–5041

  • Tang, S., Andriluka, M., Andres, B., Schiele, B. (2017). Multiple people tracking by lifted multicut and person re-identification. In: CVPR

  • Tian, Z., Shen, C., Chen, H., He, T. (2019). Fcos: Fully convolutional one-stage object detection. In: ICCV

  • Varior, R. R., Shuai, B., Lu, J., Xu, D., & Wang, G. (2016). A siamese long short-term memory architecture for human re-identification. ECCV

  • Wang, X., Doretto, G., Sebastian, T., Rittscher, J., & Tu, P. (2007). Shape and appearance context modeling. ICCV.

  • Wang, Y., Gong, D., Zhou, Z., Ji, X., Wang, H., Li, Z., Liu, W., Zhang, T. (2018). Orthogonal deep features decomposition for age-invariant face recognition. In: ECCV

  • Wang, Z., Zheng, L., Liu, Y., Wang, S (2019)Towards real-time multiobject tracking. arXiv preprint arXiv:1909.12605

  • Wei, L., Zhang, S., Yao, H., Gao, W., Tian, Q.: Glad: Global-local-alignment descriptor for pedestrian retrieval. In: ACM’MM (2017)

  • Wen, L., Li, W., Yan, J., Lei, Z., Yi, D., Li, S.Z.: Multiple target tracking based on undirected hierarchical relation hypergraph. In: CVPR (2014)

  • Wen, Y., Zhang, K., Li, Z., & Qiao, Y. (2016). A discriminative feature learning approach for deep face recognition. ECCV

  • Xiang, J., Xu, G., Ma, C., Hou, J. (2020). End-to-end learning deep crf models for multi-object tracking. TCSVT

  • Xiang, W., Huang, J., Qi, X., Hua, X.S., Zhang, L. (2018). Homocentric hypersphere feature embedding for person re-identification. arXiv preprint arXiv:1804.08866

  • Xiang, Y., Alahi, A., Savarese, S.: Learning to track: Online multi-object tracking by decision making. In: ICCV (2015)

  • Xiao, J., Xie, Y., Tillo, T., Huang, K., Wei, Y., Feng, J. (2017). Ian: The individual aggregation network for person search. arXiv preprint arXiv:1705.05552

  • Xiao, T., Li, H., Ouyang, W., Wang, X. (2016). Learning deep feature representations with domain guided dropout for person re-identification. In: CVPR

  • Xiao, T., Li, S., Wang, B., Lin, L., Wang, X. (2017). Joint detection and identification feature learning for person search. In: CVPR

  • Xu, J., Zhao, R., Zhu, F., Wang, H., Ouyang, W. (2018). Attention-aware compositional network for person re-identification. In: CVPR

  • Xu, Y., Osep, A., Ban, Y., Horaud, R., Leal-Taixé, L., Alameda-Pineda, X. (2020). How to train your deep multi-object tracker. In: CVPR

  • Yan, Y., Li, J., Qin, J., Bai, S., Liao, S., Liu, L., Zhu, F., Shao, L. (2021). Anchor-free person search. In: CVPR

  • Yan, Y., Qin, J., Ni, B., Chen, J., Liu, L., Zhu, F., Zheng, W. S., Yang, X., & Shao, L. (2020). Learning multi-attention context graph for group-based re-identification. TPAMI.

  • Yan, Y., Zhang, Q., Ni, B., Zhang, W., Xu, M., Yang, X.: Learning context graph for person search. In: CVPR (2019)

  • Yang, F., Choi, W., Lin, Y.: Exploit all the layers: Fast and accurate cnn object detector with scale dependent pooling and cascaded rejection classifiers. In: CVPR (2016)

  • Yao, H., Zhang, S., Hong, R., Zhang, Y., Xu, C., & Tian, Q. (2019). Deep representation learning with part loss for person re-identification. TIP, 28(6), 2860–2871.

    MathSciNet  MATH  Google Scholar 

  • Yi, D., Lei, Z., Liao, S., & Li, S. Z. (2014). Deep metric learning for person re-identification. ICPR.

  • Zhang, L., Xiang, T., Gong, S. (2016). Learning a discriminative null space for person re-identification. In: CVPR

  • Zhang, S., Bauckhage, C., Cremers, A.B. (2014) Informed haar-like features improve pedestrian detection. In: CVPR

  • Zhang, S., Benenson, R., Omran, M., Hosang, J., Schiele, B. (2016).How far are we from solving pedestrian detection? In: CVPR

  • Zhang, S., Benenson, R., Omran, M., Hosang, J., & Schiele, B. (2018). Towards reaching human performance in pedestrian detection. TPAMI, 40(4), 973–986.

    Article  Google Scholar 

  • Zhang, S., Benenson, R., Schiele, B. (2015). Filtered channel features for pedestrian detection. In: CVPR

  • Zhang, S., Benenson, R., Schiele, B. (2017). Citypersons: A diverse dataset for pedestrian detection. In: CVPR

  • Zhang, S., Yang, J., Schiele, B. (2018). Occluded pedestrian detection through guided attention in cnns. In: CVPR

  • Zhang, Y., Wang, C., Wang, X., Zeng, W., Liu, W.: Fairmot: On the fairness of detection and re-identification in multiple object tracking. arXiv preprint arXiv:2004.01888 (2020)

  • Zhao, L., Li, X., Zhuang, Y., Wang, J.: Deeply-learned part-aligned representations for person re-identification. In: ICCV (2017)

  • Zhao, R., Ouyang, W., & Wang, X. (2013). Unsupervised salience learning for person re-identification. CVPR.

  • Zheng, L., Bie, Z., Sun, Y., Wang, J., Su, C., Wang, S., & Tian, Q. (2016). Mars: A video benchmark for large-scale person re-identification. ECCV

  • Zheng, L., Shen, L., Tian, L., Wang, S., Wang, J., Tian, Q. (2015). Scalable person re-identification: A benchmark. In: ICCV

  • Zheng, L., Zhang, H., Sun, S., Chandraker, M., Yang, Y., Tian, Q. (2017). Person re-identification in the wild. In: CVPR

  • Zhou, X., Wang, D., Krähenbühl, P. (2019). Objects as points. arXiv preprint arXiv:1904.07850

Download references


This work was partially supported by the National Science Fund of China (Grant No. U1713208), Funds for International Co-operation and Exchange of the National Natural Science Foundation of China (Grant No. 61861136011), “111” Program B13022, Natural Science Foundation of Jiangsu Province, China (Grant No. BK20181299), and National Key Research and Development Program of China (Grant No. 2017YFC0820601).

Author information

Authors and Affiliations


Corresponding authors

Correspondence to Shanshan Zhang or Jian Yang.

Additional information

Communicated by Ivan Laptev.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Chen, D., Zhang, S., Yang, J. et al. Norm-Aware Embedding for Efficient Person Search and Tracking. Int J Comput Vis 129, 3154–3168 (2021).

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:


  • Person search
  • Pedestrian detection
  • Person re-identification
  • Multiple object tracking