Temporal Coherence or Temporal Motion: Which Is More Critical for Video-Based Person Re-identification?

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12353)


Video-based person re-identification aims to match pedestrians with the consecutive video sequences. While a rich line of work focuses solely on extracting the motion features from pedestrian videos, we show in this paper that the temporal coherence plays a more critical role. To distill the temporal coherence part of video representation from frame representations, we propose a simple yet effective Adversarial Feature Augmentation (AFA) method, which highlights the temporal coherence features by introducing adversarial augmented temporal motion noise. Specifically, we disentangle the video representation into the temporal coherence and motion parts and randomly change the scale of the temporal motion features as the adversarial noise. The proposed AFA method is a general lightweight component that can be readily incorporated into various methods with negligible cost. We conduct extensive experiments on three challenging datasets including MARS, iLIDS-VID, and DukeMTMC-VideoReID, and the experimental results verify our argument and demonstrate the effectiveness of the proposed method.


Video-based person re-identification Temporal coherence Feature augmentation Adversarial learning 



This work was supported in part by the National Key Research and Development Program of China under Grant 2017YFA0700802, in part by the National Natural Science Foundation of China under Grant 61822603, Grant U1813218, Grant U1713214, and Grant 61672306, in part by Beijing Natural Science Foundation under Grant No. L172051, in part by Beijing Academy of Artificial Intelligence (BAAI), in part by a grant from the Institute for Guo Qiang, Tsinghua University, in part by the Shenzhen Fundamental Research Fund (Subject Arrangement) under Grant JCYJ20170412170602564, and in part by Tsinghua University Initiative Scientific Research Program.


  1. 1.
    Chen, D., Li, H., Xiao, T., Yi, S., Wang, X.: Video person re-identification with competitive snippet-similarity aggregation and co-attentive snippet embedding. In: CVPR, pp. 1169–1178 (2018)Google Scholar
  2. 2.
    Chen, G., Lin, C., Ren, L., Lu, J., Zhou, J.: Self-critical attention learning for person re-identification. In: ICCV, pp. 9637–9646 (2019)Google Scholar
  3. 3.
    Chen, G., Lu, J., Yang, M., Zhou, J.: Spatial-temporal attention-aware learning for video-based person re-identification. TIP 28(9), 4192–4205 (2019)MathSciNetzbMATHGoogle Scholar
  4. 4.
    Chen, G., Lu, J., Yang, M., Zhou, J.: Learning recurrent 3D attention for video-based person re-identification. TIP 29, 6963–6976 (2020)Google Scholar
  5. 5.
    Dai, J., Zhang, P., Wang, D., Lu, H., Wang, H.: Video person re-identification by temporal residual learning. TIP 28(3), 1366–1377 (2018)MathSciNetGoogle Scholar
  6. 6.
    Dehghan, A., Modiri Assari, S., Shah, M.: GMMCP tracker: globally optimal generalized maximum multi clique problem for multiple object tracking. In: CVPR, pp. 4091–4099 (2015)Google Scholar
  7. 7.
    Dosovitskiy, A., et al.: Learning optical flow with convolutional networks. In: ICCV, pp. 2758–2766 (2015)Google Scholar
  8. 8.
    Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: ICCV, pp. 6202–6211 (2019)Google Scholar
  9. 9.
    Felzenszwalb, P.F., Girshick, R.B., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part-based models. TPAMI 32(9), 1627–1645 (2010)CrossRefGoogle Scholar
  10. 10.
    Frid-Adar, M., Klang, E., Amitai, M., Goldberger, J., Greenspan, H.: Synthetic data augmentation using GAN for improved liver lesion classification. In: ISBI, pp. 289–293. IEEE (2018)Google Scholar
  11. 11.
    Fu, Y., Wang, X., Wei, Y., Huang, T.: STA: Spatial-temporal attention for large-scale video-based person re-identification. In: AAAI (2019)Google Scholar
  12. 12.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)Google Scholar
  13. 13.
    Hermans, A., Beyer, L., Leibe, B.: In defense of the triplet loss for person re-identification. arXiv (2017)Google Scholar
  14. 14.
    Hirzer, M., Beleznai, C., Roth, P.M., Bischof, H.: Person re-identification by descriptive and discriminative classification. In: SCIA, pp. 91–102 (2011)Google Scholar
  15. 15.
    Hou, R., Ma, B., Chang, H., Gu, X., Shan, S., Chen, X.: VRSTC: occlusion-free video person re-identification. In: CVPR, June 2019Google Scholar
  16. 16.
    Huang, H., Li, D., Zhang, Z., Chen, X., Huang, K.: Adversarially occluded samples for person re-identification. In: CVPR, pp. 5098–5107 (2018)Google Scholar
  17. 17.
    Karanam, S., Li, Y., Radke, R.J.: Person re-identification with discriminatively trained viewpoint invariant dictionaries. In: ICCV, pp. 4516–4524 (2015)Google Scholar
  18. 18.
    Karanam, S., Li, Y., Radke, R.J.: Sparse re-id: block sparsity for person re-identification. In: CVPR Workshops, pp. 33–40 (2015)Google Scholar
  19. 19.
    Klaser, A., Marszałek, M., Schmid, C.: A spatio-temporal descriptor based on 3D-gradients. In: BMVC, pp. 1–10 (2008)Google Scholar
  20. 20.
    Li, J., Wang, J., Tian, Q., Gao, W., Zhang, S.: Global-local temporal representations for video person re-identification. In: ICCV, October 2019Google Scholar
  21. 21.
    Li, J., Zhang, S., Huang, T.: Multi-scale 3D convolution network for video based person re-identification. AAAI 33, 8618–8625 (2019)CrossRefGoogle Scholar
  22. 22.
    Li, S., Bak, S., Carr, P., Wang, X.: Diversity regularized spatiotemporal attention for video-based person re-identification. In: CVPR, pp. 369–378 (2018)Google Scholar
  23. 23.
    Li, W., Zhu, X., Gong, S.: Harmonious attention network for person re-identification. In: CVPR, p. 2 (2018)Google Scholar
  24. 24.
    Liao, X., He, L., Yang, Z., Zhang, C.: Video-based person re-identification via 3D convolutional networks and non-local attention. In: ACCV, pp. 620–634. Springer (2018)Google Scholar
  25. 25.
    Lin, J., Ren, L., Lu, J., Feng, J., Zhou, J.: Consistent-aware deep learning for person re-identification in a camera network. In: CVPR (2017)Google Scholar
  26. 26.
    Liu, H., Jie, Z., Jayashree, K., Qi, M., Jiang, J., Yan, S., Feng, J.: Video-based person re-identification with accumulative motion context. TCSVT 28(10), 2788–2802 (2017)Google Scholar
  27. 27.
    Liu, W., et al.: SSD: single shot multibox detector. In: ECCV, pp. 21–37 (2016)Google Scholar
  28. 28.
    Liu, Y., Yan, J., Ouyang, W.: Quality aware network for set to set recognition. In: CVPR (2017)Google Scholar
  29. 29.
    McLaughlin, N., Martinez del Rincon, J., Miller, P.: Recurrent convolutional network for video-based person re-identification. In: CVPR, pp. 1325–1334, June 2016Google Scholar
  30. 30.
    Ouyang, D., Shao, J., Zhang, Y., Yang, Y., Shen, H.T.: Video-based person re-identification via self-paced learning and deep reinforcement learning framework. In: ACM MM, pp. 1562–1570 (2018)Google Scholar
  31. 31.
    Rao, Y., Lu, J., Zhou, J.: Learning discriminative aggregation network for video-based face recognition and person re-identification. IJCV 127(6–7), 701–718 (2019)CrossRefGoogle Scholar
  32. 32.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards real-time object detection with region proposal networks. In: NIPS, pp. 91–99 (2015)Google Scholar
  33. 33.
    Snell, J., Swersky, K., Zemel, R.: Prototypical networks for few-shot learning. In: NeurIPS, pp. 4077–4087 (2017)Google Scholar
  34. 34.
    Song, G., Leng, B., Liu, Y., Hetang, C., Cai, S.: Region-based quality estimation network for large-scale person re-identification. In: AAAI (2018)Google Scholar
  35. 35.
    Subramaniam, A., Nambiar, A., Mittal, A.: Co-segmentation inspired attention networks for video-based person re-identification. In: ICCV, October 2019Google Scholar
  36. 36.
    Szegedy, C., et al.: Going deeper with convolutions. In: CVPR, pp. 1–9 (2015)Google Scholar
  37. 37.
    Wang, F., et al.: Residual attention network for image classification. In: CVPR (2017)Google Scholar
  38. 38.
    Wang, T., Gong, S., Zhu, X., Wang, S.: Person re-identification by video ranking. In: ECCV, pp. 688–703 (2014)Google Scholar
  39. 39.
    Wang, T., Gong, S., Zhu, X., Wang, S.: Person re-identification by discriminative selection in video ranking. TPAMI 38(12), 2501–2514 (2016)CrossRefGoogle Scholar
  40. 40.
    Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018)Google Scholar
  41. 41.
    Wu, L., Wang, Y., Gao, J., Li, X.: Where-and-when to look: deep siamese attention networks for video-based person re-identification. TMM 21(6), 1412–1424 (2018)Google Scholar
  42. 42.
    Wu, Y., Lin, Y., Dong, X., Yan, Y., Ouyang, W., Yang, Y.: Exploit the unknown gradually: one-shot video-based person re-identification by stepwise learning. In: CVPR, pp. 5177–5186 (2018)Google Scholar
  43. 43.
    Xu, S., Cheng, Y., Gu, K., Yang, Y., Chang, S., Zhou, P.: Jointly attentive spatial-temporal pooling networks for video-based person re-identification. In: ICCV (2017)Google Scholar
  44. 44.
    Yan, Y., Ni, B., Song, Z., Ma, C., Yan, Y., Yang, X.: Person re-identification via recurrent feature aggregation. In: ECCV, pp. 701–716 (2016)Google Scholar
  45. 45.
    You, J., Wu, A., Li, X., Zheng, W.S.: Top-push video-based person re-identification. In: CVPR, pp. 1345–1353, June 2016Google Scholar
  46. 46.
    Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 (2017)
  47. 47.
    Zhang, J., Wang, N., Zhang, L.: Multi-shot pedestrian re-identification via sequential decision making. In: CVPR (2018)Google Scholar
  48. 48.
    Zhang, L., Xiang, T., Gong, S.: Learning a discriminative null space for person re-identification. In: CVPR, pp. 1239–1248 (2016)Google Scholar
  49. 49.
    Zhao, H., et al.: Spindle net: person re-identification with human body region guided feature decomposition and fusion. In: CVPR (2017)Google Scholar
  50. 50.
    Zhao, Y., Shen, X., Jin, Z., Lu, H., Hua, X.S.: Attribute-driven feature disentangling and temporal aggregation for video person re-identification. In: CVPR, June 2019Google Scholar
  51. 51.
    Zheng, L., Bie, Z., Sun, Y., Wang, J., Su, C., Wang, S., Tian, Q.: Mars: a video benchmark for large-scale person re-identification. In: ECCV, pp. 868–884 (2016)Google Scholar
  52. 52.
    Zheng, Z., Zheng, L., Yang, Y.: Unlabeled samples generated by GAN improve the person re-identification baseline in vitro. In: ICCV, pp. 3754–3762 (2017)Google Scholar
  53. 53.
    Zhong, Z., Zheng, L., Kang, G., Li, S., Yang, Y.: Random erasing data augmentation. arXiv preprint arXiv:1708.04896 (2017)
  54. 54.
    Zhou, Z., Huang, Y., Wang, W., Wang, L., Tan, T.: See the forest for the trees: joint spatial and temporal recurrent neural networks for video-based person re-identification. In: CVPR, July 2017Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Department of AutomationTsinghua UniversityBeijingChina
  2. 2.State Key Lab of Intelligent Technologies and SystemsBeijingChina
  3. 3.Beijing National Research Center for Information Science and TechnologyBeijingChina
  4. 4.Tsinghua Shenzhen International Graduate SchoolTsinghua UniversityBeijingChina

Personalised recommendations