Coarse-to-Fine: A RNN-Based Hierarchical Attention Model for Vehicle Re-identification

Wei, Xiu-Shen; Zhang, Chen-Lin; Liu, Lingqiao; Shen, Chunhua; Wu, Jianxin

doi:10.1007/978-3-030-20890-5_37

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11362))

Included in the following conference series:

Asian Conference on Computer Vision

2308 Accesses
18 Citations

Abstract

Vehicle re-identification is an important problem and becomes desirable with the rapid expansion of applications in video surveillance and intelligent transportation. By recalling the identification process of human vision, we are aware that there exists a native hierarchical dependency when humans identify different vehicles. Specifically, humans always firstly determine one vehicle’s coarse-grained category, i.e., the car model/type. Then, under the branch of the predicted car model/type, they are going to identify specific vehicles by relying on subtle visual cues, e.g., customized paintings and windshield stickers, at the fine-grained level. Inspired by the coarse-to-fine hierarchical process, we propose an end-to-end RNN-based Hierarchical Attention (RNN-HA) classification model for vehicle re-identification. RNN-HA consists of three mutually coupled modules: the first module generates image representations for vehicle images, the second hierarchical module models the aforementioned hierarchical dependent relationship, and the last attention module focuses on capturing the subtle visual information distinguishing specific vehicles from each other. By conducting comprehensive experiments on two vehicle re-identification benchmark datasets VeRi and VehicleID, we demonstrate that the proposed model achieves superior performance over state-of-the-art methods.

The first two authors contributed equally to this work. This research was supported by NSFC (National Natural Science Foundation of China) under 61772256 and the program A for Outstanding Ph.D. candidate of Nanjing University (201702A010). Liu’s participation was in part supported by ARC DECRA Fellowship (DE170101259).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Note that, in Tables 1 and 2, we only compare with the appearance-based methods for fair comparisons.

References

Chatfield, K., Simonyan, K., Vedaldi, A.: Return of the devil in the details: delving deep into convolutional nets. arXiv preprint arXiv:1405.3531 (2014)
Chen, C., Liu, M.-Y., Tuzel, O., Xiao, J.: R-CNN for small object detection. In: Lai, S.-H., Lepetit, V., Nishino, K., Sato, Y. (eds.) ACCV 2016. LNCS, vol. 10115, pp. 214–230. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-54193-8_14
Chapter Google Scholar
Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: EMNLP, pp. 1724–1735, October 2014
Google Scholar
Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014)
Feris, R.S., Siddiquie, B., Zhai, Y., Datta, A., Brown, L.M., Pankanti, S.: Large-scale vehicle detection, indexing, and search in urban surveillance videos. IEEE TMM 14(1), 28–42 (2015)
Google Scholar
Gao, H., Mao, J., Zhou, J., Huang, Z., Wang, L., Xu, W.: Are you talking to a machine? Dataset and methods for multilingual image question answering. In: NIPS, pp. 2296–2304, December 2015
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778, June 2016
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long shot-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Hou, S., Feng, Y., Wang, Z.: VegFru: a domain-specific dataset for fine-grained visual categorization. In: ICCV, pp. 541–549, October 2017
Google Scholar
Jégou, H., Douze, M., Schmid, C., Pérez, P.: Aggregating local descriptors into a compact image representation. In: CVPR, pp. 3304–3311, June 2010
Google Scholar
Johnson, J., Karpathy, A., Fei-Fei, L.: DenseCap: fully convolutional localization networks for dense captioning. In: CVPR, pp. 4565–4574, June 2016
Google Scholar
Jozefowicz, R., Zaremba, W., Sutskever, I.: An empirical exploration of recurrent network architectures. In: ICML, pp. 2342–2350, July 2015
Google Scholar
Kanac, A., Zhu, X., Gong, S.: Vehicle re-identification by fine-grained cross-level deep learning. In: BMVC, pp. 770–781, September 2017
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: NIPS, pp. 1097–1105, December 2012
Google Scholar
Liao, S., Hu, Y., Zhu, X., Li, S.Z.: Person re-identification by local maximal occurrence representation and metric learning. In: CVPR, pp. 2197–2206, June 2015
Google Scholar
Liu, H., Tian, Y., Wang, Y., Pang, L., Huang, T.: Deep relative distance learning: tell the difference between similar vehicles. In: CVPR, pp. 2167–2175, June 2016
Google Scholar
Liu, L., Shen, C., van den Hengel, A.: Cross-convolutional-layer pooling for image recognition. IEEE TPAMI 39(11), 2305–2313 (2016)
Article Google Scholar
Liu, W., et al.: SSD: single shot MultiBox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_2
Chapter Google Scholar
Liu, X., Lin, W., Ma, H., Fu, H.: Large-scale vehicle re-identification in urban suveillance videos. In: ICME, pp. 1–6, July 2016
Google Scholar
Liu, X., Liu, W., Mei, T., Ma, H.: A deep learning-based approach to progressive vehicle re-identification for urban surveillance. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 869–884. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_53
Chapter Google Scholar
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440, June 2015
Google Scholar
Sánchez, J., Perronnin, F., Mensink, T., Verbeek, J.: Image classification with the fisher vector: theory and practice. IJCV 105(3), 222–245 (2013)
Article MathSciNet Google Scholar
Shen, Y., Xiao, T., Li, H., Yi, S., Wang, X.: Learning deep neural networks for vehicle Re-ID with visual-spatio-temporal path proposals. In: ICCV, pp. 1900–1909, October 2017
Google Scholar
Singh, B., Han, X., Wu, Z., Davis, L.S.: PSPGC: part-based seeds for parametric graph-cuts. In: Cremers, D., Reid, I., Saito, H., Yang, M.-H. (eds.) ACCV 2014. LNCS, vol. 9005, pp. 360–375. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-16811-1_24
Chapter Google Scholar
Sochor, J., Herout, A., Havel, J.: BoxCars: 3D boxes as CNN input for improved fine-grained vehicle recognition. In: CVPR, pp. 3006–3015, June 2016
Google Scholar
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: NIPS, pp. 3104–3112, December 2014
Google Scholar
Szegedy, C., et al.: Going deeper with convolutions. In: CVPR, pp. 1–9, June 2015
Google Scholar
Tan, Y.H., Chan, C.S.: phi-LSTM: a phrase-based hierarchical LSTM model for image captioning. In: Lai, S.-H., Lepetit, V., Nishino, K., Sato, Y. (eds.) ACCV 2016. LNCS, vol. 10115, pp. 101–117. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-54193-8_7
Chapter Google Scholar
Tieleman, T., Hinton, G.E.: Neural networks for machine learning. Coursera (Lecture 65 - RMSprop)
Google Scholar
Varela, M., Velastin, S.A.: Intelligent distributed surveillance systems: a review. IEEE Trans. Vis. Image Signal Process. 152(2), 192–204 (2005)
Article Google Scholar
Weijier, J.V.D., Schmid, C., Verbeek, J., Larlus, D.: Learning color names for real-world applications. IEEE TIP 18(7), 1512–1523 (2009)
MathSciNet MATH Google Scholar
Weijier, J.V.D., Schmid, C., Verbeek, J., Larlus, D.: Deep feature learning with relative distance comparison for person re-identification. Pattern Recogn. 48(10), 2993–3003 (2015)
Article Google Scholar
Wu, Q., Shen, C., Liu, L., Dick, A., van den Hengel, A.: What value to explicit high level concepts have in vision to language problems? In: CVPR, pp. 203–212, June 2016
Google Scholar
Wu, Q., Wang, P., Shen, C., Dick, A., van den Hengel, A.: Ask me anything: free-form visual question answering based on knowledge from external sources. In: CVPR, pp. 4622–4630, June 2016
Google Scholar
Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: ICML, pp. 2048–2057, July 2015
Google Scholar
Yang, L., Luo, P., Loy, C.C., Tang, X.: A large-scale car dataset fro fine-grained categorization and verification. In: CVPR, pp. 3973–3981, June 2015
Google Scholar
Zhang, J., Wang, F.Y., Lin, W.H., Xu, X., Chen, C.: Data-driven intelligent transportation systems: a survey. IEEE Trans. Intell. Transp. Syst. 12(4), 1624–1639 (2011)
Article Google Scholar
Zheng, L., Shen, L., Tian, L., Wang, S., Wang, J., Tian, Q.: Scalable person re-identification: a benchmark. In: ICCV, pp. 1116–1124, December 2015
Google Scholar
Zheng, L., Yang, Y., Hauptmann, A.G.: Person re-identification: Past, present and future. arXiv preprint arXiv:1610.02984 (2016)
Zheng, Y., Capra, L., Wolfson, O., Yang, H.: Urban computing: concepts, methodologies, and applications. ACM Trans. Intell. Syst. Technol. 5(38), 1–55 (2014)
Google Scholar
Zheng, Z., Zheng, L., Yang, Y.: A discriminatively learned CNN embedding for person re-identification. arXiv preprint arXiv:1611.05666 (2016)
Zhou, Y., Shao, L.: Viewpoint-aware attentive multi-view inference for vehicle re-identification. In: CVPR, pp. 6489–6498, June 2018
Google Scholar

Download references

Author information

Authors and Affiliations

Megvii Research Nanjing, Megvii Technology, Nanjing, China
Xiu-Shen Wei
National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
Chen-Lin Zhang & Jianxin Wu
School of Computer Science, The University of Adelaide, Adelaide, Australia
Lingqiao Liu & Chunhua Shen

Authors

Xiu-Shen Wei
View author publications
You can also search for this author in PubMed Google Scholar
Chen-Lin Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Lingqiao Liu
View author publications
You can also search for this author in PubMed Google Scholar
Chunhua Shen
View author publications
You can also search for this author in PubMed Google Scholar
Jianxin Wu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiu-Shen Wei .

Editor information

Editors and Affiliations

IIIT Hyderabad, Hyderabad, India
C. V. Jawahar
ANU, Canberra, ACT, Australia
Hongdong Li
Simon Fraser University, Burnaby, BC, Canada
Greg Mori
ETH Zurich, Zurich, Zürich, Switzerland
Konrad Schindler

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wei, XS., Zhang, CL., Liu, L., Shen, C., Wu, J. (2019). Coarse-to-Fine: A RNN-Based Hierarchical Attention Model for Vehicle Re-identification. In: Jawahar, C., Li, H., Mori, G., Schindler, K. (eds) Computer Vision – ACCV 2018. ACCV 2018. Lecture Notes in Computer Science(), vol 11362. Springer, Cham. https://doi.org/10.1007/978-3-030-20890-5_37

Download citation

DOI: https://doi.org/10.1007/978-3-030-20890-5_37
Published: 02 June 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-20889-9
Online ISBN: 978-3-030-20890-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics