Online object tracking based on BLSTM-RNN with contextual-sequential labeling

Original Research
  • 291 Downloads

Abstract

Object context has been verified its significance for appearance modeling in different proposed tracking-by-detection approaches. Unfortunately, the restrictive representation of the target’s contextual relationship within spatial domain has intensively limited its utility with high-level classification strategies. By investigating the learning capability of long-term dependencies from sequential data, in this paper, we propose a novel appearance model by transforming the target contextual dependency into a semantic sequential representation. It can be effectively utilized by a recurrent neural network embedded with bidirectional long short-term memory cells for online tracking-by-learning. Based on the trained BLSTM-RNN model, a searching mechanism by labeling score is proposed to improve the tracking robustness. With the implied appearance variation by labeling, the proposed tracking method has demonstrated to outperform most of state-of-the-art trackers on challenging benchmark videos via a heuristic strategy for model updating.

Keywords

Visual tracking Tracking-by-detection RNN LSTM Sequence labeling 

Notes

Acknowledgements

This work was supported by the National Natural Science Foundation of China (Grant No. 61571363) and The National High Technology Research and Development Program of China (Grant No. 2015AA016402).

References

  1. Adam A, Rivlin E, Shimshoni I (2006) Robust fragments-based tracking using the integral histogram. In: IEEE conference on computer vision and pattern recognition (CVPR), IEEE, pp 798–805Google Scholar
  2. Babenko B, Yang MH, Belongie S (2009) Visual tracking with online multiple instance learning. In: IEEE conference on computer vision and pattern recognition (CVPR), IEEE, pp 983–990Google Scholar
  3. Dengue L, Yu D (2014) Deep learning: methods and applications. Found Trends Signal Process 7(3–4):197–387MathSciNetCrossRefGoogle Scholar
  4. Fan B, Wang L, Soong FK, Xie L (2015) Photo-real talking head with deep bidirectional LSTM. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 4884–4888Google Scholar
  5. Farabet C, Couprie C, Najman L, LeCun Y (2013) Learning hierarchical features for scene labeling. IEEE Trans Pattern Anal Mach Intell 35(8):1915–1929CrossRefGoogle Scholar
  6. Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: IEEE conference on computer vision and pattern recognition (CVPR), IEEE, pp 580–587Google Scholar
  7. Grabner H, Grabner M, Bischof H (2006) Real-time tracking via on-line boosting. In: The British machine vision conference (BMVC), BMVA Press, pp 6.1–6.10Google Scholar
  8. Graves A (2012a) Offline Arabic handwriting recognition with multidimensional recurrent neural networks. In: Guide to OCR for Arabic Scripts. Springer, London, pp 297–313Google Scholar
  9. Graves A (2012b) Supervised sequence labelling with recurrent neural networks. Stud Comput Intell, vol 385. Springer, BerlinGoogle Scholar
  10. Graves A, Mohamed A, Hinton GE (2013) Speech recognition with deep recurrent neural networks. IEEE international conference on acoustics speech and signal processing (ICASSP), IEEE, pp 6645–6649Google Scholar
  11. Greff K, Srivastava RK, Koutník J, Steunebrink BR, Schmidhuber J (2015) LSTM: a search space odyssey. CoRR abs/1503.04069Google Scholar
  12. Hare S, Saffari A, Torr PHS (2011) Struck: structured output tracking with kernels. In: The international conference on computer vision (ICCV), IEEE, pp 263–270Google Scholar
  13. Henriques JAF, Caseiro R, Martins P, Batista J (2012) Exploiting the circulant structure of tracking-by-detection with kernels. In: The European conference on computer vision (ECCV), Springer International Publishing, pp 702–715Google Scholar
  14. Hochreiter S, Bengio Y, Frasconi P (2001) Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. In: Field guide to dynamical recurrent networks, IEEE PressGoogle Scholar
  15. Hong S, You T, Kwak S, Han B (2015) Online tracking by learning discriminative saliency map with convolutional neural network. In: The international conference on machine learning (ICML), JMLR Workshop and Conference Proceedings, pp 597–606Google Scholar
  16. Hong Z, Wang C, Mei X, Prokhorov D, Tao D (2014) Tracking using multilevel quantizations. In: The European conference on computer vision (ECCV), vol 8694. Springer International Publishing, pp 155–171Google Scholar
  17. Hopfield JJ (1982) Neural networks and physical systems with emergent collective computational abilities. Proc Nat Acad Sci 79(8):2554–2558MathSciNetCrossRefMATHGoogle Scholar
  18. Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: IEEE conference on computer vision and pattern recognition (CVPR), IEEE, pp 3128–3137Google Scholar
  19. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems 25 (NIPS), Curran Associates, Inc., pp 1097–1105Google Scholar
  20. Kwon J, Lee KM (2011) Tracking by sampling trackers. In: The international conference on computer vision (ICCV), pp 1195–1202Google Scholar
  21. Li H, Li Y, Porikli F (2014) Robust online visual tracking with a single convolutional neural network. In: The Asian conference on computer vision (ACCV). Springer International Publishing, pp 194–209Google Scholar
  22. Liang M, Hu X (2015) Recurrent convolutional neural network for object recognition. In: IEEE conference on computer vision and pattern recognition (CVPR), IEEE, pp 3367–3375Google Scholar
  23. Mei X, Hong Z, Prokhorov D, Tao D (2015) Robust multitask multiview tracking in videos. IEEE Trans Neural Netw Learn Syst 26(11):2874–2890MathSciNetCrossRefGoogle Scholar
  24. Pinheiro P, Collobert R (2014) Recurrent convolutional neural networks for scene labeling. In: The international conference on machine learning (ICML), JMLR Workshop and Conference Proceedings, pp 82–90Google Scholar
  25. Ross DA, Lim J, Lin RS, Yang MH (2008) Incremental learning for robust visual tracking. Int J Comput Vis 77(1–3):125–141CrossRefGoogle Scholar
  26. Sak H, Senior AW, Rao K, Beaufays F (2015) Fast and accurate recurrent neural network acoustic models for speech recognition. In: InterSpeech, IEEEGoogle Scholar
  27. Szegedy C, Toshev A, Erhan D (2013) Deep neural networks for object detection. In: Advances in neural information processing systems 26 (NIPS), Curran Associates, Inc., pp 2553–2561Google Scholar
  28. Wang N, Yeung DY (2013) Learning a deep compact image representation for visual tracking. In: Advances in neural information processing systems 26 (NIPS), Curran Associates, Inc., pp 809–817Google Scholar
  29. Wu Y, Lim J, Yang MH (2013) Online object tracking: a benchmark. In: IEEE conference on computer vision and pattern recognition (CVPR), IEEE, pp 2411–2418Google Scholar
  30. Zeiler MD, Fergus R (2014) The European conference on computer vision (ECCV). Visualizing and understanding convolutional networks. Springer International Publishing, pp 818–833Google Scholar
  31. Zhang J, Ma S, Sclaroff S (2014) MEEM: robust tracking via multiple experts using entropy minimization. In: The European conference on computer vision (ECCV). Springer International PublishingGoogle Scholar
  32. Zhang K, Zhang L, Yang MH (2012) Real-time compressive tracking. In: The European conference on computer vision (ECCV). Springer International Publishing, pp 864–877Google Scholar
  33. Zhong W, Lu H, Yang MH (2012) Robust object tracking via sparsity-based collaborative model. In: IEEE conference on computer vision and pattern recognition (CVPR), IEEE, pp 1838–1845Google Scholar
  34. Zhou X, Xie L, Zhang P, Zhang Y (2014) An ensemble of deep neural networks for object tracking. In: IEEE international conference on image processing (ICIP), pp 843–847Google Scholar
  35. Zhou X, Xie L, Zhang P, Zhang Y (2015) Online object tracking based on CNN with metropolis-hasting re-sampling. In: The 23rd ACM international conference on multimedia (ACM MM), ACM, pp 1163–1166Google Scholar
  36. Zuo Z, Shuai B, Wang G, Liu X, Wang X, Wang B, Chen Y (2015) Convolutional recurrent neural networks: learning spatial dependencies for image representation. In: IEEE conference on computer vision and pattern recognition workshops (CVPRW), pp 18–26Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2017

Authors and Affiliations

  1. 1.School of Computer ScienceNorthwestern Polytechnical UniversityXi’anChina

Personalised recommendations