Online Multi-modal Person Search in Videos

Xia, Jiangyue; Rao, Anyi; Huang, Qingqiu; Xu, Linning; Wen, Jiangtao; Lin, Dahua

doi:10.1007/978-3-030-58610-2_11

Jiangyue Xia¹²,
Anyi Rao¹³,
Qingqiu Huang¹³,
Linning Xu¹³,
Jiangtao Wen¹² &
…
Dahua Lin¹³

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12357))

Included in the following conference series:

European Conference on Computer Vision

4631 Accesses
19 Citations

Abstract

The task of searching certain people in videos has seen increasing potential in real-world applications, such as video organization and editing. Most existing approaches are devised to work in an offline manner, where identities can only be inferred after an entire video is examined. This working manner precludes such methods from being applied to online services or those applications that require real-time responses. In this paper, we propose an online person search framework, which can recognize people in a video on the fly. This framework maintains a multi-modal memory bank at its heart as the basis for person recognition, and updates it dynamically with a policy obtained by reinforcement learning. Our experiments on a large movie dataset show that the proposed method is effective, not only achieving remarkable improvements over online schemes but also outperforming offline methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Arandjelovic, O., Zisserman, A.: Automatic face recognition for film character retrieval in feature-length films. In: 2005 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 860–867 (2005)
Google Scholar
Chung, J.S.: Naver at ActivityNet challenge 2019-task B active speaker detection (AVA). arXiv preprint arXiv:1906.10555 (2019)
Cour, T., Sapp, B., Nagle, A., Taskar, B.: Talking pictures: temporal grouping and dialog-supervised person recognition. In: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1014–1021 (2010)
Google Scholar
Erzin, E., Yemez, Y., Tekalp, A.M.: Multimodal speaker identification using an adaptive classifier cascade based on modality reliability. IEEE Trans. Multimedia 7(5), 840–852 (2005)
Article Google Scholar
Everingham, M., Sivic, J., Zisserman, A.: “Hello! my name is... Buffy”-automatic naming of characters in TV video. In: 2006 British Machine Vision Conference (BMVC), pp. 899–908 (2006)
Google Scholar
Farenzena, M., Bazzani, L., Perina, A., Murino, V., Cristani, M.: Person re-identification by symmetry-driven accumulation of local features. In: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2360–2367 (2010)
Google Scholar
Feng, L., Li, Z., Kuang, Z., Zhang, W.: Extractive video summarizer with memory augmented neural networks. In: 2018 ACM International Conference on Multimedia (MM), pp. 976–983 (2018)
Google Scholar
Gheissari, N., Sebastian, T.B., Hartley, R.: Person reidentification using spatiotemporal appearance. In: 2006 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1528–1535 (2006)
Google Scholar
Graves, A., Wayne, G., Danihelka, I.: Neural turing machines. arXiv preprint arXiv:1410.5401 (2014)
Guo, Y., Zhang, L., Hu, Y., He, X., Gao, J.: MS-Celeb-1M: a dataset and benchmark for large-scale face recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 87–102. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_6
Chapter Google Scholar
Haurilet, M., Tapaswi, M., Al-Halah, Z., Stiefelhagen, R.: Naming TV characters by watching and analyzing dialogs. In: 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1–9 (2016)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)
Google Scholar
Hu, D., Li, X., Lu, X.: Temporal multimodal learning in audiovisual speech recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3574–3582 (2016)
Google Scholar
Hu, Y., Ren, J.S., Dai, J., Yuan, C., Xu, L., Wang, W.: Deep multimodal speaker naming. In: 2015 ACM International Conference on Multimedia (MM), pp. 1107–1110 (2015)
Google Scholar
Huang, Q., Xiong, Y., Lin, D.: Unifying identification and context learning for person recognition. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2217–2225 (2018)
Google Scholar
Huang, Q., Liu, W., Lin, D.: Person search in videos with one portrait through visual and temporal links. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 437–454. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01261-8_26
Chapter Google Scholar
Huang, Q., Xiong, Y., Rao, A., Wang, J., Lin, D.: MovieNet: a holistic dataset for movie understanding. In: 2020 European Conference on Computer Vision (ECCV) (2020)
Google Scholar
Li, D., Kadav, A.: Adaptive memory networks. In: 2018 International Conference on Learning Representations Workshop (ICLRW) (2018)
Google Scholar
Lin, D., Kapoor, A., Hua, G., Baker, S.: Joint people, event, and location recognition in personal photo collections using cross-domain context. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6311, pp. 243–256. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15549-9_18
Chapter Google Scholar
Liu, Z., Wang, J., Gong, S., Lu, H., Tao, D.: Deep reinforcement active learning for human-in-the-loop person re-identification. In: 2019 IEEE International Conference on Computer Vision (ICCV), pp. 6121–6130 (2019)
Google Scholar
Logan, B.: Mel frequency cepstral coefficients for music modeling. In: 2000 International Symposium on Music Information Retrieval (ISMIR) (2000)
Google Scholar
Loy, C.C., et al.: Wider face and pedestrian challenge 2018: methods and results. arXiv preprint arXiv:1902.06854 (2019)
Maaten, L.V.D., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(Nov), 2579–2605 (2008)
MATH Google Scholar
Na, S., Lee, S., Kim, J., Kim, G.: A read-write memory network for movie story understanding. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 677–685 (2017)
Google Scholar
Nagrani, A., Zisserman, A.: From Benedict Cumberbatch to Sherlock Holmes: character identification in TV series without a script. In: 2017 British Machine Vision Conference (BMVC), pp. 107.1–107.13 (2017)
Google Scholar
Ouyang, D., Shao, J., Zhang, Y., Yang, Y., Shen, H.T.: Video-based person re-identification via self-paced learning and deep reinforcement learning framework. In: 2018 ACM International Conference on Multimedia (MM), pp. 1562–1570 (2018)
Google Scholar
Rao, A., et al.: A unified framework for shot type classification based on subject centric lens. In: 2020 European Conference on Computer Vision (ECCV) (2020)
Google Scholar
Rao, A., et al.: A local-to-global approach to multi-modal movie scene segmentation. In: 2020 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10146–10155 (2020)
Google Scholar
Rao, Y., Lu, J., Zhou, J.: Attention-aware deep reinforcement learning for video face recognition. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 3951–3960 (2017)
Google Scholar
Ren, J.S.J., et al.: Look, listen and learn - a multimodal LSTM for speaker identification. In: 2016 AAAI Conference on Artificial Intelligence (AAAI), pp. 3581–3587 (2016)
Google Scholar
Roth, J., et al.: AVA-active speaker: an audio-visual dataset for active speaker detection. arXiv preprint arXiv:1901.01342 (2019)
Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)
Article MathSciNet Google Scholar
Shen, Y., Tan, S., Hosseini, A., Lin, Z., Sordoni, A., Courville, A.C.: Ordered memory. In: Advances in Neural Information Processing Systems, pp. 5037–5048 (2019)
Google Scholar
Sivic, J., Everingham, M., Zisserman, A.: “Who are you?” - learning person specific classifiers from video. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1145–1152 (2009)
Google Scholar
Sukhbaatar, S., Szlam, A., Weston, J., Fergus, R.: End-to-end memory networks. In: Advances in Neural Information Processing Systems, pp. 2440–2448 (2015)
Google Scholar
Sutton, R.S., Barto, A.G.: Reinforcement learning: an introduction. IEEE Trans. Neural Netw. 9(5), 1054–1054 (1998)
Article Google Scholar
Wang, J., Wang, W., Huang, Y., Wang, L., Tan, T.: Hierarchical memory modelling for video captioning. In: 2018 ACM International Conference on Multimedia (MM), pp. 63–71 (2018)
Google Scholar
Wang, J., Wang, W., Wang, Z., Wang, L., Feng, D., Tan, T.: Stacked memory network for video summarization. In: 2019 ACM International Conference on Multimedia (MM), pp. 836–844 (2019)
Google Scholar
Weston, J., Chopra, S., Bordes, A.: Memory networks. In: 2015 International Conference on Learning Representations (ICLR) (2015)
Google Scholar
Yang, T., Chan, A.B.: Learning dynamic memory networks for object tracking. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11213, pp. 153–169. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01240-3_10
Chapter Google Scholar
Zajdel, W., Zivkovic, Z., Krose, B.J.A.: Keeping track of humans: have I seen this person before? In: 2005 IEEE International Conference on Robotics and Automation (ICRA), pp. 2081–2086 (2005)
Google Scholar
Zhang, K., Zhang, Z., Li, Z., Qiao, Y.: Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Process. Lett. 23(10), 1499–1503 (2016)
Article Google Scholar
Zhang, N., Paluri, M., Taigman, Y., Fergus, R., Bourdev, L.: Beyond frontal faces: improving person recognition using multiple cues. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4804–4813 (2015)
Google Scholar
Zhang, W., He, X., Lu, W., Qiao, H., Li, Y.: Feature aggregation with reinforcement learning for video-based person re-identification. IEEE Trans. Neural Netw. Learn. Syst. 30(12), 3847–3852 (2019)
Article Google Scholar
Zheng, L., et al.: MARS: a video benchmark for large-scale person re-identification. In: 2016 European Conference on Computer Vision (ECCV), pp. 868–884 (2016)
Google Scholar
Zhou, D., Bousquet, O., Lal, T.N., Weston, J., Schölkopf, B.: Learning with local and global consistency. In: Advances in Neural Information Processing Systems, pp. 321–328 (2003)
Google Scholar
Zhou, H., Liu, Z., Xu, X., Luo, P., Wang, X.: Vision-infused deep audio inpainting. In: 2019 IEEE International Conference on Computer Vision (ICCV), pp. 283–292 (2019)
Google Scholar

Download references

Acknowledgment

This work is partially supported by the SenseTime Collaborative Grant on Large-scale Multi-modality Analysis (CUHK Agreement No. TS1610626 & No. TS1712093), the General Research Fund (GRF) of Hong Kong (No. 14203518 & No. 14205719), and Innovation and Technology Support Program (ITSP) Tier 2, ITS/431/18F.

Author information

Authors and Affiliations

Department of Computer Science and Technology, Tsinghua University, Beijing, China
Jiangyue Xia & Jiangtao Wen
CUHK-SenseTime Joint Lab, The Chinese University of Hong Kong, Hong Kong, China
Anyi Rao, Qingqiu Huang, Linning Xu & Dahua Lin

Authors

Jiangyue Xia
View author publications
You can also search for this author in PubMed Google Scholar
Anyi Rao
View author publications
You can also search for this author in PubMed Google Scholar
Qingqiu Huang
View author publications
You can also search for this author in PubMed Google Scholar
Linning Xu
View author publications
You can also search for this author in PubMed Google Scholar
Jiangtao Wen
View author publications
You can also search for this author in PubMed Google Scholar
Dahua Lin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anyi Rao .

Editor information

Editors and Affiliations

University of Oxford, Oxford, UK
Andrea Vedaldi
Graz University of Technology, Graz, Austria
Horst Bischof
University of Freiburg, Freiburg im Breisgau, Germany
Thomas Brox
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Jan-Michael Frahm

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xia, J., Rao, A., Huang, Q., Xu, L., Wen, J., Lin, D. (2020). Online Multi-modal Person Search in Videos. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12357. Springer, Cham. https://doi.org/10.1007/978-3-030-58610-2_11

Download citation

DOI: https://doi.org/10.1007/978-3-030-58610-2_11
Published: 07 October 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58609-6
Online ISBN: 978-3-030-58610-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics