Frame Aggregation and Multi-modal Fusion Framework for Video-Based Person Recognition

Li, Fangtao; Wang, Wenzhe; Liu, Zihe; Wang, Haoran; Yan, Chenghao; Wu, Bin

doi:10.1007/978-3-030-67832-6_7

Fangtao Li¹⁵,
Wenzhe Wang¹⁵,
Zihe Liu¹⁵,
Haoran Wang¹⁵,
Chenghao Yan¹⁵ &
…
Bin Wu¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12572))

Included in the following conference series:

International Conference on Multimedia Modeling

2601 Accesses
1 Citations

Abstract

Video-based person recognition is challenging due to persons being blocked and blurred, and the variation of shooting angle. Previous research always focused on person recognition on still images, ignoring similarity and continuity between video frames. To tackle the challenges above, we propose a novel Frame Aggregation and Multi-Modal Fusion (FAMF) framework for video-based person recognition, which aggregates face features and incorporates them with multi-modal information to identify persons in videos. For frame aggregation, we propose a novel trainable layer based on NetVLAD (named AttentionVLAD), which takes arbitrary number of features as input and computes a fixed-length aggregated feature based on the feature quality. We show that introducing an attention mechanism into NetVLAD effectively decreases the impact of low-quality frames. For the multi-model information of videos, we propose a Multi-Layer Multi-Modal Attention (MLMA) module to learn the correlation of multi-modality by adaptively updating correlation Gram matrix. Experimental results on iQIYI-VID-2019 dataset show that our framework outperforms other state-of-the-art methods.

This work is supported by the National Key Research and Development Program of China (2018YFC0831500), the National Natural Science Foundation of China under Grant No. 61972047, the NSFC-General Technology Basic Research Joint Funds under Grant U1936220 and the Fundamental Research Funds for the Central Universities (2019XD-D01).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Cross-modality person re-identication with triple-attentive feature aggregation

Article 09 December 2021

Learning Discriminative Aggregation Network for Video-Based Face Recognition and Person Re-identification

Article 28 November 2018

Large-Scale Video-Based Person Re-identification via Non-local Attention and Feature Erasing

References

Arandjelovic, R., Gronat, P., Torii, A., Pajdla, T., Sivic, J.: NetVLAD: CNN architecture for weakly supervised place recognition. In: CVPR, pp. 5297–5307 (2016)
Google Scholar
Dehak, N., Kenny, P.J., Dehak, R., Dumouchel, P., Ouellet, P.: Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2010)
Article Google Scholar
Deng, J., Guo, J., Xue, N., Zafeiriou, S.: ArcFace: additive angular margin loss for deep face recognition. In: CVPR, pp. 4690–4699 (2019)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Google Scholar
Huang, G.B., Mattar, M., Berg, T., Learned-Miller, E.: Labeled faces in the wild: a database forstudying face recognition in unconstrained environments (2008)
Google Scholar
Li, C., et al.: Deep speaker: an end-to-end neural speaker embedding system. arXiv preprint arXiv:1705.02304 650 (2017)
Lin, R., Xiao, J., Fan, J.: NeXtVLAD: an efficient neural network to aggregate frame-level features for large-scale video classification. In: ECCV, p. 0 (2018)
Google Scholar
Liu, Y., et al.: iQIYI-VID: a large dataset for multi-modal person identification. arXiv preprint arXiv:1811.07548 (2018)
Liu, Y., et al.: iQIYI celebrity video identification challenge. In: ACM MM, pp. 2516–2520 (2019)
Google Scholar
Liu, Z., Shen, Y., Lakshminarasimhan, V.B., Liang, P.P., Zadeh, A.B., Morency, L.P.: Efficient low-rank multimodal fusion with modality-specific factors. In: ACL, pp. 2247–2256 (2018)
Google Scholar
Mai, S., Hu, H., Xing, S.: Modality to modality translation: an adversarial representation learning and graph fusion network for multimodal fusion. In: AAAI, pp. 164–172. AAAI Press (2020)
Google Scholar
Song, G., Leng, B., Liu, Y., Hetang, C., Cai, S.: Region-based quality estimation network for large-scale person re-identification. In: AAAI (2018)
Google Scholar
Tang, P., Wang, X., Shi, B., Bai, X., Liu, W., Tu, Z.: Deep FisherNet for object classification. arXiv preprint arXiv:1608.00182 (2016)
Tsai, Y.H.H., Bai, S., Liang, P.P., Kolter, J.Z., Morency, L.P., Salakhutdinov, R.: Multimodal transformer for unaligned multimodal language sequences. In: ACL, pp. 6558–6569 (2019)
Google Scholar
Whitelam, C., et al.: IARPA Janus benchmark-B face dataset. In: CVPR, pp. 90–98 (2017)
Google Scholar
Xuan, H., Zhang, Z., Chen, S., Yang, J., Yan, Y.: Cross-modal attention network for temporal inconsistent audio-visual event localization. In: AAAI, pp. 279–286. AAAI Press (2020)
Google Scholar
Zadeh, A., Chen, M., Poria, S., Cambria, E., Morency, L.P.: Tensor fusion network for multimodal sentiment analysis. arXiv preprint arXiv:1707.07250 (2017)
Zhang, H., Goodfellow, I., Metaxas, D., Odena, A.: Self-attention generative adversarial networks. In: ICML, pp. 7354–7363 (2019)
Google Scholar
Zheng, Z., Yang, X., Yu, Z., Zheng, L., Yang, Y., Kautz, J.: Joint discriminative and generative learning for person re-identification. In: CVPR, pp. 2138–2147 (2019)
Google Scholar
Zhong, Y., Arandjelović, R., Zisserman, A.: GhostVLAD for set-based face recognition. In: Jawahar, C.V., Li, H., Mori, G., Schindler, K. (eds.) ACCV 2018. LNCS, vol. 11362, pp. 35–50. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20890-5_3
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Beijing University of Posts and Telecommucations, Beijing, China
Fangtao Li, Wenzhe Wang, Zihe Liu, Haoran Wang, Chenghao Yan & Bin Wu

Authors

Fangtao Li
View author publications
You can also search for this author in PubMed Google Scholar
Wenzhe Wang
View author publications
You can also search for this author in PubMed Google Scholar
Zihe Liu
View author publications
You can also search for this author in PubMed Google Scholar
Haoran Wang
View author publications
You can also search for this author in PubMed Google Scholar
Chenghao Yan
View author publications
You can also search for this author in PubMed Google Scholar
Bin Wu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bin Wu .

Editor information

Editors and Affiliations

Charles University, Prague, Czech Republic
Jakub Lokoč
Charles University, Prague, Czech Republic
Tomáš Skopal
Klagenfurt University, Klagenfurt, Austria
Klaus Schoeffmann
CERTH-ITI, Thessaloniki, Greece
Vasileios Mezaris
Renmin University of China, Beijing, China
Xirong Li
CERTH-ITI, Thessaloniki, Greece
Stefanos Vrochidis
Queen Mary University of London, London, UK
Ioannis Patras

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, F., Wang, W., Liu, Z., Wang, H., Yan, C., Wu, B. (2021). Frame Aggregation and Multi-modal Fusion Framework for Video-Based Person Recognition. In: Lokoč, J., et al. MultiMedia Modeling. MMM 2021. Lecture Notes in Computer Science(), vol 12572. Springer, Cham. https://doi.org/10.1007/978-3-030-67832-6_7

Download citation

DOI: https://doi.org/10.1007/978-3-030-67832-6_7
Published: 21 January 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-67831-9
Online ISBN: 978-3-030-67832-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Frame Aggregation and Multi-modal Fusion Framework for Video-Based Person Recognition

Abstract

Access this chapter

Similar content being viewed by others

Cross-modality person re-identication with triple-attentive feature aggregation

Learning Discriminative Aggregation Network for Video-Based Face Recognition and Person Re-identification

Large-Scale Video-Based Person Re-identification via Non-local Attention and Feature Erasing

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Frame Aggregation and Multi-modal Fusion Framework for Video-Based Person Recognition

Abstract

Access this chapter

Similar content being viewed by others

Cross-modality person re-identication with triple-attentive feature aggregation

Learning Discriminative Aggregation Network for Video-Based Face Recognition and Person Re-identification

Large-Scale Video-Based Person Re-identification via Non-local Attention and Feature Erasing

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation