Skip to main content

Frame Aggregation and Multi-modal Fusion Framework for Video-Based Person Recognition

  • Conference paper
  • First Online:
MultiMedia Modeling (MMM 2021)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12572))

Included in the following conference series:

Abstract

Video-based person recognition is challenging due to persons being blocked and blurred, and the variation of shooting angle. Previous research always focused on person recognition on still images, ignoring similarity and continuity between video frames. To tackle the challenges above, we propose a novel Frame Aggregation and Multi-Modal Fusion (FAMF) framework for video-based person recognition, which aggregates face features and incorporates them with multi-modal information to identify persons in videos. For frame aggregation, we propose a novel trainable layer based on NetVLAD (named AttentionVLAD), which takes arbitrary number of features as input and computes a fixed-length aggregated feature based on the feature quality. We show that introducing an attention mechanism into NetVLAD effectively decreases the impact of low-quality frames. For the multi-model information of videos, we propose a Multi-Layer Multi-Modal Attention (MLMA) module to learn the correlation of multi-modality by adaptively updating correlation Gram matrix. Experimental results on iQIYI-VID-2019 dataset show that our framework outperforms other state-of-the-art methods.

This work is supported by the National Key Research and Development Program of China (2018YFC0831500), the National Natural Science Foundation of China under Grant No. 61972047, the NSFC-General Technology Basic Research Joint Funds under Grant U1936220 and the Fundamental Research Funds for the Central Universities (2019XD-D01).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Arandjelovic, R., Gronat, P., Torii, A., Pajdla, T., Sivic, J.: NetVLAD: CNN architecture for weakly supervised place recognition. In: CVPR, pp. 5297–5307 (2016)

    Google Scholar 

  2. Dehak, N., Kenny, P.J., Dehak, R., Dumouchel, P., Ouellet, P.: Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2010)

    Article  Google Scholar 

  3. Deng, J., Guo, J., Xue, N., Zafeiriou, S.: ArcFace: additive angular margin loss for deep face recognition. In: CVPR, pp. 4690–4699 (2019)

    Google Scholar 

  4. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  5. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)

    Google Scholar 

  6. Huang, G.B., Mattar, M., Berg, T., Learned-Miller, E.: Labeled faces in the wild: a database forstudying face recognition in unconstrained environments (2008)

    Google Scholar 

  7. Li, C., et al.: Deep speaker: an end-to-end neural speaker embedding system. arXiv preprint arXiv:1705.02304 650 (2017)

  8. Lin, R., Xiao, J., Fan, J.: NeXtVLAD: an efficient neural network to aggregate frame-level features for large-scale video classification. In: ECCV, p. 0 (2018)

    Google Scholar 

  9. Liu, Y., et al.: iQIYI-VID: a large dataset for multi-modal person identification. arXiv preprint arXiv:1811.07548 (2018)

  10. Liu, Y., et al.: iQIYI celebrity video identification challenge. In: ACM MM, pp. 2516–2520 (2019)

    Google Scholar 

  11. Liu, Z., Shen, Y., Lakshminarasimhan, V.B., Liang, P.P., Zadeh, A.B., Morency, L.P.: Efficient low-rank multimodal fusion with modality-specific factors. In: ACL, pp. 2247–2256 (2018)

    Google Scholar 

  12. Mai, S., Hu, H., Xing, S.: Modality to modality translation: an adversarial representation learning and graph fusion network for multimodal fusion. In: AAAI, pp. 164–172. AAAI Press (2020)

    Google Scholar 

  13. Song, G., Leng, B., Liu, Y., Hetang, C., Cai, S.: Region-based quality estimation network for large-scale person re-identification. In: AAAI (2018)

    Google Scholar 

  14. Tang, P., Wang, X., Shi, B., Bai, X., Liu, W., Tu, Z.: Deep FisherNet for object classification. arXiv preprint arXiv:1608.00182 (2016)

  15. Tsai, Y.H.H., Bai, S., Liang, P.P., Kolter, J.Z., Morency, L.P., Salakhutdinov, R.: Multimodal transformer for unaligned multimodal language sequences. In: ACL, pp. 6558–6569 (2019)

    Google Scholar 

  16. Whitelam, C., et al.: IARPA Janus benchmark-B face dataset. In: CVPR, pp. 90–98 (2017)

    Google Scholar 

  17. Xuan, H., Zhang, Z., Chen, S., Yang, J., Yan, Y.: Cross-modal attention network for temporal inconsistent audio-visual event localization. In: AAAI, pp. 279–286. AAAI Press (2020)

    Google Scholar 

  18. Zadeh, A., Chen, M., Poria, S., Cambria, E., Morency, L.P.: Tensor fusion network for multimodal sentiment analysis. arXiv preprint arXiv:1707.07250 (2017)

  19. Zhang, H., Goodfellow, I., Metaxas, D., Odena, A.: Self-attention generative adversarial networks. In: ICML, pp. 7354–7363 (2019)

    Google Scholar 

  20. Zheng, Z., Yang, X., Yu, Z., Zheng, L., Yang, Y., Kautz, J.: Joint discriminative and generative learning for person re-identification. In: CVPR, pp. 2138–2147 (2019)

    Google Scholar 

  21. Zhong, Y., Arandjelović, R., Zisserman, A.: GhostVLAD for set-based face recognition. In: Jawahar, C.V., Li, H., Mori, G., Schindler, K. (eds.) ACCV 2018. LNCS, vol. 11362, pp. 35–50. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20890-5_3

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bin Wu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Li, F., Wang, W., Liu, Z., Wang, H., Yan, C., Wu, B. (2021). Frame Aggregation and Multi-modal Fusion Framework for Video-Based Person Recognition. In: Lokoč, J., et al. MultiMedia Modeling. MMM 2021. Lecture Notes in Computer Science(), vol 12572. Springer, Cham. https://doi.org/10.1007/978-3-030-67832-6_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-67832-6_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-67831-9

  • Online ISBN: 978-3-030-67832-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics