Advertisement

Joint embeddings with multimodal cues for video-text retrieval

  • Niluthpol C. MithunEmail author
  • Juncheng Li
  • Florian Metze
  • Amit K. Roy-Chowdhury
Regular Paper
  • 12 Downloads

Abstract

For multimedia applications, constructing a joint representation that could carry information for multiple modalities could be very conducive for downstream use cases. In this paper, we study how to effectively utilize available multimodal cues from videos in learning joint representations for the cross-modal video-text retrieval task. Existing hand-labeled video-text datasets are often very limited by their size considering the enormous amount of diversity the visual world contains. This makes it extremely difficult to develop a robust video-text retrieval system based on deep neural network models. In this regard, we propose a framework that simultaneously utilizes multimodal visual cues by a “mixture of experts” approach for retrieval. We conduct extensive experiments to verify that our system is able to boost the performance of the retrieval task compared to the state of the art. In addition, we propose a modified pairwise ranking loss function in training the embedding and study the effect of various loss functions. Experiments on two benchmark datasets show that our approach yields significant gain compared to the state of the art.

Keywords

Video-text retrieval Joint embedding Multimodal cues 

Notes

Acknowledgements

This work was partially supported by NSF grants 33384, IIS-1746031, CNS-1544969, ACI-1548562, and ACI-1445606. J. Li was supported by the Bosch Graduate Fellowship to CMU LTI. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research.

References

  1. 1.
    Andrew G, Arora R, Bilmes J, Livescu K (2013) Deep canonical correlation analysis. In: International conference on machine learning, pp 1247–1255Google Scholar
  2. 2.
    Awad G, Butt A, Fiscus J, Joy D, Delgado A, Michel M, Smeaton AF, Graham Y, Kraaij W, Quénot G et al (2017) Trecvid 2017: evaluating ad-hoc and instance video search, events detection, video captioning and hyperlinking. In: Proceedings of TRECVIDGoogle Scholar
  3. 3.
    Aytar Y, Vondrick C, Torralba A (2016) Soundnet: learning sound representations from unlabeled video. In: Advances in neural information processing systems, pp 892–900Google Scholar
  4. 4.
    Aytar Y, Vondrick C, Torralba A (2017) See, hear, and read: deep aligned representations. arXiv preprint arXiv:1706.00932
  5. 5.
    Bois R, Vukotić V, Simon AR, Sicre R, Raymond C, Sébillot P, Gravier G (2017) Exploiting multimodality in video hyperlinking to improve target diversity. In: International conference on multimedia modeling, Springer, pp 185–197Google Scholar
  6. 6.
    Budnik M, Demirdelen M, Gravier G (2018) A study on multimodal video hyperlinking with visual aggregation. In: 2018 IEEE international conference on multimedia and expo, IEEE, pp 1–6Google Scholar
  7. 7.
    Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the kinetics dataset. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 4724–4733Google Scholar
  8. 8.
    Cha M, Gwon Y, Kung H (2015) Multimodal sparse representation learning and applications. arXiv preprint arXiv:1511.06238
  9. 9.
    Chen DL, Dolan WB (2011) Collecting highly parallel data for paraphrase evaluation. In: Annual meeting of the association for computational linguistics: human language technologies, vol 1, ACL, pp 190–200Google Scholar
  10. 10.
    Chi J, Peng Y (2018) Dual adversarial networks for zero-shot cross-media retrieval. In: International joint conferences on artificial intelligence, pp 663–669Google Scholar
  11. 11.
    Chung J, Gulcehre C, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555
  12. 12.
    Dong J, Li X, Snoek CG (2016) Word2visualvec: Image and video to sentence matching by visual feature prediction. arXiv preprint arXiv:1604.06838
  13. 13.
    Faghri F, Fleet DJ, Kiros JR, Fidler S (2018) Vse++: improved visual-semantic embeddings. In: British machine vision conference (BMVC)Google Scholar
  14. 14.
    Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth D (2010) Every picture tells a story: generating sentences from images. In: European conference on computer vision, Springer, pp 15–29Google Scholar
  15. 15.
    Feng F, Wang X, Li R (2014) Cross-modal retrieval with correspondence autoencoder. In: ACM multimedia conference, ACM, pp 7–16Google Scholar
  16. 16.
    Fraz MM, Remagnino P, Hoppe A, Uyyanonvara B, Rudnicka AR, Owen CG, Barman SA (2012) An ensemble classification-based approach applied to retinal blood vessel segmentation. IEEE Trans Biomed Eng 59(9):2538–2548CrossRefGoogle Scholar
  17. 17.
    Frome A, Corrado GS, Shlens J, Bengio S, Dean J, Mikolov T et al (2013) Devise: a deep visual-semantic embedding model. In: Advances in neural information processing systems, pp 2121–2129Google Scholar
  18. 18.
    Gong Y, Ke Q, Isard M, Lazebnik S (2014) A multi-view embedding space for modeling internet images, tags, and their semantics. Int J Comput Vis 106(2):210–233CrossRefGoogle Scholar
  19. 19.
    Hardoon DR, Szedmak S, Shawe-Taylor J (2004) Canonical correlation analysis: an overview with application to learning methods. Neural Comput 16(12):2639–2664CrossRefzbMATHGoogle Scholar
  20. 20.
    He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 770–778Google Scholar
  21. 21.
    Henning CA, Ewerth R (2017) Estimating the information gap between textual and visual representations. In: International conference on multimedia retrieval, ACM, pp 14–22Google Scholar
  22. 22.
    Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: data, models and evaluation metrics. J Artif Intell Res 47:853–899MathSciNetCrossRefzbMATHGoogle Scholar
  23. 23.
    Huang G, Liu Z, van der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 2261–2269Google Scholar
  24. 24.
    Huang Y, Wang W, Wang L (2017) Instance-aware image and sentence matching with selective multimodal LSTM. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 2310–2318Google Scholar
  25. 25.
    Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 3128–3137Google Scholar
  26. 26.
    Karpathy A, Joulin A, Li FFF (2014) Deep fragment embeddings for bidirectional image sentence mapping. In: Advances in neural information processing systems, pp 1889–1897Google Scholar
  27. 27.
    Kingma D, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980
  28. 28.
    Kiros R, Salakhutdinov R, Zemel RS (2014) Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539
  29. 29.
    Kiros R, Zhu Y, Salakhutdinov RR, Zemel R, Urtasun R, Torralba A, Fidler S (2015) Skip-thought vectors. In: Advances in neural information processing systems, pp 3294–3302Google Scholar
  30. 30.
    Klein B, Lev G, Sadeh G, Wolf L (2015) Associating neural word embeddings with deep image representations using fisher vectors. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 4437–4446Google Scholar
  31. 31.
    Lee JH (1997) Analyses of multiple evidence combination. In: ACM SIGIR forum, vol 31, ACM, pp 267–276Google Scholar
  32. 32.
    Ma Z, Lu Y, Foster D (2015) Finding linear structure in large datasets with scalable canonical correlation analysis. In: International conference on machine learning, pp 169–178Google Scholar
  33. 33.
    Manmatha R, Wu CY, Smola AJ, Krahenbuhl P (2017) Sampling matters in deep embedding learning. In: IEEE international conference on computer vision, IEEE, pp 2859–2867Google Scholar
  34. 34.
    Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119Google Scholar
  35. 35.
    Mithun NC, Li J, Metze F, Roy-Chowdhury AK (2018) Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In: ACM international conference on multimedia retrievalGoogle Scholar
  36. 36.
    Mithun NC, Munir S, Guo K, Shelton C (2018) Odds: real-time object detection using depth sensors on embedded gpus. In: ACM/IEEE international conference on information processing in sensor networks, IEEE Press, pp 230–241Google Scholar
  37. 37.
    Mithun NC, Panda R, Roy-Chowdhury AK (2016) Generating diverse image datasets with limited labeling. In: ACM multimedia conference, ACM, pp 566–570Google Scholar
  38. 38.
    Mithun NC, Rameswar P, Papalexakis E, Roy-Chowdhury A (2018) Webly supervised joint embedding for cross-modal image-text retrieval. In: ACM international conference on multimediaGoogle Scholar
  39. 39.
    Nam H, Ha JW, Kim J (2017) Dual attention networks for multimodal reasoning and matching. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 299–307Google Scholar
  40. 40.
    Otani M, Nakashima Y, Rahtu E, Heikkilä J, Yokoya N (2016) Learning joint representations of videos and sentences with web image search. In: European conference on computer vision, Springer, pp 651–667Google Scholar
  41. 41.
    Pan Y, Mei T, Yao T, Li H, Rui Y (2016) Jointly modeling embedding and translation to bridge video and language. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 4594–4602Google Scholar
  42. 42.
    Polikar R (2006) Ensemble based systems in decision making. IEEE Circuits Syst Mag 6(3):21–45CrossRefGoogle Scholar
  43. 43.
    Polikar R (2007) Bootstrap inspired techniques in computational intelligence: ensemble of classifiers, incremental learning, data fusion and missing features. IEEE Signal Process Mag 24(4):59–72CrossRefGoogle Scholar
  44. 44.
    TQN Tran, Le Borgne H, Crucianu M (2016) Aggregating image and text quantized correlated components. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2046–2054Google Scholar
  45. 45.
    Ramanishka V, Das A, Park DH, Venugopalan S, Hendricks LA, Rohrbach M, Saenko K (2016) Multimodal video description. In: ACM multimedia conference, ACM, pp 1092–1096Google Scholar
  46. 46.
    Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: Unified, real-time object detection. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 779–788Google Scholar
  47. 47.
    Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp 91–99Google Scholar
  48. 48.
    Schroff F, Kalenichenko D, Philbin J (2015) Facenet: a unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 815–823Google Scholar
  49. 49.
    Socher R, Fei-Fei L (2010) Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 966–973Google Scholar
  50. 50.
    Torabi A, Tandon N, Sigal L (2016) Learning language-visual embedding for movie understanding with natural-language. arXiv preprint arXiv:1609.08124
  51. 51.
    Usunier N, Buffoni D, Gallinari P (2009) Ranking with ordered weighted pairwise classification. In: International conference on machine learning, ACM, pp 1057–1064Google Scholar
  52. 52.
    Vendrov I, Kiros R, Fidler S, Urtasun R (2016) Order-embeddings of images and language. In: International conference on learning representationsGoogle Scholar
  53. 53.
    Venugopalan S, Rohrbach M, Donahue J, Mooney R, Darrell T, Saenko K (2015) Sequence to sequence-video to text. In: IEEE International conference on computer vision, IEEE, pp 4534–4542Google Scholar
  54. 54.
    Vukotić V, Raymond C, Gravier G (2016) Bidirectional joint representation learning with symmetrical deep neural networks for multimodal and crossmodal applications. In: ACM international conference on multimedia retrieval, ACM, pp 343–346Google Scholar
  55. 55.
    Vukotić V, Raymond C, Gravier G (2017) Generative adversarial networks for multimodal representation learning in video hyperlinking. In: ACM international conference on multimedia retrieval, ACM, pp 416–419Google Scholar
  56. 56.
    Vukotić V, Raymond C, Gravier G (2018) A crossmodal approach to multimodal fusion in video hyperlinking. IEEE Multimed 25(2):11–23CrossRefGoogle Scholar
  57. 57.
    Wang B, Yang Y, Xu X, Hanjalic A, Shen HT (2017) Adversarial cross-modal retrieval. In: ACM multimedia conference, ACM, pp 154–162Google Scholar
  58. 58.
    Wang L, Li Y, Huang J, Lazebnik S (2018) Learning two-branch neural networks for image-text matching tasks. IEEE Trans Pattern Anal Mach Intell 41(2):394–407Google Scholar
  59. 59.
    Wang L, Li Y, Lazebnik S (2016) Learning deep structure-preserving image-text embeddings. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 5005–5013Google Scholar
  60. 60.
    Xu J, Mei T, Yao T, Rui Y (2016) Msr-vtt: A large video description dataset for bridging video and language. In: IEEE conference on computer vision and pattern recognition, pp 5288–5296Google Scholar
  61. 61.
    Xu R, Xiong C, Chen W, Corso JJ (2015) Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In: AAAI, vol 5, p 6Google Scholar
  62. 62.
    Yan F, Mikolajczyk K (2015) Deep correlation for matching images and text. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 3441–3450Google Scholar
  63. 63.
    Yan R, Yang J, Hauptmann AG (2004) Learning query-class dependent weights in automatic video retrieval. In: ACM multimedia conference, ACM, pp 548–555Google Scholar
  64. 64.
    Zhang L, Ma B, Li G, Huang Q, Tian Q (2017) Multi-networks joint learning for large-scale cross-modal retrieval. In: ACM multimedia conference, ACM, pp 907–915Google Scholar
  65. 65.
    Zhang X, Gao K, Zhang Y, Zhang D, Li J, Tian Q (2017) Task-driven dynamic fusion: reducing ambiguity in video description. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 3713–3721Google Scholar
  66. 66.
    Zhou B, Lapedriza A, Khosla A, Oliva A, Torralba A (2017) Places: a 10 million image database for scene recognition. IEEE Trans Pattern Anal Mach Intell 40:1452–1464CrossRefGoogle Scholar

Copyright information

© Springer-Verlag London Ltd., part of Springer Nature 2019

Authors and Affiliations

  1. 1.University of CaliforniaRiversideUSA
  2. 2.Carnegie Mellon UniversityPittsburghUSA

Personalised recommendations