Skip to main content
Log in

Enhancing multimodal deep representation learning by fixed model reuse

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

As we all know, inconsistent distribution and representation of different modalities, such as image, text and audio, cause the “media gap”, which poses a great challenge to deal with such heterogeneous data. Currently, state-of-the-art multimodal approaches mainly focus on the data provided by target task, neglecting the extra information on different but related tasks. In this paper, we explore a multimodal representation learning architecture by leveraging embedding representation trained from extra information. Specifically speaking, the approach of fixed model reuse is integrated into our architecture, which can incorporate helpful information from existing models/features into a new model. Based on our proposed architecture, we study multilingual OCR and long-text-based image retrieval tasks. Multilingual OCR is a difficult task that deals with multiple languages on the same page. We take advantage of extra textual embedding layer in an existing text-generating model to improve the accuracy of multilingual OCR. As for the long-text-based image retrieval, a cross-modal task, intermediate visual embedding layer in an off-the-shelf image-captioning model is leveraged to enhance the retrieval ability. The experimental results validate the effectiveness of our proposed architecture on narrowing down the “media gap” and yield observable improvement in these two tasks. Our architecture outperform the state-of-the-art approaches by 4.2% improvements in terms of accuracy in multilingual OCR task and yields improvement from 9 to 6 with regard to the median rank of retrieval result in long-text-based image retrieval task.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. https://github.com/yashk2810/Image-Captioning

  2. https://pypi.org/project/jieba/

References

  1. Ah-Pine J, Csurka G (2011) Semantic combination of textual and visual information in multimedia retrieval. In: ACM International conference on multimedia retrieval, p 44

  2. Andrew G, Arora R, Bilmes J, Livescu K (2013) Deep canonical correlation analysis. In: International conference on international conference on machine learning, pp III–1247

  3. Baird HS (1993) Document image defect models and their uses. In: 2Nd international conference document analysis and recognition, ICDAR ’93, october 20-22, 1993, tsukuba city, japan, pp 62–67

  4. Bengio Y, Simard P, Frasconi P (2002) Learning long-term dependencies with gradient descent is difficult. IEEE Trans Neural Netw 5(2):157–166

    Article  Google Scholar 

  5. Bengio Y, Courville A, Vincent P (2013) Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35(8):1798–1828

    Article  Google Scholar 

  6. Breuel TM (2008) The ocropus open source OCR system. In: Document recognition and retrieval XV, part of the IS&T-SPIE Electronic Imaging Symposium, San Jose, CA, USA, January 29-31, 2008. Proceedings, p 68150F

  7. Chrupala G, Gelderloos L, Alishahi A (2017) Representations of language in a model of visually grounded speech signal. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pp 613– 622

  8. Firmani D, Merialdo P, Nieddu E, Scardapane S (2017) In codice ratio: OCR of handwritten latin documents using deep convolutional networks. In: International workshop on artificial intelligence for cultural heritage, pp 9–16

  9. Frome A, Corrado GS, Shlens J, Bengio S, Dean J, Ranzato M, Mikolov T (2013) Devise: a deep visual-semantic embedding model. In: International conference on neural information processing systems, pp 2121–2129

  10. Graves A, Gomez F (2006) Connectionist temporal classification:labelling unsegmented sequence data with recurrent neural networks. In: International conference on machine learning, pp 369–376

  11. Graves A, Liwicki M, Fernndez S, Bertolami R, Bunke H, Schmidhuber J (2009) A novel connectionist system for unconstrained handwriting recognition. IEEE Trans Pattern Anal Mach Intell 31(5):855

    Article  Google Scholar 

  12. Hardoon DR, Szedmak S, Shawe-Taylor J (2014) Canonical correlation analysis: an overview with application to learning methods. Neural Comput 16(12):2639–2664

    Article  Google Scholar 

  13. He K, Zhang X, Ren S, Sun J (2015) Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Mach Intell 37 (9):1904–1916

    Article  Google Scholar 

  14. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  15. Kiros R, Salakhutdinov R, Zemel RS (2014) Unifying visual-semantic embeddings with multimodal neural language models Computer Science

  16. Klein B, Lev G, Sadeh G, Wolf L (2015) Associating neural word embeddings with deep image representations using fisher vectors. In: IEEE Conference on computer vision and pattern recognition, pp 4437–4446

  17. Knight K, Nenkova A, Rambow O (eds) (2016) NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. The Association for Computational Linguistics, San Diego

    Google Scholar 

  18. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: International conference on neural information processing systems, pp 1097–1105

  19. Le QV, Mikolov T (2014) Distributed representations of sentences and documents. In: Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, pp 1188–1196

  20. Li D, Dimitrova N, Li M, Sethi IK (2003) Multimedia content processing through cross-modal association. In: Proceedings of the Eleventh ACM International Conference on Multimedia, Berkeley, CA, USA, November 2-8, 2003, pp 604–611

  21. Li N, Tsang IW, Zhou ZH (2013) Efficient optimization of performance measures by classifier adaptation. IEEE Trans Pattern Anal Mach Intell 35(6):1370–1382

    Article  Google Scholar 

  22. Liu Y, Zhao WL, Ngo CW, Xu CS, Lu HQ (2010) Coherent bag-of audio words model for efficient large-scale video copy detection. In: Acm international conference on image & video retrieval, pp 89–96

  23. Long M, Cao Y, Wang J, Jordan MI (2015) Learning transferable features with deep adaptation networks. In: International conference on international conference on machine learning, pp 97–105

  24. Mihalcea R, Tarau P (2004) Textrank: Bringing order into texts. Emnlp, pp 404–411

  25. Naz S, Umar AI, Shirazi SH, Ajmal MM (2014) Salahuddin: The optical character recognition for cursive script using hmm: A review. Res J Appl Sci Eng Technol 8(19):2016–2025

    Article  Google Scholar 

  26. Ngiam J, Khosla A, Kim M, Nam J, Lee H, Ng AY (2009) Multimodal deep learning. In: International conference on machine learning, ICML 2011, bellevue, washington, usa, june 28 - july, pp 689–696

  27. Pan SJ, Yang Q (2010) A survey on transfer learning. IEEE Trans Knowl Data Eng 22(10):1345–1359

    Article  Google Scholar 

  28. Pan Y, Mei T, Yao T, Li H, Rui Y (2016) Jointly modeling embedding and translation to bridge video and language. In: Computer vision and pattern recognition, pp 4594–4602

  29. Peng Y, Huang X, Zhao Y (2017) An overview of cross-media retrieval: concepts, methodologies, benchmarks and challenges. IEEE Trans Circuits Syst Video Technol PP(99):1–1

    Google Scholar 

  30. Philip B, Samuel RDS (2009) A novel bilingual ocr system based on column-stochastic features and svm classifier for the specially enabled. In: Second international conference on emerging trends in engineering & technology, pp 252–257

  31. Rasiwasia N, Pereira JC, Coviello E, Doyle G, Lanckriet GRG, Levy R, Vasconcelos N (2010) A new approach to cross-modal multimedia retrieval. In: International conference on multimedia, pp 251–260

  32. Silberer C, Lapata M (2014) Learning grounded meaning representations with autoencoders. In: Meeting of the association for computational linguistics, pp 721–732

  33. Smith R, Antonova D, Lee DS (2009) Adapting the tesseract open source ocr engine for multilingual ocr. In: International workshop on multilingual ocr, p 1

  34. Socher R, Karpathy A, Le QV, Manning CD, Ng AY (2014) Grounded compositional semantics for finding and describing images with sentences. TACL 2:207–218

    Google Scholar 

  35. Song R, Umemoto K, Nie J, Xie X, Tanaka K, Rui Y (2016) Uniclip: Leveraging web search for universal clipping of articles on mobile. Data Science and Engineering 1(2):101–113

    Article  Google Scholar 

  36. Srivastava N, Salakhutdinov R (2012) Learning representations for multimodal data with deep belief nets. In: International conference on machine learning

  37. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958

    MathSciNet  MATH  Google Scholar 

  38. Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: 2016 IEEE Conference on computer vision and pattern recognition, CVPR 2016, las vegas, NV, USA, June 27-30, 2016, pp 2818–2826

  39. Ul-Hasan A, Breuel TM (2013) Can we build language-independent ocr using lstm networks?. In: International workshop on multilingual ocr, p 9

  40. Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: IEEE Conference on computer vision and pattern recognition, CVPR 2015, boston, MA, USA, June 7-12, 2015, pp 3156–3164

  41. Wang D, Cui P, Ou M, Zhu W (2015) Deep multimodal hashing with orthogonal regularization. In: International conference on artificial intelligence, pp 2291–2297

  42. Wang Y, Lin X, Wu L, Zhang W (2015) Effective multi-query expansions: Robust landmark retrieval. In: Proceedings of the 23rd Annual ACM Conference on Multimedia Conference, MM ’15, Brisbane, Australia, October 26 - 30, 2015, pp 79–88

  43. Wang Y, Lin X, Wu L, Zhang W, Zhang Q, Huang X (2015) Robust subspace clustering for multi-view data by exploiting correlation consensus. IEEE Trans Image Process 24(11):3939–3949

    Article  MathSciNet  Google Scholar 

  44. Wang Y, Lin X, Wu L, Zhang W (2017) Effective multi-query expansions: Collaborative deep networks for robust landmark retrieval. IEEE Trans Image Processing 26(3):1393–1404

    Article  MathSciNet  Google Scholar 

  45. Wang Y, Zhang W, Wu L, Lin X, Zhao X (2017) Unsupervised metric fusion over multiview data by graph random walk-based cross-view diffusion. IEEE Trans Neural Netw Learn Syst 28(1):57–70

    Article  Google Scholar 

  46. Weston J, Bengio S, Usunier N (2011) Wsabie: scaling up to large vocabulary image annotation. In: International joint conference on artificial intelligence, pp 2764–2770

  47. Wu L, Wang Y, Gao J, Li X (2017) Deep adaptive feature embedding with local sample distributions for person re-identification. Pattern Recogn 73:275–288

    Article  Google Scholar 

  48. Wu L, Wang Y, Li X, Gao J (2017) What-and-where to match: Deep spatially multiplicative integration networks for person re-identification Pattern Recognition

  49. Wu L, Wang Y, Li X, Gao J (2018) Deep attention-based spatially recursive networks for fine-grained visual recognition. IEEE Transactions on Cybernetics PP (99):1–12

    Google Scholar 

  50. Yang Y, Zhan D, Fan Y, Jiang Y, Zhou Z (2017) Deep learning for fixed model reuse. In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA., pp 2831–2837

  51. Yosinski J, Clune J, Bengio Y, Lipson H (2014) How transferable are features in deep neural networks?. In: Advances in neural information processing systems 27: Annual conference on neural information processing systems 2014, december 8-13 2014, montreal, quebec, canada, pp 3320–3328

  52. Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. TACL 2:67–78

    Google Scholar 

  53. Zhou ZH (2016) Learnware: on the future of machine learning. Springer, New York

    Google Scholar 

Download references

Acknowledgements

This work is supported by the National Social Science Foundation of China (Grant No: 15BGL048), Hubei Province Science and Technology Support Project (Grant No: 2015BAA072), Hubei Provincial Natural Science Foundation of China (Grant No: 2017CFA012), The Fundamental Research Funds for the Central Universities (WUT: 2017II39GX).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhongwei Xie.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xie, Z., Li, L., Zhong, X. et al. Enhancing multimodal deep representation learning by fixed model reuse. Multimed Tools Appl 78, 30769–30791 (2019). https://doi.org/10.1007/s11042-018-6556-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-018-6556-6

Keywords

Navigation