Cross-Modal Event Retrieval: A Dataset and a Baseline Using Deep Semantic Learning

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11165)


In this paper, we propose to learn Deep Semantic Space (DSS) for cross-modal event retrieval, which is achieved by exploiting deep learning models to extract semantic features from images and textual articles jointly. More specifically, a VGG network is used to transfer deep semantic knowledge from a large-scale image dataset to the target image dataset. Simultaneously, a fully-connected network is designed to model semantic representation from textual features (e.g., TF-IDF, LDA). Furthermore, the obtained deep semantic representations for image and text can be mapped into a high-level semantic space, in which the distance between data samples can be measured straightforwardly for cross-model event retrieval. In particular, we collect a dataset called Wiki-Flickr event dataset for cross-modal event retrieval, where the data are weakly aligned unlike image-text pairs in the existing cross-modal retrieval datasets. Extensive experiments conducted on both the Pascal Sentence dataset and our Wiki-Flickr event dataset show that our DSS outperforms the state-of-the-art approaches.


Cross-modal event retrieval Deep learning Common space 



The authors would like to thank Zehang Lin and Feitao Huang for data collection. This work is supported by the National Natural Science Foundation of China (No. 61703109, No. 91748107, No. U1611461), the Guangdong Innovative Research Team Program (No. 2014ZT05G157), Science and Technology Program of Guangdong Province, China (No. 2016A010101012), and CAS Key Lab of Network Data Science and Technology, Institute of Computing Technology, Chinese Academy of Sciences, 100190, Beijing, China (No. CASNDST201703), and an internal grant from City University of Hong Kong (Project No. 9610367).


  1. 1.
    Yang, Z., Li, Q., Lu, Z., Ma, Y., Gong, Z., Liu, W.: Dual structure constrained multimodal feature coding for social event detection from Flickr data. ACM Trans. Internet Technol. 17(2), 19 (2017)CrossRefGoogle Scholar
  2. 2.
    Yang, Z., Li, Q., Liu, W., Ma, Y., Cheng, M.: Dual graph regularized NMF model for social event detection from Flickr data. World Wide Web 20(5), 995–1015 (2017)CrossRefGoogle Scholar
  3. 3.
    Rasiwasia, N., Costa Pereira, J., Coviello, E., Doyle, G., Lanckriet, G. R., Levy, R., Vasconcelos, N.: A new approach to cross-modal multimedia retrieval. In: 18th ACM International Conference on Multimedia, pp. 251–260. ACM (2010)Google Scholar
  4. 4.
    Rashtchian, C., Young, P., Hodosh, M., Hockenmaier, J.: Collecting image annotations using Amazon’s Mechanical Turk. In: NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, pp. 139–147. Association for Computational Linguistics (2010)Google Scholar
  5. 5.
    Hwang, S.J., Grauman, K.: Reading between the lines: object localization using implicit cues from image tags. IEEE Trans. Pattern Anal. Mach. Intell. 34(6), 1145–1158 (2012)CrossRefGoogle Scholar
  6. 6.
    Thompson, B: Canonical correlation analysis. In: Encyclopedia of Statistics in Behavioral Science (2000) Google Scholar
  7. 7.
    Li, D., Dimitrova, N., Li, M., Sethi, I. K.: Multimedia content processing through cross-modal association. In: 11th ACM International Conference on Multimedia, pp. 604–611. ACM (2003)Google Scholar
  8. 8.
    Simonyan, K., Zisserman, A.: Very Deep Convolutional Networks for Large-Scale Image Recognition (2014). arXiv preprint arXiv:1409.1556
  9. 9.
    Bronstein, M. M., Bronstein, A. M., Michel, F., Paragios, N.: Data fusion through cross-modality metric learning using similarity-sensitive hashing. In: Computer Vision and Pattern Recognition, pp. 3594–3601 (2010)Google Scholar
  10. 10.
    Song, J., Yang, Y., Yang, Y., Huang, Z., Shen, H.T.: Inter-media hashing for large-scale retrieval from heterogeneous data sources. In: 2013 ACM SIGMOD International Conference on Management of Data, pp. 785–796. ACM (2013)Google Scholar
  11. 11.
    Bai, B., Weston, J., Grangier, D., Collobert, R., Sadamasa, K., Qi, Y., Weinberger, K.: Learning to rank with (a lot of) word features. Inf. Retr 13(3), 291–314 (2010)CrossRefGoogle Scholar
  12. 12.
    Grangier, D., Bengio, S.: A discriminative kernel-based approach to rank images from text queries. IEEE Trans. Pattern Anal. Mach. Intell. 30(8), 1371–1384 (2008)CrossRefGoogle Scholar
  13. 13.
    Srivastava, N., Salakhutdinov, R.: Multimodal learning with deep boltzmann machines. Adv. Neural Inf. Process. Syst. 5, 2222–2230 (2012)zbMATHGoogle Scholar
  14. 14.
    Wang, C., Yang, H., Meinel, C.: Deep semantic mapping for cross-modal retrieval. In: Tools with Artificial Intelligence, pp. 234–241. IEEE (2015)Google Scholar
  15. 15.
    Wei, Y., Zhao, Y., Lu, C., Wei, S., Liu, L., Zhu, Z., Yan, S.: Cross-modal retrieval with cnn visual features: A new baseline. IEEE Trans. Cybern. 47(2), 449–460 (2017)Google Scholar
  16. 16.
    Zhai, X., Peng, Y., Xiao, J.: Learning cross-media joint representation with sparse and semisupervised regularization. IEEE Trans. Circuits Syst. Video Technol. 24(6), 965–978 (2014)CrossRefGoogle Scholar
  17. 17.
    Kang, C., Xiang, S., Liao, S., Xu, C., Pan, C.: Learning consistent feature representation for cross-modal multimedia retrieval. IEEE Trans. Multimedia 17(3), 370–381 (2015)CrossRefGoogle Scholar
  18. 18.
    Srivastava, N., Salakhutdinov, R.: Learning representations for multimodal data with deep belief nets. In: International Conference on Machine Learning Workshop, vol. 79 (2012)Google Scholar
  19. 19.
    Feng, F., Wang, X., Li, R.: Cross-modal retrieval with correspondence autoencoder. In: 22nd ACM International Conference on Multimedia, pp. 7–16. ACM (2014)Google Scholar
  20. 20.
    Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: 28th International Conference on Machine Learning, pp. 689–696 (2011)Google Scholar
  21. 21.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)Google Scholar
  22. 22.
    Krizhevsky, A.: One Weird Trick for Parallelizing Convolutional Neural Networks (2014). arXiv preprint arXiv:1404.5997
  23. 23.
    Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J., Keutzer, K.: SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5 MB model size (2016). arXiv preprint arXiv:1602.07360

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.School of Computer Science and TechnologyGuangdong University of TechnologyGuangzhouChina
  2. 2.School of Computer Science and EngineeringSouth China University of TechnologyGuangzhouChina
  3. 3.Department of Computer ScienceCity University of Hong KongHong KongChina

Personalised recommendations