An Adversarial Learning and Canonical Correlation Analysis Based Cross-Modal Retrieval Model

  • Thi-Hong VuongEmail author
  • Thanh-Huyen Pham
  • Tri-Thanh Nguyen
  • Quang-Thuy Ha
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11431)


The key of cross-modal retrieval approaches is to find a maximally correlated subspace among multiple datasets. This paper introduces a novel Adversarial Learning and Canonical Correlation Analysis based Cross-Modal Retrieval (ALCCA-CMR) model. For each modality, the ALCCA phase finds an effective common subspace and calculates the similarity by canonical correlation analysis embedding for cross-modal retrieval. We demonstrate an application of ALCCA-CMR model implemented for the dataset of two modalities. Experimental results on real music data show the efficacy of the proposed method in comparison with other existing ones.


Cross-modal retrieval Adversarial learning Canonical correlation analysis 


  1. 1.
    Andrew, G., Arora, R., Bilmes, J., Livescu, K.: Deep canonical correlation analysis. In: International Conference on Machine Learning, pp. 1247–1255 (2013)Google Scholar
  2. 2.
    Boutell, M., Luo, J.: Photo classification by integrating image content and camera metadata. In: 2004 Proceedings of the 17th International Conference on Pattern Recognition, ICPR 2004, vol. 4, pp. 901–904. IEEE (2004)Google Scholar
  3. 3.
    Chaudhuri, K., Kakade, S.M., Livescu, K., Sridharan, K.: Multi-view clustering via canonical correlation analysis. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 129–136. ACM (2009)Google Scholar
  4. 4.
    De Bie, T., De Moor, B.: On the regularization of canonical correlation analysis. In: International Symposium on ICA and BSS, pp. 785–790 (2003)Google Scholar
  5. 5.
    Feng, F., Li, R., Wang, X.: Deep correspondence restricted boltzmann machine for cross-modal retrieval. Neurocomputing 154, 50–60 (2015)CrossRefGoogle Scholar
  6. 6.
    Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, pp. 2672–2680 (2014)Google Scholar
  7. 7.
    Hu, X., Downie, J.S., Ehmann, A.F.: Lyric text mining in music mood classification. Am. Music 183(5,049), 2–209 (2009)Google Scholar
  8. 8.
    Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: International Conference on Machine Learning, pp. 1188–1196 (2014)Google Scholar
  9. 9.
    Mandal, A., Maji, P.: Regularization and shrinkage in rough set based canonical correlation analysis. In: Polkowski, L., et al. (eds.) IJCRS 2017. LNCS (LNAI), vol. 10313, pp. 432–446. Springer, Cham (2017). Scholar
  10. 10.
    Mandal, A., Maji, P.: FaRoC: fast and robust supervised canonical correlation analysis for multimodal omics data. IEEE Trans. Cybern. 48(4), 1229–1241 (2018)CrossRefGoogle Scholar
  11. 11.
    McAuley, J., Leskovec, J.: Image labeling on a network: using social-network metadata for image classification. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7575, pp. 828–841. Springer, Heidelberg (2012). Scholar
  12. 12.
    Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: Proceedings of the 28th International Conference on Machine Learning (ICML 2011), pp. 689–696 (2011)Google Scholar
  13. 13.
    Peng, Y., Huang, X., Qi, J.: Cross-media shared representation by hierarchical learning with multiple deep networks. In: IJCAI, pp. 3846–3853 (2016)Google Scholar
  14. 14.
    Wang, B., Yang, Y., Xu, X., Hanjalic, A., Shen, H.T.: Adversarial cross-modal retrieval. In: Proceedings of the 2017 ACM on Multimedia Conference, pp. 154–162. ACM (2017)Google Scholar
  15. 15.
    Wang, K., He, R., Wang, W., Wang, L., Tan, T.: Learning coupled feature spaces for cross-modal matching. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2088–2095 (2013)Google Scholar
  16. 16.
    Wang, K., Yin, Q., Wang, W., Wu, S., Wang, L.: A comprehensive survey on cross-modal retrieval. arXiv preprint arXiv:1607.06215 (2016)
  17. 17.
    Xia, R., Pan, Y., Lai, H., Liu, C., Yan, S.: Supervised hashing for image retrieval via image representation learning. In: AAAI, vol. 1, p. 2 (2014)Google Scholar
  18. 18.
    Yan, F., Mikolajczyk, K.: Deep correlation for matching images and text. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3441–3450 (2015)Google Scholar
  19. 19.
    Yao, T., Mei, T., Ngo, C.W.: Learning query and image similarities with ranking canonical correlation analysis. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 28–36 (2015)Google Scholar
  20. 20.
    Yu, Y., Tang, S., Raposo, F., Chen, L.: Deep cross-modal correlation learning for audio and lyrics in music retrieval. arXiv preprint arXiv:1711.08976 (2017)
  21. 21.
    Zhang, H., et al.: StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks. arXiv preprint (2017)Google Scholar
  22. 22.
    Zhang, J., Peng, Y., Yuan, M.: Unsupervised generative adversarial cross-modal hashing. arXiv preprint arXiv:1712.00358 (2017)

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Thi-Hong Vuong
    • 1
    Email author
  • Thanh-Huyen Pham
    • 1
    • 2
  • Tri-Thanh Nguyen
    • 1
  • Quang-Thuy Ha
    • 1
  1. 1.Vietnam National University, Hanoi (VNU), VNU-University of Engineering and Technology (UET)HanoiVietnam
  2. 2.Ha Long UniversityQuang NinhVietnam

Personalised recommendations