Visual-Textual Late Semantic Fusion Using Deep Neural Network for Document Categorization

  • Cheng Wang
  • Haojin Yang
  • Christoph Meinel
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9489)


Multi-modality fusion has recently drawn much attention due to the fast increasing of multimedia data. Document that consists of multiple modalities i.e. image, text and video, can be better understood by machines if information from different modalities semantically combined. In this paper, we propose to fuse image and text information with deep neural network (DNN) based approach. By jointly fusing visual-textual feature and taking the correlation between image and text into account, fusion features can be learned for representing document. We investigated the fusion features on document categorization, found that DNN-based fusion outperforms mainstream algorithms include K-Nearest Neighbor(KNN), Support Vector Machine (SVM) and Naive Bayes (NB) and 3-layer Neural Network (3L-NN) in both early and late fusion strategies.


Categorization Semantic feature Late fusion Deep neural network 


  1. 1.
    Atrey, P.K., Hossain, M.A., El-Saddik, A., Kankanhalli, M.S.: Multimodal fusion for multimedia analysis: a survey. Multimedia Syst. 16(6), 345–379 (2010)CrossRefGoogle Scholar
  2. 2.
    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)zbMATHGoogle Scholar
  3. 3.
    Clinchant, S., Ah-Pine, J., Csurka, G.: Semantic combination of textual and visual information in multimedia retrieval. In: Proceedings of the 1st ACM International Conference on Multimedia Retrieval, p. 44. ACM (2011)Google Scholar
  4. 4.
    Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition, 2009 (CVPR 2009), pp. 248–255. IEEE (2009)Google Scholar
  5. 5.
    Escalante, H.J.: Late fusion of heterogeneous methods for multimedia image retrieval (2008)Google Scholar
  6. 6.
    Hinton, G., Deng, L., Dong, Y., Dahl, G.E., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T.N., et al.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Sig. Process. Mag. 29(6), 82–97 (2012)CrossRefGoogle Scholar
  7. 7.
    Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093 (2014)
  8. 8.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)Google Scholar
  9. 9.
    Liu, D., Lai, K.-T., Ye, G., Chen, M.-S., Chang, S.-F.: Sample-specific late fusion for visual category recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013, pp. 803–810. IEEE (2013)Google Scholar
  10. 10.
    Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004)CrossRefGoogle Scholar
  11. 11.
    Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: Proceedings of the 28th International Conference on Machine Learning (ICML 2011), pp. 689–696 (2011)Google Scholar
  12. 12.
    Rasiwasia, N., Pereira, J.C., Coviello, E., Doyle, G., Lanckriet, G.R.G., Levy, R., Vasconcelos, N.: A new approach to cross-modal multimedia retrieval. In: Proceedings of the International Conference on Multimedia, pp. 251–260. ACM (2010)Google Scholar
  13. 13.
    Terrades, O.R., Valveny, E., Tabbone, S.: Optimal classifier fusion in a non-bayesian probabilistic framework. IEEE Trans. Pattern Anal. Mach. Intell. 31(9), 1630–1644 (2009)CrossRefGoogle Scholar
  14. 14.
    Thompson, B.: Canonical correlation analysis. In: Everitt, B., Howell, D. (eds.) Encyclopedia of Statistics in Behavioral Science. Wiley, New York (2005)Google Scholar
  15. 15.
    Van der Maaten, L., Hinton, G.: Visualizing data using t-sne. J. Mach. Learn. Res. 9(2579–2605), 85 (2008)zbMATHGoogle Scholar
  16. 16.
    Vedaldi, A., Fulkerson, B.: Vlfeat: An open and portable library of computer vision algorithms. In: Proceedings of the International Conference on Multimedia, pp. 1469–1472. ACM (2010)Google Scholar
  17. 17.
    Wu, Z., Jiang, Y.-G., Wang, J., Pu, J., Xue, X.: Exploring inter-feature and inter-class relationships with deep neural networks for video classification. In: Proceedings of the ACM International Conference on Multimedia, pp. 167–176. ACM (2014)Google Scholar
  18. 18.
    Ye, G., Liu, D., Jhuo, I.-H., Chang, S.-F.: Robust late fusion with rank minimization. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3021–3028. IEEE (2012)Google Scholar
  19. 19.
    Yu, J., Cong, Y., Qin, Z., Wan, T.: Cross-modal topic correlations for multimedia retrieval. In: 2012 21st International Conference on Pattern Recognition (ICPR), pp. 246–249. IEEE (2012)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.Hasso Plattner InstituteUniversity of PotsdamPotsdamGermany

Personalised recommendations