Multimedia Tools and Applications

, Volume 78, Issue 1, pp 437–456 | Cite as

Multimedia integrated annotation based on common space learning

  • Feng TianEmail author
  • Xianmei Liu
  • Zhuoxuan Liu
  • Ning Sun
  • Mei Wang
  • Haochang Wang
  • Fengquan Zhang


Multimedia automatic annotation, which assigns text labels to multimedia objects, has been widely studied. However, existing methods usually focus on modeling two types of media data or pairwise correlation. In fact, heterogeneous media are complementary to each other and optimizing them simultaneously can further improve accuracy. In this paper, a novel common space learning (CSL) algorithm for multimedia integrated annotation is presented, by which heterogeneous media data can be projected into a unified space and multimedia annotation is transformed to the nearest neighbor search in the space. Optimizing these heterogeneous media simultaneously makes the heterogeneous media complementary to each other and aligned in the common space. We solve the proposed CSL as an optimization problem mainly considering the following issues. First, different types of media objects with the similar labels should be closer in the common space. Second, the media similarity of the original space and the common space should be consistent. We attempt to solve the optimization problem in a sparse and semi-supervised learning framework, thus more unlabeled data can be integrated into the learning process, which can boost the performance of space learning. In addition, we proposed an iterative optimization algorithm to solve the problem. Since the projected samples in the common space share the same representation, the labels for new media object are assigned by a simple nearest neighbor voting mechanism. To the best of our knowledge, our method has made the first attempt to multimedia integrated annotation. Experiments on data sets with up to four media types (image, sound, video and 3D model) show the effectiveness of our proposed approach, as compared with the state-of-the-art methods.


Multimedia annotation Automatic annotation Common space leaning 



Special thanks should go to the collaborators in the Lab for Media Search of National University of Singapore, for their instructive advice and useful suggestions on this work. This work is supported by the Natural Science Foundation of China (No.61502094,61402099,61402016), Natural Science Foundation of Heilongjiang Province of China (No.F2016002,F2015020) and Beijing Natural Science Foundation (No.4154067).


  1. 1.
    Atrey P K, Hossain M A, El Saddik A (2010) Multimodal fusion for multimedia analysis: a survey. Multimed Syst 16:345–379CrossRefGoogle Scholar
  2. 2.
    Battiato S, Farinella GM, Guarnera GC (2007) Data mining learning bootstrap through semantic thumbnail analysis. In: Proceedings of electronic imaging, p 65060PGoogle Scholar
  3. 3.
    Battiato S, Farinella G M, Giuffrida G (2009) Using visual and text features for direct marketing on multimedia messaging services domain. Multimed Tools Appl 42:5–30CrossRefGoogle Scholar
  4. 4.
    Battiato S, Farinella GM, Guarnera GC (2010) Bags of phrases with codebooks alignment for near duplicate image detection. In: Proceedings of the 2nd ACM workshop on multimedia in forensics, security and intelligence, pp 65–70Google Scholar
  5. 5.
    Bredin H, Chollet G (2007) Audio-visual speech synchrony measure for talking-face identity verification. In: Proceedings of the IEEE international conference on acoustics, speech and signal processing, pp II-233–II-236Google Scholar
  6. 6.
    Chen DY, Tian XP, Shen YT, Ouhyoung M (2003) On visual similarity based 3D model retrieval. In: Proceedings of computer graphics forum, pp 223–232Google Scholar
  7. 7.
    Chen L, Xu D, Tsang I W, Luo J (2012) Tag-based image retrieval improved by augmented features and group-based refinement. IEEE Trans Multimed 14:1057–1067CrossRefGoogle Scholar
  8. 8.
    Feng SL, Manmatha R, Lavrenko V (2004) Multiple Bernoulli relevance models for image and video annotation. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, II–IIGoogle Scholar
  9. 9.
    Feng Z, Feng S, Jin R, Jain AK (2014) Image tag completion by noisy matrix recovery. In: Proceedings of the European conference on computer vision, pp 424–438Google Scholar
  10. 10.
    Gao Y, Wang M, Zha Z J, Shen J, Li X, Wu X (2013) Visual-textual joint relevance learning for tag-based social image search. IEEE Trans Image Process 22:363–376MathSciNetCrossRefGoogle Scholar
  11. 11.
    Gemmeke, Jort F (2017) Audio set: an ontology and human-labeled dartaset for audio events. In: IEEE ICASSPGoogle Scholar
  12. 12.
    Guillaumin M, Mensink T, Verbeek J (2009) Tagprop: discriminative metric learning in nearest neighbor models for image auto-annotation. In: Proceedings of the IEEE 12th international conference on computer vision, pp 309–316Google Scholar
  13. 13.
    Hardoon D, Sandor S, John S (2004) Canonical correlation analysis: an overview with application to learning methods. Neural Comput 16:2639–2664CrossRefGoogle Scholar
  14. 14.
    Hotelling H (1936) Relations between two sets of variates. Biometrika 28:321–377CrossRefGoogle Scholar
  15. 15.
    Hu Y, Cheng X, Chia L T (2009) Coherent phrase model for efficient image near-duplicate retrieval. IEEE Trans Multimed 11:1434–1445CrossRefGoogle Scholar
  16. 16.
    Jeon J, Lavrenko V, Manmatha R (2003) Automatic image annotation and retrieval using cross-media relevance models. In: Proceedings of the 26th annual international ACM SIGIR conference on research and development in informaion retrieval, pp 119–126Google Scholar
  17. 17.
    Kalayeh MM, Idrees H, Shah M (2014) NMF-KNN: image annotation using weighted multi-view non-negative matrix factorization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 184–191Google Scholar
  18. 18.
    Khoshneshin M, Street WN (2010) Collaborative filtering via euclidean embedding. In: Proceedings of the 4th ACM conference on recommender systems, pp 87–94Google Scholar
  19. 19.
    Kidron E, Schechner Y Y, Elad M (2005) Pixels that sound. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 88–95Google Scholar
  20. 20.
    Kuo Y H, Cheng W H, Lin H T, Hsu W H (2012) Unsupervised semantic feature discovery for image object retrieval and tag refinement. IEEE Trans Multimed 14:1079–1090CrossRefGoogle Scholar
  21. 21.
    Lee S, De Neve W, Ro Y M (2014) Visually weighted neighbor voting for image tag relevance learning. Multimed Tools Appl 72:1363–1386CrossRefGoogle Scholar
  22. 22.
    Li X, Snoek CG (2013) Classifying tag relevance with relevant positive and negative examples. In: Proceedings of the 21st ACM international conference on multimedia, pp 485–488Google Scholar
  23. 23.
    Li D, Dimitrova N, Li M, Sethi IK (2003) Multimedia content processing through cross-modal association. In: Proceedings of the 11th ACM international conference on multimedia, pp 604–611Google Scholar
  24. 24.
    Li X, Snoek C G, Worring M (2009) Learning social tag relevance by neighbor voting. IEEE Trans Multimed 11:1310–1322CrossRefGoogle Scholar
  25. 25.
    Li X, Uricchio T, Ballan L, Bertini M, Snoek C G, Bimbo A D (2016) Socializing the semantic gap: a comparative survey on image tag assignment, refinement, and retrieval. ACM Comput Surv 49:14Google Scholar
  26. 26.
    Liu D, Yan SH, Rui Y (2010) Unified tag analysis with multi-edge graph. In: Proceedings of the ACM multimedia international conference, pp 25–34Google Scholar
  27. 27.
    Liu Y, Zhao WL, Ngo CW (2010) Coherent bag-of audio words model for efficient large-scale video copy detection. In: Proceedings of the ACM international conference on image and video retrieval, pp 89–96Google Scholar
  28. 28.
    Liu J, Zhang Y, Li Z, Lu H (2013) Correlation consistency constrained probabilistic matrix factorization for social tag refinement. Neurocomputing 119:3–9CrossRefGoogle Scholar
  29. 29.
    Liu A-A, Su Y-T, Jia P-P, Gao Z, Hao T, Yang Z-X (2015) Multipe/single-view human action recognition via part-induced multitask structural learning. IEEE Trans Cybern 45(6):1194–1208CrossRefGoogle Scholar
  30. 30.
    Liu A A, Nie W Z, Gao Y, Su Y T (2016) Multi-modal clique-graph matching for view-based 3D model retrieval. IEEE Trans Image Process 25(5):2103–2116MathSciNetCrossRefGoogle Scholar
  31. 31.
    Liu A-A, Xu N, Nie W, Su Y, Wong Y, Kankanhalli M (2016) Benchmarking a multimodal and multiview and interactive dataset for human action recognition. IEEE Trans Cybern 44(4):1–1Google Scholar
  32. 32.
    Liu A-A, Su Y-T, Nie W-Z, Kankanhalli M (2017) Hierarchical clustering multi-task learning for joint human action grouping and recognition. IEEE Trans Pattern Anal Mach Intell 39(1):102–114CrossRefGoogle Scholar
  33. 33.
    Lyndon SK, Malcolm S, Kilian W (2009) Reliable tags using image similarity: mining specificity and expertise from large-scale multimedia databases. In: Proceedings of ACM MM workshop on web-scale multimedia corpus, pp 17–24Google Scholar
  34. 34.
    Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv:1301.3781
  35. 35.
    Monay F, Gatica-Perez D (2004) PLSA-based image auto-annotation: constraining the latent space. In: Proceedings of the 12th annual ACM international conference on multimedia, pp 348–351Google Scholar
  36. 36.
    Nie F, Huang H, Cai X, Ding CH (2010) Efficient and robust feature selection via joint 2,1-norms minimization. In: Proceedings of the neural information processing systems, pp 1813–1821Google Scholar
  37. 37.
    Nie W, Liu A, Su Y (2016) Cross-domain semantic transfer from large-scale social media. Multimed Syst 22(1):75–85CrossRefGoogle Scholar
  38. 38.
    Pan JY, Yang HJ, Faloutsos C (2004) Automatic multimedia cross-modal correlation discovery. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, pp 653–658Google Scholar
  39. 39.
    Pols LCW (1966) Spectral analysis and identification of dutch vowels in monosyllabic words. Doctoral dissertion, pp 26–27Google Scholar
  40. 40.
    Rasiwasia N, Costa Pereira J, Coviello E, Doyle G, Lanckriet GR, Levy R, Vasconcelos N (2010) A new approach to cross-modal multimedia retrieval. In: Proceedings of the 18th ACM international conference on multimedia, pp 251–260Google Scholar
  41. 41.
    Richter F, Romberg S, Horster E, Lienhart R (2012) Leveraging community metadata for multimodal image ranking. Multimed Tools Appl 56:35–62CrossRefGoogle Scholar
  42. 42.
    Rui XG, Li MJ, Li ZW (2007) Bipartite graph reinforcement model for web image annotation. In: Proceedings of the ACM international multimedia conference and exhibition, pp 585–594Google Scholar
  43. 43.
    Sang J, Xu C, Liu J (2012) User-aware image tag refinement via ternary semantic analysis. IEEE Trans Multimed 14:883–895CrossRefGoogle Scholar
  44. 44.
    Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556
  45. 45.
    Truong BQ, Sun A, Bhowmick SS (2012) Content is still king: the effect of neighbor voting schemes on tag relevance for social image retrieval. In: Proceedings of the 2nd ACM international conference on multimedia retrieval, p 9Google Scholar
  46. 46.
    Verbeek J, Guillaumin M, Mensink T, Schmid C (2010) Image annotation with Tagprop on the Mirflickr set. In: Proceedings of the international conference on multimedia information retrieval, pp 537–546Google Scholar
  47. 47.
    Wang M, Ni B, Hua XS (2012) Assistive tagging: a survey of multimedia tagging with human-computer joint exploration. ACM Comput Surv 44:25–25CrossRefGoogle Scholar
  48. 48.
    Wang J, Zhou J, Xu H, Mei T, Hua X S, Li S (2014) Image tag refinement by regularized latent dirichlet allocation. Comput Vis Image Underst 124:61–70CrossRefGoogle Scholar
  49. 49.
    Wu L, Jin R, Jain A K (2013) Tag completion for image retrieval. IEEE Trans Pattern Anal Mach Intell 35:716–727CrossRefGoogle Scholar
  50. 50.
    Wu P, Hoi S C H, Xia H (2013) Online multimodal deep similarity learning with application to ImageRetrieval. In: Proceedings of the 21st ACM international conference on multimedia, pp 153–162Google Scholar
  51. 51.
    Xia H, Wu P, Hoi S C H (2013) Online multi-modal distance learning for scalable multimedia retrieval. In: Proceedings of the 6th ACM international conference on web search and data mining, pp 455–464Google Scholar
  52. 52.
    Xu X, Shimada A, Taniguchi RI (2014) Tag completion with defective tag assignments via image-tag re-weighting. In: Proceedings of the IEEE international conference on multimedia and expo, pp 1–6Google Scholar
  53. 53.
    Yakhnenko O, Honavar V (2008) Annotating images and image objects using a hierarchical dirichlet process model. In: Proceedings of the 9th international workshop on multimedia data mining: held in conjunction with the ACM SIGKDD, pp 1–7Google Scholar
  54. 54.
    Yang Y, Zhuang Y T, Wu F (2008) Harmonizing hierarchical manifolds for multimedia document semantics understanding and cross-media retrieval. IEEE Trans Multimed 10:437–446CrossRefGoogle Scholar
  55. 55.
    Yang Y, Xu D, Nie F (2009) Ranking with local regression and globaGl alignment for cross media retrieval. In: Proceedings of the 17th ACM international conference on multimedia, pp 175–184Google Scholar
  56. 56.
    Zhou B, Jagadeesh V, Piramuthu R (2015) Conceptlearner: discovering visual concepts from weakly labeled image collections. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1492–1500Google Scholar
  57. 57.
    Zhu G, Yan S, Ma Y (2010) Image tag refinement towards low-rank, content-tag prior and error sparsity. In: Proceedings of the 18th ACM international conference on multimedia, pp 461–470Google Scholar
  58. 58.
    Zhu X, Nejdl W, Georgescu M (2014) An adaptive teleportation random walk model for learning social tag relevance. In: Proceedings of the 37th international ACM SIGIR conference on research & development in information retrieval, pp 223–232Google Scholar
  59. 59.
    Zhuang Y T, Yang Y, Wu F (2008) Mining semantic correlation of heterogeneous multimedia data for cross-media retrieval. IEEE Trans Multimed 10:221–229CrossRefGoogle Scholar
  60. 60.
    Znaidia A, Shabou A, Le Borgne H (2012) Bag-of-multimedia-words for image classification, (ICPR). In: Proceedings of the 21st IEEE international conference on pattern recognition, pp 1509–1512Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2017

Authors and Affiliations

  1. 1.School of Computer and Information TechnologyNortheast Petroleum UniversityDaQingChina
  2. 2.School of ComputingNational University of SingaporeSingaporeSingapore
  3. 3.School of Computer ScienceNorth China University of TechnologyBeijingChina

Personalised recommendations