Skip to main content
Log in

Multiple kernel visual-auditory representation learning for retrieval

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Cross-media data representation, which focuses on semantics understanding of multimedia data in different modalities, is a rising hot topic in web media data analysis. The most challenging issues for cross-media data representation include: how to find underlying content-level data correlations and how to use such correlations in the representation model. Most traditional web media data analysis works are based on single modality data sources, such as Flickr images or YouTube videos, leaving cross-media data representation and semantics understanding wide open. In this paper, we propose a multiple kernel visual-auditory representation learning approach, which learns cross-media correlations from visual and auditory feature spaces with multiple kernel strategies. Besides, we give cross-media distance measure for image-audio retrieval in the mutual subspace of co-occurrence. Experiment results on the collected image-audio database are encouraging, and show that the performance of our approach is effective from multiple perspectives.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Chang X, Yang Y, Hauptmann AG, Xing E, Yu Y (2015) Semantic concept discovery for large-scale zero-shot event detection. International Joint Conference on Artificial Intelligence, IJCAI

  2. Chang X, Yang Y, Xing E, Yu Y (2015) Complex event detection using semantic saliency and nearly-isotonic SVM. International Conference on Machine Learning (ICML)

  3. Chang X, Yu Y, Yang Y, Hauptmann A (2015) Searching persuasively: joint event detection and evidence justification with limited supervision. ACM MM

  4. Gao DD, Huang RB (2000) Some results on canonical correlation and their application to a linear model. Linear Algebra Appl 321:47–59

    Article  MathSciNet  MATH  Google Scholar 

  5. Gonen M, Alpaydın E (2011) Multiple kernel learning algorithms. J Mach Learn Res 12:2211–2268

    MathSciNet  MATH  Google Scholar 

  6. Jain P, Kulis B, Davis JV, Dhillon IS (2012) Metric and kernel learning using a linear transformation. J Mach Learn Res 13:519–547

    MathSciNet  MATH  Google Scholar 

  7. Jain A, Vishwanathan SVN, Varma M (2012) Spg-gmkl: generalized multiple kernel learning with a million kernels. In: Proceedings of the ACM SIGKDD conference on knowledge discovery and data mining

  8. Lanckriet GRG, Cristianini N, Bartlett P, Ghaoui LE, Jordan MI (2004) Learning the kernel matrix with semi-definite programming. J Mach Learn Res 5:27–72

    MATH  Google Scholar 

  9. Lew MS, Sebe N, Djeraba C, Jain R (2006) Content-based multimedia information retrieval: state of the art and challenges. ACM Trans Multimed Comput Commun Appl 2(1):1–19

    Article  Google Scholar 

  10. Liu Y, Wu F, Zhuang Y, Xiao J (2008) Active post-refined multimodality video semantic concept detection with tensor representation. ACM International Conference on Multimedia. pp.91–100

  11. Liu G, Yan Y, Gao C, Tong W, Hauptmann AG, Sebe N (2014) The mystery of faces: investigating face contribution for multimedia event detection. ICMR

  12. Liu H, Yu L (2005) Toward integrating feature selection algorithms for classication and clustering. IEEE Trans Knowl Data Eng 17(4):491–502

    Article  Google Scholar 

  13. Ma Q, Akiyo N, Katsumi T (2006) Complementary information retrieval for cross-media news content. Inf Syst 31(7):659–678

    Article  Google Scholar 

  14. Melzer T, Reiter M, Bischof H (2003) Appearance models based on kernel canonical correlation analysis. Pattern Recogn 36:1961–1971

    Article  MATH  Google Scholar 

  15. Shen H, Yan Y, Xu S, Ballas N, Chen W (2015) Evaluation of semi-supervised learning method on action recognition. Multimedia Tools and Applications 74(2):523–542

    Article  Google Scholar 

  16. Sonnenburg S, Rätsch G, Schafer C, Scholkopf B (2006) Largescale multiple kernel learning. J Mach Learn Res 7:1531–1565

    MathSciNet  MATH  Google Scholar 

  17. Sun T, Chen S (2007) Locality preserving CCA with applications to data visualization and pose estimation. Image Vis Comput 25:531–543

    Article  Google Scholar 

  18. Thomas M, Michael R, Horst B (2003) Appearance models based on kernel canonical correlation analysis. Pattern Recogn 27(2):1–8

    MATH  Google Scholar 

  19. Tolias G, Bursuc A, Furon T, Jégou H (2015) Rotation and translation covariant match kernels for image retrieval. Comp Vis Image Underst 140:9–20

    Article  Google Scholar 

  20. Tong S, Chang E (2001) Support vector machine active learning for image retrieval. ACM International Conference on Multimedia, pp. 107–118

  21. Vapnik V (1997) The nature of statistical learning theory. IEEE Trans Neural Netw 8(6)

  22. Varma M, Babu BR (2009) More generality in efficient multiple kernel learning. In Proceedings of International Conference on Machine Learning, pp.1065–1072

  23. Vishwanathan SVN, Sun Z, Ampornpunt N, Varma M (2010) Multiple kernel learning and the SMO algorithm. In: NIPS, pp. 2361–2369

  24. Wang D, Hoi SC, He Y, Zhu J, Mei T, Luo J (2014) Retrieval-based face annotation by weak label regularized local coordinate coding. IEEE Trans Pattern Ana Mach Intell (TPAMI) 36(3):550–563

    Article  Google Scholar 

  25. Wu Y, Chang EY, Chang CC, Kevin, Smith JR (2004) Optimal multimodal fusion for multi-media data analysis. In: ACM Multimedia Conference, pp. 572–579

  26. Wu Y, Chang EY, Chen-Chuan Chang K, Smith JR (2004) Optimal multimodal fusion for multimedia data analysis. ACM International Conference on Multimedia, pp.572–579

  27. Xia H, Hoi SC, Jin R, Zhao P (2012) Online multiple kernel similarity learning for visual search. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 1(1)

  28. Yan Y, Ricci E, Liu G, Sebe N (2015) Egocentric daily activity recognition via multitask clustering. IEEE Trans Image Process 24(10):2984–2995

    Article  MathSciNet  Google Scholar 

  29. Yan Y, Ricci E, Subramanian R, Liu G, Lanz O, Sebe N. A multi-task learning framework for head pose estimation under target motion, IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), in press

  30. Yan Y, Shen H, Liu G, Ma Z, Gao C, Sebe N (2014) GLocal tells you more: coupling glocal structural for feature selection with sparsity for image and video classification. Comp Vision Image Underst (CVIU) 124(7):99–109

    Article  Google Scholar 

  31. Yang Y, Ma Z, Hauptmann AG, Sebe N (2012) Feature selection for multimedia analysis by sharing information among multiple tasks. IEEE Trans Multimed 15(3):661–669

    Article  Google Scholar 

  32. Yang Y, Nie F, Xu D, Luo J, Zhuang Y, Pan Y (2012) A multimedia retrieval framework based on semi-supervised ranking and relevance feedback. IEEE Trans Pattern Anal Mach Intell (TPAMI) 34(5):723–742

    Article  Google Scholar 

  33. Yang Y, Song J, Huang Z, Ma Z, Sebe N, Hauptmann AG (2013) Multi-feature fusion via hierarchical regression for multimedia analysis. IEEE Trans Multimedia 15(3):572–58

    Article  Google Scholar 

  34. Yang Y, Zhuang Y, Wu F, Pan Y (2008) Harmonizing hierarchical manifolds for multimedia document semantics understanding and cross-media retrieval. IEEE Transactions on Multimedia 10(3):437–446

    Article  Google Scholar 

  35. Yu Z, Wu F, Yang Y, Tian Q, Luo J, Zhuang Y (2014) Discriminative coupled dictionary hashing for fast cross-media retrieval. SIGIR, 395–404

  36. Zhang H, Liu Y, Ma Z (2013) Fusing inherent and external knowledge with nonlinear learning for cross-media retrieval. Neurocomputing 119:10–16

    Article  Google Scholar 

  37. Zhang H, Wu P, Beck A, Zhang Z, Gao X (2016) Adaptive incremental learning of image semantics with application to social robot. Neurocomputing 173:93–101

    Article  Google Scholar 

  38. Zhang H, Yu J, Wang M, Liu Y (2012) Semi-supervised distance metric learning based on local linear regression for data clustering. Neurocomputing 93:100–105

    Article  Google Scholar 

  39. Zhang H, Yuan J, Gao X, Chen Z (2014) Boosting cross-media retrieval via visual-auditory feature analysis and relevance feedback. ACM International Conference on Multimedia

  40. Zhuang Y, Yang Y, Wu F (2008) Mining semantic correlation of heterogeneous multimedia data for cross-media retrieval. IEEE Transactions on Multimedia 10(2):221–229

    Article  Google Scholar 

Download references

Acknowledgments

This research is supported by the National Natural Science Foundation of China (No.61003127, No. 61373109, No.61440016) and the China Scholarship Council (201508420248).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hong Zhang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, H., Zhang, W., Liu, W. et al. Multiple kernel visual-auditory representation learning for retrieval. Multimed Tools Appl 75, 9169–9184 (2016). https://doi.org/10.1007/s11042-016-3294-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-016-3294-5

Keywords

Navigation