Multiple kernel visual-auditory representation learning for retrieval

Zhang, Hong; Zhang, Wenping; Liu, Wenhe; Xu, Xin; Fan, Hehe

doi:10.1007/s11042-016-3294-5

Multiple kernel visual-auditory representation learning for retrieval

Published: 23 February 2016

Volume 75, pages 9169–9184, (2016)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Hong Zhang^1,2,
Wenping Zhang¹,
Wenhe Liu³,
Xin Xu¹ &
…
Hehe Fan⁴

459 Accesses
9 Citations
Explore all metrics

Abstract

Cross-media data representation, which focuses on semantics understanding of multimedia data in different modalities, is a rising hot topic in web media data analysis. The most challenging issues for cross-media data representation include: how to find underlying content-level data correlations and how to use such correlations in the representation model. Most traditional web media data analysis works are based on single modality data sources, such as Flickr images or YouTube videos, leaving cross-media data representation and semantics understanding wide open. In this paper, we propose a multiple kernel visual-auditory representation learning approach, which learns cross-media correlations from visual and auditory feature spaces with multiple kernel strategies. Besides, we give cross-media distance measure for image-audio retrieval in the mutual subspace of co-occurrence. Experiment results on the collected image-audio database are encouraging, and show that the performance of our approach is effective from multiple perspectives.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Cross-modal Deep Learning Applications: Audio-Visual Retrieval

Cross-modal Embeddings for Video and Audio Retrieval

A cross-media distance metric learning framework based on multi-view correlation mining and matching

Article 21 April 2015

References

Chang X, Yang Y, Hauptmann AG, Xing E, Yu Y (2015) Semantic concept discovery for large-scale zero-shot event detection. International Joint Conference on Artificial Intelligence, IJCAI
Chang X, Yang Y, Xing E, Yu Y (2015) Complex event detection using semantic saliency and nearly-isotonic SVM. International Conference on Machine Learning (ICML)
Chang X, Yu Y, Yang Y, Hauptmann A (2015) Searching persuasively: joint event detection and evidence justification with limited supervision. ACM MM
Gao DD, Huang RB (2000) Some results on canonical correlation and their application to a linear model. Linear Algebra Appl 321:47–59
Article MathSciNet MATH Google Scholar
Gonen M, Alpaydın E (2011) Multiple kernel learning algorithms. J Mach Learn Res 12:2211–2268
MathSciNet MATH Google Scholar
Jain P, Kulis B, Davis JV, Dhillon IS (2012) Metric and kernel learning using a linear transformation. J Mach Learn Res 13:519–547
MathSciNet MATH Google Scholar
Jain A, Vishwanathan SVN, Varma M (2012) Spg-gmkl: generalized multiple kernel learning with a million kernels. In: Proceedings of the ACM SIGKDD conference on knowledge discovery and data mining
Lanckriet GRG, Cristianini N, Bartlett P, Ghaoui LE, Jordan MI (2004) Learning the kernel matrix with semi-definite programming. J Mach Learn Res 5:27–72
MATH Google Scholar
Lew MS, Sebe N, Djeraba C, Jain R (2006) Content-based multimedia information retrieval: state of the art and challenges. ACM Trans Multimed Comput Commun Appl 2(1):1–19
Article Google Scholar
Liu Y, Wu F, Zhuang Y, Xiao J (2008) Active post-refined multimodality video semantic concept detection with tensor representation. ACM International Conference on Multimedia. pp.91–100
Liu G, Yan Y, Gao C, Tong W, Hauptmann AG, Sebe N (2014) The mystery of faces: investigating face contribution for multimedia event detection. ICMR
Liu H, Yu L (2005) Toward integrating feature selection algorithms for classication and clustering. IEEE Trans Knowl Data Eng 17(4):491–502
Article Google Scholar
Ma Q, Akiyo N, Katsumi T (2006) Complementary information retrieval for cross-media news content. Inf Syst 31(7):659–678
Article Google Scholar
Melzer T, Reiter M, Bischof H (2003) Appearance models based on kernel canonical correlation analysis. Pattern Recogn 36:1961–1971
Article MATH Google Scholar
Shen H, Yan Y, Xu S, Ballas N, Chen W (2015) Evaluation of semi-supervised learning method on action recognition. Multimedia Tools and Applications 74(2):523–542
Article Google Scholar
Sonnenburg S, Rätsch G, Schafer C, Scholkopf B (2006) Largescale multiple kernel learning. J Mach Learn Res 7:1531–1565
MathSciNet MATH Google Scholar
Sun T, Chen S (2007) Locality preserving CCA with applications to data visualization and pose estimation. Image Vis Comput 25:531–543
Article Google Scholar
Thomas M, Michael R, Horst B (2003) Appearance models based on kernel canonical correlation analysis. Pattern Recogn 27(2):1–8
MATH Google Scholar
Tolias G, Bursuc A, Furon T, Jégou H (2015) Rotation and translation covariant match kernels for image retrieval. Comp Vis Image Underst 140:9–20
Article Google Scholar
Tong S, Chang E (2001) Support vector machine active learning for image retrieval. ACM International Conference on Multimedia, pp. 107–118
Vapnik V (1997) The nature of statistical learning theory. IEEE Trans Neural Netw 8(6)
Varma M, Babu BR (2009) More generality in efficient multiple kernel learning. In Proceedings of International Conference on Machine Learning, pp.1065–1072
Vishwanathan SVN, Sun Z, Ampornpunt N, Varma M (2010) Multiple kernel learning and the SMO algorithm. In: NIPS, pp. 2361–2369
Wang D, Hoi SC, He Y, Zhu J, Mei T, Luo J (2014) Retrieval-based face annotation by weak label regularized local coordinate coding. IEEE Trans Pattern Ana Mach Intell (TPAMI) 36(3):550–563
Article Google Scholar
Wu Y, Chang EY, Chang CC, Kevin, Smith JR (2004) Optimal multimodal fusion for multi-media data analysis. In: ACM Multimedia Conference, pp. 572–579
Wu Y, Chang EY, Chen-Chuan Chang K, Smith JR (2004) Optimal multimodal fusion for multimedia data analysis. ACM International Conference on Multimedia, pp.572–579
Xia H, Hoi SC, Jin R, Zhao P (2012) Online multiple kernel similarity learning for visual search. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 1(1)
Yan Y, Ricci E, Liu G, Sebe N (2015) Egocentric daily activity recognition via multitask clustering. IEEE Trans Image Process 24(10):2984–2995
Article MathSciNet Google Scholar
Yan Y, Ricci E, Subramanian R, Liu G, Lanz O, Sebe N. A multi-task learning framework for head pose estimation under target motion, IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), in press
Yan Y, Shen H, Liu G, Ma Z, Gao C, Sebe N (2014) GLocal tells you more: coupling glocal structural for feature selection with sparsity for image and video classification. Comp Vision Image Underst (CVIU) 124(7):99–109
Article Google Scholar
Yang Y, Ma Z, Hauptmann AG, Sebe N (2012) Feature selection for multimedia analysis by sharing information among multiple tasks. IEEE Trans Multimed 15(3):661–669
Article Google Scholar
Yang Y, Nie F, Xu D, Luo J, Zhuang Y, Pan Y (2012) A multimedia retrieval framework based on semi-supervised ranking and relevance feedback. IEEE Trans Pattern Anal Mach Intell (TPAMI) 34(5):723–742
Article Google Scholar
Yang Y, Song J, Huang Z, Ma Z, Sebe N, Hauptmann AG (2013) Multi-feature fusion via hierarchical regression for multimedia analysis. IEEE Trans Multimedia 15(3):572–58
Article Google Scholar
Yang Y, Zhuang Y, Wu F, Pan Y (2008) Harmonizing hierarchical manifolds for multimedia document semantics understanding and cross-media retrieval. IEEE Transactions on Multimedia 10(3):437–446
Article Google Scholar
Yu Z, Wu F, Yang Y, Tian Q, Luo J, Zhuang Y (2014) Discriminative coupled dictionary hashing for fast cross-media retrieval. SIGIR, 395–404
Zhang H, Liu Y, Ma Z (2013) Fusing inherent and external knowledge with nonlinear learning for cross-media retrieval. Neurocomputing 119:10–16
Article Google Scholar
Zhang H, Wu P, Beck A, Zhang Z, Gao X (2016) Adaptive incremental learning of image semantics with application to social robot. Neurocomputing 173:93–101
Article Google Scholar
Zhang H, Yu J, Wang M, Liu Y (2012) Semi-supervised distance metric learning based on local linear regression for data clustering. Neurocomputing 93:100–105
Article Google Scholar
Zhang H, Yuan J, Gao X, Chen Z (2014) Boosting cross-media retrieval via visual-auditory feature analysis and relevance feedback. ACM International Conference on Multimedia
Zhuang Y, Yang Y, Wu F (2008) Mining semantic correlation of heterogeneous multimedia data for cross-media retrieval. IEEE Transactions on Multimedia 10(2):221–229
Article Google Scholar

Download references

Acknowledgments

This research is supported by the National Natural Science Foundation of China (No.61003127, No. 61373109, No.61440016) and the China Scholarship Council (201508420248).

Author information

Authors and Affiliations

College of Computer Science & Technology, Wuhan University of Science & Technology, Wuhan, 430081, China
Hong Zhang, Wenping Zhang & Xin Xu
Hubei Province Key Laboratory of Intelligent Information Processing and Real-time Industrial System, Wuhan, China
Hong Zhang
The Centre for Quantum Computation & Intelligent Systems, the University of Technology, Sydney (UTS), Sydney, Australia
Wenhe Liu
Baidu, Beijing, China
Hehe Fan

Authors

Hong Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Wenping Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Wenhe Liu
View author publications
You can also search for this author in PubMed Google Scholar
Xin Xu
View author publications
You can also search for this author in PubMed Google Scholar
Hehe Fan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hong Zhang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, H., Zhang, W., Liu, W. et al. Multiple kernel visual-auditory representation learning for retrieval. Multimed Tools Appl 75, 9169–9184 (2016). https://doi.org/10.1007/s11042-016-3294-5

Download citation

Received: 04 October 2015
Revised: 03 December 2015
Accepted: 21 January 2016
Published: 23 February 2016
Issue Date: August 2016
DOI: https://doi.org/10.1007/s11042-016-3294-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multiple kernel visual-auditory representation learning for retrieval

Abstract

Access this article

Similar content being viewed by others

Cross-modal Deep Learning Applications: Audio-Visual Retrieval

Cross-modal Embeddings for Video and Audio Retrieval

A cross-media distance metric learning framework based on multi-view correlation mining and matching

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Multiple kernel visual-auditory representation learning for retrieval

Abstract

Access this article

Similar content being viewed by others

Cross-modal Deep Learning Applications: Audio-Visual Retrieval

Cross-modal Embeddings for Video and Audio Retrieval

A cross-media distance metric learning framework based on multi-view correlation mining and matching

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation