Binary feature representation learning for scene retrieval in micro-video

  • Jie Guo
  • Xiushan Nie
  • Muwei Jian
  • Yilong YinEmail author


Micro-video is popular as new social media, and scene retrieval is a useful application in micro-video. At present, few researches focus on scene retrieval in micro-video, and there is a big gap between scene feature and semantics. In order to extract better semantical feature, we propose a combinational fusion method which combines multi-layer neural network and supervised hash learning method. As nonlinear projection, multi-layer neural network fuses multiple modalities by nonlinear transformation, and supervised hash learning method transforms fusion feature by linear projection to binary code for semantics and similarity preservation. We evaluate the proposed method on an actual micro-video dataset crawled from Vine. The experimental results show its superior performance than single multi-modal fusion methods and single hash learning methods.


Scene retrieval Micro-video Multi-layer neural network Supervised hash learning 



This work is supported by the National Natural Science Foundation of China (61671274, 61573219, 61876098), China Postdoctoral Science Foundation (2016M592190), Shandong Provincial Key Research and Development Plan (2017CXGC1504), Shandong Provincial High College Science and Technology Plan (J17KB161) and the Fostering Project of Dominant Discipline and Talent Team of Shandong Province Higher Education Institutions.


  1. 1.
    Andrew G, Arora R, Bilmes J, Livescu K (2013) Deep canonical correlation analysis. In: Proceedings of the 2013 international conference on machine learning, pp III–1247Google Scholar
  2. 2.
    Belhumeur PN, Hespanha JP, Kriegman DJ (1997) Eigenfaces vs. fisherfaces: recognition using class specific linear projection. IEEE Trans Pattern Anal Mach Intell 19(7):711–720CrossRefGoogle Scholar
  3. 3.
    Chen J, Song X, Nie L, Wang X, Zhang H, Chua TS (2016) Micro tells macro: predicting the popularity of micro-videos via a transductive model. In: Proceedings of the 24th ACM international conference on multimedia. ACM, pp 898–907Google Scholar
  4. 4.
    Cheng Z, Shen J (2016) On effective location-aware music recommendation. ACM Trans Inf Syst 34(2):1–32MathSciNetCrossRefGoogle Scholar
  5. 5.
    Cui H, Zhu L, Cui C et al (2018) Efficient weakly-supervised discrete hashing for large-scale social image retrieval. Pattern Recogn Lett.
  6. 6.
    Jiang Q, Li W (2015) Scalable graph hashing with feature transformation. In: Proceedings of the twenty-fourth international joint conference on artificial intelligence, pp 2248–2254Google Scholar
  7. 7.
    Jing P, Su Y, Nie L et al (2017) Low-rank multi-view embedding learning for micro-video popularity prediction[J]. IEEE Trans Knowl Data Eng pp(99):1–1Google Scholar
  8. 8.
    Kan M, Shan S, Zhang H, Lao S, Chen X (2016) Multi-view discriminant analysis. IEEE Trans Pattern Anal Mach Intell 38(1):188–194CrossRefGoogle Scholar
  9. 9.
    Kang W, Li W, Zhou Z (2016) Column sampling based discrete supervised hashing. In: Proceedings of the thirtieth AAAI conference on artificial intelligence (AAAI)Google Scholar
  10. 10.
    Zhu L , Huang Z , Li Z , Xie L, & Shen, H. T. (2018). Exploring auxiliary context: discrete semantic transfer hashing for scalable image retrieval. IEEE Transactions on Neural Networks and Learning Systems, 1-13.Google Scholar
  11. 11.
    Liu W, Wang J, Kumar S, Chang S (2011) Hashing with graphs. In: Proceedings of international conference on machine learningGoogle Scholar
  12. 12.
    Liu W, Wang J, Ji R, Jiang Y, Chang S (2012) Supervised hashing with kernels. In: Proceeding of 25th IEEE conference on computer vison and pattern recognition, pp 2074–2081Google Scholar
  13. 13.
    Liu M, Nie L, Wang M et al (2017) Towards micro-video understanding by joint sequential-sparse modeling[C]. ACM on multimedia conference. ACM, pp 970–978Google Scholar
  14. 14.
    Liu X, Xu Q, Xu Y et al (2018) A stochastic attribute grammar for robust cross-view human tracking. IEEE Transaction on Circuits and Systems for Video Technology, pp(28):2884–2895Google Scholar
  15. 15.
    Liu X, Xu Q, Chau T et al (2018) Revisiting jump-diffusion process for visual tracking: a reinforcement learning approach. IEEE Transaction on Circuits and Systems for Video Technology.
  16. 16.
    Liu X, Zhu L, Cheng Z et al (2019) Efficient discrete latent semantic hashing for scalable cross-modal retrieval. Signal Process PP(154):217–231CrossRefGoogle Scholar
  17. 17.
    Nguyen PX, Rogez G, Fowlkes C, Ramamnan D (2016) The open world of micro-videos. arXiv preprint arXiv:1603.09439Google Scholar
  18. 18.
    Nie L, Wang X, Zhang J, He X, Zhang H, Hong R, Tian Q (2017) Enhancing micro-video understanding by harnessing external sounds. In: Proceedings of the 25th ACM international conference on multimedia. ACM, pp 1192–1200Google Scholar
  19. 19.
    Nie X , Yin Y , Sun J , Liu J , & Cui C (2017). Comprehensive feature-based robust video fingerprinting using tensor model. IEEE Transactions on Multimedia, 19(4), 785-796Google Scholar
  20. 20.
    Norouzi M, Fleet DJ (2011) Minimal loss hashing for compact binary codes. In: Proceedings of international conference on machine learningGoogle Scholar
  21. 21.
    Rasiwasia N, Pereira JC, Coviello E, Doyle G, Lanckriet GRG, Levy R, Vasconcelos N (2010) A new approach to cross-modal multimedia retrieval. In: Proceedings of the 18th acm international conference on multimedia. ACM, pp 251–260Google Scholar
  22. 22.
    Redi M, Ohare N, Schifanella R, Trevisiol M, Jaimes A (2014) 6 seconds of sound and vision: creativity in micro-videos. In: Proceedings of the 2014 IEEE conference on computer vision and pattern recognition. IEEE, pp 4272–4279Google Scholar
  23. 23.
    Rosipal R, Krämer N (2005) Overview and recent advances in partial least squares. In: Proceedings of the 2005 international conference on subspace, latent structure and feature selection, pp 34–51Google Scholar
  24. 24.
    Sharma A, Kumar A, Daume H, Jacobs DW (2012) Generalized multiview analysis: a discriminative latent space. In: Proceedings of the 2012 IEEE conference on computer vision and pattern recognition. IEEE, pp 2160–2167Google Scholar
  25. 25.
    Shen F, Shen C, Liu W, Shen H (2015) Supervised discrete hashing. In: Proceeding of IEEE Conference on Computer Vision and Pattern Recognition, pp 37–45Google Scholar
  26. 26.
    Song J, Yang Y , Huang Z , Shen H, & Luo J. (2013). Effective multiple feature hashing for large-scale near-duplicate video retrieval. IEEE Transactions on Multimedia, 15(8), 1997-2008Google Scholar
  27. 27.
    Tenenbaum JB, Freeman WT (2014) Separating style and content with bilinear models. Neural Comput 12(6):1247–1283CrossRefGoogle Scholar
  28. 28.
    Wang J, Kumar S, Chang S (2012) Semi-supervised hashing for large scale search. IEEE Trans Pattern Anal Mach Intell 34(12):2393–2406CrossRefGoogle Scholar
  29. 29.
    Wang L, Zhu L, Yu E et al (2018) Task-dependent and query-dependent subspace learning for cross-modal retrieval. IEEE Access PP(6):27091–27102CrossRefGoogle Scholar
  30. 30.
    Xie L, Shen J, Han J et al (2017) Dynamic multi-view hashing for online image retrieval. In: Proceeding of 26th international joint conference on artificial intelligence, pp 3133–3139Google Scholar
  31. 31.
    Zhang D, Li WJ (2014) Large-scale supervised multimodal hashing with semantic correlation maximization. In: Proceedings of the 28th AAAI conference on artificial intelligence. AAAI, pp 2177–2183Google Scholar
  32. 32.
    Zhang P, Zhang W, Li W, Guo M (2014) Supervised hashing with latent factor models. In: Proceeding of 37th international ACM SIGIR conference on research and development in information retrieval (SIGIR)Google Scholar
  33. 33.
    Zhang J, Nie L, Wang X, He X, Huang X, Chua TS (2016) Shorter-is-better: venue category estimation from micro-video. In: Proceedings of the 24th ACM international conference on multimedia. ACM, pp 1415–1424Google Scholar
  34. 34.
    Zhu L, Shen J, Xie L, Cheng Z (2017) Unsupervised visual hashing with semantic assistant for content-based image retrieval. IEEE Trans Knowl Data Eng 29(2):472–486CrossRefGoogle Scholar
  35. 35.
    Zhu L, Huang Z, Chang X et al (2017) Exploring consistent preferences: discrete hashing with pair-exemplar for scalable landmark search[C]. ACM, pp 726–734Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.School of Computer Science and TechnologyShandong UniversityJinanChina
  2. 2.Shandong University of Finance and EconomicsJinanChina
  3. 3.School of SoftwareShandong UniversityJinanChina

Personalised recommendations