Skip to main content
Log in

User-centric multimodal feature extraction for personalized retrieval of tumblr posts

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript


Tumblr is one of the most popular micro-blogging services worldwide on which users can share posts consisting of texts and images. This paper proposes a user-centric method of multimodal feature extraction for the personalized retrieval of Tumblr posts. To implement personalized retrieval, we formulate each user’s preferences as a triplet loss by using Likes as metadata as well as the text- and image-related features of posts. Furthermore, we develop a personalized multivariational autoencoder (PMVAE) by introducing a triplet loss into multivariational autoencoder (MVAE), which is among the most effective methods of multimodal feature extraction. Previously proposed variants of MVAE can project multiple kinds of features into the single latent features. However, because the latent features do not reflect each user’s preferences, retrieval performance when using the previous methods is limited. On the contrary, our PMVAE can extract relationships between text- and image-related features of posts by considering class-related information that represents whether a user prefers a given post. As a result, user-centric multimodal features, which separate a post that a user prefer and a post that a user does not prefer in the latent feature space, can be obtained. Because user-centric multimodal features have high discriminating power, the personalized retrieval of posts desired by each user becomes feasible by using them in such retrieval algorithms as the k-nearest neighbors and Annoy, which is a technique for approximate nearest neighbor search. We conduct experiments using 10 users and 150,947 contents, to verify the performance of k-NN and Annoy. The results show that our PMVAE increased normalized discounted cumulative gain (nDCG) compared with existing methods. The nDCG becomes 0.253 when using term frequency-inverse document frequency based text features and our end-to-end image features.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others







  1. Ahmed A, Jalal A, Kim K (2020) Rgb-d images for object segmentation, localization and recognition in indoor scenes using feature descriptor and hough voting. In: 2020 17th international Bhurban conference on applied sciences and technology (IBCAST), pp 290–295

  2. Ai Q, Zhang Y, Bi K, Chen X, Croft WB (2017) Learning a hierarchical embedding model for personalized product search. In: Proc. international ACM SIGIR conf. research and development in information retrieval, pp 645–654

  3. Alam F, Imran M, Ofli F (2017) Image4act: Online social media image processing for disaster response. In: Proc. conf. advances in social networks analysis and mining 2017, pp 601–604

  4. Almatarneh S, Gamallo P, Pena FJR (2019) CiTIUS-COLE at semeval-2019 task 5: Combining linguistic features to identify hate speech against immigrants and women on multilingual tweets. In: Proc. workshop on semantic evaluation, pp 387–390

  5. Badar ud din Tahir S, Jalal A, Batool M (2020) Wearable sensors for activity analysis using smo-based random forest over smart home and sports datasets. In: 2020 3rd International conference on advancements in computational sciences (ICACS), pp 1–6

  6. Chang Y, Tang L, Inagaki Y, Liu Y (2014) What is Tumblr: A statistical overview and comparison. SIGKDD Explor. Newsl. 16(1):21–29

    Article  Google Scholar 

  7. Chen, Y, Wang N, Zhang Z (2018) Darkrank: Accelerating deep metric learning via cross sample similarities transfer. In: Thirty-second AAAI conf. artificial intelligence

  8. Cheng Z, Jialie, S, Hoi SC (2016) On effective personalized music retrieval by exploring online user behaviors. In: Proc. international ACM SIGIR conf. on research and development in information Retrieval, pp 125–134

  9. Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805

  10. Farooq A, Jalal A, Kamal S (2015) Dense rgb-d map-based human tracking and activity recognition using skin joints features and self-organizing map. KSII transactions on internet and information systems (TIIS) 5, 5

  11. Ge W (2018) Deep metric learning with hierarchical triplet loss. In: Proc. european conf. computer vision (ECCV), pp 269–285

  12. Gordo A, Almazán J, Revaud J, Larlus D (2016) Deep image retrieval: Learning global representations for image search. In: Proc. european conf. computer vision (ECCV). Springer, pp 241–257

  13. Harakawa R, Ogawa T, Haseyama M (2016) Accurate and efficient extraction of hierarchical structure of web communities for web video retrieval. ITE Trans. Media Technology and Applications 4(1):49–59

    Article  Google Scholar 

  14. Harakawa R, Takehara D, Ogawa T, Haseyama M (2018) Sentiment-aware personalized tweet recommendation through multimodal FFM. Multimedia Tools and Applications 77(14):18741–18759

    Article  Google Scholar 

  15. Harakawa R, Takimura S, Ogawa T, Haseyama M, Iwahashi M (2019) Consensus clustering of tweet networks via semantic and sentiment similarity estimation. IEEE Access 7:116207–116217

    Article  Google Scholar 

  16. He K, Zhang X, Ren S, Sun, J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conf. computer vision and pattern recognition, pp 770–778

  17. Hu N, Zhang J, Pavlou PA (2009) Overcoming the j-shaped distribution of product reviews. Commun. ACM 52(10):144–147

    Article  Google Scholar 

  18. Jalal A, Kamal S, Kim D (2014) Depth map-based human activity tracking and recognition using body joints features and self-organized map. In: Fifth international conference on computing, communications and networking technologies (ICCCNT), pp 1–6

  19. Jalal A, Kamal S, Kim D (2014) A depth video sensor-based life-logging human activity recognition system for elderly care in smart indoor environments. Sensors 14(7):11735–11759

    Article  Google Scholar 

  20. Jalal A, Kamal S, Kim, D (2015) Depth silhouettes context: A new robust feature for human tracking and activity recognition based on embedded hmms. In: 2015 12th International conference on ubiquitous robots and ambient intelligence (URAI), pp 294–299

  21. Jalal A, Kamal S, Kim, D (2015) Shape and motion features approach for activity tracking and recognition from kinect video camera. In: 2015 IEEE 29th International conference on advanced information networking and applications workshops, pp 445–450

  22. Jalal A, Kamal S, Kim D (2016) Human Depth Sensors-Based Activity Recognition Using Spatiotemporal Features and Hidden Markov Model for Smart Environments. Journal of Computer Networks and Communications 2016:8087545

    Article  Google Scholar 

  23. Jalal A, Kim J, Kim, T-H (2012) Development of a life logging system via depth imaging-based human activity recognition for smart homes. Proceedings of the international symposium on sustainable healthy buildings, pp 91–95

  24. Jalal A, Kim Y (2014) Dense depth maps-based human pose tracking and recognition in dynamic scenes using ridge data. In: 2014 11th IEEE international conference on advanced video and signal based surveillance (AVSS), pp 19–124

  25. Jalal A, Kim Y-H, Kim Y-J, Kamal S, Kim D (2017) Robust human activity recognition from depth video using spatiotemporal multi-fused features. Pattern Recognition 61:295–308

    Article  Google Scholar 

  26. Jalal A, Quaid MAK, Kim K (2019) A Wrist Worn Acceleration Based Human Motion Analysis and Classification for Ambient Smart Home System. Journal of Electrical Engineering & Technology 14(4):1733–1739

    Article  Google Scholar 

  27. Jalal A, Sharif N, Kim J, Kim T-S (2013) Human activity recognition via recognized body parts of human depth silhouettes for residents monitoring services at smart home. Indoor and built environment 22 , pp 271–279

  28. Jin Z, Cao J, Guo H, Zhang Y, Wang Y, Luo, J (2017) Detection and analysis of 2016 US presidential election related rumors on Twitter. In: Proc. conf. SBP-BRiMS. Springer, pp 14–24

  29. Kamal S, Jalal A (2016) A Hybrid Feature Extraction Approach for Human Detection, Tracking and Activity Recognition Using Depth Sensors. Arabian Journal for Science and Engineering 41(3):1043–1051

    Article  Google Scholar 

  30. Kamal S, Jalal A, Kim D (2016) Depth images-based human detection, tracking and activity recognition using spatiotemporal features and modified hmm. J Electric Eng Technol 6.

  31. Kaya M, Bilge H (2019) Deep metric learning: A survey. Symmetry 11(9):1066:1-1066:26

    Article  Google Scholar 

  32. Khalid S, Khalil T, Nasreen S (2014) A survey of feature selection and feature extraction techniques in machine learning. In: 2014 Science and information conference, pp 372–378

  33. Kim K, Jalal A, Mahmood M (2019) Vision-Based Human Activity Recognition System Using Depth Silhouettes: A Smart Home System for Monitoring the Residents. Journal of Electrical Engineering & Technology 14(6):2567–2573

    Article  Google Scholar 

  34. Kim W, Goyal B, Chawla K, Lee J, Kwon, K (2018) Attention-based ensemble for deep metric learning. In: Proc. european conf. computer vision (ECCV), pp 736–751

  35. Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. arXiv:1412.6980

  36. Kingma DP, Welling M (2013) Auto-encoding variational bayes. arXiv:1312.6114

  37. Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: Proc. conf. machine learning, pp 1188–1196

  38. Lee J, Abu-El-Haija S, Varadarajan B, Natsev A (2018) Collaborative deep metric learning for video understanding. In: Proc. ACM special interest group on knowledge discovery in data (SIGKDD), pp 481–490

  39. Li W, Zhang Y, Sun Y, Wang W, Li M, Zhang W, Lin, X (2019) Approximate nearest neighbor search on high dimensional data-experiments, analyses, and improvement. IEEE Trans Knowl Data Eng :1–14

  40. Liang J, Hu Q, Zhu P, Wang W (2018) Efficient multi-modal geometric mean metric learning. Pattern Recognition 75:188–198

    Article  Google Scholar 

  41. Liao L, He X, Zhao B, Ngo C-W, Chua T-S (2018) Interpretable multimodal retrieval for fashion products. MM ’18, Association for Computing Machinery, pp 1571–1579

  42. Lin X, Duan Y, Dong Q, Lu J, Zhou J (2018) Deep variational metric learning. In: Proc. european conf. computer vision (ECCV), pp 689–704

  43. Liong VE, Lu Tan, Tan Y, Zhou J (2016) Deep coupled metric learning for cross-modal matching. IEEE Trans. Multimedia 19(6):1234–1244

    Article  Google Scholar 

  44. Mahmood M, Jalal A, Kim K (2020) WHITE STAG model: wise human interaction tracking and estimation (WHITE) using spatio-temporal and angular-geometric (STAG) descriptors. Multimedia Tools and Applications 79(11):6919–6950

    Article  Google Scholar 

  45. Mekala D, Gupta V, Paranjape B, Karnick H (2016) SCDV: Sparse composite document vectors using soft clustering over distributional representations. arXiv preprint arXiv:1612.06778

  46. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean, J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems(NeurIPS), pp 3111–3119

  47. Nadeem A, Jalal A, Kim K (2020) Human actions tracking and recognition based on body parts detection via artificial neural network. In: 2020 3rd International conference on advancements in computational sciences (ICACS), pp 1–6

  48. Nitish S, Ruslan S (2014) Multimodal learning with deep boltzmann machines. J. Mach. Learn. Res. 15:2949–2980

    MathSciNet  MATH  Google Scholar 

  49. Oh Song H, Jegelka S, Rathod, V, Murphy K (2017) Deep metric learning via facility location. In: Proc. IEEE conf. on computer vision and pattern recognition (CVPR), pp 5382–5390

  50. Osterland S, Weber J (2019) Analytical analysis of single-stage pressure relief valves. International Journal of Hydromechatronics 2:32

    Article  Google Scholar 

  51. Passalis N, Iosifidis A, Gabbouj M, Tefas A (2020) Variance-preserving deep metric learning for content-based image retrieval. Pattern Recognition Letters 131:8–14

    Article  Google Scholar 

  52. Quaid MAK, Jalal A (2020) Wearable sensors based human behavioral pattern recognition using statistical features and reweighted genetic algorithm. Multimedia Tools and Applications 79(9):6061–6083

    Article  Google Scholar 

  53. Ramachandran P, Zoph B, Le QV (2017) Searching for activation functions. arXiv:1710.05941

  54. Rizwan SA, Jalal A, Kim, K (2020) An accurate facial expression detector using multi-landmarks selection and local transform features. In: 2020 3rd International conference on advancements in computational sciences (ICACS), pp 1–6

  55. Roostaiyan SM, Imani E, Baghshah MS (2017) Multi-modal deep distance metric learning. Intelligent Data Analysis 21(6):1351–1369

    Article  Google Scholar 

  56. Roy A, Paul A, Pirsiavash H, Pan, S (2017) Automated detection of substance use-related social media posts based on image and text analysis. In: 2017 IEEE 29th International conf. tools with artificial intelligence (ICTAI). IEEE, pp 72–779

  57. Sang J (2014) User-centric social multimedia computing. Springer, New York

    Book  Google Scholar 

  58. Saritha RR, Paul V, Kumar PG (2019) Content based image retrieval using deep learning process. Cluster Computing 22(2):4187–4200

    Article  Google Scholar 

  59. Seyedin S, Ahadi SM (2009) Robust mvdr-based feature extraction for speech recognition. In: 2009 7th International conference on information, communications and signal processing (ICICS), pp 1–5

  60. Shi Y, Siddharth N, Paige B, Torr P (2019) Variational mixture-of-experts autoencoders for multi-modal deep generative models. In: Proc. advances in neural information processing system (NeurIPS), pp 15692–15703

  61. Shokri M, Tavakoli K (2019) A review on the artificial neural network approach to analysis and prediction of seismic damage in infrastructure. International Journal of Hydromechatronics 2:178

    Article  Google Scholar 

  62. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556

  63. Sønderby,C, R T, M L, Sønderby S, WO (2016) How to train deep variational autoencoders and probabilistic ladder networks. In: Proc. int. conf. machine learning (ICML), pp 1–9

  64. Sparling EI, Sen S (2011) Rating: How difficult is it? In: Proceedings of the fifth ACM conference on recommender systems , RecSys ’11, Association for Computing Machinery, pp 149–156

  65. Susan S, Agrawal P, Mittal M, Bansal S (2019) New shape descriptor in the context of edge continuity. CAAI Transactions on Intelligence Technology 4(2):101–109

    Article  Google Scholar 

  66. Suzuki M, Nakayama K, Matsuo Y (2016) Joint multimodal learning with deep generative models. arXiv:1611.01891

  67. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proc. conf. computer vision and pattern recognition (CVPR), pp 1–9

  68. Tabrizi SA, Shakery A, Zamani H, Tavallaei MA (2018) Person: Personalized information retrieval evaluation based on citation networks. Information Processing & Management 54(4):630–656

    Article  Google Scholar 

  69. Tautkute I, Trzciński T, Skorupa AP, Brocki L, Marasek K (2019) Deepstyle: Multimodal search engine for fashion and interior design. IEEE Access 7:84613–84628

    Article  Google Scholar 

  70. Tingting Y, Junqian W, Lintai W, Yong X (2019) Three-stage network for age estimation. CAAI Transactions on Intelligence Technology 4(2):122–126

    Article  Google Scholar 

  71. Vedantam R, Fischer I, Huang J, Murphy K (2017) Generative models of visually grounded imagination. arXiv:1705.10762

  72. Vicente-López E, de Campos LM, Fernández-Luna JM, Huete JF (2016) Use of textual and conceptual profiles for personalized retrieval of political documents. Knowledge-Based Systems 112:127–141

    Article  Google Scholar 

  73. Wang J, Song Y, Leung T, Rosenberg C, Wang J, Philbin J, Chen B, Wu Y (2014) Learning fine-grained image similarity with deep ranking. In: Proc. IEEE conf. computer vision and pattern recognition (CVPR), pp 1386–1393

  74. Wang J, Zhou F, Wen S, Liu X, Lin Y (2017) Deep metric learning with angular loss. In: Proc. of the IEEE international conf. on computer vision (ICCV), pp 2593–2601

  75. Wang W, Yan X, Lee H, Livescu K (2016) Deep variational canonical correlation analysis. arXiv:1610.03454

  76. Wiens T (2019) Engine speed reduction for hydraulic machinery using predictive algorithms. International Journal of Hydromechatronics 2(1):16–31

    Article  Google Scholar 

  77. Wu M, Goodman N (2018) Multimodal generative models for scalable weakly-supervised learning. In: Proc. conf. neural information processing systems (NeurIPS), pp 5575–5585

  78. Wu Y, Wang S. Huang Q (2017) Online asymmetric similarity learning for cross-modal retrieval. In: Proc. IEEE conf. computer vsion and pattern recognition (CVPR), pp 4269–4278

  79. Xu X, He L, Lu H, Gao L, Ji Y (2019) Deep adversarial metric learning for cross-modal retrieval. World Wide Web 22(2):657–672

    Article  Google Scholar 

  80. Yaacob NI, Tahir NM (2012) Feature selection for gait recognition. In: 2012 IEEE symposium on humanities, science and engineering research, pp. 379–383

  81. Yu J, Yang X, Gao F, Tao D (2016) Deep multimodal distance metric learning using click constraints for image ranking. IEEE Trans. Cybernetics 47(12):4014–4024

    Article  Google Scholar 

  82. Zhao W, Zhou D, Wu X, Lawless S, Liu J (2017) An augmented user model for personalized search in collaborative social tagging systems. EAI Endorsed Transactions on Collaborative Computing 3:12

    Article  Google Scholar 

  83. Zhu C, Miao D (2019) Influence of kernel clustering on an rbfn. CAAI Transactions on Intelligence Technology 4(4):255–260

    Article  Google Scholar 

Download references


We thank Saad Anis, PhD, from Edanz Group ( for editing a draft of this manuscript. This work was partly supported by JSPS KAKENHI Grant Number JP21K17861, and the MIC/SCOPE #181601001.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Kazuma Ohtomo.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ohtomo, K., Harakawa, R., Ogawa, T. et al. User-centric multimodal feature extraction for personalized retrieval of tumblr posts. Multimed Tools Appl 81, 2979–3003 (2022).

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: