Skip to main content
Log in

Exploring demographic information in social media for product recommendation

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript


In many e-commerce Web sites, product recommendation is essential to improve user experience and boost sales. Most existing product recommender systems rely on historical transaction records or Web-site-browsing history of consumers in order to accurately predict online users’ preferences for product recommendation. As such, they are constrained by limited information available on specific e-commerce Web sites. With the prolific use of social media platforms, it now becomes possible to extract product demographics from online product reviews and social networks built from microblogs. Moreover, users’ public profiles available on social media often reveal their demographic attributes such as age, gender, and education. In this paper, we propose to leverage the demographic information of both products and users extracted from social media for product recommendation. In specific, we frame recommendation as a learning to rank problem which takes as input the features derived from both product and user demographics. An ensemble method based on the gradient-boosting regression trees is extended to make it suitable for our recommendation task. We have conducted extensive experiments to obtain both quantitative and qualitative evaluation results. Moreover, we have also conducted a user study to gauge the performance of our proposed recommender system in a real-world deployment. All the results show that our system is more effective in generating recommendation results better matching users’ preferences than the competitive baselines.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others








  7. Given an attribute, we collect all the unique values filled in by users in our data collection, and only keep the values with high population. We further manually group similar values. Furthermore, we discretized attribute values based on the customer segmentation [11] (chapter five) in marketing and ensured balanced distribution probabilities over different values across different discretization intervals.

  8. This will make \(\phi ^{(u,a)}\) no longer a valid probability distribution. But as will be shown later, it does not affect the construction of demographic feature vectors.

  9. For example, we can sum the corresponding demographic-based probabilities for each attribute: User 1 will be assigned to a value of 2.52 by having \(1\times 1 + 0.9 \times 1 + 0.7\times 0.8 + 0.3\times 0.2\), while similarly user 2 will be assigned to a value of 1.44.

  10. We distinguish normal users from spam users using the following three conditions: (1) an normal user should have a balanced number of tweets and retweets; (2) a normal user should not include any keywords relating to products or brands in her the nickname or profile description. (3) A normal user should not publish many tweets containing keywords relating products or brands.

  11. To be more specific, the values of y are needed to be given in training, while in test we obtain the values of y by using the predicted output from the learnt ranking function f, and an item with a larger value for y will be ranked in a higher position, i.e., of more importance for recommendation.

  12. On Sina Weibo, all the tweets from a user can be publicly seen by other registered users. The judges log into their own Weibo accounts and check the validity of each candidate query–product pair online. Each user’s public profile of a user is also checked and spam users are removed. The workload for each judge is about 5–7 times the number of qd pairs in Table 2, i.e., only 1/7–1/5 of the originally detected qd pairs are finally kept as training data.

  13. RankLib might assign equal scores to items during ranking. In this case, we further sort the items of equal scores by their sales volume.

  14. For the listwise approach, each training instance is an ordered list. However, the relative order between non-relevant products is not possible to obtain in our training data.

  15. Balanced interleaving method reflects the intuition that the results of the two rankings A and B should be interleaved into a single ranking I in a balanced way, which ensures that any top k results in I always contain the top \(k_a\) results from A and the top \(k_b\) results from B, where \(k_a\) and \(k_b\) differ by at most 1.


  1. Wang J, Zhang Y (2013) Opportunity model for e-commerce recommendation: right product; right time. In: Ser. SIGIR ’13

  2. von Reischach F, Michahelles F, Schmidt A (2009) The design space of ubiquitous product recommendation systems. In: Ser. MUM ’09

  3. Giering M (2008) Retail sales prediction and item recommendations using customer demographics at store level. SIGKDD Explor Newsl 10(2):84–89

    Article  Google Scholar 

  4. Xiao B, Benbasat I (2007) E-commerce product recommendation agents: use, characteristics, and impact. MIS Q 31:137–209

    Google Scholar 

  5. Linden G, Smith B, York J (2003) recommendations: item-to-item collaborative filtering. IEEE Internet Comput 7(1):76–80

    Article  Google Scholar 

  6. Hollerit B, Kröll M, Strohmaier M (2013) Towards linking buyers and sellers: detecting commercial intent on twitter. In: Ser. WWW ’13 companion

  7. Zhao X-W, Guo Y, He Y, Jiang H, Wu Y, Li X (2014) We know what you want to buy: a demographic-based system for product recommendation on microblogs. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining, ser. KDD ’14, 2014, pp 1935–1944

  8. Baker M, Hart S (2007) The marketing book, 6th edn. Routledge, London

    Google Scholar 

  9. Sridhar G (2007) Consumer involvement in product choice–a demographic analysis. XIMB J Manag 3:131–148

    Google Scholar 

  10. Zeithaml VA (1985) The new demographics and market fragmentation. J Mark 49:64–75

    Article  Google Scholar 

  11. Tsiptsis K, Chorianopoulos A (2010) Data mining techniques in CRM: inside customer segmentation. Wiley, London

    Book  Google Scholar 

  12. Dong Y, Yang Y, Tang J, Yang Y, Chawla N-V (2014) Inferring user demographics and social strategies in mobile social networks. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining, ser. KDD ’14, 2014, pp 15–24

  13. Mislove A, Viswanath B, Gummadi K-P, Druschel P (2010) You are who you know: inferring user profiles in online social networks. In: Ser. WSDM ’10

  14. Bi B, Shokouhi M, Kosinski M, Graepel T (2013) Inferring the demographics of search users: social data meets search queries. In: Ser. WWW ’13

  15. Zou B, Zhou G, Zhu Q (2014) Negation focus identification with contextual discourse information. In: Proceedings of the 52nd annual meeting of the Association for Computational Linguistics (vol 1: long papers). Association for Computational Linguistics, Baltimore, Maryland, pp 522–530

  16. (2012) US demographic and business summary data. Product guide

  17. Zhai C, Lafferty JD (2004) A study of smoothing methods for language models applied to information retrieval. ACM Trans Inf Syst 22(2):179–214

    Article  Google Scholar 

  18. Pang B, Lee L (2004) A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts. In: Ser. ACL ’04

  19. Liu T-Y (2009) Learning to rank for information retrieval. Found Trends Inf Retr 3(3):225–331

    Article  Google Scholar 

  20. Turney P-D (2002) Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews. In: Proceedings of the 40th annual meeting on Association for Computational Linguistics, ser. ACL ’02, 2002, pp 417–424

  21. Ganjisaffar Y, Caruana R, Lopes C-V (2011) Bagging gradient-boosted trees for high precision, low variance ranking models. In Proceedings of the 34th international ACM SIGIR conference on research and development in information retrieval, ser. SIGIR ’11, 2011, pp 85–94

  22. Zhang H, Riedl E, Petrushin V-A, Pal S, Spoelstra J (2012) Committee based prediction system for recommendation: KDD cup 2011, track2. In: Proceedings of KDD cup 2011 competition, San Diego, CA, USA, 2011, pp 215–229

  23. Breiman L, Friedman J, Olshen R, Stone C (1984) Classification and regression trees. Wadsworth and Brooks, Monterey

    MATH  Google Scholar 

  24. Friedman JH (2002) Stochastic gradient boosting. Comput Stat Data Anal 38(4):367–378

    Article  MathSciNet  MATH  Google Scholar 

  25. Friedman JH (2000) Greedy function approximation: a gradient boosting machine. Ann Stat 29:1189–1232

    Article  MathSciNet  MATH  Google Scholar 

  26. Breiman L (2001) Random forests. Mach Learn 45(1):5–32

    Article  MathSciNet  MATH  Google Scholar 

  27. Ho TK, Hull JJ, Srihari SN (1994) Decision combination in multiple classifier systems. IEEE Trans Pattern Anal Mach Intell 16(1):66–75

    Article  Google Scholar 

  28. Joachims T (2006) Training linear svms in linear time. In Ser. KDD ’06

  29. Freund Y, Iyer R, Schapire R-E, Singer Y (2003) An efficient boosting algorithm for combining preferences. J Mach Learn Res 4:933–969

    MathSciNet  MATH  Google Scholar 

  30. Cao Z, Qin T, Liu T-Y, Tsai M-F, Li H (2007) Learning to rank: from pairwise approach to listwise approach. In Ser. ICML ’07

  31. Xu J, Li H (2007) Adarank: a boosting algorithm for information retrieval. In: Ser. SIGIR ’07

  32. Weng J, Lim E-P, Jiang J, He Q (2010) Twitterrank: finding topic-sensitive influential twitterers. In: WSDM

  33. Chapelle O, Joachims T, Radlinski F, Yue Y (2012) Large-scale validation and analysis of interleaved search evaluation. ACM Trans Inf Syst 30(1):6:1–6:41

    Article  Google Scholar 

  34. Sarwar B, Karypis G, Konstan J, Riedl J (2001) Item-based collaborative filtering recommendation algorithms. In: Ser. WWW ’01

  35. Adomavicius G, Tuzhilin A (2005) Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions. IEEE Trans Knowl Data Eng 17(6):734–749

  36. Symeonidis P, Tiakas E, Manolopoulos Y (2011) Product recommendation and rating prediction based on multi-modal social networks. In: Ser. RecSys ’11

  37. Korfiatis N, Poulos M (2013) Using online consumer reviews as a source for demographic recommendations: a case study using online travel reviews. Expert Syst Appl 40(14):5507–5515

    Article  Google Scholar 

  38. Qiu L, Benbasat I (2010) A study of demographic embodiments of product recommendation agents in electronic commerce. Int J Hum Comput Stud 68(10):669–688

    Article  Google Scholar 

  39. Pang B, Lee L (2008) Opinion mining and sentiment analysis. Found Trends Inf Retr 2(1–2):1–135

    Article  Google Scholar 

  40. Liu Y, Huang J, An A, Yu X (2007) ARSA: a sentiment-aware model for predicting sales performance using blogs. In: SIGIR

  41. McGlohon M, Glance NS, Reiter Z (2010) Star quality: aggregating reviews to rank products and merchants. In: ICWSM

  42. Ganu G, Kakodkar Y, Marian A (2013) Improving the quality of predictions using textual information in online user reviews. Inf Syst 38(1):1–15

    Article  Google Scholar 

  43. Zhang Y, Lai G, Zhang M, Zhang Y, Liu Y, Ma S (2014) Explicit factor models for explainable recommendation based on phrase-level sentiment analysis. In: SIGIR

  44. Zhang Y, Zhang H, Zhang M, Liu Y, Ma S (2014) Do users rate or review? Boost phrase-level sentiment labeling with review-level sentiment classification. In: SIGIR

  45. Pazzani M-J (1999) A framework for collaborative, content-based and demographic filtering. Artif Intell Rev 13(5–6):393–408

    Article  Google Scholar 

  46. Seroussi Y, Bohnert F, Zukerman I (2011) Personalised rating prediction for new users using latent factor models. In: ACM HH

  47. Dai HK, Zhao L, Nie Z, Wen J-R, Wang L, Li Y (2006) Detecting online commercial intention (oci). In: WWW ’06

Download references


The authors thank the anonymous reviewers for their valuable and constructive comments. The work was partially supported by National Natural Science Foundation of China under Grant Nos. 61502502 and 61573026, the pilot project under Baidu open cloud service platform under Grant No. 4333150064, and the National Key Basic Research Program (973 Program) of China under Grant No. 2014CB340403. Xin Zhao was also partially supported by 2015 HTC Young Scholar Program.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Wayne Xin Zhao.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhao, W.X., Li, S., He, Y. et al. Exploring demographic information in social media for product recommendation. Knowl Inf Syst 49, 61–89 (2016).

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: