Advertisement

Data Mining and Knowledge Discovery

, Volume 31, Issue 2, pp 502–547 | Cite as

Modeling user interests from web browsing activities

  • Fabio Gasparetti
Article

Abstract

Browsing sessions are rich in elements useful to build profiles of user interests, but at the same time HTML pages include noisy data such as advertisements, navigation menus and privacy notes. Moreover, some pages cover several different topics making it difficult to identify the most relevant to the user. For these reasons, they are often ignored by personalized search and recommender systems. We propose a novel approach for recognizing valuable text descriptions of current user information needs—namely cues—based on the data mined from browsing interactions over the web. The approach combines page clustering techniques based on Document Object Model-based representations for acquiring evidence about relevant correlations between text contents. This evidence is exploited for better filtering out irrelevant information and facilitating the construction of interest profiles. A comparative framework proves the accuracy of the extracted cues in the personalize search task, where results are re-ranked according to the last browsed resources.

Keywords

Information needs User modeling Clustering Web browsing 

References

  1. Alarte J, Insa D, Silva J, Tamarit S (2015) Temex: the web template extractor. In: Proceedings of the 24th international conference on World Wide Web, WWW ’15 Companion. ACM, New York, pp 155–158Google Scholar
  2. Attardi G, Gullí A, Sebastiani F (1999) Automatic web page categorization by link and context analysis. In: Hutchison C, Lanzarone G (eds) Proceedings of THAI-99, 1st European symposium on telematics, hypermedia and artificial Intelligence. Varese, IT, pp 105–119Google Scholar
  3. Baeza-Yates RA, Ribeiro-Neto BA (2011) Modern information retrieval—the concepts and technology behind search, 2nd edn. Pearson Education Ltd., HarlowGoogle Scholar
  4. Banerjee S, Pedersen T (2002) An adapted lesk algorithm for word sense disambiguation using wordnet. In: Proceedings of the third international conference on computational linguistics and intelligent text processing, CICLing ’02. Springer, London, pp 136–145Google Scholar
  5. Bates MJ (1989) The design of browsing and berrypicking techniques for the online search interface. Online Rev 13(5):407–431CrossRefGoogle Scholar
  6. Beauvisage T (2009) Computer usage in daily life. In: Proceedings of the SIGCHI conference on human factors in computing systems, CHI ’09. ACM, New York, pp 575–584Google Scholar
  7. Bennett PN, White RW, Chu W, Dumais ST, Bailey P, Borisyuk F, Cui X (2012) Modeling the impact of short- and long-term behavior on search personalization. In: Proceedings of the 35th international ACM SIGIR conference on research and development in information retrieval, SIGIR ’12. ACM, New York, pp 185–194Google Scholar
  8. Bilenko M, White RW (2008) Mining the search trails of surfing crowds: Identifying relevant websites from user activity. In: Proceedings of the 17th international conference on World Wide Web, WWW ’08. ACM, New York, pp 51–60Google Scholar
  9. Billsus D, Pazzani MJ (2007) Adaptive news access. In: Brusilovsky P, Kobsa A, Nejdl W (eds) The adaptive web, vol 4321., Lecture notes in computer scienceSpringer, Berlin, pp 550–570CrossRefGoogle Scholar
  10. Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022zbMATHGoogle Scholar
  11. Broder A (2002) A taxonomy of web search. SIGIR Forum 36(2):3–10CrossRefzbMATHGoogle Scholar
  12. Cleverdon C (1997) The cranfield tests on index language devices. In: Jones KS, Willett P (eds) Readings in information retrieval. Morgan Kaufmann Publishers Inc., San Francisco, pp 47–59Google Scholar
  13. Cockburn A, McKenzie B (2001) What do web users do? An empirical analysis of web use. Int J Hum–Comput Stud 54(6):903–922CrossRefzbMATHGoogle Scholar
  14. Corley C, Mihalcea R (2005) Measuring the semantic similarity of texts. In: Proceedings of the ACL workshop on empirical modeling of semantic equivalence and entailment, EMSEE ’05. Association for Computational Linguistics, Stroudsburg, pp 13–18Google Scholar
  15. Daoud M, Tamine-Lechani L, Boughanem M, Chebaro B (2009) A session based personalized search using an ontological user profile. In: Proceedings of the 2009 ACM symposium on applied computing, SAC ’09. ACM, New York, pp 1732–1736Google Scholar
  16. de Kunder M (2016) Worldwidewebsiz—the size of the world wide web (the internet). Last visited on 15 Aug 2016Google Scholar
  17. Ding C, Patra JC (2007) User modeling for personalized web search with self-organizing map. J Am Soc Inf Sci Technol 58(4):494–507CrossRefGoogle Scholar
  18. Fellbaum C (1998) WordNet: an electronic lexical database. Bradford Books, CambridgezbMATHGoogle Scholar
  19. Ferrara E, De Meo P, Fiumara G, Baumgartner R (2014) Web data extraction, applications and techniques: a survey. Knowl-Based Syst 70:301–323CrossRefGoogle Scholar
  20. Foundation The Apache Software. Apache lucene. Last visited on 15 Aug 2016Google Scholar
  21. Gallacher S, Papadopoulou E, Taylor NK, Williams MH (2013) Learning user preferences for adaptive pervasive environments: an incremental and temporal approach. ACM Trans Auton Adapt Syst 8(1):5:1–5:26CrossRefGoogle Scholar
  22. Gasparetti F, Micarelli A (2007) Exploiting web browsing histories to identify user needs. In: IUI ’07: Proceedings of the 12th international conference on intelligent user interfaces. ACM Press, New York, pp 325–328Google Scholar
  23. Ghorab MR, Zhou D, O’connor A, Wade V (2013) Personalised information retrieval: survey and classification. User Model User-Adapt Interact 23(4):381–443CrossRefGoogle Scholar
  24. Gibson D, Punera K, Tomkins A (2005) The volume and evolution of web page templates. In: Special interest tracks and posters of the 14th international conference on World Wide Web, WWW ’05. ACM, New York, pp 830–839Google Scholar
  25. Glover EJ, Tsioutsiouliklis K, Lawrence S, Pennock DM, Flake G (2002) Using web structure for classifying and describing web pages. In: Proceedings of the 11th international conference on World Wide Web, WWW ’02. ACM, New York, pp 562–569Google Scholar
  26. Google. Google books Ngram viewer. Last visited on 15 Aug 2016Google Scholar
  27. Google. Google news. Last visited on 15 Aug 2016Google Scholar
  28. Gottron T (2008) Clustering template based web documents. In: Macdonald C, Ounis I, Plachouras V, Ruthven I, White RW (eds) Advances in information retrieval, vol 4956., Lecture notes in computer scienceSpringer, Berlin, pp 40–51CrossRefGoogle Scholar
  29. Guha R, Gupta V, Raghunathan V , Srikant R (2015) User modeling for a personal assistant. In: Proceedings of the eighth ACM international conference on web search and data mining, WSDM ’15. ACM, New York, pp 275–284Google Scholar
  30. Han TA, Pereira LM (2013) State-of-the-art of intention recognition and its use in decision making. AI Commun 26(2):237–246MathSciNetGoogle Scholar
  31. Hofmann K, Whiteson S, Schuth A, de Rijke M (2014) Learning to rank for information retrieval from user interactions. SIGWEB Newsl 5(Spring):5–7Google Scholar
  32. Hua W, Song Y, Wang H, Zhou X (2013) Identifying users’ topical tasks in web search. In: Proceedings of the sixth ACM international conference on web search and data mining, WSDM ’13. ACM, New York, pp 93–102Google Scholar
  33. Jansen BJ, Spink A, Blakely C, Koshman S (2007) Defining a session on web search engines: research articles. J Am Soc Inf Sci Technol 58(6):862–871CrossRefGoogle Scholar
  34. Järvelin K, Kekäläinen J (2002) Cumulated gain-based evaluation of IR techniques. ACM Trans Inf Syst 20(4):422–446CrossRefGoogle Scholar
  35. Jiang D, Pei J, Li H (2013) Mining search and browse logs for web search: a survey. ACM Trans Intell Syst Technol 4(4):57:1–57:37CrossRefGoogle Scholar
  36. Jin X, Sloan M, Wang J (2013) Interactive exploratory search for multi page search results. In: Proceedings of the 22nd international conference on World Wide Web, WWW ’13. International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, pp 655–666Google Scholar
  37. Jones KS, Walker S, Robertson SE (2000) A probabilistic model of information retrieval: development and comparative experiments. Inf Process Manag 36(6):779–808CrossRefGoogle Scholar
  38. Kellar M, Watters C, Shepherd M (2006) A goal-based classification of web information tasks. Proc Am Soc Inf Sci Technol 43(1):1–22Google Scholar
  39. Koehn P (2010) Statistical machine translation, 1st edn. Cambridge University Press, New YorkzbMATHGoogle Scholar
  40. Kohlschütter C, Fankhauser P, Nejdl W (2010) Boilerplate detection using shallow text features. In: Proceedings of the third ACM international conference on Web search and data mining, WSDM ’10. ACM, New York, pp 441–450Google Scholar
  41. Landauer TK, Foltz PW, Laham D (1998) An introduction to latent semantic analysis. Discourse Process 25:259–284CrossRefGoogle Scholar
  42. Language and Information Processing Research Group @ University of Memphis. Semilar: a semantic similarity toolkit. Last visited on 15 Aug 2016Google Scholar
  43. Lintean MC, Moldovan C, Rus V, McNamara DS (2010) The role of local and global weighting in assessing the semantic similarity of texts using latent semantic analysis. In: Guesgen HW , Murray CR (eds) Proceedings of the twenty-third international Florida artificial intelligence research society conference, May 19–21, 2010. AAAI Press, Daytona BeachGoogle Scholar
  44. Liu B, Grossman R, Zhai Y (2003) Mining data records in web pages. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’03. ACM, New York, pp 601–606Google Scholar
  45. Liu Y, Miao J, Zhang M, Ma S, Liyun Ru (2011) How do users describe their information need: query recommendation based on snippet click model. Expert Syst Appl 38(11):13847–13856Google Scholar
  46. Maekawa T, Yanagisawa Y, Sakurai Y, Kishino Y, Kamei K, Okadome T (2012) Context-aware web search in ubiquitous sensor environments. ACM Trans Internet Technol 11(3):12:1–12:23CrossRefGoogle Scholar
  47. Matthijs N, Radlinski F (2011) Personalizing web search using long term browsing history. In: Proceedings of the fourth ACM international conference on web search and data mining, WSDM ’11. ACM, New York, pp 25–34Google Scholar
  48. McKenzie B, Cockburn A (2001) An empirical analysis of web page revisitation. In: Proceedings of the 34th annual Hawaii international conference on system sciences ( HICSS-34), HICSS ’01, vol 5. IEEE Computer Society, Washington, DC, p 5019Google Scholar
  49. Micarelli A, Gasparetti F, Sciarrone F, Gauch S (2007) Personalized search on the world wide web. In: Brusilovsky P, Kobsa A, Nejdl W (eds) The adaptive web: methods and strategies of web personalization, vol 4321., Lecture notes in computer scienceSpringer, Berlin, pp 195–230CrossRefGoogle Scholar
  50. Microsoft Bing. Last visited on 15 Aug 2016Google Scholar
  51. Mihalcea R, Corley C, Strapparava C (2006) Corpus-based and knowledge-based measures of text semantic similarity. In: Proceedings of the 21st national conference on artificial intelligence, AAAI’06, vol 1. AAAI Press, pp. 775–780Google Scholar
  52. Mozilla Project. Gecko. Last visited on 15 Aug 2016Google Scholar
  53. Nordenson B (2008) Overload!. Columbia J Rev 47(4):30–42Google Scholar
  54. O’Day VL, Jeffries R (1993) Orienteering in an information landscape: how information seekers get from here to there. In: Proceedings of the INTERACT ’93 and CHI ’93 conference on human factors in computing systems, CHI ’93. ACM, New York, pp 438–445Google Scholar
  55. Panjwani S, Shrivastava N, Shukla S, Jaiswal S (2013) Understanding the privacy-personalization dilemma for web search: a user perspective. In: Proceedings of the SIGCHI conference on human factors in computing systems, CHI ’13. ACM, New York, pp 3427–3430Google Scholar
  56. Papadakis G, Kawase R, Herder E, Nejdl W (2015) Methods for web revisitation prediction: survey and experimentation. User Model User-Adapt Interact 25(4):331–369CrossRefGoogle Scholar
  57. Pariser E (2011) The filter bubble: what the internet is hiding from you. Penguin Group, New YorkGoogle Scholar
  58. Phan X-H, Nguyen L-M, Horiguchi S (2008) Learning to classify short and sparse text and web with hidden topics from large-scale data collections. In: Proceedings of the 17th international conference on World Wide Web, WWW ’08. ACM, New York, pp 91–100Google Scholar
  59. Pirolli P, Card SK (1999) Information foraging. Psychol Rev 106(4):643–675CrossRefGoogle Scholar
  60. Pirolli P, Card S 1995) Information foraging in information access environments. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’95. ACM Press/Addison-Wesley Publishing Co, New York, pp 51–58Google Scholar
  61. Pirolli PLT (2007) Information foraging theory: adaptive interaction with information, 1st edn. Oxford University Press, Inc., New YorkCrossRefGoogle Scholar
  62. Pitkow J, Schütze H, Cass T, Cooley R, Turnbull D, Edmonds A, Adar E, Breuel T (2002) Personalized search. Commun ACM 45(9):50–55CrossRefGoogle Scholar
  63. Rahurkar M, Cucerzan S (2008) Predicting when browsing context is relevant to search. In: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’08. ACM, New York, pp 841–842Google Scholar
  64. Reis DC Golgher PB Silva AS, Laender AF (2004) Automatic web news extraction using tree edit distance. In: Proceedings of the 13th international conference on World Wide Web, WWW ’04. ACM, New York, pp 502–511Google Scholar
  65. Ren X, Wang Y, Yu X, Yan J, Chen Z, Han J (2014) Heterogeneous graph-based intent learning with queries, web pages and wikipedia concepts. In: Proceedings of the 7th ACM international conference on web search and data mining, WSDM ’14. ACM, New York, pp 23–32Google Scholar
  66. Rhodes BJ, Maes P (2000) Just-in-time information retrieval agents. IBM Syst J 39(3–4):685–704CrossRefGoogle Scholar
  67. Rocchio JJ (1971) Relevance feedback in information retrieval. In: Salton G (ed) The SMART retrieval system: experiments in automatic document processing chapter 14. Prentice-Hall Inc., Englewood Cliffs, pp 313–323Google Scholar
  68. Rus V, Arthur CG (2016) Deeper natural language processing for evaluating student answers in intelligent tutoring systems. In: Proceedings, the twenty-first national conference on artificial intelligence and the eighteenth innovative applications of artificial intelligence conference, July 16–20, 2006. AAAI Press, Boston, pp 1495–1500Google Scholar
  69. Smyth B, Balfe E (2006) Anonymous personalization in collaborative web search. Inf Retr 9(2):165–190CrossRefGoogle Scholar
  70. Speretta M (2005) Personalized search based on user search histories. In: In Proceedings of international conference of knowledge management( CIKM). Washington DC, pp 622–628Google Scholar
  71. Sriram S, Shen X, Zhai C (2004) A session-based search engine. In: Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’04. ACM, New York, pp 492–493Google Scholar
  72. Stamou S, Ntoulas A (2009) Search personalization through query and page topical analysis. User Model User-Adapt Interact 19(1–2):5–33CrossRefGoogle Scholar
  73. Sugiyama K, Hatano K, Yoshikawa M (2004) Adaptive web search based on user profile constructed without any effort from users. In: Proceedings of the 13th international conference on World Wide Web, WWW ’04, May 17–22. ACM, New York, pp 675–684Google Scholar
  74. Tauscher L, Greenberg S (1997) How people revisit web pages: empirical findings and implications for the design of history systems. Int J Hum–Comput Stud 47(1):97–137CrossRefGoogle Scholar
  75. Teevan J, Dumais ST, Horvitz E (2005) Personalizing search via automated analysis of interests and activities. In: SIGIR ’05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval. ACM Press, New York, pp 449–456Google Scholar
  76. Ustinovskiy Y, Serdyukov P (2013) Personalization of web-search using short-term browsing context. In: Proceedings of the 22nd ACM international conference on information & knowledge management, CIKM ’13. ACM, New York, pp 1979–1988Google Scholar
  77. Utard H, Fürnkranz J (2006) Link-local features for hypertext classification. In: Ackermann M, Berendt B, Grobelnik M, Hotho A, Mladeni D, Semeraro G, Spiliopoulou M, Stumme G, Svtek V, van Someren M (eds) Semantics, web and mining, vol 4289., Lecture notes in computer scienceSpringer, Berlin, pp 51–64CrossRefGoogle Scholar
  78. van den Bosch A, Bogers T, de Kunder M (2016) Estimating search engine index size variability: a 9-year longitudinal study. Scientometrics 107(2):839–856CrossRefGoogle Scholar
  79. Vicente-Lpez E, de Campos LM, Fernndez-Luna JM, Huete JF, Tagua-Jimnez A, Tur-Vigil C (2015) An automatic methodology to evaluate personalized information retrieval systems. User Model User-Adapt Interact 25(1):1–37CrossRefGoogle Scholar
  80. Vieira K, da Costa Carvalho AL, Berlt K, de Moura ES, da Silva AS, Freire J (2009) On finding templates on web collections. World Wide Web 12(2):171–211CrossRefGoogle Scholar
  81. Vieira K, da Silva AS, Pinto N, de Moura ES, Cavalcanti J, Freire J (2006) A fast and robust method for web page template detection and removal. In: Proceedings of the 15th ACM international conference on information and knowledge management, CIKM ’06. ACM, New York, pp 258–267Google Scholar
  82. Wang H, Zhai CX, Liang F, Dong A, Chang Y (2014) User modeling in search logs via a nonparametric Bayesian approach. In: Proceedings of the 7th ACM international conference on web search and data mining, WSDM ’14. ACM, New York, pp 203–212Google Scholar
  83. Webkit (2016) Webkit—open source web browser engine. Last visited on 15 Aug 2016Google Scholar
  84. White RW, Bailey P, Chen L (2009) Predicting user interests from contextual information. In: Proceedings of the 32nd international ACM SIGIR conference on research and development in information retrieval, SIGIR ’09. ACM, New York, pp 363–370Google Scholar
  85. White RW, Bennett PN, Dumais S T (2010) Predicting short-term interests using activity-based search context. In: Proceedings of the 19th ACM international conference on information and knowledge management, CIKM ’10. ACM, New York, pp 1009–1018Google Scholar
  86. White RW, Chu W, Hassan A, He X, Song Y, Wang H (2013) Enhancing personalized search by mining and modeling task behavior. In: Proceedings of the 22nd international conference on World Wide Web, WWW ’13. International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, pp 1411–1420Google Scholar
  87. White RW, Drucker SM (2007) Investigating behavioral variability in web search. In: Proceedings of the 16th international conference on World Wide Web, WWW ’07. ACM, New York, pp 21–30Google Scholar
  88. White RW, Huang J (2010) Assessing the scenic route: measuring the value of search trails in web logs. In: Proceedings of the 33rd international ACM SIGIR conference on research and development in information retrieval, SIGIR ’10. ACM, New York, pp 587–594Google Scholar
  89. White RW, Jose JM, Ruthven I (2003) An approach for implicitly detecting information needs. In: Proceedings of the twelfth international conference on information and knowledge management, CIKM ’03. ACM, New York, pp 504–507Google Scholar
  90. White RW, Kelly D (2006) A study on the effects of personalization and task information on implicit feedback performance. In: Proceedings of the 15th ACM international conference on information and knowledge management, CIKM ’06. ACM, New York, pp 297–306Google Scholar
  91. White RW, Ruthven I, Jose JM, Van Rijsbergen CJ (2005) Evaluating implicit feedback models using searcher simulations. ACM Trans Inf Syst 23(3):325–361CrossRefGoogle Scholar
  92. Whittaker S (2011) Personal information management: from information consumption to curation. ARIST 45(1):1–62Google Scholar
  93. World Wide Web Consortium. Tidy. Last visited on 15 Aug 2016Google Scholar
  94. Wu M, Hawking D, Turpin A, Scholer F (2012) Using anchor text for homepage and topic distillation search tasks. J Am Soc Inf Sci Technol 63(6):1235–1255CrossRefGoogle Scholar
  95. W3C DOM Working Group. Document object model (DOM). Last visited on 15 Aug 2016Google Scholar
  96. Yang Y (1999) An evaluation of statistical approaches to text categorization. Inf Retr 1(1–2):69–90CrossRefGoogle Scholar
  97. Yin Z, Shokouhi M, Craswell N (2009) Query expansion using external evidence. In: Proceedings of the 31th European conference on IR research on advances in information retrieval, ECIR ’09. Springer, Berlin, pp 362–374Google Scholar

Copyright information

© The Author(s) 2016

Authors and Affiliations

  1. 1.Roma Tre UniversityRomeItaly

Personalised recommendations