Understanding User Behavior Through Log Data and Analysis

  • Susan DumaisEmail author
  • Robin Jeffries
  • Daniel M. Russell
  • Diane Tang
  • Jaime Teevan


HCI researchers are increasingly collecting rich behavioral traces of user interactions with online systems in situ at a scale not previously possible. These logs can be used to characterize user interactions with existing systems and compare different designs. Large-scale log studies give rise to new challenges in experimental design, data collection and interpretation, and ethics. The chapter discusses how to address these challenges using search engine logs, but the methods are applicable to other types of log data.


User Experience Sanity Check Search Result Page Critical Incident Study 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. Adar, E., Teevan, J., & Dumais, S. T. (2008). Large scale analysis of web revisitation patterns. In Proceedings of CHI 2008 (pp. 1197–1206). New York: ACM.Google Scholar
  2. Baeza-Yates, R., Dupret, G., & Velasco, J. (2007). A study of mobile search queries in Japan. In Proceedings of WWW 2007 workshop on query log analysis: Social and technical challenges. New York, NY: ACM.Google Scholar
  3. Barbaro, M. & Zeller, T. (2006). A face is exposed for AOL searcher No. 4417749, New York Times, Retrieved on August 9, 2006, from
  4. Barnett, V., & Lewis, S. (1994). Outliers in statistical data. New York, NY: Wiley & Sons.zbMATHGoogle Scholar
  5. Beitzel, S. M., Jensen, E. C., Chowdhury, A., Grossman, D. A., & Frieder, O. (2004). Hourly analysis of a very large topically categorized web query log. In Proceedings of SIGIR 2004 (pp. 321–328). New York, NY: ACM.Google Scholar
  6. Broder, A. (2002). A taxonomy of web search. SIGIR Forum, 36(2), 3–10.CrossRefGoogle Scholar
  7. Brown, C. (2012). Split testing with Google analytics experiments. Retrieved on December 16, 2012, from
  8. Capra, R. (2011). HCI browser: A tool for administration and data collection for studies of web search behavior. In Proceedings of HCIHCI 2011 (pp. 259–268). New York, NY: Springer.Google Scholar
  9. Crook, T., Frasca, B., Kohavi, R., & Longbotham, R. (2009). Seven pitfalls to avoid when running controlled experiments on the web. In Proceedings of KDD 2009 (pp. 1105–1114). New York, NY: ACM.Google Scholar
  10. Dell, N., Vaidyanathan, V., Medhi, I., Cutrell, E., & Thies, W. (2012). “Yours is better!”: Participant response bias in HCI. In Proceedings of CHI 2012 (pp. 1321–1330). New York, NY: ACM.Google Scholar
  11. Dumais, S. T., Cutrell, E., Cadiz, J. J., Jancke, G., Sarin, R., & Robbins, D. C. (2003). Stuff I’ve seen: A system for personal information retrieval and re-use. In Proceedings of SIGIR 2003 (pp. 72–79). New York, NY: ACM.Google Scholar
  12. Efthimiadis, E. N. (2008). How do Greeks search the web?: A query log analysis study. In Proceedings iNews 2008 (pp. 81–84). New York, NY: ACM.Google Scholar
  13. Fetterly, D., Manasse, M., & Najork, M. (2004). Spam, damn spam, and statistics: Using statistical analysis to locate spam web pages. In Proceedings WebDB 2004 (pp. 1–6). New York, NY: ACM.Google Scholar
  14. Fox, S., Karnawat, K., Mydland, M., Dumais, S. T., & White, T. (2005). Evaluating implicit measures to improve web search. ACM: Transactions on Information Systems (TOIS), 23(2), 147–168.Google Scholar
  15. Ghorab, M. R., Leveling, J., Zhou, D., Jones, G. J. F., & Wade, V. (2009). Identifying common user behaviour in multilingual search logs. In Proceedings of CLEF 2009, pp. 518–525.Google Scholar
  16. Ginsberg, J., Mohebbi, M. H., Patel, R. S., Brammer, L., Smolinski, M. S., & Brilliant, L. (2009). Detecting influenza epidemics using search engine query data. Nature, 457, 1012–1014.CrossRefGoogle Scholar
  17. Google. (2012). Google analytics. Retrieved on December 16, 2012, from
  18. Huck, S. (2011). Reading statistics and research (6th ed.). Boston, MA: Pearson.Google Scholar
  19. Jansen, B. J. (2006). Search log analysis: What it is, what’s been done, how to do it. Library and Information Science Research, 28(3), 407–432.CrossRefGoogle Scholar
  20. Jupiter Research Corporation. (2005, March 9). Measuring unique visitors: Addressing the dramatic decline in the accuracy of cookie-based measurement Google Scholar
  21. Kohavi, R., Deng, A., Frasca, B., Longbotham, R., Walker, T., & Xu, Y. (2012). Trustworthy online controlled experiments: Five puzzling outcomes explained. In Proceedings of KDD 2012 (pp. 786–794). New York, NY: ACM.Google Scholar
  22. Kohavi, R., Longbotham, R., Sommerfield, D., & Henne, R. M. (2009). Controlled experiments on the web: Survey and practical guide. Data Mining and Knowledge Discovery, 18(1), 140–181.CrossRefMathSciNetGoogle Scholar
  23. Kotov, A., Bennett, P., White, R. W., Dumais, S. T., & Teevan, J. (2011). Modeling and analysis of cross-session search tasks. In Proceedings of SIGIR 2011 (pp. 5–14). New York, NY: ACM.Google Scholar
  24. Lau, T., & Horvitz, E. (1999). Patterns of search: Analyzing and modeling web query refinement. In Proceedings of user modeling 1999 (pp. 119–128). New York, NY: ACM.Google Scholar
  25. Narayanan, A., & Shmatikov, V. (2008). Robust de-anonymization of large sparse datasets. In Proceedings of IEEE symposium on security and privacy 2008 (pp. 111–125). Washington, DC: IEEE.CrossRefGoogle Scholar
  26. Ogbuji, U. (2009). Working with web server logs. Retrieved on December 16, 2012, from
  27. Osborne, J. W. (2012). Best practices in data cleaning: Everything you need to know before and after collecting your data. Thousand Oak, CA: Sage Publications.Google Scholar
  28. Rodden, K., & Leggett, M. (2010). Best of both worlds: Improving Gmail labels with the affordance of folders. In Proceedings of CHI 2010 (pp. 4587–4596). New York, NY: ACM.Google Scholar
  29. Silverstein, C., Henzinger, M., Marais, H., & Moricz, M. (1998). Analysis of a very large web search engine query log. Technical Report 1998-014. Digital SRC.Google Scholar
  30. Skinner, B. F. (1938). The behavior of organisms: An experimental analysis. Oxford, England: Appleton-Century.Google Scholar
  31. Spink, A., Ozmutlu, S., Ozmutlu, H. C., & Jansen, B. J. (2002). U.S. versus European web searching trends. ACM SIGIR Forum, 36(2), 32–38.CrossRefGoogle Scholar
  32. Starbird, K. & Palen, L. (2010). Pass it on? Retweeting in mass emergencies. In Proceedings of ISCRAM 2010, pp. 1–10.Google Scholar
  33. Tang, D., Agarwal, A., O’Brien, D., & Meyer, M. (2010). Overlapping experiment infrastructure: More, better, faster experimentation. In Proceedings KDD 2010 (pp. 17–26). New York, NY: ACM.Google Scholar
  34. Teevan, J., Adar, E., Jones, R., & Potts, M. (2007). Information re-retrieval: Repeat queries in Yahoo’s logs. In Proceedings of SIGIR 2007 (pp. 151–158). New York, NY: ACM.Google Scholar
  35. Teevan, J., Dumais, S. T., & Liebling, D. J. (2008). To personalize or not to personalize: Modeling queries with variation in user intent. In Proceedings of SIGIR 2008 (pp. 163–170). New York, NY: ACM.Google Scholar
  36. Teevan, J., & Hehmeyer, A. (2013). Understanding how the projection of availability state impacts the reception of incoming communication. In Proceedings of CSCW 2013 (pp. 753–758). New York, NY: ACM.Google Scholar
  37. Teevan, J., Ramage, D., & Morris, M. R. (2011). #TwitterSearch: A comparison of microblog search and web search. In Proceedings of WSDM 2011 (pp. 35–44). New York, NY: ACM.Google Scholar
  38. Tyler, S. K., & Teevan, J. (2010). Large scale query log analysis of re-finding. In Proceedings of WSDM 2010 (pp. 191–200). New York, NY: ACM.Google Scholar
  39. White, R., Dumais, S. T., & Teevan, J. (2009). Characterizing the influence of domains expertise on web search behavior. In Proceedings of WSDM 2009 (pp. 132–141). New York, NY: ACM.Google Scholar
  40. White, R., & Morris, D. (2007). Investigating the querying and browsing behavior of advanced search engine users. In Proceedings of SIGIR 2007 (pp. 255–262). New York, NY: ACM.Google Scholar
  41. Wikipedia: AOL search. Retrieved on December 16, 2012, from
  42. Wikipedia: Delta method. Retrieved on December 16, 2012, from
  43. Wikipedia: Hadoop. Retrieved on December 16, 2012, from
  44. Wikipedia: Netflix. Retrieved on December 16, 2012, from
  45. Wikipedia: Power. Retrieved on December 16, 2012, from
  46. Wikipedia: Simpson’s Paradox. Retrieved on December 16, 2012, from

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  • Susan Dumais
    • 1
    Email author
  • Robin Jeffries
    • 2
  • Daniel M. Russell
    • 2
  • Diane Tang
    • 2
  • Jaime Teevan
    • 1
  1. 1.Microsoft Research One Microsoft WayRedmondUSA
  2. 2.Google, Inc.Mountain ViewUSA

Personalised recommendations