Information Retrieval

, Volume 14, Issue 3, pp 257–289 | Cite as

Query reformulation mining: models, patterns, and applications

  • Paolo Boldi
  • Francesco Bonchi
  • Carlos Castillo
  • Sebastiano Vigna
Web Mining for Search

Abstract

Understanding query reformulation patterns is a key task towards next generation web search engines. If we can do that, then we can build systems able to understand and possibly predict user intent, providing the needed assistance at the right time, and thus helping users locate information more effectively and improving their web-search experience. As a step in this direction, we build a very accurate model for classifying user query reformulations into broad classes (generalization, specialization, error correction or parallel move), achieving 92% accuracy. We then apply the model to automatically label two very large query logs sampled from different geographic areas, and containing a total of approximately 17 million query reformulations. We study the resulting reformulation patterns, matching some results from previous studies performed on smaller manually annotated datasets, and discovering new interesting reformulation patterns, including connections between reformulation types and topical categories. We annotate two large query-flow graphs with reformulation type information, and run several graph-characterization experiments on these graphs, extracting new insights about the relationships between the different query reformulation types. Finally we study query recommendations based on short random walks on the query-flow graphs. Our experiments show that these methods can match in precision, and often improve, recommendations based on query-click graphs, without the need of users’ clicks. Our experiments also show that it is important to consider transition-type labels on edges for having recommendations of good quality.

Keywords

Query log mining Query flow graph Session segmentation Query recommendation 

References

  1. Anagnostopoulos, A., Becchetti, L., Castillo, C., & Gionis, A. (2010). An optimization framework for query recommendation. In WSDM ’10: Proceedings of the third ACM international conference on web search and data mining (pp. 161–170). New York, NY, USA: ACM. doi:10.1145/1718487.1718508.
  2. Baeza-yates, R., Hurtado, C., & Mendoza, M. (2004). Query recommendation using query logs in search engines. In International workshop on clustering information over the web (ClustWeb, in conjunction with EDBT), Creete (pp. 588–596). New York: Springer.Google Scholar
  3. Baeza-Yates, R., & Tiberi, A. (2007). Extracting semantic relations from query logs. In KDD ’07: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 76–85). New York, NY, USA: ACM Press. doi:10.1145/1281192.1281204.
  4. Baraglia, R., Castillo, C., Donato, D., Nardini, F. M., & Perego, R. (2010). The effects of time on query flow graph-based models for query suggestion. In Proceedings of RIAO.Google Scholar
  5. Baraglia, R., Castillo, C., Donato, D., Nardini, F. M., Perego, R., & Silvestri, F. (2009). Aging effects on query flow graphs for query suggestion. In CIKM ’09: Proceeding of the 18th ACM conference on information and knowledge management (pp. 1947–1950). New York, NY, USA: ACM. doi:10.1145/1645953.1646272.
  6. Beeferman, D., & Berger, A. (2000). Agglomerative clustering of a search engine query log. In Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 407–416). New York: ACM Press. doi:10.1145/347090.347176.
  7. Belazzougui, D., Boldi, P., Pagh, R., & Vigna, S. (2009). Theory and practise of monotone minimal perfect hashing. In Proceedings of the tenth workshop on algorithm engineering and experiments (ALENEX) (pp. 132–144). SIAM.Google Scholar
  8. Belkin, N. J. (2000). The human element: Helping people find what they don’t know. Communications of the ACM, 43(8), 58–61.CrossRefGoogle Scholar
  9. Boldi, P., Bonchi, F., Castillo, C., Donato, D., Gionis, A., & Vigna, S. (2008). The query-flow graph: Model and applications. In Proceedings of ACM 17th conference on information and knowledge management (CIKM) (pp. 609–618). Napa Valley, CA, USA: ACM Press.Google Scholar
  10. Boldi, P., Bonchi, F., Castillo, C., Donato, D., & Vigna, S. (2009). Query suggestions using query-flow graphs. In WSCD ’09: Proceedings of the 2009 workshop on web search click data (pp. 56–63). New York, NY, USA: ACM. doi:10.1145/1507509.1507518.
  11. Boldi, P., Bonchi, F., Castillo, C., & Vigna, S. (2009). From “dango” to “japanese cakes”: Query reformulation models and patterns. In WI-IAT ’09: Proceedings of the 2009 IEEE/WIC/ACM international joint conference on web intelligence and intelligent agent technology (pp. 183–190). Washington, DC, USA: IEEE Computer Society. doi:10.1109/WI-IAT.2009.34.
  12. Boldi, P., & Vigna, S. (2004). The WebGraph framework I: Compression techniques. In Proceedings of the thirteenth international world wide web conference (WWW 2004) (pp. 595–601). Manhattan, USA: ACM Press.Google Scholar
  13. Bordino, I., Castillo, C., Donato, D., & Gionis, A. (2010). Query similarity by projecting the query-flow graph. In SIGIR ’10: Proceedings of the 33rd annual international ACM SIGIR conference on research and development in information retrieval. New York: ACM Press.Google Scholar
  14. Castillo, C., Donato, D., Becchetti, L., Boldi, P., Leonardi, S., Santini, M., & Vigna, S. (2006). A reference collection for web spam. SIGIR Forum, 40(2), 11–24. doi:10.1145/1189702.1189703.Google Scholar
  15. Craswell, N., & Szummer, M. (2007). Random walks on the click graph. In SIGIR ’07: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval (pp. 239–246). New York, NY, USA: ACM Press. doi:10.1145/1277741.1277784.
  16. Craswell, N., Zoeter, O., Taylor, M., & Ramsey, B. (2008). An experimental comparison of click position-bias models. In WSDM ’08: Proceedings of the international conference on web search and web data mining (pp. 87–94). New York: ACM.Google Scholar
  17. Donato, D., Bonchi, F., Chi, T., & Maarek, Y. (2010). Do you want to take notes? Identifying research missions in Yahoo! Search Pad. In Proceedings of the 19th international conference on world wide web (WWW 2010).Google Scholar
  18. Fonseca, B. M., Golgher, P. B., de Moura, E. S., & Ziviani, N. (2003). Using association rules to discover search engines related queries. In LA-WEB ’03: Proceedings of the first latin American web congress. Washington, DC, USA: IEEE Computer Society.Google Scholar
  19. Fuxman, A., Tsaparas, P., Achan, K., & Agrawal, R. (2008). Using the wisdom of the crowds for keyword generation. In WWW ’08: Proceeding of the 17th international conference on world wide web (pp. 61–70). New York, NY, USA: ACM. doi:10.1145/1367497.1367506.
  20. Glance, N. S. (2001). Community search assistant. In Artificial intelligence for web search (pp. 91–96).Google Scholar
  21. Goodrum, A. A., Bejune, M. M., & Siochi, A. C. (2003). A state transition analysis of image search patterns on the web. In Image and video retrieval, Lecture notes in computer science (Vol. 2728, pp. 193–197). New Yok: Springer.Google Scholar
  22. Haas, S. W., & Grams, E. S. (1998). Page and link classifications: Connecting diverse resources. In DL ’98: Proceedings of the third ACM conference on digital libraries (pp. 99–107). New York, NY, USA: ACM. doi:10.1145/276675.276686.
  23. He, D., & Göker, A. (2000). Detecting session boundaries from web user logs. In Proceedings of the BCS-IRSG 22nd annual colloquium on information retrieval research (pp. 57–66). Cambridge, UK.Google Scholar
  24. He, D., Göker, A., & Harper, D. J. (2002). Combining evidence for automatic web session identification. Information Processing & Management 38(5), 727–742. doi:10.1016/S0306-4573(01)00060-7.MATHCrossRefGoogle Scholar
  25. Jansen, B. J., Spink, A., Bateman, J., & Saracevic, T. (1998). Real life information retrieval: A study of user queries on the web. SIGIR Forum, 32(1), 5–17.CrossRefGoogle Scholar
  26. Jansen, B. J., Spink, A., & Narayan, B. (2007). Query modifications patterns during web searching. In Information technology, 2007. ITNG ’07. fourth international conference on (pp. 439–444). doi:10.1109/ITNG.2007.164.
  27. Jansen, B. J., Zhang, M., & Spink, A. (2007). Patterns and transitions of query reformulation during web searching. International Journal of Web Information Systems, 3(4), 328–340. doi:10.1108/17440080710848116.CrossRefGoogle Scholar
  28. Jeh, G., & Widom, J. (2003). Scaling personalized web search. In WWW ’03: Proceedings of the 12th international conference on world wide web (pp. 271–279). New York, NY, USA: ACM Press. doi:10.1145/775152.775191.
  29. Jones, R., & Klinkner, K. L. (2008). Beyond the session timeout: Automatic hierarchical segmentation of search topics in query logs. In CIKM ’08: Proceeding of the 17th ACM conference on information and knowledge management (pp. 699–708). New York, NY, USA: ACM. doi:10.1145/1458082.1458176.
  30. Jones, R., Rey, B., Madani, O., & Greiner, W. (2006). Generating query substitutions. In Proceedings of the 15th international conference on world wide web, WWW 2006 (pp. 387–396). Scotland, UK: Edinburgh.Google Scholar
  31. Lau, T., & Horvitz, E. (1999). Patterns of search: Analyzing and modeling web query refinement. In UM ’99: Proceedings of the seventh international conference on user modeling (pp. 119–128). New York: Springer.Google Scholar
  32. Li, Y., Zheng, Z., & Dai, H. (2005). Kdd cup-2005 report: Facing a great challenge. SIGKDD Explorations Newsletter, 7(2), 91–99. doi:10.1145/1117454.1117466.CrossRefGoogle Scholar
  33. Luxenburger, J., Elbassuoni, S., & Weikum, G. (2008). Matching task profiles and user needs in personalized web search. In CIKM ’08: Proceeding of the 17th ACM conference on information and knowledge mining (pp. 689–698). New York, NY, USA: ACM. doi:10.1145/1458082.1458175.
  34. Mei, Q., Zhou, D., & Church, K. (2008). Query suggestion using hitting time. In CIKM ’08: Proceeding of the 17th ACM conference on information and knowledge mining (pp. 469–478). New York, NY, USA: ACM. doi:10.1145/1458082.1458145.
  35. Quinlan, J. R. (1993). C4.5: Programs for machine learning. MA: Morgan Kaufmann.Google Scholar
  36. Radlinski, F., & Joachims, T. (2005). Query chains: Learning to rank from implicit feedback. In KDD ’05: Proceeding of the eleventh ACM SIGKDD international conference on knowledge discovery in data mining (pp. 239–248). New York, NY, USA: ACM Press. doi:10.1145/1081870.1081899.
  37. Richardson, M. (2008). Learning about the world through long-term query logs. ACM Transactions on the Web, 2(4), 1–27. doi:10.1145/1409220.1409224.CrossRefGoogle Scholar
  38. Rieh, S. Y., & Xie, H. (2006). Analysis of multiple query reformulations on the web: The interactive information retrieval context. Information Processing & Management, 42(3), 751–768. doi:10.1016/j.ipm.2005.05.005.CrossRefGoogle Scholar
  39. Sadikov, E., Madhavan, J., Wang, L., & Halevy, A. (2010). Clustering query refinements by user intent. In WWW ’10: Proceedings of the 19th international conference on world wide web. North Carolina, USA: Raleigh.Google Scholar
  40. Silverstein, C., Henzinger, M., Marais, H., & Moricz, M. (1998). Analysis of a very large altavista query log. Tech. rep., Digital SRC.Google Scholar
  41. Spink, A., Jansen, B. J., Wolfram, D., & Saracevic, T. (2002). From e-sex to e-commerce: Web search changes. IEEE Computer, 35(3), 107–109.Google Scholar
  42. Wen, J., Nie, J., & Zhang, H. (2001). Clustering user queries of a search engine. In Proceedings of the 10th international conference on world wide web (pp. 162–168). New York: ACM.Google Scholar
  43. White, R. W., Clarke, C. L. A., & Cucerzan, S. (2007). Comparing query logs and pseudo-relevance feedbackfor web-search query refinement. In SIGIR ’07: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval (pp. 831–832). New York, NY, USA: ACM. doi:10.1145/1277741.1277931.
  44. Zhang, Z., & Nasraoui, O. (2006). Mining search engine query logs for query recommendation. In WWW ’06: Proceedings of the 15th international conference on world wide web (pp. 1039–1040). New York, NY, USA: ACM. doi:10.1145/1135777.1136004.

Copyright information

© Springer Science+Business Media, LLC 2010

Authors and Affiliations

  • Paolo Boldi
    • 1
  • Francesco Bonchi
    • 2
  • Carlos Castillo
    • 2
  • Sebastiano Vigna
    • 1
  1. 1.DSI, Università degli studi di MilanoMilanItaly
  2. 2.Yahoo! ResearchBarcelonaSpain

Personalised recommendations