Automatic Search Engine Performance Evaluation with the Wisdom of Crowds

  • Rongwei Cen
  • Yiqun Liu
  • Min Zhang
  • Liyun Ru
  • Shaoping Ma
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5839)


Relevance evaluation is an important topic in Web search engine research. Traditional evaluation methods resort to huge amount of human efforts which lead to an extremely time-consuming process in practice. With analysis on large scale user query logs and click-through data, we propose a performance evaluation method that fully automatically generates large scale Web search topics and answer sets under Cranfield framework. These query-to-answer pairs are directly utilized in relevance evaluation with several widely-adopted precision/recall-related retrieval performance metrics. Besides single search engine log analysis, we propose user behavior models on multiple search engines’ click-through logs to reduce potential bias among different search engines. Experimental results show that the evaluation results are similar to those gained by traditional human annotation, and our method avoids the propensity and subjectivity of manual judgments by experts in traditional ways.


Performance evaluation click-through data analysis Web search engine the wisdom of crowds 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Agichtein, E., Brill, E., Dumais, S., Ragno, R.: Learning user interaction models for predicting web search result preferences. In: SIGIR 2006, pp. 3–10. ACM, New York (2006)Google Scholar
  2. 2.
    Broder, A.: A taxonomy of web search. SIGIR Forum 36(2), 3–10 (2002)CrossRefzbMATHGoogle Scholar
  3. 3.
    Buckley, C., Dimmick, D., Soboroff, I., Voorhees, E.: Bias and the limits of pooling for large collections. Inf. Retr. 10(6), 491–508 (2007)CrossRefGoogle Scholar
  4. 4.
    Cleverdon, C., Mills, J., Keen, M.: Aslib Cranfield research project - Factors determining the performance of indexing systems; Design; Part 1, vol. 1 (1966)Google Scholar
  5. 5.
    Craswell, M., Hawking, D.: Overview of the TREC 2003 Web track. In: Voorhees, E.M., Buckland, L.P. (eds.) NIST Special Publication 500-261: TREC 2004 (2004)Google Scholar
  6. 6.
    Fuxman, A., Tsaparas, P., Achan, K., Agrawal, R.: Using the wisdom of the crowds for keyword generation. In: Proc. of WWW 2008, pp. 61–70. ACM, New York (2008)Google Scholar
  7. 7.
    Hawking, D., Craswell, N.: Overview of the TREC 2003 Web track. In: Voorhees, E.M., Buckland, L.P. (eds.) NIST Special Publication 500-255: TREC 2003 (2003)Google Scholar
  8. 8.
    Joachims, T., Granka, L., Pan, B., Hembrooke, H., Gay, G.: Accurately interpreting clickthrough data as implicit feedback. In: SIGIR 2005, pp. 154–161. ACM, New York (2005)Google Scholar
  9. 9.
    Dou, Z., Song, R., Yuan, X., Wen, J.R.: Are click-through data adequate for learning web search rankings? In: CIKM 2008, New York, NY, pp. 73–82 (2008)Google Scholar
  10. 10.
    Kent, A., Berry, M., Leuhrs, F.U., Perry, J.W.: Machine literature searching VIII. Operational criteria for designing information retrieval systems. American Documentation 6(2), 93–101 (1955)Google Scholar
  11. 11.
    Liu, Y., Zhang, M., Ru, L., Ma, S.: Automatic Query Type Identification Based on Click Through Information. In: Ng, H.T., Leong, M.-K., Kan, M.-Y., Ji, D. (eds.) AIRS 2006. LNCS, vol. 4182, pp. 593–600. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  12. 12.
    Nuray, R., Can, F.: Automatic ranking of retrieval systems in imperfect environments. In: Proc. of SIGIR 2003, pp. 379–380. ACM, New York (2003)Google Scholar
  13. 13.
    Oard, D.W., Kim, J.: Modeling information content using observable behavior. In: Proc. of ASIST 2001, Washington, D.C., USA, pp. 38–45 (2001)Google Scholar
  14. 14.
    Rose, D.E., Levinson, D.: Understanding user goals in web search. In: Proc. of WWW 2004, pp. 13–19. ACM, New York (2004)Google Scholar
  15. 15.
    Saracevic, T.: Evaluation of evaluation in information retrieval. In: Proc. of SIGIR 1995, pp. 138–146. ACM, New York (1995)Google Scholar
  16. 16.
    Silverstein, C., Marais, H., Henzinger, M., Moricz, M.: Analysis of a very large web search engine query log. SIGIR Forum 33(1), 6–12 (1999)CrossRefGoogle Scholar
  17. 17.
    Soboroff, I., Nicholas, C., Cahan, P.: Ranking retrieval systems without relevance judgments. In: Proc. of SIGIR 2001, pp. 66–73. ACM, New York (2001)Google Scholar
  18. 18.
    Soboroff, I., Voorhees, E., Craswell, N.: Summary of the SIGIR 2003 workshop on defining evaluation methodologies for terabyte-scale test collections. SIGIR Forum 37(2), 55–58 (2003)CrossRefGoogle Scholar
  19. 19.
    Voorhees, E.M.: The Philosophy of Information Retrieval Evaluation. In: Peters, C., Braschler, M., Gonzalo, J., Kluck, M. (eds.) CLEF 2001. LNCS, vol. 2406, pp. 355–370. Springer, Heidelberg (2002)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Rongwei Cen
    • 1
  • Yiqun Liu
    • 1
  • Min Zhang
    • 1
  • Liyun Ru
    • 1
  • Shaoping Ma
    • 1
  1. 1.State Key Laboratory of Intelligent Technology and Systems, Tsinghua National Laboratory for Information Science and Technology, Department of Computer Science and TechnologyTsinghua UniversityBeijingChina

Personalised recommendations