Evaluating Web Search Result Summaries

  • Shao Fen Liang
  • Siobhan Devlin
  • John Tait
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3936)


The aim of our research is to produce and assess short summaries to aid users’ relevance judgements, for example for a search engine result page. In this paper we present our new metric for measuring summary quality based on representativeness and judgeability, and compare the summary quality of our system to that of Google. We discuss the basis for constructing our evaluation methodology in contrast to previous relevant open evaluations, arguing that the elements which make up an evaluation methodology: the tasks, data and metrics, are interdependent and the way in which they are combined is critical to the effectiveness of the methodology. The paper discusses the relationship between these three factors as implemented in our own work, as well as in SUMMAC/MUC/DUC.


Evaluation Methodology Relevance Judgement Text Summarisation Summary Quality Judgeability Task 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Afantenos, S., Karkaletsis, V., Stamatopoulos, P.: Summarization from medical documents: a survey. Artificial Intelligence in Medicine 33(2), 157–177 (2005)CrossRefGoogle Scholar
  2. 2.
    Berger, A., Mittal, V.O.: Query-Relevant summarisation using FAQs. In: ACL, pp. 294–301 (2000)Google Scholar
  3. 3.
    Borko, H., Bernier, C.L.: Abstracting concepts and methods. Academic Press, San Diego (1975)Google Scholar
  4. 4.
    Chinchor, N., Hirschman, L., Lewis, D.D.: Evaluating message understanding systems: An analysis of the third message understanding conference. Association for Computation Linguistics 19(3), 409–449 (1993)Google Scholar
  5. 5.
    Chinchor, N.: MUC-3 Evaluation metrics. In: Proceedings of third message understanding conference, pp. 17–24 (1991)Google Scholar
  6. 6.
    Harman, D., Over, P.: The effects of human variation in DUC summarization evaluation. In: Proceedings of the ACL 2004 Workshop in Text Summarization Branches Out, Barcelona, Spain, pp. 10–17 (July 2004)Google Scholar
  7. 7.
    Liang, S.F., Devlin, S., Tait, J.: Poster: Using query term order for result summarisation. In: SIGIR 2005, Brazil, pp. 629–630 (2005)Google Scholar
  8. 8.
    Lin, C.Y.: ROUGE: a Package for Automatic Evaluation of Summaries. In: Proceedings of the Workshop on Text Summarization Branches Out, Barcelona, Spain, July 25-26 (2004)Google Scholar
  9. 9.
    Mani, I., Firmin, T., Sundheim, B.: The TIPSTER SUMMAC text summarization evaluation. In: Proceedings of the ninth conference on European chapter of the Association ofr Computational Linguistics, Bergen, Norway, pp. 77–85 (1999)Google Scholar
  10. 10.
    Mani, I.: Automatic Summarization. John Benjamins, Amsterdam (2001)CrossRefMATHGoogle Scholar
  11. 11.
    Pagano, R.R.: Understanding statistics in the behavioural sciences. Wadsworth/Thomson Learning (2001)Google Scholar
  12. 12.
    Sparck Jones, C., Galliers, J.R.: Evaluating natural language processing systems: an analysis and review. Springer, New York (1996)Google Scholar
  13. 13.
    Tipster text phase III 18-month workshop notes (1998), Fairfax, VA (May 1998)Google Scholar
  14. 14.
    Voorhees, E.M.: Variations in Relevance Judgements and the Measurement of Retrieval Effectiveness. Information Processing & Management 36(5), 697–716 (2000)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Shao Fen Liang
    • 1
  • Siobhan Devlin
    • 1
  • John Tait
    • 1
  1. 1.The University of Sunderland School of Computing and TechnologySunderlandUK

Personalised recommendations