Advertisement

Evaluating Question Answering System Performance

  • Ellen M. Voorhees
Part of the Text, Speech and Language Technology book series (TLTB, volume 32)

The TREC question answering (QA) track was the first large-scale evaluation of open-domain question answering systems. In addition to successfully fostering research on the QA task, the track has also been used to investigate appropriate evaluation methodologies for question answering systems. This chapter reviews the TREC QA track, emphasizing the issues associated with evaluating question answering systems.

Keywords

Question Answering Test Collection Relevance Judgment Question Answering System Reciprocal Rank 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

7. References

  1. ARDA. (2002). Advanced question and answering for intelligence (AQUAINT). http://www.icarda.org/InfoExploit/aquaint/index.html.
  2. Breck, E., Burger, J., Ferro, L., Hirschman, L., House, D., Light, M. & Mani, I. (2000). How to evaluate your question answering system every day … and still get real work done. In Proceedings of the second international conference on language resources and evaluation (LREC-2000) (Vol. 3, pp. 1495-1500).Google Scholar
  3. Buckley, C. & Voorhees, E. M. (2000). Evaluating evaluation measure stability. In N. Belkin, P. Ingwersen & M. Leong (Eds.), Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval (pp. 33-40).Google Scholar
  4. Chinchor, N., Hirschman, L. & Lewis, D. D. (1993). Evaluating message understanding systems: An analysis of the third Message Understanding Conference (MUC-3). Computational Linguistics, 19(3), 409-449.Google Scholar
  5. Dumais, S., Banko, M., Brill, E., Lin, J. & Ng, A. (2002). Web questions answering: Is more always better? In Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval (pp. 291-298).Google Scholar
  6. Fellbaum, C. (Ed.) (1998). Wordnet: An electronic lexical database. The MIT Press.Google Scholar
  7. Harman, D. & Over, P. (2002). The DUC summarization evaluations. In Proceedings of the international conference on human language technology. (In press.)Google Scholar
  8. Harrison, P., Abney, S., Black, E., Flickenger, D., Gdaniec, C., Grishman, R., Hindle, D., Ingria, R., Marcus, M., Santorini, B. & Strzalkowski, T. (1991). Evaluating syntax performance of parser/ grammars of English. In Proceedings of the workshop on evaluating natural language processing systems. Association for Computaional Linguistics.Google Scholar
  9. Hirschman, L. (1998). Language understanding evaluations: Lessons learned from MUC and ATIS. In Proceedings of the first international conference on language resources and evaluation (LREC) (pp. 117-122). Granada, Spain.Google Scholar
  10. International Standards for Language Engineering (ISLE). (2000). The ISLE classification of machine translation evaluations. http://www.isi.edu/natural-language/mteval/cover.html.
  11. Kilgarriff, A. & Palmer, M. (2000). Introduction to the special issue on SENSEVAL. Computers and the Humanities, 34(1-2), 1-13.CrossRefGoogle Scholar
  12. Kupiec, J. (1993). MURAX: A robust linguistic approach for question answering using an on-line encyclopedia. In R. Korfage, E. Rasmussen & P. Willett (Eds.), Proceedings of the sixteenth annual international acm sigir conference on research and development in information retrieval (pp. 181-190). (Special issue of the SIGIR FORUM.)Google Scholar
  13. Mani, I., House, D., Klein, G., Hirschman, L., Obrst, L., Firmin, T., Chrzanowski, M. & Sundheim, B. (1998). The TIPSTER SUMMAC text summarization evaluation (Tech. Rep. No. MTR 98W0000138). McLean, Virginia: MITRE, Washington C3 Center.Google Scholar
  14. Martin, J. & Lankester, C. (2000). Ask Me Tomorrow: The NRC and University of Ottawa question answering system. In Proceedings of the eighth Text REtrieval Conference (TREC-8) (pp. 675-683). (NIST Special Publication 500-246.)Google Scholar
  15. National Institute of Standards and Technology.(2002). The Text Retrieval Conference web site. (http://trec.nist.gov)
  16. Pallett, D. S., Garofolo, J. S. & Fiscus, J. G. (2000). Measurements in support of research accomplishments. Communications of the ACM, 43(2), 75-79.CrossRefGoogle Scholar
  17. SAIC. (2001). Introduction to information extraction. www.itl.nist.gov/iaui/894.02/related_projects/muc.
  18. Schamber, L. (1994). Relevance and information behavior. Annual Review of Information Science and Technology, 29, 3-48.Google Scholar
  19. Singhal, A., Abney, S., Bacciani, M., Collins, M., Hindle, D. & Pereira, F. (2000). AT&T at TREC-8. In E. Voorhees & D. Harman (Eds.), Proceedings of the eighth Text REtrieval Conference (TREC-8). (NIST Special Publication 500-246. Electronic version available at http://trec.nist.gov/pubs.html)
  20. Sparck Jones, K. (2001). Automatic language and information processing: Rethinking evaluation. Natural Language Engineering, 7(1), 29-46.Google Scholar
  21. Sparck Jones, K. & Galliers, J. R. (1996). Evaluating natural language processing systems. Springer.Google Scholar
  22. Sparck Jones K., & Willett P. (1997) Evaluation. In K. Sparck Jones & P. Willett Eds.), Readings in information retrieval ( 167-174). Morgan Kaufmann.Google Scholar
  23. Tichy, W. F. (1998). Should computer scientists experiment more? Computer, 31(5), 32-40.CrossRefGoogle Scholar
  24. Voorhees, E. & Harman, D. (Eds.). (2000). Proceedings of the eighth Text REtrieval Conference (TREC-8). (NIST Special Publication 500-246. Electronic version available at http://trec.nist.gov/pubs.html)
  25. Voorhees, E. M. (2000a). Special issue: The sixth Text REtrieval Conference (TREC-6). Information Processing and Management, 36(1).Google Scholar
  26. Voorhees, E. M. (2000b). Variations in relevance judgments and the measurement of retrieval effectiveness. Information Processing and Management, 36, 697-716.CrossRefGoogle Scholar
  27. Voorhees, E. M. & Buckley, C. (2002). The effect of topic set size on retrieval experiment error. In Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval (pp. 316-323).Google Scholar
  28. Voorhees, E. M. & Tice, D. M. (2000a). Building a question answering test collection. In Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval (pp. 200-207).Google Scholar
  29. Voorhees, E. M. & Tice, D. M. (2000b). The TREC-8 question answering track evaluation. In E. Voorhees & D. Harman (Eds.), Proceedings of the eighth Text REtrieval Conference (TREC-8) (pp. 83-105).(NIST Special Publication500-246. Electronic version available at http://trec.nist.gov/pubs.html)
  30. White, J. (1999, April). Evaluation and assessment techniques. www-2.cs.cmu.edu/∼ref/mlim/.Google Scholar

Copyright information

© Springer 2008

Authors and Affiliations

  • Ellen M. Voorhees
    • 1
  1. 1.National Institute of Standards and TechnologyGaitherburgUSA

Personalised recommendations