Skip to main content

Evaluating Question Answering System Performance

  • Chapter
Advances in Open Domain Question Answering

Part of the book series: Text, Speech and Language Technology ((TLTB,volume 32))

The TREC question answering (QA) track was the first large-scale evaluation of open-domain question answering systems. In addition to successfully fostering research on the QA task, the track has also been used to investigate appropriate evaluation methodologies for question answering systems. This chapter reviews the TREC QA track, emphasizing the issues associated with evaluating question answering systems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

7. References

  • ARDA. (2002). Advanced question and answering for intelligence (AQUAINT). http://www.icarda.org/InfoExploit/aquaint/index.html.

  • Breck, E., Burger, J., Ferro, L., Hirschman, L., House, D., Light, M. & Mani, I. (2000). How to evaluate your question answering system every day … and still get real work done. In Proceedings of the second international conference on language resources and evaluation (LREC-2000) (Vol. 3, pp. 1495-1500).

    Google Scholar 

  • Buckley, C. & Voorhees, E. M. (2000). Evaluating evaluation measure stability. In N. Belkin, P. Ingwersen & M. Leong (Eds.), Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval (pp. 33-40).

    Google Scholar 

  • Chinchor, N., Hirschman, L. & Lewis, D. D. (1993). Evaluating message understanding systems: An analysis of the third Message Understanding Conference (MUC-3). Computational Linguistics, 19(3), 409-449.

    Google Scholar 

  • Dumais, S., Banko, M., Brill, E., Lin, J. & Ng, A. (2002). Web questions answering: Is more always better? In Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval (pp. 291-298).

    Google Scholar 

  • Fellbaum, C. (Ed.) (1998). Wordnet: An electronic lexical database. The MIT Press.

    Google Scholar 

  • Harman, D. & Over, P. (2002). The DUC summarization evaluations. In Proceedings of the international conference on human language technology. (In press.)

    Google Scholar 

  • Harrison, P., Abney, S., Black, E., Flickenger, D., Gdaniec, C., Grishman, R., Hindle, D., Ingria, R., Marcus, M., Santorini, B. & Strzalkowski, T. (1991). Evaluating syntax performance of parser/ grammars of English. In Proceedings of the workshop on evaluating natural language processing systems. Association for Computaional Linguistics.

    Google Scholar 

  • Hirschman, L. (1998). Language understanding evaluations: Lessons learned from MUC and ATIS. In Proceedings of the first international conference on language resources and evaluation (LREC) (pp. 117-122). Granada, Spain.

    Google Scholar 

  • International Standards for Language Engineering (ISLE). (2000). The ISLE classification of machine translation evaluations. http://www.isi.edu/natural-language/mteval/cover.html.

  • Kilgarriff, A. & Palmer, M. (2000). Introduction to the special issue on SENSEVAL. Computers and the Humanities, 34(1-2), 1-13.

    Article  Google Scholar 

  • Kupiec, J. (1993). MURAX: A robust linguistic approach for question answering using an on-line encyclopedia. In R. Korfage, E. Rasmussen & P. Willett (Eds.), Proceedings of the sixteenth annual international acm sigir conference on research and development in information retrieval (pp. 181-190). (Special issue of the SIGIR FORUM.)

    Google Scholar 

  • Mani, I., House, D., Klein, G., Hirschman, L., Obrst, L., Firmin, T., Chrzanowski, M. & Sundheim, B. (1998). The TIPSTER SUMMAC text summarization evaluation (Tech. Rep. No. MTR 98W0000138). McLean, Virginia: MITRE, Washington C3 Center.

    Google Scholar 

  • Martin, J. & Lankester, C. (2000). Ask Me Tomorrow: The NRC and University of Ottawa question answering system. In Proceedings of the eighth Text REtrieval Conference (TREC-8) (pp. 675-683). (NIST Special Publication 500-246.)

    Google Scholar 

  • National Institute of Standards and Technology.(2002). The Text Retrieval Conference web site. (http://trec.nist.gov)

  • Pallett, D. S., Garofolo, J. S. & Fiscus, J. G. (2000). Measurements in support of research accomplishments. Communications of the ACM, 43(2), 75-79.

    Article  Google Scholar 

  • SAIC. (2001). Introduction to information extraction. www.itl.nist.gov/iaui/894.02/related_projects/muc.

  • Schamber, L. (1994). Relevance and information behavior. Annual Review of Information Science and Technology, 29, 3-48.

    Google Scholar 

  • Singhal, A., Abney, S., Bacciani, M., Collins, M., Hindle, D. & Pereira, F. (2000). AT&T at TREC-8. In E. Voorhees & D. Harman (Eds.), Proceedings of the eighth Text REtrieval Conference (TREC-8). (NIST Special Publication 500-246. Electronic version available at http://trec.nist.gov/pubs.html)

  • Sparck Jones, K. (2001). Automatic language and information processing: Rethinking evaluation. Natural Language Engineering, 7(1), 29-46.

    Google Scholar 

  • Sparck Jones, K. & Galliers, J. R. (1996). Evaluating natural language processing systems. Springer.

    Google Scholar 

  • Sparck Jones K., & Willett P. (1997) Evaluation. In K. Sparck Jones & P. Willett Eds.), Readings in information retrieval ( 167-174). Morgan Kaufmann.

    Google Scholar 

  • Tichy, W. F. (1998). Should computer scientists experiment more? Computer, 31(5), 32-40.

    Article  Google Scholar 

  • Voorhees, E. & Harman, D. (Eds.). (2000). Proceedings of the eighth Text REtrieval Conference (TREC-8). (NIST Special Publication 500-246. Electronic version available at http://trec.nist.gov/pubs.html)

  • Voorhees, E. M. (2000a). Special issue: The sixth Text REtrieval Conference (TREC-6). Information Processing and Management, 36(1).

    Google Scholar 

  • Voorhees, E. M. (2000b). Variations in relevance judgments and the measurement of retrieval effectiveness. Information Processing and Management, 36, 697-716.

    Article  Google Scholar 

  • Voorhees, E. M. & Buckley, C. (2002). The effect of topic set size on retrieval experiment error. In Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval (pp. 316-323).

    Google Scholar 

  • Voorhees, E. M. & Tice, D. M. (2000a). Building a question answering test collection. In Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval (pp. 200-207).

    Google Scholar 

  • Voorhees, E. M. & Tice, D. M. (2000b). The TREC-8 question answering track evaluation. In E. Voorhees & D. Harman (Eds.), Proceedings of the eighth Text REtrieval Conference (TREC-8) (pp. 83-105).(NIST Special Publication500-246. Electronic version available at http://trec.nist.gov/pubs.html)

  • White, J. (1999, April). Evaluation and assessment techniques. www-2.cs.cmu.edu/∼ref/mlim/.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer

About this chapter

Cite this chapter

Voorhees, E.M. (2008). Evaluating Question Answering System Performance. In: Strzalkowski, T., Harabagiu, S.M. (eds) Advances in Open Domain Question Answering. Text, Speech and Language Technology, vol 32. Springer, Dordrecht. https://doi.org/10.1007/978-1-4020-4746-6_13

Download citation

Publish with us

Policies and ethics