Evaluating Question Answering System Performance

Voorhees, Ellen M.

doi:10.1007/978-1-4020-4746-6_13

Ellen M. Voorhees⁵

Part of the book series: Text, Speech and Language Technology ((TLTB,volume 32))

814 Accesses
4 Citations

The TREC question answering (QA) track was the first large-scale evaluation of open-domain question answering systems. In addition to successfully fostering research on the QA task, the track has also been used to investigate appropriate evaluation methodologies for question answering systems. This chapter reviews the TREC QA track, emphasizing the issues associated with evaluating question answering systems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

7. References

ARDA. (2002). Advanced question and answering for intelligence (AQUAINT). http://www.icarda.org/InfoExploit/aquaint/index.html.
Breck, E., Burger, J., Ferro, L., Hirschman, L., House, D., Light, M. & Mani, I. (2000). How to evaluate your question answering system every day … and still get real work done. In Proceedings of the second international conference on language resources and evaluation (LREC-2000) (Vol. 3, pp. 1495-1500).
Google Scholar
Buckley, C. & Voorhees, E. M. (2000). Evaluating evaluation measure stability. In N. Belkin, P. Ingwersen & M. Leong (Eds.), Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval (pp. 33-40).
Google Scholar
Chinchor, N., Hirschman, L. & Lewis, D. D. (1993). Evaluating message understanding systems: An analysis of the third Message Understanding Conference (MUC-3). Computational Linguistics, 19(3), 409-449.
Google Scholar
Dumais, S., Banko, M., Brill, E., Lin, J. & Ng, A. (2002). Web questions answering: Is more always better? In Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval (pp. 291-298).
Google Scholar
Fellbaum, C. (Ed.) (1998). Wordnet: An electronic lexical database. The MIT Press.
Google Scholar
Harman, D. & Over, P. (2002). The DUC summarization evaluations. In Proceedings of the international conference on human language technology. (In press.)
Google Scholar
Harrison, P., Abney, S., Black, E., Flickenger, D., Gdaniec, C., Grishman, R., Hindle, D., Ingria, R., Marcus, M., Santorini, B. & Strzalkowski, T. (1991). Evaluating syntax performance of parser/ grammars of English. In Proceedings of the workshop on evaluating natural language processing systems. Association for Computaional Linguistics.
Google Scholar
Hirschman, L. (1998). Language understanding evaluations: Lessons learned from MUC and ATIS. In Proceedings of the first international conference on language resources and evaluation (LREC) (pp. 117-122). Granada, Spain.
Google Scholar
International Standards for Language Engineering (ISLE). (2000). The ISLE classification of machine translation evaluations. http://www.isi.edu/natural-language/mteval/cover.html.
Kilgarriff, A. & Palmer, M. (2000). Introduction to the special issue on SENSEVAL. Computers and the Humanities, 34(1-2), 1-13.
Article Google Scholar
Kupiec, J. (1993). MURAX: A robust linguistic approach for question answering using an on-line encyclopedia. In R. Korfage, E. Rasmussen & P. Willett (Eds.), Proceedings of the sixteenth annual international acm sigir conference on research and development in information retrieval (pp. 181-190). (Special issue of the SIGIR FORUM.)
Google Scholar
Mani, I., House, D., Klein, G., Hirschman, L., Obrst, L., Firmin, T., Chrzanowski, M. & Sundheim, B. (1998). The TIPSTER SUMMAC text summarization evaluation (Tech. Rep. No. MTR 98W0000138). McLean, Virginia: MITRE, Washington C3 Center.
Google Scholar
Martin, J. & Lankester, C. (2000). Ask Me Tomorrow: The NRC and University of Ottawa question answering system. In Proceedings of the eighth Text REtrieval Conference (TREC-8) (pp. 675-683). (NIST Special Publication 500-246.)
Google Scholar
National Institute of Standards and Technology.(2002). The Text Retrieval Conference web site. (http://trec.nist.gov)
Pallett, D. S., Garofolo, J. S. & Fiscus, J. G. (2000). Measurements in support of research accomplishments. Communications of the ACM, 43(2), 75-79.
Article Google Scholar
SAIC. (2001). Introduction to information extraction. www.itl.nist.gov/iaui/894.02/related_projects/muc.
Schamber, L. (1994). Relevance and information behavior. Annual Review of Information Science and Technology, 29, 3-48.
Google Scholar
Singhal, A., Abney, S., Bacciani, M., Collins, M., Hindle, D. & Pereira, F. (2000). AT&T at TREC-8. In E. Voorhees & D. Harman (Eds.), Proceedings of the eighth Text REtrieval Conference (TREC-8). (NIST Special Publication 500-246. Electronic version available at http://trec.nist.gov/pubs.html)
Sparck Jones, K. (2001). Automatic language and information processing: Rethinking evaluation. Natural Language Engineering, 7(1), 29-46.
Google Scholar
Sparck Jones, K. & Galliers, J. R. (1996). Evaluating natural language processing systems. Springer.
Google Scholar
Sparck Jones K., & Willett P. (1997) Evaluation. In K. Sparck Jones & P. Willett Eds.), Readings in information retrieval ( 167-174). Morgan Kaufmann.
Google Scholar
Tichy, W. F. (1998). Should computer scientists experiment more? Computer, 31(5), 32-40.
Article Google Scholar
Voorhees, E. & Harman, D. (Eds.). (2000). Proceedings of the eighth Text REtrieval Conference (TREC-8). (NIST Special Publication 500-246. Electronic version available at http://trec.nist.gov/pubs.html)
Voorhees, E. M. (2000a). Special issue: The sixth Text REtrieval Conference (TREC-6). Information Processing and Management, 36(1).
Google Scholar
Voorhees, E. M. (2000b). Variations in relevance judgments and the measurement of retrieval effectiveness. Information Processing and Management, 36, 697-716.
Article Google Scholar
Voorhees, E. M. & Buckley, C. (2002). The effect of topic set size on retrieval experiment error. In Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval (pp. 316-323).
Google Scholar
Voorhees, E. M. & Tice, D. M. (2000a). Building a question answering test collection. In Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval (pp. 200-207).
Google Scholar
Voorhees, E. M. & Tice, D. M. (2000b). The TREC-8 question answering track evaluation. In E. Voorhees & D. Harman (Eds.), Proceedings of the eighth Text REtrieval Conference (TREC-8) (pp. 83-105).(NIST Special Publication500-246. Electronic version available at http://trec.nist.gov/pubs.html)
White, J. (1999, April). Evaluation and assessment techniques. www-2.cs.cmu.edu/∼ref/mlim/.
Google Scholar

Download references

Author information

Authors and Affiliations

National Institute of Standards and Technology, 100 Bureau Drive, 20899, Gaitherburg, MD, USA
Ellen M. Voorhees

Authors

Ellen M. Voorhees
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

State University of New York at Albany, 1400 Washington Avenue, 12222, Albany, NY, USA
Tomek Strzalkowski
University of Texas at Dallas, 75083, Richardson, TX, USA
Sanda M. Harabagiu

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Voorhees, E.M. (2008). Evaluating Question Answering System Performance. In: Strzalkowski, T., Harabagiu, S.M. (eds) Advances in Open Domain Question Answering. Text, Speech and Language Technology, vol 32. Springer, Dordrecht. https://doi.org/10.1007/978-1-4020-4746-6_13

Download citation

DOI: https://doi.org/10.1007/978-1-4020-4746-6_13
Publisher Name: Springer, Dordrecht
Print ISBN: 978-1-4020-4744-2
Online ISBN: 978-1-4020-4746-6
eBook Packages: Humanities, Social Sciences and LawSocial Sciences (R0)

Publish with us

Policies and ethics