The TREC question answering (QA) track was the first large-scale evaluation of open-domain question answering systems. In addition to successfully fostering research on the QA task, the track has also been used to investigate appropriate evaluation methodologies for question answering systems. This chapter reviews the TREC QA track, emphasizing the issues associated with evaluating question answering systems.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
7. References
ARDA. (2002). Advanced question and answering for intelligence (AQUAINT). http://www.icarda.org/InfoExploit/aquaint/index.html.
Breck, E., Burger, J., Ferro, L., Hirschman, L., House, D., Light, M. & Mani, I. (2000). How to evaluate your question answering system every day … and still get real work done. In Proceedings of the second international conference on language resources and evaluation (LREC-2000) (Vol. 3, pp. 1495-1500).
Buckley, C. & Voorhees, E. M. (2000). Evaluating evaluation measure stability. In N. Belkin, P. Ingwersen & M. Leong (Eds.), Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval (pp. 33-40).
Chinchor, N., Hirschman, L. & Lewis, D. D. (1993). Evaluating message understanding systems: An analysis of the third Message Understanding Conference (MUC-3). Computational Linguistics, 19(3), 409-449.
Dumais, S., Banko, M., Brill, E., Lin, J. & Ng, A. (2002). Web questions answering: Is more always better? In Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval (pp. 291-298).
Fellbaum, C. (Ed.) (1998). Wordnet: An electronic lexical database. The MIT Press.
Harman, D. & Over, P. (2002). The DUC summarization evaluations. In Proceedings of the international conference on human language technology. (In press.)
Harrison, P., Abney, S., Black, E., Flickenger, D., Gdaniec, C., Grishman, R., Hindle, D., Ingria, R., Marcus, M., Santorini, B. & Strzalkowski, T. (1991). Evaluating syntax performance of parser/ grammars of English. In Proceedings of the workshop on evaluating natural language processing systems. Association for Computaional Linguistics.
Hirschman, L. (1998). Language understanding evaluations: Lessons learned from MUC and ATIS. In Proceedings of the first international conference on language resources and evaluation (LREC) (pp. 117-122). Granada, Spain.
International Standards for Language Engineering (ISLE). (2000). The ISLE classification of machine translation evaluations. http://www.isi.edu/natural-language/mteval/cover.html.
Kilgarriff, A. & Palmer, M. (2000). Introduction to the special issue on SENSEVAL. Computers and the Humanities, 34(1-2), 1-13.
Kupiec, J. (1993). MURAX: A robust linguistic approach for question answering using an on-line encyclopedia. In R. Korfage, E. Rasmussen & P. Willett (Eds.), Proceedings of the sixteenth annual international acm sigir conference on research and development in information retrieval (pp. 181-190). (Special issue of the SIGIR FORUM.)
Mani, I., House, D., Klein, G., Hirschman, L., Obrst, L., Firmin, T., Chrzanowski, M. & Sundheim, B. (1998). The TIPSTER SUMMAC text summarization evaluation (Tech. Rep. No. MTR 98W0000138). McLean, Virginia: MITRE, Washington C3 Center.
Martin, J. & Lankester, C. (2000). Ask Me Tomorrow: The NRC and University of Ottawa question answering system. In Proceedings of the eighth Text REtrieval Conference (TREC-8) (pp. 675-683). (NIST Special Publication 500-246.)
National Institute of Standards and Technology.(2002). The Text Retrieval Conference web site. (http://trec.nist.gov)
Pallett, D. S., Garofolo, J. S. & Fiscus, J. G. (2000). Measurements in support of research accomplishments. Communications of the ACM, 43(2), 75-79.
SAIC. (2001). Introduction to information extraction. www.itl.nist.gov/iaui/894.02/related_projects/muc.
Schamber, L. (1994). Relevance and information behavior. Annual Review of Information Science and Technology, 29, 3-48.
Singhal, A., Abney, S., Bacciani, M., Collins, M., Hindle, D. & Pereira, F. (2000). AT&T at TREC-8. In E. Voorhees & D. Harman (Eds.), Proceedings of the eighth Text REtrieval Conference (TREC-8). (NIST Special Publication 500-246. Electronic version available at http://trec.nist.gov/pubs.html)
Sparck Jones, K. (2001). Automatic language and information processing: Rethinking evaluation. Natural Language Engineering, 7(1), 29-46.
Sparck Jones, K. & Galliers, J. R. (1996). Evaluating natural language processing systems. Springer.
Sparck Jones K., & Willett P. (1997) Evaluation. In K. Sparck Jones & P. Willett Eds.), Readings in information retrieval ( 167-174). Morgan Kaufmann.
Tichy, W. F. (1998). Should computer scientists experiment more? Computer, 31(5), 32-40.
Voorhees, E. & Harman, D. (Eds.). (2000). Proceedings of the eighth Text REtrieval Conference (TREC-8). (NIST Special Publication 500-246. Electronic version available at http://trec.nist.gov/pubs.html)
Voorhees, E. M. (2000a). Special issue: The sixth Text REtrieval Conference (TREC-6). Information Processing and Management, 36(1).
Voorhees, E. M. (2000b). Variations in relevance judgments and the measurement of retrieval effectiveness. Information Processing and Management, 36, 697-716.
Voorhees, E. M. & Buckley, C. (2002). The effect of topic set size on retrieval experiment error. In Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval (pp. 316-323).
Voorhees, E. M. & Tice, D. M. (2000a). Building a question answering test collection. In Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval (pp. 200-207).
Voorhees, E. M. & Tice, D. M. (2000b). The TREC-8 question answering track evaluation. In E. Voorhees & D. Harman (Eds.), Proceedings of the eighth Text REtrieval Conference (TREC-8) (pp. 83-105).(NIST Special Publication500-246. Electronic version available at http://trec.nist.gov/pubs.html)
White, J. (1999, April). Evaluation and assessment techniques. www-2.cs.cmu.edu/∼ref/mlim/.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Springer
About this chapter
Cite this chapter
Voorhees, E.M. (2008). Evaluating Question Answering System Performance. In: Strzalkowski, T., Harabagiu, S.M. (eds) Advances in Open Domain Question Answering. Text, Speech and Language Technology, vol 32. Springer, Dordrecht. https://doi.org/10.1007/978-1-4020-4746-6_13
Download citation
DOI: https://doi.org/10.1007/978-1-4020-4746-6_13
Publisher Name: Springer, Dordrecht
Print ISBN: 978-1-4020-4744-2
Online ISBN: 978-1-4020-4746-6
eBook Packages: Humanities, Social Sciences and LawSocial Sciences (R0)