Advertisement

Evaluating Semantic Evaluations: How RTE Measures Up

  • Sam Bayer
  • John Burger
  • Lisa Ferro
  • John Henderson
  • Lynette Hirschman
  • Alex Yeh
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3944)

Abstract

In this paper, we discuss paradigms for evaluating open-domain semantic interpretation as they apply to the PASCAL Recognizing Textual Entailment (RTE) evaluation (Dagan et al. 2005). We focus on three aspects critical to a successful evaluation: creation of large quantities of reasonably good training data, analysis of inter-annotator agreement, and joint analysis of test item difficulty and test-taker proficiency (Rasch analysis). We found that although RTE does not correspond to a “real” or naturally occurring language processing task, it nonetheless provides clear and simple metrics, a tolerable cost of corpus development, good annotator reliability (with the potential to exploit the remaining variability), and the possibility of finding noisy but plentiful training material.

Keywords

Test Item Question Answering Statistical Machine Translation Word Error Rate Semantic Evaluation 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aberdeen, J., Condon, S., Doran, C., Harper, L., Oshika, B., Phillips, J.: Evaluation of speech-to-speech translation systems (2005) (unpublished manuscript)Google Scholar
  2. Aberdeen, J., Hirschman, L., Walker, M.: Evaluation for DARPA Communicator spoken dialogue systems. In: Proceedings of the 2nd Conference on Language Resources and Evaluation (2000)Google Scholar
  3. Bayer, S., Burger, J., Ferro, L., Henderson, J., Yeh, A.: MITRE’s submissions to the EU Pascal RTE challenge. In: PASCAL Proceedings of the First Challenge Workshop, Recognizing Textual Entailment, Southampton, U.K. (2005)Google Scholar
  4. Bayer, S., Burger, J., Greiff, W., Wellner, B.: The MITRE logical form generation system. In: Proceedings of Senseval-3: The Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text, pp. 69–72 (2004)Google Scholar
  5. Bond, T.G., Fox, C.M.: Applying the Rasch Model: Fundamental Measurement in the Human Sciences. University of Toledo Press (2001)Google Scholar
  6. Bos, J., Markert, K.: Combining shallow and deep NLP methods for recognizing textual entailment. In: PASCAL Proceedings of the First Challenge Workshop, Recognizing Textual Entailment, Southampton, U.K. (2005)Google Scholar
  7. Brachman, R. (AA)AI: More than the sum of its parts. AAAI Presidential Address. In: Presented at AAAI 2005 (2005)Google Scholar
  8. Brown, P.F., Della Pietra, S.A., Della Pietra, V.J., Mercer, R.L.: The mathematics of statistical machine translation. Computational Linguistics 19 (1993)Google Scholar
  9. Burger, J., Ferro, L.: Generating an entailment corpus from news headlines. In: ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment, Ann Arbor, MI (2005)Google Scholar
  10. Dagan, I., Glickman, O., Magnini, B.: The PASCAL recognizing textual entailment challenge. In: PASCAL Proceedings of the First Challenge Workshop, Recognizing Textual Entailment, Southampton, U.K. (2005)Google Scholar
  11. Damianos, L., Wohlever, S., Kozierok, R., Ponte, J.: MiTAP for real users, real data, real problems. In: Proceedings of the Conference on Human Factors of Computing Systems, Fort Lauderdale, FL (2003)Google Scholar
  12. Deshmukh, N., Duncan, R., Ganapathiraju, A., Picone, J.: Benchmarking human performance for continuous speech recognition. In: Proceedings of the Fourth International Conference on Spoken Language Processing, Philadelphia, Pennsylvania, USA, pp. 2486–2489 (1996)Google Scholar
  13. Dolan, B., Brockett, C., Quirk, C.: Microsoft Research paraphrase corpus (2005), http://research.microsoft.com/research/nlp/msr_paraphrase.htm
  14. Grishman, R., Sundheim, B.: Design of the MUC-6 evaluation. In: Proceedings of the Sixth Message Understanding Conference (MUC-6), Columbia, MD NIST. Morgan Kaufmann, San Francisco (1995)Google Scholar
  15. Gusfield, D.: Algorithms on Strings, Trees and Sequences. Cambridge University Press, Cambridge (1997)CrossRefMATHGoogle Scholar
  16. Henderson, J., Morgan, W.: Paris: an automated MT evaluation metric toolkit; and a survey of metric performance on the segment ranking task. Technical report, MITRE (2005) (to appear)Google Scholar
  17. Hirschman, L.: The evolution of evaluation: Lessons from the message understanding conferences. Computer Speech and Language 12, 281–305 (1998)CrossRefGoogle Scholar
  18. Hirschman, L.: Language understanding evaluations: Lessons learned from MUC and ATIS. In: Proceedings of LREC 1998, Granada (1998)Google Scholar
  19. Hirschman, L., Bates, M., Dahl, D., Fisher, W.M., Garafolo, J., Pallet, D.S., Hunicke- Smith, K., Price, P., Rudnicky, A., Tzoukermann, E.: Multisite data collection and evaluation in spoken language understanding. In: Proceedings of the DARPA Workshop on Human Language Technology, Princeton, NJ, pp. 19–24 (1993)Google Scholar
  20. Hirschman, L., Light, M., Breck, E., Burger, J.D.: Deep Read: A reading comprehension system. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (1999)Google Scholar
  21. Hirschman, L., Yeh, A., Blaschke, C., Valencia, A.: Overview of BioCreAtIvE: Critical assessment of information extraction for biology. BMC Bioinformatics 6(suppl. 1) (2005)Google Scholar
  22. Joachims, T.: Learning to Classify Text Using Support Vector Machines. Kluwer, Dordrecht (2002)CrossRefGoogle Scholar
  23. Lange, R., Moran, J., Greiff, W., Ferro, L.: A probabilistic Rasch analysis of question answering evaluations. In: Proceedings of HLT-NAACL 2004, pp. 65–72 (2004)Google Scholar
  24. Light, M., Mann, G.S., Riloff, E., Breck, E.: Analyses for elucidating current question answering technology. Natural Language Engineering 7, 325–342 (2001)CrossRefGoogle Scholar
  25. Morgan, A., Hirschman, L., Colosimo, M., Yeh, A., Colombe, J.: Gene name identification and normalization using a model organism database. Journal of Biomedical Informatics 37, 396–410 (2004)CrossRefGoogle Scholar
  26. Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Computational Linguistics 29 (2003)Google Scholar
  27. Papineni, K., Roukos, S., Ward, T., Henderson, J., Reeder, F.: Corpus-based comprehensive and diagnostic MT evaluation: Initial Arabic, Chinese, French, and Spanish results. In: Proceedings of the 2002 Conference on Human Language Technology, San Diego, CA, pp. 124–127 (2002)Google Scholar
  28. Sundheim, B.: Overview of results of the MUC-6 evaluation. In: Proceedings of the Sixth Message Understanding Conference (MUC-6), Columbia, MD. NIST. Morgan Kaufmann, San Francisco (1995)Google Scholar
  29. Tong, S., Koller, D.: Support vector machine active learning with applications to text classification. In: Proceedings of ICML 2000, 17th International Conference on Machine Learning (2000)Google Scholar
  30. Walker, M., Aberdeen, J., Boland, J., Bratt, E., Garofolo, J., Hirschman, L., Le, A., Lee, S., Narayanan, S., Papineni, K., Pellom, B., Polifroni, J., Potamianos, A., Prabhu, P., Rudnicky, A., Sanders, G., Seneff, S., Stallard, D., Whittaker, S.: DARPA Communicator dialog travel planning systems: The June 2000 data collection. In: Proceedings of Eurospeech 2001, Aalborg, Denmark (2001)Google Scholar
  31. Wellner, B., Ferro, L., Greiff, W., Hirschman, L.: Reading comprehension tests for computer-based understanding evaluation. Natural Language Engineering (2005) (to appear)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Sam Bayer
    • 1
  • John Burger
    • 1
  • Lisa Ferro
    • 1
  • John Henderson
    • 1
  • Lynette Hirschman
    • 1
  • Alex Yeh
    • 1
  1. 1.The MITRE CorporationBedfordUSA

Personalised recommendations