Evaluation of machine learning-based information extraction algorithms: criticisms and recommendations
- 237 Downloads
We survey the evaluation methodology adopted in information extraction (IE), as defined in a few different efforts applying machine learning (ML) to IE. We identify a number of critical issues that hamper comparison of the results obtained by different researchers. Some of these issues are common to other NLP-related tasks: e.g., the difficulty of exactly identifying the effects on performance of the data (sample selection and sample size), of the domain theory (features selected), and of algorithm parameter settings. Some issues are specific to IE: how leniently to assess inexact identification of filler boundaries, the possibility of multiple fillers for a slot, and how the counting is performed. We argue that, when specifying an IE task, these issues should be explicitly addressed, and a number of methodological characteristics should be clearly defined. To empirically verify the practical impact of the issues mentioned above, we perform a survey of the results of different algorithms when applied to a few standard datasets. The survey shows a serious lack of consensus on these issues, which makes it difficult to draw firm conclusions on a comparative evaluation of the algorithms. Our aim is to elaborate a clear and detailed experimental methodology and propose it to the IE community. Widespread agreement on this proposal should lead to future IE comparative evaluations that are fair and reliable. To demonstrate the way the methodology is to be applied we have organized and run a comparative evaluation of ML-based IE systems (the Pascal Challenge on ML-based IE) where the principles described in this article are put into practice. In this article we describe the proposed methodology and its motivations. The Pascal evaluation is then described and its results presented.
KeywordsEvaluation methodology Information extraction Machine learning
F. Ciravegna, C. Giuliano, N. Ireson, A. Lavelli and L. Romano were supported by the IST-Dot.Kom project (http://www.dot-kom.org), sponsored by the European Commission as part of the Framework V (grant IST-2001-34038). N. Kushmerick was supported by grant 101/F.01/C015 from Science Foundation Ireland and grant N00014-03-1-0274 from the US Office of Naval Research. We would like to thank Leon Peshkin for kindly providing us his own corrected version of the Seminar Announcement collection and Scott Wen-Tau Yih for his own tagged version of the Job Posting collection. We would also like to thank Hai Long Chieu, Leon Peshkin, and Scott Wen-Tau Yih for answering our questions concerning the settings of their experiments. We are also indebted to the anonymous reviewers of this article for their valuable comments.
- Califf, M. E. (1998). Relational learning techniques for natural language information extraction. Ph.D. thesis, University of Texas at Austin.Google Scholar
- Chieu, H. L., & Ng, H. T. (2002). Probabilistic reasoning for entity and relation recognition. In Proceedings of the 19th National Conference on Artificial Intelligence (AAAI 2002).Google Scholar
- Chinchor, N., Hirschman, L., & Lewis, D. D. (1993). Evaluating message understanding systems: An analysis of the third Message Understanding Conference (MUC-3). Computational Linguistics, 19(3), 409–449.Google Scholar
- Ciravegna, F. (2001a). Adaptive information extraction from text by rule induction and generalisation. In Proceedings of 17th International Joint Conference on Artificial Intelligence (IJCAI-01). Seattle, WA.Google Scholar
- Ciravegna, F. (2001b). (LP)2, an adaptive algorithm for information extraction from web-related texts. In Proceedings of the IJCAI-2001 Workshop on Adaptive Text Extraction and Mining. Seattle, WA.Google Scholar
- Ciravegna, F., Dingli, A., Petrelli, D., & Wilks, Y. (2002). User-system cooperation in document annotation based on information extraction. In Proceedings of the 13th International Conference on Knowledge Engineering and Knowledge Management (EKAW02).Google Scholar
- Daelemans, W., & Hoste, V. (2002). Evaluation of machine learning methods for natural language processing tasks. In Proceedings of the Third International Conference on Language Resources and Evaluation (LREC 2002). Las Palmas, Spain.Google Scholar
- Daelemans, W., Hoste, V., Meulder, F. D., & Naudts, B. (2003). Combined optimization of feature selection and algorithm parameters in machine learning of language. In Proceedings of the 14th European Conference on Machine Learning (ECML 2003). Cavtat-Dubronik, Croatia.Google Scholar
- De Sitter, A., & Daelemans, W. (2003). Information extraction via double classification. In Proceedings of the ECML/PKDD 2003 Workshop on Adaptive Text Extraction and Mining (ATEM 2003). Cavtat-Dubronik, Croatia.Google Scholar
- Douthat, A. (1998). The Message Understanding Conference scoring software user’s manual. In Proceedings of the 7th Message Understanding Conference (MUC-7). http://www.itl.nist.gov/iaui/894.02/related_projects/muc/muc_sw/muc_sw_manual.html.
- Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap. New York: Chapman and Hall.Google Scholar
- Finkel, J. R., Grenager, T., & Manning, C. (2005). Incorporating non-local information into information extraction systems by Gibbs sampling. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics (ACL 2005).Google Scholar
- Finn, A., & Kushmerick, N. (2004a). Information extraction by convergent boundary classification. In Proceedings of the AAAI 2004 Workshop on Adaptive Text Extraction and Mining (ATEM 2004). San Jose, California.Google Scholar
- Finn, A., & Kushmerick, N. (2004b). Multi-level boundary classification for information extraction. In Proceedings of the 15th European Conference on Machine Learning. Pisa, Italy.Google Scholar
- Freitag, D. (1997). Using grammatical inference to improve precision in information extraction. In Proceedings of the ICML-97 Workshop on Automata Induction, Grammatical Inference, and Language Acquisition. Nashville, Tennessee.Google Scholar
- Freitag, D. (1998). Machine learning for information extraction in informal domains. Ph.D. thesis, Carnegie Mellon University.Google Scholar
- Freitag, D., & Kushmerick, N. (2000). Boosted wrapper induction. In Proceedings of the 17th National Conference on Artificial Intelligence (AAAI 2000). Austin, Texas.Google Scholar
- Habert, B., Adda, G., Adda-Decker, M., de Mareuil, P. B., Ferrari, S., Ferret, O., Illouz, G., & Paroubek, P. (1998). Towards tokenization evaluation. In Proceedings of 1st International Conference on Language Resources and Evaluation (LREC-98). Granada, Spain.Google Scholar
- Ireson, N., Ciravegna, F., Califf, M. E., Freitag, D., Kushmerick, N., & Lavelli, A. (2005). Evaluating machine learning for information extraction. In Proceedings of 22nd International Conference on Machine Learning (ICML 2005). Bonn, Germany.Google Scholar
- Iria, J., & Ciravegna, F. (2006). A methodology and tool for representing language resources for information extraction. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC 2006). Genoa, Italy.Google Scholar
- Kosala, R., & Blockeel, H. (2000). Instance-based wrapper induction. In Proceedings of the Tenth Belgian-Dutch Conference on Machine Learning (Benelearn 2000). Tilburg, The Netherlands, pp. 61–68.Google Scholar
- Li, Y., Bontcheva, K., & Cunningham, H. (2005a). SVM based learning system for information extraction. In J. Winkler, M. Niranjan, & N. Lawrence (Eds.), Deterministic and statistical methods in machine learning, Vol. 3635 of LNAI. (pp. 319–339). Springer Verlag.Google Scholar
- Li, Y., Bontcheva, K., & Cunningham, H. (2005b). Using uneven margins SVM and perceptron for information extraction. In Proceedings of the Ninth Conference on Computational Natural Language Learning (CONLL 2005).Google Scholar
- Makhoul, J., Kubala, F., Schwartz, R., & Weischedel, R. (1999), Performance measures for information extraction. In Proceedings of the DARPA Broadcast News Workshop. http://www.nist.gov/speech/publications/darpa99/pdf/dir10.pdf.
- Noreen, E. W. (1989). Computer Intensive Methods for Testing Hypotheses: An Introduction. New York: Wiley.Google Scholar
- Peshkin, L., & Pfeffer, A. (2003). Bayesian information extraction network. In Proceedings of 18th International Joint Conference on Artificial Intelligence (IJCAI 2003). Acapulco, Mexico.Google Scholar
- RISE. (1998). A repository of online information sources used in information extraction tasks. [http://www.isi.edu/info-agents/RISE/index.html] Information Sciences Institute/USC.
- Roth, D., & Yih, W. (2001). Relational learning via propositional algorithms: An information extraction case study. In Proceedings of 17th International Joint Conference on Artificial Intelligence (IJCAI-01). Seattle, WA.Google Scholar
- Roth, D., & Yih, W. (2002). Relational learning via propositional algorithms: An information extraction case study. Technical Report UIUCDCS-R-2002-2300, Department of Computer Science, University of Illinois at Urbana-Champaign.Google Scholar
- Sigletos, G., Paliouros, G., Spyropoulos, C., & Hatzopoulos, M. (2005). Combining information extraction systems using voting and stacked generalization. Journal of Machine Learning Research, 6, 1751–1782.Google Scholar
- Sutton, C., & McCallum, A. (2004). Collective segmentation and labeling of distant entities. In Proceedings of the ICML Workshop on Statistical Relational Learning and Its Connections to Other Fields.Google Scholar