Abstract
In this paper, we present IdentiFinderTM, a hidden Markov model that learns to recognize and classify names, dates, times, and numerical quantities. We have evaluated the model in English (based on data from the Sixth and Seventh Message Understanding Conferences [MUC-6, MUC-7] and broadcast news) and in Spanish (based on data distributed through the First Multilingual Entity Task [MET-1]), and on speech input (based on broadcast news). We report results here on standard materials only to quantify performance on data available to the community, namely, MUC-6 and MET-1. Results have been consistently better than reported by any other learning algorithm. IdentiFinder's performance is competitive with approaches based on handcrafted rules on mixed case text and superior on text where case information is not available. We also present a controlled experiment showing the effect of training set size on performance, demonstrating that as little as 100,000 words of training data is adequate to get performance around 90% on newswire. Although we present our understanding of why this algorithm performs so well on this class of problems, we believe that significant improvement in performance may still be possible.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Aberdeen, J., Burger, J., Day, D., Hirschman, L., Robinson, P., & Vilain, M. (1995). MITRE: Description of the Alembic system used for MUC-6. Proceedings of the Sixth Message Understanding Conference (MUC-6) (pp. 141–155). Columbia, Maryland: Morgan Kaufmann Publishers, Inc.
Appelt, D.E., Jerry, R.H., Bear, J., Israel, D., Kameyama, M., Kehler, A., Martin, D., Myers, K., & Tyson, M. (1995). SRI international FASTUS system MUC-6 test results and analysis. Proceedings of the Sixth Message Understanding Conference (MUC-6) (pp. 237–248). Columbia, Maryland: Morgan Kaufmann Publishers, Inc.
Bennett, S.W., Aone, C., & Lovell, C. (1997). Learning to tag multilingual texts through observation. Proceedings of the Second Conference on Empirical Methods in Natural Language Processing (pp. 109–116). Providence, Rhode Island: Morgan Kaufmann Publishers, Inc.
Borthwick, A., Sterling, J., Agichtein, E., & Grishman, R. (1998). Description of the MENE named entity system as used in MUC-7. Proceedings of the Seventh Message Understanding Conference (MUC-7). Fairfax, Virginia: Morgan Kaufmann Publishers, Inc.
Brill, E. (1995). Transformation-based error-driven learning and natural language processing: A case study in part-of-speech tagging. Computational Linguistics, 21(4), 543–565.
Chinchor, N. (1995). Statistical significance of MUC-6 results. Proceedings of the Sixth Message Understanding Conference (MUC-6) (pp. 39–43). Columbia, Maryland: Morgan Kaufmann Publishers, Inc.
Chinchor, N. (1998). MUC-7 named entity task definition dry run version, version 3.5, 17 September 1997. Proceedings of the Seventh Message Understanding Conference (MUC-7) (to appear). Fairfax,Virginia: Morgan Kaufmann Publishers, Inc. URL: ftp://online.muc.saic.com/NE/training/guidelines/NE.task.def.3.5.ps.
Church, K. (1988). A stochastic parts program and noun phrase parser for unrestricted text. Proceedings of the Second Conference on Applied Natural Language Processing, Austin, Texas.
Krupka, G. (1995). SRA: Description of the SRA system as used for MUC-6. Proceedings of the Sixth Message Understanding Conference (MUC-6) (pp. 221–235). Columbia, Maryland: Morgan Kaufmann Publishers, Inc.
Merchant, R., Okurowski, M., & Chinchor, N. (1996). The multilingual entity task overview. Proceedings of the Tipster Text Program Phase II (pp. 445–447). Vienna, Virginia: Morgan Kaufmann Publishers, Inc.
Rabiner, L.R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE.
Sundheim, B., & Chinchor, N. (1995). Named entity task definition (version 2.1). Proceedings of the Sixth Message Understanding Conference (MUC-6) (pp. 319–332). Columbia, Maryland: Morgan Kaufmann Publishers, Inc.
Viterbi, A.J. (1967). Error bounds for convolutional codes and an asympotically optimum decoding algorithm. IEEE Transactions on Information Theory, IT-13(2), 260–269.
Weischedel, R. (1995). BBN: Description of the PLUM system as used for MUC-6. Proceedings of the Sixth Message Understanding Conference (MUC-6) (pp. 55–69). Columbia, Maryland: Morgan Kaufmann Publishers, Inc.
Weischedel, R., Meteer, M., Schwartz, R., Ramshaw, L., & Palmucci, J. (1993). Coping with ambiguity and unknown words through probabilistic methods. Computational Linguistics, 19(2), 359–382.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Bikel, D.M., Schwartz, R. & Weischedel, R.M. An Algorithm that Learns What's in a Name. Machine Learning 34, 211–231 (1999). https://doi.org/10.1023/A:1007558221122
Issue Date:
DOI: https://doi.org/10.1023/A:1007558221122