Abstract
It seems widely agreed that IE (Information Extraction) is now a tested language technology that has reached precision+recall values that put it in about the same position as Information Retrieval and Machine Translation, both of which are widely used commercially. There is also a clear range of practical applications that would be eased by the sort of template-style data that IE provides. The problem for wider deployment of the technology is adaptability: the ability to customize IE rapidly to new domains.
In this paper we discuss some methods that have been tried to ease this problem, and to create something more rapid than the bench-mark one-month figure, which was roughly what ARPA teams in IE needed to adapt an existing system by hand to a new domain of corpora and templates. An important distinction in discussing the issue is the degree to which a user can be assumed to know what is wanted, to have pre-existing templates ready to hand, as opposed to a user who has a vague idea of what is needed from a corpus.
We shall discuss attempts to derive templates directly from corpora; to derive knowledge structures and lexicons directly from corpora, including discussion of the recent LE project ECRAN which attempted to tune existing lexicons to new corpora. An important issue is how far established methods in Information Retrieval of tuning to a user’s needs with feedback at an interface can be transferred to IE.
The authors are grateful to discussion and contributions from Hamish Cunningham, Robert Gaizauskas, Louise Guthrie and Evelyne Viegas, All errors are our own of course.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
J. Aberdeen, J. Burger, D. Day, L. Hirschman, P. Robinson, and M. Vilain. MITRE-Description of the Alembic System used for MUC-6. In Proceedings of the Sixth Message Understanding Conference (MUC-6), pages 141–156, 1995.
S. Azzam, K. Humphreys, and R. Gaizauskas. Using conference chains for text summarization. In Proceedings of the ACL’ 99 WOrkshop on Conference and its Applications. Maryland, 1999.
R. Basili, M. Pazienza, and P. Velardi. Aquisition of selectional patterns from sub-langauges. Machine Translation, 8, 1993.
R. Catizone M.T. Pazienza M. Stevenson M. P. Velardi M. Vindigni Y. Wilks Basili, R. An empirical approach to lexical tuning. In Workshop on Adapting Lexical and Corpus Resources to Sublanguages and Applications, LREC, First International Conference on Language Resources and Evaluation, Granada, Spain, 1998.
D. Bikel, S. Miller, R. Schwartz, and R. Weischedel. Nymble: a High-Performance Learning Name-finder. In Proceedings of the Fifth conference on Applied Natural Language Processing, 1997.
D.G. Bobrow and T. Winograd. An overview of krl, a knowledge representation language. Cognitive Science 1, pages 3–46, 1977.
E. Brill. Some Advances in Transformation-Based Part of Speech Tagging. In Proceedings of the Twelfth National Conference on AI (AAAI-94), Seattle, Washington, 1994.
E. Brill. Transformation-Based Error-Driven Learning and Natural Language. Computational Linguistics, 21(4), December 1995.
E. Briscoe, A. Copestake, and V. De Pavia. Default inheritance in unification-based approaches to the lexicon. Technical report, Cambridge University Computer Laboratory, 1991.
R. Bruce and L. Guthrie. Genus disambiguation: A study in weighted preference. In Proceesings of COLING-92, pages 1187–1191, Nantes, France, 1992.
P. Buitelaar. A lexicon for underspecified semantic tagging. In Proceedings of the ACL-Siglex Workshop on Tagging Text with Lexical Semantics, Washington, D.C., 1997.
Claire Cardie. Empirical methods in information extraction. AI Magazine. Special Issue on Empirical Natural Language Processing, 18(4), 1997.
N. Chinchor. The statistical significance of the MUC-5 results. In Proceedings of the Fifth Message Understanding Conference (MUC-5), pages 79–83. Morgan Kaufmann, 1993.
N. Chinchor and Sundheim B. MUC-5 Evaluation Metrics. In Proceedings of the Fifth Message Understanding Conference (MUC-5), pages 69–78. Morgan Kaufmann, 1993.
N. Chinchor, L. Hirschman, and D.D. Lewis. Evaluating message understanding systems: An analysis of the third message understanding conference (muc-3). Computational Linguistics, 19(3):409–449, 1993.
R. Collier. Automatic Template Creation for Information Extraction. PhD thesis, UK, 1998.
J. Cowie, L. Guthrie, W. Jin, W. Odgen, J. Pustejowsky, R. Wanf, T. Wakao, S. Waterman, and Y. Wilks. CRL/Brandeis: The Diderot System. In Proceedings of Tipster Text Program (Phase I). Morgan Kaufmann, 1993.
J. Cowie and W. Lehnert. Information extraction. Special NLP Issue of the Communications of the ACM, 1996.
H. Cunningham. JAPE-a Jolly Advanced Pattern Engine. 1997.
H. Cunningham, S. Azzam, and Y. Wilks. Domain Modelling for AVENTINUS (WP 4.2). LE project LE1-2238 AVENTINUS internal technical report, University of Sheffield, UK, 1996.
H. Cunningham, R.G. Gaizauskas, and Y. Wilks. A General Architecture for Text Engineering (GATE)-a new approach to Language Engineering R&D. Technical Report CS-95-21, Department of Computer Science, University of Sheffield, 1995. Also available as http://xxx.lanl.gov/ps/cmp-lg/9601009.
W. Daelemans, J. Zavrel, K. van der Sloot, and A. van den Bosch. TiMBL: Tilburg memory based learner version 1.0. Technical report, ILK Technical Report 98-03, 1998.
D. Day, J. Aberdeen, L. Hirschman, R. Kozierok, P. Robinson, and M. Vilain. Mixed-Initiative Development of Language Processing Systems. In Proceedings of the 5th Conference on Applied NLP Systems (ANLP-97), 1997.
J. Sterling.NYU E. Agichtein R. Grishman A.Borthwick.
R. Evans and G. Gazdar. Datr: A language for lexical knowledge representation . Computational Linguistics 22 2, pages 167–216, 1996.
R. Gaizauskas. XI: A Knowledge Representation Language Based on Cross-Classification and Inheritance. Technical Report CS-95-24, Department of Computer Science, University of Sheffield, 1995.
R. Gaizauskas and Y. Wilks. Information Extraction: Beyond Document Retrieval. Journal of Documentation, 1997. In press (Also available as Technical Report CS-97-10).
G. Gazdar and C. Mellish. Natural Language Processing in Prolog. Addison-Wesley, 1989.
T. Givon. Transformations of ellipsis, sense development and rules of lexical derivation. Technical Report SP-2896, Systems Development Corp., Sta Monica, CA, 1967.
R. Grishman. Information extraction: Techniques and challenges. In M-T. Pazienza, editor, Proceedings of the Summer School on Information Extraction (SCIE-97), LNCS/LNAI. Springer-Verlag, 1997.
R. Grishman and J. Sterling. Generalizing automatically generated patterns. In Proceedings of COLING-92, 1992.
R. Grishman and J. Sterling. Description of the Proteus system as used for MUC-5. In Proceedings of the Fifth Message Understanding Conference (MUC-5), pages 181–194. Morgan Kaufmann, 1993.
G. Hirst. Semantic Interpretation and the Resolution of Ambiguity. CUP, Cambridge, England, 1987.
J.R. Hobbs. The generic information extraction system. In Proceedings of the Fifth Message Understanding Conference (MUC-5), pages 87–91. Morgan Kaufman, 1993.
W.J. Hutchins. Machine Translation: past, present, future. Chichester: Ellis Horwood, 1986.
H. Khosravi and Y. Wilks. Extracting pragmatic content from e-mail. Journal of Natural Language Engineering, 1997. submitted.
R. Krovetz and B. Croft. Lexical ambiguity and information retrieval. ACM Transactions on Information Systems 2 10, 1992.
W. Lehnert, C. Cardie, D. Fisher, J. McCarthy, and E. Riloff. University of Massachusetts: Description of the CIRCUS system as used for MUC-4. In Proceedings of the Fourth Message Understanding Conference MUC-4, pages 282–288. Morgan Kaufmann, 1992.
B. Levin. English Verb Calsses and Alternations. Chicago, II, 1993.
H. P. Luhn. A statistical approach to mechanized encoding and searching of literary information. IBM Journal of Research and Development 1, pages 309–317, 1957.
R. Morgan, R. Garigliano, P. Callaghan, S. Poria, M. Smith, A. Urbanowicz, R. Collingham, M. Costantino, and C. Cooper. Description of the LOLITA System as used for MUC-6. In Proceedings of the Sixth Message Understanding Conference (MUC-6), pages 71–86, San Francisco, 1995. Morgan Kaufmann.
S. Muggleton. Recent advances in inductive logic programming. In Proc. 7th Annu. ACM Workshop on Comput. Learning Theory, pages 3–11. ACM Press, New York, NY, 1994.
S. Nirenburg and V. Raskin. Ten choices for lexical semantics. Technical report, Computing Research Lab, Las Cruces, NM, 1996. MCCS-96-304.
J. Pustejovsky. The Generative Lexicon. MIT, 1995.
J. Pustejovsky and P. Anick. Automatically acquiring conceptual patterns without an annotated corpus. In Proceedings of the Third Workshop on Very Large Corpora, 1988. 1
E. Riloff. Automatically contructing a dictionary for information extraction tasks. In Proceedings of Eleventh National Conference on Artificial Intelligence, 1993.
E. Riloff and W. Lehnert. Automated dictionary construction for information extraction from text. In Proceedings of Ninth IEEE Conference on Artificial Intelligence for Applications, pages 93–99, 1993.
E. Riloff and J. Shoen. Automatically aquiring conceptual patterns without an annotated corpus. In Proceedings of the Third Workshop on Very Large Corpora, 1995.
E. Roche and Y. Schabes. Deterministic Part-of-Speech Tagging with Finite-State Transducers. Computational Linguistics, 21(2):227–254, June 1995. 4
K. Samuel, S. Carberry, and K. Vijay-Shanker. Dialogue act tagging with transofrmation-based learning. In Proceedings of the COLING-ACL 1998 Conference, pages 1150–1156, 1998.
S. Small and C. Rieger. Parsing and comprehending with word experts (a theory and it’s realiastion). In W. Lehnert and M. Ringle, editors, Strategies for Natural Language Processing. Lawrence Erlbaum Associates, Hillsdale, NJ, 1982.
David Page Stephen Muggleton James Cussens and Ashwin Srinivasan. Using inductive logic programming for natural language processing. In Proceedings of in ECML.Workshop Notes on Empirical Learning of Natural Language Tasks, pages 25–34, Prague, 1997.
Jin Wang T. Strzalkowski, Fang Lin and Jose Perez-Caballo. Natural Language Information Retrieval, chapter Evaluating Natural Language Processing Techniques in Information Retrieval, pages 113–146. Kluwer Academic Publishers, 1997.
Mark Vilain.
Y. Wilks. Grammar, Meaning and the Machine Analysis of Meaning. Routledge and Kegan Paul, 1972.
Y. Wilks, L. Guthrie, J. Guthrie, and J. Cowie. Combining Weak Methods in Large-Scale Text Processing, in Jacobs 1992, Text-Based Intelligent Systems. Lawrence Erlbaum, 1992.
Y. Wilks and M. Stevenson. Sense tagging: Semantic tagging with a lexicon. In Proceedings of the SIGLEX Workshop “Tagging Text with Lexical Semantics: What, why and how?”, Washington, D.C., April 1997. Available as http://xxx.lanl.gov/ps/cmp-lg/9705016.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 1999 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Wilks, Y., Catizone, R. (1999). Can We Make Information Extraction More Adaptive?. In: Pazienza, M.T. (eds) Information Extraction. SCIE 1999. Lecture Notes in Computer Science(), vol 1714. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-48089-7_1
Download citation
DOI: https://doi.org/10.1007/3-540-48089-7_1
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-66625-7
Online ISBN: 978-3-540-48089-1
eBook Packages: Springer Book Archive