Abstract
Term extraction is an important problem in natural language processing. In this paper, we propose a language independent statistical corpusbased term extraction algorithm. In previous approaches, evaluation has been subjective, at best relying on a lexicographer’s judgement. We evaluate the quality of our term extractor by assessing its predictiveness on an unseen corpus using perplexity. Second, we evaluate the precision and recall of our extractor by comparing the Chinese words in a segmented corpus with the words extracted by our system.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Ando, R. K. and Lee, L. 2000. Mostly-unsupervised statistical segmentation of Japanese: Application to Kanji. In Proceedings of NAACL-2000. pp. 241–248. Seattle, WA.
Bourigault, D. 1992. Surface grammatical analysis for the extraction of terminological noun phrases. In Proceedings of COLING-92. pp. 977–981. Nates, France.
Cover, T. M., Thomas, J. A. 1991. Elements of Information Theory. Wiley, New York.
Dagan, I. and Church, K. 1994. Termight: identifying and translating technical terminology. In Proceedings of Applied Language Processing. pp. 34–40. Stuttgart, Germany.
Daille, B., Gaussier, E., and Langé, J.M. 1994. Towards automatic extraction of monolingual and bilingual terminology. In Proceedings of COLING-94. pp. 515–521. Kyoto, Japan.
Dong, Z. 1999. Bigger context and better understanding-Expectation on future MT technology. In Proceedings of ICMT&CLIP-99. pp.17–25.
Dunning, T. 1993. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1):61–74.
Eklund, P. and Wille, R. 1998. A multimodal approach to term extraction using a rhetorical structure theory tagger and formal concept analysis. In Proceedings of Conference on Cooperative Multimodal Communication: Theory and Applications. pp. 171–175. Tilburg, Netherlands.
FDMC. 1986. Xiandai Hanyu Pinlu Cidian(Modern Chinese Frequency Dictionary). Beijing Language Institute Press.
Frantzi, K.T. and Ananiadou, S. 1999. The C-Value/NC-Value domain independent method for multi-word term extraction. Natural Language Processing, 6(3):145–179.
Fung, P. 1998. Extracting key terms from Chinese and Japanese texts. The International Journal on Computer Processing of Oriental Language. Special Issue on Information Retrieval on Oriental Languages, pp. 99–121.
Guo, J. 1998. Segmented Chinese corpus. ftp://ftp.cogsci.ed.ac.uk/pub/chinese/.
Jennings, R., Benage D. B., Crandall, S. and Gregory K. 1997. Special Edition Using Windows NT Server 4: The Most Complete Reference. Macmillan Computer Publishing.
Justeson, J.S. and Katz, S.L. 1996. Technical terminology: some linguistic properties and an algorithm for identification in text. Natural Language Engineering, 3(2):259–289.
Kirch, O. 1995. LINUX: Network Administrator’s Guide. O'Reilly.
Lin, D. 1998a. Extracting collocations from text corpora. In Proceedings of COLING/ACL-98 Workshop on Computational Terminology. Montreal, Canada.
Maynard, A. and Ananiadou, S. dy2000. Identifying terms by their Family and friends. In Proceedings of COLING-2000. pp. 530–536. Saarbrucken, Germany.
Maynard, A. and Ananiadou, S. 1999. Identifying contextual information for multi-word term extraction. In Proceedings of Terminology and Knowledge Engineering Conference-99. pp. 212–221. Innsbruck, Austria.
Smadja, F. 1993. Retrieving collocations from text: Xtract. Computational Linguistics, 19(1):143–177.
Sun, M., Shen, D. and Tsou B. K. 1998. Chinese word segmentation without using lexicon and hand-crafted training data. In Proceedings of COLING-ACL-98. pp. 1265–1271. Montreal, Canada.
Witten, I. H. and Bell, T. C. 1991. The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression. IEEE Transactions on Information Theory, 37(4).
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2001 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Pantel, P., Lin, D. (2001). A Statistical Corpus-Based Term Extractor. In: Stroulia, E., Matwin, S. (eds) Advances in Artificial Intelligence. Canadian AI 2001. Lecture Notes in Computer Science(), vol 2056. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45153-6_4
Download citation
DOI: https://doi.org/10.1007/3-540-45153-6_4
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42144-3
Online ISBN: 978-3-540-45153-2
eBook Packages: Springer Book Archive