A Statistical Corpus-Based Term Extractor

Pantel, Patrick; Lin, Dekang

doi:10.1007/3-540-45153-6_4

Patrick Pantel³ &
Dekang Lin³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2056))

Included in the following conference series:

Conference of the Canadian Society for Computational Studies of Intelligence

1247 Accesses
22 Citations

Abstract

Term extraction is an important problem in natural language processing. In this paper, we propose a language independent statistical corpusbased term extraction algorithm. In previous approaches, evaluation has been subjective, at best relying on a lexicographer’s judgement. We evaluate the quality of our term extractor by assessing its predictiveness on an unseen corpus using perplexity. Second, we evaluate the precision and recall of our extractor by comparing the Chinese words in a segmented corpus with the words extracted by our system.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Ando, R. K. and Lee, L. 2000. Mostly-unsupervised statistical segmentation of Japanese: Application to Kanji. In Proceedings of NAACL-2000. pp. 241–248. Seattle, WA.
Google Scholar
Bourigault, D. 1992. Surface grammatical analysis for the extraction of terminological noun phrases. In Proceedings of COLING-92. pp. 977–981. Nates, France.
Google Scholar
Cover, T. M., Thomas, J. A. 1991. Elements of Information Theory. Wiley, New York.
Google Scholar
Dagan, I. and Church, K. 1994. Termight: identifying and translating technical terminology. In Proceedings of Applied Language Processing. pp. 34–40. Stuttgart, Germany.
Google Scholar
Daille, B., Gaussier, E., and Langé, J.M. 1994. Towards automatic extraction of monolingual and bilingual terminology. In Proceedings of COLING-94. pp. 515–521. Kyoto, Japan.
Google Scholar
Dong, Z. 1999. Bigger context and better understanding-Expectation on future MT technology. In Proceedings of ICMT&CLIP-99. pp.17–25.
Google Scholar
Dunning, T. 1993. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1):61–74.
Google Scholar
Eklund, P. and Wille, R. 1998. A multimodal approach to term extraction using a rhetorical structure theory tagger and formal concept analysis. In Proceedings of Conference on Cooperative Multimodal Communication: Theory and Applications. pp. 171–175. Tilburg, Netherlands.
Google Scholar
FDMC. 1986. Xiandai Hanyu Pinlu Cidian(Modern Chinese Frequency Dictionary). Beijing Language Institute Press.
Google Scholar
Frantzi, K.T. and Ananiadou, S. 1999. The C-Value/NC-Value domain independent method for multi-word term extraction. Natural Language Processing, 6(3):145–179.
Google Scholar
Fung, P. 1998. Extracting key terms from Chinese and Japanese texts. The International Journal on Computer Processing of Oriental Language. Special Issue on Information Retrieval on Oriental Languages, pp. 99–121.
Google Scholar
Guo, J. 1998. Segmented Chinese corpus. ftp://ftp.cogsci.ed.ac.uk/pub/chinese/.
Jennings, R., Benage D. B., Crandall, S. and Gregory K. 1997. Special Edition Using Windows NT Server 4: The Most Complete Reference. Macmillan Computer Publishing.
Google Scholar
Justeson, J.S. and Katz, S.L. 1996. Technical terminology: some linguistic properties and an algorithm for identification in text. Natural Language Engineering, 3(2):259–289.
Google Scholar
Kirch, O. 1995. LINUX: Network Administrator’s Guide. O'Reilly.
Google Scholar
Lin, D. 1998a. Extracting collocations from text corpora. In Proceedings of COLING/ACL-98 Workshop on Computational Terminology. Montreal, Canada.
Google Scholar
Maynard, A. and Ananiadou, S. dy2000. Identifying terms by their Family and friends. In Proceedings of COLING-2000. pp. 530–536. Saarbrucken, Germany.
Google Scholar
Maynard, A. and Ananiadou, S. 1999. Identifying contextual information for multi-word term extraction. In Proceedings of Terminology and Knowledge Engineering Conference-99. pp. 212–221. Innsbruck, Austria.
Google Scholar
Smadja, F. 1993. Retrieving collocations from text: Xtract. Computational Linguistics, 19(1):143–177.
Google Scholar
Sun, M., Shen, D. and Tsou B. K. 1998. Chinese word segmentation without using lexicon and hand-crafted training data. In Proceedings of COLING-ACL-98. pp. 1265–1271. Montreal, Canada.
Google Scholar
Witten, I. H. and Bell, T. C. 1991. The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression. IEEE Transactions on Information Theory, 37(4).
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computing Science, University of Alberta, Edmonton, Alberta, T6G, 2H1, Canada
Patrick Pantel & Dekang Lin

Authors

Patrick Pantel
View author publications
You can also search for this author in PubMed Google Scholar
Dekang Lin
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, University of Alberta, Edmonton, AB, Canada, T6G 2E8
Eleni Stroulia
School of Information Technology and Engineering, University of Ottawa, Ottawa, ON, Canada, K1N 6N5
Stan Matwin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pantel, P., Lin, D. (2001). A Statistical Corpus-Based Term Extractor. In: Stroulia, E., Matwin, S. (eds) Advances in Artificial Intelligence. Canadian AI 2001. Lecture Notes in Computer Science(), vol 2056. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45153-6_4

Download citation

DOI: https://doi.org/10.1007/3-540-45153-6_4
Published: 16 May 2001
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42144-3
Online ISBN: 978-3-540-45153-2
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics