Skip to main content

A Statistical Corpus-Based Term Extractor

  • Conference paper
  • First Online:
Advances in Artificial Intelligence (Canadian AI 2001)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2056))

Abstract

Term extraction is an important problem in natural language processing. In this paper, we propose a language independent statistical corpusbased term extraction algorithm. In previous approaches, evaluation has been subjective, at best relying on a lexicographer’s judgement. We evaluate the quality of our term extractor by assessing its predictiveness on an unseen corpus using perplexity. Second, we evaluate the precision and recall of our extractor by comparing the Chinese words in a segmented corpus with the words extracted by our system.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Ando, R. K. and Lee, L. 2000. Mostly-unsupervised statistical segmentation of Japanese: Application to Kanji. In Proceedings of NAACL-2000. pp. 241–248. Seattle, WA.

    Google Scholar 

  2. Bourigault, D. 1992. Surface grammatical analysis for the extraction of terminological noun phrases. In Proceedings of COLING-92. pp. 977–981. Nates, France.

    Google Scholar 

  3. Cover, T. M., Thomas, J. A. 1991. Elements of Information Theory. Wiley, New York.

    Google Scholar 

  4. Dagan, I. and Church, K. 1994. Termight: identifying and translating technical terminology. In Proceedings of Applied Language Processing. pp. 34–40. Stuttgart, Germany.

    Google Scholar 

  5. Daille, B., Gaussier, E., and Langé, J.M. 1994. Towards automatic extraction of monolingual and bilingual terminology. In Proceedings of COLING-94. pp. 515–521. Kyoto, Japan.

    Google Scholar 

  6. Dong, Z. 1999. Bigger context and better understanding-Expectation on future MT technology. In Proceedings of ICMT&CLIP-99. pp.17–25.

    Google Scholar 

  7. Dunning, T. 1993. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1):61–74.

    Google Scholar 

  8. Eklund, P. and Wille, R. 1998. A multimodal approach to term extraction using a rhetorical structure theory tagger and formal concept analysis. In Proceedings of Conference on Cooperative Multimodal Communication: Theory and Applications. pp. 171–175. Tilburg, Netherlands.

    Google Scholar 

  9. FDMC. 1986. Xiandai Hanyu Pinlu Cidian(Modern Chinese Frequency Dictionary). Beijing Language Institute Press.

    Google Scholar 

  10. Frantzi, K.T. and Ananiadou, S. 1999. The C-Value/NC-Value domain independent method for multi-word term extraction. Natural Language Processing, 6(3):145–179.

    Google Scholar 

  11. Fung, P. 1998. Extracting key terms from Chinese and Japanese texts. The International Journal on Computer Processing of Oriental Language. Special Issue on Information Retrieval on Oriental Languages, pp. 99–121.

    Google Scholar 

  12. Guo, J. 1998. Segmented Chinese corpus. ftp://ftp.cogsci.ed.ac.uk/pub/chinese/.

  13. Jennings, R., Benage D. B., Crandall, S. and Gregory K. 1997. Special Edition Using Windows NT Server 4: The Most Complete Reference. Macmillan Computer Publishing.

    Google Scholar 

  14. Justeson, J.S. and Katz, S.L. 1996. Technical terminology: some linguistic properties and an algorithm for identification in text. Natural Language Engineering, 3(2):259–289.

    Google Scholar 

  15. Kirch, O. 1995. LINUX: Network Administrator’s Guide. O'Reilly.

    Google Scholar 

  16. Lin, D. 1998a. Extracting collocations from text corpora. In Proceedings of COLING/ACL-98 Workshop on Computational Terminology. Montreal, Canada.

    Google Scholar 

  17. Maynard, A. and Ananiadou, S. dy2000. Identifying terms by their Family and friends. In Proceedings of COLING-2000. pp. 530–536. Saarbrucken, Germany.

    Google Scholar 

  18. Maynard, A. and Ananiadou, S. 1999. Identifying contextual information for multi-word term extraction. In Proceedings of Terminology and Knowledge Engineering Conference-99. pp. 212–221. Innsbruck, Austria.

    Google Scholar 

  19. Smadja, F. 1993. Retrieving collocations from text: Xtract. Computational Linguistics, 19(1):143–177.

    Google Scholar 

  20. Sun, M., Shen, D. and Tsou B. K. 1998. Chinese word segmentation without using lexicon and hand-crafted training data. In Proceedings of COLING-ACL-98. pp. 1265–1271. Montreal, Canada.

    Google Scholar 

  21. Witten, I. H. and Bell, T. C. 1991. The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression. IEEE Transactions on Information Theory, 37(4).

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2001 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Pantel, P., Lin, D. (2001). A Statistical Corpus-Based Term Extractor. In: Stroulia, E., Matwin, S. (eds) Advances in Artificial Intelligence. Canadian AI 2001. Lecture Notes in Computer Science(), vol 2056. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45153-6_4

Download citation

  • DOI: https://doi.org/10.1007/3-540-45153-6_4

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-42144-3

  • Online ISBN: 978-3-540-45153-2

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics