Skip to main content

Fextor: A Feature Extraction Framework for Natural Language Processing: A Case Study in Word Sense Disambiguation, Relation Recognition and Anaphora Resolution

  • Chapter
Computational Linguistics

Abstract

Feature extraction from text corpora is an important step in Natural Language Processing (NLP), especially for Machine Learning (ML) techniques. Various NLP tasks have many common steps, e.g. low level act of reading a corpus and obtaining text windows from it. Some high-level processing steps might also be shared, e.g. testing for morpho-syntactic constraints between words. An integrated feature extraction framework removes wasteful redundancy and helps in rapid prototyping.

In this paper we present a flexible feature extraction framework called Fextor. We describe assumptions about the feature extraction process and provide general overview of software architecture. This is accompanied by examples of applications in hugely different NLP tasks. Namely, we show the application of Fextor in: word sense disambiguation, recognition of interchunk syntactic relations, semantic relations between named entities, as well as anaphora resolution.

This work was financed by the National Centre for Research and Development (NCBiR) project SP/I/1/77065/10.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Agirre, E., Edmonds, P. (eds.): Word Sense Disambiguation: Algorithms and Applications. Springer (2006)

    Google Scholar 

  2. Anderson, E.: The species problem in iris. Annals of the Missouri Botanical Garden 23(3), 457–509 (1936)

    Article  Google Scholar 

  3. Baś, D., Broda, B., Piasecki, M.: Towards Word Sense Disambiguation of Polish. In: Proceedings of the International Multiconference on Computer Science and Information Technology — 3rd International Symposium Advances in Artificial Intelligence and Applications (AAIA 2008), pp. 65–71 (2008)

    Google Scholar 

  4. Bird, S., Loper, E.: Nltk: The natural language toolkit. In: Proceedings of the ACL Demonstration Session, Barcelona, pp. 214–217. Association for Computational Linguistics (2004)

    Google Scholar 

  5. Broda, B., Derwojedowa, M., Piasecki, M., Szpakowicz, S.: Corpus-based Semantic Relatedness for the Construction of Polish WordNet. In: European Language Resources Association (ELRA) (ed.) Proceedings of the Sixth International Language Resources and Evaluation (LREC 2008), Marrakech, Morocco (2008)

    Google Scholar 

  6. Broda, B., Marcińczuk, M., Maziarz, M., Radziszewski, A., Wardyński, A.: Kpwr: Towards a free corpus of polish. In: Calzolari, N., Choukri, K., Declerck, T., Doğan, M.U., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S. (eds.) Proceedings of LREC 2012, Istanbul, Turkey. ELRA (2012)

    Google Scholar 

  7. Broda, B., Piasecki, M.: Evaluating LexCSD in a Large Scale Experiment. Control and Cybernetics 40(2) (2011)

    Google Scholar 

  8. Bunescu, R.C.: Learning for information extraction: from named entity recognition and disambiguation to relation extraction. Ph.d., The University of Texas at Austin (2007)

    Google Scholar 

  9. Daelemans, W., Zavrel, J., van der Sloot, K., Van den Bosch, A.: TiMBL: Tilburg Memory Based Learner, version 6.3, reference guide. Technical Report 10-01, ILK (2010)

    Google Scholar 

  10. Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research 9, 1871–1874 (2008)

    MATH  Google Scholar 

  11. Fellbaum, C., et al.: WordNet: An electronic lexical database. MIT Press, Cambridge (1998)

    MATH  Google Scholar 

  12. Janus, D., Przepiórkowski, A.: Poliqarp: An open source corpus indexer and search engine with syntactic extensions. In: Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, pp. 85–88. Association for Computational Linguistics (2007)

    Google Scholar 

  13. Marcińczuk, M., Janicki, M.: Optimizing CRF-Based Model for Proper Name Recognition in Polish Texts. In: Gelbukh, A. (ed.) CICLing 2012, Part I. LNCS, vol. 7181, pp. 258–269. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  14. Marcińczuk, M., Ptak, M.: Preliminary study on automatic induction of rules for recognition of semantic relations between proper names in polish texts. In: Proceedings of the 15th International Conference on Text, Speech and Dialogue. LNCS (LNAI), Springer (to appear, 2012)

    Google Scholar 

  15. Maziarz, M., Radziszewski, A., Wieczorek, J.: Chunking of Polish: guidelines, discussion and experiments with Machine Learning. In: Proceedings of the 5th Language & Technology Conference LTC 2011, Poznań, Poland (2011)

    Google Scholar 

  16. Młodzki, R., Przepiórkowski, A.: The wsd development environment. In: Vetulani, Z. (ed.) Proc. 4th Language and Technology Conference, Poznań, Poland (2009)

    Google Scholar 

  17. Navigli, R.: Word sense disambiguation: A survey. ACM Comput. Surv. 41(2), 1–69 (2009)

    Article  Google Scholar 

  18. Ng, T., Lee, H.: Integrating multiple knowledge sources to disambiguate word senses: An examplar-based approach. In: Proceedings of the Thirty-Fourth Annual Meeting of the Association for Computational Linguistics, pp. 40–47 (1996)

    Google Scholar 

  19. Ng, V., Gardent, C.: Improving machine learning approaches to coreference resolution. In: ACL, pp. 104–111 (2002)

    Google Scholar 

  20. Padró, L., Collado, M., Reese, S., Lloberes, M., Castellón, I.: FreeLing 2.1: Five years of open-source language processing tools. In: Chair, N.C.C., Choukri, K., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S., Rosner, M., Tapias, D. (eds.) Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010), Valletta, Malta. European Language Resources Association, ELRA (2010)

    Google Scholar 

  21. Piasecki, M., Ramocki, R., Maziarz, M.: Automated Generation of Derivative Relations in the Wordnet Expansion Perspective. In: Proceedings of the 6th Global Wordnet Conference, Matsue, Japan (2012)

    Google Scholar 

  22. Piasecki, M., Szpakowicz, S., Broda, B.: A WordNet from the Ground Up. Oficyna wydawnicza Politechniki Wroclawskiej (2009)

    Google Scholar 

  23. Piasecki, M., Szpakowicz, S., Broda, B.: Extended Similarity Test for the Evaluation of Semantic Similarity Functions. In: Proceedings of the 3rd Language and Technology Conference, Pozna’n, Poland, October 5-7, pp. 104–108. Wydawnictwo Pozna’nskie Sp. z o.o. (2007)

    Google Scholar 

  24. Przepiórkowski, A.: Powierzchniowe przetwarzanie języka polskiego. Akademicka Oficyna Wydawnicza EXIT, Warsaw (2008)

    Google Scholar 

  25. Radziszewski, A., Marek, M., Wieczorek, J.: Shallow syntactic annotation in the Corpus of Wrocław University of Technology. Cognitive Studies 12 (2012)

    Google Scholar 

  26. Radziszewski, A., Śniatowski, T.: Maca — a configurable tool to integrate Polish morphological data. In: Proceedings of the Second International Workshop on Free/Open-Source Rule-Based Machine Translation (2011)

    Google Scholar 

  27. Radziszewski, A., Wardyński, A., Śniatowski, T.: WCCL: A morpho-syntactic feature toolkit. In: Proceedings of the Balto-Slavonic Natural Language Processing Workshop. Springer (2011)

    Google Scholar 

  28. Roth, D., Cumby, C., Sammons, M., Yih, W.T.: A relational feature extraction language (fex). Technical report, University of Illinois at Urbana Champaign (2004)

    Google Scholar 

  29. Soon, W.M., Chung, D., Lim, D.C.Y., Lim, Y., Ng, H.T.: A machine learning approach to coreference resolution of noun phrases (2001)

    Google Scholar 

  30. Wróblewska, A.: Polish dependency bank. Linguistic Issues in Language Technology 7(1) (2012)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bartosz Broda .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Broda, B., Kędzia, P., Marcińczuk, M., Radziszewski, A., Ramocki, R., Wardyński, A. (2013). Fextor: A Feature Extraction Framework for Natural Language Processing: A Case Study in Word Sense Disambiguation, Relation Recognition and Anaphora Resolution. In: Przepiórkowski, A., Piasecki, M., Jassem, K., Fuglewicz, P. (eds) Computational Linguistics. Studies in Computational Intelligence, vol 458. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-34399-5_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-34399-5_3

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-34398-8

  • Online ISBN: 978-3-642-34399-5

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics