Fextor: A Feature Extraction Framework for Natural Language Processing: A Case Study in Word Sense Disambiguation, Relation Recognition and Anaphora Resolution

Broda, Bartosz; Kędzia, Paweł; Marcińczuk, Michał; Radziszewski, Adam; Ramocki, Radosław; Wardyński, Adam

doi:10.1007/978-3-642-34399-5_3

Bartosz Broda⁵,
Paweł Kędzia⁵,
Michał Marcińczuk⁵,
Adam Radziszewski⁵,
Radosław Ramocki⁵ &
…
Adam Wardyński⁵

Part of the book series: Studies in Computational Intelligence ((SCI,volume 458))

1751 Accesses
7 Citations

Abstract

Feature extraction from text corpora is an important step in Natural Language Processing (NLP), especially for Machine Learning (ML) techniques. Various NLP tasks have many common steps, e.g. low level act of reading a corpus and obtaining text windows from it. Some high-level processing steps might also be shared, e.g. testing for morpho-syntactic constraints between words. An integrated feature extraction framework removes wasteful redundancy and helps in rapid prototyping.

In this paper we present a flexible feature extraction framework called Fextor. We describe assumptions about the feature extraction process and provide general overview of software architecture. This is accompanied by examples of applications in hugely different NLP tasks. Namely, we show the application of Fextor in: word sense disambiguation, recognition of interchunk syntactic relations, semantic relations between named entities, as well as anaphora resolution.

This work was financed by the National Centre for Research and Development (NCBiR) project SP/I/1/77065/10.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Agirre, E., Edmonds, P. (eds.): Word Sense Disambiguation: Algorithms and Applications. Springer (2006)
Google Scholar
Anderson, E.: The species problem in iris. Annals of the Missouri Botanical Garden 23(3), 457–509 (1936)
Article Google Scholar
Baś, D., Broda, B., Piasecki, M.: Towards Word Sense Disambiguation of Polish. In: Proceedings of the International Multiconference on Computer Science and Information Technology — 3rd International Symposium Advances in Artificial Intelligence and Applications (AAIA 2008), pp. 65–71 (2008)
Google Scholar
Bird, S., Loper, E.: Nltk: The natural language toolkit. In: Proceedings of the ACL Demonstration Session, Barcelona, pp. 214–217. Association for Computational Linguistics (2004)
Google Scholar
Broda, B., Derwojedowa, M., Piasecki, M., Szpakowicz, S.: Corpus-based Semantic Relatedness for the Construction of Polish WordNet. In: European Language Resources Association (ELRA) (ed.) Proceedings of the Sixth International Language Resources and Evaluation (LREC 2008), Marrakech, Morocco (2008)
Google Scholar
Broda, B., Marcińczuk, M., Maziarz, M., Radziszewski, A., Wardyński, A.: Kpwr: Towards a free corpus of polish. In: Calzolari, N., Choukri, K., Declerck, T., Doğan, M.U., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S. (eds.) Proceedings of LREC 2012, Istanbul, Turkey. ELRA (2012)
Google Scholar
Broda, B., Piasecki, M.: Evaluating LexCSD in a Large Scale Experiment. Control and Cybernetics 40(2) (2011)
Google Scholar
Bunescu, R.C.: Learning for information extraction: from named entity recognition and disambiguation to relation extraction. Ph.d., The University of Texas at Austin (2007)
Google Scholar
Daelemans, W., Zavrel, J., van der Sloot, K., Van den Bosch, A.: TiMBL: Tilburg Memory Based Learner, version 6.3, reference guide. Technical Report 10-01, ILK (2010)
Google Scholar
Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research 9, 1871–1874 (2008)
MATH Google Scholar
Fellbaum, C., et al.: WordNet: An electronic lexical database. MIT Press, Cambridge (1998)
MATH Google Scholar
Janus, D., Przepiórkowski, A.: Poliqarp: An open source corpus indexer and search engine with syntactic extensions. In: Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, pp. 85–88. Association for Computational Linguistics (2007)
Google Scholar
Marcińczuk, M., Janicki, M.: Optimizing CRF-Based Model for Proper Name Recognition in Polish Texts. In: Gelbukh, A. (ed.) CICLing 2012, Part I. LNCS, vol. 7181, pp. 258–269. Springer, Heidelberg (2012)
Chapter Google Scholar
Marcińczuk, M., Ptak, M.: Preliminary study on automatic induction of rules for recognition of semantic relations between proper names in polish texts. In: Proceedings of the 15th International Conference on Text, Speech and Dialogue. LNCS (LNAI), Springer (to appear, 2012)
Google Scholar
Maziarz, M., Radziszewski, A., Wieczorek, J.: Chunking of Polish: guidelines, discussion and experiments with Machine Learning. In: Proceedings of the 5th Language & Technology Conference LTC 2011, Poznań, Poland (2011)
Google Scholar
Młodzki, R., Przepiórkowski, A.: The wsd development environment. In: Vetulani, Z. (ed.) Proc. 4th Language and Technology Conference, Poznań, Poland (2009)
Google Scholar
Navigli, R.: Word sense disambiguation: A survey. ACM Comput. Surv. 41(2), 1–69 (2009)
Article Google Scholar
Ng, T., Lee, H.: Integrating multiple knowledge sources to disambiguate word senses: An examplar-based approach. In: Proceedings of the Thirty-Fourth Annual Meeting of the Association for Computational Linguistics, pp. 40–47 (1996)
Google Scholar
Ng, V., Gardent, C.: Improving machine learning approaches to coreference resolution. In: ACL, pp. 104–111 (2002)
Google Scholar
Padró, L., Collado, M., Reese, S., Lloberes, M., Castellón, I.: FreeLing 2.1: Five years of open-source language processing tools. In: Chair, N.C.C., Choukri, K., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S., Rosner, M., Tapias, D. (eds.) Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010), Valletta, Malta. European Language Resources Association, ELRA (2010)
Google Scholar
Piasecki, M., Ramocki, R., Maziarz, M.: Automated Generation of Derivative Relations in the Wordnet Expansion Perspective. In: Proceedings of the 6th Global Wordnet Conference, Matsue, Japan (2012)
Google Scholar
Piasecki, M., Szpakowicz, S., Broda, B.: A WordNet from the Ground Up. Oficyna wydawnicza Politechniki Wroclawskiej (2009)
Google Scholar
Piasecki, M., Szpakowicz, S., Broda, B.: Extended Similarity Test for the Evaluation of Semantic Similarity Functions. In: Proceedings of the 3rd Language and Technology Conference, Pozna’n, Poland, October 5-7, pp. 104–108. Wydawnictwo Pozna’nskie Sp. z o.o. (2007)
Google Scholar
Przepiórkowski, A.: Powierzchniowe przetwarzanie języka polskiego. Akademicka Oficyna Wydawnicza EXIT, Warsaw (2008)
Google Scholar
Radziszewski, A., Marek, M., Wieczorek, J.: Shallow syntactic annotation in the Corpus of Wrocław University of Technology. Cognitive Studies 12 (2012)
Google Scholar
Radziszewski, A., Śniatowski, T.: Maca — a configurable tool to integrate Polish morphological data. In: Proceedings of the Second International Workshop on Free/Open-Source Rule-Based Machine Translation (2011)
Google Scholar
Radziszewski, A., Wardyński, A., Śniatowski, T.: WCCL: A morpho-syntactic feature toolkit. In: Proceedings of the Balto-Slavonic Natural Language Processing Workshop. Springer (2011)
Google Scholar
Roth, D., Cumby, C., Sammons, M., Yih, W.T.: A relational feature extraction language (fex). Technical report, University of Illinois at Urbana Champaign (2004)
Google Scholar
Soon, W.M., Chung, D., Lim, D.C.Y., Lim, Y., Ng, H.T.: A machine learning approach to coreference resolution of noun phrases (2001)
Google Scholar
Wróblewska, A.: Polish dependency bank. Linguistic Issues in Language Technology 7(1) (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Informatics, Wrocław University of Technology, Wybrzeże Wyspiańskiego 27, 50-370, Wrocław, Poland
Bartosz Broda, Paweł Kędzia, Michał Marcińczuk, Adam Radziszewski, Radosław Ramocki & Adam Wardyński

Authors

Bartosz Broda
View author publications
You can also search for this author in PubMed Google Scholar
Paweł Kędzia
View author publications
You can also search for this author in PubMed Google Scholar
Michał Marcińczuk
View author publications
You can also search for this author in PubMed Google Scholar
Adam Radziszewski
View author publications
You can also search for this author in PubMed Google Scholar
Radosław Ramocki
View author publications
You can also search for this author in PubMed Google Scholar
Adam Wardyński
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bartosz Broda .

Editor information

Editors and Affiliations

, Institute of Computer Science, Polish Academy of Sciences, ul. Ordona 21, Warsaw, 01-237, Poland
Adam Przepiórkowski
, Institute of Informatics, Wroclaw University of Technology, ul. Wybrzeże Wyspiańskiego 27, Wroclaw, 50-370, Poland
Maciej Piasecki
, Faculty of Mathematics and Computer Scie, Adam Mickiewicz University, ul. Umultowska 87, Poznań, 61-614, Poland
Krzysztof Jassem
TiP Sp. z o. o., Francuska 35/37, Katowice, 40-027, Poland
Piotr Fuglewicz

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Broda, B., Kędzia, P., Marcińczuk, M., Radziszewski, A., Ramocki, R., Wardyński, A. (2013). Fextor: A Feature Extraction Framework for Natural Language Processing: A Case Study in Word Sense Disambiguation, Relation Recognition and Anaphora Resolution. In: Przepiórkowski, A., Piasecki, M., Jassem, K., Fuglewicz, P. (eds) Computational Linguistics. Studies in Computational Intelligence, vol 458. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-34399-5_3

Download citation

DOI: https://doi.org/10.1007/978-3-642-34399-5_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-34398-8
Online ISBN: 978-3-642-34399-5
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics