Large-Scale Experiments with NP Chunking of Polish

Radziszewski, Adam; Pawlaczek, Adam

doi:10.1007/978-3-642-32790-2_17

Adam Radziszewski²¹ &
Adam Pawlaczek²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7499))

Included in the following conference series:

International Conference on Text, Speech and Dialogue

1657 Accesses
8 Citations

Abstract

The published experiments with shallow parsing for Slavic languages are characterised with small size of the corpora used. With the publication of the National Corpus of Polish (NCP), a new opportunity was opened: to test several chunking algorithms on the 1-million token manually annotated subcorpus of the NCP. We test three Machine Learning techniques: Decision Tree induction, Memory-Based Learning and Conditional Random Fields. We also investigate the influence of tagging errors on the overall chunker performance, which happens to be quite substantial.

This work was financed by the National Centre for Research and Development (NCBiR) project SP/I/1/77065/10.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Abney, S.: Parsing by chunks. In: Principle-Based Parsing, pp. 257–278. Kluwer Academic Publishers (1991)
Google Scholar
Ramshaw, L.A., Marcus, M.P.: Text chunking using transformation-based learning. In: Proceedings of the Third ACL Workshop on Very Large Corpora, Cambridge, MA, USA, pp. 82–94 (1995)
Google Scholar
Osenova, P.: Bulgarian nominal chunks and mapping strategies for deeper syntactic analyses. In: Proceedings of the Workshop on Treebanks and Linguistic Theories, TLT 2002, Sozopol, Bulgaria, September 20–21 (2002)
Google Scholar
Przepiórkowski, A.: Powierzchniowe przetwarzanie języka polskiego. Akademicka Oficyna Wydawnicza EXIT, Warsaw (2008)
Google Scholar
Vučković, K., Tadić, M., Dovedan, Z.: Rule-based chunker for croatian. In: Proceedings of the Sixth International Language Resources and Evaluation (LREC 2008). ELRA, Marrakech, Morocco (2008)
Google Scholar
Vučković, K.: Model parsera za Hrvatski jezik. Ph.D. thesis, Department of Information Sciences, Faculty of Humanities and Social Sciences, University of Zagreb, Croatia (2009)
Google Scholar
Grác, M., Jakubíček, M., Kovář, V.: Through low-cost annotation to reliable parsing evaluation. In: Proceedings of the 24th Pacific Asia Conference on Language, Information and Computation, Tokio, Waseda University, pp. 555–562 (2010)
Google Scholar
Radziszewski, A., Piasecki, M.: A preliminary noun phrase chunker for Polish. In: Proceedings of the Intelligent Information Systems (2010)
Google Scholar
Maziarz, M., Radziszewski, A., Wieczorek, J.: Chunking of Polish: guidelines, discussion and experiments with Machine Learning. In: Proceedings of the 5th Language & Technology Conference, LTC 2011, Poznań, Poland (2011)
Google Scholar
Nenadić, G.: Local Grammars and Parsing Coordination of Nouns in Serbo-Croatian. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2000. LNCS (LNAI), vol. 1902, pp. 57–62. Springer, Heidelberg (2000)
Chapter Google Scholar
Vučković, K., Agić, Ž., Tadić, M.: Improving chunking accuracy on Croatian texts by morphosyntactic tagging. In: Choukri, K., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S., Rosner, M., Tapias, D. (eds.) Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010), European Language Resources Association (ELRA), Valletta (2010)
Google Scholar
Przepiórkowski, A., Górski, R.L., Łaziński, M., Pęzik, P.: Recent developments in the National Corpus of Polish. In: Proceedings of the Seventh International Conference on Language Resources and Evaluation, LREC 2010, Valletta, Malta, ELRA (2010)
Google Scholar
Waszczuk, J., Głowińska, K., Savary, A., Przepiórkowski, A.: Tools and methodologies for annotating syntax and named entities in the National Corpus of Polish. In: Proceedings of the International Multiconference on Computer Science and Information Technology (IMCSIT 2010): Computational Linguistics – Applications, CLA 2010, Wisła, Poland, PTI, pp. 531–539 (2010)
Google Scholar
Sang, E.F.T.K., Veenstra, J.: Representing text chunks. In: Proceedings of the Ninth Conference on European Chapter of the Association for Computational Linguistics, pp. 173–179. Association for Computational Linguistics, Morristown (1999)
Chapter Google Scholar
Bird, S., Loper, E.: Nltk: The natural language toolkit. In: Proceedings of the ACL Demonstration Session, pp. 214–217. Association for Computational Linguistics, Barcelona (2004)
Google Scholar
Daelemans, W., Zavrel, J., van der Sloot, K., van den Bosch, A.: TiMBL: Tilburg Memory Based Learner, version 6.3, reference guide. Technical Report 10-01, ILK (2010)
Google Scholar
Sha, F., Pereira, F.C.N.: Shallow parsing with conditional random fields. In: Proceedings of the 2003 Human Language Technology Conference and North American Chapter of the Association for Computational Linguistics, HLT/NAACL 2003 (2003)
Google Scholar
Wallach, H.M.: Conditional random fields: An introduction. Technical report, Department of Computer and Information Science, University of Pennsylvania (2004)
Google Scholar
Radziszewski, A., Śniatowski, T.: A memory-based tagger for Polish. In: Proceedings of the 5th Language & Technology Conference, Poznań, Poland (2011)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Informatics, Wrocław University of Technology, Poland
Adam Radziszewski & Adam Pawlaczek

Authors

Adam Radziszewski
View author publications
You can also search for this author in PubMed Google Scholar
Adam Pawlaczek
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Informatics, Department of Computer Graphics and Design, Masaryk University, Botanická 68a, 602 00, Brno, Czech Republic
Petr Sojka
Faculty of Informatics, Department of Information Technologies, Masaryk University, Botanická 68a, 602 00, Brno, Czech Republic
Aleš Horák , Ivan Kopeček & Karel Pala , &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Radziszewski, A., Pawlaczek, A. (2012). Large-Scale Experiments with NP Chunking of Polish. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech and Dialogue. TSD 2012. Lecture Notes in Computer Science(), vol 7499. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-32790-2_17

Download citation

DOI: https://doi.org/10.1007/978-3-642-32790-2_17
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-32789-6
Online ISBN: 978-3-642-32790-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Large-Scale Experiments with NP Chunking of Polish