Abstract
In this work, we present two modules for a python open-source library for the analysis of the Italian language. The modules include a Pos tagger based on Averaged Perceptron Tagger and a Lemmatizer, based on the vast collection of linguistic data held by the Department of Politics and Communication Science of the University of Salerno. While the Averaged Perceptron Tagger algorithm is mostly used for the the English language from famous python libraries such as NLTK or Spacy, the Lemmatizer represents an entirely original module that relies on a vast electronic dictionary characterized by the presence of syntactic, morphological, and semantic tags. We present our approach and a preliminary experiment in which we compare our module results with the results of another widely used Pos-tagger and Lemmatizer as Tree-Tagger.
A. Maisto edited Sects. 1, 2, 3, 4, 5; W. Balzano collaborated in the project.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Amato, A., Balzano, W., Cozzolino, G., Moscato, F.: Analysis of consumers perceptions of food safety risk in social networks. In: Barolli, L., Takizawa, M., Xhafa, F., Enokido, T. (eds.) International Conference on Advanced Information Networking and Applications, pp. 1217–1227. Springer, Cham (2019)
Greene, B.B., Rubin, G.M.: Automatic grammatical tagging of English. Department of Linguistics. Brown University (1971)
Francis, W., Kucera, H.: Frequency analysis of English usage (1982)
Church, K.W.: A stochastic parts program and noun phrase parser for unrestricted text. In: Second Conference on Applied Natural Language Processing, pp. 136–143. Association for Computational Linguistics (1988)
Cutting, D., Kupiec, J., Pedersen, J., Sibun, P.: A practical part-of-speech tagger. In: Proceedings of the Third Conference on Applied Natural Language Processing, pp. 133–140. Association for Computational Linguistics (1992)
Brill, E.: A simple rule-based part of speech tagger. In: Proceedings of the Workshop on Speech and Natural Language, pp. 112–116. Association for Computational Linguistics (1992)
Ratnaparkhi, A., et al.: A maximum entropy model for part-of-speech tagging. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, Philadelphia, USA, vol. 1, pp. 133–142 (1996)
Toutanova, K., Manning, C.D.: Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In: Proceedings of the 2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora: Held in Conjunction With the 38th Annual Meeting of the Association for Computational Linguistics, vol. 13, pp. 63–70. Association for Computational Linguistics (2000)
Giménez, J., Marquez, L.: SVMTool: a general POS tagger generator based on support vector machines. In: Proceedings of the 4th International Conference on Language Resources and Evaluation. Citeseer (2004)
Denis, P., Sagot, B., et al.: Coupling an annotated corpus and a morphosyntactic Lexicon for state-of-the-art POS tagging with less human effort. In: PACLIC, pp. 110–119 (2009)
Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, vol. 1, pp. 173–180 (2003)
Shen, L., Satta, G., Joshi, A.: Guided learning for bidirectional sequence classification. In: ACL, vol. 7, pp. 760–767. Citeseer (2007)
Manning, C.D.: Part-of-speech tagging from 97% to 100%: is it time for some linguistics? In: Gelbukh, A.F. (ed.) International Conference on Intelligent Text Processing and Computational Linguistics, pp. 171–189. Springer, Berlin (2011)
Huang, Z., Xu, W., Yu, K.: Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991 (2015)
Choi, J.D.: Dynamic feature induction: the last gist to the state-of-the-art. In: Proceedings of NAACL-HLT, pp. 271–281 (2016)
Collins, M.: Discriminative training methods for hidden Markov models: theory and experiments with perceptron algorithms. In: Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing, vol. 10, pp. 1–8. Association for Computational Linguistics (2002)
Amato, F., Casola, V., Mazzocca, N., Romano, S.: A semantic approach for fine-grain access control of e-health documents. Log. J. IGPL 21(4), 692–701 (2013)
Amato, F., Boselli, R., Cesarini, M., Mercorio, F., Mezzanzanica, M., Moscato,V., Persia, F., Picariello, A.: Challenge: processing web texts for classifying job offers. In: Proceedings of the 2015 IEEE 9th International Conference on Semantic Computing (IEEE ICSC 2015), pp. 460–463. IEEE (2015)
Amato, F., Casola, V., Mazzeo, A., Romano, S.: A semantic based methodology to classify and protect sensitive data in medical records. In: 2010 Sixth International Conference on Information Assurance and Security, pp. 240–246. IEEE (2010)
Votrubec, J.: Morphological tagging based on averaged perceptron. In: WDS 2006 Proceedings of Contributed Papers, pp. 191–195 (2006)
Hajič, J., Raab, J., Spousta, M., et al.: Semi-supervised training for the averaged perceptron POS tagger. In: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, pp. 763–771. Association for Computational Linguistics (2009)
Chrupała, G., Dinu, G., Van Genabith, J.: Learning morphology with Morfette (2008)
Constant, M.,Tellier, I., Duchier, D., Dupont, Y., Sigogne, A., Billot, S.: Intégrer des connaissances linguistiques dans un crf: application à l’apprentissage d’un segmenteur-étiqueteur du français. In: TALN, vol. 1, p. 321 (2011)
Kanis, J., Müller, L.: Automatic lemmatizer construction with focus on OOV words lemmatization. In: Matoušek, V., Mautner, P., Pavelka, T., (eds.) International Conference on Text, Speech and Dialogue, pp. 132–139. Springer, Berlin (2005)
Schmid, H.: Treetagger—a language independent part-of-speech tagger. Institut für Maschinelle Sprachverarbeitung, Universität Stuttgart 43, 28 (1995)
Morton, T., Kottmann, J., Baldridge, J., Bierner, G.: Opennlp: a Java-based NLP toolkit (2005)
Pianta, E., Zanoli, R.: TagPro: a system for Italian PoS tagging based on SVM. Intelligenza Artificiale 4(2), 8–9 (2007)
Favretti, R.R., Tamburini, F., De Santis, C.: CORIS/CODIS: a corpus of written Italian based on a defined and a dynamic model. A Rainbow of Corpora: Corpus Linguistics and the Languages of the World. Lincom-Europa, Munich (2002)
Attardi, G., Fuschetto, A., Tamberi, F., Simi, M., Vecchi, E.M.: Experiments in tagger combination: arbitrating, guessing, correcting, suggesting. In: Proceedings of Workshop Evalita, p. 10 (2009)
Dell’Orletta, F.: Ensemble system for part-of-speech tagging. In: Proceedings of EVALITA, vol. 9, pp. 1–8 (2009)
De Smedt, T., Daelemans, W.: Pattern for Python. J. Mach. Learn. Res. 13, 2063–2067 (2012)
Lyding, V., Stemle, E., Borghetti, C., Brunello, M., Castagnoli, S., Dell’Orletta, F., Dittmann, H., Lenci, A., Pirrelli, V.: The paisa corpus of Italian web texts. In: Proceedings of the 9th Web as Corpus Workshop (WaC-9), pp. 36–43 (2014)
Hahn, U., Tomanek, K., Beisswanger, E., Faessler, E.: A proposal for a configurable silver standard. In: Proceedings of the Fourth Linguistic Annotation Workshop, pp. 235–242 (2010)
Elia, A.: Dizionari elettronici e applicazioni informatiche. In: JADT (1995)
Elia, A., Marano, F., Monteleone, M., Sabatino, S., Vellutino, D.: Strutture lessicali delle informazioni comunitarie all’interno di domini specialistici. In: Statistical Analysis of Textual Data, Proceedings of 10th International Conference “Journées D’Analyse Statistique des Données Textuelles”, pp. 9–11. Università” La Sapienza, Roma (2010)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Maisto, A., Balzano, W. (2021). Building a Pos Tagger and Lemmatizer for the Italian Language. In: Barolli, L., Woungang, I., Enokido, T. (eds) Advanced Information Networking and Applications. AINA 2021. Lecture Notes in Networks and Systems, vol 227. Springer, Cham. https://doi.org/10.1007/978-3-030-75078-7_7
Download citation
DOI: https://doi.org/10.1007/978-3-030-75078-7_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-75077-0
Online ISBN: 978-3-030-75078-7
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)