The Tanl Lemmatizer Enriched with a Sequence of Cascading Filters

  • Giuseppe Attardi
  • Stefano Dei Rossi
  • Maria Simi
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7689)


We have extended an existing lemmatizer, which relies on a lexicon of about 1.2 millions form, where lemmas are indexed by rich PoS tags, with a sequence of cascading filters, each one in charge of dealing with specific issues related to out-of-dictionary words. The last two filters are devoted to resolve semantic ambiguities between words of the same syntactic category, by querying external resources: an enriched index built on the Italian Wikipedia and the Google index.


Lemmatization Lexicon Part-of-Speech tagging Deep Search 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Attardi, G., et al.: Tanl (Text Analytics and Natural Language Processing): Analisi di Testi per Semantic Web e Question Answering (2009),
  2. 2.
    Attardi, G., et al.: The Tanl POS Tagset (2007),
  3. 3.
    Attardi, G., et al.: Deep Search (2009),
  4. 4.
    Attardi, G., Simi, M.: Overview of the EVALITA 2009 Part-of-Speech Tagging Task. In: Workshop Evalita 2009, Reggio Emilia, Italy (2009)Google Scholar
  5. 5.
    Baroni, M., Bernardini, S., Comastri, F., Piccioni, L., Volpi, A., Aston, G., Mazzoleni, M.: Introducing the “La Repubblica” Corpus: a Large Annotated TEI (XML)–compliant corpus of newspaper italian. In: Proc. of LREC 2004, pp. 1771–1774. ELDA, Lisbon (2004)Google Scholar
  6. 6.
    Buchholz, S., Marsi, E.: CoNLL-X shared task on multilingual dependency parsing. In: Proceedings of the Tenth Conference on Computational Natural Language Learning (CoNLL-X 2006), pp. 149–164. Association for Computational Linguistics, Stroudsburg (2006)CrossRefGoogle Scholar
  7. 7.
    De Mauro, T.: Il Dizionario della lingua italiana,
  8. 8.
    Gabrielli, A.: Il Grande Italiano,
  9. 9.
    Loponen, A., Järvelin, K.: A Dictionary and Corpus Independent Statistical Lemmatizer for Information Retrieval in Low Resource Languages. In: Agosti, M., Ferro, N., Peters, C., de Rijke, M., Smeaton, A. (eds.) CLEF 2010. LNCS, vol. 6360, pp. 3–14. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  10. 10.
    Schmid, H.: Probabilistic Part-of-Speech Tagging Using Decision Trees. In: International Conference on New Methods in Language Processing, pp. 44–49 (1994)Google Scholar
  11. 11.
    Zanchetta, E., Baroni, M.: Morph-it! A free corpus-based morphological resource for the Italian language. In: Corpus Linguistics. University of Birmingham, UK (2005)Google Scholar
  12. 12.
    Zingarelli, N.: Il nuovo Zingarelli minore. Zanichelli (2008)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Giuseppe Attardi
    • 1
  • Stefano Dei Rossi
    • 1
  • Maria Simi
    • 1
  1. 1.Dipartimento di InformaticaUniversità di PisaPisaItaly

Personalised recommendations