Spanish Treebank Annotation of Informal Non-standard Web Text

  • Mariona Taulé
  • M. Antonia Martí
  • Ann Bies
  • Montserrat Nofre
  • Aina Garí
  • Zhiyi Song
  • Stephanie Strassel
  • Joe Ellis
Conference paper

DOI: 10.1007/978-3-319-24800-4_2

Part of the Lecture Notes in Computer Science book series (LNCS, volume 9396)
Cite this paper as:
Taulé M. et al. (2015) Spanish Treebank Annotation of Informal Non-standard Web Text. In: Daniel F., Diaz O. (eds) Current Trends in Web Engineering. ICWE 2015. Lecture Notes in Computer Science, vol 9396. Springer, Cham

Abstract

This paper presents the Latin American Spanish Discussion Forum Treebank (LAS-DisFo). This corpus consists of 50,291 words and 2,846 sentences that are part-of-speech tagged, lemmatized and syntactically annotated with constituents and functions. We describe how it was built and the methodology followed for its annotation, the annotation scheme and criteria applied for dealing with the most problematic phenomena commonly encountered in this kind of informal unedited web text. This is the first available Latin American Spanish corpus of non-standard language that has been morphologically and syntactically annotated. It is a valuable linguistic resource that can be used for the training and evaluation of parsers and PoS taggers.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Mariona Taulé
    • 2
  • M. Antonia Martí
    • 2
  • Ann Bies
    • 1
  • Montserrat Nofre
    • 2
  • Aina Garí
    • 2
  • Zhiyi Song
    • 1
  • Stephanie Strassel
    • 1
  • Joe Ellis
    • 1
  1. 1.Linguistic Data ConsortiumUniversity of PennsylvaniaPhiladelphiaUSA
  2. 2.CLiCUniversity of BarcelonaBarcelonaSpain

Personalised recommendations