1 Introduction

Leading open-source Natural Language Processing (NLP) libraries using neural network structures, such as UDPipe (Straka 2018), spaCy (Montani et al 2022), and Stanza (Qi et al 2020), rely on Universal Dependencies (UD) treebanks for regular training. The UD project, committed to maintaining a consistent annotation scheme among the broadest array of human languages, currently offers about 200 treebanks for over 100 languages (Nivre et al 2016).

The primary UD treebank for Spanish, known as UD Spanish AnCora (Alonso and Zeman 2016), serves as the standard resource in the field and stems from the Spanish data from the AnCora corpus (Taulé et al 2008). This corpus, originally compiled with standard-written data from Spanish newspapers, was subsequently converted to conform to the UD guidelines and was incorporated into the CoNLL 2009 shared task. The AnCora corpus includes “225,000 words from the EFE Spanish news agency and 200,000 from the Spanish version of El Periódico newspaper”, and an additional 75,000 words from the LexEsp corpus (Sebastián-Gallés 2000), which comprises a diverse range of European Spanish texts from various literary genres, such as press articles, and scientific papers. Owing to its frequent usage in numerous international evaluation contests for an array of syntactic and semantic NLP tasks, AnCora is regarded as a canonical resource.

In addition to AnCora, the Google Stanford Dependencies (GSD) (McDonald et al 2013) and the Parallel Universal Dependencies (PUD) treebanks are key resources in the field of NLP. GSD, a multilingual treebank originally including Spanish, was among the UD precursors, although Spanish didn’t receive as extensive manual attention as some other languages. The PUD treebanks, developed for the CoNLL 2017 shared task, consist of 1,000 sentences per language, drawn from news and Wikipedia sources, and have been translated into various languages. The first 750 sentences were originally in English, and the remaining 250 in German, French, Italian, or Spanish. The translated and morphologically annotated data was then converted to UD v2 guidelines by the UD community.

The NLP libraries spaCy, UDPipe, and Stanza have consistently demonstrated high accuracy, exceeding 98% for tasks such as lemmatization and PoS tagging, as shown in Sect. 2. While these metrics are impressive, it’s important to note that they are primarily based on models trained on standard written language. This context raises questions about the libraries’ performance when applied to geographic linguistic varieties, such as dialects, or specialized domains like rural Spanish. Advances in neural networks have certainly improved language model precision, but the open data sets used for training-such as the AnCora corpus or web-scraped data-are often limited to standard written forms. This lack of comprehensive data fails to capture the full range of geographic and other linguistic varieties, inevitably affecting the performance of these models.

To address these limitations, we embarked on a project to construct a gold standard for PoS tagging for spoken European rural Spanish. The foundation of this work is the data derived from the Corpus Oral y Sonoro del Español Rural (COSER) -the “Audible corpus of spoken rural Spanish” (Fernández-Ordóñez (2005 - present))- which is the most extensive collection of spoken Spanish. The primary objectives of this project were twofold: to assess the accuracy of current taggers when dealing with non-standard data, and to identify specific characteristics of the COSER corpus, viewed as a collection of rural dialects, that could potentially result in annotation errors or biases.

The motivation behind this work is anchored in supporting NLP technology in low-resourced languages and dialects, of which the spoken rural Spanish dialects form an integral part. Spanish, being one of the most spoken languages globally (534 million speakers), presents a rich tapestry of linguistic diversity, which, unfortunately, is not fully captured in mainstream NLP models. This is primarily due to the focus of these models on standard, widely spoken language varieties, often overlooking the less common dialects and language forms. Our study also raises awareness of interlinguistic and intralinguistic diversity as a shared heritage, promoting an inclusive view of language diversity.

As such, the proposed work aligns with the broader objective of expanding the diversity of languages covered in contemporary NLP research and applications. This aligns with an approach that not only benefits linguists and their communities but also fosters inclusiveness and more direct interactions among experts of different languages. This cross-pollination of ideas, expertise, and approaches can lead to more insightful linguistic descriptions and boost cross-lingual performance, a point that is particularly salient in the context of languages with high intralinguistic diversity, like Spanish.

In the following sections of this paper, we initially provide a comprehensive review of the work done in the field of morphosyntactic tagging of standard Spanish, before moving on to discuss the latest advancements in the area of the spoken domain. Our second chapter delineates the methodology leveraged to extract a geographically representative sample from the COSER corpus, detailing the manual and automated processes employed to assign accurate PoS tags. This section also encompasses a summary of our validation efforts for these morphosyntactic tags, which relied on both manual review and the innovative use of games with a purpose (GWAPs). These strategic approaches significantly facilitated the final development of the gold standard dataset featured in this paper.

In our third chapter, we showcase the accuracy results of leading-edge NLP libraries such as UDPipe, spaCy, and Stanza NLP in tasks related to lemmatization and PoS tagging. This analysis encompasses both fine-grained (FEATS) and coarse-grained (UPOS) labels when applied to transcriptions of spoken data. Furthermore, we identify and dissect the most prevalent errors and biases that emerged during the manual correction of the tags. Chapter four presents the fine-tuning results of the spaCy transformer model in the context of PoS tagging of spoken Spanish. Lastly, in our concluding chapter, we resume the main findings of our research, discuss their implications, and contemplate future directions for this line of inquiry. In doing so, we aim to provide a holistic perspective on our endeavor to establish a gold standard for PoS tagging of spoken European rural Spanish and its potential ripple effect across the wider field of NLP.

2 Literature overview

2.1 PoS tagging Spanish

The landscape of Spanish PoS tagging has evolved remarkably over the decades, beginning with GRAMPAL’s rule-based approach (Moreno and Goñi 1995) to leveraging advanced neural networks in contemporary NLP libraries like UDPipe (Straka and Straková 2017; Straka et al 2019), spaCy (Montani et al 2022), and Stanza NLP (Qi et al 2020). Initially, Freeling introduced morphological analysis and PoS tagging with over 97% precision (Carreras et al 2004). Although, as shown in Table 1, a comparative study by Parra Escartín y Martínez Alonso (2015) benchmarked TreeTagger (Schmid 1994), IULA TreeTagger (Martínez et al 2010), Freeling (Padró and Stanilovsky 2012), and IXA PoS tagger (Agerri et al 2014), showing accuracies ranging from 85% to 88%. Notably, advancements in machine learning have significantly boosted performance, with UDPipe achieving up to 99.05% accuracy in UPOS and 98.70% in FEATS by 2019 (Straka et al 2019), following its integration of advanced techniques like ELMO, FLAIR, and BERT embeddings. Similarly, spaCy and Stanza NLP have reported accuracies close to 99%, underscoring the effectiveness of Convolutional Neural Networks(CNN), transformers, and Bidirectional Long Short-Term Memory (Bi-LSTM) architectures (Montani et al 2022; Qi et al 2020).

Table 1 Key developments in Spanish PoS tagging with accuracy metrics

It is important to highlight that this evolution is deeply intertwined with the development and influence of the AnCora treebank (Taulé et al 2008), stemming from early foundational work such as the GRAMPAL tagger (Moreno and Goñi 1995), and subsequent integrations with emerging NLP technologies like the Freeling suite (Carreras et al 2004; Padró et al 2010). The collaboration between these early projects and AnCora’s transition to the UD framework (Alonso and Zeman 2016) led to the development of the robust Spanish language models we see today in neural network-based NLP applications like UDPipe (Straka 2018), spaCy (Montani et al 2022), and Stanza NLP (Qi et al 2020).

2.2 PoS tagging spoken Spanish

The evolution of Part-of-Speech (PoS) tagging for spoken Spanish has been marked by significant efforts to adapt preexisting methods for the complexities of spoken language. The adaptation of GRAMPAL for the Spanish subcorpus of C-ORAL-ROM was one of the first attempts to address the PoS tagging of spoken Spanish, achieving a precision of 95.3% (Moreno Sandoval and Guirao 2006). This endeavor highlighted the need for tagsets that cater specifically to the particularities of spoken language, including multi-word expressions and tokenization challenges.

Subsequent projects have advanced by adapting the Freeling librarie. For instance, the Spanish oral learner corpus CORELE (Campillos-Llanos 2016) utilized Freeling, emphasizing the addition of learner-specific tags and the manual correction of tags to better accommodate the unique linguistic features found in learner language. Similarly, the COSER corpus (De Benito Moreno et al 2016; Fernández-Ordóñez (2005 - present)) implemented Freeling with rule-based grammar adjustments to handle the dialectal and morphosyntactic variations of rural Spanish speech. The adaptation process for COSER involved refining Freeling’s tokenizer and disambiguation modules to better align with the oral and dialectal peculiarities present in the corpus. In the same order, the Sociolinguistic Speech Corpus of Chilean Spanish (COSCACH) project (Sadowsky 2022) expanded Freeling to include Chilean Spanish. Despite these efforts to adapt Freeling for spoken Spanish corpora, comprehensive documentation and accuracy assessments are scarce, making it difficult to evaluate or replicate these adaptations. Furthermore, the adaptation of Freeling for the COSCACH project is the only one made available as open-source.

The most recent progress in the field was achieved through the deployment of maximum-entropy-based PoS taggers, devised by the Stanford NLP group, tailored for both oral and written varieties of Baja California Spanish. This innovative approach involved two versions of the taggers: a standard MaxEnt tagger and an enhanced version incorporating distributional similarities (MaxEnt + DS). Both models were initially trained using the AnCora corpus and subsequently evaluated on the Corpus del Habla de Baja California (CHBC), where they demonstrated impressive accuracies of 96.8% and 97%, respectively (Rico-Sulayes et al 2017). However, it’s important to note that the CHBC comprises both written and spoken texts, making it a mixed corpus rather than exclusively spoken.

3 Corpus processing and Gold Standard setting

3.1 Transcriptions processing

To perform a rigorous evaluation of the accuracy of cutting-edge PoS taggers applied to spoken Spanish, we strategically chose the COSER corpus. This robust data source encompasses more than 4 million linguistic tokens. Utilizing this substantial repository, we formulated a gold-standard corpus, termed COSER-PoS, to provide a reliable benchmark for systematic analysis and comparison. This required a geographically balanced sample representing all known Spanish varieties present in our source corpus. For this, interviews were grouped into different regions based on distinctive dialectal features, disregarding administrative boundaries. From each region, a random selection of 500 to 600 conversation turns was made. It’s important to note that in bilingual areas such as Catalonia and the Basque Country, the corpus solely includes the Spanish language, although with potential loanwords from other regional languages.

Our goal was to obtain comprehensive speech turns from each participant, without any markers that might disrupt the natural language flow and potentially affect tagging. As such, conversational markers transcribed in the COSER corpus and indicated within brackets were removed. For instance, cross-conversations were eliminated, as these were typically extraneous to the primary conversation and could interfere with the tagging process. An essential aspect we maintained was the “NP”, representing the subject of the sentence. These markers were primarily used to conceal the real names of informants for privacy reasons.

Furthermore, we removed markers relating to pronunciation, vocal emissions, pauses, silence, external noises, gestures, and recording intelligibility. These were initially used in the COSER corpus to provide additional information about the conversation’s context, but they were not essential for our analysis.

Certain elements that naturally occur in the speaker’s communication and were not enclosed in brackets were retained. These included incomplete or repeated words, as well as foreign terms. By preserving these elements, we were able to maintain the authenticity of the spoken language, as they reflect the idiosyncrasies and unique characteristics of the speaker’s individual communication style and cultural influences.

The COSER transcription framework mandates that vocalic, consonantal, or syllabic segments suppressed within a word or due to syntactic phonetics are not transcribed. Conversely, vocalic fusion between words within a syntactic chain is denoted with (’)(Fernández-Ordóñez (2005 - present)). Initially, to ensure comprehensive tokenization, we dissected these occurrences into individual components (e.g., pa’l was divided into three tokens: pa, ’, l), aligning with the framework’s guidelines for the treatment of elisions and fusions. However, we ultimately decided to remove the single quotation mark. While it initially served a purpose during tokenization, its use was not systematic in the raw dataset. In certain instances, it was found adjacent to the initial word but did not serve to separate any components. It merely indicated the elision of sounds, which was already evident from the transcription itself.

Subsequently, these turns underwent automatic processing using spaCy’s (Montani et al 2022) large model and the spacy-conllFootnote 1 module to segment the turns into sentences and to obtain a lemmatized and tagged base dataset in the CONLL-UFootnote 2 format which is used for UD treebanks. The decision to use the spaCy library was the state-of-the-art ease of use of the pipeline and its high accuracy (see subsection 2.1).

The resulting CONLL-U fileFootnote 3 was enriched with metadata, including the original turn ID, the data collection location, and the time range from which each sentence was extracted. Notably, the XPOS field was repurposed to capture the original FreeLing tag assigned by the COSER team, as discussed in subsection 2.2. The COSER FreeLing tags, based on the adapted EAGLES model, were not considered a definitive gold standard for our research due to several critical limitations. Primarily, the EAGLES model, as customized for COSER, lacks public availability and sufficient documentation. During prior experiments, attempts were made to convert the EAGLES scheme to UD; however, this process required a meticulous manual analysis to establish accurate correspondences. A notable challenge encountered was that 56 tags did not align perfectly with the EAGLES framework, raising concerns about potential misinterpretations or adjustments that might only apply to a limited number of examples, rather than being systematically applicable. Additionally, the treatment of multi-token words, such as verbs with clitics and multi-word expressions-particularly compound proper names-differed significantly between the EAGLES and UD models. While EAGLES tends to tokenize such compound names as a single unit, UD advocates for treating each component of the compound word separately, further complicating the conversion process.

Despite these challenges, when estimating the differences in transitioning from EAGLES to UD, we observed that our manually verified gold standard diverges, especially in categories unique to oral speech, such as interjections (INTJ) and incomplete words. Overall, the correspondence between the EAGLES annotations and our UD-adapted approach is approximately 90%.

Following the initial segmentation and processing steps, our methodology employed the spaCy library’s large model and the spacy-conll module to automatically process the data, aiming to create a preliminary, clean document formatted in CONLL-U for UD treebanks. This step was crucial for structuring the data in a manner that allowed for efficient examination and verification of all linguistic tags, leveraging spaCy’s state-of-the-art capabilities for ease of use and accuracy in linguistic processing.

However, it’s important to emphasize that this automated tagging and formatting process was not the final step in our data preparation. Recognizing the limitations of solely relying on automatic processing for achieving the desired level of accuracy and consistency, we subsequently implemented a thorough manual review and correction phase (see Subsection 3.2).

In light of the inconsistent application of the correspondence between phonological and orthographic spelling in the original dataset, we opted to preserve this relationship to allow for the potential recreation of an orthographic version of the dataset. To facilitate this, an “Ortho” value was introduced to the MISC field to document the corresponding orthographic spelling of words transcribed phonologically. This led to the manual and semi-automatic addition of 3,913 orthographic words, supplementing the 1,206 initially recovered from COSER’s XML files. This decision, while not explicitly reiterated in previous chapters, is rooted in the transcription conventions employed by COSER, which are particularly relevant considering the corpus’s representation of non-standard speech.

The organization of clitics presented a significant challenge due to the intricacies of their representation in language. Under the UD guidelines, clitics are considered separate syntactic words. This implies that they should be separated from the host word during tokenization and assigned their line in the annotation. Consequently, managing clitics in our dataset required a largely manual process to ensure they were appropriately identified and separated. In contrast, the inclusion of multi-word expressions (MWEs) was facilitated by semi-automatic methods, benefiting from the standardized approach UD offers for handling such constructs.

In conclusion, the final COSER-PoS dataset utilized for the experiments discussed in this paper consists of 13,219 sentences, derived from 1,760 conversational turns, and encompasses 196,372 tokens. This dataset demonstrates considerable geographic diversity, having been compiled from 16 distinct dialectal regions across SpainFootnote 4, as detailed in Table 2. Although each region contributed between 500 and 600 turns, the size of these turns varied according to the number of words spoken by the individual participants. Consequently, upon segmentation into sentences and subsequent tokenization, certain regions emerged with a higher quantity of data. The number of sentences and tokens in each region, therefore, depends on the fluency and verbosity of the speakers from that region rather than a predefined distribution.

Table 2 Distribution of sentences and tokens by region in the COSER-PoS dataset

3.2 PoS tagging validation approaches

After selecting a sample from the COSER corpus and transforming it into the CONLL-U format, we initiated a meticulous manual validation process for lemmas, UPOS, and FEATS tags. This thorough examination culminated in the formation of an initial reference corpusFootnote 5 intended to fulfill several key objectives.

First, it enabled us to carry out a preliminary evaluation of the taggers’ performance, furnishing an early benchmark to gauge their efficacy. Second, it provided a platform to investigate if certain regional dialects exhibited a closer alignment with the standard language. In particular, we sought to discern if these dialects demonstrated superior performance due to their resemblance with the written language, which had been utilized as the training set for the taggers.

The first investigation revealed that the accuracy of UPOS tagging for spoken Spanish using the spaCy, Stanza NLP, and UDPipe libraries ranged from 0.90 to 0.93. The Transformer-based spaCy model and Stanza NLP demonstrated superior performance, achieving high accuracy rates. There was no significant difference detected amongst the various regional variations analyzed (Bonilla et al 2022).

Analyzing the performance of the Transformer-based spaCy model in terms of grammatical categories, the ones most challenging for automatic tagging, and consequently registering the lowest F1 scores, were as follows in ascending order: interjections (INTJ, F1=0.53), proper nouns (PROPN, F1=0.69), auxiliary verbs (AUX, F1=0.75), adjectives (ADJ, F1=0.84), and subordinating conjunctions (SCONJ, F1=0.87). A manual analysis of these categories revealed that the decreased performance of auxiliaries, adjectives, and subordinate conjunctions was primarily due to the ambiguous roles that some grammatical categories can play, and the unclear boundaries between them. This was particularly evident in the case of polyfunctional words such as como (“how”) and cuando (“when”), where a manual review was instrumental in determining whether they were used as adverbs or subordinate conjunctions.

In the case of auxiliary verbs, a particular challenge arose when we manually reassigned some of the tags of the verbs ser and estar (“to be”) to VERB (F1=0.91) when they acted as the main linking verb in sentences. Conversely, we assigned AUX tags to verbs when they functioned as helper verbs in verbal expressions. However, the broad rules suggested by the UD guidelines, built into the models trained on the AnCora corpus, negatively impacted accuracy.

We faced difficulties distinguishing the boundary of passive participles as verbs or adjectives. While dialectal transcription may potentially affect accuracy, the task of automatically differentiating words with strictly verbal functions from those acquiring adjective values through lexicalization proved to be challenging.

Interjections registered the lowest performance. This result is attributable to the guidelines set by AnCora, derived from UD, which stipulate that words used in exclamations retain their primary grammatical categories if they typically belong to another part of speech. For instance, Dios (“God”) is still categorized as a NOUN even in exclamatory contexts. For multi-word interjections, each token retains its primary category, but they are collectively assigned a separate tag to group them as a Multi-Word Expression (MWE), thereby increasing the complexity of classification. It is worth noting that the X tag, employed by UD for forms to which a grammatical category cannot be assigned, recorded an F1 score of 0. This result is understandable given that the models have been trained on standard written language, where incomplete words are not customary.

In the following research stage, we incorporated the aforementioned reference dataset into a Games with a Purpose (GWAPs) platform named Juegos del EspañolFootnote 6 (Bonilla et al 2022; Segundo Díaz et al 2023a, b; Bonilla et al 2023)and implemented an inter-annotator agreement measure, based on Millour and Fort’s (Millour and Fort 2018) previous work.

A total of 121 participants were recruited for the study through two main methods. Initially, an email invitation was sent to potential participants, who were then required to register with a valid email address and provide consent through an informed consent form. This initial process also involved completing a demographic questionnaire that covered various socio-geographic details, including gender, age, education, linguistic background, and places of residence and upbringing. To complement the email-based recruitment, ten gaming workshops were organized, targeting students at the Humboldt University of Berlin with at least a B1 proficiency in Spanish. These workshops were integrated into Spanish Language and Linguistics seminars during the Winter Semester of 2022–2023, focusing on linguistic variation in Spanish. Students engaged in these workshops were specifically tasked with correcting PoS tags related to Andalusian and Canarian Spanish, aligning with seminar themes on Latin American Spanish.

Participants verified 5,976 tokens, averaging about 49.3 tags per participant (Bonilla et al 2023). Among these verified tokens, 4,139 were unique. The overall accuracy of the participants’ verification efforts reached 0.80. The tags NOUN and PROPN achieved the highest scores, ranging from 0.93 to 0.94. Other high-performing tags encompassed ADP, CCONJ, DET, ADV, INTJ, and NUM, with F1 scores spanning from 0.83 to 0.89. On the other hand, AUX recorded a low F1 score of 0.61 which can be traced back to the broader definition explained before of auxiliaries in the UD guidelines and insufficient clarification provided during the game’s tutorial sessions, confusing participants. By excluding corrections from AUX to VERB, we observed an improvement in the overall accuracy score from 0.80 to 0.84. Concurrently, the F1 score for AUX rose from 0.61 to 0.84. These adjustments offered a more precise reflection of participants’ performance.

Interestingly, the results also revealed that the SCONJ tag had a comparatively low F1 score of 0.46, relative to other tags. In terms of frequency, SCONJ was most often mislabeled as CCONJ (23/47 instances), followed by PRON (15/47 instances). The technical differentiation between SCONJ and CCONJ, both denoting various types of conjunctions, may be unfamiliar to participants without a background in linguistics or language studies. The word que" (that), most often functioning as a SCONJ to introduce a subordinate clause, caused the most verification difficulties and was often mislabelled as PRON.

Of the 5,976 verified tokens, only 5% (299 tokens) were fully validated using the inter-annotator agreement measure, with an average of 3.25 verifications conducted per token. Among these validated tags, 92.3% (276 out of 299) served as confirmations, while 6% (18 out of 299) were associated with the auxiliary verbs ser and estar (to be) instead of verbs when functioning as copulas. Excluding the ser and estar (“to be”) corrections, the inter-annotator accuracy reaches 0.98 and highlights the importance of incorporating inter-annotator agreement measures, which have proven successful in terms of accuracy. However, the challenge lies in effectively engaging and retaining the players, as a large number of participants is crucial for validating the vast number of PoS tags within the COSER-PoS dataset. We estimated that to validate all the PoS tags of the COSER-PoS (excluding punctuation) 69,670 participants would be necessary to achieve a comparable degree of validation for the larger tag set, using a scaling technique based on the validation ratio.

The findings of our previous studies, including challenges in morpho-syntactic tagging and the potential of crowdsourcing validation through games with a purpose (GWAPs), lay a robust foundation for future research. By learning from these experiences and continuously refining our methodologies, we can enhance the accuracy of automatic tagging, contributing to advancements in computational linguistics and the understanding of linguistic variations in spoken Spanish

3.3 COSER-PoS Gold Standard particularities

Our exploration into the morpho-syntactic tagging of Spanish dialects has yielded several key insights that guided our future research. Foremost among these was the clear necessity for our tagging practices to closely follow the UD guidelines to enhance accuracy. To this end, we have executed a side-by-side comparison with the AnCora and GSD treebanks to better align our tagging method with UD.

We also took into account the most common corrections suggested by the players in the Games with a Purpose (GWAPs). These contributions were instrumental in identifying more errors, for example in the first version of the Gold Standard where some ADJ was still tagged as VERB and vice-versa. Additionally, in our quest to provide a more harmonized and accurate benchmark for parsing Spanish, we have instituted several key changes to our Gold Standard. These adjustments aim at better alignment with the AnCora and UD guidelines while ensuring specific tags or groupings are in place to cater to the spoken domain characteristics of our transcriptions. This chapter details these adjustments and their implications.

Regarding the coarse-grained PoS tags or UPOS, a significant revision was implemented. We assigned the DET label to all possessive adjectives, including some postponed possessives common in spoken Spanish, as seen in la casa mía (“my house”). This decision was made to be consistent with AnCora because it is functional for the syntactic construction as the possessive limits or gives context to the noun. However, the classification of possessives like “my,” “mine,” and “my own” as either determiners (DET), pronouns (PRON), or adjectives (ADJ) presents an intriguing linguistic dilemma. This debate centers on their functional role in sentences—whether they primarily modify nouns, indicating possession, or stand in for nouns themselves. In the UD framework, these words are classified as determiners because they specify or clarify the noun they accompany, rather than substituting for it, aligning with the DET’s role of narrowing down the reference of a noun. However, some grammatical analyses prefer to see them as adjectives due to their descriptive modifying function and their agreement in number (and sometimes gender) with the nouns they modify, mirroring typical adjective behavior.

Ordinal numbers, conventionally conceived as ADJ in UD, are treated differently in our benchmark. In our data, these were marked as ADV when they were found in a VERB context. The reasoning behind this is that ordinal numbers can function in different contexts, such as adverbs (ADV) when modifying a verb (VERB), or as adjectives (ADJ) when modifying a noun (NOUN). Despite these differing roles, they have been tagged ADJ in both situations following the UD guidelines, and for the future parsing layer with the dependency relation distinguishing between adjectival modifier (amod) and adverbial modifier (advmod).

As for auxiliary verbs, we followed UD guidelines and marked ser and estar as AUX consistently. Verbs that were semi-auxiliaries, typically found in verbal periphrasis, were changed to VERB. The only verbs occasionally used as auxiliaries apart from ser and estar are poder (“can”), haber (“have/be”), deber (“must/should”), saber (“know”), querer (“want”), and soler (“used to”).

In the context of interjections (INTJ), we have primarily adhered to the pre-existing categorizations, with a few notable exceptions. For example, bueno (“good”) is typically classified as either INTJ or ADJ, particularly when it precedes a noun. Similarly, claro (“of course” or “clear”) is conventionally identified as either ADJ or INTJ. This dual categorization also applies to words such as vale (“ok” or “cost”), vamos (“let’s go”), vaya (“(you) go”), and venga (“come on”). These terms have been categorized as either VERB or INTJ, contingent upon the specific context.

Our Gold Standard also incorporates several unique features that further differentiate it from benchmarks designed for standard Spanish. These additions are geared towards capturing the idiosyncrasies of spoken language and accommodating their presence in our data.

A feature of particular interest is the Degree=Dim tag, designated for diminutive terms-a prominent linguistic device in colloquial Spanish. Employed to convey endearment, diminution, or a lesser degree of an adjective or noun’s root concept, diminutives frequently serve as indicators of a speaker’s regional origin due to their variable usage across different geographical areas. Take, for instance, the diminutive morphemes -ito, -ita, which are generally utilized and the only ones documented in AnCora. However, our dataset has captured other morphemes such as -ica, -ico (e.g., juerguecica “little party” ), -illa, -illo (e.g., calderilla “little boiler” ), and -ín, -ino, -ina (e.g., estampadín “little print”). These examples demonstrate that our dataset offers a comprehensive view, capturing not only the general characteristics of spoken Spanish but also its regional variations, thus enhancing its linguistic richness and detail.

We also maintain documented foreign words from the original COSER transcripts and mark them with the Foreign=Yes feature. This feature is particularly useful in capturing the diversity of Spanish in contact, where speakers may use words from other languages. Examples from our dataset include budells (“intestines”) from Catalan or cornos (“horns”) from Gallego.

Another unique feature is the disambiguation of the PronType into Interrogative (Int) or Exclamative (Exc). The PronType=Int tag is used for interrogative pronouns, while the PronType=Exc tag is used for exclamatory pronouns. These tags help to better capture the tone and intent of spoken language, which often involves more exclamations and questions than written language.

Our dataset sheds light on distinctive verb conjugations frequently encountered in the spoken domain, such as in reported and conversational speech. These forms, although less prevalent in written language, are crucial for understanding the linguistic dynamics of everyday conversation. For instance, the first-person singular imperative form, epitomized by the word quita (translated as “remove” or “go away”), is commonly used to deliver direct, informal commands, thus embodying the spontaneity of spoken communication.

Moreover, COSER-PoS provides insights into a variety of verb conjugations related to the second-person plural pronoun vosotros, which is less prevalent in standard-driven domains. Included among these is the conditional plural form (e.g., pasearíais, translating to “you would walk”), which is conventionally used to posit hypothetical situations. Furthermore, the dataset includes the future indicative form for the second person plural, as exemplified by verbs such as mandaréis (“you will send”), iréis (“you will go”), probaréis (“you will try”), and habréis (“you will have”), thereby encapsulating expressions of future actions or intentions.

The dataset also reveals peculiar instances of the second person plural preterite indicative form, with examples such as estuvisteis (“you were”) and tuvisteis (“you had”), which discuss completed past actions. Additionally, the second person plural imperfect indicative form is exemplified by words like teníais (“you had”), hacíais (“you were doing”), and organizabais (“you were organizing”), capturing ongoing or habitual past actions.

Notably, we also find the second person plural imperfect subjunctive form, as in quisierais (“you would want”). This form, often employed to articulate hypothetical desires or polite requests, underscores the conversational subtleties embedded within the corpus.

Lastly, the dataset elucidates the second person singular imperfect subjunctive form, with examples such as hubieras (“you had”), vieras (“you saw”), trabajaras (“you worked”), querías (“you wanted”), and fueres (“you were”). These linguistic peculiarities remain undocumented in pre-existing treebanks, underscoring the unique contributions of our COSER-PoS.

Through our rigorous exploration and subsequent modifications, the COSER-PoS Gold Standard has emerged as a comprehensive tool that accurately captures the nuances of the spoken Spanish language. Our sustained commitment to align with UD guidelines, coupled with the invaluable feedback from the Games with a Purpose (GWAPs), has ensured a high degree of accuracy and consistency. The adjustments we have implemented not only enhance alignment with AnCora and UD guidelines but also cater to the unique characteristics of spoken language prevalent in our data. Moreover, the unique features we have incorporated, such as the Degree=Dim tag for diminutive terms and the Foreign=Yes feature for foreign words, illustrate our commitment to reflect the rich diversity and complexity of spoken Spanish.

3.4 Spanish UD treebanks harmonization

Emerging from a research stay with a primary objective of improving COSER-PoS and adapting it to align with the UD guidelines, we identified the need for a more comprehensive revision of the existing treebanks for Spanish. This realization sparked an ongoing, collaborative endeavor with Professor Daniel Zeman from Charles University, to harmonize all Spanish treebanks. Some of these modifications have already been integrated into the latest UD release, version 2.12, which was published on May 15, 2023. These refinements include the reclassification of donde (“where”) and cuando (“when”), allowing them to be identified as either SCONJ or ADV, but not PRON, with their interrogative forms, dónde and cuándo, exclusively recognized as ADV. The existential hay (“there is/are”) and necessitative hay que (“have to) have been reassessed, transitioning from AUX to VERB. Furthermore, the role of cuyo (“whose”) has been refined as a relative determiner. The classifications for qué (“what”) and cuál (“which”) have been updated to be exclusively interrogative, and que (“that”) and cual (“which”) are now identified solely as relative terms.

4 State-of-the-art models accuracy estimation

To evaluate the performance of Part-of-Speech (PoS) taggers on written data, in a manner that is comparable with our gold standard for spoken Spanish, we used the Spanish test sets from the UD version 2.12 (Zeman et al 2023), which include the AnCora, GSD, and PUD datasets.

These datasets (Table 3) are divided into training, development (dev), and testing sets, following the common practice in machine learning. The training set, which accounts for 80% of the data, is used to train the models, helping them to learn the underlying patterns. The development set, constituting 10% of the data, is employed to tune model parameters and to guide the selection of the best model. Lastly, the test set, also comprising 10% of the data, is utilized to evaluate the final model’s performance in a “real-world” scenario, measuring how well the model generalizes to unseen data. Notably, the PUD dataset only includes a test set, and therefore, it will solely be utilized for evaluation purposes, without contributing to the training process.

Consequently, for the state-of-the-art models’ accuracy estimation, we used the test set from our benchmark, comprising 10% of the total sentences. These sentences were randomly selected, yet ensuring a representative 10% from each regional dialect, for maintaining sufficient dialectal representation within the data.

Our benchmark comprises two distinct versions: COSER-PoS, which retains the phonological transcriptions as they were originally extracted from the corpus, and COSER-PoS_o, an orthographic version of the benchmark. The objective behind maintaining these two versions was to discern the influence of spelling on the accuracy of the PoS taggers.

Table 3 Datasets used for experiments

In our evaluation presented in Table 4 we provide a detailed comparison of the performance of pre-trained models on the Spanish test datasets, we focused on three integral components of the processing pipelines utilized by spaCy, Stanza NLP, and UDPipe. The models under examination included spaCy’s small (sm), medium (md), transformer (trf), and large (lg) versions, along with Stanza NLP, and UDPipe 1.2.0. These encompassed the lemmatization process, Universal POS (UPOS) tagging, and fine-grained tagging, often referenced as morphological analysis and abbreviated as FEATS in this context.

Table 4 Comparison of pre-trained models lemma, UPOS, and FEATS accuracy for Spanish test datasets

4.1 Lemmatization

For lemmatization accuracy, the models consistently performed best on the AnCora dataset, as expected, with all models reaching an accuracy of 0.98 or 0.99. In contrast, the models performed slightly worse on the COSER-PoS dataset, with accuracies ranging from 0.93 to 0.96. However, when these models were evaluated on the COSER-PoS_o dataset, the orthographic version of COSER-PoS, the lemmatization accuracies improved slightly, ranging from 0.95 to 0.97. This highlights the influence of spelling on the performance of the models, demonstrating that they are more attuned to standard orthographic forms found in written Spanish.

Despite this overall satisfactory performance, the models encounter certain issues in the realm of lemmatization, as it depends on a limited lexicon. For instance, the noun cara (“face”) is incorrectly lemmatized as the adjective caro (“expensive”). Similarly, libro (“book”) is mistakenly lemmatized as librar (“to free/release”), and baño (“bath”) as bañar (“to bathe”). These lemmatization errors suggest that the pipelines’ lemmatizer seems to operate independently from the PoS tagging. To illustrate, the models may correctly tag libro as a noun, but incorrectly lemmatize it as if it were the verb librar.

These problems emphasize the need for training models on a more diverse and comprehensive lexicon. Such an approach would enable them to better handle the complexity and variation inherent in the language, and improve the accuracy of both PoS tagging and lemmatization tasks.

Finally, the models delivered the least effective performance on the PUD dataset, with accuracies hovering around 0.78. The GSD dataset showed mixed results, with accuracies fluctuating between 0.89 and 0.90. Upon a more detailed analysis of PUD and GSD treebanks, it was found that lemma annotation and PoS tagging showed signs of automation, resulting in lemmas or tags that were not entirely correct. Thus, they warrant future scrutiny and potential correction and harmonization with AnCora. In summary, these results illustrate the strengths and weaknesses of the evaluated models when processing different forms of Spanish, with the highest accuracies achieved on written standard Spanish datasets and a decrease in performance on phonologically transcribed spoken Spanish.

4.2 PoS tagging

Upon examining the UPOS tagging results across the pre-trained models, it is clear that the performances exhibit a similar pattern to those observed in the lemmatization task. The models exhibited their best performance on the AnCora dataset. Here, the accuracy ranged from 0.98 to 0.99, reflecting an extremely high level of accuracy in UPOS tagging and in full agreement with what is reported by the libraries and resume in Table 1.

The performance on the COSER-PoS and COSER-PoS_o datasets was slightly lower but still robust. For these datasets, the accuracy fluctuated between 0.93 to 0.96, showing that the models are still quite effective at UPOS tagging on spoken Spanish data. It is important to note that the orthographic version of the COSER-PoS dataset, COSER-PoS_o, showed slightly better results than the phonologically transcribed version. This indicates, once again, that the spelling does have some influence on the performance of the taggers. The models delivered the least effective performance on the PUD dataset, with accuracy hovering around 0.91. The GSD dataset showed mixed results, with accuracy between 0.91 and 0.92.

As previously mentioned in Sect. 3.2, our initial investigation found that the UPOS tagging accuracy for spoken Spanish, using the spaCy, Stanza NLP, and UDPipe libraries, ranged from 0.90 to 0.93. The Transformer-based spaCy model and Stanza NLP stood out as the most effective performers, achieving the highest accuracy rates. Moreover, there was no significant disparity observed across the various regional variations analyzed.

In contrast, the current study observed a noticeable improvement in the UPOS tagging accuracy for the COSER-PoS dataset, with scores ranging from 0.93 to 0.96 across most models. This three-percentage point improvement is a significant uptick. It suggests that the additional harmonization of COSER-PoS with UD guidelines contributed positively to the models’ UPOS tagging performance.

Analyzing the individual performances of the models, we can see some interesting patterns. The models that outperformed others in terms of UPOS tagging accuracy across all datasets are Stanza NLP and the Transformer-based spaCy model. On the other hand, spaCy sm, spaCy md, spaCy lg, and UDPipe 1.2.0 showed relatively lower performance in UPOS tagging across all datasets, with accuracy generally below 0.94 when dealing with spoken Spanish.

Table 5 provides a comprehensive comparison of the F1 scores per UPOS tag for two of the best Spanish pre-trained models, Stanza NLP and the Transformer-based spaCy model, against COSER-PoS and COSER-PoS_o. We present the F1 score as it considers both precision and recall to provide a balanced measure of a model’s performance.

Table 5 Best pre-trained models f1 scores per UPOS tag

In comparison to the first study by (Bonilla et al 2022), where the dataset was not as harmonized with the UD guidelines there are some noteworthy findings.

The performance of both models on the AUX tag has seen a marked improvement, with Stanza NLP achieving an F1 score of 0.84 and the Transformer-based spaCy model reaching 0.82 and 0.83 on COSER-PoS and COSER-PoS_o respectively. This signifies considerable progress since the first study, where the highest score for AUX was 0.75. This advancement could potentially be attributed to the alignment of the UD guidelines for auxiliaries.

The INTJ tag, another problematic category identified in the initial investigation, has also seen significant enhancements in performance. Stanza NLP now achieves an F1 score of 0.78 and 0.81 on COSER-PoS and COSER-PoS_o respectively, while the Transformer-based spaCy model attains scores of 0.83 and 0.85, a considerable leap from the initial F1 score of 0.53. A noteworthy shift can also be observed in the performance of the models on the PROPN tag. While the Transformer-based spaCy model’s performance has slightly declined compared to the first investigation, Stanza NLP now achieves a strong F1 score of 0.9 and 0.91 on COSER-PoS and COSER-PoS_o respectively, indicating improved proficiency in handling proper nouns.

The SCONJ tag maintains a consistent performance between the two studies, with F1 scores hovering around 0.89 to 0.91 for both models across both datasets. This suggests that the models continue to grapple with the challenges presented by subordinate conjunctions, likely due to their inherent ambiguity.

In comparing the performance of the Transformer-based spaCy model and Stanza NLP across COSER-PoS and COSER-PoS_o datasets, it’s interesting to note the role of orthographic transcription. In several instances, both models demonstrate slightly enhanced F1 scores on COSER-PoS_o compared to COSER-PoS. For example, the scores for the INTJ, ADJ, and PROPN tags for both models are higher on COSER-PoS_o. These subtle improvements demonstrate once again that standard orthographic transcription contributes to better model performance. However, it’s also essential to consider that these improvements are incremental, and the overall performance remains consistent across both datasets, suggesting the robustness of these models in dealing with variations in orthographic transcription.

Analyzing the differences in the performance of Stanza NLP and the Transformer-based spaCy model reveals the potential impact of different neural network architectures on the models’ capacity to classify specific categories. For instance, the Transformer-based spaCy model outperforms Stanza NLP on the ADJ tag, while the latter demonstrates superior performance on the PROPN tag. These variations might stem from the inherent strengths and weaknesses of the models’ respective architectures, and how they interact with the specific characteristics of different UPOS tags. It suggests that certain architectures may be more suited to tagging tasks, or that they may respond differently to the challenges presented by different grammatical categories. However, this hypothesis warrants further investigation and extends beyond the scope of this current paper. Future research could aim to elucidate this fascinating interplay between neural network architecture and UPOS tagging performance.

4.3 Morphology

Regarding morphology, for the AnCora dataset, all models show high FEATS accuracy ranging from 0.97 to 0.99. On the other hand, the PUD dataset, which is a translation of English news texts, consistently yields the lowest FEATS accuracy across all models, with scores ranging from 0.66 to 0.69. This could be attributed to the syntactic and morphological disparities between English and Spanish, leading to complex or ambiguous morphological structures in the translated text.

The FEATS accuracy for the COSER-PoS and COSER-PoS_o datasets exhibits a similar trend across all models. Specifically, the Transformer-based spaCy model and Stanza NLP present slightly higher FEATS accuracy for COSER-PoS_o (0.92) compared to COSER-PoS (0.91 and 0.90 respectively). Nonetheless, the FEATS accuracy for these two datasets is generally lower than AnCora, which could be attributed to the unique characteristics of spoken Spanish in the COSER datasets, such as the presence of diminutives and specific verb conjugations, not typically found in the news domain.

While a complete analysis of each FEAT would extend beyond the scope of this work, it is worth highlighting specific areas where both models exhibited challenges. One such area is the disambiguation of pronouns, specifically those with complex morphological features that carry multiple functional roles in a sentence.

Consider the pronoun se, which serves as an accusative-dative case marker, indicates the third person, and is also reflexive. This pronoun is notorious for its multiplicity of functions in Spanish, as it can be used reflexively, reciprocally, impersonally, passively, and in various other pronominal verb constructions. Accurately predicting its role in a sentence requires deep syntactic and semantic understanding, which the models might be struggling with. Similarly, the pronoun nos in the accusative case, denotes first person plural. Finally, os, which is also dialectal, in the dative case, representing second person plural, can also introduce ambiguity. These pronouns can function as direct or indirect objects, or form part of pronominal verbs, adding a layer of complexity to the sentence structure.

Moreover, the unique features used in the COSER datasets also impact the models’ performance. The Degree feature, which is used to mark diminutive forms, introduces additional morphological variation that the models don’t recognize as it was not previously introduced in its training AnCora dataset. Given the frequent use of diminutives in spoken Spanish and their morphological variability, models are not prepared to accurately predict this feature, thereby affecting their FEATS accuracy.

In conclusion, the disparity in FEATS accuracy between the AnCora and COSER-PoS datasets is potentially due to the distinctive linguistic nuances of spoken Spanish, intricacies of pronoun disambiguation, and the employment of certain morphological features atypical in written language. The multifaceted usage of pronouns such as se, nos, and os, each carrying complex morphological characteristics and playing multiple functional roles, poses substantial challenges for the models. In addition, these models encounter difficulties with accurately identifying and tagging verb participles that concurrently serve as adjectives, leading to potential inaccuracies in FEATS prediction. The COSER-PoS datasets introduce the Degree feature, marking diminutive forms and thereby increasing morphological diversity. As these models lack prior exposure to such variations during their AnCora training phase, they struggle to accurately predict these features, hence affecting their overall FEATS accuracy.

5 Fine-tuning a lemmatizer, PoS tagger, and morphologizer for spoken Spanish

5.1 Training setup

For this task, we selected the spaCy training pipeline with GPU transformers, driven by its user-friendly nature and its great performance, equivalent to Stanza NLP, in our preceding evaluations. The default settings, available in the spaCy training configuration,Footnote 7 were adopted and are detailed below.

The configuration settings outline the components of the NLP pipeline, encompassing the transformer, the tagger, the morphologizer, and the trainable lemmatizer. The batch size is set at 128. The transformer model employs the BETO (Cañete et al 2023)Footnote 8 transformer architecture with rapid tokenization. BETO is a BERT model that has been trained on a substantial Spanish corpus,Footnote 9 which includes Wikipedia and the Spanish segments of diverse corpora such as ParaCrawl, EUBookshop, OpenSubtitles, among others.

The transformer model also integrates a span getter, applying a window of 128 tokens and a stride of 96. The tagger, morphologizer, and lemmatizer components utilize a Tagger architecture (spacy.Tagger.v2) with a TransformerListener for the tok2vec model. The pooling method for tok2vec is reduce_mean.

In the training phase, the gradient accumulation is fixed at 3. The optimizer deploys the Adam algorithm, with a learning rate schedule defined by a linear warm-up for the initial 250 steps, succeeded by a linear decay spanning 20,000 total steps. The initial learning rate is set at 5e-5. The batcher strategy adheres to batch_by_padded, with a size of 2000 and a buffer of 256, discarding any oversize batches.

In general, this configuration shows a fine-tuned pipeline specifically curated for the task of processing spoken Spanish. It presents a harmonious combination of leading architectures and optimal parameters, intended at accomplishing superior performance on this distinctive NLP task.

5.2 Datasets

For our training setup, we employed various datasets based on their respective training and development divisions (See Table 3). The training process in spaCy operates primarily with these two subsets of the data, derived from treebanks. For a comparative analysis of the training performance, we once again employed the Spanish treebanks available in version 2.12 of UD (Zeman et al 2023), which include AnCora and GSD. However, we did not use PUD as it lacks a training dataset. For our experiments, we utilized 80% of the COSER-PoS dataset as a training set, and the remaining 10%-not used in the previous test set-served as the development set. A similar process was followed with the orthographically-transcribed version of the COSER-PoS dataset, referred to as COSER-PoS_o.

Moving beyond individual datasets, we further explored the impact of dataset concatenation on the model’s performance. We maintained the correspondence between the subsets when concatenating (train with train, dev with dev), ensuring the consistency of the data structure. Our experiments involved the following combinations: (a) COSER-PoS and its orthographically-transcribed counterpart, COSER-PoS_o; (b) AnCora concatenated with COSER-PoS, and separately with COSER-PoS_o; (c) AnCora combined with both COSER-PoS and COSER-PoS_o.

It is important to note that we did not include the GSD dataset in these combinations. As demonstrated in the previous section, the GSD dataset’s performance in terms of accuracy was lower compared to other datasets. Preliminary tests also suggested that including GSD could potentially lower the overall accuracy. Consequently, we decided to omit it from these concatenated datasets. The performance results of the training using each treebank and the described combinations will be discussed in the subsequent sections.

5.3 Fine-tuning results

The results presented in Table 6 correspond to the best model obtained from the spaCy training pipeline for each dataset. During the training process, the pipeline automatically evaluates the model on the development (dev) set after each training epoch, and the model with the highest overall score is selected as the best one. The scores for lemmatization (LEMMA), Universal POS tags (UPOS), and morphological features (FEATS) are computed based on this evaluation.

Table 6 Training spaCy transformers pipeline results

Observing the individual datasets, the best models trained on COSER-PoS_o and COSER-PoS outperformed the others with LEMMA scores of 97.36% and 97.13%, and UPOS scores of 97.19% and 97.12%, respectively. Interestingly, COSER models have shown better results than AnCora, even if by a small margin.

The model trained on the GSD treebank showed the lowest performance across all metrics. The GSD model’s LEMMA, UPOS, and FEATS scores were 95.98%, 94.92%, and 94.97%, respectively, which are considerably lower compared to the other models.

When we look at the concatenated datasets, the combination of COSER-PoS and COSER-PoS_o resulted in the highest LEMMA and UPOS scores of 97.35% and 97.11%, respectively. However, in terms of the FEATS metric, the models trained on the combination of AnCora and COSER-PoS, and AnCora alone, performed slightly better, with scores of 96.25% and 96.35%, respectively. The model trained on the combination of all three (AnCora, COSER-PoS, and COSER-PoS_o) demonstrated high performance, with a LEMMA score of 97.16%, UPOS score of 96.9%, and FEATS score of 96.24

While the differences in scores across the different datasets are small, they can still be informative. For example, the difference between the highest and lowest LEMMA scores is just 1.38 percentage points, and the difference between UPOS and FEATS scores are 2.27 and 1.38 percentage points respectively.

Furthermore, the lower performance of the GSD dataset model-across all metrics reaffirms our decision to exclude it from the concatenated datasets. The LEMMA and UPOS scores were around 1.35 and 2.2 percentage points lower than the best-performing models for these metrics, and the FEATS score was even further behind, with a difference of around 1.38 percentage points.

These results suggest that the fine-tuning of the spaCy transformer pipeline can be effectively achieved with appropriate dataset combinations, providing high-performing models for processing spoken Spanish and integrating samples from different domains. This affirms the value of our approach and reinforces the importance of careful dataset selection and combination for model training.

5.4 Re-trained models accuracy estimation

In line with the methodologies outlined in Sect. 3, we conducted an assessment of the re-trained models. We tested these models using the test datasets from the UD 2.12 (Zeman et al 2023) Spanish treebanks, along with the COSER-PoS and COSER-PoS_o test datasets. The goal of this evaluation was to understand the accuracy of these models across different treebanks and against our benchmark.

Table 7 Re-trained models lemma, UPOS, and FEATS accuracy for Spanish test datasets

Upon analyzing the results detailed in Table 7, it’s clear that models perform optimally when tested on the same dataset they were trained on. For instance, the AnCora model, as evidenced by spaCy and Stanza NLP (Table 1), scores highest when tested on the AnCora test set, achieving 0.99 in all metrics (LEMMA, UPOS, and FEATS). This pattern is consistent for the COSER-PoS and COSER-PoS_o models as well, which achieve an accuracy of 0.98 in LEMMA and UPOS and 0.97 in FEATS on their respective test sets. This is consistent with the consensus that models perform best with data that mirrors their training data.

However, the main point of interest in this evaluation isn’t just the performance of the models on their training sets but their performance on different test datasets. Here, the COSER-PoS and COSER-PoS_o models display superior accuracy in terms of LEMMA, UPOS, and FEATS when tested on different datasets, demonstrating strong generalization abilities.

For instance, the accuracy of the COSER-PoS model on the AnCora, GSD, and PUD test datasets is 0.93, 0.85, and 0.77 respectively for the LEMMA metric. Similarly, COSER-PoS_o shows a decrease in LEMMA accuracy on these same test datasets, with respective percentages of 0.94, 0.85, and 0.77. Despite this decrease, these models still surpass the previous best UPOS tagging for spoken Spanish, indicating their effectiveness.

Interestingly, a closer look at the COSER-PoS and COSER-PoS_o models shows that the COSER-PoS model outperforms both phonological and orthographic transcriptions, maintaining a high accuracy of 0.98 for LEMMA and UPOS and 0.97 for FEATS when tested on the COSER-PoS_o dataset. In contrast, the COSER-PoS_o model shows a slight decrease in performance when tested on the COSER-PoS dataset, with accuracies of 0.97 for LEMMA and UPOS, and still 0.97 for FEATS. This difference could suggest that the COSER-PoS model is more adaptable to different types of data, a finding that may have significant implications for future research and application.

Table 8 Re-trained concatenated models lemma, UPOS, and FEATS accuracy for Spanish test datasets

Focusing on the results shown in Table 8, we find that models trained on concatenated datasets demonstrate impressive performance. This suggests that leveraging multiple datasets for model training enhances their capabilities, allowing them to consistently deliver high accuracy across diverse testing scenarios. This underscores the value of dataset concatenation as a viable approach for improving model performance in this context.

In terms of the spoken Spanish datasets, COSER-PoS and COSER-PoS_o, strong performance across all metrics is observed. When the model is trained on a combination of the COSER-PoS and COSER-PoS_o datasets, it reaches an accuracy of 0.98 for both LEMMA and UPOS and 0.97 for FEATS for both COSER-PoS and COSER-PoS_o test datasets. This aligns with the high scores observed for models trained on individual datasets.

Introducing the AnCora dataset into the mix does not compromise the model’s performance. The model trained on a combination of AnCora and COSER-PoS achieves high accuracy (0.98 for LEMMA and UPOS, and 0.97 for FEATS) for the COSER-PoS test dataset. Likewise, when combined with COSER-PoS_o, the model achieves a 0.97 accuracy across all metrics on the COSER-PoS dataset and even better results (0.98 for LEMMA and UPOS, and 0.97 for FEATS) on the COSER-PoS_o dataset.

A notable finding emerges when models are trained on all three datasets - AnCora, COSER-PoS, and COSER-PoS_o. These models demonstrate similar performance, achieving an accuracy of 0.98 for LEMMA and UPOS, and 0.97 for FEATS on the COSER-PoS dataset, with an even higher accuracy of 0.99 for LEMMA on the COSER-PoS_o dataset.

As discussed earlier in subsection 2.2, the initial PoS tagging attempt for the Spanish subcorpus of C-ORAL-ROM using GRAMPAL achieved a precision of 95.3% (Moreno Sandoval and Guirao 2006). More recently, Sulayes et al. applied two maximum-entropy-based PoS taggers to the Corpus del Habla de Baja California (CHBC), achieving accuracies of 96.8% and 97% Rico-Sulayes et al (2017), though their training was also based on AnCora.

Nevertheless, our retrained models, particularly those using the domain-specific COSER-PoS and COSER-PoS_o datasets, have outperformed these previous attempts. Specifically, they have established a new standard in UPOS tagging accuracy for spoken Spanish, reaching an impressive score of 98%. Additionally, these models achieved a FEATS tagging accuracy of 97%.

This improved performance not only illustrates the benefits of fine-tuning models on domain-specific datasets but also highlights the significant potential of these models. They serve as powerful tools for improving natural language processing tasks related to spoken Spanish, such as lemmatization and morphological features tagging, thereby setting a new state-of-the-art in this field.

In wrapping up this section, we offer a detailed comparison of performance improvement by grammatical category in Table 9. Here, we assess the two best-performing models that demonstrated the highest Universal Part of Speech (UPOS) accuracy. The first model combines COSER-PoS and COSER-PoS_o, while the second model includes the AnCora dataset. These fine-tuned models are then compared with the results from previous evaluations of the transformer-based spaCy model and Stanza NLP model, both of which were trained exclusively on the AnCora corpus (see Table 4).

In Table 9, we provide the F1 scores per UPOS tag for these two fine-tuned models. The left side of the table displays results when tested on COSER-PoS, and the right side shows results when tested on COSER-PoS_o. The performance improvement across various categories is noteworthy in these models, especially when compared to the scores of the spaCy and Stanza models shown in Table 4.

Table 9 Best re-trained models F1 scores per UPOS tag

A significant improvement is observed in the X tag, which refers in our case covers the incomplete and uncategorized words. The fine-tuned models performed notably better than their counterparts from the previous evaluation, with F1 scores improving from 0 to 0.61 and 0.77 for COSER-PoS, and 0.8 for both models tested on COSER-PoS_o. The ADJ (Adjective) category also shows improvement, with F1 scores increasing from 0.80 and 0.86 (spaCy trf and Stanza NLP respectively) to 0.89 and 0.91 in COSER-PoS and staying constant at 0.89 for both models in COSER-PoS_o (Table 9).

The AUX category sees a performance improvement, with F1 scores increasing from 0.84 and 0.82 to 0.97 and 0.96 for the COSER-PoS models and remaining at 0.97 for both models in COSER-PoS_o. INTJ sees a minor improvement, with scores increasing from 0.78 and 0.83 to 0.97 for all models in Table 9. Other categories like PROPN, DET, VERB, ADV, NOUN, NUM, ADP, CCONJ, and PUNCT also see consistently high scores across all models, highlighting the strong performance of our models on these tags.

Interestingly, the results suggest that there isn’t a significant difference in the model’s performance between the phonological and orthographic transcriptions across most categories. The F1 scores for the majority of the UPOS tags remain relatively stable, indicating that the models have learned to generalize well from both types of transcriptions.

However, some minor differences can be seen. For instance, the X PoS tag shows slightly better performance in the COSER-PoS_o model (0.8) as compared to the COSER-PoS model (0.61 and 0.77). Similarly, for the SCONJ (subordinating conjunction) label, the scores are marginally higher when tagging COSER-PoS_o (0.94 and 0.95) than in COSER-PoS (0.95 and 0.96).

Notably, when the AnCora dataset is added to the COSER-PoS and COSER-PoS_o training sets, the performance appears to remain consistent. This suggests that the concatenation with AnCora does not adversely affect the performance but rather maintains or even slightly improves the scores in some categories. In conclusion, the fine-tuned models, particularly those trained on the combined datasets of COSER-PoS, COSER-PoS_o, and AnCora, outperform the spaCy transformer-based and Stanza NLP models trained solely on AnCora across a wide range of UPOS tags.

6 Conclusions

The primary aim of this study was to rigorously evaluate the accuracy of state-of-the-art PoS taggers when applied to spoken Spanish. For this purpose, we created a gold-standard corpus, termed COSER-PoS, composed of 13,219 sentences, derived from 1,760 conversational turns, and encompassing 196,372 tokens. This corpus represents considerable geographic diversity, having been compiled from 16 distinct dialectal regions across Spain.

The findings show that when the models were tested on the same dataset they were trained on, they yielded the highest accuracies. However, the COSER-PoS and COSER-PoS_o models demonstrated superior performance on the spoken Spanish datasets, surpassing the previous state-of-the-art in UPOS tagging for spoken Spanish with an impressive accuracy score of 98%. Additionally, 98% in lemmatization and 97% for FEATS tagging.

A significant improvement is observed in the X tag, which in our case covers incomplete and uncategorized words. The fine-tuned models performed notably better than their counterparts from the previous evaluation, with F1 scores improving from 0 to 0.61 and 0.77 for COSER-PoS, and 0.8 for both models tested on COSER-PoS_o. The ADJ (Adjective) category also shows improvement, with F1 scores increasing from 0.80 and 0.86 (spaCy trf and Stanza NLP respectively) to 0.89 and 0.91 in COSER-PoS, and remaining constant at 0.89 for both models in COSER-PoS_o.

Importantly, the results suggest that there isn’t a significant difference in the model’s performance between the phonological and orthographic transcriptions across most categories. The F1 scores for the majority of the UPOS tags remain relatively stable, indicating that the models have learned to generalize well from both types of transcriptions.

In conclusion, this study has demonstrated that it is possible to improve the accuracy of PoS tagging in spoken Spanish by using finely-tuned models and training them on datasets composed of transcriptions of spoken Spanish. The fine-tuned models, particularly those trained on the combined datasets of COSER-PoS, COSER-PoS_o, and AnCora, outperform the spaCy transformer-based and Stanza NLP models trained solely on AnCora across a wide range of UPOS tags. This research has set a new standard in UPOS tagging accuracy for spoken Spanish and provides valuable direction for future research in this field.