Introduction

Part-of-speech tagging (POS tagging henceforth) is the process of assigning a sequence of tags to a sequence of words in order to mark word classes. Traditional methodologies involve rule-based and statistical POS taggers, and more recently machine learning algorithms. Rule-based POS taggers (Brill, 1992, 1994; Sadredini et al., 2018) rely on a set of deterministic transformation rules, such as the association of a word to a POS. The ruleset is often coupled with a set of constraints the tagger must follow, e.g., an article cannot be followed by another article. Rule-based approaches are especially suitable for building multilingual and non-English taggers (Garg et al., 2012, Megyiesi, 1998, Rashel et al., 2014), which often cannot benefit from annotated corpora: any additional language requires a specific set of rules, yet neither data nor training are needed. While powerful enough to achieve high accuracy on benchmark datasets, rule-based taggers show inherent limitations in uncontrolled experimental environments, due to the lack of comprehension of the context and to the rigidity towards unexpected cases. Statistical POS taggers work by finding the sequence of POS tags that most likely fits the input sentences by means of hidden Markov models (Brants, 2000; Carlberger & Kann, 1999; Cutting et al., 1992) or entropy maximization (Ratnaparkhi, 1996, Toutanova and Manning 2000). POS taggers built upon machine learning algorithms, such as SVM (Giménez & Marquez, 2004) and neural networks (Schmid, 1994), are very powerful; however, many machine learning algorithms are not interpretable, which means that it is not possible to understand what motivated the POS tagger’s choices.

The usefulness of POS tagging lies in the automatisation, thus, the speeding up of research for specific word classes (e.g. nouns, pronouns, verbs, adverbs) and combinations of them (e.g. noun followed by a preposition). In wanting to give the opportunity of automatic POS search through the 32 film dialogues collected for the anglophone section of the Pavia Corpus of Film DialogueFootnote 1 (PCFD henceforth), we chose to conduct a pilot study on the dialogues of the film Thelma & Louise (Ridley Scott, 1991), which at the time was the latest film to be added to the corpus. The POS-tagger CLAWS4 was selected among the available software (e.g. Penn Treebank, Stanford POS tagger, CLL-Tagger) since it was freely accessible through the online interface and common to other reference corpora of English such as the BNC and the corpora available through english-corpora.org such as COCA and COHA, therefore convenient for the sake of comparability in future studies about film dialogue and spoken language.

Being the PCFD a corpus of orthographically transcribed film dialogues, we expected a certain degree of problematicity in dealing with the tagging of the texts. The POS-tagging exclusively relies on word-class assignment based on morpho-syntactic criteria, whereas the nature of film dialogue requires the pragmatic dimension to be taken into account. This is due to the fact that film dialogue very much resembles spoken language (Bednarek, 2010; Forchini, 2012; Quaglio, 2009; Valdeón, 2009) as it represents a kind of language that is “written to be spoken as if not written” (Gregory, 1967: 191-192). Film dialogue is therefore dotted with linguistic features with a predominantly pragmatic function such as vocatives, general extenders, thanking and apologizing routines, greetings and leave-takings, interjections etc. (cf. Bednarek, 2010, Bonsignori et al., 2011, Formentelli, 2014, Freddi, 2011, 2012, Forchini, 2012, Pavesi, 2009, Quaglio, 2009, Rodriguez Martin 2010, Zanotti, 2014). Furthermore, the syntax of spoken language is often characterised by ellipsis, hesitations, false starts and other disfluency phenomena related to on-line linguistic production in real time (Bortfeld et al., 2001; Lickley, 2001; Shriberg, 1994), which are in turn imitated to some extent by audiovisual dialogue with the aim of giving a sense of naturalness and spontaneity.

As a consequence of the peculiarities of spoken language and the reproduction of its distinctive features in audiovisual dialogue, an automatic POS-tagging becomes more prone to error the more the language the tagger is presented with drifts apart from written language (see 3 below). POS-taggers are mainly trained on written language, given its wider availability compared to suitable training corpora of orthographically transcribed spoken language (Nivre et al., 1996). This impacts on the probabilistic decision-making of taggers (see 2.1 below) since they are driven by the probability estimates of written language which may not be representative of spoken language (Nivre et al., 1996). Finally, since POS-taggers are not provided with rules that allow to discern between non-pragmatic (thus propositional) and pragmatic uses of certain expressions, an issue is bound to rise any time essentially pragmatic items will be treated as any other expression contributing propositional content. For example, a POS-tagger will not recognise a form of address in the word honey or darling, which will be tagged as noun and adjective respectively; this, however, does not allow to distinguish between the times in which honey indicates the fluid produced by bees and the ones in which it is used as an endearment term and often as a vocative, thus pragmatically. Similarly, the tagger will guess what word class could words such as hi, hey, yes, no be assigned to on the grounds of probabilistic rules (probably adverbs, see 2.1 below), ignoring their essentially pragmatic function as greetings and response forms which cannot fit any of the traditional word classes.

In this paper we will stress that when tagging transcriptions of spoken language, it is impossible to keep the levels of grammatical and pragmatic analysis separate. The two are continuously integrated (Traugott & Trousdale, 2013), and in light of the recent studies on pragmatic marking stressing the diversity between expressions contributing propositional meaning from those performing pragmatic functions (cf. Aijmer & Simon-Vandenbergen, 2011, Aijmer, 2013, Brinton, 2008, Fedriani & Sansò, 2017, Waltereit & Detges, 2007), it would not be advisable to have taggers which treat propositional and pragmatic contributions of an expression as the same thing.

The aim and principle underlying the building of our post-processing PythonFootnote 2 script revolves around the belief that being able to automatically distinguish between mainly-propositional and mainly-pragmatic expressions is an essential resource in spoken text analysis. To the best of our knowledge, this represents one of the few attempts to integrate the automatic tagging of pragmatic categories in corpus annotation (cf. Zago, 2016), considering how pragmatic annotation on corpora generally go as far as tagging speech acts, tone movements, attention management and speech event (cf. SPICE corpus, MICASE, project MAVIRFootnote 3).

The article is structured as follows: section "Research Questions and Methodology" outlines the research questions and methodology and provides a description of the tagging software CLAWS4; section "CLAWS4 Applied to Film Dialogue" deals with the application of CLAWS4 to the dialogues of Thelma & Louise (Ridley Scott, 1991); section "Discussion of Tagging Errors and the Definition of New Rules" discusses the tagging errors in detail and defines the new tagging rules fed into the Python script for data post-processing; section "Accuracy Rates of the Post-processing" briefly displays the improvement in the tagging and the related accuracy rates; finally, conclusions are drawn in section ``Conclusion''.

Research Questions and Methodology

The purpose of the pilot study on the dialogues of Thelma & Louise (Ridley Scott, 1991) focuses on verifying how well automatic POS-tagging can perform on film dialogue with a view to tagging the entire PCFD. As already discussed in 1, we expect some tagging error to occur due to both the probabilistic methods adopted by the tagger in assigning the word-class tags and the specificities of film dialogue in featuring a consistent number of predominantly pragmatic expressions which are not appropriately captured by traditional word classes. The research questions we will answer are the following:

  1. 1.

    What is the accuracy rate of automatic CLAWS POS-tagging on film dialogue?

  2. 2.

    What are the most frequent tagging errors?

  3. 3.

    How does the methodology of tagging film dialogue and/or the tagging procedures need to change in order to obtain satisfactory accuracy rates?

  4. 4.

    What kind of improvement does integrating the morpho-syntactic and pragmatic levels of analysis bring to automatic tagging of spoken language?

We will answer the first and second questions by providing a qualitative analysis of the output produced by the POS-tagger CLAWS4 (see 2.1 below) on the dialogue of Thelma & Louise (Ridley Scott, 1991) followed by a calculation of the error rate and the estimated accuracy rate (see 5 below). The most frequent tagging errors will be highlighted and divided into two categories: parsing errors and tag-assignment errors (see 2.2). The answer to the second research question relies on the identification of tag-assignment errors and acts on the need to widen the number and type of categories of analysis (therefore of tags) available, in order to integrate pragmatic categories such as pragmatic markers, address forms, response forms, etc., into the automatic tagging. The third question will be answered by analysing how the building of a Python script which post-processes the output of CLAWS4 performs on film dialogue.

The Python script for the post-processing of the data output provided by CLAWS4 was designed to correct some of the consistent errors produced by the POS-tagger in order to improve the accuracy of the results. The script was fed new rules in order to both correct parsing errors and implement the tagging of the categories of pragmatic markers, response forms, conversational formulas and general extenders. It also applies a rule for the disambiguation of subject you from object you (see 4). The definition of new tagging rules relies on both the literature on the topics touched upon by each tagging error (e.g. conversational formulas, address forms, etc.) and the graphic cues available in the orthographic transcriptions of the film dialogues in the PCFD. The possibility to automatically tag pragmatic categories is directly related to the method with which the dialogues in the PCFD were transcribed. Since punctuation is used in transcription conventions as a way to represent pauses in speech, it was possible to instruct the Python script to rely on it in order to identify sentence-peripheral (as in (1) below) as well as intra-sentential (as in (2) below) pragmatic expressions such as address forms, response forms, interjections, etc.

(1)

THELMA: Honey, you better hurry up!

DARRYL: God damn it, Thelma.

[…]

THELMA: Okay, I will too, then.

(2)

LOUISE: No, Thelma, we don’t need the lantern.

[…]

WAITRESS: Oh, hell, I told you fifty times, yeah, I could identify them.

(Thelma & Louise, Ridley Scott 1991)

After applying the Python script to the output of CLAWS4, a manual analysis was carried out on the new post-processed output in order to calculate the accuracy rates of the automatic tagging of film dialogue. It also shows in what ways the integration of morpho-syntax and pragmatics improves the automatic tagging of film dialogue and, possibly, of other instances of spoken language accessible in written form.

CLAWS4

CLAWS (Constituent-Likelihood Automatic Word-Tagging System)Footnote 4 is a part-of-speech tagger developed by the University Centre for Computer Corpus Research on Language (UCREL) at Lancaster University (UK). Its first version was developed in the 1980s and its latest version CLAWS4 was used to tag the British National Corpus (BNC)Footnote 5 as well as the corpora accessible through Mark Davies’s BYU interface. The current tagset used by CLAWS is C8 which features over 160 tags; however, the present study uses C7 as made available through the web interface of CLAWS (see Appendix A below and http://ucrel.lancs.ac.uk/claws8tags.pdf for C8 tagset).

The tagging system operated by CLAWS consists of five different stages applied successively (Garside, 1987: 33), some of which use lexical and morphological sample lists in order to help the software assign the tags:

  1. (1)

    Pre-editing: the text is prepared for the tagging process. This stage is carried out partly manually, partly automatically and involves the verticalisation of the text, so to have a separate line for each word or punctuation mark in the corpus.

  2. (2)

    Tag assignment: each word in the text is assigned one or more tags regardless of the context of occurrence. This stage is performed by a dedicated software called WORDTAG. In order to assign a tag, the software uses a lexicon of ca. 7200 words and indicates up to six possible tags for a single word. The tags are listed in decreasing likelihood. At this stage, the words are assigned all the tags listed in their entry of the lexicon. WORDTAG also uses a suffix list consisting of about 720 word endings with their associated tags. When the software fails to identify a word in the lexicon, the suffix list is searched for the longest word-ending matching the word that needs to be tagged. For example, if the word does not match the -able ending (which would be tagged as adjective), the -ble ending (thus a noun or verb) becomes the most probable; if the word does not match the -ble ending either, the -le ending (thus a noun or a verb) is tested as possible interpretation and tag assignment. Exceptions to suffix assignment are accounted for in the lexicon (see cable and enable in the case of -able suffix). The tag-assignment process based on the identification of the suffix is typically employed for the 7-12% of the words in the text and is claimed to be carried out successfully for most of the words. Hyphenated and prefixed words are assigned the tag of the remaining word once the prefix (e.g. co-, hyper-, mis-, over- etc.) is detached from the word.

  3. (3)

    Idiom-tagging: the dedicated program IDIOMTAG identifies word patterns and narrows down the number of tags that can be assigned to a word when found in a particular context. The software searches among a list of about 150 phrases and modifies the tags accordingly. IDIOMTAG improves the tagging of separate orthographic items which function syntactically as a single unit (e.g. as well as) and performs what is labelled “ditto-tags” whereby a single grammatical tag is assigned to all the items in the identified unit. In the analysis of as well as, for example, the tag assigned by IDIOMTAG would be CC (i.e. conjunction) to each of the three elements the expression consists of.

  4. (4)

    Tag disambiguation: the dedicated program CHAINPROBS deals with the cases in which more than one tag were assigned to a word (these correspond to roughly 35% of the words in the corpus). By considering the context of occurrence, CHAINPROBS calculates the probability of each tag in the specific context and chooses the one with the highest probability. The probability information is derived from a sample of the Brown Corpus which was previously manually tagged and checked.

  5. (5)

    Post-editing: this is a manual stage in which the tags assigned by the software are checked and corrected if necessary.

The level of accuracy achieved by CLAWS ranges between 96-97% according to UCREL (http://ucrel.lancs.ac.uk/claws/). This percentage might vary according to the text type. The error rate based on a 50,000-word test sample corresponds to 1.14% on average in written texts and 1.17% in spoken texts. CLAWS is shown to be more prone to error when tagging some specific word classes, namely adjectives, interrogative pronouns (e.g. when, why, how), proper nouns, possessives, base forms of lexical verbs, and past participles (see Table 1 in http://ucrel.lancs.ac.uk/bnc2/bnc2error.htm for a detailed account of the error percentages).

Table 1 Pre- and post-processed data (excerpt from Thelma & Louise)

Lately UCREL has devised a post-processor for CLAWS which uses the knowledge about the tagger’s most frequent errors to improve the accuracy of the resulting tagging. The post-processor, however, is not available for the online tagger nor in the licensed versions of CLAWS4.

CLAWS4 Applied to Film Dialogue

In the present pilot study on tagging the Pavia Corpus of Film Dialogue, the online version of CLAWS4 was used to tag the dialogue of the film Thelma & Louise (Ridley Scott, 1991). The online tagger allows to choose the tagset and layout of the output. The data were converted in text format (.txt), the metadata about characters’ names and scene details were excluded in order to avoid parsing errors, the tagset C7 (see Appendix A below) and the horizontal layout were chosen, so that the tagged output obtained looked as in the excerpt below:

Excerpt 1

Sonny_NP1 ,_, you_PPY wan_VV0 na_TO hit_VVI me_PPIO1 up_RP with_IW some_DD

syrup_NN1 ?_?

Thank_VV0 you_PPY ,_, darling_NN1 ._.

Maam_VV0 ?_?

Yeah_UH ,_, Ill_NP1 be_VBI right_JJ with_IW you_PPY ._.

Here_RL you_PPY go_VV0 ,_, you_PPY got_VVD the_AT sausage_NN1 ,_, pancakes_NN2

for_IF you_PPY ,_, and_CC theres_NN2 your_APPGE syrup_NN1 ._.

He_PPHS1 s_VBZ over_RP there_RL ._.

Uh-huh_UH ._.

As can be observed, the output contains the bare text of the dialogue without any information about the characters, conversational turns or other metalinguistic clues generally included in the PCFD dataFootnote 6. The assigned POS-tag is attached to the word it refers to following the underscore ‘_’. The list of tags and their corresponding parts of speech are listed in full in Appendix A.

The output created by CLAWS4 on the dialogue of Thelma & Louise (Ridley Scott, 1991) was manually checked by two researchers in order to verify whether the assigned tags were appropriate and thus to calculate a percentage of accuracy as well as an error margin of the performance of the POS-tagger on film dialogue. The first qualitative analysis of the output automatically produced by CLAWS4 on the dialogue of Thelma & Louise revealed a number of errors which occurred consistently throughout the POS-tagged text. These lowered the percentages of accuracy of the software to roughly 87.4%. Two main categories of errors can be distinguished: first, parsing errors due to the text format or problems of context-bound word class assignment; second, functional errors in which POS-tags were assigned to expressions whose function is mainly pragmatic, therefore not corresponding to their grammatical function (e.g. discourse marker well is tagged as an adverb). The list of frequent errors is reported below together with the examples extracted from the dialogue of Thelma & Louise (Ridley Scott, 1991):

  • Forms of address such as ma’am are tagged as either VV0, i.e. verbs in their base form, as a singular noun (NN1) as happens with darling in the excerpt below, or as an adjective (JJ), see hon in the excerpt below. Although tagging these forms as nouns represents the most sensible option, address forms do not work as prototypical nouns, given their vocative intersubjective function.

  • Excerpt 1:

  • Thank_VV0 you_PPY, darling_NN1.

  • Maam_VV0 ?_?

  • Yeah_UH, Ill_NP1 be_VBI right_JJ with_IW you_PPY ._.

  • […]

  • Hon_JJ ?_?

As will be shown below in 4, the issue of assigning a wrong word class tag to address forms will be solved by constituting a new tagging category dedicated to address forms. This category will group together the expressions with addressing function independently of the word classes identified for the items making up the address form.

  • The tagger fails to recognise contracted modal or auxiliary verbs attached to personal pronouns and nouns, as in I’ll in excerpt 1 above, which is tagged as proper noun. This might be due to the fact that apostrophes, although present in the input, are not processed by the tagger. Other examples can be observed in excerpt 2 below, in which you’d is tagged as a base form of a verb, they’d is tagged as a singular noun, and Jimmy’s is tagged as a plural proper noun:

  • Excerpt 2

  • Youd_VV0 almost_RR think_VV0 theyd_NN1 want_VV0 to_TO forget_VVI about_II it_PPH1 for_IF the_AT weekend_NNT1.

  • […]

  • Wonder_VV0 if_CSW Jimmys_NP2 got_VVD back_RP ._.

  • Some adverbs are analysed as adjectives, as happens with right in excerpt 1 above.

  • Conversational formulas (e.g. please, thanks, and some pragmatic markers are analysed as adverbs (RR) or nouns (NN), as in excerpt 3 below:

  • Excerpt 3

  • Regular_JJ ,_, please_RR.

  • […]

  • Well_RR ,_, wait_VV0 now_RT.

  • […]

  • Uhm_NN1 ,_, no_UH ,_, thanks_NN2 ._.

  • Genitive ‘s is not recognised and therefore tagged as a plural -s as in excerpt 4 below:

  • Excerpt 4

  • For_IF Christs_NP2 sake_NN1

  • Response forms (Biber et al., 1999) such as yes, no, okay, sure, yeah are tagged as interjections (UH), and adverbs (RR), as in excerpt 5 below:

  • Excerpt 5

  • Well_RR ,_, yes_UH ,_, I_PPIS1 will_VM ,_, operator_NN1 ._.

  • […]

  • No_UH ,_, you_PPY wont_JJ ._.

  • […]

  • Sure_RR ,_, Thelma_NP1 ._.

  • […]

  • Yeah_UH ,_, I_PPIS1 could_VM identify_VVI them_PPHO2

Nonetheless, the literature on the matter claims that response forms function as neither interjections, since in their prototypical forms such as yes and no they do not communicate the speaker’s emotion, nor adverbs, as they do not “qualify” another element in the sentence (Aijmer, 2002; Biber et al., 1999). Response forms are rather classified as response signals or reaction signals (Aijmer, 2002) or inserts with a back-channelling function (Biber et al., 1999), in both cases representing a category of their own.

  • Other parsing errors among which the cluster I got to which is tagged as pronoun + past tense verb (VVD) + TO whereas it should be tagged as pronoun + past participle of a verb (VVN) + TO (see excerpt 6 below), and presentatives which are tagged as plural nouns (NN2) (see excerpt 6 below).

  • Excerpt 6

  • I_PPIS1 got_VVD ta_TO get_VVI to_TO work_VVI !_!

  • […] and_CC theres_NN2 your_APPGE syrup_NN1 ._.

Applying CLAWS4 to film dialogue has revealed a series of tagging errors which lower the accuracy rate claimed by UCREL. By calculating an estimate of tagging errors for the script of Thelma & Louise (Ridley Scott, 1991) based on a 1000-word sample analysis, the error margin goes up to 10.73%, differently from what indicated on the official web page of CLAWS applied to the BNCFootnote 7 for spoken texts, namely 1.17%, and from the overall maximum error margin of 3%. There is a probability that the text format and using the online software might have affected the accuracy rate of the software. Many of the tagging errors could probably be avoided if the software had been able to detach many contracted verbs from the pronouns (e.g. Im, theyd, thats, etc.). In the next section, some of the tagging errors posing methodological issues will be discussed and then, new tagging rules will be proposed. These rules were implemented by a Python script which corrects the tagging output of CLAWS by post-processing it.

Discussion of Tagging Errors and the Definition of New Rules

Among the seven consistent tagging errors presented above, some posed theoretical challenges due to their essentially pragmatic function and difficulty in fitting into a ‘traditional’ word class defined on the grounds of morphosyntactic criteria. This is the case of pragmatic markers such as well, interjections such as oh, ah, and response forms such as yes, no, okay, yeah, sure. Other tagging errors only need a specific rule to help CLAWS4 disambiguate and assign the correct tag. This section deals with the formulation of new rules to be fed to the post-processing script with the aim of increasing the accuracy of the automatic tagging on film dialogue.

The results of the automatic tagging carried out by CLAWS4 on film dialogue revealed a need for implementing a number of pragmatic categories that are frequent in dialogue in the tagging system. Five different pragmatic categories were identified to be automatically tagged by the script: besides interjections, which are already tagged with a high level of accuracy by CLAWS4, the categories of forms of address, response forms, pragmatic markers, conversational formulas and general extenders were introduced. The categories were selected in a bottom-up fashion, following the manual analysis of the most frequent expressions occurring in the pilot POS-tagging of Thelma & Louise (Ridley Scott, 1991). Since pragmatic categories often contain complex multi-word expressions which at times correspond to entire sentences (e.g. you know), we chose, where applicable, to keep the POS-tags indicating the word classes of the elements of the complex expression and then add the label of the pragmatic category the expression belongs to. Below is a discussion of the rules formulated for the Python script. The first five rules concern the tagging of the aforementioned pragmatic categories, whereas rules 6–10 aim to correct some parsing errors made by CLAWS4 and assign the proper word class to the items in the expression. Finally, rule 11 adds the possibility to automatically distinguish between subject you and object you.

The Rules

Rule 1–Forms of address: address forms such as honey, darling, dear, Miss, Mrs, Sir, etc. will be tagged as AF (address forms) (cf. Brown & Ford, 1961; Brown & Gilman, 1960). Vocatives have been shown to be particularly frequent and crucial to film dialogue, even more than in spontaneous spoken language (Bubel, 2006; Formentelli, 2014) in that they help to build the character’s identity and relationships as well as encourage viewers’ participation (Pavesi, 2012; Zago 2015). Therefore, we deemed a new category would be useful in order to signal which words have an addressing function, charged with attitudinal values (Braun, 1988), and differentiate them from the non-addressing counterparts with propositional contribution, as in (1) below. This rule deals with lexical forms of address and excludes proper names used as vocatives.

(1)

a. Honey, are you okay?

b. Can I have some honey?

In (1a), honey will now be tagged AF, whereas in (1b) honey will be tagged as a common noun (NN1). The same applies to other address forms such as darling, dear and so on, which need to keep separate from other contexts in which they function as adjectives (see (2) below).

(2)

My dear/darling grandma is now 97.

The script was provided with a list of possible address forms and given parameters in order to distinguish them from adjectival or nominal uses. The parameters according to which the script decides which tag to assign the possible address form are mainly syntactic, though also based on spelling and punctuation: the script was instructed to assign the tag AF if the word appeared in the list of possible address forms, was found at the beginning of a line followed by a comma or at the end of a line preceded by a comma and followed by other punctuation. This rule works because the dialogues in the PCFD are manually transcribed and punctuation is used in order to reproduce pauses, therefore text segmentation. In texts without this kind of information address forms would prove more difficult to identify.

Rule 2–Response forms (tag RF): this category includes all the response forms to which a traditional word class cannot be assigned (Aijmer, 2002; Biber et al., 1999). The script is provided a list of response forms, which includes yes, no, yep, nope, yup, nah, yeah, okay and echo responses such as I do, she did, weren’t they? etc. In order to distinguish between echo responses and tag questions, the script was instructed to consider these expressions as response forms when they constitute a stand-alone utterance. The clues used by the script are syntactic position and punctuation.

Rule 3–Pragmatic markers (tag PM): this category gathers the linguistic expressions with a predominant function of dialogic organisation (e.g. structure and turn-taking), relationship management between the interlocutors, expression of the speaker’s stance and intentions. The list of English pragmatic markers was obtained by looking at the literature on the matter (cf. Beeching, 2016, Carter & McCarthy, 2006, Furiassi, 2021 among others; Başol & Kartal, 2019, Chaume, 2004, Forchini, 2010, for audiovisual dialogue) and includes the expressions well, just, you know, sort of, I mean, I think, anyway(s), so yeah, so, though, you see, really, okay then. These expressions were kept separate from their literal, non-pragmatic uses thanks to syntactic clues such as position in the sentence and punctuation. Since the Python script does not distinguish between specific functions of pragmatic expressions, the generic label pragmatic markers was chosen over discourse markers, considering that the term is taken to encompass a wide variety of linguistic items with pragmatic function including discourse markers (for a detailed discussion on the labels of pragmatic categories see Aijmer, 2013, Aijmer & Simon-Vandenbergen, 2011, Brinton, 2008, Fedriani & Sansò, 2017, Waltereit & Detges, 2007).

Rule 4 – Conversational formulas (tag CF): also called conversational routines by Aijmer (1996), Coulmas (1981) and Firth (1972) refer to expressions typical of conversation with little informational value compared to their paramount procedural and interpersonal role aimed at social cohesion. These are called ‘routines’ or ‘formulas’ given their nature of pre-fabricated linguistic expressions with a generally accepted use which takes into account the appropriateness in the context (e.g. nice to meet you). The list of conversational formulas for the Python script draws on the literature by Coulmas (1981) and Firth (1972) on spoken English and by Bonsignori et al. (2011) on film dialogue. It includes the following expressions of greeting, leave-taking and good wishes:

Hi, hello, hey, oi, yo, aye aye, hiya

How are you? (Are) you okay? ((are) you) Alright?

Nice to meet you, how do you do, pleasure to meet you, my pleasure, come on, please, if you don’t mind

(Good)bye, see you (later), take care, cheers, farewell

Thanks, thank you (very/so much), good to see you, so long, long time no see, good luck, welcome (back), (you’re) welcome, any time, of course, (good) evening, (good) morning, (good) afternoon.

Rule 5–General extenders (tag EXT): they are related to the concept of vagueness in language and indicate the expressions occurring at the end of a list which suggests that the list can go on without specifying all the members of the list (e.g. and so on, and stuff like that) relying on the fact that the interlocutor can and will fill in the gaps (Crystal & Davy, 1975; Overstreet, 1999). General extenders are relied on in film dialogue in order to convey orality and naturalness (Zanotti, 2014). The list of general extenders for this rule includes: or something, and everything, and things (like that), and stuff (like that), such and such, and so on (and so forth), et cetera. The only syntactic criteria used for the identification of general extenders is the position in the right periphery of the utterance.

Rule 6 – Contracted verbs: every time a personal pronoun found on the list provided to the script (i.e. I, you, she, he, it, we, they) is recognised and this string is followed by another string belonging to the list of possible contractions of verbs (i.e. d for would and had, m for am, re for are, ll for will, s for is and has, and ve for have), then the pronoun is tagged with the corresponding label of personal pronouns (see Appendix A) and the contracted verb is tagged as modal, auxiliary, or copula according to the context (e.g. you_PPYS 're_VBR gon_VVGK na_TO). In contexts in which disambiguation is needed, as in I’ll vs. ill, the script was instructed to consider capital letter I as a personal pronoun, also given the non-occurrence in the corpus of the adjective ill at the beginning of the sentence, thus capitalised.

Rule 7 – Saxon genitive (tag GE): the script is fed a list of most common proper names in EnglishFootnote 8 and given instructions to tag as genitive any recognised proper name immediately followed by an -s and a noun (NN) or a combination of adjective and noun (JJ + NN). This will automatically exclude the occurrences in which -s represents a contracted form of is as it will be expected to be followed by a gerund (e.g. is doing), an adjective not followed by a noun (is beautiful), a complex noun phrase consisting of an article + adjective (optional) + noun (is a beautiful girl), or a past participle (e.g. is said to be). The rule for tagging possessive -s was also extended to indefinite pronouns such as somebody, anybody, nobody, everybody, and the set combined with -one such as someone, anyone etc.

Rule 8–Presentatives: the script is instructed to tag as presentative (PRES) all the instances of there followed by any form and tense of the verb to be, e.g. there is, there was, there had been, etc.

Other rules:

Rule 9–Contracted negation (tag XX): when the script recognises that the string nt is attached to a verb, e.g. aren’t, don’t, doesn’t, hadn’t, etc. then it analyses the unit as the base form of the verb plus not (see (3) below). The negated future auxiliary won’t is not included here because it is tagged correctly by CLAWS4.

(3)

I don’t know.

I_PPIS1 do_VD0 not_XX know_VV0.

Rule 10–I got to: the expression omits the auxiliary have or had (and the contracted forms ve and d), therefore got needs to be analysed as a past participle. The string I got to as well as its fused form gotta are tagged as I (PPIS1) + past participle of got (VVN) + to_TO, with the verb following to tagged as an infinitive (VVI).

Rule 11–Distinguishing between subject you and object you: the rule solves the ambiguity of using the second person pronoun you for both the subject and object functions by indicating in the tag which function the pronoun is performing. Subject you is indicated by PPYS and object you is indicated by PPYO. The script looks at the position of you in the sentence and the co-text in order to establish which function is being performed by the pronoun in a specific context. It takes into account variation due to negation and word order in questions.

Accuracy Rates of the Post-processing

Once the new rules were applied to the data, a qualitative analysis of the tagging on the dialogue of Thelma & Louise (Ridley Scott, 1991) displayed a visible improvement in the accuracy of the labelling. On average, CLAWS4 produced an error margin of 10.73% on film dialogue, whereas thanks to the post-processing it is now reduced to 3.16% which translates to an accuracy percentage of a little less than 97%. Below is an example of how the text is tagged by CLAWS4 and how the script has changed the tags in post-processing.

CLAWS4

PYTHON SCRIPT

Sonny_NP1 ,_, you_PPY wan_VV0 na_TO hit_VVI me_PPIO1 up_RP with_IW some_DD syrup_NN1 ?_?

Thank_VV0 you_PPY ,_, darling_NN1 ._.

Maam_VV0 ?_?

Yeah_UH ,_, Ill_NP1 be_VBI right_JJ with_IW you_PPY ._.

Here_RL you_PPY go_VV0 ,_, you_PPY got_VVD the_AT sausage_NN1 ,_, pancakes_NN2 for_IF you_PPY ,_, and_CC theres_NN2 your_APPGE syrup_NN1 ._.

He_PPHS1 s_VBZ over_RP there_RL ._.

Uh-huh_UH ._.

Decaf_VV0 or_CC regular_JJ ?_?

Regular_JJ ,_, please_RR ._.

You_PPY girls_NN2 are_VBR kind_RR21 of_RR22 young_JJ to_TO be_VBI smoking_VVG

,_, do_VD0 nt_XX you_PPY think_VVI ?_?

It_PPH1 ruins_VVZ your_APPGE sex_NN1 drive_NN1 ._.

Ill_NP1 get_VV0 it_PPH1 ,_, I_PPIS1 got_VVD it_PPH1 !_!

Hello_UH ?_?

Hey_UH !_!

Sonny_NP1 ,_, you_PPYS wan_VV0 na_TO hit_VVI me_PPIO1 up_RP with_IW some_DD syrup_NN1 ?_?

Thank_CF you_CF ,_, darling_AF ._.

Maam_AF ?_?

Yeah_RF ,_, I_PPIS1 ll_VM be_VBI right_JJ with_IW you_PPY ._.

Here_RL you_PPYS go_VV0 ,_, you_PPYS got_VVD the_AT sausage_NN1 ,_, pancakes_NN2 for_IF you_PPY ,_, and_CC there_PRES is_VBZ your_APPGE syrup_NN1 ._.

He_PPHS1 s_VBZ over_RP there_RL ._.

Uh-huh_UH ._.

Decaf_VV0 or_CC regular_JJ ?_?

Regular_JJ ,_, please_CF ._.

You_PPY girls_NN2 are_VBR kind_RR21 of_RR22 young_JJ to_TO be_VBI smoking_VVG ,_, do_VD0 nt_XX you_PPYS think_VVI ?_?

It_PPH1 ruins_VVZ your_APPGE sex_NN1 drive_NN1 ._.

I_PPIS1 ll_VM get_VV0 it_PPH1 ,_, I_PPIS1 got_VVD it_PPH1 !_!

Hello_CF ?_?

Hey_CF !_!

The items in bold highlight the expressions whose tags were modified by the script. In the excerpt it is possible to observe that thank you is now tagged as a conversational formula (CF), darling and maam as address forms (AF) rather than noun (NN1) and verb (VV0); yeah is now a response form (RF) instead of an interjection (UH), the contracted form I’ll which was originally tagged as a noun (NP1) is now correctly parsed into a first person pronoun I and the verb will; a similar operation is applied to there’s, originally tagged as a plural noun (NN2), which is now parsed into a presentative there followed by the copula is, and so on.

The improved accuracy of the automatic tagging was realised thanks to the integration of grammatical and pragmatic categories of analysis, which allowed to distinguish between items that mainly contribute to the propositional content from items with a predominant pragmatic function. By doing so, not only is the tagging of dialogues deemed to be more reliable, but it also allows the automatic retrieval of pragmatic items from a text.

Conclusion

This study has dealt with the issues and solutions related to a pilot study about tagging a corpus of film dialogue (namely PCFD). The application of the software CLAWS4 and the manual check of the tagged output has highlighted a number of parsing and word-class assignment errors. In order to solve consistent tagging errors, we developed a Python scriptFootnote 9 which corrects the wrong tags in post-processing. Tagging error analysis has also yielded the grounds for methodological discussion: it was noted how some linguistic expressions could not fit the word-class category they were assigned to by the tagger because they performed pragmatic functions. Applying a tagging software to a corpus of film dialogue has proven particularly prolific in this sense, since the dialogues contain frequent pragmatic features typical of conversation, such as pragmatic markers, interjections, conversational formulas, etc. Noting how linguistic expressions contributing propositional content and those performing pragmatic functions are generally integrated in the dialogue analysed for the pilot, a likewise integrated approach to tagging was proposed whereby the script was instructed to tag both word classes and pragmatic categories to be shown in the same output. This setting allows to automatically distinguish between propositional and pragmatic uses of the same expression (e.g. well, I mean, sure, etc.) and represents a valuable shortcut to be used in corpus-based pragmatic studies of film dialogue as well as spoken language, which generally require manual annotation of pragmatic features.

At the beginning of the paper, four research questions were formulated which will be answered below:

Q1: What is the accuracy rate of automatic CLAWS POS-tagging on film dialogue?

Following a qualitative analysis of the performance, the accuracy rate of the POS-tagger on the dialogue of the film Thelma & Louise (Ridley Scott, 1991) corresponds to 89.27%. This is different from the percentages of accuracy presented on the website of the CLAWS tagger, namely 98.86% for written language and 98.83% for spoken language.

Q2: What are the most frequent tagging errors?

The most frequent tagging errors appear to belong to two categories: one gathers mislabelling and parsing errors due to text format or context-bound disambiguation, such as failure to recognise contracted verbs, auxiliaries and genitive -s; the second concerns the issues related to labelling linguistic items with a predominant pragmatic function which are treated by the POS-tagger as prototypical lexical items (with the exception of interjections). These are pragmatic markers (e.g. you know, well), address forms (e.g. darling, honey), conversational formulas (e.g. thank you, see you later), general extenders (e.g. and stuff like that) and response forms (e.g. yes, yeah, nope).

Q3: How does the methodology of tagging film dialogue and/or the tagging procedures need to change in order to obtain satisfactory accuracy rates?

Besides fixing the software errors related to input-reading, we found that improved results could be obtained by designing ad hoc tagging categories for predominantly pragmatic expressions. The Python script was instructed to tag the five categories of pragmatic markers, address forms, conversational formulas, general extenders and response forms. For complex pragmatic expressions, such as conversational formulas, which are generally composed of two or more words, the software was instructed to keep both the POS and pragmatic tags. This allows to search for the single words making up a complex pragmatic expression by their word class. For example, see in see you later can be retrieved by searching for both verbs and conversational formulas.

Q4: What kind of improvement does integrating the morpho-syntactic and pragmatic levels of analysis bring to the automatic tagging of spoken language?

Three levels of improvement can be observed: the first concerns the possibility to automatically distinguish between mainly-propositional and mainly-pragmatic expressions in spoken text analysis; the second concerns the accuracy with which the tagging is carried out in a text genre in which the pragmatic dimension is paramount, thus cannot be reduced exclusively to morpho-syntactic criteria; the third concerns the theoretical underpinning of such integration, as it supports and fosters the idea that grammar and pragmatics are continuously integrated and cannot be kept separate in any analysis of spoken language.