Part-of-Speech and Pragmatic Tagging of a Corpus of Film Dialogue: A Pilot Study

Galiano, Liviana; Semeraro, Alfonso

doi:10.1007/s41701-022-00132-9

Part-of-Speech and Pragmatic Tagging of a Corpus of Film Dialogue: A Pilot Study

Original Paper
Open access
Published: 11 January 2023

Volume 7, pages 17–39, (2023)
Cite this article

Download PDF

You have full access to this open access article

Corpus Pragmatics Aims and scope Submit manuscript

Part-of-Speech and Pragmatic Tagging of a Corpus of Film Dialogue: A Pilot Study

Download PDF

Liviana Galiano¹ &
Alfonso Semeraro²

1957 Accesses
1 Citation
2 Altmetric
Explore all metrics

Abstract

This article presents how a pilot study for automatically POS-tagging a corpus of orthographic transcriptions of film dialogues (Pavia Corpus of Film Dialogue) was dealt with and the related issues solved. The software CLAWS4, which is freely available on UCREL’s website, was used for the sake of comparability with reference corpora such as the BNC (both 1994 and 2014) and all the English corpora available on english-corpora.org (former BYU interface). The study highlights that automatic POS-tagging needs readjusting when applied to film dialogue and the accuracy of the tagging greatly benefits from the introduction of tags for pragmatic categories. This integrated approach of grammatical and pragmatic automatic tagging was realised through the writing of a Python script which post-processes the data output of CLAWS4.

Speech Acts Annotation of Everyday Conversations in the ORD Сorpus of Spoken Russian

Part of Speech Tagging for Polish: State of the Art and Future Perspectives

Quantitative Data on POS Distribution in the Beginnings and the Ends of Utterances in Everyday Russian Speech

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Part-of-speech tagging (POS tagging henceforth) is the process of assigning a sequence of tags to a sequence of words in order to mark word classes. Traditional methodologies involve rule-based and statistical POS taggers, and more recently machine learning algorithms. Rule-based POS taggers (Brill, 1992, 1994; Sadredini et al., 2018) rely on a set of deterministic transformation rules, such as the association of a word to a POS. The ruleset is often coupled with a set of constraints the tagger must follow, e.g., an article cannot be followed by another article. Rule-based approaches are especially suitable for building multilingual and non-English taggers (Garg et al., 2012, Megyiesi, 1998, Rashel et al., 2014), which often cannot benefit from annotated corpora: any additional language requires a specific set of rules, yet neither data nor training are needed. While powerful enough to achieve high accuracy on benchmark datasets, rule-based taggers show inherent limitations in uncontrolled experimental environments, due to the lack of comprehension of the context and to the rigidity towards unexpected cases. Statistical POS taggers work by finding the sequence of POS tags that most likely fits the input sentences by means of hidden Markov models (Brants, 2000; Carlberger & Kann, 1999; Cutting et al., 1992) or entropy maximization (Ratnaparkhi, 1996, Toutanova and Manning 2000). POS taggers built upon machine learning algorithms, such as SVM (Giménez & Marquez, 2004) and neural networks (Schmid, 1994), are very powerful; however, many machine learning algorithms are not interpretable, which means that it is not possible to understand what motivated the POS tagger’s choices.

The usefulness of POS tagging lies in the automatisation, thus, the speeding up of research for specific word classes (e.g. nouns, pronouns, verbs, adverbs) and combinations of them (e.g. noun followed by a preposition). In wanting to give the opportunity of automatic POS search through the 32 film dialogues collected for the anglophone section of the Pavia Corpus of Film Dialogue^{Footnote 1} (PCFD henceforth), we chose to conduct a pilot study on the dialogues of the film Thelma & Louise (Ridley Scott, 1991), which at the time was the latest film to be added to the corpus. The POS-tagger CLAWS4 was selected among the available software (e.g. Penn Treebank, Stanford POS tagger, CLL-Tagger) since it was freely accessible through the online interface and common to other reference corpora of English such as the BNC and the corpora available through english-corpora.org such as COCA and COHA, therefore convenient for the sake of comparability in future studies about film dialogue and spoken language.

Being the PCFD a corpus of orthographically transcribed film dialogues, we expected a certain degree of problematicity in dealing with the tagging of the texts. The POS-tagging exclusively relies on word-class assignment based on morpho-syntactic criteria, whereas the nature of film dialogue requires the pragmatic dimension to be taken into account. This is due to the fact that film dialogue very much resembles spoken language (Bednarek, 2010; Forchini, 2012; Quaglio, 2009; Valdeón, 2009) as it represents a kind of language that is “written to be spoken as if not written” (Gregory, 1967: 191-192). Film dialogue is therefore dotted with linguistic features with a predominantly pragmatic function such as vocatives, general extenders, thanking and apologizing routines, greetings and leave-takings, interjections etc. (cf. Bednarek, 2010, Bonsignori et al., 2011, Formentelli, 2014, Freddi, 2011, 2012, Forchini, 2012, Pavesi, 2009, Quaglio, 2009, Rodriguez Martin 2010, Zanotti, 2014). Furthermore, the syntax of spoken language is often characterised by ellipsis, hesitations, false starts and other disfluency phenomena related to on-line linguistic production in real time (Bortfeld et al., 2001; Lickley, 2001; Shriberg, 1994), which are in turn imitated to some extent by audiovisual dialogue with the aim of giving a sense of naturalness and spontaneity.

As a consequence of the peculiarities of spoken language and the reproduction of its distinctive features in audiovisual dialogue, an automatic POS-tagging becomes more prone to error the more the language the tagger is presented with drifts apart from written language (see 3 below). POS-taggers are mainly trained on written language, given its wider availability compared to suitable training corpora of orthographically transcribed spoken language (Nivre et al., 1996). This impacts on the probabilistic decision-making of taggers (see 2.1 below) since they are driven by the probability estimates of written language which may not be representative of spoken language (Nivre et al., 1996). Finally, since POS-taggers are not provided with rules that allow to discern between non-pragmatic (thus propositional) and pragmatic uses of certain expressions, an issue is bound to rise any time essentially pragmatic items will be treated as any other expression contributing propositional content. For example, a POS-tagger will not recognise a form of address in the word honey or darling, which will be tagged as noun and adjective respectively; this, however, does not allow to distinguish between the times in which honey indicates the fluid produced by bees and the ones in which it is used as an endearment term and often as a vocative, thus pragmatically. Similarly, the tagger will guess what word class could words such as hi, hey, yes, no be assigned to on the grounds of probabilistic rules (probably adverbs, see 2.1 below), ignoring their essentially pragmatic function as greetings and response forms which cannot fit any of the traditional word classes.

In this paper we will stress that when tagging transcriptions of spoken language, it is impossible to keep the levels of grammatical and pragmatic analysis separate. The two are continuously integrated (Traugott & Trousdale, 2013), and in light of the recent studies on pragmatic marking stressing the diversity between expressions contributing propositional meaning from those performing pragmatic functions (cf. Aijmer & Simon-Vandenbergen, 2011, Aijmer, 2013, Brinton, 2008, Fedriani & Sansò, 2017, Waltereit & Detges, 2007), it would not be advisable to have taggers which treat propositional and pragmatic contributions of an expression as the same thing.

The aim and principle underlying the building of our post-processing Python^{Footnote 2} script revolves around the belief that being able to automatically distinguish between mainly-propositional and mainly-pragmatic expressions is an essential resource in spoken text analysis. To the best of our knowledge, this represents one of the few attempts to integrate the automatic tagging of pragmatic categories in corpus annotation (cf. Zago, 2016), considering how pragmatic annotation on corpora generally go as far as tagging speech acts, tone movements, attention management and speech event (cf. SPICE corpus, MICASE, project MAVIR^{Footnote 3}).

The article is structured as follows: section "Research Questions and Methodology" outlines the research questions and methodology and provides a description of the tagging software CLAWS4; section "CLAWS4 Applied to Film Dialogue" deals with the application of CLAWS4 to the dialogues of Thelma & Louise (Ridley Scott, 1991); section "Discussion of Tagging Errors and the Definition of New Rules" discusses the tagging errors in detail and defines the new tagging rules fed into the Python script for data post-processing; section "Accuracy Rates of the Post-processing" briefly displays the improvement in the tagging and the related accuracy rates; finally, conclusions are drawn in section ``Conclusion''.

Research Questions and Methodology

The purpose of the pilot study on the dialogues of Thelma & Louise (Ridley Scott, 1991) focuses on verifying how well automatic POS-tagging can perform on film dialogue with a view to tagging the entire PCFD. As already discussed in 1, we expect some tagging error to occur due to both the probabilistic methods adopted by the tagger in assigning the word-class tags and the specificities of film dialogue in featuring a consistent number of predominantly pragmatic expressions which are not appropriately captured by traditional word classes. The research questions we will answer are the following:

1.
What is the accuracy rate of automatic CLAWS POS-tagging on film dialogue?
2.
What are the most frequent tagging errors?
3.
How does the methodology of tagging film dialogue and/or the tagging procedures need to change in order to obtain satisfactory accuracy rates?
4.
What kind of improvement does integrating the morpho-syntactic and pragmatic levels of analysis bring to automatic tagging of spoken language?

We will answer the first and second questions by providing a qualitative analysis of the output produced by the POS-tagger CLAWS4 (see 2.1 below) on the dialogue of Thelma & Louise (Ridley Scott, 1991) followed by a calculation of the error rate and the estimated accuracy rate (see 5 below). The most frequent tagging errors will be highlighted and divided into two categories: parsing errors and tag-assignment errors (see 2.2). The answer to the second research question relies on the identification of tag-assignment errors and acts on the need to widen the number and type of categories of analysis (therefore of tags) available, in order to integrate pragmatic categories such as pragmatic markers, address forms, response forms, etc., into the automatic tagging. The third question will be answered by analysing how the building of a Python script which post-processes the output of CLAWS4 performs on film dialogue.

The Python script for the post-processing of the data output provided by CLAWS4 was designed to correct some of the consistent errors produced by the POS-tagger in order to improve the accuracy of the results. The script was fed new rules in order to both correct parsing errors and implement the tagging of the categories of pragmatic markers, response forms, conversational formulas and general extenders. It also applies a rule for the disambiguation of subject you from object you (see 4). The definition of new tagging rules relies on both the literature on the topics touched upon by each tagging error (e.g. conversational formulas, address forms, etc.) and the graphic cues available in the orthographic transcriptions of the film dialogues in the PCFD. The possibility to automatically tag pragmatic categories is directly related to the method with which the dialogues in the PCFD were transcribed. Since punctuation is used in transcription conventions as a way to represent pauses in speech, it was possible to instruct the Python script to rely on it in order to identify sentence-peripheral (as in (1) below) as well as intra-sentential (as in (2) below) pragmatic expressions such as address forms, response forms, interjections, etc.

(1)

THELMA: Honey, you better hurry up!

DARRYL: God damn it, Thelma.

[…]

THELMA: Okay, I will too, then.

(2)

LOUISE: No, Thelma, we don’t need the lantern.

[…]

WAITRESS: Oh, hell, I told you fifty times, yeah, I could identify them.

(Thelma & Louise, Ridley Scott 1991)

After applying the Python script to the output of CLAWS4, a manual analysis was carried out on the new post-processed output in order to calculate the accuracy rates of the automatic tagging of film dialogue. It also shows in what ways the integration of morpho-syntax and pragmatics improves the automatic tagging of film dialogue and, possibly, of other instances of spoken language accessible in written form.

CLAWS4

CLAWS (Constituent-Likelihood Automatic Word-Tagging System)^{Footnote 4} is a part-of-speech tagger developed by the University Centre for Computer Corpus Research on Language (UCREL) at Lancaster University (UK). Its first version was developed in the 1980s and its latest version CLAWS4 was used to tag the British National Corpus (BNC)^{Footnote 5} as well as the corpora accessible through Mark Davies’s BYU interface. The current tagset used by CLAWS is C8 which features over 160 tags; however, the present study uses C7 as made available through the web interface of CLAWS (see Appendix A below and http://ucrel.lancs.ac.uk/claws8tags.pdf for C8 tagset).

The tagging system operated by CLAWS consists of five different stages applied successively (Garside, 1987: 33), some of which use lexical and morphological sample lists in order to help the software assign the tags:

(1)
Pre-editing: the text is prepared for the tagging process. This stage is carried out partly manually, partly automatically and involves the verticalisation of the text, so to have a separate line for each word or punctuation mark in the corpus.
(2)
Tag assignment: each word in the text is assigned one or more tags regardless of the context of occurrence. This stage is performed by a dedicated software called WORDTAG. In order to assign a tag, the software uses a lexicon of ca. 7200 words and indicates up to six possible tags for a single word. The tags are listed in decreasing likelihood. At this stage, the words are assigned all the tags listed in their entry of the lexicon. WORDTAG also uses a suffix list consisting of about 720 word endings with their associated tags. When the software fails to identify a word in the lexicon, the suffix list is searched for the longest word-ending matching the word that needs to be tagged. For example, if the word does not match the -able ending (which would be tagged as adjective), the -ble ending (thus a noun or verb) becomes the most probable; if the word does not match the -ble ending either, the -le ending (thus a noun or a verb) is tested as possible interpretation and tag assignment. Exceptions to suffix assignment are accounted for in the lexicon (see cable and enable in the case of -able suffix). The tag-assignment process based on the identification of the suffix is typically employed for the 7-12% of the words in the text and is claimed to be carried out successfully for most of the words. Hyphenated and prefixed words are assigned the tag of the remaining word once the prefix (e.g. co-, hyper-, mis-, over- etc.) is detached from the word.
(3)
Idiom-tagging: the dedicated program IDIOMTAG identifies word patterns and narrows down the number of tags that can be assigned to a word when found in a particular context. The software searches among a list of about 150 phrases and modifies the tags accordingly. IDIOMTAG improves the tagging of separate orthographic items which function syntactically as a single unit (e.g. as well as) and performs what is labelled “ditto-tags” whereby a single grammatical tag is assigned to all the items in the identified unit. In the analysis of as well as, for example, the tag assigned by IDIOMTAG would be CC (i.e. conjunction) to each of the three elements the expression consists of.
(4)
Tag disambiguation: the dedicated program CHAINPROBS deals with the cases in which more than one tag were assigned to a word (these correspond to roughly 35% of the words in the corpus). By considering the context of occurrence, CHAINPROBS calculates the probability of each tag in the specific context and chooses the one with the highest probability. The probability information is derived from a sample of the Brown Corpus which was previously manually tagged and checked.
(5)
Post-editing: this is a manual stage in which the tags assigned by the software are checked and corrected if necessary.

The level of accuracy achieved by CLAWS ranges between 96-97% according to UCREL (http://ucrel.lancs.ac.uk/claws/). This percentage might vary according to the text type. The error rate based on a 50,000-word test sample corresponds to 1.14% on average in written texts and 1.17% in spoken texts. CLAWS is shown to be more prone to error when tagging some specific word classes, namely adjectives, interrogative pronouns (e.g. when, why, how), proper nouns, possessives, base forms of lexical verbs, and past participles (see Table 1 in http://ucrel.lancs.ac.uk/bnc2/bnc2error.htm for a detailed account of the error percentages).

Table 1 Pre- and post-processed data (excerpt from Thelma & Louise)

Full size table

Lately UCREL has devised a post-processor for CLAWS which uses the knowledge about the tagger’s most frequent errors to improve the accuracy of the resulting tagging. The post-processor, however, is not available for the online tagger nor in the licensed versions of CLAWS4.

CLAWS4 Applied to Film Dialogue

In the present pilot study on tagging the Pavia Corpus of Film Dialogue, the online version of CLAWS4 was used to tag the dialogue of the film Thelma & Louise (Ridley Scott, 1991). The online tagger allows to choose the tagset and layout of the output. The data were converted in text format (.txt), the metadata about characters’ names and scene details were excluded in order to avoid parsing errors, the tagset C7 (see Appendix A below) and the horizontal layout were chosen, so that the tagged output obtained looked as in the excerpt below:

Excerpt 1

Sonny_NP1 ,_, you_PPY wan_VV0 na_TO hit_VVI me_PPIO1 up_RP with_IW some_DD

syrup_NN1 ?_?

Thank_VV0 you_PPY ,_, darling_NN1 ._.

Maam_VV0 ?_?

Yeah_UH ,_, Ill_NP1 be_VBI right_JJ with_IW you_PPY ._.

Here_RL you_PPY go_VV0 ,_, you_PPY got_VVD the_AT sausage_NN1 ,_, pancakes_NN2

for_IF you_PPY ,_, and_CC theres_NN2 your_APPGE syrup_NN1 ._.

He_PPHS1 s_VBZ over_RP there_RL ._.

Uh-huh_UH ._.

As can be observed, the output contains the bare text of the dialogue without any information about the characters, conversational turns or other metalinguistic clues generally included in the PCFD data^{Footnote 6}. The assigned POS-tag is attached to the word it refers to following the underscore ‘_’. The list of tags and their corresponding parts of speech are listed in full in Appendix A.

The output created by CLAWS4 on the dialogue of Thelma & Louise (Ridley Scott, 1991) was manually checked by two researchers in order to verify whether the assigned tags were appropriate and thus to calculate a percentage of accuracy as well as an error margin of the performance of the POS-tagger on film dialogue. The first qualitative analysis of the output automatically produced by CLAWS4 on the dialogue of Thelma & Louise revealed a number of errors which occurred consistently throughout the POS-tagged text. These lowered the percentages of accuracy of the software to roughly 87.4%. Two main categories of errors can be distinguished: first, parsing errors due to the text format or problems of context-bound word class assignment; second, functional errors in which POS-tags were assigned to expressions whose function is mainly pragmatic, therefore not corresponding to their grammatical function (e.g. discourse marker well is tagged as an adverb). The list of frequent errors is reported below together with the examples extracted from the dialogue of Thelma & Louise (Ridley Scott, 1991):

Forms of address such as ma’am are tagged as either VV0, i.e. verbs in their base form, as a singular noun (NN1) as happens with darling in the excerpt below, or as an adjective (JJ), see hon in the excerpt below. Although tagging these forms as nouns represents the most sensible option, address forms do not work as prototypical nouns, given their vocative intersubjective function.

Excerpt 1:

Thank_VV0 you_PPY, darling_NN1.

Maam_VV0 ?_?

Yeah_UH, Ill_NP1 be_VBI right_JJ with_IW you_PPY ._.

[…]

Hon_JJ ?_?

As will be shown below in 4, the issue of assigning a wrong word class tag to address forms will be solved by constituting a new tagging category dedicated to address forms. This category will group together the expressions with addressing function independently of the word classes identified for the items making up the address form.

The tagger fails to recognise contracted modal or auxiliary verbs attached to personal pronouns and nouns, as in I’ll in excerpt 1 above, which is tagged as proper noun. This might be due to the fact that apostrophes, although present in the input, are not processed by the tagger. Other examples can be observed in excerpt 2 below, in which you’d is tagged as a base form of a verb, they’d is tagged as a singular noun, and Jimmy’s is tagged as a plural proper noun:

Excerpt 2

Youd_VV0 almost_RR think_VV0 theyd_NN1 want_VV0 to_TO forget_VVI about_II it_PPH1 for_IF the_AT weekend_NNT1.

[…]

Wonder_VV0 if_CSW Jimmys_NP2 got_VVD back_RP ._.

Some adverbs are analysed as adjectives, as happens with right in excerpt 1 above.
Conversational formulas (e.g. please, thanks, and some pragmatic markers are analysed as adverbs (RR) or nouns (NN), as in excerpt 3 below:

Excerpt 3

Regular_JJ ,_, please_RR.

[…]

Well_RR ,_, wait_VV0 now_RT.

[…]

Uhm_NN1 ,_, no_UH ,_, thanks_NN2 ._.

Genitive ‘s is not recognised and therefore tagged as a plural -s as in excerpt 4 below:

Excerpt 4

For_IF Christs_NP2 sake_NN1

Response forms (Biber et al., 1999) such as yes, no, okay, sure, yeah are tagged as interjections (UH), and adverbs (RR), as in excerpt 5 below:

Excerpt 5

Well_RR ,_, yes_UH ,_, I_PPIS1 will_VM ,_, operator_NN1 ._.

[…]

No_UH ,_, you_PPY wont_JJ ._.

[…]

Sure_RR ,_, Thelma_NP1 ._.

[…]

Yeah_UH ,_, I_PPIS1 could_VM identify_VVI them_PPHO2

Nonetheless, the literature on the matter claims that response forms function as neither interjections, since in their prototypical forms such as yes and no they do not communicate the speaker’s emotion, nor adverbs, as they do not “qualify” another element in the sentence (Aijmer, 2002; Biber et al., 1999). Response forms are rather classified as response signals or reaction signals (Aijmer, 2002) or inserts with a back-channelling function (Biber et al., 1999), in both cases representing a category of their own.

Other parsing errors among which the cluster I got to which is tagged as pronoun + past tense verb (VVD) + TO whereas it should be tagged as pronoun + past participle of a verb (VVN) + TO (see excerpt 6 below), and presentatives which are tagged as plural nouns (NN2) (see excerpt 6 below).

Excerpt 6

I_PPIS1 got_VVD ta_TO get_VVI to_TO work_VVI !_!

[…] and_CC theres_NN2 your_APPGE syrup_NN1 ._.

Applying CLAWS4 to film dialogue has revealed a series of tagging errors which lower the accuracy rate claimed by UCREL. By calculating an estimate of tagging errors for the script of Thelma & Louise (Ridley Scott, 1991) based on a 1000-word sample analysis, the error margin goes up to 10.73%, differently from what indicated on the official web page of CLAWS applied to the BNC^{Footnote 7} for spoken texts, namely 1.17%, and from the overall maximum error margin of 3%. There is a probability that the text format and using the online software might have affected the accuracy rate of the software. Many of the tagging errors could probably be avoided if the software had been able to detach many contracted verbs from the pronouns (e.g. Im, theyd, thats, etc.). In the next section, some of the tagging errors posing methodological issues will be discussed and then, new tagging rules will be proposed. These rules were implemented by a Python script which corrects the tagging output of CLAWS by post-processing it.

Discussion of Tagging Errors and the Definition of New Rules

Among the seven consistent tagging errors presented above, some posed theoretical challenges due to their essentially pragmatic function and difficulty in fitting into a ‘traditional’ word class defined on the grounds of morphosyntactic criteria. This is the case of pragmatic markers such as well, interjections such as oh, ah, and response forms such as yes, no, okay, yeah, sure. Other tagging errors only need a specific rule to help CLAWS4 disambiguate and assign the correct tag. This section deals with the formulation of new rules to be fed to the post-processing script with the aim of increasing the accuracy of the automatic tagging on film dialogue.

The results of the automatic tagging carried out by CLAWS4 on film dialogue revealed a need for implementing a number of pragmatic categories that are frequent in dialogue in the tagging system. Five different pragmatic categories were identified to be automatically tagged by the script: besides interjections, which are already tagged with a high level of accuracy by CLAWS4, the categories of forms of address, response forms, pragmatic markers, conversational formulas and general extenders were introduced. The categories were selected in a bottom-up fashion, following the manual analysis of the most frequent expressions occurring in the pilot POS-tagging of Thelma & Louise (Ridley Scott, 1991). Since pragmatic categories often contain complex multi-word expressions which at times correspond to entire sentences (e.g. you know), we chose, where applicable, to keep the POS-tags indicating the word classes of the elements of the complex expression and then add the label of the pragmatic category the expression belongs to. Below is a discussion of the rules formulated for the Python script. The first five rules concern the tagging of the aforementioned pragmatic categories, whereas rules 6–10 aim to correct some parsing errors made by CLAWS4 and assign the proper word class to the items in the expression. Finally, rule 11 adds the possibility to automatically distinguish between subject you and object you.

The Rules

Rule 1–Forms of address: address forms such as honey, darling, dear, Miss, Mrs, Sir, etc. will be tagged as AF (address forms) (cf. Brown & Ford, 1961; Brown & Gilman, 1960). Vocatives have been shown to be particularly frequent and crucial to film dialogue, even more than in spontaneous spoken language (Bubel, 2006; Formentelli, 2014) in that they help to build the character’s identity and relationships as well as encourage viewers’ participation (Pavesi, 2012; Zago 2015). Therefore, we deemed a new category would be useful in order to signal which words have an addressing function, charged with attitudinal values (Braun, 1988), and differentiate them from the non-addressing counterparts with propositional contribution, as in (1) below. This rule deals with lexical forms of address and excludes proper names used as vocatives.

(1)

a. Honey, are you okay?

b. Can I have some honey?

In (1a), honey will now be tagged AF, whereas in (1b) honey will be tagged as a common noun (NN1). The same applies to other address forms such as darling, dear and so on, which need to keep separate from other contexts in which they function as adjectives (see (2) below).

(2)

My dear/darling grandma is now 97.

The script was provided with a list of possible address forms and given parameters in order to distinguish them from adjectival or nominal uses. The parameters according to which the script decides which tag to assign the possible address form are mainly syntactic, though also based on spelling and punctuation: the script was instructed to assign the tag AF if the word appeared in the list of possible address forms, was found at the beginning of a line followed by a comma or at the end of a line preceded by a comma and followed by other punctuation. This rule works because the dialogues in the PCFD are manually transcribed and punctuation is used in order to reproduce pauses, therefore text segmentation. In texts without this kind of information address forms would prove more difficult to identify.

Rule 2–Response forms (tag RF): this category includes all the response forms to which a traditional word class cannot be assigned (Aijmer, 2002; Biber et al., 1999). The script is provided a list of response forms, which includes yes, no, yep, nope, yup, nah, yeah, okay and echo responses such as I do, she did, weren’t they? etc. In order to distinguish between echo responses and tag questions, the script was instructed to consider these expressions as response forms when they constitute a stand-alone utterance. The clues used by the script are syntactic position and punctuation.

Rule 3–Pragmatic markers (tag PM): this category gathers the linguistic expressions with a predominant function of dialogic organisation (e.g. structure and turn-taking), relationship management between the interlocutors, expression of the speaker’s stance and intentions. The list of English pragmatic markers was obtained by looking at the literature on the matter (cf. Beeching, 2016, Carter & McCarthy, 2006, Furiassi, 2021 among others; Başol & Kartal, 2019, Chaume, 2004, Forchini, 2010, for audiovisual dialogue) and includes the expressions well, just, you know, sort of, I mean, I think, anyway(s), so yeah, so, though, you see, really, okay then. These expressions were kept separate from their literal, non-pragmatic uses thanks to syntactic clues such as position in the sentence and punctuation. Since the Python script does not distinguish between specific functions of pragmatic expressions, the generic label pragmatic markers was chosen over discourse markers, considering that the term is taken to encompass a wide variety of linguistic items with pragmatic function including discourse markers (for a detailed discussion on the labels of pragmatic categories see Aijmer, 2013, Aijmer & Simon-Vandenbergen, 2011, Brinton, 2008, Fedriani & Sansò, 2017, Waltereit & Detges, 2007).

Rule 4 – Conversational formulas (tag CF): also called conversational routines by Aijmer (1996), Coulmas (1981) and Firth (1972) refer to expressions typical of conversation with little informational value compared to their paramount procedural and interpersonal role aimed at social cohesion. These are called ‘routines’ or ‘formulas’ given their nature of pre-fabricated linguistic expressions with a generally accepted use which takes into account the appropriateness in the context (e.g. nice to meet you). The list of conversational formulas for the Python script draws on the literature by Coulmas (1981) and Firth (1972) on spoken English and by Bonsignori et al. (2011) on film dialogue. It includes the following expressions of greeting, leave-taking and good wishes:

Hi, hello, hey, oi, yo, aye aye, hiya

How are you? (Are) you okay? ((are) you) Alright?

Nice to meet you, how do you do, pleasure to meet you, my pleasure, come on, please, if you don’t mind

(Good)bye, see you (later), take care, cheers, farewell

Thanks, thank you (very/so much), good to see you, so long, long time no see, good luck, welcome (back), (you’re) welcome, any time, of course, (good) evening, (good) morning, (good) afternoon.

Rule 5–General extenders (tag EXT): they are related to the concept of vagueness in language and indicate the expressions occurring at the end of a list which suggests that the list can go on without specifying all the members of the list (e.g. and so on, and stuff like that) relying on the fact that the interlocutor can and will fill in the gaps (Crystal & Davy, 1975; Overstreet, 1999). General extenders are relied on in film dialogue in order to convey orality and naturalness (Zanotti, 2014). The list of general extenders for this rule includes: or something, and everything, and things (like that), and stuff (like that), such and such, and so on (and so forth), et cetera. The only syntactic criteria used for the identification of general extenders is the position in the right periphery of the utterance.

Rule 6 – Contracted verbs: every time a personal pronoun found on the list provided to the script (i.e. I, you, she, he, it, we, they) is recognised and this string is followed by another string belonging to the list of possible contractions of verbs (i.e. d for would and had, m for am, re for are, ll for will, s for is and has, and ve for have), then the pronoun is tagged with the corresponding label of personal pronouns (see Appendix A) and the contracted verb is tagged as modal, auxiliary, or copula according to the context (e.g. you_PPYS 're_VBR gon_VVGK na_TO). In contexts in which disambiguation is needed, as in I’ll vs. ill, the script was instructed to consider capital letter I as a personal pronoun, also given the non-occurrence in the corpus of the adjective ill at the beginning of the sentence, thus capitalised.

Rule 7 – Saxon genitive (tag GE): the script is fed a list of most common proper names in English^{Footnote 8} and given instructions to tag as genitive any recognised proper name immediately followed by an -s and a noun (NN) or a combination of adjective and noun (JJ + NN). This will automatically exclude the occurrences in which -s represents a contracted form of is as it will be expected to be followed by a gerund (e.g. is doing), an adjective not followed by a noun (is beautiful), a complex noun phrase consisting of an article + adjective (optional) + noun (is a beautiful girl), or a past participle (e.g. is said to be). The rule for tagging possessive -s was also extended to indefinite pronouns such as somebody, anybody, nobody, everybody, and the set combined with -one such as someone, anyone etc.

Rule 8–Presentatives: the script is instructed to tag as presentative (PRES) all the instances of there followed by any form and tense of the verb to be, e.g. there is, there was, there had been, etc.

Other rules:

Rule 9–Contracted negation (tag XX): when the script recognises that the string nt is attached to a verb, e.g. aren’t, don’t, doesn’t, hadn’t, etc. then it analyses the unit as the base form of the verb plus not (see (3) below). The negated future auxiliary won’t is not included here because it is tagged correctly by CLAWS4.

(3)

I don’t know.

I_PPIS1 do_VD0 not_XX know_VV0.

Rule 10–I got to: the expression omits the auxiliary have or had (and the contracted forms ve and d), therefore got needs to be analysed as a past participle. The string I got to as well as its fused form gotta are tagged as I (PPIS1) + past participle of got (VVN) + to_TO, with the verb following to tagged as an infinitive (VVI).

Rule 11–Distinguishing between subject you and object you: the rule solves the ambiguity of using the second person pronoun you for both the subject and object functions by indicating in the tag which function the pronoun is performing. Subject you is indicated by PPYS and object you is indicated by PPYO. The script looks at the position of you in the sentence and the co-text in order to establish which function is being performed by the pronoun in a specific context. It takes into account variation due to negation and word order in questions.

Accuracy Rates of the Post-processing

Once the new rules were applied to the data, a qualitative analysis of the tagging on the dialogue of Thelma & Louise (Ridley Scott, 1991) displayed a visible improvement in the accuracy of the labelling. On average, CLAWS4 produced an error margin of 10.73% on film dialogue, whereas thanks to the post-processing it is now reduced to 3.16% which translates to an accuracy percentage of a little less than 97%. Below is an example of how the text is tagged by CLAWS4 and how the script has changed the tags in post-processing.

CLAWS4	PYTHON SCRIPT
Sonny_NP1 ,_, you_PPY wan_VV0 na_TO hit_VVI me_PPIO1 up_RP with_IW some_DD syrup_NN1 ?_? Thank_VV0 you_PPY ,_, darling_NN1 ._. Maam_VV0 ?_? Yeah_UH ,_, Ill_NP1 be_VBI right_JJ with_IW you_PPY ._. Here_RL you_PPY go_VV0 ,_, you_PPY got_VVD the_AT sausage_NN1 ,_, pancakes_NN2 for_IF you_PPY ,_, and_CC theres_NN2 your_APPGE syrup_NN1 ._. He_PPHS1 s_VBZ over_RP there_RL ._. Uh-huh_UH ._. Decaf_VV0 or_CC regular_JJ ?_? Regular_JJ ,_, please_RR ._. You_PPY girls_NN2 are_VBR kind_RR21 of_RR22 young_JJ to_TO be_VBI smoking_VVG ,_, do_VD0 nt_XX you_PPY think_VVI ?_? It_PPH1 ruins_VVZ your_APPGE sex_NN1 drive_NN1 ._. Ill_NP1 get_VV0 it_PPH1 ,_, I_PPIS1 got_VVD it_PPH1 !_! Hello_UH ?_? Hey_UH !_!	Sonny_NP1 ,_, you_PPYS wan_VV0 na_TO hit_VVI me_PPIO1 up_RP with_IW some_DD syrup_NN1 ?_? Thank_CF you_CF ,_, darling_AF ._. Maam_AF ?_? Yeah_RF ,_, I_PPIS1 ll_VM be_VBI right_JJ with_IW you_PPY ._. Here_RL you_PPYS go_VV0 ,_, you_PPYS got_VVD the_AT sausage_NN1 ,_, pancakes_NN2 for_IF you_PPY ,_, and_CC there_PRES is_VBZ your_APPGE syrup_NN1 ._. He_PPHS1 s_VBZ over_RP there_RL ._. Uh-huh_UH ._. Decaf_VV0 or_CC regular_JJ ?_? Regular_JJ ,_, please_CF ._. You_PPY girls_NN2 are_VBR kind_RR21 of_RR22 young_JJ to_TO be_VBI smoking_VVG ,_, do_VD0 nt_XX you_PPYS think_VVI ?_? It_PPH1 ruins_VVZ your_APPGE sex_NN1 drive_NN1 ._. I_PPIS1 ll_VM get_VV0 it_PPH1 ,_, I_PPIS1 got_VVD it_PPH1 !_! Hello_CF ?_? Hey_CF !_!

The items in bold highlight the expressions whose tags were modified by the script. In the excerpt it is possible to observe that thank you is now tagged as a conversational formula (CF), darling and maam as address forms (AF) rather than noun (NN1) and verb (VV0); yeah is now a response form (RF) instead of an interjection (UH), the contracted form I’ll which was originally tagged as a noun (NP1) is now correctly parsed into a first person pronoun I and the verb will; a similar operation is applied to there’s, originally tagged as a plural noun (NN2), which is now parsed into a presentative there followed by the copula is, and so on.

The improved accuracy of the automatic tagging was realised thanks to the integration of grammatical and pragmatic categories of analysis, which allowed to distinguish between items that mainly contribute to the propositional content from items with a predominant pragmatic function. By doing so, not only is the tagging of dialogues deemed to be more reliable, but it also allows the automatic retrieval of pragmatic items from a text.

Conclusion

This study has dealt with the issues and solutions related to a pilot study about tagging a corpus of film dialogue (namely PCFD). The application of the software CLAWS4 and the manual check of the tagged output has highlighted a number of parsing and word-class assignment errors. In order to solve consistent tagging errors, we developed a Python script^{Footnote 9} which corrects the wrong tags in post-processing. Tagging error analysis has also yielded the grounds for methodological discussion: it was noted how some linguistic expressions could not fit the word-class category they were assigned to by the tagger because they performed pragmatic functions. Applying a tagging software to a corpus of film dialogue has proven particularly prolific in this sense, since the dialogues contain frequent pragmatic features typical of conversation, such as pragmatic markers, interjections, conversational formulas, etc. Noting how linguistic expressions contributing propositional content and those performing pragmatic functions are generally integrated in the dialogue analysed for the pilot, a likewise integrated approach to tagging was proposed whereby the script was instructed to tag both word classes and pragmatic categories to be shown in the same output. This setting allows to automatically distinguish between propositional and pragmatic uses of the same expression (e.g. well, I mean, sure, etc.) and represents a valuable shortcut to be used in corpus-based pragmatic studies of film dialogue as well as spoken language, which generally require manual annotation of pragmatic features.

At the beginning of the paper, four research questions were formulated which will be answered below:

Q1: What is the accuracy rate of automatic CLAWS POS-tagging on film dialogue?

Following a qualitative analysis of the performance, the accuracy rate of the POS-tagger on the dialogue of the film Thelma & Louise (Ridley Scott, 1991) corresponds to 89.27%. This is different from the percentages of accuracy presented on the website of the CLAWS tagger, namely 98.86% for written language and 98.83% for spoken language.

Q2: What are the most frequent tagging errors?

The most frequent tagging errors appear to belong to two categories: one gathers mislabelling and parsing errors due to text format or context-bound disambiguation, such as failure to recognise contracted verbs, auxiliaries and genitive -s; the second concerns the issues related to labelling linguistic items with a predominant pragmatic function which are treated by the POS-tagger as prototypical lexical items (with the exception of interjections). These are pragmatic markers (e.g. you know, well), address forms (e.g. darling, honey), conversational formulas (e.g. thank you, see you later), general extenders (e.g. and stuff like that) and response forms (e.g. yes, yeah, nope).

Q3: How does the methodology of tagging film dialogue and/or the tagging procedures need to change in order to obtain satisfactory accuracy rates?

Besides fixing the software errors related to input-reading, we found that improved results could be obtained by designing ad hoc tagging categories for predominantly pragmatic expressions. The Python script was instructed to tag the five categories of pragmatic markers, address forms, conversational formulas, general extenders and response forms. For complex pragmatic expressions, such as conversational formulas, which are generally composed of two or more words, the software was instructed to keep both the POS and pragmatic tags. This allows to search for the single words making up a complex pragmatic expression by their word class. For example, see in see you later can be retrieved by searching for both verbs and conversational formulas.

Q4: What kind of improvement does integrating the morpho-syntactic and pragmatic levels of analysis bring to the automatic tagging of spoken language?

Three levels of improvement can be observed: the first concerns the possibility to automatically distinguish between mainly-propositional and mainly-pragmatic expressions in spoken text analysis; the second concerns the accuracy with which the tagging is carried out in a text genre in which the pragmatic dimension is paramount, thus cannot be reduced exclusively to morpho-syntactic criteria; the third concerns the theoretical underpinning of such integration, as it supports and fosters the idea that grammar and pragmatics are continuously integrated and cannot be kept separate in any analysis of spoken language.

Data availability

The datasets generated during and/or analysed during the current study are not publicly available due to restricted access to the corpus but are available from the corresponding author on reasonable request.

Notes

The other section containing dubbed and original Italian film dialogues; see https://studiumanistici.unipv.it/?pagina=p&titolo=pcfd for a complete description.
https://www.python.org/.
[S-505/TIC/0267] funded by the Regional Government of Madrid; http://www.mavir.net.
http://ucrel.lancs.ac.uk/claws/.
https://www.english-corpora.org/bnc/.
A detailed account of the data included in the PCFD can be found at https://studiumanistici.unipv.it/?pagina=p&titolo=pcfd.
https://ucrel.lancs.ac.uk/bnc2/bnc2error.htm
Source: https://www.ef.com/wwen/english-resources/english-names/.
Available to the research community upon request.

References

Aijmer, K. (1996). Conversational routines in english: Convention and creativity. Longman.
Google Scholar
Aijmer, K. (2002). Interjections in a contrastive perspective. In E. Weigand (Ed.), Emotion in dialogic interaction (pp. 103–124). John Benjamins Publishing Company.
Google Scholar
Aijmer, K. (2013). Understanding pragmatic markers: A variational pragmatic approach. Edinburgh University Press.
Book Google Scholar
Aijmer, K., & Simon-Vandenbergen, A. M. (2011). Pragmatic markers. In J. Zienkowski, J. Östman, & J. Verschueren (Eds.), Discoursive pragmatics (pp. 223–247). John Benjamins.
Chapter Google Scholar
Başol, H. C., & Kartal, G. (2019). The use of micro-level discourse markers in British and American feature-length films: implications for teaching in EFL contexts. Journal of Language and Linguistic Studies, 15(1), 276–290.
Article Google Scholar
Bednarek, M. (2010). The language of fictional television: Drama and identity. Continuum.
Google Scholar
Beeching, K. (2016). Pragmatic markers in British English: Meaning in social interaction. Cambridge University Press.
Book Google Scholar
Biber, D., Johansson, S., Leech, G., Conrad, S., & Finegan, E. (1999). Longman grammar of spoken and Written English. Longman.
Google Scholar
Bonsignori, V., Bruti, S., & Masi, S. (2011). Formulae across languages: English greetings, leave-takings and good wishes in dubbed Italian. In A. Şerban, A. Matamala, & J. M. Lavaur (Eds.), Audiovisual translation in close-up: practical and theoretical approaches (pp. 23–44). Peter Lang.
Google Scholar
Bortfeld, H., Leon, S. D., Bloom, J. E., Schober, M. F., & Brennan, S. E. (2001). Disfluency rates in conversation: Effects of age, relationship, topic, role, and gender. Language and Speech, 44(2), 123–147.
Article Google Scholar
Brants, T. (2000). TnT-a statistical part-of-speech tagger. arXiv preprint cs/0003055.
Braun, F. (1988). Terms of address: Problems of patterns and usage in various languages and cultures. UK: Mouton de Gruyter.
Book Google Scholar
Brill, E. (1992). A simple rule-based part of speech tagger. Pennsylvania Univ Philadelphia Dept of Computer and Information Science.
Brill, E. (1994). Some advances in transformation-based part of speech tagging. arXiv preprint cmp-lg/9406010
Brinton, L. J. (2008). The comment clause in English: Syntactic origins and pragmatic development. Cambridge University Press.
Book Google Scholar
Brown, R., & Ford, M. (1961). Address in American English. Journal of Abnormal and Social Psychology, 62, 375–385.
Article Google Scholar
Brown, R., & Gilman, A. (1960). The pronouns of power and solidarity. In T. A. Sebeok (Ed.), Style in language (pp. 253–276). Wiley & Sons.
Google Scholar
Bubel, C. M. (2006). The linguistic construction of character relations in TV drama: Doing friendship in sex and the city. PhD dissertation, Universität des Saarlandes.
Carlberger, J., & Kann, V. (1999). Implementing an efficient part-of-speech tagger. Software: Practice and Experience, 29(9), 815–832.
Google Scholar
Carter, R., & McCarthy, M. (2006). Cambridge grammar of English. Cambridge University Press.
Google Scholar
Fedriani, C. & Sansò, A. (eds). (2017). Pragmatic markers, discourse markers and modal particles. John Benjamins.
Chaume, F. (2004). Discourse markers in audiovisual translation. Meta, 49(4), 643–855.
Google Scholar
Coulmas, F. (ed). (1981). Conversational routine. Mouton.
Crystal, D., & Davy, D. (1975). Advanced conversational English. Longman.
Google Scholar
Cutting, D., Kupiec, J., Pedersen, J., & Sibun, P. (1992). A practical part-of-speech tagger. In Third conference on applied natural language processing (pp. 133-140).
Firth, R. (1972). Verbal and bodily rituals of greeting and parting. In J. S. La Fontaine (Ed.), Interpretation of ritual (pp. 1–38). Tavistock.
Google Scholar
Forchini, P. (2010). ‘Well, uh no. I mean, you know’. Discourse markers in movie conversation. In L. Bogucki & K. Kredens (Eds.), Perspectives on audiovisual translation (pp. 45–59). Peter Lang.
Google Scholar
Forchini, P. (2012). Movie language revisited. Peter Lang.
Book Google Scholar
Formentelli, M. (2014). Vocatives galore in audiovisual dialogue: Evidence from a corpus of American and British films. English Text Construction, 7(1), 53–83.
Article Google Scholar
Freddi, M. (2011). A phraseological approach to film dialogue: Film stylistics revisited. Yearbook of Phraseology, 2, 137–163.
Article Google Scholar
Freddi, M. (2012). What AVT can make of corpora: Some findings from the Pavia corpus of film dialogue. In A. Remael, P. Orero, & M. Carroll (Eds.), AVT and media accessibility at the crossroads. Media for All 3 (pp. 381–407). Rodopi.
Google Scholar
Furiassi, C. (2021). Translating the discourse marker combination okay then from English into Italian: evidence from the American TV series Fargo. Textus, 1, 101–130.
Google Scholar
Garg, N., Goyal, V., & Preet, S. (2012). Rule based Hindi part of speech tagger. In Proceedings of COLING 2012: Demonstration papers (pp. 163-174).
Garside, R. (1987). The CLAWS word-tagging system. In R. Garside, G. Leech, & G. Sampson (Eds.), The computational analysis of English: A corpus-based approach. UK: Longman.
Google Scholar
Giménez, J., & Marquez, L. (2004). SVMTool: A general POS tagger generator based on support vector machines. In Proceedings of the 4th international conference on language resources and evaluation.
Gregory, M. (1967). Aspects of varieties differentiation. Journal of Linguistics, 3, 177–198.
Article Google Scholar
Lickley, R. J. (2001). Dialogue moves and disfluency rates. In Proceedings of DiSS: ISCA tutorial and research workshop on disfluency in spontaneous speech. Edinburgh, UK.
Megyesi, B. (1998). Brill's rule based part-of-speech tagger for Hungarian. Master's thesis, University of Stockholm.
Nivre, J., Grönqvist, L., Gustafsson, M., Lager, T. & Sofkova, S. (1996). Tagging spoken language using written language statistics. In The 16^th international conference on computational linguistics, (vol 2, pp. 1078–1081).
Overstreet, M. (1999). Whales, candlelight, and stuff like that: General extenders in English discourse. Oxford University Press.
Google Scholar
Pavesi, M. (2009). Pronouns in film dubbing and the dynamics of audiovisual communication. VIAL (Vigo International Journal of Applied Linguistics), 6, 89–107.
Google Scholar
Pavesi, M. (2012). The enriching functions of address shifts in film translation. In A. Remael, P. Orero, & M. Carroll (Eds.), AVT and media accessibility at the crossroads. Media for all 3 (pp. 335–356). Rodopi.
Google Scholar
Quaglio, P. (2009). Television dialogue. The sitcom friends vs. natural conversation. John Benjamins.
Book Google Scholar
Rashel, F., Luthfi, A., Dinakaramani, A., & Manurung, R. (2014, October). Building an Indonesian rule-based part-of-speech tagger. In 2014 international conference on Asian language processing (IALP) (pp. 70–73). IEEE.
Ratnaparkhi, A. (1996). A maximum entropy model for part-of-speech tagging. In Conference on empirical methods in natural language processing.
Rodríguez Martín, M. E. (2010). Comparing Conversational Processes in the BNC and a Micro-Corpus of Movies: Is Film Language the ‘Real Thing’?. Language Forum. 36/1.
Sadredini, E., Guo, D., Bo, C., Rahimi, R., Skadron, K., & Wang, H. (2018, July). A scalable solution for rule-based part-of-speech tagging on novel hardware accelerators. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining (pp. 665-674).
Schmid, H. (1994). Part-of-speech tagging with neural networks. arXiv preprint cmp-lg/9410018.
Shriberg, E. (1994). Preliminaries to a theory of speech disfluencies. PhD Dissertation, University of California.
Toutanvoa, K., & Manning, C. D. (2000, October). Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora (pp. 63–70).
Traugott, E. C., & Trousdale, G. (2013). Constructionalization and constructional changes. Oxford University Press.
Book Google Scholar
Valdeón, R. A. (2009). Imitating the Conversational Mode in Audiovisual Fiction: performance phenomena and non-clausal units. In M. Amador, P. Carolina, & A. Nunes (Eds.), The representation of the spoken mode in fiction (pp. 197–221). Edwin Mellen.
Google Scholar
Waltereit, R., & Detges, U. (2007). Different functions, different histories: modal particles and discourse markers from a diachronic point of view. Catalan Journal of Linguistics, 6(6), 61–80.
Article Google Scholar
Zago, R. (2016). From originals to remakes. Colloquiality in English film dialogue over time. Acireale/Roma: Bonanno Editore. ISBN: 978-88-6318-104-3.
Zago, R. (2015). ‘That’s none of your business, Sy’: The pragmatics of vocatives in film dialogue. In J. Chovanec & M. Dynel (Eds.), Participation in public and social media interactions (pp. 183–207). Benjamins.
Google Scholar
Zanotti, S. (2014). ‘It feels like bits of me are crumbling or something’: general extenders in original and dubbed television language. In M. Pavesi, M. Formentelli, & E. Ghia (Eds.), The languages of dubbing: Mainstream audiovisual translation in Italy (pp. 113–140). Peter Lang.
Google Scholar

Download references

Funding

Open access funding provided by Università degli Studi di Pavia within the CRUI-CARE Agreement. No funding was received to assist with the preparation of this manuscript. The authors have no relevant financial or non-financial interests to disclose.

Author information

Authors and Affiliations

University of Pavia, Corso Strada Nuova, 27100, Pavia, Italy
Liviana Galiano
University of Turin, via Verdi 8, 10124, Torino, Italy
Alfonso Semeraro

Authors

Liviana Galiano
View author publications
You can also search for this author in PubMed Google Scholar
Alfonso Semeraro
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Both authors agreed with the content and gave explicit consent to submit after having obtained consent from the organisations where the work has been carried out. Both authors made substantial contributions to the work, drafted it critically, approved the version to be published and agree to be accountable for all aspects of the work. Conceptualisation and methodology: Liviana Galiano; software development and literature review on POS-taggers: Alfonso Semeraro; literature review on film language: Liviana Galiano; original draft preparation: Liviana Galiano; review and editing: Alfonso Semeraro.

Corresponding author

Correspondence to Liviana Galiano.

Ethics declarations

Conflict of interest

The authors have no conflicts of interest to disclose.

Ethical approval

Ethics approval was not required for this study since the language data were gathered from fictional film characters.

Consent for publication

The manuscript is not submitted to other journals for simultaneous consideration. The submitted work is original and not published elsewhere partially or in full.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A – List of C7 tags (CLAWS4)

APPGE	Possessive pronoun, pre-nominal (e.g. my, your, our)
AT	Article (e.g. the, no)
AT1	Singular article (e.g. a, an, every)
BCL	Before-clause marker (e.g. in order (that), in order (to))
CC	Coordinating conjunction (e.g. and, or)
CCB	Adversative coordinating conjunction (but)
CS	Subordinating conjunction (e.g. if, because, unless, so, for)
CSA	As (as conjunction)
CSN	Than (as conjunction)
CST	That (as conjunction)
CSW	Whether (as conjunction)
DA	After-determiner or post-determiner capable of pronominal function (e.g. such, former, same)
DA1	Singular after-determiner (e.g. little, much)
DA2	Plural after-determiner (e.g. few, several, many)
DAR	Comparative after-determiner (e.g. more, less, fewer)
DAT	Superlative after-determiner (e.g. most, least, fewest)
DB	Before determiner or pre-determiner capable of pronominal function (all, half)
DB2	Plural before-determiner ( both)
DD	Determiner (capable of pronominal function) (e.g any, some)
DD1	Singular determiner (e.g. this, that, another)
DD2	Plural determiner ( these,those)
DDQ	Wh-determiner (which, what)
DDQGE	Wh-determiner, genitive (whose)
DDQV	Wh-ever determiner, (whichever, whatever)
EX	Existential there
FO	Formula
FU	Unclassified word
FW	Foreign word
GE	Germanic genitive marker - (' or's)
IF	For (as preposition)
II	General preposition
IO	Of (as preposition)
IW	With, without (as prepositions)
JJ	General adjective
JJR	General comparative adjective (e.g. older, better, stronger)
JJT	General superlative adjective (e.g. oldest, best, strongest)
JK	Catenative adjective (able in be able to, willing in be willing to)
MC	Cardinal number,neutral for number (two, three..)
MC1	Singular cardinal number (one)
MC2	Plural cardinal number (e.g. sixes, sevens)
MCGE	Genitive cardinal number, neutral for number (two's, 100's)
MCMC	Hyphenated number (40–50, 1770–1827)
MD	Ordinal number (e.g. first, second, next, last)
MF	Fraction,neutral for number (e.g. quarters, two-thirds)
ND1	Singular noun of direction (e.g. north, southeast)
NN	Common noun, neutral for number (e.g. sheep, cod, headquarters)
NN1	Singular common noun (e.g. book, girl)
NN2	Plural common noun (e.g. books, girls)
NNA	Following noun of title (e.g. M.A.)
NNB	Preceding noun of title (e.g. Mr., Prof.)
NNL1	Singular locative noun (e.g. Island, Street)
NNL2	Plural locative noun (e.g. Islands, Streets)
NNO	Numeral noun, neutral for number (e.g. dozen, hundred)
NNO2	Numeral noun, plural (e.g. hundreds, thousands)
NNT1	Temporal noun, singular (e.g. day, week, year)
NNT2	Temporal noun, plural (e.g. days, weeks, years)
NNU	Unit of measurement, neutral for number (e.g. in, cc)
NNU1	Singular unit of measurement (e.g. inch, centimetre)
NNU2	Plural unit of measurement (e.g. ins., feet)
NP	Proper noun, neutral for number (e.g. IBM, Andes)
NP1	Singular proper noun (e.g. London, Jane, Frederick)
NP2	Plural proper noun (e.g. Browns, Reagans, Koreas)
NPD1	Singular weekday noun (e.g. Sunday)
NPD2	Plural weekday noun (e.g. Sundays)
NPM1	Singular month noun (e.g. October)
NPM2	Plural month noun (e.g. Octobers)
PN	Indefinite pronoun, neutral for number (none)
PN1	Indefinite pronoun, singular (e.g. anyone, everything, nobody, one)
PNQO	Objective wh-pronoun (whom)
PNQS	Subjective wh-pronoun (who)
PNQV	Wh-ever pronoun (whoever)
PNX1	Reflexive indefinite pronoun (oneself)
PPGE	Nominal possessive personal pronoun (e.g. mine, yours)
PPH1	3rd person sing. neuter personal pronoun (it)
PPHO1	3rd person sing. objective personal pronoun (him, her)
PPHO2	3rd person plural objective personal pronoun (them)
PPHS1	3rd person sing. subjective personal pronoun (he, she)
PPHS2	3rd person plural subjective personal pronoun (they)
PPIO1	1st person sing. objective personal pronoun (me)
PPIO2	1st person plural objective personal pronoun (us)
PPIS1	1st person sing. subjective personal pronoun (I)
PPIS2	1st person plural subjective personal pronoun (we)
PPX1	Singular reflexive personal pronoun (e.g. yourself, itself)
PPX2	Plural reflexive personal pronoun (e.g. yourselves, themselves)
PPY	2nd person personal pronoun (you)
RA	Adverb, after nominal head (e.g. else, galore)
REX	Adverb introducing appositional constructions (namely, e.g.)
RG	Degree adverb (very, so, too)
RGQ	Wh- degree adverb (how)
RGQV	Wh-ever degree adverb (however)
RGR	Comparative degree adverb (more, less)
RGT	Superlative degree adverb (most, least)
RL	Locative adverb (e.g. alongside, forward)
RP	Prep. adverb, particle (e.g about, in)
RPK	Prep. adv., catenative (about in be about to)
RR	General adverb
RRQ	Wh- general adverb (where, when, why, how)
RRQV	Wh-ever general adverb (wherever, whenever)
RRR	Comparative general adverb (e.g. better, longer)
RRT	Superlative general adverb (e.g. best, longest)
RT	Quasi-nominal adverb of time (e.g. now, tomorrow)
TO	Infinitive marker (to)
UH	Interjection (e.g. oh, yes, um)
VB0	Be, base form (finite i.e. imperative, subjunctive)
VBDR	Were
VBDZ	Was
VBG	Being
VBI	Be, infinitive (To be or not... It will be ..)
VBM	Am
VBN	Been
VBR	Are
VBZ	Is
VD0	Do, base form (finite)
VDD	Did
VDG	Doing
VDI	Do, infinitive (I may do... To do...)
VDN	Done
VDZ	Does
VH0	Have, base form (finite)
VHD	Had (past tense)
VHG	Having
VHI	Have, infinitive
VHN	Had (past participle)
VHZ	Has
VM	Modal auxiliary (can, will, would, etc.)
VMK	Modal catenative (ought, used)
VV0	Base form of lexical verb (e.g. give, work)
VVD	Past tense of lexical verb (e.g. gave, worked)
VVG	-ing participle of lexical verb (e.g. giving, working)
VVGK	-ing participle catenative (going in be going to)
VVI	Infinitive (e.g. to give... It will work...)
VVN	Past participle of lexical verb (e.g. given, worked)
VVNK	Past participle catenative (e.g. bound in be bound to)
VVZ	-s form of lexical verb (e.g. gives, works)
XX	Not, n't
ZZ1	Singular letter of the alphabet (e.g. A,b)
ZZ2	Plural letter of the alphabet (e.g. A's, b's)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Galiano, L., Semeraro, A. Part-of-Speech and Pragmatic Tagging of a Corpus of Film Dialogue: A Pilot Study. Corpus Pragmatics 7, 17–39 (2023). https://doi.org/10.1007/s41701-022-00132-9

Download citation

Received: 27 June 2022
Accepted: 12 December 2022
Published: 11 January 2023
Issue Date: March 2023
DOI: https://doi.org/10.1007/s41701-022-00132-9

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Part-of-Speech and Pragmatic Tagging of a Corpus of Film Dialogue: A Pilot Study

Abstract

Similar content being viewed by others

Speech Acts Annotation of Everyday Conversations in the ORD Сorpus of Spoken Russian

Part of Speech Tagging for Polish: State of the Art and Future Perspectives

Quantitative Data on POS Distribution in the Beginnings and the Ends of Utterances in Everyday Russian Speech

Introduction

Research Questions and Methodology

CLAWS4

CLAWS4 Applied to Film Dialogue

Discussion of Tagging Errors and the Definition of New Rules

The Rules

Accuracy Rates of the Post-processing

Conclusion

Data availability

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Consent for publication

Additional information

Publisher's Note

Appendix A – List of C7 tags (CLAWS4)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Part-of-Speech and Pragmatic Tagging of a Corpus of Film Dialogue: A Pilot Study

Abstract

Similar content being viewed by others

Speech Acts Annotation of Everyday Conversations in the ORD Сorpus of Spoken Russian

Part of Speech Tagging for Polish: State of the Art and Future Perspectives

Quantitative Data on POS Distribution in the Beginnings and the Ends of Utterances in Everyday Russian Speech

Introduction

Research Questions and Methodology

CLAWS4

CLAWS4 Applied to Film Dialogue

Discussion of Tagging Errors and the Definition of New Rules

The Rules

Accuracy Rates of the Post-processing

Conclusion

Data availability

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Consent for publication

Additional information

Publisher's Note

Appendix A – List of C7 tags (CLAWS4)

Appendix A – List of C7 tags (CLAWS4)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation