1 Introduction

This article investigates theme detection in character speech of dramatic texts. It does not aim to contribute to drama(-historical) research, but introduces two methodological innovations to computational literary studies: We propose a procedure for the systematic comparison of text-analytic methods even if a gold standard does not exist or is difficult to create. We employ this procedure for the comparison of several implemented systems that assign some notion of ‘theme’ to a text segment. Henceforth, we use the term ‘theme’ to describe the content of character speech and the term ‘topic’ exclusively for the outcome of topic modeling. ‘Theme’ represents the goal, and ‘topic’ the outcome of one specific method (that may or may not achieve this goal).

The lack of reference corpora with annotated categories that are of relevance for literary studies is a methodological bottleneck for the development of text processing tools for literary texts. Existing and publicly available corpora allow researchers to test and/or train their tools, and to assess their quality. The existence of such corpora in other domains and with non-literary categories has been the driving force behind the progress made in computational linguistics over the past years. Creating reference corpora for literary phenomena, however, is a challenging task, for multiple reasons.Footnote 1 One reason is the nature of literary texts, as they are often deliberately ambiguous and open for different interpretations. In addition, literary categories are hard to define exactly,Footnote 2 which is often only uncovered during annotation.Footnote 3 In this article, we would like to offer a different perspective on the issue: Knowledge that has been established in literary studies might not be defined as exact as in natural science, but with the right procedure it can still be employed for method development.

The procedure we propose can be outlined as follows: We first formulate a set of postulates about dramatic texts. These postulates combine a structural element of the dramatic text with a non-interpretive hypothesis about the thematic content of this text element. For instance, we postulate that continuous copresence of characters leads to similar thematic content of their dramatic speech. The procedure does not require these postulates to be true for every drama. Though, we do maintain that they are plausible assumptions for a large number of dramatic texts and that results obtained by tuning or training on these postulates would be suitable for poetological or drama historical statements.

Dramatic texts are well suited for this analysis, because they are strongly structured by nature (compared to, e.g., prose texts) and this structure is encodable in a machine-readable fashion. The speech of characters is unfiltered by a narrator, the entire text is segmented into acts and scenes. Unsurprisingly, the dramatic structure has always played a role in the literary analysis of dramas. Many theoretical predictions are rooted in the drama structure that makes both the formulation of general postulates and their operationalization straightforward.

Most work in the past has been either analyzing the dramatic structure or the content of dramatic language. For social network analysis, one typically extracts co-presence information (which characters share the stage how often?) from the texts.Footnote 4 On top of the relational information encoded in networks, Algee-Hewitt uses betweenness centrality to make interesting observations.Footnote 5 Genre analysis, as for instance done by Schöch,Footnote 6 on the other hand, ignores most of the structure and just takes the text as a whole: All characters’ voices are merged into one.

There are only a few works that actually combine structure and content information: Nalisnick/Baird have employed sentiment analysis on the character speech to determine the turning point in the relation between characters (e.g., between Hamlet and his mother).Footnote 7 Bullard/Ovesdotter Alm describe corpus-linguistic investigations into two Scandinavian authors and determine how certain character classes are represented linguistically.Footnote 8 Karsdorp et al. focus on one specific character relation and describe a system to determine the most probable romantic partner for characters.Footnote 9

This paper is structured in the following way: In Sect. 2, we discuss conceptually how we are going to compare systems on a task for which a gold standard does not exist. Section 3 introduces the theme detection systems. In Sect. 4 we discuss four postulates about dramatic texts, against which we compare the systems. Section 5 outlines the experimental setup and introduces formal and notational basics. In Sect. 6, each postulate is covered by one subsection, discussing its operationalization and experimental results. Section 7 concludes.

2 Comparing Systems without a Gold Standard

In order to compare the performance of different text analysis systems, one would typically presuppose the existence of a gold standard, against which the systems can be tested. While having a gold standard would be ideal, creating one is difficult in this case for two reasons that are related with the target phenomenon ‘theme’.

Defining a list of themes as annotation categories would in principle allow us to annotate text segments (e.g., character utterances) with such categories. Cast as a supervised classification problem, it would be straightforward to train a classifier that detects and distinguishes these themes. During the application of this classifier, however, we would only be able to detect the topics as they are defined in the gold standard. We are then not able to discover new themes in the corpus, in addition to the usual dependency on the training data. The requirement of seeing negative data during training (i.e., texts that do not refer to a given theme) makes even binary classification attempts (training one classifier for one theme) difficult, as the absence of a theme may be even harder to annotate than its presence. All this is also related to another challenge: Themes in literary texts are to some extent interpretative, in particular if the theme is deliberately hidden or its detection depends on special historical knowledge. Except for very broad and general themes, it is therefore unlikely that any kind of gold standard will be created in the near future.

The core idea in this procedure is to formulate a number of postulates that are derived from existing scholarly knowledge about dramatic texts. These postulates are assumed to be true in general without having to be true for every single text. Such postulates do need to be operationalized at some point, and structural properties of both the systems in comparison and the postulates make some more suited than others: Postulates that state relations between one variable and another can be operationalized as correlations in a straightforward manner, for instance. Once this is done, the different theme detection systems can be applied onto a corpus and their output be compared with the postulates. However, the interpretation of such results cannot be expected to be trivial, as we are comparing multiple systems with multiple postulates.

Conceptually, this is similar to two existing concepts from natural language processing (NLP): In the paradigm of distant supervision, an un-annotated corpus is annotated automatically, based on structured data that is in some way linked to the corpus. In Husby/Barbosa, training data is created by linking blog articles to Wikipedia domains via freebase entities,Footnote 10 which are in turn used as topic labels for the blog articles. After the data has been established, standard supervised classification algorithms can be applied. The second concept is extrinsic evaluation. NLP tools are often used in pipelines: The output of a part of speech tagger is fed as input to a dependency parser (the downstream task). If there was no gold standard for part of speech tags, one could evaluate a part of speech tagger by comparing the performance of a dependency parser with and without the input of the part of speech tagger (assuming we have an evaluation method for the dependency parser).

If formalized in an appropriate way, existing literary knowledge can actually be re-used as a ‘distant gold standard’. Systems could be optimized to reproduce decisions and distinctions made by literary scholars.

3 Theme Detection Systems

We will be using two kinds of systems in this article: Dictionary-based systems use a dictionary to detect words in text that belong to certain keywords in this dictionary. Dictionary-based systems require the existence or creation of a dictionary, which pre-determines the themes one is able to find in texts. While defined themes might still appear in unexpected places, it is impossible to detect completely new themes. The findings, however, can be interpreted in a straightforward way, as resulting numbers represent the (relative) amount of words belonging to a certain keyword in a text segment.

Latent Dirichlet Allocation (LDA, referred to here as „topic modeling“, TM) is an unsupervised statistical method that aims at uncovering textual patterns by inspecting co-occurrence of words, typically in large corpora. Using topic models, one can discover new and unexpected themes in a corpus. However, because documents in topic models are represented as distributions over distributions over words (see below), they are difficult to interpret. As other researchers have discovered,Footnote 11 many of the resulting topics do not represent themes at all, but formal or structural properties of the documents. In addition, one has to specify the number of topics in advance.

To give an overview, Table 1 shows the systems and system variants in comparison. Word classes shows the set of parts of speech that have been taken into account for the analysis. The dictionary-based systems only work with content words (nouns, verbs, adjectives) and ignore function words. Avg. size shows the average size (and standard deviation) for keywords (dictionary methods) or training documents (topic modeling). Total size gives the total number of entries (in the dictionaries) or number of training documents (for topic modeling).

Table 1 Overview of the theme detection systems. A detailed description of the columns can be found in the text

3.1 Dictionary-based Approaches

A dictionary in this context is defined as a set of keywords with associated entries. The entries are semantically related to the keyword, but not necessarily synonymous. The keyword “theater”, for instance, would contain entries like “stage”, “actress”, “playing” etc. The dictionaries cover different word classes, historic, stylistic and spelling variations. They contain all entries as lemmas.

We will compare three different dictionary-based systems. The major difference between the systems lays in the nature and source of the entries. All systems assign a score for each keyword in the dictionary, based on the number of words that belong to this keyword. This score is normalized for the segment length, and, if word set sizes are differing largely, by dictionary size.

Custom Dictionary

In the first variant of this method, we use a dictionary that has been collected by a domain expert (one of the authors of this paper) for the specific purpose of plays. As described in Willand/Reiter,Footnote 12 we manually created a dictionary for each of the five keywords that we consider exceptionally relevant for the analysis of dramatic texts from our research period (1730 to 1930): “family”, “war”, “love”, “reason” and “religion”. Each of the sets contains 60 to 115 words, which sum up to 444 entries in total. Time-span related meaning variation has been taken into account. The creation of the sets was done in an iterative manner: Starting from the word “love” we looked for words in it’s immediate context with the character speech that were then checked for their contexts.Footnote 13

An illustrative example (taken from ibid.) on the use of these dictionaries is shown in Fig. 1. Each corner in the ‘spider web’ represents an entry in the dictionary field and each node represents the extent to which a character uses words that are grouped under this keyword. As Fig. 1 shows, a custom dictionary is capable of giving information about the thematic distribution of character speech in the dimensions defined in the dictionary. Apparently, the lovers mostly talk about love and the parents mostly talk about family (with respect to the five thematic fields). The crucial point of this method is that one has to have an idea of relevant themes before doing any analysis. This allows domain experts to incorporate their previous knowledge but has the downside that it limits what we can find to what we have defined before.

Fig. 1
figure 1

Romeo’s and Juliet’s speech (left) and Juliet’s parent’s speech (right) analyzed using a custom dictionary

Enriched Custom Dictionary

As second variant of the custom dictionary approach, we use the previously built custom dictionary as a set of seed words and add words that are distributionally similar. This similarity is measured as distance over word vectors created with the tool word2vec,Footnote 14 integrated as an R package wordVectors.Footnote 15 The word vectors have been trained on a corpus of German literary fiction.Footnote 16 Although a different literary genre (prose), we selected this corpus for its size and because it covers the historic period in question. The enriched dictionaries contain the same words as before, but additionally all words that have a (cosine) similarity of \(0.5\) or higher. This results in between 136 and 212 entries per keyword. In total this dictionary contains 893 entries.

GermaNet Dictionary

As a third, technically similar option, we test the use of GermaNet 11Footnote 17 as a dictionary source. GermaNet is a concept network, primarily used for (computational) linguistic purposes (and developed by computational linguists). A concept in GermaNet may or may not be lexicalized: Systematic gaps in the hierarchy have been filled with concepts, even if no German word exists for this concept. The relations between concepts are lexical relations like hypernymy/hyponymy or various forms of meronymy. The hypernymy/hyponymy-relations can be used to form a concept tree based on inheritance. While there is a technical top node in this tree, the concepts directly below this node are grouped into so-called semantic fields (see GermaNet Section in the Appendix).

Due to the concept hierarchy, one could in principle operate on different granularity levels. As the other systems are using a limited (and rather small) number of dimensions, we chose to use the semantic field classification, and extracted all realized words that belong to these fields (from anywhere in the hierarchy). This results in a dictionary consisting of 54 keywords, each of which containing 85 to 18,070 lemmas. In total this dictionary contains 142,815 entries.

3.2 Topic Modeling

Topic modeling has been introduced by Blei et al.Footnote 18 The core idea is to model topics as distributions over words (i.e., a word has a certain probability to be in a topic), and to model documents as distributions over topics. A document is then a combination of various topics, each assigned a probability. When training a topic model, one needs to fix the number of topics the model should consider.

Within digital humanities, topic models are widely used as a means to discover thematic structure of text corpora. Underwood/Goldstone use topic models to study the history of the Publications of the Modern Language Association (a scientific journal);Footnote 19 Rhody studies figurative language in poetry,Footnote 20 Quamen/Hjartarson investigate the personal correspondence of two Canadian writers,Footnote 21 and Schöch studies dramatic genre (and sub-genre) using topic modelsFootnote 22 (this list is not exhaustive).

For our experiments, we employ topic models in two variants, each with a low and high number of topics (20 vs. 100 topics). In the first variant, we train the topic models on full text level: Each dramatic text is taken as a single document, containing the voices of all characters as one (stage directions and speaker names are removed). In the second variant, we consider the speech of every character as a new document. The entire speech of Emilia from Emilia Galotti, for instance, is put together in a new document, containing only her speech.

4 Postulates about Dramatic Texts

In this section, we substantiate our understanding of what we call ‘postulates about dramatic texts’, focusing on their motivation. Their operationalization will be discussed in Sect. 6.

4.1 Copresence

Following from the intuition that conversations on stage share a common theme (or set of themes), we expect that two copresent characters (which are characters with shared stage presence) share themes in their speech. This has never been subject to drama-historical postulates or poetological claims but seems plausible for our setting. In addition, it is in line with previous observations we made in former work on the translation of Shakespeare’s Romeo and Juliet:Footnote 23 Using the custom dictionary-approach to compare Shakespeare’s characters with those from German plays, we found that in dramatic genres with small ensembles—like the Bürgerliche Trauerspiel—the character speech of the most dominant characters is similar in terms of thematic distribution. That seems to be a result of ensemble size. The smaller it is, the greater the probability that characters spend time together on stage. We also found that copresence and themes of speech seems to be linked in single pairs of characters. Striking examples are in this case butlers and servants, which are often peripheral or satellite characters: Their only network connection is the one to their lord.Footnote 24 In the examples we give, this connection is usually not the only one, but it is the strongest. E.g., in Schillers Kabale und Liebe, the distribution of themes in the speech of Präsident von Walter and his butler is very similar. The same applies for Julieta and the nurse in Shakespeare’s Romeo and Juliet and even for Hettore Gonzaga and Marinelli in Emilia Galotti.

4.2 Scene Boundaries

The postulate about scene boundaries leads back to poetological controversies about the function of scenes that started with the classical unities of action, time and place. The latter was not introduced by Aristotle, but by J. C. Scaliger and adopted by the French doctrine classique in the 16th and 17th century. The neo-Aristotelian ideal of what Volker Klotz calls “geschlossenes Drama”Footnote 25 postulates a causal development of plot and the liaison des scènes.Footnote 26 This means that two adjacent scenes are connected by at least one character that remains present. This is supposed to ensure the unity of action. Gottsched—an assertor of the French tradition—describes it as follows:

Schlüßlich muß ich erinnern, daß die Auftritte oder Scenen in einer Handlung [meint “Akt”] allezeit mit einander verbunden seyn müssen: damit die Bühne nicht eher gantz ledig werde, bis die gantze Handlung aus ist. Es muß also aus der vorigen Scene immer eine Person da bleiben, wenn eine neue kommt, oder eine abgeht: damit die gantze Handlung einen Zusammenhang habe. Die Alten sowohl, als Corneille und Racine haben dieses fleißig beobachtet: aber in unsern deutschen Stücken geht es gemeiniglich sehr bunt durcheinander; und daher kommt es grosentheils, daß die Fabeln nicht recht zusammen hängen.

[Finally I must note that the scenes in an act must always be connected with each other: so that the stage does not become completely empty until the whole action is over. There must always be one person from the previous scene there when a new one comes, or one leaves: so that the whole act has a connection. The old ones as well as Corneille and Racine have observed this diligently: but in our German plays it is generally very colorfully mixed up; and that’s why it comes largely that the fables don't really hang together.]Footnote 27

Even though Gottsched hints at the contemporarily widespread touring companies—that he wanted to be replaced by institutionalized theaters –, he also foresees a drama-historical development that would start two decades later. It basically breaks with the French doctrine classique and sets Shakespeare as the new poetological landmark. This development—which peaks in Sturm und Drang poetics—leads to our postulate about scene boundaries. It combines the structural unit ‘scene’ with the themes of character speech in a scene. If many characters remain on stage, the themes prevalent in the character speech remain similar, so that as Gottsched says, “die Fabeln […] zusammen hängen [the threads are connected].” In contrast, if there is a major exchange of characters between the scenes, the themes spoken in those scenes are likely to change as well.

4.3 Tragic or Happy Ends: Die or Marry

In her 2005 Novel Tenor of Love, the Canadian feminist poet, author and university professor Mary di Michele writes: “Somebody always has to die onstage, die or marry; that’s the only difference between a comedy and a tragedy as far as the world knows.”Footnote 28 In fact, there are a lot of differences between tragedies and comedies and most of them—just as the function of scenes—are object to manifold controversies for more than 2,000 years now. Some of them are plot-related and point to the divergent endings, other differences are bound to the dramatic characters. Their historicity, moral qualities, social status and decorum of speech are the best known. Among others Scaliger, Opitz, Gottsched and Lessing discussed the question, whether middle-class characters have tragic qualities or solely higher ranked characters as kings do.Footnote 29

This postulate brings into focus the ending of the drama. The expectation here is that the sketched differences in events (death or marriage) are reflected in different themes appearing in the character speech. Commenting or describing someone’s wedding implies the appearance of wedding-related themes, while the death of a character can be expected to be discussed in different themes.

4.4 Dynamic and Static Protagonists

Characters in drama can be considered as either dynamic or static.Footnote 30 In antique tragedy, e.g., actors wore masks (persona) that expressed a fixed, stable character property, like a specific emotion. This mask represented the character’s set of values and the way they would deal with what fate intended for him.Footnote 31 A change of mind would neither correspond to the anthropological model of the Greek nor to their poetological understanding of dealing with the dramatic conflict. During the course of drama history, the idea of a static character got more and more replaced by the attempt to form characters in line with modern individuals. For German tragedy, this individualization of characters is usually stated to become a broader phenomenon with the rise of the Bürgerliche Trauerspiel in the second half of the 18th century:Footnote 32 Characters react to the dramatic conflict by changing their mind and values.

This postulate aims at the protagonist of the play and the idea that the thematic design of a dynamic character’s speech is more diverse than the speech of a static character. Since our corpus covers one century starting 1730, we expect it to contain mostly dynamic protagonists and aim for automatically determining their dynamicism.

5 Experimental Setup

5.1 Formalization

Each theme detection method produces a vector for a given input text. For the dictionary-based methods, this is the (normalized) number of words in the respective set. The methods using topic modeling return a vector of probabilities with one element for each topic. We make no assumptions on interpretability of the theme vectors, nor do we inspect the topics manually. Comparing system predictions on texts or text segments typically involves a notion of distance between the vectors. For all experiments, we employ Euclidean distance to measure distance between vectors and Pearson correlationFootnote 33 to measure correlations.Footnote 34

A key component in the formalization of the postulates is the stage presence matrix.Footnote 35 This binary matrix represents scenes in columns and characters in rows. If a character has stage presence during a scene, the cell is marked with a 1 (one). All other cells contain 0 (zero). In our analysis, we equate ‘being present on stage’ with ‘speaking on stage’. This is a simplification, as mere presence on stage is not encoded in a machine-readable fashion in our corpus. Interestingly, this distinction can already to be found in Gryphius’s plays, e.g., in Carolus Stuardus, written about 1650: Not all characters present in a scene do actually speak, and so Gryphius differentiates in the Dramatis Personae of his plays between characters that enter stage ‘as speakers’ and those that enter ‘as mute’ characters.Footnote 36

$$M = \left[ {\begin{array}{*{20}c} 1 & 0 & 1 \\ 1 & 1 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \\ \end{array} } \right]\;$$
(1)

To give an example, we assume the stage presence matrix shown in Eq. 1. Character 1 is therefore present on stage in scenes one and three. Individual columns or rows can be extracted: The vector for Scene 2 is \(\left[ {0,1,1,0\left] { = M} \right[,2} \right]\) (second column), and the vector for character 3 is \(\left[ {0,1,0\left] { = M} \right[3,} \right]\) (third row).

$$C_{i} = \{ x|M\left[ {x,i} \right] = 1\}$$
(2)

Based on this matrix, we can define the set of characters in a scene (\(C_{i}\)) as shown in Eq. 2: All characters that have a 1 in the respective column. \(C\) will be the set of all characters in the play. We will use both the matrix and set representation for operationalizing the postulates.

5.2 Corpus

All texts are taken from the TextGrid repositoryFootnote 37 and have been written or translated into German between 1730 and 1930. We selected texts that can be classified (mostly) uncontroversially as either being a tragedy or a comedy. A table that lists the entire corpus can be found in the Appendix.

5.3 Preprocessing

The texts and their structure have been converted into stand-off annotations in the UIMA pipeline framework.Footnote 38 Stage directions and character speech have been processed separately, to ensure that inserted stage directions do not break the syntactic structure of the surrounding sentences. This has been achieved with the DramaNLP package.Footnote 39 The texts have been lemmatized and part of speech-tagged, using the Stanford part of speech-taggerFootnote 40 and the Mate lemmatizerFootnote 41, integrated into the DKpro core packageFootnote 42. Speaker attributions have been made using manually specified mappings in difficult cases and string overlap heuristics in other cases.

After preprocessing, the texts have been exported into CSV files and are analyzed with RStudio, using the R package DramaAnalysis.Footnote 43

6 Experiments

In this section, we will describe in detail the operationalization of the postulates introduced above and the results, separately for each postulate.

6.1 Copresence

Postulate: Copresence of characters leads to similar thematic profiles of their speech.

Operationalization

Given the scene presence matrix \(M\) for a play, we first create a copresence matrix by multiplying the presence matrix with its transpose: \(M_{c} = MM^{T}\). This (quadratic, symmetric) matrix contains the characters as rows and columns, and the number of scenes in which two characters are co-present as cell values. As a next step, this matrix is converted to a distance matrix by inverting and normalizing with the total number of scenes (\(s\), equivalent to the number of columns in the stage presence matrix): \(M_{dist} = 1 - \left( {M_{c} /s} \right)\).

This distance matrix contains values between 0 and 1. Pairs of characters that appear in more scenes together have lower values (lower distance) than characters that appear rarely together. Character pairs that are not co-present at all have the maximal distance of \(1\), character pairs that are co-present in all scenes have a distance of \(0\). Characters that speak less than 1,000 words are excluded from this analysis. As an example, Table 2 shows the distance matrix for Emilia Galotti. Emilia and Orsina do not share a single scene, which is marked by the maximal value in the corresponding cell. Emilia and Appiani share one single scene, which gives them a distance score of 0.98.

Table 2 Distance matrix for Emilia Galotti

We calculate the correlation between distances based on co-presence and Euclidean distances between theme vectors. A positive correlation indicates that co-presence and shared topics in character speech go hand in hand, a negative correlation indicates the opposite. A value of zero indicates no correlation between the two. The correlation metric reflects how strong this postulate is realized in texts, and there is no way of formulating an exact expectation. High correlations are unrealistic to expect, as exceptions to the postulate come to mind relatively quick. Our discussion focuses on the range between 0.1 and 0.5.

Results

Figure 2 shows the percentage of dramas that fulfill the copresence postulate, for various correlation thresholds (on the x-axis). For instance, just over 75 % of the plays have a positive correlation (strictly \(> 0\)) when using the GermaNet dictionary method.

Fig. 2
figure 2

Percentage of plays that fulfill the copresence postulate, given the correlation threshold and theme detection method

In general, we see similar curves for all methods, although the difference between the best and worst method is up to 10–15 percentage points (for low correlations). For correlations below \(0.5\), dictionary methods achieve the better performances. For higher correlations, the custom dictionary method becomes one of the worst, while enriched and GermaNet are still achieving top performance rates. The four topic modeling variants are outperformed in most settings. It is a surprising result that methods that cover only a fixed set of themes (custom and enriched dictionary) do perform on a similar level as methods that are in principle covering all themes (GermaNet dictionary and topic modeling).

The fact that all methods achieve low correlation scores across the board suggests that this concept (of some copresence leading to similar themes) is hard to grasp for automatic methods. One reason is that characters are usually co-present with more than one other character throughout the play—the entire text a character speaks is actually a mixture of contributions to thematically more or less connected situations (scenes), in which the content is also influenced by other present characters. Another difficulty is that copresence is not limited to two present characters—it’s not unusual that three or more characters interact on stage simultaneously. This polyphony of character speeches influences the speech of each character. None of the methods is designed to cope with these problems.

6.2 Scene boundaries

Postulate: The more characters are exchanged at scene boundaries, the more different the two scenes are thematically.

Operationalization

To our knowledge, two possible ways of quantifying the change of stage personnel have been published: Marcus proposes the use of the Hamming distance on a binary stage presence matrix.Footnote 44 Given two symbol sequences of equal length, the Hamming distance is the number of positions in which the two sequences are different. Applied to two scene vectors extracted from a scene presence matrix, this is the number of characters that leave or enter the stage at a scene boundary.

Hamming distance can be scaled to the interval between zero and one in two ways: Dividing by the total number of characters (\(\left| C \right|\)) allows comparison across dramatic texts, as a play with a large dramatis personae can exchange more characters than one with a small cast (Eq. 3). Trilcke et al. propose the drama change rate as a new measure to quantify this change.Footnote 45 Drama change rate is almost equivalent to the Hamming distance, but it uses the number of characters in the two adjacent scenes to scale (Eq. 4), not the entire cast. In all variants, the measure is minimal if all characters remain on stage. Scaled Hamming distance is only maximal if all characters in the play are affected, e.g., by one half leaving and the other half entering the stage. Drama change rate is maximal if all characters that are present in one of the two scenes are exchanged, irrespective of their proportion at the full dramatis personae.

$$H_{i} = \frac{{\left| {C_{i} \vartriangle C_{i + 1} } \right|}}{\left| C \right|}$$
(3)
$$r_{i} = \frac{{\left| {C_{i} \vartriangle C_{i + 1} } \right|}}{{\left| {C_{i} \cup C_{i + 1} } \right|}}$$
(4)

Draxler describes the scenic difference (szenische Differenz) between two neighboring scenes as the total number of characters in the drama minus the number of characters present in both scenes (Eq. 5).Footnote 46 The resulting number is maximal (i.e., equal to the number of characters in the play), if not a single character is present in both scenes (the intersection becomes zero). It is minimal if all characters in the entire play remain on stage.Footnote 47 The scenic difference can be scaled by dividing by the total number of characters in the play.

$$s_{i} = \left| C \right| - \left| {C_{i} \cap C_{i + 1} } \right|$$
(5)

The difference between the measures can first be characterized as one of scope: Scaled Hamming distance directly expresses how many percent of the dramatis personae are exchanged at this boundary. Drama change rate is a local measure, ignoring whether the change concerns 90 or 10 percent of the dramatis personae. Scenic difference gives a number that is relative to the total number of characters in the dramatis personae. Only if all characters in the play are present in either one of the scenes (\(C = C_{1} \cup C_{2}\)) or the number of remaining characters is zero (\(C_{1} \cap C_{2} = \emptyset\)), the measures produce the same number. The difference between the measures grows with the number of characters in the dramatis personae, and with the stability of the stage personnel.

It is noteworthy that Draxler’s scenic difference sometimes produces counter-intuitive results: Consider two scene boundaries as an example. A single character is entering the stage at the first, while a group of characters is entering at the second. The intersection of characters across the boundaries will remain the same at both boundaries, leading to the same scenic difference—although the group entering at the second boundary might be as large as the entire set of characters already on stage.

Figure 3 shows plotted Hamming distance/change rate/scenic difference (dashed lines) and thematic distance values (solid line) for the Lessing play Emilia Galotti and the Schiller play Jungfrau von Orleans. The vertical lines depict act boundaries. Broadly speaking, the measures make similar movements at most scene boundaries. Act boundaries show higher rates. Scenic difference and Hamming distance values are generally in a smaller interval, as they express change rates relative to the entire set of characters.

Fig. 3
figure 3

Drama change rate, scenic differences and distances calculated based on a custom dictionary for two plays. Vertical lines represent act or prolog boundaries

The difference between Hamming distance and drama change rate (that results from different scaling methods) becomes apparent between scenes 17–20 in Die Jungfrau von Orleans (bottom plot). Drama change rate shows a continuous maximal line (as all characters are exchanged at the boundaries), while the scaled Hamming distance differentiates between 15 and 20 % of the characters being exchanged.

According to the postulate, we expect a correlation between the thematic distance (solid line) and one of the change rates (dashed lines). We do not expect a perfect correlation (this becomes obvious in the examples). In some cases, the same set of themes is continued across scene boundaries (at least according to this theme detection method) even if characters are exchanged (e.g., across the first act boundary in Emilia Galotti). In some parts, however, we see a similar movement (e.g., the peak in the last act of the Jungfrau von Orleans).

Technically, we calculate Pearson correlations between one of the measures at each scene boundary, and the Euclidean distance between the scene before and after the boundary. Scene boundaries that are also act boundaries are not removed, as we expect them to behave in line with the postulate.

Results

Figure 4 presents the results in a similar way as for the previous postulate. Again, the x-axis shows the correlation threshold and the y-axis the number of texts for which we find a positively correlation according to the postulate. The left plot shows the positive correlations with the drama change rate, the right plot the positive correlations with scenic difference.

Fig. 4
figure 4

Percentage of plays that fulfill the scene boundary postulate, given the correlation threshold and theme detection method

The most striking observation here is that the results are better differentiated when correlated with the scenic difference. In terms of method ranking, however, the results are similar: The custom dictionary method clearly performs worst, with scores over 10 pp. behind the system ranked second last. To a great extent, this is probably related to the limited size of only five discriminatory topics. That they might be relevant for a lot of plays does not implicate, that they are the only themes discussed in plays. Themes like death or nature are simply not recognised by the custom dictionaries. Enriched dictionary performs a bit better, but is still outperformed by the others, at least up to a correlation coefficient of 0.4. The different variants of topic modeling perform all similar—depending on the correlation level, one or the other variant is slightly in the lead. A varying number of dimensions does not make a big difference, although topic models with more dimensions are slightly in the lead.

The GermaNet dictionary method performs best if correlated against the scenic difference. One possible explanation could be that the ‘counter-intuitive’ scores this measure sometimes produces (see above) are actually in line with the postulate. Characters entering the stage might not make a big difference in terms of discussed themes.

A further possible reason for generally low correlation scores might be the diverging understandings of the function of scenes in drama history. While boundaries between acts in almost all plays mark a change of time or place, boundaries between scenes are known to be ambiguous.Footnote 48 That is mainly because the sketched French and German post-Gottsched tradition in poetics and playwriting differ in terms and concepts. In the French tradition coming from Scaliger, a play is structured by scène and acte. The German tradition since 1750—which leads back to Shakespeare—distinguishes Auftritt, Szene and Akt.Footnote 49 The French acte corresponds to the German Szene and Akt by the mentioned time/place-change.Footnote 50 In our understanding, a scene is not necessarily a plot-related category. It is less likely to be plot-related in the French scène, where entries (Auftritte) or exits of only one character cause scene boundaries. But it is more likely to be plot-related in Shakespeare’s tradition, where scene boundaries demark a change of the complete setting and the new location might imply other themes, even though the same characters are on stage.

6.3 Tragic or happy ends: Die or marry

Postulate: Tragedies and comedies are distinguishable based on the theme of the ending.

Operationalization

This postulate focuses on the ending. We therefore extract the ending from each text by considering only the last act, the last scene, or the last two scenes (determining the actual ending of a literary text is surprisingly difficultFootnote 51). We consider the entire text in this ending, and disregard individual characters. Using the different methods, we generate theme vectors for each ending, and cluster the resulting vectors hierarchically, using group-average linkage. We analyze the resulting cluster purity at multiple stages in the hierarchical clustering process, which results in differing number of clusters.

Figure 5 shows a hierarchical clustering of several plays as an example, based on the last act. Tragedies are marked with a T, comedies are marked with a C. In the bottom part, three of the four tragedies are grouped together nicely. In the top part, a comedy and a tragedy have been judged as similar. The vertical lines in the plot indicate possible stages to extract the actual groupings. Extracting the clusters at the short-dashed line yields two clusters, extracting at the dotted line yields three and the long-dashed line four clusters.

$$P\left( {C_{i} } \right) = \frac{{\mathop {{\text{max}}}\limits_{j} p_{i}^{j} }}{{\left| {C_{i} } \right|}}$$
(6)
Fig. 5
figure 5

Hierarchical clustering of the endings of five plays (four tragedies, one comedy). Vertical lines show groupings that can be extracted

Cluster purity is calculated as shown in Eq. 6, with \(C_{i}\) being a cluster, and the numerator the number of instances of the majority class in the cluster. In essence, cluster purity is the percentage the majority class in each cluster has. If one would use the clustering to assign class labels to each instance, cluster purity expresses the classification accuracy. Since our corpus contains more tragedies than comedies, a majority baseline would achieve a classification accuracy of \(54.72\%\) (which is equivalent with the cluster purity when putting all endings in a single cluster). In the example in Fig. 5, the clusters at the short-dashed line (left) have an average cluster purity of \(0.75\), the two other stages an average purity of \(1\).

The reasoning behind looking at multiple numbers of clusters is that both tragedy and comedy are broad, generic genres that combine many sub genres. A system might be better at distinguishing more fine-grained classes (as the one in Fig. 5), which would still be indicative of its performance for theme detection.

Results

Figure 6 shows the results for the different systems and ending variants. In each plot, we see the number of clusters on the x-axis, and the achieved cluster purity on the y-axis.

Fig. 6
figure 6

Cluster purity for hierarchical clustering based on complete linkage and Euclidean distance

When considering the last act as the ending, two groups of systems can be discerned: The systems based on topic models, and the dictionary-based systems. The former outperform the latter clearly. For the distinction between the genres, the low-dimensional topic model (TM) trained on entire texts performs best. A jump in performance can be observed when distinguishing between six genres: The topic models trained on entire texts (both variants) achieve a cluster purity of about 90 %. Increasing the number of groups further has only minor influence on this score.

The picture changes slightly when considering only the last scene as the ending. The low dimensional TM trained on entire texts achieves high purity early on. Interestingly, the higher dimensional TM achieves a similar performance only for five and more genres. Character-based topic modeling is generally lower, but in a similar range.

If we look at the last two scenes understood as the ending, we observe different results. The gap between the best and the second-best system has been substantial (over ten percentage points) in the previously discussed plots. This gap does not exist in this setting. The topic models perform similarly as the dictionary-based systems. Only for more than five to seven genres, a gap between topic modeling and dictionary-based methods can be discerned.

One conclusion to draw here is that the ‘ending’ of a dramatic text is better represented by either the last scene or last act, but not multiple scenes. A possible explanation could be that the second last scene leads to more heterogeneous themes or introduces ‘non-ending material’. More material might counterbalance this effect, leading again to increased performance when considering the entire last act. But this might not be related to the dramatic ending in particular. The additional text could be distinctive for the dramatic genre without any thematic connection to the prototypical drama endings. In terms of system choice, it is clear that topic modeling performs much better than the other methods.

6.4 Dynamic and static protagonists

Postulate: Themes in the main characters speech are changing over the course of the play.

Operationalization

To test against this postulate, we focus on the protagonist of the play. Moretti points out that identifying the actual protagonist is far from trivial—no uniform definition exist, and more complex definitions are difficult to operationalize.Footnote 52 We therefore opt for a very simple heuristic, and select the most ‘talkative’ character, i.e., the character that speaks the most words per play.

In the three plays depicted in Fig. 7, Sara and Johanna are clearly the protagonists in Lessing’s Miß Sara Sampson and Schiller’s Die Jungfrau von Orleans—and in both plays, they are the characters with the most spoken words. There are, however, cases in which human readers do not identify the most talkative character as the main one. In Lessing’s Emilia Galotti, the character speaking the most is Marinelli, chamberlain of the prince. The prince himself, Emilia’s father, or Emilia, who would all be candidates for protagonists, speak, however, less then Marinelli. In line with our attempt of working with the vagueness engrained in literary works, we rely on the fact that this heuristic works in many other cases and identifies—if not the antagonist or at least one of the group of antagonists—a very present and therefore important character.

Fig. 7
figure 7

Speech distribution in three plays

Given the main character in this fashion, we extract this character’s speech from the first, second, fourth and fifth act. As the key turning point in the character development is supposed to take place during Act 3, we leave out this act. Plays that do not contain five acts are excluded from this analysis.

The segmented character speech is then processed by the theme detection methods, which results in a theme vector for each act (for each main character). We then calculate Euclidean distance between the vectors and compare distances before/after and across the change. A distance before the change would, e.g., be the distance between first and second act, a distance across the change one between the first and fourth act. The expectation here is that for dynamic characters, the distance across the third act should be higher than before or after the third act.

Results

The results of this experiment are shown in Fig. 8. Each plot shows the (normalized) distance values between one act pair from within the first or second half (x-axis) and one act pair across the third act (y-axis). Each symbol represents one drama and its act pair distances. The straight lines are linear trend lines, set in such a way that the gap between the line and the corresponding points is minimal.

Fig. 8
figure 8

Comparison of thematic distances between act pairs

A system that is able to fully represent the postulate would be shown as many points appearing in the top-left triangle, to the left or top of the diagonal. This would be caused by distances across the third act being larger than distances before/after the third act. The corresponding trend line would be relatively steep and cross the plot boundary at the top.

It is relatively clear that this is in general not the case—the only systems with (relatively) steep trend lines are the GermaNet-based dictionary system and the high-dimensional character-based TM when comparing distance between act pairs 1, 4 and 4, 5 (bottom left plot). All other systems in all other settings produce gentle trend lines. In general, the methods are seemingly not able to grasp the character development appropriately.

We see two possible reasons for this bad performance: The postulated dynamics of characters might not be present on this granularity level. Developments within acts remain hidden in our analysis but could be perceived much stronger by audience or readers. This might be the case, if a character’s change of mental attitude is expressed in opposing behavior: If an ‘I will not do this’ in the exposition becomes an ‘I will do this’ in the final act, theme identification methods don’t stand a chance. Secondly, most of the poetological claims involving character development focus on tragedies, and it is unclear whether this postulate is supposed to hold on comedies.

7 Conclusions

This article systematically explores different ways of measuring the themes that appear in character speech in dramatic texts. The focus of this article lays on the methodology and the development of such methodology, and not on the interpretation or use of results coming from such methods. Development and use of methodology (or tools in this case) are two distinct ways of doing computational literary studies, and it is important not to confuse the two. In order to be able to judge the quality of any method or tool, one has to have some kind of gold standard against which results are compared—even if the gold standard is not realized in a corpus or annotated data set. And if one is using a certain methodology or tool, one needs to be sure that the tool is actually doing what it is supposed to. Evaluating a tool and interpreting its results at the same time is dangerous, as it may lead to results that are not generalizable. As we have presented here, we believe it is possible to evaluate systems based on a not explicitly annotated “gold” standard.

We have compared different methods that assign themes to segments of dramatic character speech. It may not be surprising that there is no clear winner: The notion of theme is sometimes interpretative, the postulates we defined challenging, the methods not adapted to dramatic texts. We can, however, discern some strengths and weaknesses of the methods:

  • Topic modeling can be applied to short text segments without severe loss of performance. Topic models that are trained on individual character’s speech perform in general similarly as models trained on entire texts, sometimes even better. We see no general difference in the scene boundary and ending postulates. Only for the copresence postulate, character-based topic models achieve less performance than models trained on the full text: This is surprising, given the fact that the text segments under investigation here are character speeches.

  • A key difference between the GermaNet dictionary and the other dictionary methods is that the others cover only a fixed set of themes, while GermaNet is much larger and covers (in theory) most of the (modern) German language use. This difference does not lead to huge performance differences, GermaNet dictionary performs similar to the others in the all postulates. The only exception to this trend is the higher performance when correlated with scenic difference in the scene boundary postulate, but this might be more related to scenic difference than to the dictionary.

  • Higher dimensionality of the topic models does not lead to increased performance automatically. The only case in which the higher dimensional TM performs better than its counterpart is for the third postulate on the last two scenes. In this case, the character-based TM performs better with 100 dimensions than with 20. In all other cases, the high dimensional TM is either outperformed or performs very similar to its counterpart.

The second main motivation for this paper was to experiment with a new kind of gold standard. The postulates we defined are thought to be true in general, which does not mean that they are true for every single text. Since many literary statements are of this kind (if made about a set of texts like a genre), a larger number of postulates can be established, which makes the entire procedure even more fruitful. Calculating correlation scores between postulate and predicted result then incorporates the literary vagueness into the evaluation and does not hide it.

However, there are still multiple reasons to be critical about the postulates we discussed in this article: There might be disagreement on their truth, i.e., one might doubt that a postulate holds true even in general. Apart from our observations in previous work and our intuition, there is little evidence for the first postulate. The role of scene boundaries has been discussed in literary studies (cf. Sect. 4, scene boundaries), but there is no guarantee that playwrights adhere to what normative drama poetics tell them to. The fact that a genre ‘tragicomedy’ exists points out that the comedy/tragedy distinction is not as clear cut as it seems to be. While this might sound like the discussed postulates are not trustworthy, we argue that it is not necessary to believe in them fully. Inspecting different correlation levels, for instance, allows comparison to multiple, and potentially subjective, possible truths.

Canonization in literary studies might be another reason for reservations against using postulates as a gold standard. Many theories and theoretically justified predictions are based on a relatively small and non-representative selection, the literary canon. If one applies a postulate developed on the canon to non-canonical texts, it is indeed questionable whether it can be expected to be true, or even ‘truish’. However, a closer look is necessary: Non-canonical texts might not be different from canonical texts in all aspects. As we have discussed in Reiter/Willand, popular plays, which are mostly not canonized, diverge from canonical plays in many structural aspects.Footnote 53 The corpus used in this work, however, consists mostly of canonical plays.

A second possible critique towards the postulates is that they might not be related to themes. We initially believed that the comparisons discussed above include a thematic change, at least to some extent. The results partly suggest that this is might not always be the case; surprisingly, the analysis of character development is such a case. While the exact extent is an open question to all our method testing scenarios, we can still not decide whether the postulate or the operationalization by theme is the crux of the matter. As an outlook for the future, it would be helpful to apply our methods to synthetic texts. These could—to stay with our example—contain thematically ideal dynamic and static characters and thus contribute to the answer to our questions.

Finally, all postulates need to be operationalized before conducting any kind of experiment. Operationalizing a concept always entails decisions that may lead to a misrepresentation of the concept. Some choices might not even be foreseen conceptually, as is best exemplified by the different quantification methods for personnel change for the scene boundary postulate. We believe that the above described operationalizations are reasonable, without saying that they are without alternatives. However, we also believe a scientific discussion about appropriate ways of operationalizing literary concepts is very fruitful, and that experiments as the ones above bring to light weaknesses and strengths of such concepts. The issue of operationalization is also discussed in Reiter/Willand,Footnote 54 since we believe that progress on operationalizing literary concepts is crucial for tool and method development.

In sum, this article makes an argument for a systematic methodological development. The lack of annotated reference corpora is a challenge but can be circumvented by thoughtful and reflected use of insights derived in literary studies. Directly comparing methods to solve a fixed set of problems reveals that there is not a single winner-method: Depending on assumptions about drama structure and the exact goal, different methods need to be considered.