1 Introduction

This paper introduces the Multilingual Emotional Football Corpus (MEmoFC),Footnote 1 a new corpus consisting of pairs of football reports, which can be used for the study of affective language. We present the text corpus in three languages, English, Dutch, and German, combined with the matching football game statistics, as a resource for investigating how (affective) perspective can change reporting about an event. To the best of our knowledge, this multilingual corpus is the first one where objective data and textual realizations from multiple affective perspectives are systematically combined.

Sports reportage provided by sports clubs themselves is arguably one of the most interesting registers available for linguistic analyses of affect-laden language from different perspectives. It opens up room for creative language, starting already with the headlines of the match reports (Smith and Montgomery 1989). Additionally, the point of view of the author of a match report is clearly definable from the beginning, as it is either a reaction to a tie (that might still be perceived as a net loss or win by the team) or, depending on the perspective, a loss or a win for the football club. So, it seems reasonable to assume that the different possible outcomes of such a match would also produce different match reports in terms of language and affect. Take for example the following introductory sentences:

  1. 1.

    “Peterborough United suffered a 2-1 defeat at Burton Albion in Sky Bet League One action and lost defender Gabi Zakuani to a straight red card during a nightmare spell at the Pirelli Stadium, but what angered all connected with the club happened in the final moments of the encounter.” (PB220815, MEmoFC).

    Compared to:

  2. 2.

    “If all League One games at the Pirelli Stadium this season are going to be like this it is going to be an entertaining if nerve jangling season.” (BA220815, MEmoFC).

Both describe the exact same match and events, but the affective nuances are completely different. The match resulted in a loss for the British club Peterborough United, as evident in the first example, whereas it turned out to be a win for Burton Albion in the second example. This results in very different affective states shining through in the corresponding reports: while all the frustration of Peterborough seems to be released in a long first sentence (suffer… a defeat, nightmare spell, anger), the winners’ text is shorter and much more positive (entertaining).

In this paper, we describe how the corpus was collected and preprocessed, we give an overview of properties of the corpus, and we explore it with regard to linguistic differences and similarities related to affect in reports about won, lost, and tied matches in English, German, and Dutch using different tools. In the remainder of this introduction, we position the corpus more broadly in the research field studying the influence of emotion on language, and link it to applications in sentiment analysis and affective natural language generation.

1.1 The psychology of language and emotion

It is a general assumption that a text reflects the affective state of the author. Writing a text involves various cognitive processes, and it is commonly believed that affective states influence these cognitive states, and, hence, that they can have a noticeable effect on the resulting text. This idea has been put forward in psychological theories, such as, for example, Forgas’ Affect Infusion Model (1995), which describes how affective states, while seen as different from cognitive processes, “interact with and inform cognition and judgments by influencing the availability of cognitive constructs used in the constructive processing of information” (Forgas 1995, p. 41). Affect infusion is characterized as “the process whereby affectively loaded information exerts an influence on and becomes incorporated into the judgmental process, entering into the judge’s deliberations and eventually coloring the judgmental outcome” (Forgas 1995, p. 39). In this study, we aim to investigate whether the influence that affective states (due to winning or losing) exert on cognition extends to language production.

A limited number of psychologists have studied the role of affect on language. Perhaps most notably, Forgas and colleagues found that the affective state influences the politeness of requests, with people in a negative state being more polite (Forgas 1999, 2013; Forgas and East 2008; Koch et al. 2013). In addition, Beukeboom and Semin (2006) found that people in a negative state used more concrete language, in terms of the Linguistic Category Model (Semin and Fiedler 1991), while people in a positive mood used relatively more abstract descriptions.

Many of these psychological studies relied on controlled experiments and small amounts of manually annotated data. To facilitate and speed up these kinds of studies, Pennebaker et al. (2001) developed an automatic tool for assessing texts in terms of different psychological and linguistic categories, including terms related to valence and emotions: the Linguistic Inquiry and Word Count (LIWC). LIWC is a bag-of-words technique that counts words belonging to one or more categories in its dictionary and converts those frequencies to percentages of all relevant words in the text. It has several attractive properties: its emotion word categories and the associated word lists have been validated through human evaluation (Tausczik and Pennebaker 2010), LIWC can be used with arbitrary datasets and requires no pre-processing of the input texts. As a result, LIWC has been used in a large number of psychological studies (Cohn et al. 2004; Pennebaker and Graybeal 2001; Rude et al. 2004; Stirman and Pennebaker 2001) and NLP studies (e.g., Mihalcea and Strapparava 2009; Nguyen et al. 2011; Strapparava and Mihalcea 2017). For example, in a study on language and depression, Rude et al. (2004) analyzed the language of depressed, formerly-depressed, and never-depressed students and found that, as one would expect, depressed participants used more negatively valenced words, but also, perhaps less expected, used the pronoun “I” more frequently than never- and formerly-depressed students. A similar study was conducted on poems written by suicidal and non-suicidal poets (Stirman and Pennebaker 2001), which confirmed the use of the first person singular as related to negative mood. Text analysis, particularly online, for depression detection has been gaining popularity (see, e.g., Morales et al. 2017, or Losada and Gamallo 2020) with potential applications for mental health, such as early depression detection, treatment, and suicide prevention.

While these studies are indicative of a link between affect and language, most of them focus on less ecologically valid settings (such as the laboratory), use questionnaires or focus on disorders like depression. One can ask how such findings translate to the natural settings outside the laboratory. A study directly addressing this question is Baker-Ward et al. (2005), who analyzed spoken reports of young football players after their final match of the season. They found that the players in a positive state (i.e. winners) produced descriptions of the game that were clearer and more cohesive, while the players in a negative state (the losing players) described the game more interpretatively.

Interestingly, these findings connect to an early study conducted by Hastorf and Cantril (1954), which deals with how different perspectives on a football game between Princeton and Dartmouth influenced viewers’ perceptions of the game itself. While Princeton students mostly agreed that the game was played “rough and dirty” by Dartmouth, who ultimately lost the game, and saw more flagrant infractions, the majority of Dartmouth students saw it as “rough and fair” and blamed the roughness on both teams. While this study nicely illustrates how perceptions of events, and, in a way, events themselves may differ according to one’s point of view, the precise language used to describe the match was unfortunately not investigated in this study.

Cialdini et al. (1976), however, did investigate language use in relation to success and failure. In three experiments, they demonstrated how individuals involved themselves in victories of (groups of) other people, without having a direct influence on the victory. For example, they suggested that when students where they asked about wins and losses of their own university’s team, successful matches were described with significantly more use of the pronoun we than lost matches were. This phenomenon of identifying with winners was coined BIRGing (“Basking In Reflected Glory”). Snyder et al. (1986) showed the opposite effect in behavior for failures and coined it CORFing (“Cutting Off Reflected Failure”). While these tendencies of people to bask and distance themselves have been replicated repeatedly (Downs and Sundar 2011; Wann and Branscombe 1990), whether and how these tendencies emerge in language production has not been systematically explored.

1.2 Natural language generation (NLG) and Natural language processing (NLP)

Psychological studies, just as described, have revealed that affective state can influence language production. However, most of these studies only focused on one specific aspect such as politeness or abstractness. Moreover, with the exception of the work done with LIWC, all of these mentioned studies approach the influence of affect on language production experimentally. However, in recent years, there has been a growing interest in more comprehensive studies into emotion and language production, typically using computational approaches. Here, we highlight two: sentiment analysis and affective natural language generation.

Natural language generation (NLG) is the process of converting data into text (Gatt and Krahmer 2018; Reiter and Dale 2000), with applications in, for example, automatic generation of texts for sensitive matters such as neonatal intensive care reports based on medical data (Mahamood and Reiter 2011; Portet et al. 2009), but also automatic generation of photo captions (Chen et al. 2015; Feng and Lapata 2010; Kuznetsova et al. 2012), which can be tailored to the needs of people with visual impairments, or sports commentary (Lee et al. 2014; van der Lee et al. 2017). Bateman and Paris (1989) stress the importance of tailoring machine generated language to the needs of the intended audience. Taking this one step further, Hovy (1990) describes how considering different perspectives on the same event, by taking into account the speaker’s emotional state, rhetorical, and communicative goals, is crucial for generating suitable texts for different addressees. Several companies worldwide already offer automatically generated narratives based in databases, e.g., Automated Insights (USA) or Arria NLG (UK). However, the reality of automatic text generation is that not many NLG systems are able to adapt to the mood of the recipients of the produced text (Mahamood and Reiter 2011) and to convey the mood of the author. While this may not be a problem if simple data-to-text output is the aim of the system, Portet et al.’s (2009) study shows that there are indeed situations that call for a more emotionally informed approach. In general, tailoring automated text to an intended audience especially with regard to sentiment still poses a challenge to whose solution MEmoFC can contribute, for example, in enabling the tailoring of reports specifically to the perspective and affective states fans of specific clubs after specific game outcomes.

Of course, to be able to do this, we need to know how affective state could influence text production, not only concerning the factors studied by psychologists (politeness, abstractness, etc.) but in all aspects of language production. Sentiment analysis can provide valuable clues in this respect. Sentiment analysis, or stance detection, can be characterized as a classification of texts, for example, the labeling of positive versus negative online reviews to capture sentiments and attitudes towards specific topics, brands, and products, which has become a crucial task in recent years (Glorot et al. 2011; Kim 2014; Ravi and Ravi 2015; Socher et al. 2013). Social network sites like Facebook and Twitter have been used to extract opinions and sentiment on a large scale, for example, with a focus on brands or political elections (dos Santos and Gatti 2014; Ghiassi et al. 2013; Isah et al. 2014; Pak and Paroubek 2010; Pang and Lee 2008; Tumasjan et al. 2011).

While most work on stance detection and sentiment analysis has focused on English corpora, there has also been work on other languages, see, e.g., Basile (2013) or Bosco et al. (2013), for work on Italian, more recently, Tsakalidis et al. (2018) for resources in Greek, or on informal and scarce languages (Lo et al. 2017). Increasingly, work is also being done to apply sentiment analysis techniques from English to other, less researched languages automatically, using, for example, machine translation techniques (e.g., Perez-Rosas et al. 2012, or Bautin et al. 2008). These kinds of studies can be informative about which words and phrases are associated with which particular emotional states. Yet, while these approaches are promising, they often still rely on training material in the less-researched languages, for which limited resources are available (at least compared to English).

1.3 The current studies

This paper introduces the MEmoFC corpus, a multilingual, large-scale corpus of soccer reports, which is unique in that it contains pairs of reports for each match, one for each team participating in the match, combined with the original game statistics. In this way, MEmoFC offers controlled (in terms of the source of the events described) yet natural emotionally varied descriptions of the same events. This makes it an attractive resource to study the effect of affect and perspective on language, which, in turn, paves the way for tailoring automatically generated texts to a specific audience.

In this paper, we describe how we constructed and preprocessed the MEmoFC corpus, and we present descriptive statistics for it. MEmoFC can be used to address many different research questions, but to illustrate its potential and evaluate its use as a source for affective science, we perform three example studies:

Example Study 1: Do we see more linguistic indicators of basking behavior in the reports after won matches than after lost ones?

As we have described above, earlier studies have suggested that basking occurs more after winning than after losing (Cialdini et al. 1976). We ask whether this is indeed the case by investigating whether writers in the different languages use the pronoun we more often after winning than after tying or losing.

Example Study 2: Which words and phrases are typical for the different game outcomes, and does this differ per language?

We expect the affective states of the authors to be reflected in their lexical choices and possibly also in other linguistic features such as grammar or punctuation (e.g., Stirman and Pennebaker 2001; Hancock et al. 2007). Here, we ask which words and phrases are actually frequently used for specific game outcomes, and whether this differs between the different languages under investigation.

Example Study 3: Can we classify texts as describing a win or loss (and does this vary per language) and which textual elements of the reports are most indicative of the game outcome?

Assuming that different winning and losing reports express different emotions with different language depending on the game results, we ask whether this knowledge can be used to classify reports; in other words, to what extent can we tell, based on the language, whether a game was won or lost.

The corpus is publicly available for research purposes upon request (https://doi.org/10.34894/07ROT3).

2 Construction of the corpus

2.1 Texts in MEmoFC

The reports in the corpus were manually collected, saved directly from the homepages’ archives, and have not been cleaned (typographic errors, wrong grammar, layout etc.). MEmoFC is multilingual in that it contains reports from three languages: English, German, and Dutch. The linguistic subcorpora are further divided into WIN, LOSS and TIE, which are, in turn, distinguished by league (first/second [+ third for the UK]). There are two metadata tables per language: one explaining the abbreviations for the different football clubs and one that allows the identification of the two participating teams of a match, the file name, outcome (win, loss, tie), the date the match took place, and the date the archive of the respective homepage was accessed. An example excerpt from the English metafile can be found in Table 1. Due to the multitude of participating football clubs, possible influences of individual authors’ writing styles on the language employed for the text are reduced, which makes it possible to draw more general conclusions for the genre from analyses.

Table 1 Example excerpt of the metadata file from the English subcorpus of MEmoFC

In addition to the written reports, MEmoFC also contains the corresponding match statistics (see Sect. 2.2). The original files are saved in UTF-8 coding and have not been annotated, parsed or PoS-tagged, meaning the texts are exactly how they appeared on the homepages of the clubs right after the matches took place.

With the help of the metadata and the consistently named files as shown in Table 1, the participating clubs and outcomes are easily identifiable, and the matching reports can be aligned and analyzed contrastively. Table 2 illustrates how text excerpts from the corpus are loosely aligned, ending at the same event in the game. Displayed are the two sides of a match that took place on the 26th September 2015 in the British first league. The reports themselves, of course, differ in length and game events described.

Table 2 Example excerpts of matched reports from the English subcorpus of MEmoFC

2.2 Game statistics

The statistics for the relevant matches were automatically scraped from Goal.com, a website that provides information and content about football. Finding and mining these statistics was done using three modules. First Google queries designed to find pages from Goal.com were activated to find the corresponding statistics for each match in MEmoFC. After the corresponding Goal.com pages were found, the data that was stored on these pages were mined and, finally, converted to an XML-format. Each XML-file provides data about a football match in MEmoFC. These files contain general-level information as well as more detailed information (see Table 3).

Table 3 Information (general, match events, last game, players, substitutes, and managers, last five games, relative strength, match statistics) about MEmoFC statistics stored in the XML files

2.3 Descriptive statistics of MEmoFC

The corpus covers between 34 and 46 game days in approximately the same time frame (August 2015 until April/May 2016) in all countries (Table 4). Table 5 shows the difference in text and token numbers: UK1, UK2 and UK3 contain more than twice as many reports as GER and NL. Unfortunately, some of the reports were untraceable on the websites, either because they were removed or never written for individual matches. This concerns 64 reports throughout all leagues and languages, which encompasses just 1.18% of the whole corpus. These matches have been marked n.a./not available in the metafiles. Due to the proportion of missing texts being small, these do not cause a significant imbalance in perspective. Hence, we did not treat them as problematic missing data in the exploration of the corpus. Although these missing matches are mentioned in the metafiles, their reports are not counted in Table 5 and Fig. 1. This means that the numbers in Table 5a, b solely result from the texts actually available in MEmoFC, which explains the differences in numbers between wins and losses, as well as the uneven numbers of ties. The corpus now contains 5434 texts, which add up to about 3.5 million tokens, with more than half being part of the English subcorpus, 803,793 in the German, and 507,035 in the Dutch subcorpora. The Dutch match reports are the shortest in all conditions, while English and German reports are generally similar in length (see Table 5b). Overall, game outcome seems to have no influence on text length in any of the languages in MEmoFC.

Table 4 Overview of football season 2015/2016
Table 5 (a) Number of texts (Txt) and words (W) in MEmoFC by League and Country (1–3 in the UK; 1 and 2 in Germany; 1 and 2 in the Netherlands); (b) Average text length and words per sentence (WPS) in MEmoFC by League and Country (1–3 in the UK; 1 and 2 in Germany; 1 and 2 in the Netherlands)
Fig. 1
figure 1

Distribution of text lengths (words per report) in MEmoFC by language and game outcome

2.4 Parsing and lemmatization

In a next step, the corpus was dependency parsed and lemmatized. For the English and German subcorpora, the Spacy Python library (Honnibal and Johnson 2015) was used. The Dutch subcorpus, was lemmatized by Frog (Bosch et al. 2007) because Spacy does not contain a lemmatizer for Dutch. Dutch multiword expressions were automatically conjoined with an underscore by Frog (e.g., zijn_binnen [to be in]). For English and German, phrasal verbs and/or separable prefix verbs were “rejoined” (e.g., climb up or ringen_nieder [wrestle down]). This way it is possible to differentiate, for example, between kick and kick off. The preprocessed files can be found in a separate folder.

3 Using the MEmoFC

In this section, we will illustrate the potential of the corpus with three exploratory studies, coming from three angles. In the following subsections, we approach the evaluation of the corpus and show its usefulness as an affective linguistic resource with a variety of different techniques in order to demonstrate the diverse ways in which it can be used for research.

3.1 Example study 1: Do we see more linguistic indicators of basking behavior in the reports after won matches than after lost ones?

With regard to language reflecting basking tendencies, the focus was on the use of the first person plural pronoun we in the aligned match reports. Following the suggestion of Cialdini et al. (1976), we hypothesize more uses of first person plural pronouns (1PP) in reports on won matches compared to reports on ties or losses. We ask whether this is indeed the case, and whether this is the same across languages.

While analyzing and comparing different types of pronouns with NLP tools would also be interesting, in particular the distribution of 1PP compared to they (or third person plural pronouns; 3PP), this task proved to be challenging for two reasons. In German and Dutch, some pronouns are ambiguous (e.g., Sie in German can be 3rd person plural, formal 2nd person singular and plural, or 3rd person singular female; zij in Dutch can be 3rd person plural or singular). This would require a deeper syntactic analysis to detect plural pronouns. However, even after this step, the pronouns’ referents would still be ambiguous: whether the more distant 3PP option is indeed used as a reference to the own team (instead of 1PP) cannot be ensured, since 3PP could refer to wide range of referents, such as the opponent, the fans, or a specific group of players—all of which carry no weight for distancing and basking behavior. Currently, no coreference resolution tool for Dutch and German is easily available. Furthermore, Named Entity Recognition is less accurate on the reports of MEmoFC due to the differences with training data (usually annotated newspaper articles) and to identify players’ surnames that are often not present in the gazetteer lists of NER tools, and, hence, not recognized. This issue would have had a substantially negative impact on the accuracy of coreference resolution systems, which is why we opted for a different approach. To answer the question guiding ES1, occurrences of 1PP were counted in the tokenized texts and then divided by the overall number of tokens in the review (to account for the fact that longer reviews are more likely to contain more pronouns in general). Afterwards, the results were summarized for the aligned texts in the win, loss, and tie subcorpora in English, Dutch, and German (see Table 6).

Table 6 Proportions of 1PP (EN: We, Our, Ours; NL: We, Wij, Ons, Onze; GER: Wir, Uns, Unser [& variations]) compared to all tokens in Win, Loss, and Tie (Both Perspectives) in English, Dutch, and German

In English and German, we find the expected distribution: there are considerably more occurrences of 1PP in reports about won matches than in losses and ties. For Dutch, however, a reverse trend of more 1PP in loss compared to win is apparent, while the proportion of 1PP in ties is lower than in both loss and win. In the reports on ties, we find the overall lowest proportion of 1PP in English, German, and Dutch, with only minor differences between the languages (highest proportion of 1PP in Dutch). The preference for 1PP after lost matches could be a cultural peculiarity that diffuses in language, exemplifying the usefulness of taking into account different languages when constructing language resources for the study of affect. Although English, German, and Dutch are Germanic languages and the subcorpora were collected from Western European cultures, there might still be cultural differences traceable in the language use, e.g., in linguistic distancing behavior. For the aligned reports on ties, it can be assumed that the outcome is perceived differently by the involved clubs: while in some cases the perception might be more similar to a win, in other cases ties can be closer to losses, which might decrease the proportion of 1PP. Examples supporting the different perspectives can be found in the following excerpts, among others, from two aligned reports on tied matches:

  1. 3.

    “Der 1. FC Nürnberg verliert in der Nachspielzeit zwei wichtige Punkte.“ (FCN171015, MEmoFC).

    “1. FC Nürnberg loses two important points during overtime.”

  2. 4.

    “Der FSV Frankfurt sichert sich einen Punkt in Mittelfranken “ (FSV171015, MEmoFC).

“FSV Frankfurt secures one point in Middle Franconia”.

Examples (3) and (4) show that the involved clubs perceive the tie differently—for the FCN it is a lost match because the club loses points, while the FSV considers the outcome a victory as they secure a point. This means that ties can be perceived as lost or won matches as well, which could also have an influence on the use of the pronoun 1PP in these reports.

Overall, there are generally more uses of 1PP in German and Dutch reports on won and lost matches compared to English. While the pattern is similar in English and German, there is a different, even opposite trend in Dutch, which could be related to cultural differences and should be taken into account in studies on affect and in automatically produced texts.

3.2 Example study 2: Which words and phrases are typical for the different game outcomes, and does this differ per language?

After exploring the distribution of one particular word (1PP), we now ask which words are associated with winning, losing, and tying in the different languages in general. We perform three kinds of analyses: (1) on word frequencies in general (using TF-IDF and concordances, Subsect. 3.2.1), (2) on LIWC categories, and (3) on specific emotion terms (3.2.2). Names of places, players, teams, or managers were filtered out using name entity recognition with Spacy (https://spacy.io/models/) for English and German, and with Frog (Bosch et al. 2007) for Dutch. In addition to individual words, bi- and tri-grams will be inspected.

3.2.1 TF-IDF and concordance

To extract words and n-grams that are especially representative of the conditions and languages, two approaches were used. First, TF-IDF was calculated for each word in each subcorpus. Table 7 shows the extracted most frequent words after lemmatization. While the word lists in reports on wins, losses, and ties differ in all languages and all lists contain interesting frequent words per outcome (e.g., ecstatic in English/Win, embarrassment in English/Loss, ringen_nieder [wrestle down] in German/Win, entmutigen [discourage] in German/Loss, probleemlos [without problems] in Dutch/Win, Punt [point] in Dutch/Tie), there are also various words on these lists that do not appear to be typical for specific game outcomes. Given the relatively large size of the corpus compared to the small number of categories, TF-IDF may be too sensitive to be conveniently used, and other statistics—such as keyness, which compares two corpora instead of calculating word frequencies in a single corpus or document—appear to be better suited for this analysis.

Table 7 Most relevant unigrams in English, German, and Dutch for Win (W), Loss (L), and Tie (T) extracted with TF-IDF

A keyness analysis looks for keywords that are more likely to appear in a target corpus compared to a reference corpus—or, as is this case, in the differences across the conditions in the language subcorpora: win, loss, and tie. As the frequency of a word alone is not an indication of how specific a word is for a corpus, we calculate the keyness of a word with the freely available concordance tool AntConc (Anthony 2004).

Keyness, which we calculate with a word’s log-likelihood ratio (Lin and Hovy 2000), is a measure that enables the extraction of keywords based on their probability to appear in the target corpus compared to the reference corpus and, thereby, can identify the words that stand out and define a text most. The log-likelihood ratio of a word is calculated using a contingency table and takes both corpora’s sizes into account (based on Rayson and Garside 2000). First, the expected value (Ei) is calculated; Ni being the number of words in the corpora and Oi the observations of the word frequency in both corpora:

$${E}_{i}=\frac{{N}_{i}\sum_{i}{O}_{i}}{\sum_{i}{N}_{i}}$$

The log-likelihood ratio G2 is then determined as follows:

$$-2ln\uplambda =2\sum_{i}{O}_{i } ln\left(\frac{{O}_{i}}{{E}_{i}}\right)$$

The higher the log-likelihood value is, the larger the word frequency difference between the corpora and, hence, the more representative a word is for a subcorpus. In contrast to TF-IDF, keyness calculated with log-likelihood (e.g., using a Chi-square distribution) is also an indication of statistical significance since it does not only calculate the frequency of a word within a document or corpus but directly compares the frequencies in two corpora. The critical threshold for a log-likelihood value (or keyness) is 3.84 at the level of p < 0.05 and 15.13 at a level of p < 0.0001.

In our analyses, the 20 most frequent words in the multilingual subcorpora are determined and presented in Table 7, structured according to language, and target condition compared to a reference condition, e.g., win compared to loss (represented as win–loss) or loss compared to tie (loss–tie).

Table 8 illustrates the top 20 words in English. Besides more obvious words like win or victory in WIN–LOSS, defeat, lose and loss in loss–win and draw in tie–win, frequent positive (superb, clean, perfect, secure, celebration [win–loss]; winner, opportunity, rescue, chance [tie–loss]) and negative words (condemn, disappointing, cruel, suffer, unable, frustrating [loss–win]; unable, spoil [tie–win and tie–loss]) are also apparent. In comparison, the English loss–win list consists of mostly negative words. In the reports on tied matches, the unique focus appears to be on the points earned and more neutral (both, share, neither, settle, goalless). Additionally, the loss–win list contains a preposition (despite) and an adversative conjunction (but), which likewise occurs unusually frequent in tie compared to win. Upon closer inspection of the context of the occurrences, it becomes apparent that neither is more often used as an adjective than as a conjunction. Thus, the negative game outcome affects not only the lexis but also the cohesive structure of the English texts. In addition, the use of the 1PP is more frequent in reports about won matches, hinting at basking tendencies, in line with 3.1.

Table 8 Keywords across outcomes (Win, Loss, and Tie) compared, respectively, in English after lemmatization

The patterns for German and English are comparable. As expected, we also find German words describing the outcome of the matches (see Table 9; Sieg [victory], Heimsieg [home victory], Auswärtssieg [away victory], siegen [win; WIN–LOSS]; Niederlage [defeat], verlieren [lose], unterliegen [be defeated; LOSS–WIN]; unentschieden [tied], Remis [draw; TIE–LOSS and TIE–WIN]). However, there are also differences with the English lists. While the number of positive words in the German WIN–LOSS comparison is similar to English ([Heim-/Auswärts-] Sieg [home/away victory], hochverdient [highly deserved], gewinnen [won], besiegen [defeat], Erfolg [success], wichtig [important], perfekt [perfect], ungeschlagen [unbeaten], feiern [celebrate], Tabellenspitze [top of the table], endlich [finally]), the keyword list of lost matches in relation to won ones seems less negative in comparison. This is especially due to the relative lack of negative adjectives, the only ones being bitter (bitter) and unglücklich (unlucky), and to common euphemisms for goals received, such as kassieren (collect) and einstecken (pocket). While in tie–win/loss the emphasis is clearly on the fight and the shared points (unentschieden [tied], erkämpfen [fight for & secure], leistungsgerecht [performance-based], Remis [draw], Punkteteilung [sharing of points], torlos [goalless], beid* [grammatical variations of the word both]), the German ties contain also positive keywords (zufrieden [satisfied], ungeschlagen [unbeaten], Punktegewinn [winning of points], Chance [chance], gerecht [fair]). Besides such “emotional” words, we again also find other types of words in the lists. In contrast to the English list, the German one contains the simple additive conjunction und, which is significantly more frequent in reports about won matches, for example. We again find an adversative preposition (trotz [despite]) and conjunction/adverb (jedoch [nevertheless]) in loss–win, hinting at overall differences in text cohesive structure depending on positive and negative game outcome.

Table 9 Keywords across outcomes (Win, Loss and Tie) compared, respectively, in German after lemmatization

In the Dutch subcorpus, finally, the main keywords distinguishing the conditions are again focused game outcome (see Table 10): win–loss (thuiszege [home victory], zege [victory], gewonnen [win], overwinning [victory], winnen [win]), loss–win (nederlaag [defeat], verliezen [lose]), tie–win/loss (gelijkspel [tied match], gelijk [same], gelijkspelen [tie], remise [draw]). Likewise, there are more positive words (mooi [beautiful], prachtig [magnificent], belangrijk [important], glunderen [shine], eindelijk [finally]) in win–loss, while in loss–win there are mainly negative words (pijnlijk [embarrassing], teleurstellen [disappoint], balverlies [ball loss], slecht [bad], ramp [disaster], leed [sorrow], lijden [suffer]). Again, ties seem to be associated more with neutral words.

Table 10 Keywords across outcomes (Win, Loss and Tie) compared, respectively, in Dutch after lemmatization

3.2.2 Emotion words: LIWC, VAD, and emotion analyses

Having looked in general into which words are used to describe the various game outcomes across the different languages, we will now investigate emotion terms more specifically. One way of doing this is using LIWC, originally developed by Pennebaker et al. (2001). For the exploration of the MEmoFC, the original English dictionary, the German dictionary (Wolf et al. 2008), and the Dutch dictionary (Zijlstra et al. 2004) were used. Here, we focus on categories that are arguably most interesting in terms of sentiment and perspective: pronouns, specifically 1PP due to the studies on self-serving and self-preservation (BIRG/CORF; see above), negations, positive and negative emotion words, words relating to anger, sadness, and anxiety, as well as exclamation marks, which indicate positive emotions (Gilbert and Hutto 2014; Hancock et al. 2007). Means and standard deviations for the different LIWC categories are presented in Table 11, and Fig. 2 illustrates the variance in the categories in the corpus by language and outcome/condition in violin plots.

Table 11 Means and standard deviations for LIWC categories across game outcomes (Win, Loss, and Tie) and languages
Fig. 2
figure 2

Percentages of LIWC categories (positive and negative emotion words, anger, anxiety, sadness, negation, pronouns, "we", exclamation marks) per total words per text by condition (Win, Loss, and Tie) and language (English, German, and Dutch). Points are raw data points, the line shows central tendencies of the data, the bean is the smoothed density, and the rectangle around the line represents the inference interval

Overall, the LIWC analyses are consistent with the concordance analysis, described above: more positive emotion words are counted in reports about won matches and more negative emotion words, anger, and sadness in reports about lost matches, and this pattern is consistent across languages. Only the level of anxiety does not differ according to game outcome but between languages, with the level overall highest in English and lowest in German. The numbers for tied matches generally fall in between those for wins and losses.

Looking at differences between the languages, we observe that more positive words in English texts and that these are less frequent in German and, especially, in Dutch texts. Negative emotion words are also most frequent in English and occur less in German and Dutch, where the frequencies are similar. We also find a higher proportion of negations in Dutch, specifically in reports on lost and tied matches. The same pattern (more negations in reports about losses and ties) can be observed in English and German as well, although the differences are smaller.

In recent years, various alternative methods to assess the emotional nature of a text have been developed. For example, one can assess the valence (positive–negative), Arousal (calm–excited) and Dominance (low–high) of the words in a text. To measure how texts in the MEmoFC differ in terms of these VAD dimensions, and whether this differs across languages, we use normative lexicons of English (Warriner et al. 2013), German (Vo et al. 2009) and Dutch (Moors et al. 2013), which have been developed by relying on a large number of native speakers rating thousands of words on these dimensions (except for German, where dominance is not reported).

The three lexicons differ not only in size (about 14,000 words for English, 4,000 for Dutch, and 3,000 for German), but also in the rating scales that were used during the data collection (1–9 for English, 1–7 for Dutch, and between ± 3 for German). To obtain more consistent results across languages, we used min–max normalization to rescale all dictionaries between − 4 and + 4, with 0 indicating neutral valence/arousal/dominance.

After the scale adjustment, the average VAD scores for each report were calculated by summing each dimension’s scores of all words in the report and then averaging them for all matches. A similar approach has been used in, for example, Gatti et al. (2016), and has been shown to be useful when there is no sufficient annotated data for supervised classification (Jurafsky and Martin 2009; Taboada et al. 2006), or when pre-trained sentiment or emotion analysis tools are not available (as it is the case for Dutch and German).

As can be seen in Table 12, reports of winning matches have a higher, positive valence across all languages compared to losses. In a similar vein, reports on losses have a more negative valence than those on ties, which in turn are more negative than wins. We observe no difference in arousal between reports, while dominance is slightly higher for wins. Ties are consistently ranked between wins and losses.

Table 12 Valence (V), arousal (A), and dominance (D) for aligned reports in English (Win–Loss: 1037 matches compared; Tie: 365 matches compared; Dictionary: 13,915 entries), Dutch (Win–Loss: 468 matches compared; Tie: 135 matches compared; Dictionary: 4299 entries), and German (Win–Loss: 404 matches compared; Tie: 146 matches compared; Dictionary: 2903 entries)

Given that large parts of the dictionaries consist of moderate/neutral words, this might “flatten” the differences between conditions. New dictionaries with only words that have (normalized) valence, arousal, and dominance scores of more than 2 or less than − 2 were created and the analysis rerun using only the more extreme values.

In Table 13, the differences for the extreme values between the matched reports about winning and losing matches are even bigger, which confirms the trend that winners use more positive and more strongly positive language. Again, more negative affect is expressed in texts about losses than in reports about won matches.

Table 13 Valence (V), arousal, and dominance for extreme values in aligned reports in English (Dictionary entries left: 1750 [valence]; 919 [arousal]; 403 [dominance]), Dutch (Dictionary entries left: 801 [valence]; 317 [arousal]; 145 [dominance]), and German (Dictionary entries left: 748 [valence]; 466 [arousal]; n.a. [dominance])

A final exploratory analysis of the emotion words in the different texts zooms in on discrete emotion terms using the EmoMap technique of Buechel and Hahn (2018), which maps the VAD lexicons onto Ekman’s set of basic emotions (1992). The resulting lexicons are then scaled between 0 and 10, with 0 representing absence of a particular emotion and 10 representing the maximum intensity. Note that 10 is a theoretical maximum, while in fact no word in the resulting dictionary has a value higher than 8.5. Table 14 shows the distribution of emotions across languages and game outcomes. Although the numeric differences are relatively small, the pattern is broadly consistent with the earlier LIWC and VAD analyses, with joy scores being higher for wins, and sadness, fear, and disgust higher for losses.

Table 14 Discrete emotions (joy [J], anger [A], sadness [S], fear [F], and disgust [D]) in aligned reports in English (Win–Loss: 1037 matches compared; 365 Ties compared), Dutch (Win–Loss: 468 matches compared; 135 Ties compared), and German (Win–Loss: 404 matches compared; 146 Ties compared)

3.3 Example study 3: Can we classify texts as describing a win or loss (and does this vary per language) and which textual elements of the reports are most indicative of the game outcome?

The analyses so far suggest that there are systematic differences in words and phrases used for different outcomes, whether we look at personal pronouns (we), LIWC categories, VAD scores, or discrete emotion terms. This raises the question whether we can automatically predict whether a text reports about, say, a win or a loss, taking into account words, but also other textual features. To investigate this, we conducted a multiclass classification task to further explore possible differences in the language used to report on a win, tie, or loss. Our classifier is based on the classification framework of Van der Lee and Van den Bosch (2017). Similar to that framework, a distinction was made between word statistics, syntactic, and content-specific text features (see Table 15). The word statistic features are measures on the word or character level, such as sentence length and word length distributions. Syntactic features are indications of syntactical patterns present in sentences. To find these underlying syntactical patterns, the texts were parsed automatically using Frog (Bosch et al. 2007) for the Dutch soccer reports and the Stanford NLP parser (Klein and Manning 2003) for the English and German soccer reports. Besides the raw part-of-speech n-grams, syntactical feature groups such as function words, descriptive words and punctuation were captured as well.

Table 15 Text features by word statistics, syntax, and content adopted for the classification of MEmoFC texts

The content-specific features used for this study were word uni-grams, bi-grams and tri-grams. These words or phrases could indicate certain topics in the text. Strategic data reduction was applied to the soccer reports for the content-specific features to reduce computational load and simultaneously reduce the chance that the classifier focuses on linguistically irrelevant features (the word Manchester might for instance be associated with a win, but this is not a good linguistic indicator). For this data reduction, words were lemmatized (e.g. goals to goal), stop words were removed (words like the, who and are), and named entities were again removed (e.g. Arsenal, Luuk de Jong). Furthermore, highly infrequent words (words appearing less than 10 times in the total corpus) were removed from the content-specific features.

Six machine learning classification algorithms were tested: Linear Support Vector Machines and Naïve Bayes, plus four tree-based algorithms: C4.5, AdaBoost, Random Forest, and XGBoost. Discriminating between wins, ties, and losses was done using either word statistics, syntactic or content-specific features as described above. Subsequently, the features from these three feature groups were combined using two different approaches: a supervector approach and a meta-classifier approach. The supervector approach pools all features together into a single vector to predict the type of report, regardless of the feature category. The meta-classifier approach takes the probabilistic outputs of each feature category and uses them as inputs for a higher-level classifier to predict the type of report, which has the potential to increase classification accuracy if the feature groups all contain some additional information that is not stored in a single feature group. The meta-classifier approach has been shown to increase performance in other classification tasks (Malmasi et al. 2015; van der Lee and van den Bosch 2017). Furthermore, a baseline was used that predicts the most frequent reports based on the training set.

The results show that all feature groups perform above baseline in all languages (Tables 16, 17 and 18). The lexical features (stylistic features such as word length and sentence length) classified the report types least well, with the syntactical features (e.g. POS n-grams and punctuation features) performing slightly better. The word-based content-specific features perform the best out of the feature groups, although the results can be improved by combining all three features in a meta-classifier. The best classifier was able to correctly label around 80% (compared to a 39% baseline) of the reports for each language, which confirms that there are clear linguistic differences between the descriptions of wins, ties, and losses for English, German and Dutch.

Table 16 Classification performances of the best algorithm for the different feature groups and combination methods for the English subcorpus
Table 17 Classification performances of the best algorithm for the different feature groups and combination methods for the German subcorpus
Table 18 Classification performances of the best algorithm for the different feature groups and combination methods for the Dutch subcorpus

Interestingly, the most important word features (Tables 19, 20 and 21), as obtained using Gini importance scores for the best performing tree-based algorithm (Breiman et al. 2009), do not show salient differences in word use between win, tie or loss reports. These features are all expressions that occur rarely in the corpus, which suggests that the distinctiveness of wins, losses or ties is based on many features in combination rather than specific ones.

Table 19 The ten most important word features for the English subcorpus with Normalized Gini Importance Scores
Table 20 The ten most important word features for the German subcorpus with Normalized Gini Importance Scores
Table 21 The ten most important word features for the Dutch subcorpus with Normalized Gini Importance Scores

4 Conclusion and future work

This paper presented a new multilingual corpus, MEmoFC, consisting of pairs of reports for soccer matches, taken from the respective websites of the competing teams, combined with game statistics. The corpus can be used for linguistic emotion research and has been constructed to contribute to understanding how the production of written language is influenced by an author’s emotional state or the assumed state of the intended audience of a text (e.g., happy after a win and disappointed after a loss). After describing how the corpus was collected and preprocessed, we illustrated how the corpus can be used in three exploratory studies.

The three studies were each guided by a specific research question. In our first study, we investigated basking behavior on the use of first person plural pronouns. As expected, a trend appeared in the English and German subcorpora, indicating an increase of basking after won matches compared to lost or tied matches. However, this was not the case for the Dutch subcorpus. The second study was concerned with the use of specific words and phrases depending on game outcome. In three analyses, we first examined overall frequencies with TF-IDF, which, while already suggesting trends, proved less suitable for the task; we then moved on to keywords of the language and outcome subcorpora, which showed interesting, outcome-specific words that are used in the individual subcorpora, many of which were emotionally colored, although this seemed to be the case to different degrees in the different languages; finally, we zoomed in on VAD and emotion scores, which, while showing the expected patterns according to outcome, also differed in intensity in the three languages. The third study served as a demonstration of the possibility to classify the reports according to win, loss, and tie, which confirms that some linguistic features are more representative of the respective game outcomes and, hence, possibly also emotionality, than others.

While our exploratory analyses demonstrate how the corpus can be fruitfully used to investigate affective language production, there are also some limitations worth mentioning. For example, the fact that the authors of the texts in the corpus are mostly unknown means that possible effects of authorship cannot be studied well using MEmoFC. While the possibility that the individual authors’ writing styles have an impact on the lexical choices and grammatical structures as well cannot be ruled out, we expect that the multitude of different reports coming from many different writers washes out the peculiarities of different writing styles. Although MEmoFC is smaller than other contemporary affective corpora that have been constructed, such as the Amazon corpus (McAuley and Leskovec 2013) or Twitter as a corpus (Pak and Paroubek 2010), its main strength is that it is controlled, combining pairs of descriptions for the same data. It was ensured that only certain leagues and time frames were collected, while also monitoring non-available texts, to keep the reports comparable. This made it difficult to scrape the texts automatically and called for a manual collection of reports, which also limited the scope. In general, we follow Borgman (2015, p. 4) in believing that “having the right data is usually better than having more data”. However, the corpus is expandable to more seasons, other leagues, “neutral” reports from unrelated newspapers or other countries and in other languages. The latter might also include international matches and the respective reportage (e.g., the World Cup or the European Championship), where it would be possible to investigate cultural differences in (affective) language use by examining the perspectives of two countries instead of two clubs. An extension of the corpus with more texts could also make it possible for the classifier approach to find more robust individual features, meaning linguistic features that reliably reflect differences between reports describing a win, tie or loss in the three languages, which could not be detected now due to the scarcity of reoccurring bigrams/trigrams.

In its current form, we believe that MEmoFC will contribute to and improve the generation of sports narratives as a starting point for effectively training NLG systems. For example, based on the corpus, a data-to-text generation system that is able to generate multiple reports for a single match has already been developed (van der Lee et al. 2017). In the future, it could be interesting to look at how authors of match reports select which game events to report on based on the statistics collected for the leagues and seasons of MEmoFC since there might be a bias in the selection process due to the outcome of the match or due to cultural differences that are possibly traceable in the languages.

In a next step, we intend to conduct a laboratory study to directly investigate the effects of negative and positive emotions related to success and failure on language production. Similar to Baker-Ward et al.’s (2005) study that investigated the realization of negative and positive emotions in match narratives of children who played in two teams and participated in the same football match, it would be interesting to create a game setting experiment with participants producing the reports to the matches themselves. This setting will enable us to eliminate the issues of unknown authorship and uncertainty about the emotional involvement of the author.

MEmoFC is, to the best of our knowledge, the first corpus to include affective narratives about the same events from different perspectives, across different cultures and languages. The controlled selection process of the reports ensures the quality of the corpus, while still adding up to a respectable number of texts. In this paper, we demonstrated its usefulness both for linguists (e.g., to explore cultural differences in emotions and language production) and NLP/NLG researchers (for practical applications of such differences, see, e.g., PASS by Van der Lee et al. 2017). MEmoFC is available on request for research purposes.