The Grievance Dictionary: Understanding Threatening Language Use

This paper introduces the Grievance Dictionary, a psycholinguistic dictionary which can be used to automatically understand language use in the context of grievance-fuelled violence threat assessment. We describe the development the dictionary, which was informed by suggestions from experienced threat assessment practitioners. These suggestions and subsequent human and computational word list generation resulted in a dictionary of 20,502 words annotated by 2,318 participants. The dictionary was validated by applying it to texts written by violent and non-violent individuals, showing strong evidence for a difference between populations in several dictionary categories. Further classification tasks showed promising performance, but future improvements are still needed. Finally, we provide instructions and suggestions for the use of the Grievance Dictionary by security professionals and (violence) researchers.


Automatic linguistic threat assessment and grievance-fuelled violence
In the automatic linguistic approach to grievance-fuelled violence, particular attention has been paid to the writings of terrorists and (online) extremists (Baele, 2017;Kaati, Shrestha, & Cohen, 2016;Kop et al., 2019). A few studies examined lone-actor terrorist manifestos for various psycholinguistic variables using the LIWC (Baele, 2017;. These studies compared lone-actor terrorist writings to the writings of several different populations, such as non-violent activists (e.g. Martin Luther King, Nelson Mandela), standard control writings and emotional writings (i.e., 'baseline' texts expressing low and high emotionality, respectively), and personal blogs . In several studies, lone-actor terrorist manifestos differed from control texts on several LIWC variables. For example, they contained higher proportions of negative emotion words including anger (Baele, 2017;, lower levels of positive emotion and friendship words , and more powerrelated words (Baele, 2017;. Similar research focused on 'incel' (i.e. involuntary celibate) forums. Jaki et al (2019) compared 50,000 messages from an incel forum to 50,000 neutral 'control' texts extracted from Wikipedia articles and random English tweets via LIWC software. Incel messages contained more swear words and negative emotion, such as anger and anxiety.
Besides the use of the LIWC, several studies in extremism additionally use custommade 'expert dictionaries'. For these dictionaries, domain experts are consulted to develop wordlists that cover the terms used by a specific population. For example, Smith et al. (2020) developed an expert dictionary for ISIS vernacular after consulting 'terrorism and extremism experts from government and the security and defence sectors' (Smith et al., 2020, p.6). Figéa et al. (2016) developed one for racism, aggression, and worries on a white supremacy forum.
Using the LIWC as well as expert dictionaries, other studies go beyond statistical comparisons alone, and classify violent from non-violent texts via machine learning. In a study of a white supremacy forum, all 73 LIWC categories and three expert dictionaries relating to worries, racism and aggression were used as features (Figea et al., 2016). LIWC categories for religion (e.g. 'Muslim', 'church'), see (e.g. 'view', 'saw') and third person pronouns (e.g. 'they', 'them') proved important linguistic characteristics for classifying racism posts. The LIWC categories for anger (e.g. 'hate', 'kill') and an expert dictionary category for aggression were important for recognizing both worries and aggression in the posts, achieving accuracy rates between 80-93%. In another effort, classification tasks using the LIWC output as predictors distinguished between lone-actor terrorist manifestos, texts written by non-violent activists, texts from personal blogs, forum postings on Stormfront (a white supremacy forum), and personal interest forum postings (Kaati, Shrestha, & Sardella, 2016). In one of the tasks where the aim was to distinguish between terrorist texts and Stormfront posts, LIWC categories relating to negative emotion (e.g. 'sad', 'angry'), time (e.g. 'before', 'often), and seeing (e.g. 'appear', 'show') were important features for classification with an accuracy of 90%.
In earlier work, the concepts of hate and violence were measured on American and Middle Eastern dark web forums (Abbasi & Chen, 2007). The authors utilised a custom dictionary containing words and phrases from the forums related to violence and hate (the content of the dictionary was not made available). The results indicated that Middle Eastern forums scored higher than American forums in terms of violence. Forums from both regions did not differ in terms of hate. Similarly, Chen (2008) proposed an automated method for analysing affect within two jihadist dark web forums. Up to 909,039 messages were collected from the forums, of which 500 were utilised to manually construct a dictionary for violence, anger, hate and racism. One of the forums, known to be more radical, was indeed found to contain higher levels of violence, anger, hate, and racism than the other (Chen, 2008).
Custom dictionaries created through expert consultation also potentially suffer from a third limitation in addition to the two noted in the introduction. They are often highly domaindependent and non-transparent regarding the population of experts consulted. By consulting domain experts (e.g., in right-wing extremism, radical Islam) the dictionaries are specifically attuned to a specific type of violence or extremism. The nature of online communication in these populations is that language is community-specific and constantly changes (Farrell et al., 2020;Shrestha et al., 2017). Some fringe communities may also continuously adapt their language use to evade content moderation filters on social media platforms which automatically delete or flag posts with specific word use (van der Vegt, . As a result of these phenomena, dictionaries would have to be continuously updated to capture the appropriate jargon. Furthermore, custom expert dictionaries are referenced in Abbasi & Chen (2007), Chen (2008), Figea et al. (2016) and Smith et al. (2020), but little is said about what the consultation process entailed and why those consulted can be considered experts. In short, readers are expected to trust the judgment of the researchers and experts without having access to the specifications of the tool.

Transparency statement
The approach to developing the Grievance Dictionary was fully pre-registered before data collection: https://osf.io/szvm7. All data and materials are available on the Open Science Framework: https://osf.io/3grd6/. A user guide for the dictionary can be found there too.

Part I: Dictionary development
The dictionary development consisted of five phases. (1) Threat assessment experts suggested dictionary categories. (2) Human subjects generated seed terms for each category. (3) Computational linguistics methods augmented the word list. (4) Human annotators rated word candidates on their fit into a set of categories. (5) The internal reliability for each dictionary category is assessed and their correlation with LIWC2015 categories is computed.
Phase 1: Expert survey An online survey was sent out to experts within the field of threat assessment. Participants were professional contacts of the involved researchers in the field of threat assessment and terrorism research. Participants were asked the following: Imagine you are tasked with assessing whether a piece of text signals a threat to commit violence against a designated area, individual, or entity. It may be a physical letter or an online message that you are asked to examine. In short, you are trying to judge whether the person who wrote the text will act on their threat. What do you look for in the text to assess its threat level? Please mention all relevant factors that come to mind.
The response to this question was an open text box, with no word limit. Following this, participants could add any other relevant factors that came to mind (again with an open answer response) and were asked about their professional experience in threat assessment (in years) and with linguistic threat assessment (on a 10-point scale, 1 = no experience, 10 = a lot of experience).
In total, 21 responses were gathered. On average the participants had 16 years of experience with threat assessment (SD = 8.84, range: 2-30 years). Overall, the participants indicated they had significant experience with threat assessment based on language, with a mean score of 8.17 (SD = 2.04, on a scale from 1-10).
Based on the survey responses gathered, it became clear that assessing the threat of violence through language relies on a wide variety of factors. In order to adequately measure these factors, they need to be condensed into psycholinguistic categories (e.g., similar to the LIWC). The lead author categorised free text responses. For example, the concepts 'preparation', 'rehearsal', 'developing capacity', 'refining method', or 'developing opportunity', were all coded as a single category relating to 'planning'. In total, this resulted in 79 categories (available on the OSF). The categories could broadly be defined to relate to the content of a communication (e.g., direct threat, violence, relationship), emotional processes (e.g., anger, frustration, desperation), mental health aspects (e.g., psychosis, delusional jealousy, paranoia), the communication style (e.g., unusual grammar, politeness, incoherence), and meta-linguistic factors (e.g., number of communications, font, use of graphics). Lastly, the lead author selected categories that could feasibly be represented as a psycholinguistic wordlist, serving as an overarching category (e.g., including 'weaponry' but excluding 'mentioning target' because it is too situation-specific). This resulted in a final selection of 22 categories (Table 1). Phase 2: Seed word generation Human subjects generated seed words for each category from Phase 1. A total of 13 participants suggested words for the categories in an online survey. Participants were all PhD students at English-speaking universities (full details of the sample are reported in the supplementary materials on OSF). For each category, participants were asked to write down all the words that came to mind, considering the category as an over-arching concept for the words they noted down. This resulted in a total of 1,951 seed words across categories. Instructions for the word generation task as well as the resulting words for each category are available in the online materials.

Phase 3: Word list extension
Two processes extended the word list. First, WordNet (Fellbaum, 1998) provided semantic associations for each seed word. This tool provides a lexical database of English words, grouped into 'cognitive synonyms' of meaningfully related words, which are added to the wordlists (e.g. 'knife' is supplemented with 'dagger', 'machete', and 'shiv'). All words related to the initial seed words were added to the list of the respective category. Second, we obtained pre-trained word embeddings for each candidate word using GloVe, an unsupervised learning approach trained on a 6 billion word corpus (Pennington et al., 2014). GloVe represents words as real-valued vectors (embeddings) which aim to encode semantic relationships between individual words based on the contexts in which they appear. This means that words which are similar in meaning have vector representations that are close to each other (based on a similarity measure) in the resulting vector space (e.g. a word embedding for 'gun' appears close to 'handgun', 'pistol', 'firearm', etc. in the learned vector space). For the dictionary, each seed word across all categories was supplemented with its ten nearest neighbour words in terms of cosine similarity. After removing duplicates obtained through WordNet and the embeddings, the final resulting wordlist across all categories contained 24,322 words. It is important to note that some words may appear in multiple categories (e.g. 'knife' may appear in both the weaponry and murder category).

Development phase 4a: Word list rating
Human annotators rated all words obtained through Phase 3 for the extent to which they fit within their respective category. An online task was developed where participants were presented with a category, a word, and the option to select, on a scale, 'how well the word displayed fits into the above category' (0 = does not fit at all, 10 = fits perfectly). They also had the option to select 'I do not know this word'. After reading instructions and consenting to participating, a total of 100 words (i.e., a random sample of 100 word-category pairs, with words shown for their associated category only) were rated by each participant. Participants were recruited through the crowdsourcing platform Prolific Academic, and remunerated for their time. Human workers were only eligible to participate if their first language was English. Interspersed between normal items, four attention checks were included (e.g. 'This is an attention check. Rate this word with 9 to continue').
In sum, the 24,322 words of the extended wordlist were rated by 2,318 online participants. A total of 238,366 ratings were obtained, with each word receiving at least 7 ratings, with an average of 9.42 ratings per word. All ratings from participants who failed at least one of the attention checks were removed (1.81%). Words for which the majority (50% or more) of participants indicated that they did not know the word, were also removed from the dictionary (0.39%). Following this, all dictionary words were stemmed and the ratings averaged per word stem (e.g. the ratings for 'friendship', 'friendly', and 'friends' were combined into a single score for the stem 'friend-'). This resulted in a final list of 20,502 words.

Development phase 4b: Scoring methods
Departing from the rated word list, several versions of the Grievance Dictionary can be used. Three possibilities are discussed. The first two rely on proportional scoring, based on word counts. Following the LIWC, we may wish to only retain words which received a high rating for belonging to a specific category (Pennebaker et al., 2015). In this first version, we would retain only those words which received an average rating of 7 or higher, resulting in a dictionary with 3,643 words. This version is used for evaluation and validation in this paper. An alternative second version retains words with a score of 5 or higher, resulting in a dictionary with 7,588 words. In both of these versions, scoring the texts follows the same approach as the LIWC, which is based on word count. When the dictionary is applied to a text, each word in the dictionary is searched and a proportion score for the word (i.e. frequency of the word / all words in text) and the overall category (i.e. frequency of all words in category / all words in text) is reported.
The third approach relies on average scoring, using the ratings assigned to each word through crowdsourcing. This version of the dictionary makes use of all 20,502 words and their associated average rating, assigning each word match in a text the appropriate weight. To measure each category for a text of interest, the average weight of all word matches per category is reported. While the first version using proportional scoring of words with a mean score of 7 and higher is used in this paper, alternative versions are available on the Open Science Framework.

Development phase 5: Psychometric dictionary evaluation
To assess the quality of the dictionary, it is important to examine the internal consistency of each category by measuring whether the words in each category yield a similar score for the respective category. We compute Cronbach's alpha using the proportional occurrence of each word in the 22 categories for a total of 17,583 texts across four corpora (Table 2). Similar to the development of LIWC2015 we use a varied selection of texts to compute reliability, including texts from deception detection experiments , novels (Lahiri, 2014), movie reviews (Maas et al., 2011), and Reddit posts (Demszky et al., 2020).
When assessing the reliability of psychological tests, typically a Cronbach's alpha score of 0.70 or higher is considered acceptable (Taber, 2018). Cronbach's alpha ranges between 0 to 1 and is based on the covariance between items, where a score of 1 represents perfect covariance, such that the items adequately measure the same underlying concept. As raised in Pennebaker et al. (2015), assessing the reliability of dictionaries is somewhat more complicated. In language, similar concepts are typically not repeated several times; once something has been said it is generally not necessary to be said again. In contrast, similar concepts may be assessed repeatedly in psychological test items. Thus, it has been argued that an acceptable alpha score for dictionary categories will be lower than that for a psychological test (Pennebaker et al., 2015).  (Lahiri, 2014) 3,036 (247,142,420) IMDB movie reviews (Maas et al., 2011) 50,000 (13,934,687) Reddit posts (Demszky et al., 2020) 70,000 (1,081,539) Note. *Hotel reviews (Ott et al., 2011(Ott et al., , 2013, descriptions of past and planned activities  A psychometric evaluation was performed for each version of the dictionary (words with a rating of 7 or higher, words with a rating of 5 or higher, weighted words). The results reported from here onwards concern the dictionary using words with a rating of 7 or higher, because this dictionary performed best (results for the other versions are available on the OSF). The average alpha scores across corpora are reported in Table 3. The highest reliability of 0.41 is achieved for the category 'soldier', followed by 0.40 for 'god'. The lowest score (0.15) was found for the category 'grievance', which possibly shows that this concept is difficult to reliably measure with the current approach. The average reliability achieved across categories was 0.30 (SD = 0.08). This average reliability is close to the average reliability of 0.34 achieved with the LIWC 2015. The alpha scores for the LIWC2015 ranged between 0.04 and 0.69, whereas ours range between 0.20 to 0.39. In addition to internal reliability, we also assessed whether and how the Grievance Dictionary categories correlated with existing LIWC categories. Although high correlations with a gold standard dictionary may illustrate that the Grievance Dictionary is comparable to the LIWC in terms of psychometric qualities, we do not expect such a pattern because the Grievance Dictionary categories were designed to supplement LIWC categories and not replace them.
Reported correlations serve to illustrate which other psycholinguistic concepts measured through the LIWC are related to each respective Grievance Dictionary category. The three highest correlating LIWC categories for each Grievance Dictionary category are depicted in Table 4 (full list of correlations available on OSF). Overall, correlations were low, suggesting that the Grievance Dictionary does not measure precisely the same constructs as the LIWC. Most Grievance Dictionary categories were correlated to LIWC categories which one might expect to be psychologically related. For example, several Grievance Dictionary categories such as desperation, frustration, hate, jealousy, paranoia, and violence, were positively correlated to the LIWC category negative emotion. Frustration, hate, murder, threat, and violence were also positively related to the LIWC's anger category. These results may suggest that some LIWC categories serve as 'umbrella categories' for some in the Grievance Dictionary. That is, the LIWC can provide measures of more general concepts such as negative emotion, whereas the Grievance Dictionary is suited to give more granular measures of psychological constructs (e.g., frustration, paranoia) which fall into this overarching category. Note. All correlations were statistically significant at the p < 0.0023 (0.05/22 categories) level.

Part II: Dictionary validation
The dictionary validation reported in this section serves to assess whether and how the Grievance Dictionary can be used to distinguish between different types of writing, for example neutral language and grievance-fuelled communications produced by terrorists or extremists. We first apply the Grievance Dictionary to different datasets to assess its external validity. Then, we test the performance of the dictionary in classification tasks.

External validity
We apply the dictionary to different datasets to test its validity in the context of grievancefuelled writings (Table 5). Three tests are performed. First, following previous work on violent language use (Kaati, Shrestha, & Cohen, 2016), we make statistical comparisons between manifestos written by violent lone-actor terrorists, and large samples of 'control' texts retrieved from online forums and blogs. 3 Second, we perform a comparison between lone-actor terrorist manifestos and texts from the right-wing extremist forum Stormfront. 4 For the lone-actor terrorist samples, we draw 100-word excerpts from 22 manifestos resulting in a total sample of 4,572 texts. This 'chunking' is performed so that the average word count for the terrorist manifestos is more comparable to that of the neutral writings and Stormfront posts. For both tests, mean dictionary outcome values of the lone-actor terrorist manifestos are compared to the means of the control samples with an independent samples t-test. The control samples are down-sampled through bootstrapping to match the n of the lone-actor manifestos, with outcome measures reported as an average across 100 bootstrap iterations. We report the effect size for the difference by means of Cohen's d 5 , in addition to the Bayes Factor (BF). The Bayes Factor is a measure of the degree to which the data are more likely to occur under the hypothesis that there is a difference in the dictionary categories between samples, compared to the hypothesis that there is no difference (Ortega & Navarrete, 2017;Wagenmakers et al., 2010). For example, a BF between above 10 would constitute strong evidence for the alternative hypothesis that there is a difference (Ortega & Navarrete, 2017). The third comparison is between abusive texts directed at politicians and neutral, stream-of-consciousness (SOC) essays . For this comparison a dependent samples t-test is performed, because individual participants produced both types of text. Again, effect size d and BF are reported for the difference between the two samples (note that this comparison is not based on bootstrapping). All results are reported in Table 6.
Overall, statistically significant differences were found for the majority of categories in all comparisons. In the majority of cases, the lone-actor texts scored higher on Grievance Dictionary categories than the control texts. In the first comparison with neutral texts from blogs and forums, the lone-actor manifestos scored higher on all categories except 'fixation' (denoted by a negative effect size d). The evidence for a difference between samples was very strong (BF > 10) in all cases. In the second comparison with Stormfront forum posts, the loneactor manifestos scored proportionally higher (strong evidence with BF > 10) on the categories deadline, hate, honour, jealousy, murder, planning, soldier, threat, violence, and weaponry. In contrast, Stormfront posts scored proportionally higher (BF > 10) on desperation, fixation, impostor, loneliness, relationship, suicide, and surveillance. For the comparison between abusive writing and stream-of-consciousness texts, differences in favour of SOC texts (BF > 10) were found (denoted by negative d) for the categories deadline, desperation, fixation, frustration, god, grievance, hate, jealousy, paranoia, planning, and suicide. However, the abusive texts contained proportionally more references to honour, impostor, murder, surveillance, and violence (positive d).  A positive d denotes a higher score on the category for the lone-actor terrorist manifestos (test 1 and 2) and abusive texts (test 3). A BF above 10 (in bold) constitutes strong evidence for the alternative hypothesis.

Classification
Previous work classified terrorist or extremist texts from neutral 'control samples' using the LIWC. We investigate whether the Grievance Dictionary can achieve similar results, or increase prediction performance when used to supplement the LIWC.

Classification tasks
In four classification tasks, we examine whether the Grievance Dictionary and the LIWC can distinguish between: 1) Texts written by known terrorists vs. non-violent individuals 2) Texts written by known terrorists vs. non-violent extremists 3) Abusive vs. neutral texts (within-subject comparison of non-violent individuals) 4) An explorative cross-sample classification of extremist forum posts vs. non-extremist forum posts, trained on a dataset of text written by known terrorists vs. non-violent individuals. All classification tasks are performed using a Naïve Bayes classifier. In Classification Task 1, we classify lone-actor terrorist manifesto excerpts (n = 4,572) versus neutral posts from blogs and forums (n = 680,792). The majority class of neutral posts is down-sampled to the same n as the manifesto sample by means of bootstrapping (100 times), to allow for a balanced classification task. Classification results are reported as an average across each of the 100 bootstrap tasks. In Classification Task 2, we classify lone-actor terrorist manifesto excerpts (n = 4,572) versus Stormfront posts (n = 461,950). Following the same procedure as in Task 1, the majority class of Stormfront posts is down-sampled 100 times. In classification Task 3, we perform classification for abusive vs. neutral, stream-of-consciousness writing with data from van der Vegt et al. (2020), using 789 texts per sample. Note that due to the smaller data size in Task 3 we do not perform bootstrapping, and instead opt for 80% of the sample as training data, and 20% as a test set to report performance metrics. In Classification Task 4, we exploratively train the model on lone-actor terrorist manifesto excerpts (n = 4,572) and neutral posts from blogs (n = 680,292), then test the model on Stormfront posts (n = 500) vs. neutral forum posts (n = 500) and report performance metrics of the latter. This task aims to replicate a potential real life setting in which models are trained on known previous terrorist cases, then applied to unseen online data which may contain extremist linguistic material relevant to security professionals.

Feature sets
Each classification task is performed using three different feature sets, to test the performance of the Grievance Dictionary, the LIWC and a combination of the two in classifying aforementioned datasets. The following feature sets are used: a) All 22 Grievance Dictionary categories. b) All psychological and social categories (N = 55) of the LIWC2015 6 . We exclude grammar categories from the LIWC such as pronouns and verbs because we are interested in the predictive ability of psychological concepts only, and grammatical categories do not appear in the Grievance Dictionary either. c) A combination of the Grievance Dictionary and psycho-social LIWC categories (N = 77).

Results of classification tasks
Performance metrics 7 for the classification tasks are reported in Table 7. Classification Task 1 shows high performance for distinguishing between lone-actor terrorist texts and neutral texts. In terms of accuracy, the best performing feature set was the combination of the Grievance Dictionary and the LIWC (Task 1c). Specificity and recall also show high values, but the precision of the model is low. This means that the best model is correct 43% of the time when it predicts that a text was written by a lone-actor terrorist. Classification Task 2 similarly shows that the Grievance Dictionary and the LIWC together (Task 2c) are best at distinguishing between lone-actor terrorist texts and Stormfront extremist forum posts. For this endeavour, precision is lower at 20%. Classification Task 3 shows near perfect classification when using both the Grievance and LIWC dictionary, with high performance for specificity, precision, and recall. The Grievance Dictionary alone predicts 78% of the cases accurately. In contrast, classification Task 4 shows how difficult it is to use training data from one sample (lone-actor manifestos and blog posts) when trying to classify data from another sample (Stormfront and neutral forum posts). For all three feature sets in Task 4, classification accuracy was around chance level, and further performance metrics were also sub-optimal.

Explaining high classification accuracies
All in all, classification accuracies were high, with some close-to-perfect performances. Therefore, we examined feature importance for each task in order to discover whether the model was biased towards some features. The five most important features for each task are reported in Table 8. Feature importance rankings are based on a ROC curve analysis, where a cut-off for each feature is defined that maximizes true positives predictions, and minimizes false positives; a larger area under the ROC curve implies larger variable importance (Kuhn, 2008). Tables with ROC values for each feature per task are available on the Open Science Framework.
Features with high importance also showed stark differences in mean proportional dictionary scores between datasets. For example, the most important feature 'soldier' in Task 1a showed a mean score for lone-actor terrorist manifestos of 0.11 (SD = 0.07), whereas neutral texts and Stormfront posts scored 0.01 (SD = 0.05) and 0.02 (SD = 0.04), respectively. This large difference between datasets will have contributed to the high prediction performance in this (and other) task(s) in that the classifier learned to over-rely on these features. This pattern of feature differences also largely replicates the results observed in aforementioned Bayesian t-tests, where a decisive difference (BF > 10 3 ) was observed for 'soldier', among other variables. The second most important feature 'weaponry' (BF > 10 3 ), had a mean of 0.09 (SD = 0.06) in lone-actor manifestos, in contrast to 0.02 (SD = 0.05) and 0.03 (SD = 0.05) in neutral texts and Stormfront posts, respectively. A full table of mean scores on features per dataset is available on the Open Science Framework.
The bias in the model towards specific features fortunately resulted in highly accurate classifications in the first three classification tasks, in that the model learned to over-rely on the features that were most apt at distinguishing between the two groups. This over-reliance on specific features may also explain the large drop in accuracy for Task 4, where the model was no longer able to rely on highly discriminative features in the training set to classify the test set (i.e., because the test set was drawn from a different context).

General discussion
In this paper, we introduced the Grievance Dictionary, a psycholinguistic dictionary for grievance-fuelled violence threat assessment. The aim of this work was to develop a dictionary which can specifically measure constructs relevant to threat assessment, and can be used for a wide variety of violence and extremism fuelled by a grievance. Furthermore, we aimed to address the limitations we identified pertaining to existing psycholinguistic dictionaries.

Linguistic differences
Based on the validation results of the dictionary, we saw that the Grievance Dictionary can elucidate differences between threatening and non-threatening language. Differences in Grievance Dictionary categories were found between texts written by lone-actor terrorists, neutral writing, and extremist forum posts, as well as between abusive language and streamof-consciousness writing. The evidence for these differences was strong. It must be noted that a high score on Grievance Dictionary categories is not exclusive to threatening and violent texts. In our comparison between stream-of-consciousness essays and abusive writing, the former obtained significantly higher scores for categories such as desperation, fixation, and frustration. Therefore, it is important to note that high scores on single dictionary categories should not be interpreted as individual risk factors for violence, as they may also occur in non-violent texts. Instead, the measures should be interpreted jointly to gain an understanding of the content of a grievance-fuelled text, with particular attention paid to the highly 'violent' categories such as murder, violence, threats, and weaponry. Furthermore, the importance of Grievance Dictionary categories for distinguishing between different populations may also be context-dependent. For example, mentions of a (perceived) romantic relationship may positively predict violence in a threat directed at a public figure, while it may negatively predict violence (a 'linguistic protective factor') in an extremist text. Further research will be needed to establish and replicate differential meanings of Grievance Dictionary categories across contexts.

Classification with the Grievance Dictionary
The dictionary categories were also used to classify different types of writing, including terrorist manifestos and extremist forum posts, neutral and extremist forum posts, as well as abusive and neutral writing. The classification accuracy achieved in this study approximated or outperformed previous work in the violence research domain, for example in classifying lone-actor terrorist manifestos from Stormfront posts (0.96 here vs. 0.90 in . It must however be noted that precision (the percentage of true positives as a function of all positives) was sub-optimal and thus results need to be interpreted with caution.
In the classification tasks, large statistical differences between datasets led to highly accurate predictions. Therefore, it can be argued that the proposed Grievance Dictionary categories (in addition to the LIWC) are discriminatory and relevant to the grievance-fuelled violent language domain. Future work will be needed to ascertain whether the Grievance Dictionary will achieve acceptable performance on data for which it cannot rely on such strong feature differences (e.g., violent texts written by individuals who want to actualise their threat, vs. similarly violent texts written by those who do not plan to actualise). Furthermore, when train and test sets were drawn from different samples, classification accuracy was significantly reduced. Therefore, the Grievance Dictionary does not yet seem suitable for cross-contextual classifications. This is problematic seeing as a system used in a real-life security context may need to classify unseen texts that do not necessarily align with the training data in the system. These results suggest that future work will be needed before classification using the Grievance and LIWC dictionaries can be done in practice.
The results of the classification tasks also illustrate how the Grievance Dictionary and LIWC compare on such tasks. Although the LIWC alone achieved high accuracy on the classification tasks in this paper, the Grievance Dictionary sometimes outperformed or improved prediction performance when both dictionaries were used together. Even though performance with the Grievance Dictionary did not provide a major improvement over the LIWC, the Grievance Dictionary can provide more nuanced and specific psychological measures for grievance-fuelled language than the LIWC, which may be of particular interest to threat assessment practitioners. This benefit of the Grievance Dictionary also holds in cases where other classification features are used, for example bag-of-words models, parts-of-speech tags, word embeddings or bidirectional language models (see e.g. Figea et al., 2016;Neuman et al., 2015;van der Vegt et al., 2020). These methods (which do not rely on a dictionary), may sometimes perform better at classification, but are less explainable than a dictionary. In line with the ALGOCARE framework, a transparent system such as the Grievance Dictionary may be more suitable for the security domain in future.

Usage of the Grievance Dictionary
All things considered, the Grievance Dictionary shows promising results in distinguishing between different types of (non) grievance-fuelled language. The strong evidence for differences in dictionary measures suggest that the categories elicited from expert threat assessment practitioners hold value in understanding violent from non-violent language. However, although classification results were highly accurate on balance, precision was low and cross-sample classifications did not achieve high performance. In summary, the Grievance Dictionary can be used to make (statistical) comparisons between different text samples, or to gain a general picture of language use in a text sample. Although the Grievance Dictionary may achieve high performance in some classification tasks (i.e., where training and test sets are similar and show strong statistical differences between groups), we do not yet recommend using it for cross-domain classifications.
In order to improve (cross-domain) classification performance in future, models need to be trained on additional (larger) training samples, and a deeper understanding of domain-specific differences in dictionary categories will need to be gained. In previous work where prediction of life outcomes based on large datasets failed to show high performance, it was suggested that a good understanding of a phenomenon (e.g. shown through causal inference, such as the statistical differences observed in this study) does not necessarily always translate to accurate prediction (Garip et al., 2020). Accordingly, the Grievance Dictionary may be used to gain a deeper understanding of grievance-fuelled texts, but is not yet suitable for prediction of real-life outcomes.
Besides application in prediction, the Grievance Dictionary may be of practical use for other purposes in the field of threat assessment and violence research. For instance, it may be used to gain a broad understanding of large-scale online social media data on a user or platform level, or to compare an incoming threatening message to a (police) database of existing communications. Furthermore, the tool opens up the possibility of studying grievance-fuelled language in its full range, where Grievance Dictionary categories can be measured over time, for example to linguistically model processes of radicalisation or extremism over time (e.g.  or in response to specific events (Burnap et al., 2014;van der Vegt, Mozes, et al., 2019;Zannettou et al., 2019).

Limitations and future work
In the current paper, we have endeavoured to use the Grievance Dictionary to make meaningful comparisons between different types of violent and non-violent texts. Nevertheless, an important problem within the field of linguistic threat assessment persists. It is difficult to disentangle whether statistical differences emerged based on indicators for violence and nonviolence or due to differences in topic. It is arguably not very difficult for the human eye or computer software to distinguish between a violent manifesto about attack planning and a blogpost about someone's hobby. Of particular importance is performing linguistic comparisons between violent texts written by individuals who enact violent deeds, and the same amount of violent texts written by individuals not planning to act violently. If and when data from known violent individuals is more widely available, it will be of great interest to assess whether and how differences in Grievance Dictionary categories emerge, as well as how classification tasks perform. Another way to remedy this problem is with more experimental research, where both threat actualisers and bluffers produce texts (e.g. Geurts et al., 2016) which can be assessed with the Grievance Dictionary.
Another limitation pertains to the construction of the dictionary. The seed words on which the dictionary categories are based were produced by human annotators who, to our knowledge, do not have violent ideations. Therefore, it may have been difficult for participants to produce words about attack planning and weaponry if they have little knowledge on the topic. We tried to somewhat ameliorate this problem by including word candidates obtained through automatic methods. Nevertheless, future improvements to the Grievance Dictionary may include word candidates that are obtained by means of a data-driven approach. That is, we may extract words from texts which are known to have been written by lone-actor terrorists or other violent individuals to serve as seed words.
Lastly, the assumption that the Grievance Dictionary categories indeed measure the (psychological) constructs they are designed to measure remains to be tested. For example, we do not yet know whether someone who is experiencing jealousy will also use more words from the jealousy category in the dictionary. This limitation holds for many psycholinguistic dictionaries including the LIWC, and highlights the importance of obtaining ground truth emotion datasets (Kleinberg, van der Vegt, & Mozes, 2020). Alternatively, emotions (and potentially other psychological constructs) can be experimentally manipulated prior to text writing in order to ascertain that the true emotional state of the text author is inferred from text (Kleinberg, 2020;Marcusson-Clavertz et al., 2019). Therefore, future work on the Grievance Dictionary and other psycholinguistic dictionaries should focus on measuring or even eliciting psychological processes such as frustration, jealousy, and loneliness, then measuring whether these constructs also emerge in language when applying the Grievance Dictionary.

Conclusion
The purpose of the Grievance Dictionary is to serve as a resource for threat assessment practitioners and researchers aiming to gain a better understanding of grievance-fuelled language use. Initial validation tests of the dictionary show that differences between violent and non-violent texts indeed can be detected and classified using the dictionary. All information regarding the construction and specifications of the dictionary is available to researchers and practitioners, so that the capabilities and limitations of the Grievance Dictionary can be adequately scrutinised. Even though future research will be needed to ascertain the utility of the dictionary in other contexts (such as violent texts from authors with no violent intent), we hope the current work serves as an impetus to gain a better understanding of grievance-fuelled language by automatic means.