1 Introduction

In the pursuit of uncovering investigative leads to understand criminal behaviour and solve crimes, threat assessors and law enforcement officers recognise the importance of scrutinising diverse forms of evidence [4]. One often underestimated form involves analysing the written and spoken words of criminals (ibid). Recognising the profound influence of ‘word’, as the “ideological phenomenon par excellence” which serves various functions [5, pp. 13–14], this study focuses on enhancing the investigative efficacy of linguistic analyses within forensic contexts, rather than solely on their evidential value. This includes exploring the power of words in terrorism and terrorist communications (for more on the legal definitions of terrorism, see, e.g. [6]). Deciphering criminal intent and understanding terrorist communication requires innovative approaches to text analysis and threat assessment, a multidisciplinary process involving a detailed examination of threatening communications to assess their genuineness and potential harm (e.g. [7, 8]). This article builds on the notion of burstiness [9] (detailed in Sect. 2.3) and introduces the concept of ‘conceptual burstiness’ to sociolinguistic profiling, demonstrating how language features can uncover investigative leads and infer useful characteristics about an author, informing investigative strategies within the context of terrorism. The term ‘investigative leads’ is used to refer to clues or pieces of information derived from the analysis of the burstiness of and categorisation of frequently occurring lemmas and their semantic-field categories within the dataset, which help threat assessors and law enforcement officers understand, for example, terrorists’ agenda, ideological schema, ideology or master identity, violent persona and potential for violence.

Forensic investigators engaged in the examination of authorship dynamics within forensic contexts encounter a plethora of inquiries and methodologies. These inquiries predominantly revolve around questions such as (i) ‘How was the text produced?’, (ii) ‘How many people wrote the text?’, (iii) ‘What is the relationship of a queried text with comparison texts?’ and (iv) ‘What kind of person wrote the text?’ [10, p. 215]. To effectively address inquiries related to the first three questions, analysts tend to differentiate between different types of authors and their roles in the authorship process—such as executive authors (i.e. the wordsmith) and declarative authors (i.e. the one who delivers/speaks out the text or places their name upon the title-page of a text/document to validate it) and who is involved in each (e.g. [11]). Approaches to the relationship between a queried text and comparison text traditionally belong to [12]:

  • The cognitive approach: which emphasises the theory of idiolect and statistical analyses (see e.g. [13, 14]).

  • The forensic stylistic approach: which prioritises identifying distinctive style markers that are observed as habitual choices in linguistic structures and influenced by the language user’s socio-cultural background taking up the sociolinguistic view of style [15, 16].

  • A combination of cognitive and forensic stylistic approaches: which also integrates a functional approach using systemic functional linguistics and corpus linguistics tools [12].

Central to the present exploration is the fourth authorship question: ‘What kind of person wrote the text?’. In response to this query, investigators can harness various forms of expertise and employ two distinct methods in linguistic profiling, contingent upon the insights gleaned from language analysis:

  • Psycholinguistic profiling: This approach delves into identifying the kind of psychological person who wrote a text [10], their ethical and social motives for violence, and their style as attentional patterns emerging in the text [17, 18].

  • Sociolinguistic profiling: Traditionally focused on delineating the social characteristics of the author, this method scrutinises the author’s linguistic proficiency, educational background (inferred from, e.g. language complexity), and demographic indicators (e.g. race, age, sex, dialect, geographic origin, occupation, and religious orientation) [1, 10].

This traditional focus in sociolinguistic profiling follows the early waves of variation studies [19]. The first wave focuses on speech and correlations between social categories (e.g., socioeconomic stratification) and linguistic variables (e.g., regional language differentiation), exemplified by Labovian research in the United States. The second wave considers speaker agency and uses ethnographically determined social categories and cultural norms to explain variation (e.g., using the ‘ng’ variable to create identity and authority). The most recent third wave relatively detaches social meanings from traditional macrosocial categories and focuses more on examining how linguistic practices position speakers, linking features to specific social meanings, focusing on ideology, stance-taking, social identity, value construction, social networking and affiliation. This article falls within the third wave, where:

  • Variation in language use is analysed as part of stance-taking (e.g., violent stance-taking signalled by words like ‘kill’, positions from other voices signalled by saying verbs), correlated with social variables like ideology, race, religion, and social networks. Jihadist texts often use violent and religiously charged language to position the speaker within an ideological framework.

  • Style involves specific design choices and an ideologically grounded perspective [20]. Extremists often use coded language to signal ideological stances while evading incrimination. Repetitions in lexical choices reveal (sub) topics contributing to message cohesion and register (e.g., characterising a discourse of violence) [21].

  • Language choices, symbolic power (e.g., religion), and the relationship between semantic style and social ideology reflect learned predispositions toward out-groups, relying on agents' schemes of perception and appreciation [22]. Bourdieu contends that what circulates on the linguistic market is not language per se, but rather discourses marked by distinctive styles. Terrorist texts use symbolic language to resonate with in-group members while marginalising out-groups.

  • Sociolinguistic structures encompass ideological dimensions that imbue language variation with social meaning, and texts are products shaped by their socio-historic and socio-political context, including ideological concepts used therein [22]. For jihadist and far-right extremist texts, historically and politically loaded terminology reinforces social and ideological agendas.

Sociolinguistic profiling, undertaken in the present article, contributes to the first two steps in a successful investigation of terrorism cases, which usually follow three steps [23]:

  • Find reasons to suspect a terrorist activity (that is happening or likely to happen, as in a negotiation of a violent action as a solution to a sociopolitical issue).

  • Collect evidence (of, e.g., violent intentions, predispositions, schemas, agenda).

  • Evaluate this evidence and decide whether it is useful for trial.

The article showcases not only how a lexical semantic profile of each author offers clues to violent register, encoded ideological motivations and perceptions, agenda, and concern about in-group master identity and interest. It also maps some of the emerging patterns with the TRAP-18 framework categories ([3], detailed in Sect. 3.2) to explain how such mapping can assist in profiling terrorist threats.

In linguistic profiling, investigators aim to build a speaker’s or writer’s profile by inferring their social characteristics (e.g. ideology, race, age, educational background, extremist persona, threatener argumentative tactics) from linguistic features, and often provide evidence of consistent or comparable features and styles across samples, rather than identifying the exact individual author as in traditional authorship attribution analyses. This helps narrow down the pool of potential suspects and characteristic language use. Historically, according to Shuy [1], linguistic profiling used simple phonetic tests, like word pronunciation, seen in ancient examples such as the Old Testament and the Revolt of the Sicilian Vespers. Contemporary linguistic profiling integrates dialect geography, sociolinguistics, and psycholinguistics. This multidisciplinary approach aids law enforcement in understanding criminal intentions, especially in cases involving hate mail or threats. Linguistic profiling has evolved from its ancient roots, now integrating various disciplines like dialect geography, lexicography, sociolinguistics, historical linguistics, and psycholinguistics. This approach provides law enforcement with valuable tools to narrow down suspects ([1]). In addition, more relevant to this article, for Shuy [23], focus on ideologically imbued choices of language such as content words (i.e. lexical items) can provide clues to topics as macro themes talked about in discourse, and “the cognitive substance of discourse”, i.e. “agendas”, and can provide useful clues to participants’ motivations, predispositions and intentions [23, 447]. Additionally, in criminal cases, much attention is placed on the individual or a pattern of words, phrases, and sentences in which “alleged smoking gun evidence” is thought to be found, and this is often “a good place to find evidence of criminal intent” and schemas (i.e. how criminals think about what they talk about) [23, p. 447, emphasis added]. It is important to reiterate here that linguistic (criminal) profiling does not provide evidential value but an investigative value, where patterns in linguistic choices (e.g., commands to “kill”) provide a ground to suspect violent intentions and predisposition to violence [1], given that a stance taken towards others is an intentional act and authors might write these threats and incitement messages voluntarily. However, intentionality is not a linguistic issue per se, especially when the targets are only suspects, the government agencies’ “task is to determine intentionality and predisposition”, and it is “the language used by the suspects, [interrogators and] police interviewers, and lawyers that frame the issues of intentionality, predisposition, and voluntariness” ([1, 89]. The discussion of criminal intent is beyond the scope of this article.

The present study enriches the methods used in sociolinguistic profiling of violent extremist texts by introducing a corpus-based approach. This method utilises word frequency and concordance line analyses to extract investigative leads related to extremist violent behaviour, agenda, and ideological background. It introduces the concept of ‘conceptual burstiness’ in lexical choices, elaborated in Sect. 2, elucidating the tendency of certain thematic elements to recur with heightened frequency and coherence within a textual corpus. Burstiness within this framework serves as a sign of extremist discourse, its functions, authors’ background, and underlying violent intentions in terrorist communications.

This focus on repeated themes, concepts, and lexical items resonates with recent research on terrorist-threatening communications e.g. [24, 25], which has identified properties such as thematic focus, lexical selection, attitudes, rhetorical force, presuppositions, assumptions, values, and characteristic use of linguistic, stylistic, and rhetorical devices [26,27,28,29]. Like identifying the ‘appraisal signature’ (i.e. evaluative language style) of a violent extremist criminal [24, 27, 30, 31], identifying characteristic dehumanisation and identity attacks [32], argumentative and justification tactics [17, 30], as well as discursively constructed victims and purposes of violence [33], all contribute to threatener profiling. Additionally, regularities in a criminal’s linguistic (namely, lexical) choices, such as those of Ted Kaczynski, also play a role. The latter is the focus of this study.

The article builds on Coulthard and Johnson’s [14] argument for the usefulness of scrutinising repeated concepts, lexical items, and fixed phrases, as demonstrated in the case of Ted Kaczynski in the Unabomber case (discussed in Sect. 3). This study similarly argues that such repetitions and regularities in language choice can yield useful information about a criminal’s ideology, violent inclinations, and concerns such as power relations, grievances, and moral outrage—at least in the author’s cognition. By analysing dissimilarities in language patterns across authors with different ideologies, the present study also supports contention regarding the impact of contextual differences, including geography and sociocultural background, on language patterns in forensic texts [12]. Terrorist ideologies, particularly religious or ethnonationalism in this study, profoundly influence linguistic choices. These choices, when reverse-engineered, can reveal valuable insights about an author’s background.

Empirically grounded, this study draws upon a dataset comprising 20 public statements attributed to notorious figures associated with terrorist organisations or a violent extremist ideology: Osama bin Laden (al-Qaeda), Shekau (Boko Haram), al-Baghdadi (ISIS), and Brenton Tarrant (far-right extremist). These texts epitomise the ideologies underpinning two of the most lethal forms of extremism: jihadism and far-right extremism [34]. Given the transnational nature of cyber-terrorist communications and their propensity to transcend jurisdictional boundaries, this research assumes paramount importance within the realm of international law enforcement practices, cyber-communicated crimes, and threat assessment protocols [23, 35].

2 Criminal Behavioural Profiling, Sociolinguistic Profiling, and Proposed ‘Conceptual Burstiness’ Method

Investigative leads could be extracted from not only words but also multimodal texts, given that threatening aspects of terrorist communication function through (i) words and affect (i.e. emotions) [36], and (ii) ethos and appeal to the audience’s logos and pathos [27, 37], and aim to appeal to all senses through various multimodal channels [38]. Thus, linguistic and multimodal text analysis contributes to uncovering the nuances of terrorist acts, authorship, style, and profiling. The present study is not a multimodal study. It focuses on words in sociolinguistic profiling, given the focus on transcribed public statements and written manifestos.

In contrast to linguistic profiling, criminal behavioural profiling focuses on analysing behavioural characteristics to identify as-yet-unidentified criminals [1]. Originating from the Federal Bureau of Investigation’s Behavioural Science Laboratory, this approach relies on psychology, criminology, and interpretations based on available crime data. While sociolinguistic profiling emphasises language variation and sociolinguistic features, behavioural profiling delves into behavioural traits, such as patterns of behaviour, psychological profiles, and arguably subjective interpretations of crime evidence (ibid). Both methods aim to assist law enforcement agencies in solving crimes, but they differ in their focus and methodologies: sociolinguistic profiling focuses on language analysis, while criminal behavioural profiling emphasises behaviour and psychology.

Unlike behavioural profiling, which relies on psychology and criminology, linguistic profiling emphasises language variation and sociolinguistic features to help narrow down the pool of suspects. It sheds light on aspects like political beliefs, social status, power relations, and ethnicity, offering insights into the social features of individuals, particularly evident in cases involving hate mail or threat messages.

2.1 Speaker and Writer Sociolinguistic Profiling

The literature [4] highlights the utility of both written and spoken words of criminals in criminal profiling, leading to a focus on speaker profiling and written-text writer profiling within sociolinguistic literature. Speaker sociolinguistic profiling, as discussed in the review by Schilling and Masters [39], involves inferring various attributes of a speaker from their linguistic characteristics. These attributes, including sex, gender, age, sociolect, accent, dialect, and medical conditions, aid in narrowing down potential suspects by analysing linguistic features associated with, for example, regions, social groups, or unusual pathologies. Tools such as aural-perceptual analysis, acoustic–phonetic analysis, and automated analysis are employed in this process, sparking a debate between human analysis and automated methods, with human analysts often preferred for discerning regional and phonetic variation. However, challenges such as deliberate voice and dialect disguise, amplified by emerging AI tools like voice-cloning technology (e.g. ElevenLabs' VoiceLab, Resemble.AI, Speechify, and VoiceCopy), underscore the necessity for rigorous analysis to ensure accurate speaker identification, particularly in contexts like deep-faked kidnapping, where technologies like ChatGPT enable voice spoofing and impersonation [40]. This study focuses on writer sociolinguistic profiling.

The significance of the writer's sociolinguistic profiling is best exemplified by Shuy’s seminal work in two forensic cases [1]. Firstly, in the Unabomber case, linguistic analysis of handwritten notes and a manifesto aided the FBI in narrowing down suspects, ultimately leading to the identification of Ted Kaczynski, who carried out a series of bombings between 1978 and 1995 against universities, airlines, and other locations (hence, the Unabomber) as targets of his attacks. Unlike behavioural profiling, sociolinguistic profiling in this case relied solely on existing language evidence (language accompanying bomb letters and manifesto) rather than behavioural comparisons. The analysis revealed clues about the author’s geographical background, education, age, occupation, and ideological concepts. Although it did not provide an exact identification, linguistic profiling refined the FBI’s profile and contributed to Kaczynski’s arrest and sentence to life in prison (till his death in 2023). Secondly, in the Gary Indiana Women’s Medical Clinic case, linguistic profiling of bomb threat messages helped identify linguistic clues indicative of the author’s background, including potential ties to a former British colony and gender differences suggesting a female author. Linguistic analysis matched the writing style of the clinic director, leading to her confession. These cases highlight the instrumental role of sociolinguistic profiling in forensic investigations, particularly in cases involving threatening communications. Writer sociolinguistic profiling emphasises the importance of linguistic expertise in intelligence and law enforcement investigations, especially in analysing manifestos and public statements of violent extremists (as in the Unabomber case), an area where further research is needed – to which the present study is contributing.

Akin to the Unabomber case, the present article specifically contributes to the sociolinguistic profiling of manifestos and transcribed public statements communicated by violent extremists. Focus on sociolinguistic profiling, particularly from the ‘conceptual burstiness’ lens, is yet relatively limited – towards which the present study is contributing.

2.2 Conceptual Burstiness in Terrorism-Related Sociolinguistic Profiling: A Corpus-Methods Assisted Approach

This study contributes to the methods employed in sociolinguistic profiling of violent extremist texts by demonstrating computer-based concordance analysis, supported by word frequency analysis, for extracting investigative leads related to suspected violent extremist behaviour. It suggests the presence of a “burstiness” phenomenon among the most frequently used lexical choices, characterised by their collocational strength and the repetition of their super- and sub-ordinate conceptual categories. The term “conceptual burstiness” is used to denote the tendency of words associated with certain topics to recur and co-occur more frequently than others in texts. This increased recurrence of lexical items with close semantic proximity serves as an indication of the violent extremist discourse and its discursive intentions within terrorist texts. As such, unlike studies that have focused on the temporal distribution of repeatedly used words to account for their burstiness in, for example, a talk [41], this article falls within the line of research that focuses on repeated terms at the lexical level and how these terms collocate with themselves and set up textual cohesion in discourse [42].

Burstiness, as typically conceived in the statistical natural language processing literature, is used by Pierrehumbert [1, 43] “to designate the tendency of topical words to occur repeatedly in bursts, separated by lulls in which they do not occur because different topics are under discussion” (emphasis added). Brookes and McEnery [42, p. 359] characterise burstiness similarly, noting that “once they [i.e. content words] have been used they are likely to be used again but in this case in close proximity to the original mention.” The present article contributes to this line of research by extending the concept of burstiness to the repetition of the same concept (semantic category) through the use of identical or different words in close proximity (within a sentence or paragraph). This article focuses on two discernible patterns of burstiness:

  • The persistent use of a particular lexical item or term repeated in close linguistic context (e.g., within a sentence or paragraph) following its initial mention in discourse.

  • The repetition of a similar superordinate semantic category or concept (e.g., ‘physical violence’, referring to the Roget thesaurus) expressed through the repeated use of the same lemma (e.g., “fight”) or different lexical items/lemmas (e.g., “fights,” “kill,” “cut,” “harvest + neck,” etc.) in close proximity (i.e., within a sentence or paragraph, inciting or threatening violent actions against out-groups).

The focus on the latter pattern adds to the existing literature on burstiness by building on the approaches to ‘meaning extraction’ [44]. These approaches use semantic aggregation as an investigative method to determine the major themes (i.e. topics) in a dataset based on the co-occurrence of high-frequency content words indicative of the discussed topics and agenda. This study reports on these semantic groupings and then examines the two patterns of conceptual burstiness mentioned above in detail to showcase a useful and more comprehensive picture of an author’s lexical semantic profile.

Conceptual burstiness, as such, adds to approaches to ‘meaning extraction’ [44] for investigative purposes, that is a method to determine the major themes that occur in a dataset based on the co-occurrence of high-frequency content words (i.e. lexical items, including adjectives, adverbs, nouns, and regular verbs—as in racial, racially, race, and racialise, respectively) that are more indicative of the writing topics. This study focuses on the semantic feature of burstiness, wherein groups of the most repeated lexemes with close semantic proximity (e.g. kill, die, attack) are categorised under the same semantic fields or concepts (e.g. violence: violent action, or violent cause). See Sect. 3 for more specific categories and elaboration on the analytical procedures. The repetition of these groups of lexemes is taken as indicative of the discursive purpose and characteristics of terrorist texts. This conceptual burstiness serves as a marker of the discursive purpose, affiliation, violent means, agenda, and ideology within terrorist texts, providing valuable insights into the “thought and activity” [41, p. 1] of violent extremists. The increased use of words of particular semantic proximity is taken as a result of an author’s modulation of their style to suit the communication situation and their audiences (e.g. [45]), particularly in the context of terrorism.

By examining the most frequent lexical items in a dataset using corpus methods, this study enables the description of the lexicogrammar and semantics of violence, as well as the deduction of useful information about the authors of texts engaged in illegal activities (e.g., [25]). This method, also, provides semiotic cues, such as lexical choice-grounds, which facilitate the characterisation of terrorist texts, elucidating the author’s ideological inclinations, affiliations, and motivations, particularly their preoccupation with violent means. The article argues that the discernible burstiness feature can indicate the author’s fixation on violent ideologies, motivations, affiliations, and agendas, including the pursuit of political and economic dominance. In essence, this study aids in exploring characteristics of terrorists’ topics and ideologies, contributing to the sociolinguistic profiling of terrorist texts. The conceptual burstiness concept employed here parallels Brookes and McEnery’s [42] approach, focusing on the use of repeated words to uncover ideologies, claims to symbolic power and collocations setting up textual cohesion, but extends beyond jihadist discourse to encompass far-rightist terrorist texts. Unlike previous research outside forensic linguistics, which primarily focused on the burstiness of repeated terms at the lexical level, this study explores conceptual burstiness across jihadist and far-rightist terrorist texts to extract information for forensic investigation purposes.

The corpus tools and computational techniques used in conceptual burstiness contribute to the toolbox available to forensic analysts, enabling the exploration of patterns of lexical forms and their semantic categories that give rise to conceptual burstiness in terrorist texts. This attention to conceptual burstiness as a discursive marker and a source of intelligence and clues to terrorists’ ideology, agenda, violent means, and affiliation is relatively limited in forensic linguistic research. The study of burstiness and its potential in terrorism-related forensic investigations remains underexplored despite the acknowledged usefulness of burstiness in: general text authorship tasks [46], marking genres, topics and authors, and information and document retrieval [9, 43]. This study is a showcase of this corpus method-investigative potential in forensic contexts. The use of corpus methods assisted approach to examining the most frequent lexical items in a dataset is a “powerful tool” to survey the content of the texts and what they are about [47, p. 135; 48] and reveal the author’s “norms of language use” which are “largely expressed in recurring collocations of words” and phrases that contribute, inter alia, to text cohesion [48, 304].

In addition to the semantic conceptual categories of repeated words which are detailed in Sect. 3, the study is concerned with close analysis of concordance lines of frequent lemmas. A lemma is a lexeme or dictionary headword that is realised by a word form (e.g. KILL which can be realised by the word forms kill, kills, killed, killer, and killing (lower-case italicised). A lemma or any of its forms is referred to as N = node. A word form or lemma can have different collocational behaviour within a given span (e.g. N−1 = one word to the left of the node, N + 1 = one word to the right) [48]. For example, the lemma RACE in ‘our race’ and ‘racial enemy’ collocate within the span of 1:1 (i.e. one word to the left of the node, and one word to the right of the node, respectively). Repeated combinations of N words (e.g. repetition of ‘our race’) are referred to as n-gram. This study uses Stubbs’ tradition of uppercase for a lemma and lower-case italicised for its different forms. To group a set of semantically related lemmas that give rise to conceptual burstiness, unlike Stubbs who preserves diamond brackets (< … >) for typical collocates of a node, diamond brackets are used in this study for a set of lemmas of semantic proximity that give rise to the typical bursty concepts. To distinguish lemmas from a bursty concept, the semantic (sub)category is in bold uppercase, such as VIOLENCE < KILL, ATTACK, DIE, … > . To make visible the repeated syntagmatic (divisive ‘We’ vs. ‘They’) co-occurrences on the syntagmatic axis of the individual concordance lines, a concordance line analysis is carried out in this study to shed light on “the typical lexicogrammatical frames in which [repeated] word occur” [48, 316], such as ‘our race’ versus ‘their race’. The next section shows the exact method used in the present study.

Forensic linguists need computer tools empowered with analytical techniques to assist in identifying and counting linguistic features within corpora, be they small or large [49]. Corpus linguistics (CL) analysis tools support empirical approaches whereby target texts can be analysed to identify patterns of meaning and language structure use [50]. These tools analyse language based on two main principles, empiricism and technology, which enable a valid, reliable, replicable and more objective exploration of language features [50]. One of the most widely used CL tools is AntConc [2], a free computer software that is used in this study using particularly the following AntConc’s features to illuminate dimensions of discursive features and meaning that are useful to this investigation:

  • A ‘word list’ option: which assists in identifying word frequency.

  • Concordance plot: which serves to identify the position of words in the contour of discourse, intra-textually and inter-textually.

  • Key-word-in-context (KWIC) concordances: which help to analyse the context, i.e. word(s) to left and/or right of frequent words, including contiguous sequences of words (n-grams and collocates)

3 Methodology

3.1 Data

This study analysed twenty terrorist public statements, constituting a specialised corpus of 40,000 words [18]. The selection of texts and corpus size aimed to provide a representation of transnational terrorist ideologies and organisations, encompassing far-right terrorism and jihadist-based terrorism. Forensic linguists typically analyse provided data, irrespective of its size, to scrutinise language used as evidence [51]. This study, therefore, does not seek to analyse an author's overall texts diachronically, nor does it aim to investigate a single author’s overall linguistic features throughout their entire writings. It is a showcase of the use of the concept of “conceptual burstiness” in a set of terrorist public statements sourced online and are not privately owned data.

The texts span the period following the 9/11 attacks on the USA up until the Christchurch, New Zealand, attacks in 2019, and originate from diverse geographical, socio-cultural, and political contexts, as well as various ideological backgrounds. Adherents of these ideologies are globally recognised as the most lethal actors [34]. The texts were attributed to four terrorists associated with extremist groups (See Table 1 for details) [18, 27]. Firstly, Osama bin Laden (OBL) (al-Qaeda) dedicated himself to a violent, global Salafi-jihadist struggle against the West, primarily targeting the United States. OBL’s eight texts, originally communicated in Arabic between 2001–2006, were translated by credible sources like the CIA Foreign Broadcast Information Service reports, with translations verified by the Author, a native Arabic speaker and accredited English-Arabic translator. Secondly, Abubakar Shekau (Boko Haram), a proclaimed Salafi-jihadist Nigerian, responsible for a narrative recruiting violent actors on a large scale, posed a threat to Nigeria’s federal government and neighbouring countries. Shekau’s nine texts, produced in local Nigerian languages between 2012–2018, were translated into English by reputable Nigerian media outlets, such as the ‘Sahara Reporters’ and ‘Premium Times’ websites. English translations of OBL’s and Shekau’s texts were utilised. Thirdly, the former Islamic State in Iraq and Syria (ISIS) leader, Abubaker al-Baghdadi, communicated his texts in Arabic between 2016–2018, with translations into English made available by ISIS’s al-Hayat Media Centre and ISIS’s English-language magazine, Rumiya. Al-Baghdadi’s narrative targeted ISIS members and appealed to the young, particularly those discontented and resentful against authority in Iraq and Syria. Fourthly, Brenton Tarrant, a far-right, ethno-nationalist white supremacist from Australia, shared the manifesto ‘The Great Replacement’, used in this study. The manifesto, published in English prior to Tarrant’s attacks on two mosques in Christchurch, conveyed threatening messages against, for example, Communists, Antifa, Marxists, and Turks, and incited Christians and European men against, for example, immigrants, Muslims, and democrats.

Table 1 Overview of the dataset

It is worth noting the potential influence that translations of OBL’s, Shekau’s, and al-Baghdadi’s texts might have on the accuracy in conveying socially and culturally shaped lexical and semantic choices, thus influencing the total number of these choices captured in the analysis. To enhance faithfulness to the source texts and accuracy, the Author of this article, an experienced English-Arabic translator, checked the faithfulness of the translations in conveying lexical choices and adjusted where needed to maintain consistency in the use of terms and their sociocultural references. For example, the Arabic words ‘murtad’ and ‘iman’ were not always translated as ‘apostate’ and ‘faith’ respectively; they occasionally appeared in their Arabic transliteration as ‘murtad’ and ‘iman,’ as in al-Baghdadi’s texts. Based on this check, the Arabic transliteration was added to the English translation to make one entry/lemma.

To facilitate computational text analysis, the texts underwent cleaning before processing. The dataset underwent spell-checking, expanded contracted forms, removed website links and translator interpretations. Post-cleaning, the final dataset was normalised to a frequency of 10,000 words per author (i.e. per sub-corpus). Specifically, a spell-checked version of the dataset was used; contracted forms were written in full and normalised across texts; quotations not embedded in an author’s own words were removed—such as lengthy religious texts in the jihadist texts—to focus on the terrorists’ words and style. Regarding Tarrant’s text, the following sections from his manifesto were the focus of the analysis: ‘Introduction,’ ‘Addresses to various groups,’ ‘General thoughts and potential strategies,’ and ‘In conclusion.’ This focus was chosen because the burstiness analysis was part of a larger qualitative project aiming at a manageable dataset size while examining the rhetorical strategies and language of incitement and threat messages, which appeared mainly in these sections. The article, nevertheless, acknowledges that the shortening of texts may distort burstiness as it relates to the distribution of words/concepts over a text or across texts and may influence relative frequencies in the analysis. The corpus used in this research is a real-life set of forensic texts. This is in keeping with the key role of a forensic linguist to linguistically profile, compare, describe and/or interpret real-world textual data in investigative cases. The texts are implicated in terrorist/criminal contexts and are suitable for examining meaning, comparing patterns of linguistic choices, and linguistic profiling of their authors. The dataset is taken as a specialised corpus of compelling and typical instances of threatening texts with two main functions (intentions): incitement and communicated threats. The aim is to showcase how a set of a terrorist’s texts may be approached to sociolinguistically profile the text producers, contributing to efforts to understand and counter potential real-world cases.

3.2 Data Analysis Procedure: Corpus Method-Assisted Approach to Conceptual Burstiness

This analysis explored patterns of lexical features and conceptual categories. To provide a way of determining more themes and (sub)categories for semantic domains of groups of lemmas, the AntConc program (version 3.5.8.0) was used to explore the most frequent lexical items in each sub-corpus. Using AntConc, a list of the most frequent lemmas (e.g. ‘kill’ and ‘fight’ 67 and 45 times, respectively, in Shekau’s texts) was generated. To enable a broad overview of the most frequent words from sub-corpora, the number of times of occurring lemmas was set to a minimum of ten. Words with the same occurrence counts, such as bomb* and Mujahid* (19 times each) in the al-Qaeda sub-corpus, or support, destroy, traitor and immigrant in the Tarrant’s sub-corpus, where each carried the same frequency of occurrence (18 times) and thus were grouped together as words of similar frequency. Counts of lexical items such as ‘messenger’ and ‘prophet,’ ‘America’ and ‘United States,’ ‘brother’ and ‘brethren,’ ‘Allah’ and ‘God/lord,’ or ‘Iman’ (Arabic word used intertextually) and ‘belief’ were added to each other to make one dictionary entry. This counting of lexical items has been justified on the basis that the translated texts referred to the same people, entities, or things in different ways. However, the control here was the source text, which had consistent lexical items. Differences in equivalent lexical items appeared due to unfaithfulness or inconsistency in translation, as acknowledged above. Reviewing the frequency list generated by AntConc allowed for spotting these variations. The resulting lists indicate the semantic groups of the most 32 frequent lexical items observed in each sub-corpus, producing a list of 8659 repetitions of 264 lemmas that make up 21.65% of the entire corpus – a very significant part of the textual fabric of the examined dataset.

The focus on the most frequently used lemmas provided a basis for a conceptually and semantically rich analysis, wherein repetitions of certain terms related to particular concepts could highlight specific themes and conceptual categories, as well as their 'burstiness.' This characterisation helps to delineate a terrorist ideology, affiliation, and agenda, providing clues to a key aspect of the symbolic capital used by terrorists to legitimise violence and the repertoire of concepts to which they adhere (e.g. [42]). The newly generated lists of lexical items from each sub-corpus thus became the focal point of the analysis, aiming to establish similarities and differences across authors based on underlying conceptual patterns and their burstiness. The findings offer insights into lexical choices specific to terrorist threatening texts, revealing lexis that encodes concepts from terrorist ideologies. Repeated lexical items with close semantic proximity are grouped under the same semantic fields, with this burstiness serving as both a marker of the discursive purpose of these terms in terrorist texts and an indication of ideology, affiliation, agenda, and violent stance.

The generated lemmas were examined in relation to two main sets of features. Firstly, the lemmas were analysed to determine the extent to which sub-corpora employed similar and/or distinct (ideology-specific) words, (sub)themes, and (sub)concepts, presenting each author’s semantic preferences as examined in concordance lines. In response to this aspect, the normalised frequency of each sub-corpus's list of top frequent lexical items was provided, and frequencies across sub-corpora were compared. The most frequent lemmas found across sub-corpora were then compared. The primary participants were explored by identifying the most frequent names and adjectives (e.g. Muslim*/Islam*, America*) and comparing them across sub-corpora. Next, the section shows the top-down approach to semantic categorisation used while allowing for accounting for the (bottom-up) emerging frequent lemmas and concepts by using a cover category based on the semantic field proximity. This allowed for describing emerging agendas and concerns like ‘geo- and socio-politics,’ ‘socioeconomics,’ ‘ethnonationalism,’ and ‘environment’, as well as describing sub-categories of the ‘violence and military’ category, namely: ‘physical destruction cause,’ ‘physical or psychological action,’ and ‘words of the military.’

Lexis encoding the recurring themes and ideological concepts: To identify recurring themes/concepts, words evoking specific intertextual worlds (e.g. politics, religion) or semantic subjects/domains they revolve around (e.g. the military) [52, 53] or encoding related semantic fields [42] were grouped together, as explored in their concordance lines. These groups were described manually, providing a semantically rich analysis [53]. The identified themes were then discussed in relation to how they reflect the characteristics of each author.

Given that three authors in the dataset align themselves with a shared religious ideology (Salafi Jihadism), it is crucial to categorise 'religion' as a theme more precisely. To differentiate these terrorists' texts based on subtle religious concepts, religion-themed categories were subdivided into different levels of abstraction, following Brookes and McEnery's [42, p. 362] classification of religion-semantic fields:

  • Adherence: a key element of adherence to faith.

  • Authority: which might be cited to justify argument.

  • Conflict: spiritual or physical struggle, as in ‘fitnah’ (i.e. trial, discord, and/or chaos).

  • Negative: an act perceived as negative in the religion.

  • Positive: an act evaluated as positive within the religion.

  • Spiritual: reference to spiritual or supernatural entity or place.

  • State: reference to a geopolitical entity.

  • Them: reference to non-Muslims.

  • Us: reference to Muslims.

This category list was left open to accommodate any emerging religion categories, aiming to better represent the analysed dataset. The repetition of particular subordinate or superordinate concepts was considered characteristic of the burstiness of these concepts. Burstiness is understood as a function of the discursive purpose of the words in the texts, such as introducing and persuading the audience of a concept, like jihad [42, p. 359].

Lexis encoding aggression, death, and military concepts: These are categorised under the super-ordinate label 'violence and military.' The proposed sub-categories for the lemmas are informed by the Roget 21st Century Thesaurus's index of concepts:

  • Expressions of physical destruction cause (e.g. destroy, kill, bombardment).

  • Expressions of physical (or psychological) action (e.g. invade, fight, incite).

  • Expressions of the military (action, object, organisation, field) (e.g. brigade, defence, battle).

Lexis encoding the area and extent of a violent agenda: These are categorised under the superordinate label 'power.' These lemmas provide insights into the authors’ semantic orientation and evidence of a control agenda. 'Power'-related words are further sub-categorised into four subcategories: geopolitical power, socioeconomic power, ethnonationalist power, and environmental control (as emerged from the dataset itself).

Using diamond brackets (< … >) [48], example lemmas of semantic proximity that give rise to the typical bursty concepts are written between diamond brackets in uppercase, while the semantic (sub)category is in bold uppercase, such as: VIOLENCE < KILL, ATTACK, DIE, INVADE > . It follows, when referring to a lemma (e.g. ‘kill) and its various word forms (e.g. kill, killer, killing, killed, etc.) as a whole, the lemma is written in uppercase (KILL).

The increased use of speech verbs (and their nominalisation) is also identified as a language feature that embodies intertextual practices of attribution and functions as a 'signpost' [54] for legitimising ideological perceptions and building assumptions of consensus and identity. Such utterances often address an ingroup's 'ideal' audiences and signal the power position of the speaker [55].

To reliably argue for the identified characteristic lexical and conceptual features, the emerging frequent lemmas and concepts are compared with the findings of larger corpora of violent extremist texts e.g. [42, 56], serving as “ground truth” data [57, 376] that yield reliable investigative value to the identified features in the present study.

To maximise the intelligence yield from the emerging patterns and their usefulness for the practice of threat assessment, the study integrates insights into the threat assessment framework TRAP-18, offering a 'post-diction' lens on traditional risk assessment methods. Since the analysed texts are relevant to past events, the study of patterns within them can yield insights into what could have been gleaned from these patterns if they were examined at the time of their communications. The aim of this 'post-diction' is to inform future predictions. The emerging lexical and conceptual patterns were mapped onto the categories of the Terrorist Radicalization Assessment Protocol – TRAP-18 [3] to obtain insights into how the terrorist text patterns align with TRAP-18 categories (the Author of this article has undergone the 6-credit hour TRAP-18 online training facilitated by Meloy the TRAP-18 architect to attain proficiency in utilising the instrument). Table 2 summarises the categories of focus in this study, given their identification in the analysis.

Table 2 Focus TRAP-18 categories (adapted version from Meloy’s coding sheet, [3])

4 Results and Discussion

The analysis shows that the patterns of frequent lemmas offer useful insights into the characteristics of the text authors’ lexical preferences, encoded concerns and ideological motivations, perceptions, agenda, schema, master identity or ideology (religious or ethno-nationalist), and proximal warning behaviour and distal characteristics, manifested in bursts of conceptual content. Two discernible patterns of burstiness have emerged from our analysis: (i) the repetition of similar superordinate semantic categories expressed through different lexical items/lemmas within a sub-corpus and in a close proximity (i.e. in a sentence or paragraph), and (ii) the persistent utilisation of a particular concept expressed through one term repeated in close linguistic context (e.g. in a sentence or a paragraph) following its initial mention in discourse. Regarding the latter pattern, Example 1 below shows the pattern of lemma (term) KUFR that occurs in bursts, that is, in close proximity to one another in al-Baghdadi’s text that incites violence against al-Assad regime and his ally factions on the basis of their religious ideology and category (being ‘kafir’). Examples 2–3 exemplify the pattern of different lemmas – KILL (66), SLIT + throats, CUT + throats, and HARVEST (14 in total) – that share the same concept/semantic field (i.e. physical violence) and appear in the same paragraph where cut, slit, and harvest appear after the initial threat by Shekau to kill members of the State Security Service (SSS) in Nigeria.

(1) The kufr of this deviant sect did not stop at their committing shirk with Allah in constitutions and legislations, contending with Allah in His rule, and consenting to the kufr of the nations of kufr. Its kufr continued until it became a sect having no religion, like the zanadiqa and Batiniya.

(2) I will be happy to kill those against us every time I encounter them. This is now the main goal of my mission, the mission of Shekau, who is talking to you. Now you will know exactly who I am. Now you will know my madness. You can imagine it, but you will know more about it because, I swear, I am going to slit your throats. I will not be content until I have cut your throats.

(3) Harvest Jonathan’s neck; harvest Kashim’s neck; Allah said cut out Burabura’s neck.

This section reports the analysis and categorisation of the most frequently occurring lemmas and their semantic-field categories within the sub-corpora, and discusses how these lemmas and emerging semantic patterns as well as their burstiness serve as indicators of terrorists’ lexical preferences, encoding ideological motivations, perceptions, agenda, a propensity for violence, and the reinforcement of master identity or ideology, according to Shuy’s [58] argument on the usefulness of lexical and semantic analysis in terrorism cases.

4.1 Unveiling Ideology-Specific Lexical Patterns

Before delving into the thematic categorisation of lexical clues, here is an overview of prominent words alongside their normalised frequencies within each sub-corpus. The normalised frequencies (per 10,000 words per sub-corpus) of the most frequent content words across the sub-corpora are broken down as follows, making up around 21.65% of the proportion of the datasets that the most frequent words account for: 16.29% (OBL), 27.35% (Shekau), 22.83% (al-Baghdadi), and 20.12% (Tarrant). Despite variations in the repetition rates of their top lemmas, all authors exhibit a recurring trend—a pattern suggestive of the salience of co-occurring ideological concepts within their cognitive frameworks.

The prominence of specific words varies according to the dataset authors’ ideological orientations. Among religious ideology-oriented terrorists, in addition to increased use of personal pronouns, the term ALLAH emerges as the most frequently utilised lemma, accounting for 1.56%, 2.17%, and 2.03% in OBL, Shekau, and al-Baghdadi sub-corpora, respectively (refer to Fig. 1 below for an example of dispersion plot of the lemma ALLAH as found in OBL’s texts). Conversely, for the far-right-oriented terrorist Tarrant, the lemma PEOPLE assumes primacy, comprising 1.06% of the lexical occurrences. The recurrent use of PEOPLE and its associated terms within the Tarrant sub-corpus served to foster textual cohesion by focalising on individuals from diverse ethno-national backgrounds. This lexical cohesion, as put in Canning’s [47, 59] terms, serves to structure Tarrant’s perceptions of social relationships. That is within these texts, the dichotomy between “our” people and “their” people, alongside notions of European ethnicity, assumes centrality in exacerbating anxieties and justifying acts of violence. For example, collocates include phrases such as “replace our people,” “replace the white people,” “their people,” “voting against the wishes of our people,” and “taking our people’s lands” (emphasis added, and N-1 n-grams underlined).

Fig. 1
figure 1

Example dispersion plot of the lemma ALLAH as found in OBL’s texts

Acknowledging terrorist text producers and audiences as participants in the ‘theatre’ of terrorism [60], the repeated utilisation of the iconic noun “Allah” by authors espousing religious ideologies serves to bolster their performance. Firstly, the recurrence of this lemma reflects the dataset authors’ master identity or affiliation. Cluster analysis reveals that ALLAH is predominantly contextualised within supplication (e.g. “O Allah, make the best among us our leaders”), direct quotes addressing ingroup audiences, and in Islamic formulaic expressions such as “Allah’s willing,” “by the name of Allah,” and “praise is due to Allah.” Concordance lines further demonstrate the utilisation of ALLAH in “interdiscursivity” [61, p. 9], contributing to the establishment of the texts’ authors’ identity. That is, when examining the contexts in which the word “ALLAH” appears (concordance lines), it is evident that “ALLAH” is used in various ways that connect different discourses or types of communication (“interdiscursivity”). This varied use helps to build and reinforce the identity of the authors, showing how they integrate religious references into their writing to shape their persona and connect with their audience. The frequent and diverse references to “ALLAH” in different contexts help to define and convey the authors' identities through their texts. Secondly, the lemma serves to fulfill the rhetorical function of portraying the authors as adhering to the law of Allah, thereby supporting the grounding of their arguments in an assumed “working consensus” point [4, 62] of obedience to ‘Allah’ which is emphasised by the number of hits in the dispersion plot analysis (Fig. 1). The repeated use of the lemma exemplifies the ‘burstiness’ of a religious ‘icon,’ which rhetorically amplifies invoked belongingness and author-ingroup audience alignment e.g. [26, 29].

The extremist nature of the polarised (i.e. in terms of the ‘us’ versus ‘them’ dichotomy) dataset underscores the necessity to examine the repeatedly referenced primary participants (i.e. social actors). Findings indicate that OBL explicitly mentions MUSLIM (0.80) and AMERICA (0.60), as the opposite pole (Example 4), and constructs them as primary participants. Shekau presents MUSLIM (0.51), NIGERIA (0.32), and CHRISTIAN (0.17) as the primary rival social actors. Al-Baghdadi depicts “ahl al-Sunnah” (0.26) who needs ISIS (Example 5) and “ISIS” (0.30) as the ‘legitimate’ Islamic State of ahl al-Sunnah, opposing the rest of the world. Through this construct, al-Baghdadi portrays the world as divided into a camp of FAITH versus a camp of KUFR (i.e. infidelity), with the latter including the CRUSDAERS, JEWS, ATHIESTS, POLYTHIESTS, MURTADEEN, MAJUS (Iran), and SHIITE. Tarrant portrays white EUROPEANS (1.05), POLITICIANS (0.21), and IMMIGRANTS (0.18) as the primary participants engaged in a struggle over controlling European countries and their resources as realised in the repeated use and burstiness of “control” (Example 6). The explicit naming and pronominal reference to rival primary participants imbue texts with heightened polarisation when immigrants are positioned (with burstiness) to be “removed” no matter how, which reveals clues to the author’s ideological schema (Example 7).

(4) And it is no secret to you that the American thinkers and wise men warned Bush before the war that: 'everything you want for securing America by removing the weapons of mass destruction - assuming they exist - is available to you, and nations around the world are with you in the inspections. And America’s interest does not require that it be plunged into an unjustified war, and you know not its end.

(5) Therefore, know, O ahl al-sunna in Syria, that if you wish to live in honour and dignity, you have no choice but to return to your religion and to waging jihad against your enemy... [What matters] next is to rise once more by opening new fronts and rejecting the treaties of humiliation and disgrace, based on which the factions of apostasy surrendered the territories of ahl al-sunna.

(6) Democracy is mob rule and the mob itself is ruled by our own enemies. The global and corporate ran press controls them, the education system (long since fallen to the long march through the institutions committed by the marxists) controls them, the state (long since heavily lost to its corporate backers) controls them and the anti-white media machine controls them.

(7) The invaders must be removed from European soil, regardless from where they came or when they came. Roma, African, Indian, Turkish, Semitic or other. If they are not of our people, but live in our lands, they must be removed. Where they are removed to is not our concern, or responsibility. ... How they are removed is irrelevant, peacefully, forcefully, happily, violently or diplomatically. They must be removed.

Among the most frequent lemmas identified are words that convey the violent character of the texts and signal justification of violence. The predominant word encoding violence in each sub-corpus is FIGHT (0.41) in OBL’s and (0.23) in al-Baghdadi’s, KILL (0.66) in Shekau’s, and DIE/DEATH (0.32) in Tarrant’s. Additionally, the speech verb SAY and its nominalisation are salient in frequency of use, imbuing the texts of OBL (0.50), Shekau (0.90), and al-Baghdadi (0.34) with an attribution pattern endorsing and justifying violence (as elaborated in Sect. 4.2.4).

Thus far, the top words outlined here introduce the four main categories featured in the lists of the most frequent lemmas in the sub-corpora: ‘religion,’ ‘power,’ ‘violence and military,’ and the attribution pattern of ‘saying’ which signposts and denotes power and violence in various intertextual “guises” [56, 271]. The prominence of these categories in the datasets underscores the extremist nature of the texts compared to findings from comprehensive corpus linguistic studies on extremist texts’ lexical categories [56]. Delving into these conceptual categories of lexical clues in subsequent sections bolsters previous research findings and supersedes them by illustrating how these clues offer evidence of terrorists’ semantic choices, ideological perceptions, and agendas that serve threat assessment and investigative purposes.

4.2 Conceptual (Sub) Categories and Semantic Fields

This section delineates the superordinate conceptual categories of frequently used lemmas to investigate the characteristic ‘burstiness’ of these categories within the datasets. Additionally, it describes the identified subordinate categories of these words. It also, put in Brookes and McEnery’s [42, 359] terms, illustrates the feature of burstiness of the most frequent words in terms of their collocational strength and the repetition of their super- and subordinate conceptual categories. This burstiness feature serves as an indicator of ideological motivations encompassing violence, affiliation, or master identity, alongside a violent attitude and agendas such as political control. Moreover, akin to [42, p. 359], it functions within the discursive purpose of words within the texts, such as introducing and persuading the audience of concepts like jihad. In essence, terrorist communication propels the burstiness of particular word categories, driving the usage of diverse forms that converge within the same “semantic fields” [42, p. 363]. Consequently, the burstiness of the four categories of the most frequent lemmas aids in delineating prominent semantic fields and elucidating the global functions of these lemmas within the datasets. These functions include instigating fear of sources of threat against an ingroup ‘symbolic power’ system, such as religion or political and ethnonationalist systems, and legitimising violent acts or inciting hostility against religiously or politically different ‘others.’ The findings are detailed and compared in the subsequent subsections.

4.2.1 The ‘Religion’ Category: Prevalent in the OBL, Shekau, and Al-Baghdadi Texts

Lemmas with explicit references to the ‘religion’ category exhibit variations across the sub-corpora. This repetition provides clues to the ‘fixation’ (in TRAP-18 terms) on certain ideologies or causes which can be observed through lexical cohesion, such as the repetition of lemmas like ALLAH and JIHAD (see Table 3 for more examples) in religious ideology-oriented texts. This fixation is accompanied by a deterioration in social and occupational life, as indicated by the focus on divisive concepts like a ‘faith camp’ versus ‘infidelity camp’ – i.e. ‘our people’ versus ‘their people.’ These repetitions also provide clues to the ‘identification category’; that is the desire to align with violent personas or jihadist groups against outsiders as reflected in lexical choices, such as the frequent references to “jihad” and “infidels” and identification with ideologies that serve to reinforce the in-group (including, incited) individuals and their ideological commitment to advancing their cause or belief system.

Table 3 Religion-subordinate categories and semantic fields realised in sub-corpora

The identified religious concepts demonstrate similarities in terms of their subcategories or semantic fields, as summarised in Table 3. OBL’s most frequent lemmas related to religion constitute 4.52% of the most frequent words, thus indicating ‘religion’ as a statistically significant and semantically prominent category. Similarly, Shekau’s words revolving around religion also highlight this category as a significant semantic field, comprising 5.45% of the most frequent words. Notably, the highest frequency of reference to ‘religion’ within the most frequent lemmas appears in al-Baghdadi’s sub-corpus, accounting for 7.23% and encompassing the categorisation of people of different faiths within the Muslim world. These results portray the three authors as frequently employing religious concepts to signify their master identity (i.e. a religious ideology) and as an ideological motivation for ‘othering’ and opposition.

Sub-categorising religious concepts into different levels of abstraction (i.e. semantic fields) unveils a range of commonalities and differences in the authors’ semantic orientation or choices, such as (note: the semantic sub/category is in bold uppercase, and repeated lemmas are in bold-less uppercase):

figure a

This commonality in repeated lexical items and semantic categories offers a lens on cohesion that is realised lexically and that is moving us closer to account for a way in which repeated terms and topics contribute towards what Morley [63] refers to as the ‘rhetorical structure’ of the examined texts and the ‘rhetorical movement of the discourse’. To elaborate, for instance, the most frequent religious lemmas in the OBL sub-corpus move around eight semantic fields: adherence, canon, authority icons, conflict, positive (acts), state, Them, and Us. This analysis demonstrates that OBL’s texts exhibit burstiness of words from various categories, which signify oppositional texts grounded in a religious ideology defining the delineation between ‘we’ and ‘they’: ‘we’ are Muslims (Us), adherent to Islam (canon), who acknowledge and obey our authorities (e.g. Allah, and Mohammad), engaging in Jihad (conflict) against primary identity groups, namely the crusaders (16) and the Jewish (11) (Them), until the establishment of our state. Regarding Shekau, mere consideration of example repeated lexical items and their counts may look like follows: Allah/God (217 times), Islam(ic)/Muslim (51), infidel (44), religion (41), brother/brethren (34), follow(ing/ers) (34), Prophet (18), worship (16)– and Christianity/ Christians (17), Koran (14), repent (10), faith(ful) (10). By grouping Shekau’s repeated lemmas, we notice the lemmas revolving around seven sub-categories: adherence, canon, authority, conflict, Them, Us, and bonds (between Muslims, i.e. brethren). Akin to OBL’s texts, these lemmas in Shekau’s texts also indicate oppositional texts driving the burstiness of subcategories rhetorically defining ‘Us’ via ‘our’ religion and authorities, and presenting ‘Us’ in conflict with Christians (Example 8):

(8) We hardly touch anybody except security personnel and Christians and those who have betrayed us. Everyone knows what Christians did to Muslims, not once or twice.

This semantic grouping of repeated religious terms can be traced in Table 3 above. The table shows the ‘Them’ group is repeatedly described as “INFIDEL”. This burstiness in the use of “INFIDEL” necessitates a close analysis of the concordance lines of who are constructed as infidel.

The concordance line analysis of the bursty lemma INFIDEL and close analysis of who is described as infidel gives us a clearer picture of who is constructed as members of out-groups and thus targets of ‘our’ violence (e.g. “We” killed the infidels in the Giwa Barracks.”). These include, inter alia, the Tijani sect, the Izala sect, Shiites, and the Nigerian soldiers (Fig. 2). This complementary concordance line analysis, thus, provides useful intelligence for threat assessment and management.

Fig. 2
figure 2

Exemplary concordance lines from Shekau’s repeated use of lemma INFIDEL

This binary opposition structure of ‘Us’ versus ‘Them’ is explicitly lexically realised in al-Baghdadi’s use of “against” (39 times) as in the following collocations/N-grams:

  • Against Allah, His Messenger…;

  • Against Islam and its people;

  • Against Islam and the Sunna

  • Against the Islamic State;

  • Against the Khilafa State in Iraq and Syria;

  • Against the Muslims and the mujahideen in Ninawa;

The same function and burstiness of the same word “against” also repeatedly emerged in Tarrant’s texts (interchangeably with “anti”) (Example 9) and Shekau’s (Example 10):

(9) The media of the world will be used against you, the education system of the rulers will be used against you, the financial power of the worlds corporations will be used against you, the military and legislative might of the UN, the EU and NATO itself will be used against you and even your own, previously corrupted, religious leaders will be used against you.

(10) [I]t is a Jihad war against Christians and Christianity. It is a war against western education, democracy and constitution.

In contrast to OBL and Shekau, al-Baghdadi’s most frequent religious lemmas present a wider set of semantic fields (see Table 3 above). Al-Baghdadi’s texts exhibit the burstiness of words across ten semantic sub-categories, distinguishing his texts with more explicit concepts of the ‘negative’ and ‘Them’ categories. The burstiness of ‘negative’ lemmas such as KUFR (infidelity) (26) (Example 2), RAFIDA (rejectionist) (17), SHIRK (polytheism) (18), NUSAYRIYYA (13), APOSTASY (23), and ‘Them’ lemmas (e.g. UNBELIEVERS, KAFIR/ATHIEST, RAFIDI, MUSHRIK) contributes to the “negative othering” role of the texts [56, p. 264] as in the use of two belief-based negative terms (“kufr” and “shirk”) in close proximity (Example 11). These terms categorise the world based on ISIS’ version of beliefs, marking the ideologically polarising and confrontational nature of al-Baghdadi’s dataset.

(11) The kufr of this deviant sect did not stop at their committing shirk with Allah… (al-Baghdadi)

Besides the othering and polarising functions of the religious lemmas, the authors’ reliance on the ‘religion’ concept indexes the authors’ trans-national identity and presents them to ingroup audiences as insiders and—in Malešević’s ([64, p. 271] terms—“legitimate conveyor[s]” of these concepts and ideas. Despite the fact that their use of religious concepts is “almost never popularly absorbed” as presented in their terrorist datasets [64, p. 271], the ‘religion’ category and sub-categories have so far shown what kind of clues, as well as rhetorical functions, the strategic repetition of the most frequent lemmas can offer. While acknowledging that the authors’ use of these concepts can be subject to theological renouncement and widely resisted “(mis)reading” and misinterpretation [56, p. 272; 42], the terrorists’ increased use of these words does not eliminate the usefulness of these concepts in characterising terrorist semantic orientation and ideological perceptions.

Making use of lemmas of particular conceptual categories has, thus, its identity construction implications [65, p. 176]. The use of lemmas such as INFIDEL and JIHAD in the OBL sub-corpus, for example, not only construes the identity of ‘infidel’ others (e.g. “…a camp of infidelity) but also “marks off socially and ideologically distinct areas of experience” [66, p. 84], of what al-Qaeda’s members are doing, as an assumed mark of ‘true’ faith (e.g. “…blessed jihad). Labelling as such undergoes an ideological decision in terrorist texts, and labelling also indexes the terrorists’ worldview. Religious lexicons can provide a clue to the premise of confrontation with ‘others’ such as between Christians and Muslims who are discursively constructed as main social actors, as in: “Christians cheated and killed us [Muslims] to the extent of eating our flesh like cannibals” (Shekau).

In summary, the ‘religion’ category provides evidence of the prominent semantic fields that characterise the semantic orientation of the OBL, Shekau and al-Baghdadi texts. The ‘religion’ (sub)category also offers clues of the authors’ deliberate presentation of competing ideologies and ideological perceptions, and how the terrorists strategically depend on “cultural [religious] framing” of the world in order to “ideological[ly] penetrat[e]” ingroup audiences’ micro-world and mobilise a degree of public support [65, pp. 184–187].

4.2.2 The ‘Power’ Category: A Feature in the Four Authors’ Texts

Analysis of the most frequent words also provides insights into the extent of impact or areas of interest that authors aim to influence. This section categorises words encoding these areas under the ‘power’ category, offering clues to the authors’ semantic orientation and agendas of control and dominance. The ‘power’ words emerged under four domains: geopolitical, socioeconomic, ethnonationalist, and environmental control. While the geopolitical and socioeconomic domains persist across the four sub-corpora, the ethnonationalist and environmental domains exist only in Tarrant’s.

The most frequent lexicons that invoke geopolitical and socioeconomic worlds and evoke related meanings provide clues to the transnational interests of the four authors and their extremist organisations, as well as to the proposed alternatives. These interests offer clues to the global ideologies and agendas of the authors/organisations (which accords with [56]), while the alternatives are contextualised as realisable through violent means. For instance, OBL employs words that denote or delineate specific geographical locations (such as “Iraq,” “America,” “Afghanistan,” “world,” “countries,” and “land”), political spheres (including “government,” “leader*,” “Bush,” “White” as in the White House, “Palestine,” and “Ummah”), within discussions pertaining to the legitimacy of regimes and political systems, as well as concepts of freedom and tyranny. The burstiness of these words, such as “America” (Example 4 mentioned earlier) and “government” (Example 12) emphasises the importance of these topics and their legitimacy in OBL’s discourse.

(12) …a collaborator government like all governments in the region, including the government of Karzai and Mahmoud Abbas.

Additionally, economic themes are addressed, as evidenced in passages referencing Iraq, with oil portrayed as coveted by America, referred to as “cold booty” and that destruction of Iraq’s economy and power is motivated by serving the Israeli interest. While Iraq signifies a geopolitical region targeted by America in a war driven by political motives, OBL perceives the conflict as motivated by economic interests and the desire for control over Iraq’s oil resources. Such references provide clues to the author’s ‘moral outrage and personal grievances’ that characterises his communication. In Example 13, lemmas underlined encompass socioeconomics, while those in bold denote geopolitics, thereby intersecting both agendas and showcasing OBL’s endorsement of violent alternatives to democratic resolutions. Geopolitical concerns (signalled in bold) are further underscored in Example 14, where a violent course of action (e.g. “fighting…” underlined) is advocated over ‘invalid’ peaceful, democratic solutions (underlined), particularly in response to perceived illegitimate governments or Jewish/Crusader interventions.

(13) Bush thought that Iraq and its oil are a cold booty.

(14) Voices in Iraq, as there were in Palestine, Egypt, Jordan, Yemen, and other countries before, have called for a peaceful, democratic solution in dealing with the apostate governments or invaders of the Jewish and Crusaders instead of fighting for the cause of Allah.

Boko Haram’s leader, Shekau, often references geopolitics and political figures like Nigerian President “Jonathan” (18 times), “Nigeria*” (32), “world” (24), “message” (24), “Maiduguri” (a city in Nigeria) (20), “against” (17), and “democracy” (16), “Western” (15), “Kano” (a Nigerian state) (15), “constitution” (14), “Obama” (14), “Leader(s)” (12), and “Sultan” (11). Shekau’s repeated use of words like “Nigeria,” “world,” “Maiduguri,” and “democracy” reflects the group's regional and global ideology. Notably, he portrays “democracy” (in bold, Example 15) negatively and as the wrong alternative to Sharia law, using terms like “paganism” to discredit it, as in: “Everyone knows that democracy and the constitution is paganism.” Shekau also intertwines socioeconomic themes with geopolitical ones, mentioning “people (68),” “work (25),” “education (11),” “slaves” (11), “sell” (10) the abducted girls from Chibok (a Nigerian town). This suggests his intent to shape societal and economic dynamics. Shekau's rhetoric offers violent alternatives to both geopolitical and socioeconomic issues, as seen in statements like,

(15) I speak in the name of religion, Allah, Islam, the religion of the Holy Koran, and not that of democracy… You, Sultan of Kano, is this the way you practice religion? The religion of democracy, of the constitution, of Western education!

Al-Baghdadi’s lexicons encompass a broad spectrum of geographical regions and socio-political topics spanning the globe. Among his most frequently used geopolitical terms are “lands,” “state,” “Iraq,” “Syria,” “nations,” “leaders,” “rule,” “affairs,” and “Sham.” These lexicons cover regions in the Muslim world (e.g. “Khilafa,” denoting a geopolitical project), the Middle East (e.g. “Sham”), Africa, Asia (e.g. “alSalul,” referring to the Saudi monarchy), Europe, and America. Additionally, his lexicon frequently collocates with terms such as “secularise,” “secularist,” “wealth,” “corruption,” “ministers,” “politics,” “sanctions,” and “tyrannical,” effectively merging geopolitics with socioeconomics. Al-Baghdadi’s lexical choices point towards ISIS's involvement in violence across the globe and underscore its global ideological aspirations.

For Tarrant, the frequent occurrence of specific lemmas – such as REPLACE (35), DIVERSE (23), IMMIGATE (18) – shapes his semantic orientation and agenda across various domains. These lemmas realise the following thematic areas that reflect Tarrant’s lexicon and concerns with race, geo- and socio-political issues, economic, and environmental matters:

figure b

The repetition of these lemmas underscores the concept of ‘power’ and its associated thematic domains. Particularly, in TRAP-18 terms, Tarrant’s repeated use of words related to race (in bold, Example 16) indicates a ‘fixation’ on racial identity and supremacy. For instance, the term “race” is frequently coupled with words that evoke fear and division, such as “racial replacement” (with the burstiness of “REPLACE”, Example 16), “racial minority,” and “racial diversity”, portraying a narrative of racial conflict and domination where people of other races and individualism are condemned, as in “In this hell the individual is all and the race is worthless” (Tarrant). Similarly, “diversity” is positioned as rejected with burstiness contributing to its importance for the text author (Example 17): “One cannot exist with the other. DIVERSITY IS UNEQUAL, HIERARCHIES ARE CERTAIN.” Moreover, Tarrant emphasises the preservation of ethnonationalist identity by employing terms like “White,” “European,” and “Western,” which contribute to the ethnonationalist semantic field. These words reinforce the notion of racial superiority and justify actions against perceived threats to this identity.

(16) For too long those who have profited most from the importation of cheap labour have gone unpunished. The economic elites who line their pockets with the profit received from our own ethnic replacement. These greed filled bastards expect to replace our people with a race of low intellect, low agency, muddled, muddied masses just so their own wealth and power can increase… They will soon realize there are repercussions to being race traitors.

(17) Meanwhile the “diverse” nations across the world are scenes of endless social, political, religious and ethnic conflict. The United States is one of the most diverse nations on Earth, and they are about an inch away from tearing each other to pieces. Brazil with all its racial diversity is completely fractured as a nation […]. South Africa with all its “diversity” is turning into a bloody backwater as its diversity increases, black on other black, black on white, white on black, black on Indian, doesn’t not matter, its ethnicity vs ethnicity. They all turn on each other in the end.

Furthermore, Tarrant’s discourse demonises immigrants and constructs immigration as a threat (in bold, Example 18), framing immigrants as economic competitors and environmental threats (underlined). The burstiness of terms like “immigrant/immigration” (Example 19) serves to highlight his concerns about the perceived negative impact of immigration on, for example, European workers' rights, social cohesion, namely “ethnic binds”, and environmental sustainability.

(18) Continued immigration into Europe is environmental warfare and ultimately destructive to nature itself…

(19) But it will take take some time, time we do not have due to the crisis of mass immigration. Due to mass immigration we lack the time scale required to enact the civilizational paradigm shift we need to undertake to return to health and prosperity. Mass immigration will disenfranchise us, subvert our nations, destroy our communities, destroy our ethnic binds, destroy our cultures, destroy our peoples.

Overall, Tarrant’s lexical choices reveal a deeply entrenched ethnonationalist worldview, characterised by racial supremacy, anti-immigrant sentiment, and environmental concerns. A final note is on how the ‘power’ category lexis provide clues to any of the TRAP-18 categories. In terms of distal characteristics, the thematic focus on geopolitical, socioeconomic, ethnonationalist, and environmental domains suggests a broader ideological framework driving the terrorist-text authors’ actions. Besides, the targeting of specific groups or populations, such as non-Europeans or immigrants, indicates underlying prejudices or biases that influence the authors’ worldview. Also, the use of language that emphasises racial identity, cultural preservation, and perceived threats to the in-group across the four authors suggests a desire to maintain power and dominance within certain social hierarchies.

4.2.3 The ‘Violence and Military’ Category: A Feature in the Four Authors’ Texts

Analysis of the most frequent lemmas reveals lexical choices in the semantic domain of violence and conflict, including military terms invoking the world of violence, which provide clues to the authors’ semantic orientation and violent ideology. These lemmas play a role in shaping a common violent, oppositional tone across the datasets e.g. [67,68,69], which can practically inform content moderators of the nature of communication. The ‘violent’ lemmas are sub-categorised into three subordinate concepts: physical destruction cause; physical or psychological action; and military concept. The three categories appeared in the four authors’ texts. For example, OBL’s dataset includes lemmas such as KILL, FIGHT, and WAR (as in ‘warplane’, and ‘warfare’ (Example 20), which respectively encode violent physical actions, causes of destruction, and words of the military. The repeated lemmas in OBL’s sub-corpus appeared in 223 occurrences (2.23% of the corpus). Shekau also showed heightened use of similar words: “kill” (67) and “slaughter” (11), “fight” (45), and “war” (21); and so did al-Baghdadi as in “kill” (12), “jihad” (46), “fight” (23) and “fear” (21), and “enemy” (37) and “crusade” (27), as well as Tarrant: “die” (32), “destroy” (18) and “kill” (15), “fight” (15) and “radicalise”(19), and “invade” (31), “victory” (25), “attack” (17), “enemy” (23), “support” (18), and “lose” (13).

(20) We gained experience in guerilla warfare and attrition warfare.

Additionally, in reference to the TRAP-18 categories, the frequent use of words like “fight” and “kill” in terrorist communications provides insights into their ‘pathway’ towards violent actions, indicating potential involvement in planning or preparation for attacks. Moreover, these lemmas provide clues to an authorial identity, namely a personal identity, which is characterised by an aggressive and intolerant attitude encoded in violent lexical semantics e.g. [29].

Repetition of the three sub-categories of violence (Table 4) explicitly shapes the violent semantic orientation of the terrorist-text authors. The sub-categories also help to identify the violent ideology of the text authors who propose violent solutions to ‘reality'. Additionally, such examples as well as references to violent actions and causes provide clues to the ‘last resort behaviour category; that is clues to a violent action imperative and a sense of desperation or distress which can be inferred from lexical patterns, such as the repeated use of words of VIOLENCE AND MILITARY < JIHAD, DIE, FIGHT, KILL, INVADE… > in texts signalling a perceived need to act urgently side by side to ISIS because, as argued by al-Baghdadi, “you have nothing but the Islamic State [and its pathway] to protect your religion” (Example 21). These linguistic cues may indicate a belief that there is no other choice but to engage in violence. A violent solution is prescribed through the repeated use of the lemma JIHAD, mainly by OBL and al-Baghdadi, to invoke a religiously coated violent action concept.

(21) After Allah, you have nothing but the Islamic State to protect your religion, safeguard your authority, and bolster your strength.

Table 4 ‘Violence and military’ sub-categories exemplified as realised in the sub-corpora

That said, while the lemma JIHAD intrinsically refers to a struggle against aggressors declared by the ‘legitimate’ ruler/Caliph of Muslims when there is no peaceful alternative for self-defence [70], terrorists used the lemma to religiously license their acts because they consider the current rulers of Muslim nations to be ‘illegitimate'. Repeated use of JIHAD is the ideologically decided ‘labelling’ of violent actions, which uncovers “competing ideological perceptions” of inter-group conflicts [65, p. 184].

Furthermore, examining the concordance lines of lemmas like KILL reveals collocations with particularly memorable forms of violence in various sub-corpora, thus amplifying the violent nature of the datasets. For instance, “kill” is associated with items such as knives (as seen in example 21) in the OBL sub-corpus. In the Tarrant dataset, there are many directives to “kill high-profile enemies” through asymmetric, primitive modes of communicating violent messages that provide clues to violent pathways (Example 22) as a tool to “kill” officials constructed in discourse as “anti-white” (Example 23) and immigrants constructed as “invaders” (Example 24). The burstiness on “kill” and “invaders” sheds light on Tarrant’s determination to kill and negative perception of immigrants. This contextualised use of words underscores the coercive potential inherent in the texts’ authors’ utilisation of “asymmetric violence” [65, p. 189]. For example, Tarrant encourages “ethnic soldiers” to engage in combat, while OBL urges the fight of “soldiers of Allah”—indicating their ideological justifications for violence. He repeatedly incites violence against immigrants and European officials of non-white ethnicity (e.g., Sadiq Khan, the Mayor of London), as in Example 14.

(21) …killing the Americans with bullets, knives, stones, etc. (OBL).

(22) …TATP packages strapped to drones, an EFP in motorcycle saddlebags, convoy ambush rammings with cement trucks… (Tarrant)

(23) KILL ANGELA MERKEL, KILL ERDOGAN, KILL SADIQ KHAN

(24) This Pakistani Muslim invader now sits as representative for the people of London. Londinium, the very heart of the British isles. What better sign of the white rebirth than the removal of this invader?

In summary, the analysis demonstrates that the four authors frequently employ a set of words belonging to three subcategories within the semantic domain of violence and military. These words offer clues to identifying terrorist communication and paint a picture of asymmetric violence. In terms of the TRAP-18 proximal warning behaviour, the repeated use of violent and military terminology such as “kill,” “fight,” “slaughter,” “war,” “forces,” “bombs,” “invade,” “martyr,” “jihad,” and “cut” indicates a readiness for violent action. Specific instances of threats or calls to action, such as Shekau’s “I swear, I am going to cut your throats,” can demonstrate immediate intent or willingness to engage in violent acts. The use of language that amplifies fear and trauma, such as describing violent actions in detail or emphasising the impact of violence on adversaries, serves as a warning of potential imminent harm. Additionally, the violent lemmas and conceptual subcategories delineate the “fighting words” [71, p. 262] component of the datasets. The findings corroborate previous research [69] observing al-Qaeda’s use of military language and extend this feature to Shekau, al-Baghdadi, and the far-rightist Tarrant. The use of language and linguistic aggression, illuminated by the patterns of these violent lemmas and conceptual categories, is thus understood as ideological, with discourse serving as a platform for articulating the terrorist-text authors' violent ideology. An examination of the incorporated voices in the dataset, as signalled by the category 'say,' identifies sources of authority and the legitimisation of violence.

4.2.4 The ‘Saying’ Verb Category: A Pattern of Attribution Practice and Stance Signposts in Jihadist Texts

The usage of the speech verb “say” is identified as a linguistic feature marking an attribution pattern in the OBL, Shekau, and al-Baghdadi sub-corpora, exhibiting specific patterns and pragmatic roles. The speech verb “say” serves as a signpost to stances and ideological perspectives. It also plays a role in challenging standpoints and asserting power. Speech verbs indicate terrorists’ reliance on the practice of attribution/direct quoting as a means of legitimation, endorsement, and acknowledgment of violence. This practice involves indicating that a quoted text is an incontrovertible fact, distancing from the endorsement of a source’s statement, or adding the flavour of a source’s words to the text e.g. [72]. Additionally, “say” functions as a signpost for gaining insight into the datasets as persuasive and argumentative texts that capitalise on quotes from authoritative sources to construct a ‘reality’ or imply consensus on shared values and identity within society.

While OBL exhibits a rare use of distancing locutions, employing “claim” five times to negatively frame statements by figures like Bush and cast doubt on their credibility, “say” is the most frequently used speech word, appearing 50 times. Examining it in the concordance lines, it manifests in both third-person-oriented realisations (e.g. “Allah said” or “says”) and first-person-oriented realisations (e.g. “I say”). The former is utilised with revered figures or anonymous proverbs to legitimise violence by invoking the power of the speaker and reinforcing OBL’s (e.g. inciting) message within Muslim communities to fight the Americans (Example 25). The latter structure, “I say” (Example 26) positions OBL as a powerful, authoritative voice addressing both ingroup and outgroup audiences as in Example 26 where the phrase “continuing fighting you” entails a ‘past criminal behaviour’ as in the 9/11 attacks and the threat of repeating it also provide clues to instrumental criminal violence in the author’s past. The ‘criminal violence’ category thus manifests as a history of involvement in extremist activities, and involvement in planning for previous attacks.

(25) Praise be to Allah who says: “O Prophet, fight against the disbelievers and the hypocrites and be harsh upon them”.

(26) I say to the American people: Allah willing, we are continuing fighting you.

What characterises this use of ‘saying’ words is that once a threatening stance, a challenge by a speaker (Example 26 above), or a saying by an authority (e.g. Allah says, Example 25 above) is mentioned, that saying would serve to colour the entire message with the same authority brush and semantic discourse prosody. Hence, there is no need to repeat the saying or the ‘say’ verb in close proximity, although repetition in close proximity sometimes appears as demonstrated in the dispersion plot analysis of ‘says’ hits below of OBL 5. Figure 3 shows the close repetition of ‘say’ in the concluding part of OBL 5:

Fig. 3
figure 3

Dispersion plot analysis of the saying verb ‘say’ in OBL 5

In terms of clues to proximal warning behaviour, the study of “say” concordance lines provides clues to ‘pathway’ that is involvement in research and planning for attacks through the analysis of speech verb usage, which indicates a strategic approach to communication aimed at legitimising a planned violence. Additionally, the context of the saying verbs also provides clues to ‘identification’; that is, it closely associates the terrorist-text authors with authoritative figures and religious texts, identifying with previous attackers like Osama bin Laden or incorporating religious doctrines to justify their actions. They exhibit a warrior mentality and frequently deploy military or religious language to reinforce their identities as agents advancing their causes. The authors' use of speech verbs to legitimise violence may indicate a violent action imperative (i.e. ‘last resort behaviour’), suggesting a perceived urgency or desperation to act. They may feel they have no other choice but to resort to violence, driven by a combination of ideological conviction and external triggers such as perceived threats or losses in love or work.

Shekau also employs “say” on 90 occasions and “tell”. “Tell” is utilised in various structures, including “I tell” (Example 27), “you tell,” third-person (“they tell”) (Example 28), and “who told.” Akin to OBL, “Allah” and “I” are presented as powerful actors issuing commands and guidance, while other voices like “you,” “they,” and “who” are portrayed negatively. The “I say” construction positions Shekau as a powerful figure challenging the Nigerian government or making declarations about changing the status quo (Example 29). Additionally, “we say” is used to construct Shekau’s religious ideology by merging himself in the inclusive “we” and associating with divine entities like God or Allah. Additionally, second-person-oriented realisations are employed to criticise opponents (Example 30), while third-person-oriented realisations are utilised to validate Shekau’s acts or invalidate opponents’ messages, presenting himself as a leader who, equivalently, for example, to Bush, makes a ‘declaration’ about a change in the status quo [73] and who is up to a political and violent challenge against ‘outsiders’ (Example 29). The burstiness in the use of “SAY” (Example 29) with different speakers allows the author to introduce opposing voices. In this example, the statements of an outsider (e.g., Bush) are challenged or refuted, while the author’s own viewpoint is legitimised or showcased as a demonstration of power.

(27) This is why as leader of this sect I tell you to repent.

(28) The government should stop telling all these lies.

(29) Here is what Bush once said, and we will repeat it here. He said all the fights going on in Iraq and Afghanistan are Christian war, crusade - it is a known issue - and that they will crush Afghanistan; today I will say my own. To the people of the world, everybody should know his status, it is either you are with us Mujahedeen or you are with the Christians.

(30) You [Sultan of Kano, and Jonathon] say you are going to catch us? No one is going to catch us.

Regarding al-Baghdadi, despite frequently using the term STATE, al-Baghdadi predominantly employs SAY (35 times) in his texts. This lemma is primarily used with authoritative sources of quoted assertions, such as “Allah” or “prophet Mohammad,” which is akin to OBL. By quoting religious texts, al-Baghdadi underscores his ideology of violence and his understanding of the relationship between himself as an ISIS leader and his followers, while relying on his “(mis)readings” of signposted religious texts [56, p. 274] (Example 19, inciting unity around ISIS and avoidance of division). The repeated use of “say” functions to build a “regime of truth” (in Foucauldian terms, [74, p. 131; 75]—backed by divine authority, thereby presenting al-Baghdadi’s assertions as religious obligations, compliance with which can “transcend social consciousness and social obligations” [76, p. 292] of his followers. The words “say” signposts “manifest intertextuality” components that mark the identity version of the author [61], given that the decontextualised quotes are part of a ‘symbolic’ social system [22, 77]—that is, proclaimed Islamic identity. The tendency of al-Baghdadi to use the ‘say’ verb when quoting religious texts also provides clues to distal characteristic of communication that is ‘famed by an ideology’ where the speech verb usage reflects deeply ingrained beliefs that are used to justify the author’s intent to commit violence (Example 31).

(31) And our Lord forbade us and warned us against differing and becoming divided, saying, “And do not be like the ones who became divided and differed after the clear proofs had come to them. And those will have a great punishment”… (al-Baghdadi)

In conclusion, the frequent use of the speech verb “say” by OBL, Shekau, and al-Baghdadi underscores its significance in discourse characterised by religious ideologies. This supports previous research indicating that the “say” category could characterise discourse of terrorists from religious backgrounds [e.g. 56]. The strategic deployment of “say” serves to enhance the authors' epistemic status, express ideological perceptions, and legitimise violence. By examining these speech verbs, we gain insights into the authors' religious identities and their use of shared ideologies to justify violence. Despite low raw frequencies of the ‘saying’ category and its direct connection to TRAP-18 categories is not straightforward, the nuanced use of ‘say’ in particular contexts and its analysis in its concordance lines provides valuable insights into rhetorical strategies and communication patterns that can be mapped onto TRAP-18 categories and is helpful for understanding threatening behaviours.

5 Conclusion

While terrorist communications and their analysis are complex and contested [56] and require a focus on multiple linguistic features beyond the scope of this study, the present study has presented a novel focus area by introducing the conceptual burstiness analysis, as well as the frequency analysis of concepts, for linguistic profiling purposes. This study has demonstrated the effectiveness of a corpus-method-assisted approach to uncovering conceptual burstiness within terrorist communications, offering valuable insights for forensic linguistic profilers and security professionals. By analysing word frequency and concordance lines, analysts can identify recurrent thematic elements and linguistic patterns that provide clues to the motivations, ideologies, preoccupations, schemas, violent and oppositional stance indicators, and agendas of terrorist individuals or groups. As highlighted by Shuy [1], such linguistic profiling can inform investigative strategies rather than pinpointing a single individual, and the integration of computational methods enhances the ability to extract meaningful information from linguistic traces in an accelerated manner. Focus on lexical and conceptual regularities in language choice has revealed that contextual factors such as geography (that is part of a conflict, ideological contest, or extent of aspired control) and extremist group’s ideological background shape discourse features, leading to observable similarities and differences in terrorists’ communication strategies e.g. [12]. While similarities could be ascribed to personal attributes (i.e. aggressive attitudes), agendas, and norms of terrorist discourse, variations in repertoires of choice are viewed as being adaptive responses to audience and factors such as ideological background and cross-authors’ socio-historic differences (e.g. ethnicity, religion, geography).

The analysis of conceptual burstiness within terrorist texts has also revealed patterns aligning with TRAP-18 categories, enhancing both retrospective and prospective threat assessment methods. By identifying semiotic clues to pathways to radicalisation, fixation, identification with violent groups, and last resort behaviour, linguistic analysis contributes to threat assessment frameworks and provides security professionals with a linguistically grounded lens on terrorist threatening communications. This linguistic profiling also serves to unravel agendas of influence and the extent thereof, which depart from “already existing political cleavages” [65, p. 190], a response to which might be necessary as part of counter-extremism efforts.

A corpus-method-assisted approach to conceptual burstiness for the purpose of linguistic profiling terrorist communication enables investigators to uncover and study patterns that are stable across a language performance yet might appear invisible directly to a human observer [48, p. 317], thus enhancing accelerated observations and the rigor and objectivity of investigative processes. The demonstrated examination of lexical choices offers insights into the rhetorical devices used by terrorists to encourage and justify violence – which renders further focus for future studies. The introduced tool, conceptual burstiness analysis, enables an understanding of the strategic repetition of themes related to religion, ethnicity, nationalism, power dynamics, violence, and attribution patterns; thereby, authorities can develop more effective strategies for countering terrorist propaganda and recruitment efforts. Additionally, the burstiness of violence and military terms holds significant value for online objectionable content moderation, offering crucial insights into communication dynamics and facilitating a deeper understanding of aggressive and oppositional tones within datasets. Future research can focus on dispersion plot analysis for more accelerated identification and examination of instances of burstiness of frequent lemmas. By intervening early based on linguistic indicators of radicalisation and extremist messaging, authorities can prevent individuals from becoming involved in terrorist activities and promote alternative narratives that counter extremist ideologies and their investment in (un)shared identity. This importance of early intervention is accentuated by the fact that repeated semantic choices that happen within “cultural framing” of the in-group world can serve to “ideological[ly] penetrate” in-group audiences’ macro and micro-world and mobilise a degree of public support [65, pp. 184–187] and alignments “with grassroots micro-universes” of audiences [65, p. 189, 32, 78].

While the article’s Author does not perceive Islam or Christianity (in the case of Jihadist and far-rightist authors, respectively) to be a religion that is in any way characteristic of extremism or particularly likely to motivate extremist rhetoric, the study highlights the role of identity work, namely ‘master’ religious and ethnonational identity and aggressive ‘personal’ identity, in making the terrorist-text author's attitude, identity and way of contesting inter-group relationships and affirming/legitimising pro-in-group violence. The identity also shapes lexical selection [79] in violent extremist discourse. It underscores the influence of religious and ethno-nationalist ideologies on authors’ writing and presentation of phenomena, highlighting words as markers of broader identity categories. Akin to Bourdieu’s [22, p. 169] concept of ideologies being “always doubly determined,” violent extremist ideologies emphasise the sociopolitical interests influencing extremist language and that ideologies owe their extremity and aggression to the “function of sociodicy,” further underlining the importance of sociolinguistic analysis in understanding and countering terrorist communications.

Ultimately, the interdisciplinary collaboration between computational linguistics, forensic analysis, and security studies holds promise for advancing counter-terrorism efforts. In this regard, the present study, while acknowledging limitations such as the focus on a limited sample of public statements from specific terrorist figures, the emerging conceptual categories and sub-categories, as well as the repeated lexical items, accord with fully-fledged corpus studies of violent extremist texts (e.g. [42, 56]). By leveraging computational methods to analyse linguistic patterns in terrorist communications, researchers and security professionals can stay ahead of evolving threats, identify emerging trends in extremist rhetoric, and develop proactive measures to safeguard global peace and stability. The integration of computerised linguistic analysis techniques, as also demonstrated by [25], can enable the identification of regularities and patterns in datasets, thus enhancing the effectiveness of investigatory and intervention protocols while meeting “the standards of producing rigorous empirical results” in forensic investigations [26, 49].

Future research could expand the scope to include a more diverse range of sources and explore additional semantic categories and variables influencing linguistic patterns in terrorist communications. For future research, we recommend expanding the dataset or incorporating additional linguistic markers that might correlate with the TRAP-18 categories. Researchers could also explore the application of advanced computational techniques, such as machine learning algorithms, to further enhance the analysis of terrorist texts. Additionally, interdisciplinary collaborations between linguists, sociologists, psychologists, computer scientists and security experts could lead to the development of more nuanced models for understanding and predicting terrorist behaviour based on more nuanced linguistic markers. The emergence of technologies like ChatGPT allows criminals to blend various data types, including language style, video, voice, and geolocation data, emphasising the need for a multidisciplinary approach and a focus on multimodal texts by future research to accurately scrutinising terrorist texts and inferring attributes from linguistic characteristics and extensive data to support peaceful social online interaction and coexistence.