Introduction

The relationship between power and language can be traced back to Aristotle (2012), and it has continued to be a topic of debate for intellectuals such as Foucault (1970), Habermas (1984), and Bourdieu (1991). Scholars generally agree that in the relationship between power and its expression through language, ‘power constitutes and reproduces discourse, while at the same time being reshaped by discourse itself’ (Antonio et al., 2011: 139). Even though it is recognised that discourse itself shapes power relationships, most previous work on language and power has predominantly focused on settings with clear, identifiable power structures, with extensive contextual information that can be used to map power in communicative interactions. Fairclough (2001) famously discusses the relationship between power and language distinguishing between power in discourse (i.e., the ways in which language is used to assert power) and power behind discourse (i.e., how social structures and hierarchies shape and influence language use).

In the contexts of observable power settings, a wide range of social and communicative and social settings have been analysed, including education (e.g., Sinclair & Coulthard, 1975; Craig & Pitts, 1990), workplace (e.g., Darics, 2020; Kacewicz et al., 2014), healthcare (e.g., Ainsworth-Vaughn, 1998), and legal settings such as police interviews and the courtroom (e.g., Conley et al., 1979; Haworth, 2010). Other studies have instead focused on the realisation of power according to classic sociolinguistic variables such as class (e.g. Alexander, 2005; Kubota, 2003), gender (e.g., Gal, 2012; Lakoff, 2003), and age (e.g., Blum-Kulka, 1990). These studies typically set up a binary contrast between powerful and powerless individuals to investigate how inequality is both supported and revealed through language use. The language of these individuals is analysed in its low-level (often lexical) features, such as the use of intensifiers, hedges, tag questions, hesitations (Conley et al., 1979; Holtgraves & Lasky, 1999) or of plural vs singular personal pronouns (Kacewicz et al., 2014). In all these studies, however, extralinguistic variables such as social status are actively used to “measure” how powerful or powerless a certain individual is in conversation.

But what about situations in which none of this contextual information is given, such as in anonymous online communities? The advent of computer mediated communication (CMC) has in fact allowed users to interact anonymously on designated platforms (like chatrooms, fora, etc). Crucially, in the online world, people can construct and present their online identity selectively (Hu et al., 2015; Kim et al., 2011) or create entirely new online personas which differ substantially from offline identities (For an in-depth review of the latest theories of online identity construction, see Huang et al., 2021). This uncertainty in identity construction is especially true in the context of anonymous criminal or dark web interactions, where users take extra measures to ensure their privacy is protected (Marwick, 2013).

The present study aims to contribute to the study of power in anonymous online spaces by building on a new framework of analysis developed by Newsome-Chandler and Grant (2024). The authors inductively developed a list of pragmatic-discursive textual functions that signal power (or experience) in interactions, called Power Resources. This innovative concept of different linguistic power tools in interaction allows for a nuanced understanding of how power is performed in anonymous online fora. In fact, Power Resources were inductively retrieved by manually coding a 160,000-word sample of 3 fora of different genres: a parenting discussion forum, a white nationalist forum, and a dark web child sexual abuse forum (see sections “Language as power in anonymous online communities” and “Data and methods” for more details).

Newsome-Chandler and Grant’s (2024) methodology is innovative and tackles many open questions in the literature. However, Power Resources are extracted from a comparatively small corpus (circa 160,000 words), and their annotation require a lot of time from human coders. The question on how to possibly scale this analysis to automatic or semiautomatic tagging of larger quantities of text remains unsolved. We aim to address this limitation by using corpus linguistics tools and methods to test if power resources can be successfully broken down into lists of scalable features for automatic feature extraction.

As a case study to illustrate our method, we present here an in-depth analysis on one of these pragmatic-discursive functions, Personal Experience. Personal Experience as a power resource is here defined as posts (or parts of them) in which the writer claims to have expertise in the topic discussed via direct personal experience. For example, in a parenting discussion forum (see Sect. 2.1), a poster asks advice on whether it is best to adopt domestically or internationally. Users of the forum respond giving different types of advice, mostly based on their own experiences of adoption, such as the following in example (1).

  1. (1)

    We adopted a mixed race (white and Asian) baby in <country>. Feel free to PM me if you’d like more info.

Here, the writer explicitly claims to have authority in the debate by providing evidence that they experienced a similar situation, therefore positioning themselves as an expert in the matter. We set out to answer the following research questions:

  1. (1)

    Can the pragmatic-discursive Power Resource of Personal Experience in anonymous online interactions be systematically “broken down” into retrievable syntactic and/or patterns? In other words, can we retrieve form from function?

  2. (2)

    What—if any—are the syntactic-semantic features typical of the Power Resource Personal Experience?

In the next sections we will provide some background on the theoretical underpinnings of this work. The remainder of this section will outline the main trends in the analysis of language and power in online settings (section “Language as power in anonymous online communities”), and the linguistic expression of personal experiences as a form of expertise (1.2). Sections “Data and methods” and “Data analysis and results” will respectively describe the data and the methods used in the study, while Sect. 4 discusses the corpus analysis and results. Section 5 closes the paper with our conclusions and directions for future work.

Language as Power in Anonymous Online Communities

The concept of “linguistic power” is elusive at best, and this is especially true in online contexts where only intralinguistic features are available and no extralinguistic information is given. As mentioned, scholars researching language and power in non-CMC contexts have focused on low-level linguistic features. In these contexts, low-level features work well as they are supplemented by external data that inform the linguistic interaction. Scholars interested in power in online spaces, on the other hand, have variously recognised the need for a more nuanced approach in analysing how power and expertise is expressed and negotiated in language. Power in these contexts is defined as an interactional notion emerging through the exchange (i.e. Fairclough’s power in discourse) rather than reflecting external institutional or social power (i.e. Fairclough’s power behind discourse).

For example, Bolander’s (2013) study of blogs shows that power is exercised via conversational control (turn-taking, speakership, and topic control) and type of argument proposed. McKenzie (2018) looks at discourses of authority and motherhood on Mumsnet using feminist poststructuralism. The author is interested in how users negotiate, resist, and subvert socio-cultural discourses of motherhood, and finds different discursive and linguistic mechanisms through which contributors position themselves in different ways within this debate around identity and motherhood. Paulus et al. (2018) investigate online learning spaces with Conversation Analysis and focus on how learners use agreement, disagreement, and personal experiences. The authors find that these discursive tools are used in online discussions to “perform a variety of functions”, such as supporting and strengthening academic arguments, affiliating or distancing from other users, and claiming expertise and personal involvement in the topic discussed.

Notably, Perkins (2021) experimentally showed for the first time that low-level linguistic features are inadequate to capture power in online contexts, failing to clearly map onto social and institutional power differences in a dataset of emails from a hierarchical institution. Perkins’ work was focused only on low-level linguistic features, including Part-Of-Speech- Tags, intensifiers, pronouns, hedging, politeness, and taboo words. The study showed that it is very difficult to untangle “power” from other social factors and influences in language with low-level features and no contextual information. As already highlighted by Paulus et al. (2018) and others (Graham, 2007; Buttny, 2012), contextual information is fundamental to understand how power is expressed through language.

Building on these findings, Newsome-Chandler and Grant (2024) developed a new approach to understand and analyse power performance in anonymous online interactions rooted in interactional pragmatics (Locker and Graham, 2010). In this framework, power performance is understood as a multifaceted discursive phenomenon that consists of multiple discursive resources that individuals can draw upon to make claims attempting “to persuade the original poster (and other interactants present in the fora) to listen to their point of view and potentially to act in a particular fashion” (Newsome-Chandler & Grant, 2024: 4). In other words, within one interaction users can combine different power resources to elicit different facets of expertise and persuade others to listen to them.

The authors manually annotated a 160,000 words dataset, sampled from 3 online anonymous fora. Using insights from discourse analysis and pragmatics, they developed a list of 10 different discursive power resources (henceforth: PwR) that are routinely employed by users to perform power and expertise in interaction. These are: Community Expertise, Community Specific Initialisms, Technological Expertise, Veteran Power, Accredited Expertise, Topic Expertise Through Personal Experience, Broad Topic Expertise, Private Knowledge, Citing a Secondary Source for Authority, Subject of Law Enforcement/Investigations (for an in-depth discussion on all of these, see Newsome-Chandler & Grant, 2024: 6-10). PwRs are functional and pragmatic strategies that showcase different facets of linguistic power and expertise. The authors point out that the fact that PWRs have been identified across very different datasets (see section “Data and methods”), “highlights the transferability and potential generalisability of our resource model of power in interaction” (Newsome-Chandler & Grant, 2024: 20). However, larger datasets need to be coded with semi-automated methods to both validate the coding set and prove scalability. The present paper aims at contributing to this question by investigating whether discursive functions of linguistic power can be represented by formal features for automatic detection. We present an in-depth case study to illustrate how corpus linguistics can be used to extract syntactic and/or semantic features typical of power resources. Specifically, we present our analysis of the PwR of Topic Expertise Through Personal Experience, which as mentioned refers to claims of direct personal involvement in the topic discussed. Personal experience was chosen as a first case study for a two-fold reason. Firstly, it is the only category among Newsome-Chandler and Grant’s (2024) Power Resources that has been extensively studied in the literature—hence allowing us to compare our findings with previous works. This will provide a better foundation for future work on less-studied resources of power performance. Secondly, Personal Experience is the resource with more data among the 10 found in Newsome-Chandler and Grant (2024), and therefore findings are more generalisable. This makes Personal Experience the perfect candidate to illustrate our method and contribute significantly to the scholarship.

Personal Experience as Expertise

The expression of personal anecdotal experience has been recognised by the literature as a source of expertise More specifically, it has been widely demonstrated by different areas in social science and linguistics that sharing personal experiences in a conversational settings give interlocutors authority, hence providing them with power in the context of the interaction. Personal narratives are an important tool to establish bonds and legitimacy in a social group (Fine, 2013; Polanyi, 1985), and they have been found to be an important tool that professional use to position themselves as experts and construct their institutional identities (Dyer and Keller-Cohen, 2000). The construction and negotiation of new kind of identities is also discussed by Jones (2016), who shows how participants of the It Gets Better campaign use different types of personal narratives to gain “textual authority”.

The expression of personal experience has been also linked to a “narrative evidence” function (Hong & Park, 2012) and to “personal expertise” used to engage with others while commenting on health science news articles (Shanahan, 2010). Buttny (2012) further argues that “experience is a resource that we can draw upon” (Buttny, 2012: 604). That is, the use of personal experience is seen as an asset that the individual uses malleably in different contexts. Constructivist learning theories also acknowledge that personal narratives serve as a valuable foundation for learning interactions, as personal experiences may be regarded by peers as the most trustworthy form of evidence. (Leont’ev, 1979). Moreover, research has also shown that resorting to anecdotal and personal rather than factual expertise is also more engaging for an “invisible audience” (Lester & Paulus, 2011; Stokoe et al., 2013).

Although most of the research in this area focuses on “offline genres”, with the advent of widespread use of the Internet for user-generated content—the so-called Web 2.0 (Allen, 2013; Han, 2011)—scholars have become more and more interested in how online communities construct and share personal narratives. For example, Page (2018) uses discourse analysis to develop a framework for analysing large-scale narratives on four social media platforms, Wikipedia, Facebook, Twitter, and YouTube. Stone et al. (2022) conduct experimental research to investigate why young adults share personal experiences on social media, finding social reasons to be the primary motive.

Overall, the literature recognises that personal experience as a discursive function is a tool available to interactants to express authority and/or expertise in a subject matter. Research on online spaces is growing, but the area is still vastly understudied.

But what is personal experience from a linguistic perspective? Interestingly, studies from different areas (such as linguistics, psychology, media studies) all seem to converge towards a “core” group of syntactic and semantic features of personal experience. The main syntactic and semantic features of personal narratives and experiences are the overuse of first-person pronouns and past tense verbs (Bamberg, 1987; Hayes & Casey, 2002; Hudson & Shapiro, 1991). Additionally, other textual markers of narrative building include specifications of setting of the event (to “set the scene”) linguistic items that enhance cohesiveness—such as pronominalization and interclausal conjunctions—, (Hayes & Casey, 2002; Hudson & Shapiro, 1991) and viewpoint markers that depict personal stance, including verbs of cognition, perception, and emotion (Pulvermacher & Lefstein, 2016; Sacks, 1986; Stivers, 2008; van Krieken, 2018).

Personal experience and narratives have been widely investigated in the linguistic and social sciences literature. This breadth of literature, however, is lacking in several respects. Despite recognising experience as expertise, there are no studies—to the best of our knowledge—specifically on anonymous online contexts. Furthermore, the linguistic features of personal experience have also been discussed in offline and/or non-anonymous genres. We set out to fill this gap in the research by exploring the linguistic features of personal experience following Newsome-Chandler and Grant’s (2024) coding scheme.

Data and Methods

Materials

As data, we use the sample of anonymous online interaction used in Newsome-Chandler and Grant (2024). Specifically, the corpus used is a sample from 3 multi-million words corpora of online fora of different genresFootnote 1. The first forum is a well-established clear web parenting discussion forum (henceforth: PD)Footnote 2. The platform is perfectly legal and a well known community used to discuss issues and advice regarding but not limited to parenthood. The second forum focuses on white nationalist and racist ideologies (henceforth: WN). The site attracts users who are interested in topics such as white supremacy, conspiracy theories, mysogyny and homophobia. Although the website is on the open web, it is centred around ideologies and themes generally rejected by society and public opinion, and includes discussions of criminal activities. The third and last forum is a child sexual abuse forum from the dark web (henceforth: DW). Users on this forum discuss highly illegal topics related to paedophilia, and are hence very careful of their online anonymity.

These 3 online fora were first sampled by the authors by extracting 8 prompting interrogative phrases in opening posts that elicited advice in other users (e.g., “How do I…?”, “How do you…?”, “Do you know…?”, “Why did..?”, etc.). In fact, the authors observed that users responding to explicit questions for advice would try to present themselves as reliable and trustworthy by claiming authority with different pragmatic. From this first subcorpus, a smaller dataset of 160 threads was selected by extracting the 20 most recent threads from each prompting phrase. The dataset was sampled for manual analysis by selecting the first 3 threads for each of these different lengths: 5 to 9 posts, 10 to 19 posts, 20 to 49 posts. The final corpus used by Newsome-Chandler and Grant amounts to 72 threads and around 160,000 words (Newsome-Chandler & Grant, 2024: 5-6).

From this dataset we manually extracted all threads containing one or more posts tagged as Personal Experience. This resulted in 258 posts across corpora (138 PD, 70 WN, 50 DW), across all 72 threads. The dataset was further divided in a traditional 70:30 train/test split, which is the common rule to deal with small size dataset in machine learning. That is, corpus analyses were performed on threads containing 70% of posts coded as Personal Experience (181 total posts, 97 PD, 49 WN, 35 DW, over 37 different threads) while the remaining 30% is left as test data for the computational stage of the project (Htait et al., 2024. See Sect. 5). Table 1 reports total number of tokens and words for the data.

Table 1 Corpus size of Personal Experience total data

The resulting materials were analysed with an inductive corpus-driven methodology, as we made “minimal a priori assumptions regarding the linguistic features that should be employed for the corpus analysis (…) [C]ooccurrence patterns among words, discovered from the corpus analysis, are the basis for subsequent linguistic descriptions.” (Biber, 2015: 196).

Methods

As mentioned, the overarching aim of the study is to illustrate how to break down functional annotations of PwR in automatically retrievable linguistic patterns. To do so, an inductive protocol to extract syntactic and/or semantic patterns from manually annotated posts was developed, combining different corpus linguistics methodologies rooted in keyness analysis (Rayson, 2008). More specifically, we combine keyness analysis for Part of Speech (POS) tags and for semantic domains with Corpus Query Language (CQL) queries.

Keyness analysis is a technique widely used in corpus linguistics that compares items “which occur with unusual frequency in a given text […] by comparison with a reference corpus of some kind’ (Scott, 1997: 236). Scholars conduct keyness analysis at different levels of linguistic analysis. While the most common is certainly key words, the analysis of key part of speech tags and semantic domains is also attested in the literature (Busso & Vignozzi, 2017; Culpeper, 2009; Egbert & Biber, 2023; Rayson, 2008). CQL is a formal languages that can be used to conduct searches on different corpus processing software. It is a powerful tool to retrieve complex patterns and structures from a concordancer tool. In their simplest form, CQL queries take the form of pairs of attributes and values, structured as [attribute= “value”] (e.g. [lemma= “go”]) (Klyueva et al., 2017).

Individually, these methodologies are widely used in corpus linguistics research. However, this is the first study—to the best of the author’s knowledge—that combines them into a unified protocol for corpus exploration. The protocol includes:

  1. (1)

    Keyness analysis of part-of-speech (POS) tags on Sketchengine (Kilgarriff et al., 2004; tagset: English Penn Treebank, Marcus et al., 1993; Keyness score: simple math, Kilgarriff, 2009)

  2. (2)

    Keyness analysis of semantic domains on WMatrix 5 (Rayson, 2008). This analysis is performed using the USAS tagger (Rayson et al., 2004) which automatically assigns semantic fields to each word or multiword expression in the data (Keyness score used: Log Likelihood).

  3. (3)

    CQL queries of patterns based on results of the previous 2 keyness analyses.

The use of all these three tools allows the researcher to triangulate evidence from morpho- syntax (POS tags), semantic-pragmatics (semantic domain analysis), and the interaction between the two in the form of complex retrievable patterns of syntax + semantics (see section “Data analysis and results” below).

As a preliminary step for all analyses, data for each of the 3 fora were further subdivided into two subcorpora: a focus corpus which includes all posts or parts of posts coded as “Personal Experience” (PersExp) in Newsome-Chandler and Grant’s (2024) dataset, and a reference corpus which includes all the rest of the thread, which we call “Non-Personal Experience” (NonPersExp). In this way, linguistic elements that are statistically more relevant in the PwR are foregrounded with respect to the rest of the data.

For example, in the simplified and redacted excerpt below in (2) from a post in the PD forum (with simplified XML tags), the first paragraph would be put in the focus corpus “PersExp”, since it is explicitly tagged as suchFootnote 3. The post comes from a thread on adoption, particularly discussing how to behave around children adopted by close family, who get overwhelmed by the attention very easily. In example (2) we can see the poster explicitly using their experience in the matter as “evidence” to make other users pay attention to their opinion, expressed at the very beginning of the post (i.e., that parents sometimes overestimate the child’s resilience). The rest of the post—which is tagged as another power resource, Community Expertise—directs the original poster to another thread in the same forum. Both the initial advice (which is not tagged) and this last sentence would be categorised in the reference “NonPersExp” subcorpus.

2) They may have overestimated the child and their resilience. <personal_experience> (…)I was eager to have family meet my baby but struggled sometimes to assert myself (…)! I still remember crying in my kitchen after a well-intended cousin kept getting in her space (…). It took months to feel like I could just say no. </segment> <community_expertise> It’s similar to the threads you see on the board where new mothers struggle to tell no to relatives wanting hugs and kisses etc. </segment>

The resulting subcorpora are quite balanced, with PersExp amounting to 27,525 tokens (47% of the corpus), and NotPersExp to 31,117 tokens (53% of the corpus).

All keyness analyses (POS tags and semantic domains) were performed comparing the two subcorpora against each other. CQL queries were also conducted separately in the two subcorpora, to compare frequency of occurrence of the retrieved grammatical patterns (see Appendix 1 for the complete list).

For the keyness analysis of semantic domains, only items above statistical significance threshold for Log Likelihood (i.e., a score of 6.6) were selected. Furthermore, since semantic domains are highly data-dependent, only non-context dependent semantic domains were used. For example, since PD obviously discusses many issues regarding children, parenthood, and family, all semantic domains concerning kids and family relations were discarded. All significant semantic domains were manually checked before discarding.

After using keyness analysis to reveal salient grammatical and semantic categories in the data, results from the two were integrated into CQL searches. In this way, we combine key syntactic (i.e., POS) and semantic (i.e., semantic domains) features into retrievable patterns. For example, we established that determiners are a key part of speech and particulisers are a key semantic domain in the DW forum, so we can combine this information into one CQL search that looks for particulisers belonging to the POS of determiners: [tag= “DT” & lemma= “a few|some|several|both”]. These feature can then be used to automatically detect PwRs in new untagged data.

Data analysis and results

Keyness analysis of POS tags

As described in section “Data and methods”, a keyness analysis of salient POS was performed using all posts coded as Personal Experience as the focus corpus and all the rest of the threads as reference corpus. Table 2 reports the key POS tags respectively for PD, WN, and DW, with results ordered by keyness score (threshold: 1)Footnote 4. The tables are color-coded as follows:

  • In green: linguistic items which are shared across the 3 datasets.

  • In blue: linguistic items which are shared by 2 out of 3 datasets.

  • In orange: items that are not identically present in the 3 datasets but that refer to a same general linguistic ‘category’ (e.g.: coordinating and subordinating conjunctions).

  • In red: linguistic items idiosyncratic to one of the datasets.

Table 2 Key POS tags for the 3 datasets

Table 2 shows that, beside idiosyncratic items for each dataset, there is a high number of shared features across datasets. Features of Personal Experience that are shared across these three very different datasets suggest general trends of how users convey their own lived experience through language: referring to past events (past tense verbs) and situations (existential “there”) that happened to them (possessive pronouns). Parataxis is preferred to hypotaxis (coordinating conjunctions), and actions are modified by adverbs. Figure 1 visualises shared features across the 3 datasets.

Fig. 1
figure 1

shared and idiosyncratic POS tags across the 3 datasets.

These results almost perfectly align with previous literature investigating features of personal narratives.

As we mentioned in section “Introduction”, personal narratives tend to use first-person pronouns and past tense (Bamberg, 1987), interclausal conjunctions (Hudson & Shapiro, 1991), location, temporal and stance markers (Stivers, 2008; Van Krieken, 2018).

This finding is crucial for two important interrelated reasons. First, the genres we are exploring are very different from any genre previously studied in the scholarship on personal narratives. Furthermore, the fora being anonymous, there is no way of controlling for variables such as native language or education level. Despite this, we still find features of Personal Experience that largely overlap with the scholarship, strengthening the idea of a “core” linguistic strategies of expressing personal experience that transcends genre, native language, and other potential extraneous sociolinguistic variables. Second, this independent result also functions as a validation of Newsome-Chandler and Grant’s (2024) coding scheme.

Keyness analysis of semantic domains

Analysis of key semantic domains was performed on overused domains in the focus corpus above the statistical significance threshold for Log Likelihood score. Since—as mentioned—semantic domains can be data-dependent, all the relevant domains were manually checked: only semantic domains (1) Represented by a congruous number of non-hapax words and (2) Not context dependent were selected. Figures 2a and b exemplify respectively a viable and a non- viable semantic domain according to these criteria.

Fig. 2
figure 2

a Viable semantic domain, which includes several different words with frequencies > 1 b non-viable semantic domain, which includes only a limited set of lexical items and with very low frequencies

Table 3 reports the domains analysed, which were obtained setting a working frequency threshold of 3. Figure 3 plots shared and idiosyncratic features across datasetsFootnote 5. Since WMatrix semantic domains can be quite specific (as seen in table 3), we grouped them under the same general umbrella category, like “Quantity”, or “Time”.

Table 3 Analysed key semantic domains for the 3 datasets
Fig. 3
figure 3

Shared and idiosyncratic semantic domains across the 3 datasets. The categories’ names are the original semantic domains from the USAS tagset used in WMatrix. For example, “Grammatical Bin” in WN refers to grammatical items (preposition, adverbs, conjunctions), that the system was not able to categorise in other groups. “Exclusion” refers to general or abstract terms denoting a level of exclusion, etc.

Semantic keyness analysis expand findings on key POS tags. Particularly, the general pattern that seems to emerge is that shared semantic features of Personal Experience across corpora include referring to time, especially in the past, adverbs of quantity, and have as an auxiliaryFootnote 6. All three fora also exhibit similar semantic domains that provide an evaluative component of the situation reported (“Thought/Belief”, “Seem”). This finding is again consistent with a lot of the relevant literature, showing how personal stance is a common feature in personal narratives (Baynham, 2011; Stivers, 2008; Van Krieken, 2018), and with Labov and Waletzky’s (1997) findings on how narratives fulfil both a referential and an evaluative function. That is, not only the syntactic, but also the semantic keyness analysis seem to suggest that there are some very recognisable cross-genre linguistic features of Personal Experience. This is encouraging in view of our overarching to build automatic feature detectors.

Beside shared features, it is also interesting however to note the idiosyncrasies of each forum. For example, WN and DW do not share any semantic domains beside the core ones. This suggests that the two fora express personal experiences with different strategies, and for different reasons. DW and PD, on the other hand, share a number of additional features. This would appear to suggest that the two fora use Personal Experience in a similar way.

Figure 4 helps to visualise similarities and difference in how users employ Personal Experience in the 3 fora. Semantic domains and POS tags were grouped into inductive categories to avoid an overpopulated graph. This means, for example, that all forms of adjectives (base, comparative, superlative, wh-determiner) are grouped under the same umbrella category of “Adjectives”. In the same way, all expressions of time (Time:present, Time:past, Time:future, Frequent, etc) were similarly grouped under the category of “Time”.

Fig. 4
figure 4

Frequencies of grammatical (above) and semantic (below) categories in the three corpora

As it can be very easily seen from Fig. 4, each dataset uses Personal Experience features differently. PD and DW both use a high number of adverbs, while WN appears to use them less. On the other hand, WN overuses action verbs (i.e., semantic domains of “Using”, “Moving”, “Finding”) with respect to PD and DW.

In sum, users seem to give advice and express expertise through Personal Experience using a “core” of cross-genre shared linguistic features, both at the syntactic and the semantic level (e.g., the use of the past tense, and referring to time in general). Beside this shared core, there are fuzzy categories of features that are shared at an abstract level but realised differently from dataset to dataset (e.g., the way of providing evaluation to the event), and features that can be considered completely idiosyncratic to a certain dataset (e.g., the use of modal verbs in WN). If we group the features in larger categories and plot them, we can easily visualise variation and similarities across datasets. For example, from the first mosaic plot in Fig. 4 we can see that PD uses way fewer participial forms than DW and WN. Similarly, in the second mosaic plot we can see that DW uses expressions of Quantity much more than the other two corpora.

CQL Queries

Corpus query systems present a powerful and popular concept for linguistics and more generally digital humanities. Different tools are available, with slightly different syntax. Among these, CQL (Rychlý, 2007) is one of the most widely used (Culpeper, 2009). CQL is a regular expression-based “query language used (…) to search for complex grammatical or lexical patterns” (SketchEngine documentation).

In this study, we use CQL queries to combine the key syntactic and semantic features into retrievable syntactic-semantic patterns. For example, knowing that in PD one of the key POS tags are adverbs and that particularisers are semantically relevant, two can be combined in the pattern: [tag= “RB.*” and lemma= “just| very| so| really| also| well”], i.e. “all the listed lemmas that are adverbs—in base, comparative or superlative form”. We also know that personal pronouns, past tense, and verbs of cognition (“Thought and Belief” in Table 2) are key features. Hence, we can combine them in a single search: [tag= “PP”and lemma= “I|we”] []{0,1}[tag= “V.D” and lemma= “have|think|feel”], i.e. “patterns that have a personal pronoun as a subject– either I or we—followed by have, think, or feel in the past tense”.

All retrieved concordances were manually checked and analysed, and frequency lists of lemmas and POS tags on the right or left of the KWIC were performed. This helped expand the list of searches by inductively exploring the corpus for frequent repeated patterns.

Queries were run for all three corpora separately. The list of all CQL searches for the 3 fora is provided in the Appendix, but it is worth noting that it is not meant to be exhaustive. The list provided is intended to give the reader an idea of the type of searches that can be devised following this protocol.

To verify whether the patterns retrieved are indeed typical of the PwR of Personal Experience, CQL queries’ frequencies were extracted from both the PersExp and NonPersExp subcorpus, and a one tail paired sample t-test was performed. Frequency data was log transformed to approximate a normal distribution. Results confirm that frequency of the CQL queries in PersExp is significantly higher than in NotPers (t = 5.44, p <0.001). This suggests that all patterns extracted are indeed salient features of the PersExp subcorpus.

Conclusions and Future Work

This study set out to achieve a two-fold aim. First, we presented an in-depth case study on one of the 10 power resources found by Newsome-Chandler and Grant (2024): Personal Experience. We described the linguistic features of Personal Experience as a power resource, combining pragmatic annotations with corpus linguistics to retrieve key lexico-grammatical patterns of pragmatic functions in text. Secondly, we introduced an innovative corpus-driven protocol that combines POS tags and semantic domains keyness analysis with regex-based corpus searches (CQL).

From a content perspective, the study presented innovative insights into the use of personal experiences as expertise in online anonymous interactions. Particularly, by analysing a textual type (anonymous forum threads) which is very far from the typical genres explored by the literature, we still found features that align with and expand on the scholarship. This strongly suggests that (1) There is a “core” of semantic and/or syntactic features of personal experience which seems to be genre-independent; and that (2) Our corpus-driven protocol is useful to retrieve syntactic and semantic features of high level, pragmatic annotations (in our case, claims to power and power performance). These features can be then used as cues to automatically retrieve the pragmatic functions in new, untagged data. This methodological contribution is crucial to the intersection of pragmatics and corpus linguistics, as it allows the researcher to scale pragmatic annotations—notoriously a laborious and time-consuming annotation process—to bigger datasets.

Despite its preliminary nature, the study shows that this approach is promising and offers a novel way of combining pragmatics with corpus and computational linguistics. In fact, we have been able to identify consistent sets of retrievable linguistic features of Personal Experience. These are obtained by triangulating evidence from keyness analysis of part-of-speech tags and semantic domains, and from CQL queries that bring together syntactic and semantic features into retrievable patterns.

The reliability of our findings is supported by a vast literature on personal narratives that generally converge on the same type of features, although coming from vastly different genres (section “Data analysis and results”). This—we argue—strongly suggests that we were able to extract “core” characteristics of the linguistic expression of personal experiences and expertise. Albeit further work is needed, our findings support the idea that this “core” set is mostly genre-independent and could be effectively generalized and applied to different types of anonymous online interactions. Our findings also align with claims in Functional and Usage-Based linguistics on the interrelatedness of linguistic form and function. In fact, a recent area of research in cognitive linguistics has started to see even textual genres as constructions (i.e., structural units of form and meaning, as seen in Östman & Fried, 2005). Like genre-specific constructions, we can conceptualise our features as “function”-specific constructions, as we showed that the functional category of Personal Experience is consistently expressed through a set of syntactic and semantic units.

Beside the core group of shared features, there are also a number of idiosyncratic features for each forum. This is true especially in the key semantic domains analysis (section “Keyness analysis of semantic domains”), as there are fewer shared semantic categories than POS tags. This is partly due to the simple fact that there are more semantic categories than POS tags, and partly due to polysemy factors. Although our focus has been on identifying features shared cross-genres, another fruitful research area might look into genre-specific and idiosyncratic features of personal experiences.

Given that our aim is to create scalable features for computational feature detection, we are currently exploring using the corpus results as training for automatic feature extraction algorithms to detect PwRs in both the test data and new corpora. Although this phase of the project is still in its preliminary steps, some encouraging results are already being achieved. Htait et al. (2024) details how machine learning algorithms have been developed from corpus features for automatic feature detection.

As far as possible applications of our methodology, the primary purpose of this study is forensic in nature. For example, a natural application of the corpus-driven method presented here would be the analysis of large dark web criminal fora to find powerful users. In their sample, Newsome-Chandler and Grant were already able to individuate different users who would make more and more diversified claims to power and authority than other users (Newsome-Chandler and Grant, 2024: 10 ff). With corpus linguistics and machine learning (Htait et al., 2024) we would be able to replicate that result on a much larger scale, individuating users that present as more or less powerful than others and creating a hierarchy of power performance. However, our study has potential to be useful beyond the forensic applications, as the theoretical and methodological framework developed can also be fruitfully used in analysing CMC more widely. In fact, our protocol for feature extraction can potentially be used in any situation in which researchers want to extend and generalise a qualitative analysis based on pragmatic or discursive annotations (such as Move Analysis, Appraisal, Critical Discourse Analysis, etc).