Collective self-understanding: A linguistic style analysis of naturally occurring text data

Cork, Alicia; Everson, Richard; Naserian, Elahe; Levine, Mark; Koschate-Reis, Miriam

doi:10.3758/s13428-022-02027-8

Collective self-understanding: A linguistic style analysis of naturally occurring text data

Open access
Published: 28 November 2022

Volume 55, pages 4455–4477, (2023)
Cite this article

Download PDF

You have full access to this open access article

Behavior Research Methods Aims and scope Submit manuscript

Collective self-understanding: A linguistic style analysis of naturally occurring text data

Download PDF

Alicia Cork ORCID: orcid.org/0000-0003-2892-9615¹,
Richard Everson^2,4,
Elahe Naserian⁵,
Mark Levine⁶ &
…
Miriam Koschate-Reis^3,7

2640 Accesses
2 Citations
43 Altmetric
5 Mentions
Explore all metrics

Abstract

Understanding what groups stand for is integral to a diverse array of social processes, ranging from understanding political conflicts to organisational behaviour to promoting public health behaviours. Traditionally, researchers rely on self-report methods such as interviews and surveys to assess groups’ collective self-understandings. Here, we demonstrate the value of using naturally occurring online textual data to map the similarities and differences between real-world groups’ collective self-understandings. We use machine learning algorithms to assess similarities between 15 diverse online groups’ linguistic style, and then use multidimensional scaling to map the groups in two-dimensonal space (N=1,779,098 Reddit comments). We then use agglomerative and k-means clustering techniques to assess how the 15 groups cluster, finding there are four behaviourally distinct group types – vocational, collective action (comprising political and ethnic/religious identities), relational and stigmatised groups, with stigmatised groups having a less distinctive behavioural profile than the other group types. Study 2 is a secondary data analysis where we find strong relationships between the coordinates of each group in multidimensional space and the groups’ values. In Study 3, we demonstrate how this approach can be used to track the development of groups’ collective self-understandings over time. Using transgender Reddit data (N= 1,095,620 comments) as a proof-of-concept, we track the gradual politicisation of the transgender group over the past decade. The automaticity of this methodology renders it advantageous for monitoring multiple online groups simultaneously. This approach has implications for both governmental agencies and social researchers more generally. Future research avenues and applications are discussed.

Wannabe Israeli: immigrants wrestling with their identity

Article 10 April 2019

The grand old party – a party of values?

Article Open access 27 November 2014

Text as Data for Conflict Research: A Literature Survey

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Much of our social existence is impacted by the groups to which we belong, and the in- and outgroup distinctions that we make. Understanding what groups stand for is therefore of great importance to a wide range of scholars in the social sciences. For instance, on the societal level, it allows political scientists, sociologists and psychologists better insight into social phenomena such as emerging political movements (e.g., Bednarek-Gilland, 2015; Van Bavel & Packer, 2021), shifts in power structures, national and international conflicts (McCann et al., 2020; Smith, 2021; Thome, 2015), the impact of groups on sustainability (Horcea-Milcu et al., 2019; Udall et al., 2020) and public health behaviours (Cruwys et al., 2020; Wakefield et al., 2019), radicalisation (Hogg, 2021; Smith et al., 2020), and group processes such as cohesion and social belonging (Buhrmester et al., 2018; Healy, 2019). On an organisational level, the alignment between employees’ values and the values held by the company or specific work teams has been shown to affect positive working outcomes including productivity and work satisfaction (Chung, 2017; van Knippenberg & Hogg, 2018). However, social scientists currently have to rely on relatively small-scale survey and interview data that often focus on a single group outside the wider social context, to assess a group’s collective self-understanding. Here, we propose a computational method that uses naturally occurring text data from a range of groups to understand similarities and differences in collective self-understanding.

Collective self-understanding reflects the basic norms and abstract values that a group holds, along with the groups’ purpose (e.g., whether the group is a collective action group or organisational group). The collective self-understanding is therefore an abstract understanding of who the group is and what the group stands for. Importantly, it does not comprise the specific attitudes or opinions that the group holds – it can instead be thought of as the general ‘personality’ or ‘essence’ that drives the groups’ behaviours.

For the most part, a group’s collective self-understanding is assessed through qualitative interviews or quantitative surveys (e.g., Calderon et al., 2000; Schwartz, 2012). These methods rely on self-reports of opinions, values, and goals by group members. However, introspection is notoriously challenging due to both the abstract nature of norms and values as well as the explicit questions being asked (Boyd et al., 2015). For example, Boyd and colleagues detail the difficulties of assessing values through quantitative surveys that aim to reduce abstract notions such as the value to ‘work hard’ or ‘be a good mother’ into pre-defined categories (e.g., the ten Schwartz values; Schwartz, 1992).

In this paper, we propose a methodological approach that uses linguistic style analysis to examine the direct enactment of a group’s collective self-understanding in naturally occurring text data. Linguistic style refers to how a message is communicated rather than what specifically the content of a message is. Style therefore relates to the types of words that may be used to communicate a message, such as pronouns, adverbs, words with more than six letters or filler words (the full list of word categories used to measure linguistic style can be found in Appendix 1). We argue that the way in which group members communicate with each other reveals significant cues as to the norms and values that the group holds and the type of group it is. Further, we suggest that by using linguistic data, we are able to circumvent issues associated with introspection and pre-defined categories. Instead, this methodology takes a much more abstract approach by disregarding the explicit content of communication (i.e., the specific aims, opinions and policies that are expressed) and instead focuses on the way in which ideas are communicated. By concentrating on the style of communication rather than the content, we argue that we are able to access more subtle cues as to the psychological reality of group members (Pennebaker, 2011). Moreover, in this paper we demonstrate that this approach allows us to compare the similarities and differences between a wide range of overlapping groups such as vocational, political or relational groups. By using naturally occurring linguistic data to study collective self-understanding, we also demonstrate how this approach can enable us to track potential changes or developments in collective self-understanding over extended periods of time.

Linguistic style research

Whilst the majority of linguistic research has focused on the semantics of communication, a more recent trend has shifted towards the analysis of linguistic style due to its ubiquity in all communication (Boyd & Pennebaker, 2017; Pennebaker, 2011; Pennebaker et al., 2003; Tausczik & Pennebaker, 2010). In lay terms, style refers to ‘how’ a message is communicated, as opposed to ‘what’ is being said. A message can be articulated in many different ways whilst still retaining its meaning, and thus the stylistic (non-semantic) words used to convey a particular message are thought to be integral to understanding how individuals construct their own realities (Chung & Pennebaker, 2007; Pennebaker, 2011). Based on this assertion, a plethora of research has identified the link between the way a person communicates and their individual personality (Boyd & Pennebaker, 2017; Mairesse et al., 2007; Tong et al., 2020), individual values (Boyd et al., 2015) and psychopathologies (Junghaenel et al., 2008). In addition to this, researchers have also studied the link between style and demographic factors such as gender (Newman et al., 2008) and age (Löckenhoff et al., 2008).

However, more recent research suggests that linguistic style can be used to understand individuals’ changing social realities and social group memberships. In both Cork et al. (2020) and Koschate et al. (2021), the authors find that linguistic style analysis can be used to understand group processes and identities. For example, Koschate et al. (2021) find that an individual that communicates as a parent uses a different linguistic style than when the same individual communicates as a feminist. Further, the authors also show that individuals change their style based on the group that is psychologically relevant at the time of writing, even when demographics, personality and topic of communication are controlled for. Thus, individuals appear to switch between feminist and parent communication styles based on which of these two identities is psychologically relevant to them at a given point in time. In this way, we can see the direct link between the group that someone belongs to and the way in which they communicate when that group identity is psychologically relevant.

In the present research, we aim to explore whether group-based variation in communication style directly relates to a group’s collective self-understanding. More specifically, by examining the shared communication style of group members, we aim to understand the extent to which various groups differ in their underlying collective self-understanding. This, then, allows us to relate these similarities and differences to group types (Study 1) and values (Study 2), and to examine changes in collective self-understanding over time (Study 3).

Study 1: Linguistic style reflects group type

Building on previous work from Koschate et al. (2021) and Cork et al. (2020), we are looking to understand whether similarities in group linguistic style map onto similarities in the collective self-understanding of the group and what the group stands for. In order to explore this idea, however, it is first necessary to have an understanding of which groups are perceived as having similar collective self-understandings. Previous research has used card-sorting tasks to assess the perceived similarities between different groups. In Deaux et al. (1995), the researchers found that individuals perceive there to be five different types of group. Using groups that are meaningful to individuals, such as ‘democrat, ‘aunt’ and ‘Asian American’, Deaux and colleagues asked participants to categorise 64 different social identities based on their perceived similarity. Five different types of social identity were observed: (a)vocational identities, political identities, ethnic and religious identities, relational identities and, finally, stigmatised identities. Vocational/avocational identities included groups such as teachers, psychologists, athletes and musicians; political identities included groups such as feminists, political independents, democrats and republicans; ethnic and religious identities included groups such as Jewish, Catholic, New Yorker and Asian American; relational identities included groups such as girlfriend, brother and divorcee; and finally, stigmatised identities included groups such as homeless people, fat people, gay people and people with AIDS.^{Footnote 1}

Based on the finding that individuals perceive five different types of group, here we explore whether this finding replicates using group members’ own linguistic style rather than outsiders’ (stereotypical) views of groups. That is, we examine whether a group’s collective self-understanding (in terms of the type of group they are) is reflected in a group linguistic style that shares similarities with other groups of the same type and differs from groups of a different type. In short, we hypothesise that identities from the same group type will use a more similar linguistic style than those from different group types. This hypothesis and the following methodology were preregistered at https://osf.io/jk6na/registrations.

Method

Data collection

To assess our hypothesis, we chose three identities from each of the five group types outlined by Deaux et al. (1995). This allowed us to ensure that our choice of groups was both wide ranging as well as meaningful to individuals (Deaux, 1991). In line with the research of Koschate et al. (2021) and Cork et al. (2020), we used Reddit data to collect the linguistic style behaviour of our chosen identities. We chose to use Reddit data as the Reddit platform hosts forums for a diverse range of social groups. We assessed which of the groups used in Deaux and colleagues’ original 1995 research had suitable Reddit forums from which we could collect data. We chose three diverse identities from each of the five groups where the forum had a high number of users and an active Reddit community – these choices were preregistered at https://osf.io/jk6na/registrations.

Our aim when choosing the identities to include in this analysis was to ensure that we had a broad spread of different social groups. For the vocational category, we therefore chose one white collar vocation (r/sales), one self-employed vocation (r/entrepreneurs) and one more socially rather than economically focused vocation (r/teachers). These choices were limited by the Reddit forums that were available, for example r/lawyers and r/doctors were both private forums and thus we could not access the data within these subreddits. For the relational groups, we were limited again by the forums available. There were no active subreddits at the time dedicated to sibling relationships specifically, and so we chose mothers (r/breakingmom), fathers (r/daddit) and a general relationship forum (r/relationships). Here, we actively chose forums that appeared diverse and were not formed explicitly around relationship problems (e.g., r/justnofamily). For the ethnic and religious category, we chose identities that covered both ethnicities and religions. The r/asianamerican subreddit was one of the most active communities that we found dedicated to fostering community amongst individuals of a particular ethnicity. r/islam and r/Christianity were two of the largest active subreddits relating directly to religious groups, although other smaller communities were also available (e.g., r/Jewish). For the political groups, again many options were available. We attempted to find three forums that were politically diverse and had large active communities. We therefore chose one forum that relates directly to a political party (r/Conservative), one collective action group (r/feminist) and one group based on shared ideology rather than a specific political party (r/Libertarian). Finally, our choice of stigmatised identities to include in this analysis was constrained largely by the forums that exist with large active communities. For Deaux and colleagues original 1995 analysis, the authors had a few different types of stigmatised identities; those relating to sexuality (e.g., gay or lesbian), those with stigmatised traits (e.g., old, overweight or retired), those with potential substance use issues (e.g., alcohol or smoking), and those who are underprivileged (e.g., homeless, unemployed and welfare recipients). We aimed to mirror these choices but with groups that are currently relevant and active. Thus, we chose one identity from the LGBTQ+ community (r/asktransgender), one community confronting problematic substance use (r/stopdrinking) and one underprivileged community (r/homeless). All data used in this paper can also be found at https://osf.io/jk6na/. More detail about the subreddits from which we collected our data are outlined in Table 1.

Table 1 Information pertaining to the 15 subreddits used in analysis

Full size table

After receiving ethical approval from the departmental ethics board and pre-registering the methodology and hypotheses, we collected 1 year’s worth of comments from the 15 subreddits listed above. Using Google BigQuery, we collected comments that had been posted to the aforementioned forums between January 2018 and January 2019. We collected the title, text, URL and anonymous user ID of all comments.

Data preparation

Following data collection, we quantified the linguistic data using Linguistic Inquiry and Word Count 2015 software (LIWC; Pennebaker et al., 2015). LIWC uses a bag-of-words language model, so that word order is ignored. It counts the number of words classified into particular linguistic categories, for example affective words, adverbs, future tense words (see Pennebaker et al., 2015, for further detail) and computes a percentage value for each document, reflecting the proportion of a particular feature in a document. Cork et al. (2020) and Koschate et al. (2021) have demonstrated that LIWC is a suitable software for understanding group normative linguistic styles.

For our analysis, we were interested in using only the LIWC categories that constitute linguistic style. We define style as the part-of-speech categories that are used widely across different contexts and domains regardless of topic (e.g., pronouns and articles; Schwartz et al., 2013). For this reason, we omitted all LIWC categories that refer to topical or content-based categories such as the ‘family’, ‘power’ and ‘risk’ categories. We also omitted the summary categories provided by the 2015 LIWC software that were an amalgamation of individual word categories such as ‘Clout’ and ‘Authenticity’ (Pennebaker et al., 2015). Resultantly, the textual data from each Reddit post was converted into a vector with 41 stylistic features (see Appendix 1).

Next, we excluded all comments made by self-identifying bots, all authors who have deleted their accounts or have had their accounts removed, all posts that have been deleted or removed, all posts that contain only a URL, and all comments with less than 50 words in line with common practice in computational psycholinguistic research using LIWC software (Cork et al., 2020). We also removed all URLs from the remaining comments. Table 2 indicates how many comments remained in our dataset after the data had been cleaned.

Table 2 Data remaining after excluding low-quality comments

Full size table

Analysis and results

All analyses and results in this section was completed using Python 3.0. The code used for this analysis can be found at https://github.com/acork25/Identity-MDS.

Creating a dissimilarity matrix

In order to quantify the similarity between the linguistic styles of groups, we calculated pairwise models using machine learning algorithms. These pairwise models involved using a binary classification task to assess whether it is possible to train models that can learn to differentiate between the linguistic style of two social groups. For this research, we replicated the approach taken by Cork et al. (2020) and used Extremely Randomised Trees (“Extra Trees”) classifiers. Extra Trees classifiers choose the best feature-threshold combination for each split from a small randomly chosen set (Geurts et al., 2006). In this way, the Extra Trees model is less likely to overfit the training data through a more efficient method of reducing variance and bias within the dataset. Furthermore, due to the randomised procedure of splitting the data, Extra Trees are less computationally expensive (Geurts et al., 2006).

Imbalanced class sizes can adversely impact a classifier’s ability as merely choosing to classify every post as one of the majority class can still achieve an apparently high accuracy. In order to deal with the imbalanced class sizes of our dataset, we undertook random under-sampling. We selected the minimum number of comments for each pairwise class. The sample size for each pairwise comparison is listed in Table 3.

Table 3 Total sample size (N) included in each pairwise comparison (n = N/2)

Full size table

In total, 105 pairwise comparisons were completed using the Extra Trees classifiers. We included all 41 linguistic style features in the analysis (see Appendix 1). We divided the data for each pairwise class into a training set and a test set; we used 50% of the data to train the model and 50% of the data to test the model. We chose a 50:50 split as we used AUCs to reflect the extent of similarities/differences between the linguistic styles of groups, rather than just as an indication of classification accuracy. To ensure accurate estimates of similarities/differences, we opted for a larger test set than is common within machine learning research.

As mentioned, we used the resultant AUCs as a measure of dissimilarity between groups. Where two social groups had a more similar linguistic style, the AUC would be closer to 0.5 (= random guess) demonstrating that the classifier was less able to distinguish between the two identities. Conversely, for social groups with particularly distinct stylistic differences, the AUC of the Extra Trees classifier would be closer to 1.0 (= perfect separation). By using the AUC output of the Extra Trees classifier in all 105 pairwise comparisons, we constructed a dissimilarity matrix that illustrated how dissimilar each group was from each other (see Fig. 1).

Multidimensional scaling

The next step was to run multidimensional scaling on the dissimilarity matrix to understand how the similarities between groups can be conceptualised in n-dimensional space. Specifying a Euclidean distance model, we computed and plotted the eigenvalues on a scree plot (Fig. 2), noting that two dimensions best fit the data (Kruskal & Wish, 1978).

We then plotted the two-dimensional multidimensional scaling (MDS) values on a scatterplot (Fig. 3). From Fig. 3, we can identify that our 15 identities appear to loosely cluster together into the five groups suggested by Deaux et al. (1995). That is, the identities that comprise each group type – vocational, relational, stigmatised, ethnic/religious and political – cluster relatively closely together on the two-dimensional plot. However, contrary to our predictions, the teacher identity and the homeless identity are notably close on the MDS plot. We explore potential reasons for this unanticipated result in the discussion below.

In Study 2, we explore the dimensions of the MDS plot in more detail; however, at this early stage it appears that the Y-axis represents a split on relational or empathetic characteristics, whereas the X-axis represents a split on a possible collective action focus. It is also possible that identities that fall towards the right-hand side of the plot are more advice-oriented, whereas the identities that fall towards the left of the plot are more opinion-oriented. In Study 2, we will assess how the values that these identities hold correspond to their placement on the axes.

Cluster analysis

Hierarchical clustering

In our pre-registered methodology, we outlined a cluster analysis to examine whether similarities and differences in linguistic style differentiate sufficiently between types, in line with the methodology used by Deaux et al. (1995). Although the two-dimensional solution in our data already demonstrates some clustering, we performed hierarchical agglomerative cluster analysis on the dissimilarity matrix outlined in Fig. 1 in line with the pre-registration. Agglomerative clustering starts with each group identity as a singleton cluster and merges clusters successively based on their similarity. Similarity (or distance between clusters) can be calculated in multiple different ways (see Nielsen, 2016). However, in order to test the hypothesis that groups are more similar that share a group type than those that do not, we require a similarity measure that computes within-cluster variance. For this reason, we used Ward’s method, which aims to find the pair of clusters that have the lowest increase in within-cluster variance after merging (Nielsen, 2016). Ward’s method calculates the distance between two clusters by computing the increase in the sum of squares of two clusters when merged.

In Fig. 4, we note four clusters rather than the predicted five. Interestingly, we found that the homeless identity clusters with the vocational groups rather than with the stigmatised groups. Further, we also found that the political groups in our analysis cluster closely with the ethnic and religious groups. Nevertheless, we still observe an overlap between the group types found by Deaux et al. (1995) and the group types based on group members’ own behaviour.

K-means clustering

To further explore how the groups cluster, we conducted a k-means cluster analysis. As the agglomerative clustering is greedy procedure, meaning that clusters formed early in the clustering process cannot be separated in later steps, we used k-means clustering to refine the analysis of identity proximities.

To determine the optimal value of k, we calculated the Within-Cluster-Sum of Squared Errors (WSS) for values of k from two to nine. After plotting the WSSs (Fig. 5), we found no clear threshold for the optimal number of clusters (there is no k where the WSS starts to level off). In line with the theory and as suggested by the agglomerative clustering, we chose k as 5 to assess whether the five group types suggested by Deaux et al. (1995) could be found within our data.

Through using k-means cluster analysis (k = 5) on the dissimilarity matrix values in Fig. 1, we find that the five clusters do not map perfectly onto the five group types proposed by Deaux et al. (1995). Instead, we find that the first cluster consists of political and ethnic/religious groups, the second cluster is formed of relational groups, and the third cluster is formed of the vocational groups, including the homeless identity. Finally, the last two stigmatised groups form separate singleton clusters. Looking at the cluster-coloured MDS plot shown in Fig. 6, we can see that whilst there is notable overlap between the results of the agglomerative cluster analysis and the k-means analysis, the stigmatised groups do not cluster together at all. Possible reasons for this are explored in the Discussion.

Discussion

The results from Study 1 tend to suggest that groups with similar collective self-understandings also have more similar linguistic styles. More specifically, we found that when using multidimensional scaling to visually map the linguistic similarities between group identities, we were somewhat able to identify the five types of group proposed in the perception-based research of Deaux et al. (1995); thus, the three social identities from each of our five group types were similar in their linguistic style and tended to be found closer together on the MDS plot. However, when we used agglomerative and k-means cluster analysis to further test the hypothesis that similarities within-group type would be greater than between-group type, the results were more mixed. In both the agglomerative and k-means clustering, we found that the homeless identity clustered with the vocational identities. It was also evident through looking at the MDS plot in Fig. 3 that the homeless identity was particularly close to the teacher identity. Further, in both the agglomerative and the k-means cluster analysis results, the political and religious/ethnic groups clustered together as one. Finally, the results of the k-means cluster analysis suggested that the linguistic style of stigmatised identities is more unique, with both the transgender group and problematic substance use group forming their own singleton clusters. Possible explanations for these findings are discussed below.

As can be seen on the dendogram in Fig. 3 and the k-means MDS clusters in Fig. 6, the political and ethnic/religious identities appear to form a single cluster. A likely reason for this clustering pertains to the meaning ascribed to particular ethnic and religious labels. To an outsider, or an individual who may not personally identify with the category label (e.g., the participants in Deaux and colleagues’ original 1995 study), ethnic and religious labels may be used as social labels to divide individuals into distinct groups on the basis of perceived physical and behavioural differences (McGarty, 1999). Thus, whilst an outsider may perceive similarities in groups such as ‘ethnicities’ or ‘religions’, this approach fails to comprehend what those group identities actually mean for the individuals and what the group stands for in practice. Herein lies the value of using naturally occurring behavioural data to study groups and group behaviour ‘in the wild’.

As has been argued elsewhere (Young, 2011), races, and by extensions ethnicities, are inherently political by the fact that they exist. Races are often defined in antithesis to the majority group; an individual may define someone as an Asian American in order to distinguish their ethnic identity from the superordinate American identity label. Conversely, the ‘White American’ is often perceived as the typical version of an American (Danbold & Huo, 2015; Devos & Banaji, 2005), and thus does not appear to require an explicit label. In this way then, the label ‘Asian American’ is used as a political tool to emphasise the boundary between Asian and White Americans. It therefore follows that the Asian American identity is political in nature, even if this is not recognised by laypeople such as those in Deaux et al.’s (1995) study. This argument is further supported by the lack of existence of a ‘White American’ or ‘European American’ Reddit forum. It is therefore no surprise that the Asian American group appeared closer to the overtly political identities than when only social judgments of outsiders were used to understand similarities between groups. This discrepancy between Deaux et al.’s (1995) results and our own points to the crucial value of using behaviour as a direct enactment of group identities in the real world, as opposed to relying merely on the stereotypical judgement of particular groups.

Interestingly, the closeness of the feminist identity and the Muslim identity in Fig. 2 also points to the value in using linguistic style to understand groups’ collective self-understandings. Despite the ideological differences between these groups, we can see that they both communicate in similar ways. It could be argued that the similarity between religious and political identities can be explained through understanding the fervently agentic nature of both identities (Deaux et al., 1995). More specifically, both religious and political identities are involved in collective activism with the intent of improving society; both identity types aim to create a lens through which to interpret human action, and as a result, a blueprint to improve upon society. In turn, our analysis as to how these identities are enacted goes beyond merely categorising individuals as similar (Deaux et al., 1995; McGarty, 1999; McGarty et al., 2002). By using behavioural analysis to understand who groups are and what they represent, we can directly capture how individuals construct their social realities within group-based environments (Reicher, 2004).

Relatedly, we noted that the homeless identity mapped very closely to the teacher identity and was clustered with the vocational identities in both the agglomerative and the k-means clustering. One explanation for this finding pertains to the purpose of the r/homeless subreddit. For individuals posting in this subreddit, it is possible that their communication is focused more around how to earn a living and survive whilst homeless, rather than on the stigmatised nature of their identities. In this way, the action-based focus around how to make money to survive is more likely to resemble the linguistic style of vocational groups rather than the other stigmatised groups (r/stopdrinking and r/asktransgender).

In addition to this, we can also see from results of the k-means analysis that the stigmatised identities have less of a distinctive linguistic style than the other group types. The k-means cluster analysis demonstrated that the transgender and problematic alcohol use groups formed their own singleton clusters rather than being clustered together. This is likely due to the greater variation in what it means to be part of a stigmatised group. Through looking at the AUCs outlined in Fig. 1 this explanation makes sense. The stigmatised identities have much lower AUCs when compared to all identities (including other stigmatised identities). This suggests that they are not particularly similar to any of the groups included in this analysis; the most similar group to the transgender group is the homeless group (AUC = .79), the most similar group to the problematic alcohol use group is the transgender group (AUC = .83), and the most similar group for the homeless group is sales (AUC = .74). Thus, whilst for two out of three stigmatised groups their closest group is another stigmatised group, they are still not that similar, with AUCs over .79. For future research, it would be interesting to include more stigmatised groups to understand how they fit within this framework. As noted earlier, when we chose the different groups to include in our analysis, we attempted to maximise diversity. We noted that in Deaux et al.’s original analysis (1995), the authors had three different types of stigmatised group – underprivileged, LGBTQ+, and problematic substance use. It is therefore possible that these different types of stigmatised groups each have their own linguistic profile. For future research, this could be a worthwhile avenue to explore.

However, one key strength of the research pertains to the diversity of the groups chosen to be included in this analysis. As outlined earlier in the Methods section, we chose three identities from each group that embodied diverse representations of the group type. For example, we included a diverse array of political groups (based on a political party, a collective action movement and an ideology), a diverse array of vocational groups (white-collar, self-employed and socially focused) and a diverse array of stigmatised groups (LGBTQ+, economically underprivileged, and confronting problematic substance use). It is therefore quite interesting that despite the broad range of topics that is likely being discussed within these 15 communities, we still observe some clustering in line with our predictions. This strength also points to the value of using linguistic style to assess groups’ collective self-understandings instead of linguistic content.

At this point, it bears noting that in this research the context is an online public forum. It is likely that this specific context will have an impact on how individuals choose to communicate and the purpose of their communication. Having said this, whilst this specific context may indeed play a key role in impacting the purpose of communication and the style of group communication, we suggest that this is a ‘feature’ and not a ‘bug’ of this methodological approach. We do not aim to deny the role of context and purpose, but instead argue that the purpose of communication is intrinsically tied up with the purpose, values and collective self-understanding of the group. That is, if two groups communicate in similar ways because they share a common purpose (i.e., supporting others or trying to mobilise the group for action), we note that this represents the collective self-understanding of the group and is thus central to understanding the similarity between different groups online.

In the next study, we look to more closely map the way a group communicates with the explicit values that the group holds. Whilst at present we know that the clusters represent group types in line with those perceived by participants in Deaux et al. (1995), our second study uses the linguistic content of groups’ discussions to demonstrate the link between linguistic style and group-based values.

Study 2

In order to validate the idea that group linguistic style maps to similarities in a group’s collective self-understanding, this study ascertains whether the position of each group on the MDS plot (Fig. 2) corresponds directly with the values that each group holds. Conceptually, therefore, we are aiming to explain each group’s position in multidimensional space by using the value-based content of each group’s interactions. Study 2 uses the same data as Study 1 to further understand the results of Study 1; it is therefore a secondary analysis.

Previously, research has suggested that individual-level values can be reliably identified using automated text analysis (Boyd et al., 2015), and thus we aim to extend this finding to assess values at the group-level. Through comparing naturally occurring written communication, self-reported values and value-laden behaviours, Boyd et al. (2015) find strong support for language-based value-behaviour links. In fact, they note that when values are operationalised through employing linguistic analysis on naturally occurring social media posts, they are better able to predict an individual’s future behaviour than when using self-reported scales. Building on this research, Ponizovskiy et al. (2020) developed a simple and easy-to-use value lexicon that is based on Schwartz’s theory of basic values (Schwartz, 1992). Schwartz’s theory of basic values (1992) suggests that there are ten values that ‘form a quasi-circumplex structure based on the inherent conflict or compatibility between their motivational goals’ (Schwartz & Boehnke, 2004, p.203). The values suggested by Schwartz are: benevolence, universalism, security, conformity, tradition, self-direction, stimulation, hedonism, power, and achievement. Ponizovskiy et al. (2020) find that their vocabulary-based dictionary has high reliability and a pattern of correlations between values that shows synthesis with the circumplex structure of values proposed by Schwartz (1992). Whilst this specific dictionary-based tool has not yet been used to assess social group values, previous research has demonstrated the benefit of using Schwartz values more generally to understand and conceptualise the values of particular social groups (e.g., Saroglou et al., 2004). In light of this, we suggest that the dictionary-based tool may be well suited to understanding the values of groups online.

We therefore use the value-based dictionary developed by Ponizovskiy et al. (2020) to assess the values of our 15 groups. We then assess whether the values of each group can be used to predict their position on the MDS plot. This analysis will help validate the idea that the way in which a group communicates corresponds with the values that are part of a group’s collective self-understanding.