Tracing the history of a discipline

Within the general frame of statistical methods for digital history, the present paper offers a representation of the life cycle of ideas (Tuzzi, 2018a) in the field of social psychology. This is done by conducting analyses of scientific publications in the pivotal North American journal, Journal of Personality and Social Psychology (JPSP).

Disciplines are not only a product of the interactive dynamics among scholars but also shape their understanding of the world. Scholars contribute to the definition of disciplines as well as are bound by the theories, techniques, and practices that are considered foundational and mainstream. The effects of this interplay are visible in scientific production: On the one hand, publications contribute to determining a discipline and its development in a given historical moment and context (Livingstone, 2003); on the other hand, they are bound by certain norms that establish what is part of the core and what is excluded. The contents of scientific production have the function of denoting a tradition (Gilbert, 1977; Erikson & Eralndson, 2014), and scientific journals can be seen as a place where the exchange of ideas and dissemination of research take place. They are a valuable source of information to trace the history of scientific debate; that is, the history of ideas in a specific field (Tuzzi, 2018b).

The analysis of scientific production to portray the life cycle of ideas constitutes a systematic way to read the history of a discipline. It offers an alternative to a more traditional historical account of what is known ex-post and constructed ad hoc to convey a consistent and consequential history (e.g., handbooks), thereby implying the risk of creating ceremonial or presentist histories (Hilgard et al., 1991). However, in the approach to directly analyzing scientific production in terms of published works a subjective agent remains that is the choice of the data source.

Digital history (Cohen et al., 2008) can be ascribed to the field of digital humanities, a domain that consists of a constellation of research fields and arises from the interaction between the humanities and digital technologies (cf. Svensson, 2010). It relies on what Moretti (2005, 2013) calls distant reading. Digital methods and tools for studying the history of a discipline recently gained ground in the psychological sciences (Fox Lee, 2016).

Scientometrics and beyond

When collections of scientific products are employed as data sources to obtain knowledge about a field’s temporal evolution, history can be represented either through metadata and bibliometric indices or through the chronological analysis of its contents (themes, methods, and fields of application).

Bibliometric analyses (which include measuring the impact of scientific publications based on citations) proved effective to analyze science and its products (scientometrics). These studies generally aim to quantify and measure the main features and performance of publications in order to construct accurate formal representations of their trends for explanatory, evaluative, and administrative purposes (De Bellis, 2009). With particular reference to scientific publications in social psychology, Quiñones-Vidal et al. (2004) analyzed the structure of the Journal of Personality and Social Psychology on the basis of aspects related to the number of published articles, productivity, collaborations, and affiliations. The authors noted, in particular, an increase in references to milestone articles as well as changes in the structure of the articles themselves, which became longer and included more studies and bibliographical references. They also noted that the nationality of the authors was fairly diverse in this journal and included a large share of non-US authors. Haslam and Kashima (2010) examined the growth of social psychology in Asia from 1970 to 2008 in terms of the number of articles that included authors with Asian university affiliations. They highlighted the most frequent Asian countries represented in the various periods considered and noted that Asian social psychology is becoming increasingly autonomous and distinctive. van Leeuwen (2013) analyzed the characteristics (length and number of citations and references) of articles published in social psychology from 1991 to 2010 (collected from accredited databases of scientific articles). The results of the study showed the dominant position of the US in the field of social psychology and claimed an increase in the number of bibliographic references over time, mostly found in the Web of Science database.

Considering the alternative perspective that aims to exploit the analysis of contents, it is possible to read the history of a discipline in terms of the topics covered by its scientific products (contentometrics; Tuzzi, 2018b). Several initial studies have analyzed the contents of scientific publications through traditional qualitative content analysis (Tuzzi, 2003; with regard to psychological sciences: Fisch & Daniel, 1982; Doise, 1980; Christie, 1965; Higbee & Wells, 1972; Higbee et al., 1976; Smoke, 1935; Fried et al., 1973; Mark & Cook, 1976; Diamond & Morton, 1978; Lubek, 1993). More recently, several studies have instead analyzed the content of scientific production in the psychological sciences through quantitative methods ascribable to digital history. Pettit (2016) introduced a reflection on the history of psychology in the age of big data, bringing up the use of Google Books Ngram Viewer as a means of tracing cultural changes over time. Burman (2018) highlighted how publications’ search engines (with specific reference to PsychINFO) encompass and reflect the history of a discipline. With the same idea and focusing specifically on psychology, Flis and van Eck (2017) highlighted the structure of the psychological literature in a corpus of articles (titles and abstracts) over a period from 1950 to 1999 by observing word co-occurrences. They identified a stable structure in literature—with the exception of psychoanalysis—which includes two main clusters: one referring to a Western, more applied tradition, and the other to an Eastern one that is experimental and physiological. With a focus on social psychology, Vala et al. (1996) identified the research topics considered mainstream in European social psychology by analyzing the contributions presented at the European Association of (Experimental) Social Psychology General Meeting in 1993. The authors highlighted a strong focus on intergroup processes as well as the presence of the social cognition perspective, along with an interest in some applied social topics, such as political processes. Spini et al. (2009) analyzed the keywords from five journals in different periods (1971–1972, 1985–1986, 1992–1993, the second half of 1999–first half of 2001) in order to highlight the effect of temporal dimension in research products and showed that most empirical studies are conducted on student samples and do not include time or age variables, particularly in European mainstream publications. They also showed a prevalence of experimental studies in the laboratory. Cretchley et al. (2010) mapped the history of the Journal of Cross-Cultural Psychology by highlighting themes and concepts that emerged from analyzing word frequencies and co-occurrences. They showed a prevailing orientation towards experimental psychology, with an initial emphasis on child development and, later, on psychology and personality and increasing attentiveness to cultural characteristics (values, orientation, and acculturation). Harrod et al. (2009) analyzed the evolution of research on groups in Social Psychology Quarterly from 1975 to 2005. The results of their study showed that the scholars’ interest in groups declined from the late 1970s until the 1980s, increased during most of the 1990s, and came to a halt at the end of the 1990s and the beginning of the 2000s. They also noted that the most popular research topics were group structure and the laboratory experiment method. Green et al. (2013) analyzed the lexical similarities of articles in Psychological Review (1894–1903) to enlighten the “genres” (i.e., research traditions not necessary formalized in schools) of American psychology, finding, together with the traditional schools, other research traditions on the topics of color vision, spatial vision, philosophy/metatheory, and emotion. Rizzoli et al. (2019) mapped the themes, processes, and methods of social psychology as depicted in the European Journal of Social Psychology publications (1971–2016) and highlighted distinctive characteristics of European social psychology according to the journal association (European Association of Social Psychology). They found that despite the publications reflecting several of the new theoretical proposals of the considered time-spans, they did not fully reflect the variety of perspectives and methods of social psychology and that the “social” is mainly present as social issues.

Contentometrics digital history

In relation to contentometrics digital history, the most commonly used methods rely on the statistical analysis of textual data; namely, identification of distinctive keywords in terms of over- and under-representation in time-spans (Guérin-Pace et al., 2012; Spini et al., 2009), lexical-based measures of similarity (e.g., Green et al., 2013), occurrences and co-occurrences of words (e.g., Cretchley et al., 2010; Flis & van Eck, 2017), (lexical) correspondence analysis (e.g., Rizzoli, 2018; Rizzoli et al., 2019; Vala et al., 1996), topic detection and topic trends (e.g., Rizzoli, 2018; Rizzoli et al., 2019), indices of a topic’s presence with respect to the use or non-use of certain keywords (e.g., Harrod et al., 2009).

To get an idea of the historical change in topics, the time variable (periods, time-spans, years) is observed in association with their content in these works. Except for topic detection, such studies mainly resort to measures that identify words that are both the most frequent and the most relevant in specific periods to offer an idea of the most discussed and peculiar topics but are not effective in highlighting the role of the marginal (non-frequent) or most-common (yet non-peculiar) ones. Methods for topic detection and topic extraction can partly solve this problem, as after having identified a topic—defined by a set of words that are associated with a high probability (e.g., LDA; Blei et al., 2003; cf. Sbalchiero, 2018) or by a set of words that co-occur (e.g., Reinert method; Reinert, 1983; cf., Sbalchiero, 2018; Rizzoli, 2018; Rizzoli et al., 2019)—it is possible to observe a temporal trend and then observe topics that fall into disuse or become popular. However, the abstract topics that the above-mentioned methods identify (on the basis of recurrent patterns of co-occurring words) typically blur the importance of single words—as they disregard the relevance of individual words for themselves (keywords)—and, because of their simple automatism, hardly correspond to those that a historian of the discipline would identify. In the same vein, there are keywords that can cut across a topic [understood in its statistical sense as the outcome of topic modelling (TM)] and can convey a story on their own (see Sect. “Conclusions” for further notes on the comparison between topic-based methods and our procedure). This criticality can be overcome by observing the temporal trajectory of individual words. For instance, Pettit (2016) showed the role of words through using Google Ngram Viewer. However, the same author warned against using this tool since the frequency of words over time is not a reliable index by itself. First, the same word can carry different meanings at different times, which can be disambiguated only by the context of use (e.g., specific journals). Second, it does not provide a general mapping of a general topic and its components. Defining a set of keywords that serve as proxies for themes, methods, and processes can in turn allow to overcome these critical issues.

Although various possibilities for achieving a good reading of history have been explored, digital history remains an evolving and yet unconsolidated field. In this paper, we proposed a method that—in an effort to surmount all the critical aspects so far exposed—allows for detecting the life cycle of a set of keywords and, through a correspondence of this set to theories, approaches, methods, and fields of application, allows to infer the history of a discipline. This contribution offers a glimpse into the history of social psychology from the viewpoint of one of the pivotal North American journals in the field (i.e., the JPSP) and a path for reading a discipline history, providing a vehicle for critical reflection (Apfelbaum, 1992; Danziger, 1995; Gergen, 1973) and simultaneously a means for defining its identity (Graumann, 1988). Moreover, it offers an innovative statistical learning procedure to study the history of a discipline in the era of big data (cf. Foster et al., 2021) from an empirical and systematic analysis of its products. In particular, from a functional data analysis (FDA) perspective, this approach assumes that the trajectories drawn over time by the occurrences of keywords in a mainstream journal reflect the relevance of corresponding ideas (topics, themes) in the scientific debate and, by clustering keywords that portray similar temporal patterns, allows to grasp the important latent dynamics and thus reconstruct a temporal evolution of the discipline as a whole.

Method

Corpus description and preprocessing

As an official journal of the American Psychological Association, JPSP is considered a flagship journal in the area of social psychology (Tesser, 1991, p. 349) and a mainstream source of data for tracing the history of American social psychology. In order to compile our text corpus, all the available titles, keywords, abstracts, and references (authors, year, volume, issue) from JPSP articles were downloaded from the Scopus database and were double-checked on the journal’s website to fill in possible gaps and solve any inconsistencies (e.g., duplicate or, rarely, triplicate records). The result of this first task was a collection of 10,403 items from a time span of 57 years—from Volume No. 1, Issue No. 1 (1965) to No. 121, Issue No. 6 (2021)Footnote 1—for a total number of 121 volumes and 694 issues of the journal. JPSP publishes 12 issues per year; however, the distribution of volumes and issues has been partially uneven over time. Since 1980, there are two volumes per year, with more than one volume per year occasionally until 1980. The number of articles per year is 179 on average.

Items that do not include relevant content for retrieval were excluded (e.g., editorials, master heads, errata, acknowledgements), and the resulting corpus reduced to 10,222 articles.

With reference to the pre-processing phase, JPSP consistently uses American English spellings, which was thus maintained. To achieve a reliable wordlist of the corpus, all uppercase letters were replaced with lowercase letters, punctuation marks and blanks were considered separators between words, and numbers have been maintained. At the end of the tokenization phase, the corpus consisted of N = 118,326 word-tokens (total number of occurrences) and the wordlist included V = 8882 word-types (number of different forms). The basic lexical measures show a good level of redundancy that fully supports a quantitative lexical-based approach (Lebart et al., 1998; Tuzzi, 2003; Bolasco, 2013). The type/token ratio (TTR) was 7.51% and the number of hapaxes was 43.54%.

In order to retrieve the most meaningful keywords, we decided to enrich and expand the list of word-types by including Multiword Expressions (MWEs), i.e., sequences of words that represent a meaningful compound. MWEs were identified in our titles by means of a specific information retrieval procedure that exploited part of speech (POS) regular expressions (REs), e.g., an adjective followed by a noun produces an MWE, such as in “social interaction” (Pavone, 2018). Further relevant keywords (both single words and MWEs) were identified in the corpus of titles by matching with lists retrieved from two encyclopedias of social psychology (Manstead et al., 1995; Baumeister & Vohs, 2007; cf. Sánchez-Berriel et al., 2018) and from our corpus references (that is, recent articles the list of keywords is available in for each article). At the end of this process, we replaced words with stems using the Porter Stemming Algorithm (Porter, 1980) to overcome some of the limitations of an analysis based on word-types (e.g., plural forms for nouns).

To transform information embedded in our texts into textual data, we produced a contingency table that reported the occurrences of each keyword in each year. The term-document matrix (TDM) includes all keywords that occurred at least once in 5 years (on average), i.e., meaningful stems (e.g., motiv, cognit) and stem sequences (e.g., individu differ, attitud chang) that occur in the corpus of titles at least 11 times, for a total number of 1214 keywords (rows) and 57 time-points (columns). The TDM is the result of a manifold information retrieval and selection process and represents the data structure employed in the following statistical analyses.

Finding patterns: a knowledge-based system

The system that we adopt to produce a digital history of social psychology starts from the basic assumption that each time series of word occurrences—that is, each row of TDM—portrays the life cycle of that word. Moreover, to reconstruct the micro-history of each word and identify its latent dynamics, we used a FDA approach and, within it, curve clustering. Finally, an exploratory perspective guided us in building a knowledge-based system (KBS), where human learning derives from integrating statistical learning with experts’ knowledge provided by different sources. In the first stage of KBS, linguistic experts assisted corpus creation. Then, experts of the specific domain being investigated (here, social psychology) assisted in both interpretations of results and decision-making, potentially enabling the learning process to culminate in a conclusive reading (or alternative readings) of the history (Trevisani & Tuzzi, 2018a, 2018b). Our procedure implements a sort of triangulation (Heath, 2015) or, in one of possible declinations of the concept, integration between quantitative and qualitative viewpoints: the quantitative (statistical) analysis serves as an exploratory tool which, on the basis of quantitative and replicable results, offers a set of best candidate solutions to respond to the task of the study; the qualitative (expert) understanding verifies whether the output of the quantitative procedure corresponds to the qualitative expectation and, hence, chooses the solution that is most explanatory and meaningful to the specific task. Moreover, statistical analysis and qualitative judgement can alternate: the latter can intervene downwards of a quantitative analysis to evaluate what candidate solutions can be meaningfully interpreted (Sect. “Experts’ Viewpoint”), but the same expert choices can be justified and placed in a coherent frame on a basis of data-driven analyses (Sect. “Exploring the Nested Temporal Structure”). A statistical learning process moves from the resulting TDM and consists of five steps: (1) normalizing time trajectories of word (raw) frequencies, chosen according to aspects of life cycles that are considered substantive when comparing words; (2) smoothing time trajectories of word (normalized) frequencies, interpreted as functional data (FD); (3) curve clustering (CC) to group keywords that experienced a similar life cycle and detecting all important dynamics underlying the evolution of word micro-histories; (4) interpreting using expert opinion to decipher the detected dynamics and thus composing one or more readings of the evolution of the knowledge field as a whole; (5) (optional step) exploring a possible nesting between clusterings with different levels of resolution (from more coarse-grained to more fine-grained). The calculations were performed with the support of R (R core team, 2022) libraries fda (Ramsay et al., 2020), kml (Genolini et al., 2015), clusterCrit (Desgraupes, 2018), and clusterSim (Walesiak & Dudek, 2020), supplemented by ad hoc R code.

Normalization

A diachronic corpus is typically characterized by the following features: (i) the size of the subcorpora (number of available texts for each time-point and their size in word-tokens) potentially varying greatly over time (Fig. 1) and (ii) the large number of rare events (LNRE) property of textual data, i.e., the presence of a large number of word-types whose probability of occurring is quite low, which implies high variability in popularity (total occurrences) of individual words in the entire corpus, high asymmetry of frequency spectrum by time point, and data sparsity (many zeros and small counts in the TDM) (Fig. 2).

Fig. 1
figure 1

Subcorpora dimension: for each year, number of texts (dot-line), total number of word-tokens in texts/10, sum of keyword frequencies in TDM/10 (column sum), maximum keyword frequency in TDM

Fig. 2
figure 2

Keyword occurrences: trajectories by frequency class (VH = very high, H = high, L = low, VL = very low)

As the above considerations indicate, a form of normalization by time point should be regarded as necessary in order to adjust the uneven size of subcorpora across time and hence regularize the “signal”. A further form of normalization by word might be appropriate in order to adjust the great disparity in word popularity, thus making it possible to compare word trajectories by timing (synchrony) regardless of height (popularity). Among several alternatives for data normalization in this study, we have chosen a chi square-like transformation, which is a double normalization, i.e., both by column (time) and row (word) of the TDM. In detail, if nij is the raw frequency of word i at time point j, ni. is the i-row sum, n.j. is the j-column sum, and n is the matrix total of TDM, the chi-square transformation is given by yij = nij/(ni. n.j/n). A well-known effect of the chi-square distance, which informed our normalization, is the emphasis assigned to “categories” characterized by low frequency. The transformed frequency yij (relativized with respect to word popularity ni.) allows low-frequency words to emerge so much that the driving baton in subsequent smoothing and clustering passes from popular to less frequent words. This emphasis is not an issue in our case (as we deal with keywords, every word is important in itself even if shows a low frequency), rather it is a strength, since words with low corpus frequency (though greater than the threshold) are, if meaningful, generally peculiar, that is words that thanks to their distinctiveness improve interpretation and explanatory power of results. More specifically, when the objective is—like ours—to reconstruct an historical evolution, such influence of infrequent and peculiar words naturally leads to a division of the time period in phases and epochs. This normalization has, in fact, previously proved effective in grouping words with similar timings of birth/death and presence/absence along the time span considered (Trevisani & Tuzzi, 2018a, 2018b).

Smoothing

In order to capture the life cycle of words, it is natural to resort to an FDA approach, whereby the trajectory of (normalized) frequencies, yi = {yij} for each word i at time points j = t1, …, tT, is viewed as a realization of an underlying continuous function xi(t)—sufficiently smooth or regular—that represents the word’s temporal evolution. To filter xi(t) from the noisy observation yi, we adopted the basis function approach, where xi(t) is expressed as a finite linear combination of real-valued functions φk, called basis functions (Ramsay & Silverman, 2005).

$${x}_{i}\left(t\right)={\sum }_{k=1}^{K}{c}_{ik}{\phi }_{k}\left(t\right) {c}_{ik}\in R,K<\infty$$

Moreover, we consider B-spline bases, the most popular basis system for building spline functions that are piecewise polynomials joined smoothly. We chose splines since they are general enough to accurately approximate a large variety of smooth functions and yet restrictive enough to benefit from the simplicity of parametric estimation. Thus, they enable the recognition of continuous and regular curves, and hence more easily interpretable shapes, from the extremely bumpy (peak-and-valley) trend that is typical of word trajectories.

We perform optimal smoothing using the roughness penalty (or regularization) approach, whereby, after placing knots—points where the polynomials join—at each data point (a conventional choice of a smoothing spline), the wiggliness of the spline is controlled by penalizing its roughness according to the definition of smoothness desired. Namely, the estimate of xi is the function minimizing the penalized residual sum of squares PENSSE(xi) = SSE(xi) + λ PENr(xi), where SSE(xi) is the residual sum of squares measuring the fit to the data, PENr(x) is the penalty term measuring a function roughness (by the integrated squared r-th derivative over the observation interval, PENr(x) = ∫[Drx(s)]2ds), and λ is a smoothing parameter measuring the trade-off between the fit to the data and roughness of x. As λ → 0, the fitted curve approaches an interpolant to the data; as λ → ∞, the condition PENr(x) → 0 (zero penalty or “hyper-smooth” x) means the fitted curve is a spline of order r (e.g., a straight line for r = 2 and a cubic polynomial for r = 4, the classic and acceleration penalty, respectively).

A major problem is the choice of the optimal amount of smoothing. One of the main automated selection procedures employs generalized cross validation (GCV), which is a faster approximation of cross validation (CV) and has been reported to possibly reduce the tendency of CV to under-smooth in some conditions (Hastie et al., 2009). In detail, \(GCV\left(\lambda \right)=\frac{T}{\left(T-df\left(\lambda \right){)}^{2}\right)}SSE\left({\widehat{x}}_{i}\right)\), where df(λ) is the effective degrees of freedom under regularization and monotonically decreases in λ with maximum equal to K when λ = 0. GCV provides a convenient approximation to leave-one-out CV for linear fitting under squared error loss. Further criteria (e.g., AIC, BIC, AICc) are commonly employed in automated smoothing selection, which is still an area of active research, and yet none uniformly outperforms the others (Lee, 2003). Automated methods are generally suitable when the number of time observations is not small (T > 20).

In practice, we select the best model by varying:

  • spline order m from 1 to 8;

  • roughness penalty order r: besides the standard r = m-2, r = 2 for m > 3, r = 1 for m > 2, r = 0 for every m (lower derivatives lead to simple equations and a piecewise constant or linear fit, while higher derivatives lead to rather complex mathematics and a very smooth fit);

  • λ over an appropriate range of values: log10λ from –6 to 9.

The minimum GCV has been achieved by setting m = 4 with PEN1 roughness penalty and λ = 101.5, whence df(λ) = 5.43 (compare this to the fixed number of basis functions K = T + m-2 = 59 in an un-penalized fit). Note that the second and third sub-optimal fits are coincident and practically indistinguishable from the first (in ascending order, the minimum GCV = 0.050664/0.050666/0.068789 obtained at m = 4/3/3 with r = 1/1/0 and df(λ) = 5.43/5.32/24.22, respectively).

From a visual inspection, the fitted curves seem well approximating words' time evolution without overfitting (Fig. 3).

Fig. 3
figure 3

A sample of fitted curves according to increasing root mean squared residual. Fit of the optimal smoothing spline to chi square-like normalized data (red line)

Clustering

The exploratory nature of the main task—namely, learning any major latent dynamics—led us to take an unsupervised, and hence distance-based, approach for CC. The alternative, supervised or model-based approach, is typically chosen for confirmatory analyses where findings and arguments are put to trial. Moreover, we need a mostly automated procedure that is fast and relatively easy to use and understand even by non-statisticians of interdisciplinary research groups. A model-based approach is generally more demanding in terms of computing and inferential expertise than a distance-based one.

We applied a k-means algorithm for FD. It is a common choice, especially when combined with the finite basis expansion approach, where the distance between curves is approximated by using the discretely observed evaluation points of the estimated curves xi(t). We used the L2 norm or Euclidean distance as it is the most popular metric, though an equally simple alternative would be the L1 or Manhattan distance. Such conventional distances meet our needs since they evaluate a one-to-one mapping of each pair of sequences. In fact, one of our objectives was to compare curve profiles after data transformation. Accordingly, our strategy entail first transforming data and then seeing what this involves for clustering results by using a distance measure that can approximate the area between two curves as simply as possible. The alternative way of directly choosing a dissimilarity measure that is invariant to specific distortions of the data is not suitable here, as filtering needs to be performed on preprocessed data [see an overview of FD clustering in Jacques and Preda (2014)]. Moreover, for each potential cluster number (k from 2 to an appropriate range maximum), we re-ran the algorithm starting from 20 different initial configurations, thus generating 20 possible partitions.

At this step of our KBS, clustering validation was performed by using the internal information of the clustering process to evaluate the goodness of a clustering structure without reference to external information (which will be done in the next step). Internal validation can also be used to decide what the most appropriate number of clusters is in a certain application. In our research context, no “natural” or “true” clusters exist in the available data, so there is not even a “true” number of clusters. Since we only have to be reasonably confident that nothing relevant has been left unexplored, the idea is to identify a set of best candidates for cluster number by pooling the ratings from a large number of clustering quality criteria [about 50, see Desgraupes (2018) and Genolini et al. (2015)]. It is well known that one index does not fit all situations; rather, the many existing indices can be grouped into different types, each measuring a different aspect of clustering quality. Ergo, we have gathered a large basket of indices in order not to favor any single criterion, as each is equally valid in principle. These include measures of within-cluster homogeneity (e.g., Ball-Hall, Banfeld-Raftery, C-index, Gap, Krzanowski-Lai, Marriot, Scott-Symons), of between-cluster separation (e.g., Rubin, Scott, Ratkowsky-Lance), and of their combination (e.g., Calinski-Harabasz, Davies-Bouldin, Dunn, and its generalizations, Gamma, Hartigan, McClain, PBM, Point-Biserial, Ray-Turi, SD, Silhouette, Friedman, Xie-Beni, Tau), as well as measures of similarity between the empirical within-cluster distribution and distributional shapes, such as the Gaussian distribution (e.g., BIC, AIC, and their variants). In detail, we form a list of prioritized solutions for cluster number by: first, computing a cluster number ranking for each quality index; then, pooling all rankings to calculate the frequency of being ranked first (top-1), second (top-2), third (top-3), and fourth (top-4) for each cluster number; finally, retrieving an ordered set of best candidates for cluster number from an automated selection that combines the cumulate and marginal frequency of being in the top four positions for each cluster number. The procedure essentially mimics the qualitative selection based on the “human” visual inspection of both the height and color composition of each bar corresponding to cluster number in the graphical representation (Fig. 4). Some results are recurring in our analyses: the lowest cluster numbers (2/3) are the best rated and the highest ones (of the considered range; here, 25/26) are in general frequently selected in the highest positions. This last finding, on the one hand, may reflect the lack of a defined structure and parsimonious grouping but, on the other, may be a failure due to the standard assumption underlying many quality criteria of normally distributed data and, hence, of compact and convex clusters. However, the range of more interesting solutions concerns intermediate cluster numbers. In the current application, for example, the ordered list of selected candidates is 2, 3, 4, 6, 9, 13 (after excluding 25/26). Hence, the KBS produces the best partitions corresponding to cluster numbers 4, 6, 9, 13 in order to subject them to the scrutiny of experts.

Fig. 4
figure 4

Frequency of being ranked first (top-1), second (top-2), third (top-3) and fourth (top-4) for each cluster number

In order to choose the best partition among the 20 replicates for each cluster-number candidate, we used the Rand index. In our pooling-based selection, multiple criteria place the candidate in the top-1 or, in general, in the top positions. However, the score that each criterion assigns to the candidate is associated with the specific partition (out of 20) over which it is maximized. Therefore, once extracted, all the distinct partitions resulting from the multiple criteria supporting the candidate, the partition that maximizes the average Rand index (Rand, 1971) of agreement with all the others is chosen as the best partition for the cluster-number candidate (namely, the partition that best mediates between all the other compared partitions). Moreover, we use a generalization of the standard Rand index, a “multiple” Rand index (Trevisani, 2018) that provides a measure of concordance between multiple partitions and also can be computed at several levels (of the overall partition, of single clusters as well as of individual words), thus offering a measure of stability of clustering results for each of these levels (e.g., multiple Rand index of the best partition for cluster number 6 is 0.95 (Fig. 5) and 0.98 for cluster A of the same partition (Fig. 6). Particularly, the multiple Rand index for single words tells us how much a specific word is consistently grouped (or separated) from other words across different partitions, hence highlighting which words are “regular” members and which are “odd” within a cluster. In the lists included in the figures, words are ordered by both their popularity and their individual multiple Rand index (both from highest to lowest) to guide interpretation (words are transcribed on cluster graphs up to a maximum of 60 for sake of readability; however, a special place is invariably dedicated to multi-words, commonly in the rightmost column of the list, as they are generally not among the most popular words but are useful to enhance the interpretability of word groups).

Fig. 5
figure 5

Clustering cl-6 all six groups: keyword curves and cluster mean curve

Fig. 6
figure 6

Clustering cl-6 individual clusters: most relevant keywords, keyword curves and cluster mean curve

Experts’ viewpoint

At this point, we submitted the clustering results—the overall best partition and single clusters for each cluster number candidate (e.g., Figs. 5, 6 for cluster number 6)—to subject matter experts. In order to ease the comparison between different groupings (see section below), we may provide also indices of agreement and graphs of set overlaps between different clusterings. Experts interpreted the content and latent meaning of word groups in order to identify topics, methods, and research areas, and hence possibly recognize an evolution of the field as a whole from the group temporal phases and dynamics. While interpreting, experts can formulate new research questions that may lead to further insights. If they come up with several convincing readings, the experts can opt for one or more historical narratives (see section below for a special case).

Exploring the nested temporal structure

The set of best candidates for cluster number generates clusterings from more coarse-grained (lower resolution) to more fine-grained (higher resolution). These competing solutions can be non-conflicting if a nesting relationship (or nearly so) between them exists. Yet, being the temporal connotation a key feature in this study, we investigated whether a fine-grained clustering can unfold a more modulated scanning of time phases and dynamics that were uncovered by a coarse-grained clustering. From the perspective of analyzing a clustering dynamic behavior, we expected a split of coarse-grained single cluster dynamics into coherent fine-grained sub-cluster dynamics plus a possible accordance of coarse-grained dynamics through fine-grained clusters of transition. The following procedure is again eminently explorative and consists of subsequent steps of (i) nesting analysis, (ii) temporal reordering (surveilled by expert opinion), and (iii) final modulation of coarse-grained clustering dynamics into nested clusters coupled with transitional clusters of fine-grained clustering (Fig. 9).

Let's assume that the experts consider convincing a set of clusterings corresponding to cluster numbers k1 < k2 < … (in our example, the set is: 6, 9, 13). Let’s use the short-name cl-k for a k-cluster partition henceforth. We consider coarse-grained clustering the cl-k1 whose clusters have been chronologically ordered, from the cluster of words that have disappeared to the cluster of emerging words in the period examined (Fig. 6), thus identifying the subsequent basic patterns in the field evolution (‘A’ = “steeply decreasing”, ‘B’ = “stable till mid-’80 s then decreasing”, ‘C’ = “culminant period’80 s–’90 s”, ‘D’ = “culminant 2000 then stable”, ‘E’ = “stable trend”, and ‘F’ = “increasing from 2000”). For the sake of completeness, we used the “peak chronology”, i.e., we considered the time combined with the height of the highest peak (where the trend culminates) of each cluster to establish the chronological order.

The initial step (i) consists of reordering the clusters of the fine-grained clusterings by a nesting analysis with the chronology of cl-k1 as reference. This step may also need more iterates until the final nesting structure generates a split reproducing at best, down to the highest levels of resolution, the reference chronological order. In practice, in the final arrangement, the order of fine-grained clusters produces a gradual changeover from one basic pattern to the subsequent one (Fig. 7). The reading of nesting structure is twofold—i.e., either how a low-resolution clustering splits into a higher resolution clustering or how a high-resolution clustering composes a lower resolution clustering. Regardless, the information retrieved is the same. A mosaic plot can provide a visualization of either reading. For instance, if we represent the distribution of cl-k1 conditional to cl-k with k > k1 (Fig. 7), we see the split of cl-k1 into cl-k. Nesting of cl-k clusters is perfect or partial according to a 100% or less than 100% one-color rectangle (note that there are k1 different colors in this reading). Clusters with partial nesting may be “transitional”, i.e., they can represent the changeover from one basic pattern to the subsequent one. Moreover, it may so happen that the reference chronological order is not exactly reproduced at higher levels of resolution; this can reasonably occur for basic patterns without a clear temporal trend. For example, the distribution of cl-6 conditional to cl-13 (right panel of Fig. 7) shows that nesting is perfect for A, C, E, I, J, M—and nearly so for H, L—while a transition may occur through B. Also note the transversal behavior of pattern ‘E’ = “stable trend” (blue color) that fills non-sequential high-resolution clusters.

Fig. 7
figure 7

Nesting of fine-grained clusterings into coarse-grained clustering: cl-9 (left) and cl-13 (right) into cl-6 (in reading both conditional distributions cl-6|cl-9 and c-6|cl-13, patterns ‘A’, ‘B’, ‘C’, ‘D’, ‘E’ and ‘F’ correspond to colors red, yellow, green, turquoise, blue and purple as in Fig. 6)

Next step (ii) involves the intervention of experts and is an in-depth analysis of the nesting structure for each basic pattern separately in order to examine whether finer groupings provide temporal sub-phases or sub-clusters of themes. This time, mosaic plots represent the distribution of cl-k conditional to cl-k1 to show how cl-k clusters compose cl-k1 basic patterns. As an example, we show the composition of each basic pattern by both cl-9 and cl-13 (Fig. 8; note that there are 13 different colors in this alternative reading). In order, ‘A’ is composed by A and B of both cl-9 and cl-13, which divide words into more and less steeply ceased (Appendix 1 and 2); ‘B’ is composed by B/C/D of both cl-9 and cl-13; C and D are almost coincident in cl-9 and cl-13; C characterizes the words of end’70 s early’80 s; ‘C’ is composed by E and F of cl-9, E/F/G of cl-13; E (almost coincident) and F of both cl-9 and cl-13 characterize the’80 s and’90 s respectively; ‘D’ is composed by F and G of cl-9, essentially H and I (a bit F and K) of cl-13; ‘E’ is composed by D and H of cl-9, (a bit D,) G and J of cl-13; and ‘F’ splits into G and I of cl-9, K/L/M of cl-13. Transitional finer clusters and transversal coarser clusters may pop up (e.g., cl-13 cluster B is clearly transitional between ‘A’ and ‘B’, while F might be between ‘C’ and ‘D’; pattern ‘E’ is clearly transversal as it fills non-sequential finer clusters).

Fig. 8
figure 8

Composition of basic patterns by fine-grained clusters

The conclusive step (iii) is the reconstruction of an “historical panorama”, which can show the sequence of clusters of a fine-grained clustering (panorama frames) ordered according to the temporal phases or basic patterns (background scenes) of the coarse-grained clustering (Fig. 9 shows the sequence of cl-13 clusters, each one associated with some reference pattern of cl-6). Recognition of transitional clusters occurs by two criteria: Either (i) in the “direct” nesting (of the finer grouping into the coarser one), the finer cluster is nested into more than one basic pattern with a small difference between the portions of nesting (we set this difference less or equal to 15%) or (ii) the “indirect” nesting (as derived from the full nesting structure across all the cluster numbers > k1) indicates a pattern different from the one derived directly (in our example, we find from indirect nesting that B, F, L and K are transitional cl-13 clusters for ‘A’/‘B’, ‘C’/‘D’ and, the last two, ‘D’/‘F’ reference patterns, respectively).

Fig. 9
figure 9

Historical ‘panorama’: final modulation of cl-6 basic patterns (panorama ‘scenes’) by cl-13 clusters (panorama ‘frames’ nested or transitional into/between scenes)

Finally, within the associated pattern, the finer clusters can be furtherly reordered—first, according to the “substantive chronology” dictated by expert opinion [conclusive interpretation of (ii) step] and second, according to the peak chronology dictated by the shape of curves to more strictly respect time evolution. In our study, for instance, the sequence of cl-13 clusters does not exactly follow the alphabetical order set at the initial stage of the procedure; C and D, both nested into pattern ‘B’ of cl-6, are inverted because the peak is more recent for C. The same occurs for K and L, whereas G shifts as it forms, together with J (U-like trend), the basic pattern ‘E’ of “evergreen” words.

Results

With reference to the number of clusters and the resulting granularity, the solution chosen was a combination of both the statistical learning procedure and the expertise of the scholars involved. However, as we will show below, the choice of a larger or smaller number of clusters (among the best partitions) does not affect the goodness of the result but only its level of granularity (cf. Appendix 3).

Among the best candidates in terms of the cluster numbers suggested by the criteria (Fig. 4), and taking into account the nesting analysis (Sect. “Exploring the Nested Temporal Structure”), partitions with 6, 9, and 13 clusters are good solutions in terms of the readability of the results.

In the coarser solution with six clusters (cl-6; Fig. 5), cluster A (8.6% of the keywords; Fig. 6) includes keywords that were prominent in the early years of the journal and then almost disappeared (pattern ‘A’, namely, steeply decreasing). These include words that fell into disuse and those that mainly pertain to the first studies on social influence. They are as follows: group decision processes, attitude and opinion change, obedience to authority (e.g., group discuss, conform, complianc, attitude chang, opinion chang, authoritarian), neo-behaviorism (e.g., reinforc), cognitive dissonance theory (disson), as well as game theories. The word “negro” also appears in this cluster as, over time, it has been replaced by terms more consistent with a modern conception of politically correct language. The word “subject” experienced a similar decreasing trend since it was dropped after an outcry in the US from students (who were often the “subjects” of studies). The fourth edition of the APA manual (1994) stated that, when human beings are mentioned, it is preferable to use “participant” instead of “subject.” This change persisted for a while although we expected a resumption in the use of the term, which did not occur, given that in the sixth edition of the same manual (2010), the word “subject” was again accepted in psychological scientific publications. With respect to methods, this cluster contains keywords related to interviewing and field experiments (intervie, field experi), which are currently no longer present in the publications of this journal. This cluster is split into A and B of both 9-cluster (cl-9; Appendix 1) and 13-cluster (cl-13; Appendix 2) solutions, as shown in Fig. 8. The finer-grained clusters A and B of both solutions divide words with decreasing trends in more (cf. A, nested into pattern ‘A’, in Fig. 9) or less (ibid., cf. B, transitional between ‘A’ and ‘B’, i.e., referred to pattern ‘A/B’) steep descending patterns. For example, keywords that are today considered “politically incorrect” scientific terms as well as those that refer to studies on conditioning, highly used and then fallen into disuse, are included in cluster A (of both cl-9 and cl-13), which has a highly decreasing trend. Whereas, words relating to obedience to authority and cognitive dissonance fall into cluster B (of both cl-9 and cl-13), which show a less pronounced decreasing pattern without a final drop-out.

Cluster B of cl-6 (22.2% of the keywords; Fig. 6) includes keywords with an approximately stable trend until the mid-1980s, when it begins to slowly decrease (pattern ‘B’, in short, stable till mid-’80 then decreasing). The keywords it contains mainly refer to studies on attitude and behavior (behavior, attitud), aggression (aggress), and to the specific set-up of such studies, referring to the effects on participants subjected to stimuli (e.g., exposur, rate, induc) or assigned tasks to be completed. Other keywords concern socio-psychological processes and theories, such as “causal attribution” and “locus of control” (causal attribut, locu of control). It contains also a number of words referring to methods (e.g., variabl, hypothesi, mediat) and the discipline itself (e.g., social psycholog), probably characteristic of the reflections arising from the so-called “crisis of social psychology” (cf. Gergen, 1973; Harré & Secord, 1972; McGuire, 1973). This coarser cluster is split into finer grained B, C and D in both cl-9 and cl-13 solutions (Fig. 8). These three clusters have a consistent slight downward pattern but are distinguished by different peaks in popularity. B (of both cl-9 and cl-13), as mentioned above, has a peak in the early years of the journal life, D is more flattened and, lastly, C has a peak in the end’70 s–early’80 s (Fig. 9). Note that C and D are almost coincident in cl-9 and cl-13 (cf. also Appendix 3).

Cluster C of cl-6 (19.2% of the keywords; Fig. 6) groups keywords related to processes and methods adopted mainly in the 80 s and 90 s (pattern ‘C’, namely, culminant period’80 s–’90 s), especially focusing on the structure of the individual attitude, decision making, and personality. For example, in this cluster, there are words referring to memory, decision-making, social judgment, and social comparison (judgment, memori, decis make, social judgment, social comparison). It contains words related to the structure of personality and individual differences (e.g., individu, individu differ, trait). References to methods are mainly linked to self-report measures and measurement scales of constructs (e.g., measur, scale, dimens, self report). In finer-grained solutions, this cluster is split into E and F of cl-9, and into E, F, and G of cl-13 (Fig. 8). It is worth noting that clusters of finer partitions are still characterized by peaks placed in the middle years of the time frame but keyword popularity relates to narrower periods (cf. Appendix 3). For example, cluster E of cl-9 has a slight peak in the late’80 s (immediately following C of cl-9), whereas F has a peak in the’90 s. These clusters contain keywords that recall theories and processes that were popular in the late’80 s (e.g., self awar, self conscious) and’90 s (e.g., memori, emotion, mood). This confirms that choosing either a rougher or a finer partition (among the best ones) only changes the detail level of the results. Cluster E of cl-13 shows a peak in the’80 s (cf. E, nested into pattern ‘C’, in Fig. 9) and, similarly to the cluster (of cl-9) described earlier with the same time trend, contains keywords such as “self-awareness” (self awar) or “self-efficacy” (self efficaci). Cluster F of cl-13, similarly to F of cl-9, peaked in the’90 s (ibid., cf. F, transitional between ‘C’ and ‘D’) and includes keywords such as “mood” or “emotion” as well. Cluster G of cl-13 collects keywords with a stable trend, such as mesur—referring to the measurement of psychological attributes, which remains a milestone in mainstream social psychology also today (c.f., Sijtsma & van der Ark, 2020)—, i.e. contains “evergreen” words, thereby subsumed by the nesting procedure to another pattern (see below; ibid., cf. G, nested into pattern ‘E’).

Cluster D of cl-6 (19.3% of the keywords; Fig. 6) groups keywords with an increasing trend culminating in the 2000s and becoming stable afterwards (pattern ‘D’, in short, culminant 2000 then stable). It includes references to social cognition (e.g., social cognit, stereotyp, bia), that is, a perspective (and related processes) becoming dominant in social psychology especially in those years (considering the necessary delay of scientific publications). Some keywords refer to key themes in social psychology as self and identity, emotions, intergroup relations, as well as theories including the theory of social identity, peculiar to so-called European social psychology (e.g., self, ident, emot, social identit, intergroup). Moreover, it contains keywords referring to fields of application (e.g., health, work). This cluster is split into clusters F and G of cl-9, and into H and I (also F, K and L but just for a minor part) of cl-13 (Fig. 8). As before, the trend remains similar in the clusters of the finer-grained partitions but with peaks (more or less marked) in more specific and consecutive periods (cf. F, transitional between ‘C’ and ‘D’, H and I, both nested into pattern ‘D’, K and L, both transitional between ‘D’ and ‘F’, shown in Fig. 9).

Cluster E of cl-6 (19.1% of keywords; Fig. 6) includes words that show a nearly stable trend over time (i.e., evergreen, constituting pattern ‘E’). Some of these sets of keywords seem to be general words related to the description (e.g., approach) or conduction of studies (i.e., related to the method of analysis), such as experiment (experi), control (control), prediction (predict), and correlation (correl). Other keywords refer to the groups (group, group member) and their dynamics, such as influence (influenc) and conflict (conflict), which is a distinctive topic of social psychology. This cluster is split into D (for a minor part) and H of cl-9 (Fig. 8), which both have a temporal trend that never vanishes, but the former represents slightly more popular keywords in the first years considered (as already described), whilst the latter pertains to the last period. We have already described some keywords of cluster D of cl-9 that remain stable over time but were a little more used in the first part of the time-span, such as “effect.” On the contrary, cluster H of cl-9 contains keywords with an opposite bent, such as experiment motiv, relationship, chang. It should always be considered that the decreasing temporal pattern assigned to a keyword can mean either that the topics (methods, themes, application domains) it refers to are less frequently dealt with or that the stylistic choices exploited to tackle the topics in scientific literature has changed over time. For example, the keyword “effect” has not fallen into disuse (its presence over time never ceased completely), but it might be less used today when drafting a paper title as it does not represent any novelty. The same reasoning can be followed and reversed for keywords with an increasing temporal pattern. With respect to cl-13, cluster E is split into D (for a minor part), G, and J (Fig. 8). The solution that envisages 13 clusters manages to capture interesting extra nuances (cf. Appendix 3): A cluster containing keywords slightly more popular in the early years (cf. D, nested into pattern ‘B’, in Fig. 9) is maintained; beyond that, there is a cluster of evergreen words (without peaks; ibid., cf. G, nested into pattern ‘E’). Moreover, a third cluster with a slightly U-shaped trend (ibid., J, nested into pattern ‘E’) emerged and includes keywords that come back into fashion. In coarser-grained clusterings, the same keywords are placed into clusters with increasing, decreasing (e.g., in cl-9), or stable trends when the shape of their life cycle is less marked (e.g., in cl-6). Cluster J of cl-13 contains, for example, are themes that return to be addressed more consistently in recent years, such as inequality, power, racism, and intergroup contact (e.g., power, social class, racial, race, interraci, contact). A further example is given by keywords referring to replicability (replic) and academic success (academ achiev). Given the periods of popularity of these keywords, they might refer to the so-called “crises” that at least twice affected social psychology at different times: the aforementioned general crisis of social psychology (’70 s) and the current one (which started in 2010), called “replication crisis”, which has several assumptions in common with the previous one (cf. Mülberger, 2018; Stam, 2018).

Cluster F of cl-6 (11.6% of the keywords; Fig. 6) includes keywords that experienced an increasing trend: they become popular from the 2000s onwards (pattern ‘F’, in short, increasing from 2000). These keywords recall theories, such as the regulatory focus theory (regulatori), studies on personality (e.g., big five, person trait, narciss), culture (cultur variat, cultur differ), and intergroup processes (e.g., outgroup). Moreover, it contains references to recent popular themes, such as romantic relationships, well-being, life satisfaction (romant relationship, romant partner, well be, life satisfact), and some socially relevant issues, such as sexism and religion (e.g., sexism, religi). This cluster also includes method-related keywords, such as “implicit measures”, “longitudinal studies”, and “meta-analyses” (implicit, longitudin, meta analysi). Cluster F is split mainly into G (for a minor part) and I of cl-9, and into K, L, and M of cl-13 (Fig. 8). As already shown above, when increasing the capillarity level, more detailed results are achieved. In cl-9, cluster G mirrors an increasing pattern, but it culminates in the 2000s when it reaches a plateau, whereas cluster I shows a peak in the last years considered and contains keywords already pointed in the cl-6 context (e.g., big five, narciss, person trait, romant relationship, life satisfact, well be, meta analysi, sexism, cultur variat). Clusters of cl-13 offer further details by identifying three different patterns with increasing trends (cf. Appendix 3): L shares similar trend and content with cluster G of cl-9 (where the peak is only more pronounced at the beginning and then decreases slightly; cf. L, transitional between ‘D’ and ‘F’, in Fig. 9) and differentiates from K, which has a less pronounced peak and the frequency of its keywords does not decrease even in the most recent years (ibid., cf. K, pattern ‘D/F’). Cluster M portrays a clearly increasing pattern since the 2000s (ibid, cf. M, nested into pattern ‘F’), thus clearly highlighting what can be considered “hot topics” (or methods) of the very last years considered in this analysis, e.g., life satisfaction (life satisfact), meta-analysis (meta analysi), narcissism (narciss), and sexism.

Conclusions

This work aimed at reading the history of social psychology as mirrored by the titles of articles published in a mainstream journal. It is grounded in the assumption that the time series of keywords’ occurrences are an effective way to trace the life cycle of the main contents of a journal and can highlight which were the core themes of a discipline in the past and today (in terms of methods, theories, and application domains). To represent and analyze the latent dynamics of the micro-histories of a large set of keywords, we used a FDA approach and a KBS that combines statistical learning with experts’ knowledge. The flexibility of the procedure proposed in this article, which exploits both qualitative and quantitative information, represents a strength of the method.

A further aspect to be enhanced is the readability and interpretability of the results obtained. The method achieved interesting and useful results for an overall representation of the history of social psychology. Surely, for careful observers of the historical development of the discipline and for regular readers of the journal, most of these results do not come as a surprise. It is well known that some themes have fallen into disuse, others have grown over time, some have remained constantly present, and others have experienced ups and downs in terms of diffusion and popularity. However, it is worth remembering that, beyond common sense, being able to grasp phenomena, measure them, and effectively plot their diachronic development represents a step forward for the methods of digital history (and it is the true role of statistical learning).

Our process to reconstruct the life cycle of ideas in the field of social psychology led to clusters of keywords that have experienced a similar temporal evolution but could be ascribed to different topics. From this viewpoint, the method is markedly different from more traditional methods for TM whose primary objective is identifying latent topics by detecting recurrent words in documents. Temporal topic analysis, then, extends basic TM to address topic changes of a temporally evolving document corpus. The temporal dimension can enter either added in a refinement step (e.g., Griffiths & Steyvers, 2004) or included in the data representation (e.g., Blei & Lafferty, 2006; Roberts et al., 2016; Wang & McCallum, 2006). Differences between our KBS and TM approaches are noticeable: our approach leads to clusters representing latent dynamics (average trend of words sharing a similar temporal pattern) and that can be multi-topic (individual keywords or word subsets of the cluster can be ascribed to different ideas and topics); TM produces (soft) clusters representing topics (set of words semantically consistent with the drafting of a theme from being co-occurring in documents) and possibly reconstructs the temporal development of each topic (as a whole thus) either completely ignoring (standard TM) or sacrificing (temporal approach) the individual evolution of single words. In conclusion, our approach shares with a temporal TM similar aims to some extent, but is preferable when (key)words are themselves carriers of ideas and hence their own life-cycle is important per se and, in general, when topic categorization is in the end a super-structure more cumbersome than useful (in our opinion, the implementation of theoretical models still leads to limited results). TM tends to cage words in an architecture that is, if not static, however clumsily changing to grasp the individual destinies and changes of words. On this last regard, keywords which change their meaning or aspect over time and so cross different topics would likely be overlooked and missed by TM.

A noteworthy result of applying this method is the possibility of evincing, as a consequence of expert reading of the findings, historical issues (such as crises in the discipline or debates over the use of terms) beyond the life cycle of themes, methods, and processes (see Appendix 3). This method thus makes it possible to grasp important historical issues—even if they are marginal in the journal—thanks to the process of keywords selection (and normalization choice), favoring a more comprehensive reading of history.

The most challenging aspect of a statistical procedure that aims to identify clusters is to decide the number of groups (in our case, groups of keywords that experienced a similar temporal development in terms of presence, absence and frequency) without having any a priori information on the “appropriate” (true, best, expected or desirable) cluster number. In our procedure, the process to select the most appropriate number of clusters is highly articulated and based on pooling a plurality of criteria, but the decision was ultimately taken in agreement with the essential intervention of experts. The most interesting element of this work is that it manages to effectively show that, once sensible and well-founded clusters have been found, the choice between coarser—or finer—clustering solutions is only the level of detail: finer partitions detail the keywords’ trajectories in more specific (but coherent) trends if compared with more general dynamics from which they derive. The method, with the addition of in-depth work on nested structures, also offers different levels of reading—finer and coarser—with a consistency that is not often found in cluster-based analyses. From a statistical learning point of view, these findings are crucial.