Introduction

If the title of this manuscript caught the attention of a reader, then perhaps he or she has been (or will be) in the position to assess quantitatively the interdisciplinarity of a body of research made of scientific articles, journals, institutions, or even careers. In this quest it is easy to discover that a theoretical definition of Interdisciplinary Research (IDR) has made its way as the closest to an official one:

[IDR is] a mode of research by teams or individuals that integrates information, data, techniques, tools, perspectives, concepts, and/or theories from two or more disciplines or bodies of specialised knowledge to advance fundamental understanding or to solve problems whose solutions are beyond the scope of a single discipline or area of research practice (Committee on Facilitating Interdisciplinary Research, 2005).

Historically, the content of this definition has been contrasted with definitions for

  • multidisciplinarity, the factual occurrence of multi-topical academic literature within a common system of knowledge;

  • transdisciplinarity, which regards the dynamics in transfers of ideas (primarily in collaborative fashion, sometimes in conflict) between academic or academic-adjacent authoritative institutions and non-academic subjects.

Finally, the three definitions have been grouped under the term ‘cross-disciplinarity’ (Hessels & van Lente, 2008; O’Rourke et al., 2016).

Beyond this quasi-official definition various interpretations of the meaning of ‘interdisciplinarity’ co-exist in the lexicon of laypeople, stakeholders, and even dedicated scholars (Avila-Robinson et al., 2021; Laursen et al., 2022). The main confusion seems to regard whether interdisciplinarity fundamentally concerns the successful cooperation (or just a feeling of it) among people of complementary academic expertise (Aram, 2004; Andersen, 2016; Zeng et al., 2022), or instead the process of recombination of established scientific ideas into novel discoveries that solves complex problems and unify pre-existing paradigms (Davies et al., 2018; Bloom et al., 2020; Lane et al., 2022; Teplitskiy et al., 2022), which does not necessarily require cooperation.

This study is about treating the inherently vacuous task of quantification of IDR with the solution of adopting flexible and reproducible mathematical models that properly reflect the plurality of opinions about what constitutes interdisciplinarity. I propose an example of this ‘systemic’ approach in Fig. 1: here I segmented the modelling operations into specific analytical tasks, connecting the steps from the idea of Research Design to the achievement of a set of metrics. If readers, after being inspired by this review, will start to envision diagrams similar to (or even better of) Fig. 1 for representing their own models of IDR, then my goal in writing it would be accomplished.

Fig. 1
figure 1

An example of a model of measurement of IDR

Extensive reviews on Interdisciplinary Research have been proposed by Porter et al. (2006); Huutoniemi et al. (2010); Wagner et al. (2011); Rousseau et al. (2019), and Glanzel and Debackere (2022). The whole study might be considered more suitable for advanced readers already familiar with at least the most recent of these reviews. In particular, this study tries to integrate them with other recent contributions on the topic on the research on the combinations of conceptual categories of science, especially in formal analysis of complex taxonomies, the declination of the concept of disciplinary novelty, the distinction of the unit of analysis as a singular or a collective, and the role of causality in IDR. Nevertheless, a form of strong continuity in the spirit in this work is in the recognition of a fundamental syllogism of IDR:

  • the expression “Interdisciplinary Research" is a polysemous construct (Sanz-Menendez et al., 2001),

  • and a the numerical value of an indicator can represent at most one of the semantic dimensions of a polysemous construct,

  • ergo, ideally IDR should be measured through a coordinated system of indicators not with a single measure (Rafols, 2019; Marres & de Rijcke, 2020; Stirling, 2023).

In addition to expanding the reviewed literature with new ideas about new measures, the model-based ‘systemic’ design is focused on a practitioner-oriented formal analysis of them.

Some additional words can be spent on the rationale for a multi-model approach to measurement. Firstly, it is necessary to recognise that the concept of interdisciplinarity poses a challenge to the epistemology of the categories of Science. Differently from how quantities are treated in canonical metrology, in the measurement of IDR the unit of observation cannot be reduced to the mere ratio between a unit of measurement and the observation of extensive (or intensive) feature or state of a well-defined phenomenon (Schmidt, 2008; Sugimoto & Weingart, 2015; Boon & Van Baalen, 2018). This feature is shared by all ‘derived categories’ (‘inter-categories’) which exist in-between what is officially represented with the known taxonomy. ‘Inter’-disciplinarity is clearly one of such cases: to observe integration, one must postulate the alternative condition of segregation; to observe originality, one must agree on what has occurred in the past. In summary, any instance of nonconformity inherently distinguishes itself from something else that is necessarily acknowledged as typical, so interdisciplinarity necessarily deals with those outcomes of a process of classification that are more uncertain (Lüthge, 2020).

One step beyond, sometimes who measured IDR ended to question the epistemological soundness of canonical taxonomies of academic disciplines. Consider the so-called challenge of ‘granularity’ of disciplinary categories (Glanzel et al., 2003, 2009; Boyack et al., 2005; Fanelli & Glanzel, 2013; Klavans & Boyack, 2017; Shu et al., 2022), that is the attempt to answer the question about the ideal size of a taxonomy of scientific disciplines. This problem is separated from the afoementioned error of misclassification or ambiguous classification (Zhang et al., 2022) that characterise the uncertainty of categorical classification. Behind the veil of the analytical optimisation lies a more fundamental question: what should be the criterion to define a scientific discipline? An attribution by the consensus of dedicated experts? A distinct position in a topology of citations? A semantic classification operated by an automated AI (Artificial Intelligence)? By how one answer this question depends the identification of the dimensions of IDR, and without answering it is impossible to calibrate the ideal size of a set of scientific disciplines.

The problem of the disciplinary taxonomy is definitely not an exclusive source of ambiguity in measurements. Equally challenging is the decision regarding the operational definition of IDR, i.e. the linkage between the conceptual dimensions of IDR and the mathematical formula called “measures" (see Fig. 1). Many valid measures of IDR can be found in the literature. Methodological reviews as Molas-Gallart et al. (2014); Wang and Schneider (2020), and Zwanenburg et al. (2022) actually found dozens of mathematical formula for the same dimensions, to the point that these studies ended up questioning the validity of so many measures in absence of numerical convergence in results. In addition, consider that in order to determine the uncertainty of results exclusively due to the variation in operational definitions, the authors fixed the taxonomy. Aside this methodological precaution, in reality taxonomies are not fixed and vary between studies, compounding the inconsistencies.

Pointing out all sources of fickleness in the numbers helps to introduce a further rationale for a systemic approach in the model of measurement of IDR, that is the normative impact of variable and incoherent results. This issue can be be decoupled into two separated, yet related, moments: on the evaluation of scientific projects and careers, i.e. for assigning grants, tenures, etc.; and on the recognition of a cumulative knowledge about IDR, for example through the claim of discovering essential laws about science itself and its progressive (self)-regularisation (Fortunato et al., 2018).

In the first case, the measurement of IDR occurs within the contingency of a choice in the form: “Given X data, then do A, or do B (or do C, etc.)?" Here the risk of methodological reductionism lies in the risk of adopting a model, taking a decision based on it, and then discover that by adoption of a slightly variant one the decision would have been drastically different (Paruolo et al., 2013). Additionally, it seems plausible that some choices in the models of measurement make the final indicator easier to be ‘gamed’ (Rijcke et al., 2016). If I establish a incentive for scholars engaged in IDR, and I also measure their engagement in IDR by the number of times the word “interdisciplinary" occurs on their web-pages, I should not be surprised to see then the proliferation of scholars turning from “physicists" to “interdisciplinary physicists", from “biologists" to “interdisciplinary biologist", etc. A malicious interpretation would attribute the origin of such mass behaviour exclusively to the proverbial greed of the well-seasoned scholars (Edwards & Roy, 2017). However, a true interdisciplinary scientist should not penalised (or unheard) in saying a truth. Penalising a legitimate dimension of IDR for the perverse incentives associated with its operational definition is a form of perverse incentive in itself! If a fickle indicator is relevant for the research questions, it should be paired with other robust indicators, allowing the evaluation to spot (and possibly penalise) systemic perversion (Moed & Halevi, 2015) to correct them. All these risks reaffirms the pragmatic convenience of a systemic approach.

As aforementioned, the assessment of IDR can also be expressed as general laws. On this subject, I would discuss one of the most influential articles on the concept of anomalous disciplinarity, “Atypical Combinations and Scientific Impact" from Uzzi et al. (2013). This article avoids engagement with the previous literature of IDR, as if the concept of “Atypical Combinations" is totally separated from the research on indicators of interdisciplinarity (Boyack & Klavans, 2014). It mentions the existence of “interdisciplinary journals" (whatever it means for the authors), but the methodology does not explicitly recognise the concept of disciplinarity in itself. As a early form of a complex taxonomy, in this article atypical papers are defined as those showing anomalous and rare patterns in the conditional frequencies of pairs of citations to journals, independently by the attribution of those cited journals to a discipline. This paper became very relevant for research policies because the authors characterised their results in an authoritative way ("science follows a nearly universal pattern"), advancing the important claims that “the highest-impact science is primarily grounded in exceptionally conventional combinations of prior work yet simultaneously features an intrusion of unusual combinations", and that: “teams are more likely than solo authors to insert novel combinations into familiar knowledge domains".

Without citing literature on IDR, the semantics of ‘atypicality’ Uzzi et al. (2013) ended to conflate two dimensions of novelty and diversity with a singular measure. When a systemic approach has been applied by Bornmann (2019) and Fontana et al. (2020) instead, it has been noticed that the model of measurement proposed by Uzzi et al. (2013) was correlated with the Rao-Stirling Index, a typical measure of categorical Diversity, and yet not with a specific alternative model of measurement of Novelty envisioned by Wang et al. (2017). These later studies showed the necessity to test different measures for the same construct, and to consider differences in semantic dimensions. I would exclude the possibility that Uzzi et al. (2013) overlooked the concept of disciplines because the authors ignored or underappreciated the epistemological merits of disciplinary taxonomies. I suggest a different reason: when room is given for alternative models of measurement, any scientific thesis is challenged by a higher risk of being refuted. Considering the computational complexity involved in most of the scientometric models of measurement, no one wants to make a lot of noise with their machines for nothing (Hansen & Marinacci, 2016).

However, as well-known by experts in research synthesis, there cannot be a “universal pattern" without a convergence of many models; a pattern can be largely reproducible across samples, but its universal validity would still be challenged by being conditional to a specific conception of operational definition (Flake & Fried, 2020). The issue arises when a study claims a certain causal relation between IDR and another feature yet the study also obscures the potential for methodological (and epistemic) alternatives. Such a study may indeed appear more convincing initially, but discrepancies with future studies can either mislead years of scientific research or erode the epistemic trust within the scientific community over time. These dynamics of over-confidence followed by over-skepticism erupted into a so-called “Replication Crisis" (Baker, 2016; Fanelli, 2018; Saltelli & Funtowicz, 2017). The origin of this crisis has been historically framed as a consequence of the bad incentives in peer-reviewed institutional research, where innovation and progress are overlooked over replication and robustness, with a consequential inflation of the frequency of false positive findings in published research (Nissen et al., 2016; Bartoš et al., 2024). Indeed, any reductionist paradigm of measurement can be ‘hacked’ just by omitting alternative choices in the model of measurement; in fact more robust forms of multi-model inference have been proposed as a new paradigm for inherently fickle research areas like social sciences (Steegen et al., 2016; Simonsohn et al., 2020; Cantone & Tomaselli, 2024).

The following sections of the study corresponds to the main steps of decision-making in the design of Fig. 1. Section 2 covers the process of stylisation of real events into metadata acting as unit of observations; these will later constitute the inputs for the measures. In this section are defined basic concepts of the design as analytical units, disciplinarity, and taxonomies. The discussion about the formation of a taxonomy is enriched with synthetic consideration on methods to deal with the similarity of the disciplines. Here I introduce general methods of identification and normalisation of the matrix of similarity. This section is completed with a review of methods of classification of metadata. Section “Operational definition: dimensions and measures” regards the definition of the operational definition connecting dimensions and measures. Along with a reasoned review of the most famous and the most promising measures, and the dimensions they define, criticisms are presented and discussed too. Most of this section covers the prominent dimension of Integration, presenting new argument of criticism against its universal validity. It is demonstrated that specifications of the Generalised Diversity (Leinster & Cobbold, 2012) are actually parametric adjustments of the number of countable categories associated with the unit of analysis. It is questioned under which assumptions Diversity corresponds with the theoretical definition of integration of knowledge. The study then extensively covers the discussion about the linkage between the dimension of Disciplinary Novelty and the concept of statistical divergence, presenting established and novel measures of statistical divergence between the unit of analysis and a benchmark unit. The section is concluded with a review of the dimensions based on analysis of citational networks (topological dimensions). In Sect. “Methods of semantic classification and surveys” are found advanced topics inherent to the quantification of IDR: multivariate estimators for collective unit of analysis and the role of causal structures in deconfounding the aggregated estimates of IDR for policy-making.

Stylisation of facts

Differently from the state of a physical process, the constructs involved in the measurement of IDR are much more abstract and there is a necessity to anchor them to some directly observable entities. The ‘stylisation of facts’ regards all the operations aimed at reducing the complexity of the phenomenology of scientific disciplines through the sublimation of the most indicative observable facts into proper statistical units of observation, coherently tractable quantities as inputs for the operational definitions of the dimensions of IDR. Among these operations, three co-dependent decisions stand out as a good approximation of what is the essence of stylisation:

  • The unit of analysis, that is the evaluand of the scientometric analysis. Here one must consider that scientific articles may show patters of interdisciplinarity radically different from the interdisciplinarity of their journals, of their authors, etc.

  • The taxonomy of disciplines, that is the information about how scientific knowledge should be partitioned into well-compact clusters.

  • The method of classification of the metadata, that is the linkage between unit of analysis and taxonomy that assigns numerical values for the units of observation.

Since the stylisation connects research questions with the operational definition of IDR, the framing of the unit of observation is dependent on the needs of the research design but it is also helpful to consider stylisation in tandem with the evaluation of the operational definition. Conversely, given the multitude of variants, it can be useful to identify essential elements needed to complete a stylisation of facts. In the next section will be presented the formalism for a flexible vectorial representation of the disciplinarity of a body of research that converts the metadata of the body of research into statistical frequencies. This is only one outcome of a styilisation, but it is not universally applicable along all operational definitions. Topological measures (Sect. “Cohesion and other topological dimensions”) may adopt their own network-based mathematics. The advice here is to assess options for stylisation early but maintain the possibility of revising them after other choices for the model are made.

Unit of analysis

The unit of analysis is the object of the assessment of IDR, the ‘body of research’ (Porter et al., 2008). This corporeal metaphor seems pertinent because units of analysis of IDR are usually systems that encapsulate parts of scientific knowledge. A good symbol for any unit of analysis of IDR is \({\textbf{x}}\), where the bold style reminds that it is a placeholder for a plurality of observations, not a singular number.

There is a relevant distinction between:

  • Elementary units of analysis: also conventionally termed ‘papers’, even if most elementary units of analysis are articles or web-pages that are be never written on paper.

    The characteristic of a ‘paper’ that makes it ‘elementary’ is that there is no way to take a part of it without losing its meaning. For example, a list of authors, keywords, or references, or a table or a figure is rather meaningless outside the context of the paper.

    Books can be either elementary or collective units: when the chapter of a book makes no sense without reading the previous one, then even a book is an elementary body of research and not a collection.

  • Collective units of analysis: these are bodies of research that can be partitioned in at least a way such that each part makes sense as a standalone. Some typical collective units are:

    • Journals and serials;

    • Careers / Curricula of authors;

    • Institutions, which can be made either of papers or of authors.

    Exotic collectives are:

    • Territories, nationalities, and languages, as collectives of institutions or people;

    • Editors, either as publishing houses and as collections of papers edited by the same person;

    • Research categories, to the extent even disciplines can be treated as unit of analysis.

Bodies of research can be considered either as elementary or collective units, depending on the method of classification of the metadata: authors can be classified through metadata from a singular document (their web page, their title, their Ph.D., etc.) or through the collection of papers in their careers.

Taxonomy

A taxonomy is a set of well-distinct disciplinary i-categories that partition the scientific research in thematic clusters:

$$\begin{aligned} {\mathcal {I}}: \{i_1, i_2, \dots i, \dots , i_{k_{\mathcal {I}}}\} \end{aligned}$$
(1)

To recall two distinct categories I will adopt the formalism \(i_1\) and \(i_2\). Here always holds \(i_1 \ne i_2; \; \forall i\). When the formalism allows the possibility to recall the same i category twice, e.g. junctioned to itself, then I will adopt i and j, so \(i=j\) is allowed if not explicitely negated.

All \(i \in {\mathcal {I}}\) must share the same taxonomic rank. This degree of generality is the level of granularity of the taxonomy (Shu et al., 2022). The most obvious interpreatation of this principle is that the categories of “Mathematics" and “Real analysis" cannot be found in the same set because one encompasses the other. An extension of this principle is that “Physics" and “Botany" cannot stay in the same set because Physics is intuitively more generic than Botany. Practically, there is a plethora of possible taxonomies at different levels of granularity. The approach for identifying the taxonomy can interpret the concept of a disciplinary category through the lens of the historical western canon of traditional classification from Aristotle to Comte (etc.), or alternatively with an ad hoc criterion. An exotic definition of disciplinary categories could account for specific bodies of research as disciplinary categories, as long the research design justify this choice and the taxonomy can be adequately related to the metadata of the analytical unit.

What is a discipline? Canonical disciplines, exotic alternatives, and the role of journals

The canonical approach to define \({\mathcal {I}}\) tries to reproduce the distinction between academic qualifications into the taxonomy of disciplines of scientific research. Effectively, scientific knowledge truly reproduces divisions originated in academic departments. However, scientific knowledge started to follow disciplinary paths for the sake of producing standardised professional credentials, facilitating the enrolment and social mobility of students and scholars, not for the benefit of novel research. An approach that reproduces too strictly academic partitions does not necessarily exert positive effects on the facilitation of novel research, in fact while professors (experienced teachers, mentors) are Professors-in-a-discipline, research personnel is assigned to highly specific projects with extravagant and not-standard names (Stichweh, 1992; Becher, 1994; Bourke & Butler, 1998; Godin, 1998; Jacobs & Frickel, 2009; Hodgson, 2022).

Defining disciplines according to a traditional canon results in a top-down approach of classification of journals provided by a specialised source of classification. The alternative classification is the bottom-up approach, finding stable empirical patterns and then giving a name to the disciplinary clusters of journal a posteriori of the analysis. Examples of bottom-up approaches are clustering algorithms for community detection run on citational networks (Rafols & Meyer, 2010; Klavans & Boyack, 2017), semantic pattern identified by AI, or even results consolidated through surveys on experts. These bottoms-up approaches can identify taxonomies that are more in line with modern traditions of scientific research, but in reality the distinction between top-down and bottom-up approaches goes progressively lost once top-down taxonomies are updated and validated through novel bottom-up results. For example, the naming of the clusters in unsupervised methods is still necessarily informed by some form of canon. Finally, traditional top-down disciplines and innovative bottomed-up taxonomies can coexist at different levels of granularity.

In any case, when dealing with scientific taxonomies, journals are a recurrent element of classification. Scientists tribute authority to the most read journals in their field and the publication of an article in a journal is a meaningful operation for the transmission of knowledge from scientific research into established academic notion. Journals are so powerful in defining disciplinarity that an approach to stylisation of the taxonomy is to nominally dismiss the concept of disciplines and just assume that \({\mathcal {I}}\) is a list of all scientific journals (Uzzi et al., 2013; Wang et al., 2017). Opting for journals extends the granularity of the analysis, and reduces the arbitrariness in the definition of the taxonomy and formally nullifies the error of misclassification, but makes the analysis potentially uninterpretable.

Identification of the similarity of categories

The notion of similarity between disciplines has its origin in the contrast between taxonomies at different level of granularity: research areas falling under the same academic disciplines are more similar than researches conducted in separate disciplines. In addition, research areas and academic disciplines may not perfectly align and a research area might draw inspiration from multiple disciplines, or vice versa. When two disciplines meet in the same research area, their knowledge will be more intertwined, and their disciplinary similarity will increase, too (Shu et al., 2022). This relationship becomes apparent when observing the final effects: some disciplines are “Natural sciences" and others are classified as “Social sciences" etc. and a social science is be more similar to another social science than to a natural science. These examples do not exhaust the spectrum of possibilities for identifying similarities between disciplines. Finance and Physics address vastly different phenomena, yet there can still consider each other similar because of commonality in mathematical methods and for the posterior professional interest of people with a background in Physics toward jobs in Finance.

Apart from the rule that the similarity between a discipline and itself must be equal to 1 as an upper limit of similarity between two different disciplines, no methodology to identify a

$$\begin{aligned} Z: \{z(i,j), \dots \} \end{aligned}$$

matrix of scores of similarity is definitive (Porter et al., 2006; Leydesdorff & Rafols, 2009; Zhang et al., 2016; Chen et al., 2021; Thijs et al., 2021; Vancraeynest et al., 2024). Nevertheless, there are some remarkable techniques in the literature (Adnani et al., 2020). The most established one determines the similarity of two categories through the construction of a matrix of event \(C_{Z}\) whose rows are the i categories, and the columns are the non-negative occurrences of meaningful events concerning these disciplines. So for each i there is a vector \({\textbf{i}}: \{c_{i,1}, c_{i,2}, \dots \}\). The value of the cosine of two rows counts as an estimate of the z-similarity:

$$\begin{aligned} {\hat{z}}(i_1,i_2) = \text {cos}({\textbf{i}}_1,{\textbf{i}}_2) \end{aligned}$$
(2)

The cosine of vectors enjoys remarkable properties for this task, well-documented by Egghe and Rousseau (2003) and Leydesdorff (2005). The cosine is defined in the interval (0, 1), and since the value of the cosine of a linear combination of vectors equals the the value of the cosine of any linear combination on the same vectors, the measure is independent from the scale of values in the matrix. As a consequence, the most straightforward interpretation of the cosine similarity on non-negative values suggests to consider it as a non-negative equivalent of canonical Bravais-Person measure of linear correlation between two variables.

Given that the cosine is a relatively consistent estimator of the similarity, controversies may regard the sensitivity of this measure to the choice of the matrix of events (Avila-Robinson et al., 2021). The established method tracks a citational network between disciplines and then adopts the number citations as the observable non-negative event. Each column of the matrix \(C_{Z}\) represents a distinct count of citations, either by or to a comparable body of research. The standard option would be to construct a squared matrix \({\mathcal {I}} \times {\mathcal {I}}\) of citations (Huang et al., 2021; Thijs et al., 2021; Shu et al., 2022). In some cases, citations are not available, but alternative events of similarity are provided, for example in form of a matrix of confusion where c(ij) are the probabilities of classification of the i-category as the j-category,Footnote 1. In these cases it is necessary to calibrate a matrix of similarity Z through a normalisation:

$$\begin{aligned} {z(i,j)} = {\left\{ \begin{array}{ll} 1 &{} \text { for }i=j \\ \frac{c(i,j) \cdot k_{{\mathcal {I}}}}{\sum C - \sum c(i,i)} &{} \text { for } i \ne j \end{array}\right. } \end{aligned}$$
(3)

Alternative methods to identify a Z matrix of similarity events may involve adopting scores generated by Large Langange Models (LLM). However, these could lack explainability and reproducibility: especially for unexpected similarities it can be difficult to discern whether the model has uncovered meaningful but hidden interdisciplinary connections, or it is just hallucinating an answer (Ray, 2023; Bornmann & Lepori, 2024; Thelwall, 2024).

Complex taxonomies

Complex taxonomies involve the stylisation of the partition of scientific knowledge not as a singular category but as the interaction or combination of two categories. Let a couplet of categories be represented as \((i_1,i_2)\), then the taxonomy would be in the form:

$$\begin{aligned} {\mathcal {I}}^2 = \{(i_1,i_1),(i_1,i_2),\dots (i_1,i_{k_{{\mathcal {I}}}}), \dots , (i_2,i_1), \dots \} \end{aligned}$$
(4)

Aside for to computational complexity implied in Eq. 4, sometimes complex taxonomies are alternative to the identification of similarity (Uzzi et al., 2013; Teplitskiy et al., 2022) and sometimes are complementary to it (Rafols, 2014; Wang et al., 2017).

Methods for disciplinary classifications of metadata

A method of disciplinary classification is a procedure that takes as an input a selection of metadata of the body of research \({\textbf{x}}\) and assigns a non-negative score between \({\textbf{x}}\) and the i disciplinary categories. This scores is the unit of observation of the model of measurement, and can be represented through the placeholder \(c_i({\textbf{x}})\). In most models it is convenient to transform \(c_i({\textbf{x}})\) into a proportion \(p{i_1}({\textbf{x}})\) through a normalisation on the sum of the values within the same analytical unit:

$$\begin{aligned} {\hat{p}}_i({\textbf{x}}) = \frac{c_i({\textbf{x}})}{\sum \limits _i c_i({\textbf{x}})} \end{aligned}$$
(5)

From this operation it is possible to derive a theoretical distribution:

$$\begin{aligned} \begin{aligned}&P({\textbf{x}}): \{ p_{i_1}({\textbf{x}}), p_{i_2}({\textbf{x}}), \dots , p_i({\textbf{x}}), \dots \} \; \\&\quad \text {such that} \\&\quad \sum \limits _i p_i({\textbf{x}}) = 1 \end{aligned} \end{aligned}$$
(6)

that is mathematically analogous to a distribution of probability for \({\textbf{x}}\) according to the frequentist interpretation: different disciplinary categories found in relation to \({\textbf{x}}\) coexist as estimates of theoretical frequencies of how much portion of \({\textbf{x}}\) is occupied by “information, data, techniques, tools, perspectives, concepts, and/or theories" originated in i. From a \(P({\textbf{x}})\) derives the integer \(k({\textbf{x}})\), equal to the number of \(p_i({\textbf{x}}) \mid p_i > 0\).

Given an analytical unit and a taxonomy, there are many available metadata that can be imputed in a method of classification. All of them have merits and drawbacks, which I offer a synthesis in Table 1. A extended discussion of them follows in the next subsections.

Table 1 Main methods of classification of metadata

Citational methods: list of references

A common approach to stylisation of the disciplinarity of \({\textbf{x}}\) involves retrieving the list of papers cited by \({\textbf{x}}\). This list is then stylised further into a \(P({\textbf{x}})\) by assigning disciplinary categories to journals and counting references to the categories. Journals assigned to multiple categories may split their counts or contribute to the sums more than once. For simplicity, \(p_i({\textbf{x}})\) represents the relative frequency of the disciplinary category i in the reference list of \({\textbf{x}}\), not in the text (Porter & Rafols, 2009; Rafols & Meyer, 2010; Leydesdorff & Rafols, 2011; Zhang et al., 2016). The reference list became one of the most common metadata for IDR because all scientific papers have reference lists.

In this method references lists are interpreted as a ‘trace’ of the disciplinarity involved in the originating factors of the paper. Glanzel and Debackere (2022) refer to this approach as a “cognitive" approach, while Gates et al. (2019) calls it “inspirational". Either cases refer to how knowledge reiterates itself as a process of extrapolation of patterns through social actions (Nightingale, 1998; Rousseau et al., 2019; Reijula et al., 2023). In practice, a theoretical definition of interdisciplinarity as cognition implies a process of social recognition of the importance of the referenced ideas and claims: it implies that scientific authors might operate within disciplinary frameworks (e.g. trained in a specific disciplinary branch of knowledge) yet possibly they still recognise the relevance and usefulness of concepts posited outside the typical scopes of their own area of research.

This approach comes not without flaws. It measures an emergent disciplinary property of a body of research \({\textbf{x}}_0\) through the observation of its antecedent linkages to other bodies of research, let call them \({\textbf{x}}_1\), \({\textbf{x}}_2\), etc. This procedure reduces the remote origins of the referenced articles as fungible objects within a singular disciplinary category. This operation is paradoxical when \({\textbf{x}}_0\) is of the same type of its linkages; in that case would happen that in order to measure the IDR of paper \({\textbf{x}}_0\), one assumes the disciplinarity of the referenced papers \({\textbf{x}}\) through a rule that is different than the one adopted for \({\textbf{x}}_0\). The paradoxality of this operation is self-evident when the measured interdisciplinarity of \({\textbf{x}}_0\) is inferior to what is measured for \({\textbf{x}}\).

More sophisticated approaches follow the example of Rafols and Meyer (2010) in adopting the method of “refs-of-refs", which operates the stylisation not on the direct references of the paper, but on the references of the references (Liu et al., 2012). In practice this is a computationally expensive method and in theory it only moves the issues towards the higher order of ‘refs-of-refs’. In this case a reference could be a ref-of-ref of another ‘ref’, and in those cases, the units \({\textbf{x}}\) would still be subject to two different rules, when they are evaluated in different roles (Wang et al., 2017).

Another issue is that the cognitive approach assumes that a list of cited papers correctly tracks the epistemological premises of a paper without the need of any further calibration of \({\hat{p}}\). In practice, not all citations are functionally equivalent. For instance, a paper could position itself against the claims of some of its references. In other cases, citations are only performative acts, and their weight as traces of the disciplinary influence over x should be shrunken (Tahamtan & Bornmann, 2018; Lyu et al., 2021; Tahamtan & Bornmann, 2022; Varga, 2022).

Citational methods: diffusion

The specular alternative to adopting reference lists consists in tracking the papers citing \({\textbf{x}}\). In models of Diffusion, \(p_i({\textbf{x}})\) is the relative frequency of disciplinary categories of new bodies of research citing \({\textbf{x}}\) (Liu et al., 2012; Abramo et al., 2018; Rousseau et al., 2019; Moschini et al., 2020). In many regards, Diffusion defies the full definition of IDR: it envision the role of disciplines in production of new knowledge only terms of its future effects and not as a sources of inspiration (Fig. 2), inverting the the causal direction of the cognition. A potential benefit of these models is that, paradoxically, Diffusion may works as antecedent macro-factor of IDR: authors may be inspired by many sources for completing their research ideas, but they could have a cognitive access to a broader ranges of cognitive resources because these are diffused across disciplines by a ‘superspreader’ body of research. In this case Diffusion may be a specific dimension of IDR, rather than a proper method of classification. In any case, Diffusion alone does not seems to satisfy any shared definition of IDR (Laursen et al., 2022).

On the technical side, Diffusion is problematic: future citations are not under the control of who publishes a paper, and the list of citing papers is dynamic and not fixed, changing over time as papers accumulate more citations. In addition, it restricts the manifestation of IDR in cited body of research, with the consequences that often measures of Diffusion and measures of scientific impact will be confused. Some papers, maybe most papers are never cited! Not for this they lack a disciplinarity! This approach suffers most from issues of conceptual and metrical fickleness, but it suits specific niches in research question about IDR, mostly regarding evaluation of time-series (Ke et al., 2023; Hou et al., 2024).

Methods of semantic classification and surveys

The reference list is a characteristic part of the papers. Among the other characteristic parts of papers are the title, the abstract, the keywords; these are are all synthesis of the semantic content of a paper. By analogy, authors may have titles (“Professor of..."), institutions and journals may have descriptive summaries; etc. All of these semantic elements acts as signals for a form of disciplinary self-classification.

The assumption of the semantic approach to classify disciplinarity is that automated methods derived by AI (Learning algorithms, LLM, etc.) can retrieve common patterns in the texts and estimate a \(P({\textbf{x}})\) for each i-label in their internal taxonomy. In this context, disciplines are defined as aggregators of scientific ideas and concepts are particularly relevant for the holistic understanding of scientific texts. This happens at different levels of granularity and with either top-down (supervised) and bottom-up (semi-supervised) approaches to the recognition and classification of disciplinary lexicon (Wang et al., 2013; Nichols, 2014; Silva et al., 2016; Thorleuchter & Van den Poel, 2016; Xu et al., 2018; Bu et al., 2021; Kelly et al., 2021; Kim et al., 2022).

In the causal flow of the processes of cognition and diffusion of scientific knowledge, disciplinary semantics acts in a mediating role (see Fig. 2), and an argument to prefer a semantic approach over citational approaches is that its output is close to normal human intuition: when required to think about an interdisciplinary body of research, laypeople will think of a set of documents discussing different disciplinary topics, not that cites different sources. After all, they are a representation of “what is the paper about".

Fig. 2
figure 2

Casual flow of how scientific articles diffuse knowledge

The semantic approaches has downsizes too: there is a trade-off between the expense in terms of time and information needed to train a classification algorithm and the accuracy of the estimation of \(p_i({\textbf{x}})\). In particular, the major issues seems to revolve about the lack of coherence in the calibration of the classification scores across categories of different granularity. A solution may be to renounce to develop ‘in house’ and adopt one of the pre-made classification provided by authoritative third-party provider (Velez-Estevez et al., 2023). In this case, one renounces to own control and explainability over the scores in exchange of a standardised reproducible solution.

In addition to semantic analysis over exotic unit of analyis (e.g. personal webpages), I would include in this family of approaches those research designs based on social surveys (Aboelela et al., 2007; Jamali and Nicholas, 2010; Avila-Robinson et al., 2021). Surveys are expensive method to tailor ad hoc ‘metadata’ for the research design: expert raters are interviewed and they elicit their idiosyncratic classification for the unit of analysis. This method can be convenient for research evaluation on a small set of unit of analysis. Then the classifications of the experts are evaluated through a method of selection or by averaging the scores. However the classification of results of a survey of experts may still be problematic: selection, filtering and averaging are not equivalent procedures to synthesise the results of a survey into a \(P({\textbf{x}})\). The problem lies in how to process the deviation in classification scores among raters. If rater A says that \({\textbf{x}}\) is “a physicist who worked in Chemical Physics", and rater B says that \({\textbf{x}}\) is “a chemist with an expertise in mathematical models", what can be derived about the stylisation for \({\textbf{x}}\)? Is \({\textbf{x}}\) a chemist that sometimes is erroneously classified as physicist or a mathematician, or is he an interdisciplinary chemist? In the first case Chemistry will have a higher proportion in the distribution, and in the second case a lower one. Similar considerations hold about how to process multiple sources of classification scores.

Collaborations

Concluding the overview of classification methods it is worth mentioning perhaps the most intuitive approach to the classification of disciplinarity of a body of research: observing instances of collaborations between agents of research (Abramo et al., 2012; Wagner et al., 2015; Abramo et al., 2018; Bu et al., 2018). In Glanzel and Debackere (2022), this approach is referred as the “organisation" of scientific production, and there is a degree of overlap between the organisational approach to scientific knowledge and the definitions of so-called Science of Team Science (Stokols et al., 2008; Leahey, 2016), leading to considerable flexibility in the definition of \(p_i({\textbf{x}})\) (Zhang et al., 2018).

This method poses some of the most severe challenges in implementation:

  • The basic unit of observation at the highest granularity of in the taxonomy of collaboration is the author, not the paper. So, when the analytical unit is a paper, the disciplinarity of it is constrained by the low number of authors of papers.

  • For collectives, it is required to either classify a larger number of authors into disciplinary categories, or to accept large taxonomies of authors, which can be relatively uninformative as a disciplinary taxonomy: journals have disciplinary missions, but authors are free to pursue stable disciplinary paths.

  • The classification may be based on a recursive logic: let be \({\mathcal {A}}_1\) and \({\mathcal {A}}_2\) be the authors of the analytical unit \({\textbf{x}}\). In order to assess a \(P({\textbf{x}})\), it is necessary to know \(P({\mathcal {A}}_1)\) and \(P({\mathcal {A}}_2)\), maybe by enforcing a mono-disciplinary classification them. Either options relies on previous classification on the disciplinarity of the bodies of research authored by \({\mathcal {A}}_1\) and \({\mathcal {A}}_2\). But the requirements for this need coincide with the requirement of original research question, ergo a recursive loop is formed.

  • The ethic in classifying people without relying on institutional information is questionable. Authors may be surveyed about their own opinion, but this option is expensive and unreliable (especially if a author is dead).

The organisational approach to the classification is useful for specific research question. For example, when the unit of analysis is not an author but an institution, it may be sensible to classify a network of collaborations, instead of citations or semantics (Qin et al., 1997; Bourke & Butler, 1998; Glänzel et al., 2009; Zuo & Zhao, 2018; Hackett et al., 2021). Alternatively, a stretched definition of disciplinary categories may involve the metadata on the functional role of the author in the production of the paper (i.e. theorist, analyst, data provider, etc.) (Haeussler & Sauermann, 2020). In this case, the role of authors can be crossed with an approximated classification of their disciplines with the aim to infer the macroscopic role of that discipline in the formation of new scientific research.

Operational definition: dimensions and measures

Operational definitions specifies how to measure a construct by describing its dimensions and the procedures to reproduce the available information about these dimensions into measurements. If the units of observation are the part of the model assessing what can be said about the disciplinarity of a body of research, the operational definition is the part that assesses its interdisciplinarity.

In the next sections are debated operational definitions for some dimensions of IDR. These do not constitute the entirety of the dimensions attributed to IDR. In addition, the same formula, applied on a different stylisation for \(P({\textbf{x}})\) qualifies a measurement for a different dimension. For example, in this phase it is left to the reader to decide if the dimension of Cognitive Integration of the references should conceptually coincide with Semantic IntegrationFootnote 2. The number of dimensions in the assessment of IDR depends to the extent of the research design. For the scope of a systemic review of IDR, I included a relatively large selection of measures so to provide sufficient material to recognise common methodological issues encountered in the phase of operational definition of IDR. A synthesis of these measures is presented in Table 2.

Table 2 Main dimensions and measures

Integration

The most canonical definition of IDR of a body of research strongly suggests that interdisciplinary research is capable to integrate diverse disciplinary traditions (Huutoniemi et al., 2010; Alvargonzález, 2011), and paradigmatically, the measurement of integration of a differentiable system has always been associated with measuring the structural diversity observed in the patterns of occurrence of its multiple parts (Wagner et al., 2011; Rafols, 2014; Rousseau et al., 2019; Glanzel et al., 2021).

For someone still unfamiliar with this operational definition, the connection between the abstraction of integration and the measure of Diversity of parts-in-a-whole (a system) may seem obscure, but it is worth remembering that its unquestioned ubiquity of this measure across sciences has been called by Mcdonald and Dimmick (2003) “one of the closest thing to a unitary concept of human knowledge". These consideration did not come from a vacuum. Anchored to the theory of systems and assisted by a flexible and powerful formalisation of \(p_i({\textbf{x}})\), the concept of integration of parts gained relevance in Ecology, Physics and Sociology. Those disciplines influenced the definition of IDR in adopting systemic thinking in representing structural features observed on the metadata of bodies of research. The core idea is that by observing a \(P({\textbf{x}})\) it possible to infer a functional process of cooperation among parts of a unique system, tracked by the well-proportioned co-existence of different types in a statistical distribution (Stichweh, 2000; Ricotta, 2005; Stirling, 2007; Leinster & Cobbold, 2012; Daly et al., 2018; Tahamtan & Bornmann, 2022; Estrada, 2023).

Generalised diversity

Relatively minimal in terms of mathematical sophistication, but sufficiently advanced to spark more than one methodological debate about its adoption, the paradigm of Diversity ended to dominate the literature not only about how to measure Integration, but also about how to define IDR. Alternative operational definitions of IDR always had to adopt at least a measure of Diversity as benchmark (Rafols & Meyer, 2010; Leydesdorff & Rafols, 2011; Leydesdorff et al., 2018, 2019; Bornmann et al., 2019; Fontana et al., 2020; Leydesdorff & Ivanova, 2021; Bu et al., 2021; Mutz, 2022). Diversity is a peculiar concept in the context of academic knowledge. Praised as a desirable societal goal by those favouring cosmopolitanism, the roots of such sociological ideas originated in Ecology.Footnote 3: in a diverse ecosystem, different types of agents cohabit in equal proportions (Patil & Taillie, 1982). The rationale to measure integration through structural diversity is that (eco)systems are not immutable, and they re-construct themselves through the relations of the agents living within them. Therefore, stability in the proportions of the types of agents must be informative about the capacity of species to cohabit rather than going extinct within the ecosystem.

Historically, the first measure of Diversity that reached a state of intellectual dominance has been the Rao-Stirling Diversity (Stirling, 2007), a parametrisation of the original index of Quadratic Entropy of Rao (1982), that has been originally proposed in studies by Porter and Rafols (2009) and Rafols and Meyer (2010). This was an advancement compared to the initial attempt to measure IDR just with \(k({\textbf{x}})\) (Mugabushaka et al., 2016). A classical operational definition, it envisioned the measurement of integration in a compact formula tying together three sub-dimensions:

  • Variety (a.k.a. ecological richness) is the number of categories of the system. Practically it coincides with \(k({\textbf{x}})\).

    Variety is independent from the distribution of values of \(p({\textbf{x}})\) and from the similarity among the i categories, hence it is regarded as the fundamental measure of multi-disciplinarity, since it quantifies nominally the evident ‘multitude’ of disciplines involved in an body of research (Levitt & Thelwall, 2008; Zuo & Zhao, 2018; Moschini et al., 2020).

  • Balance (a.k.a. ecological evenness) is the absence of variability in the proportions of the categories. As Mugabushaka et al. (2016) report this has often been measured using the rate of repeats, that is as the sum of the square of the proportions:

    $$\begin{aligned} \sum \limits _{i} [p_i({\textbf{x}})]^2 \end{aligned}$$
    (7)

    This statistical formalism has been rediscovered over the time by many authors as Gini, Simpson, Herfindahl, and Hirschman (Rousseau, 2018). Alternative formulas of the dimension of Balance include the canonical Gini index of Concentration (which is different from the Gini’s proposal for repeat rates) and the Coefficent of Variation (Nijssen et al., 1998).

  • Disparity is a dimension derived from the Z matrix of categorical similarities (see Sect. 2.2.2), and specifically from the sub-matrix of Z for those \(i \mid p_i ({\textbf{x}}) > 0\).

It is has been debated how these three components should be combined into an index of diversity. In theory, nothing prevents to establish that diversity is equal to the mathematical product of the three factors (Leydesdorff, 2018; Mutz, 2022; Zhang et al., 2022), but the original formalism of the Rao-Stirling is the following:

$$\begin{aligned} \underset{\text {RS}}{\Delta }({\textbf{x}}) = \sum \limits _{(i,j)} \left[ p_i({\textbf{x}}) \cdot p_j({\textbf{x}})\right] ^{\beta } \cdot \left[ 1 - z(i,j)\right] ^{\alpha } \end{aligned}$$
(8)

Stirling introduced \(\alpha\) and \(\beta\) as hyper-parameters of the index such that when set equal to 1 the formalism converges to the aforementioned Rao’s Quadratic Entropy. In addition, for \(\alpha = 0\) and \(\beta = 1\) it collapses into the complement-to-one (one minus it) of the index of Repeat (Eq. 7). Finally, for \(\alpha \rightarrow 0\) and \(\beta \rightarrow 0\) it converges to Variety \(k({\textbf{x}})\). In Eq. 8 the factor of Disparity \(1 - z(i,j)\) acts as a sort of general prior of the expected value of \(p_i({\textbf{x}}) \cdot p_j({\textbf{x}})\), unconditional to \({\textbf{x}}\). High similarity will always penalise the apportion of the couplet (ij) to the value \(\underset{\text {RS}}{\Delta }({\textbf{X}})\).

More recently, following the influential work of Jost (2006) and Ricotta and Szeidl (2006) on the generalisation of the concept of categorical similarity in measures of Entropy, Leinster and Cobbold (2012) proposed a generalisation of Diversity in only two parameters: Sensitivity (q) and Disparity (the Z mentioned in Sect. 2.2.2). Generalised Diversity is the following:

$$\begin{aligned} \underset{(q,Z)}{\Delta }({\textbf{x}}) = \left( \sum \limits _i p_i({\textbf{x}}) \Bigl [\sum \limits _j p_j({\textbf{x}}) \cdot z(i,j)\Bigr ]^{q-1}\right) ^{\frac{1}{1-q}} \end{aligned}$$
(9)

Equation 9 is defined for all real values of q except for \(q = 1\). Nevertheless, the limit of \(\underset{(q,Z)}{\Delta }(x)\) for \(q \rightarrow 1\) exists and, incidentally it is equal to the exponential of the Shannon formula of entropy. At increasing q the measure becomes more insensitive to the presence of small non-zero \(p_i({\textbf{x}})\). This property can be intuitively understood noticing that by letting q diverge to infinite, the limit of \(\underset{(q,Z)}{\Delta }(x)\) converges towards the inverse of \(\max [p_i({\textbf{x}})]\). This last quantity, that is the highest proportion in the distribution \(P({\textbf{x}})\), is also called Index of Dominance, since it just indicates how much is frequent the most common type of element in the system (Berger & Parker, 1970; Leinster & Cobbold, 2012). Aside from its use in Ecology, Dominance is a rather important measure in Statistics, since it’s complement is a canonical measure of uncertainty of classification in any unsupervised task (Senge et al., 2014). On the contrary, when q is negative, the Diversity is over-amplified by the presence of trivial quantities of the least frequent types. Finally, for \(q = 0\), Eq. 9 collapses into \(k({\textbf{x}})\), i.e. Variety, which assigns a marginal value of 1 to each non-zero new discipline is found in the distribution.

A remarkable parametrisation for \(\underset{(q,Z)}{\Delta }(x)\) is \(q = 2\), since it holds the following equation:

$$\begin{aligned} \underset{q=2}{\Delta }({\textbf{x}}) = \left( \frac{1}{1-\underset{\text {RS}}{\Delta }({\textbf{x}})}\right) \text {for } \alpha = \beta = 1 \end{aligned}$$
(10)

This quantity is also referred in literature as Effective Diversity (Mugabushaka et al., 2016; Daly et al., 2018; Rousseau et al., 2019; Leydesdorff & Ivanova, 2021), but the adjective “effective" does not concern any form of causal effect. It implies mostly a mathematical property that Jost (2006) and Leinster and Cobbold (2012) call “Modularity", and Okamura (2020) calls “Nesting Principle"Footnote 4:

  1. 1.

    if and only if \({\textbf{x}}_1\) and \({\textbf{x}}_2\) are fully separated, such that holds

    $$\begin{aligned} \begin{aligned}&p_i({\textbf{x}}_1)> 0 \rightarrow p_i(\mathbf {x_2} = 0) \text {, and} \\&p_i({\textbf{x}}_2)> 0 \rightarrow p_i(\mathbf {x_1} = 0) \text {, and} \\&z(i_1,i_2) = 0 \text { for } p_{i_1}> 0 \text {, and for } p_{i_2} > 0 \end{aligned} \end{aligned}$$
    (11)
  2. 2.

    then, if exists a \(x_3\) such that

    $$\begin{aligned} \begin{aligned}&{\textbf{x}}_3 = {\textbf{x}}_1 \cup {\textbf{x}}_2 \\&\text {i.e.} \\&p_i({\textbf{x}}_3) = \frac{p({\textbf{x}}_1) + p_i({\textbf{x}}_2)}{2} \end{aligned} \end{aligned}$$
    (12)
  3. 3.

    then it holds

    $$\begin{aligned} \underset{q=2}{\Delta }({\textbf{x}}_3) = \underset{q=2}{\Delta }({\textbf{x}}_1) + \underset{q=2}{\Delta }({\textbf{x}}_2) \end{aligned}$$
    (13)

This property is useful to understand advanced estimation of Diversity in collective units (Sect. 4.1).

Alternatives to diversity: generalised entropy

Beyond the Rao-Stirling Diversity and Generalised Diversity there are alternative measures (Wang & Schneider, 2020; Zwanenburg et al., 2022); most of these can be still related to or derived as special cases of the Shannon formulation of the concept of Entropy, that is:

$$\begin{aligned} \sum _i \left[ p_i({\textbf{x}}) \cdot \log _2 \left( \frac{1}{p_i({\textbf{x}})}\right) \right] \end{aligned}$$
(14)

which differs from the Repeat Rate in the second recurrence of \(p_i\) being inverted and then 2-logged. \(\log _2 \left( \frac{1}{p_i({\textbf{x}})} \right)\) is sometimes called the “surprise" of the i-category, and it is equal on the information of the statistical event \(p_i({\textbf{x}}) > 0\). The interpretation of the value of surprise is rather intuitive: the lower is a frequency of a non-zero \(p_i({\textbf{x}})\), the harder is to explain why it is not 0 instead. This hardship translates quite mechanically into the complexity of procedures of justification such non-zero frequency, and the base 2 of the logarithm acts as a scale parameter to relate the value of the entropy to binary procedures of specifications of the disciplinary content of the body of research (Fanelli, 2019).Footnote 5

Literature is full of mathematical studies aimed at demonstrating how (and why) Generalised Diversity is a special case of Renyi’s general formulation of Generalised Entropy (Rényi, 1961; Hill, 1973; Keylock, 2005; Jost, 2006; Ricotta & Szeidl, 2006; Grabchak et al., 2017). Mutz (2022) tailored a specific framework that reproduces the complete parametrisation for the sub-dimensions of Integration within General Entropy. As he claim, the sensible rationale for the adoption of entropy as a measure instead of \(\underset{q=2}{\Delta }\) is that for mono-disciplinary objects it will converge naturally to 0, and then move to \(\infty\) the more the distributions includes new disciplinary categories, while the minimum Effective Diversity is 1. Nevertheless, Generalised Entropy lacks the Modularity property (see Eq. 13), making it a weaker candidate for advanced analyses on collective units of analysis (see Sect. 4.1). A more articulated adoption of measures entropy is proposed in Leydesdorff and Ivanova (2021) for measuring the exotic dimension of IDR of “Synergy" within a body of research.

The relationship between the measurement of complexity and the definition of Integration stands in the necessity of internal coherence of a paper pass a peer review complex. Aside from the remarks of Alan Sokal and his “squared" disciples, it is hard to get formal random nonsense to be published in distinguished outlet. In this regard, I would make a distinction in assessing the IDR between stylisations on published papers (e.g. assessing the IDR of authors on their publications) and those based on other sources of disciplinary classification (e.g. surveys). I would not admit the measurement of entropy and diversity on the latter as a valid operational definition of Integration, because the entropy of responses may regard only the confusion of the classification in terms of access of information, e.g. authors may be famous for some research activies only, without being known for extensive side research interests.

Alternative to diversity: gini

Leydesdorff (2018) and Leydesdorff et al. (2019) raise the following problem: Stirling (2007); Porter and Rafols (2009), and Rafols and Meyer (2010) conceptualised Integration of a system as a property of three interacting sub-dimensions (Variety, Balance, Disparity), but mathematically Rao-Stirling cannot be composed knowing these three sub-dimensions separately (Jost, 2010). Leydesdorff-Wagner-Bornmann (LWB) then propose a multiplicative aggregation of three sub-dimensions:

$$\begin{aligned} \underset{LWB}{\Delta }({\textbf{x}}) = \frac{k({\textbf{x}})}{k_{{\mathcal {I}}}} \cdot [1 - {\mathcal {G}}({\textbf{x}})] \cdot \frac{k({\textbf{x}})^2 - \sum {Z_{\textbf{x}}}}{k({\textbf{x}}) [k({\textbf{x}}) - 1]} \end{aligned}$$
(15)

where

  • \(k({\textbf{x}})\) is the Variety of the object and \(k_{{\mathcal {I}}}\) is the cardinality of the taxonomy.

  • \({\mathcal {G}}\) stands for the Gini index of Concentration (Eliazar, 2024):

    $$\begin{aligned} {\mathcal {G}}({\textbf{x}}) = \frac{\sum \limits _{i} \sum \limits _{j} \left| p_i({\textbf{x}}) - p_j({\textbf{x}}) \right| }{2 \cdot n^2 \cdot \bar{p}_i({\textbf{x}})} \end{aligned}$$
    (16)
  • \(Z_{\textbf{x}}\) is the sub-matrix of Z for \(p_i({\textbf{x}}) > 0\).

ideally with the implicit clause forcing to not consider \(p_i({\textbf{x}}) = 0\) in the sample average

$$\begin{aligned} \bar{p}_i({\textbf{x}}):= \frac{\sum \limits _{p>0} p_i({\textbf{x}})}{k({\textbf{x}})} \end{aligned}$$
(17)

Rousseau (2019) contested the division for \(k_{{\mathcal {I}}}\), because by cutting it out of the formula the measure would enjoy the property of modularity.

LWB overlook that \({\mathcal {G}}\) sensitive to small deviations from 0 in the ‘marginal’ disciplines of the distribution \(P({\textbf{x}})\) (see the clause to Eq. 3.1.3. This lack of robustness to trivial additions holds significant implications for stylisations of \(P({\textbf{x}})\) relying upon stochastic, not exact, validity of the measurement. For instance, in semantic stylisations, practitioners often approximate \(p_{\textbf{x}}(i) \sim 0\) to mitigate the risk of internal miscalibration in the distribution, e.g. unnecessary inflation of \(k_{\textbf{x}}\). As a consequence \(\underset{LWB}{\Delta }({\textbf{x}})\) must assume access to an exact disciplinary classification for \(P({\textbf{x}})\), a condition that appears overly stringent even for the approach of the reference list (see its criticism in terms of validity in Sect. 2.3.1).

Rousseau (2019) moved another kind of criticism towards \(\underset{LWB}{\Delta }\): it does not properly accounts for Disparity, only for the average similarity within the non-zero disciplines in \({\textbf{x}}\). More in general, the \(\underset{LWB}{\Delta }\) is a linear aggregation of three distinct measures for three sub-dimensions: Variety, Balance, and Dissimilarity, whereas Diversity and Entropy are non-linear aggregations of sub-dimensions.

Other criticisms to integration

Diversity, being a formalisation of the dimension of Integration of disciplinary contributions, is the most commonly used family of measure for IDR. Alternatives have been proposed but as highlighted by Rousseau (2019), there are no perfect measures of IDR, and particularly no perfect measures of Integration. In this regard, criticisms can be directed also towards the relevance of these sub-dimensions in defining the integration of disciplines. These critiques reflect alternative viewpoints on a complex process of measurement and should be accounted as insights for understand more in details how to approach the evaluation of a model measurement of IDR.

The first and most important criticism regards the role of Variety in defining the mathematical scale of a measure of Integration. \(\underset{q=2}{\Delta }\) can never be higher than Variety:

$$\begin{aligned} \arg \max \underset{q=2}{\Delta }({\textbf{x}}) = k({\textbf{x}}) \end{aligned}$$
(18)

and this is reflected in the implicit dependency of the theoretical limit \(\underset{RS}{\Delta }\) on \(k({\textbf{x}})\). This property is concerning because \(k({\textbf{x}})\) depends necessarily to the \(k_{{\mathcal {I}}}\), ergo on the granularity of \({\mathcal {I}}\); so, it implies that the scale of the Diversity of a bodies of research is not a stable universal feature but depends on the specificity of the taxonomy.

In the following formulation of Diversity:

$$\begin{aligned} \delta ({\textbf{x}}) = \frac{\underset{q=2}{\Delta }({\textbf{x}})}{k({\textbf{x}})} \end{aligned}$$
(19)

Variety is either in the numerator and in the denominator, so \(\delta\)-Diversity is a measure of Integration independent to Variety. Authors as Frosini (1981) and Mcdonald and Dimmick (2003) consider \(\delta\) the universal measure of integration, since it does depend on the granularity of the taxonomy. It is definitely a good measure to evaluate the variation in Diversity due to qualitative shifts across taxonomies at different level of granularity.

A second line of criticism is an attack against the epistemic value of Balance in the integration of the parts. In studies based on the stylisation of the references, Balance in referenced disciplines is consistently associated with papers that are less cited than papers with the same Diversity but less Balance (Larivière & Gingras, 2010; Wang et al., 2015; Yegros-Yegros et al., 2015; Leahey et al., 2017; Okamura, 2019; Chen et al., 2021; Zhang et al., 2021; Li et al., 2023). A possible explanation for this pattern is that citations depend at least partially on the prestige of the publishing outlet, and Balance is not a very appreciated trait of peer reviewers of prestigious journals because these exist within a specific disciplinary framework and scope (Lamont et al., 2006; Laudel, 2006; Siler et al., 2015; Bark et al., 2016; Urbanska et al., 2019). Rigorous reviewers may expect a strong nucleus of established references as a basis for any novel work. This nucleus of references may be drawn from journals from the same discipline, diminishing the disciplinary heterogeneity of references. But this does not impede to authors to cite diverse sources as long as there is a strong central nucleus. In fact, peer reviewers could still enjoy forms of IDR as long as the proportion of references between disciplinary cores and peripheries arrange a stable configuration.

Signals of preferences for disciplinary nuclei are not alien in the literature on the desirability of IDR (Feller, 2006; Langfeldt, 2006; Rafols et al., 2012; Lane et al., 2022; Seeber et al., 2022). Are these preferences based on prejudice and rituality, or on a good intuition about how to properly integrate past knowledge? As advanced by proponents of complex taxonomies, the idea of disciplinary cooperation seems to imply a form of parity in the relation between two disciplines, but this same principle does not seem to work very well once the cooperation regards more than two disciplines. For example, there are configuations of ecosystems with a dominant cluster of species that assures the coexistence of others. These configurations are particularly stable, and species cooperate in synergy through the dominance of the cluster (Chesson & Huntly, 1993; McCann, 2000). Figure 3 represents an example of how an uneven distribution of ties results in more stable and better connected configuration, than a perfectly balanced one.

Fig. 3
figure 3

Example of internal relations of disciplinary content

The criticism for Balance introduces the final discussion on the the approximation “Disparity is dissimilarity":

$$\begin{aligned} \text {Disparity} \sim 1 - z \end{aligned}$$

which contrasts with another intuitive fact: a discipline that is very similar to others, is more interdisciplinary than a discipline less similar. A combination of two dissimilar disciplines within a body of research must be regarded as an instance of Integration, but the prevalence of an interdisciplinary category should not penalise the IDR of a body of research. While Rousseau et al. (2019) criticises (Leydesdorff et al., 2019) for abandoning the specific z(ij) for the generic \(\sum _i(Z_{{\textbf{x}}})\), I suggest that both these two quantities should be accounted for in a proper measure of Disparity.

Disciplinary novelty is nonconformist

Generalised Diversity is apt at capturing aspects of the variability internal to the distribution \(P({\textbf{x}})\), but a fundamental problem remains: even considering the sub-dimension of Disparity, a research agenda can be highly disciplinary heterogeneous and still conducted in a way that is not necessarily disciplinary innovative. This possibly happens because the paper repeats previous interdisciplinary ideas that are redundant within a clustered space of discussion, creating the paradox that instead of promoting novelty, a new paper would just help the establishment of previous literature interdisciplinary literature as a new (interdisciplinary) discipline. An example of a similar dynamics may regard the publication of a paper about a new algorithm for causal inference. Published in a journal of Data Science, it may be an interesting contribution that builds on cumulative knowledge to slightly improve an optimisation task. It would not be a particular nonconformist innovation. However, a paper about the application of the same algorithm published in a prestigious journal of Communication Studies would send a strong signal on the perception of the need for a disciplinary upgrade in methods, realising other goals of interdisciplinarity aside internal integration. Diversity is not capable of representing this specific kind of event, because it cannot incorporate expectations about \(P({\textbf{x}})\).

The idea behind the dimension of Disciplinary Novelty is to contrast the disciplinary distribution \(P({\textbf{x}})\) against a specific benchmark P(E) to establish if \({\textbf{x}}\) is a nonconforming body of research. Published nonconforming research acts as a signal of disciplinary innovation (Goyanes et al., 2020). This contrast takes the form of a generic function on the statistical distance of distributions, i.e. a divergence:

$$\begin{aligned} \nabla = f(\left| P({\textbf{x}}) - P(E) \right| ,\dots ) \end{aligned}$$
(20)

In reality, the formulations of diversity and divergence may still be related. Consider \(p_i(E) \in P(E)\); the simplest and most intuitive specification for the f of Eq. 20 is the following:

$$\begin{aligned} \underset{\textit{Simple}}{\nabla ({\textbf{x}})}:= \sum \limits _i \left[ p_i({\textbf{x}}) - p_i(E)\right] ^2 \end{aligned}$$
(21)

For separated.Footnote 6 distributions it holds

$$\begin{aligned} \sum \limits _{i} [p_i({\textbf{x}}) - p_i(E)]^2 = \sum \limits _{i} [p_i({\textbf{x}})]^2 + \sum \limits _{i} [p_i(E)]^2 \end{aligned}$$
(22)

where the two sums on the right-hand side have two rates of repeat (Eq. 7) as argument. This implication is important because through this equivalence it would be possible to derive formulas for adding the factor of Disparity and Sensitivity to Eq. 21. In other terms, disciplinary divergence could be identified as a form of disciplinary diversity augmented by a prior expectation, which sounds sufficiently coherent with the original purposes of an index of disciplinary nonconformity from a benchmark.

Specification of the divergence and elicitation of the benchmark

Equation 21 has practically never been adopted in literature, because it is too simple: it lacks a normalisation and it does not implement the concept of Z-similarities among i-categories. Remarkably, the most common normalisation of it is the \(\chi ^2\) statistic:

$$\begin{aligned} \chi ^2({\textbf{x}}) = \sum \limits _i \frac{ \left[ p_i({\textbf{x}}) - p_i(E)\right] ^2}{p_i(E)} \end{aligned}$$
(23)

one of the most established measures for the identification of nonconformity of empirical observations to a theoretical benchmark. Not \(\chi ^2\) nor its limit converge to a number for \(p_i(E) = 0\) ergo \(\chi ^2\) is impractical as a measure of Divergence within the context of IDR, because it forces the computation to ignore cases for \(p_i({\textbf{x}}) > p_i(E) = 0\) inducing a bias in the measurement. This flaw is shared by many other measures of divergence within the extended framework of Entropy (i.e. Mutual Information, Kullback-Leiber). An exception that stands out for its mechanical simplicity, is the formula for Probabilistic Jaccard of intersectional distance:

$$\begin{aligned} \underset{\text {PJ}}{\nabla }({\textbf{x}}) = 1 - \frac{\sum \nolimits _i \min [p_i({\textbf{x}}),p_i(E)]}{\sum \nolimits _i \max [p_i({\textbf{x}}),p_i(E)]} \end{aligned}$$
(24)

which is a special normalised variant of the generalised mutual entropy, enjoying a pure frequentist interpretation of the \(p_i\) values (Moulton & Jiang, 2018). Alternatively, Goyanes et al. (2020) proposes Hellinger distance

$$\begin{aligned} \underset{\text {Hel.}}{\nabla }({\textbf{x}}) = \frac{1}{\sqrt{2}} \cdot \sqrt{\sum \limits _{i} \left[ \sqrt{p_i({\textbf{x}})} \cdot \sqrt{p_i(E)} \right] ^2} \end{aligned}$$
(25)

which depends by an Euclidean interpretation of the space of probabilities.

Neither of these alternatives accounts for the similarity of the categories z(ij). Uzzi et al. (2013) accounts implicitly for the similarity of the categories deriving directly a measure of ‘atypicality’ for a combination of two categories within a \({\textbf{x}}\) body of research:

$$\begin{aligned} \underset{\textit{Uzzi}}{\dot{\nabla }}(p_{(i,j)},\textbf{x})= \frac{p_{(i,j)}({\textbf{x}}) - e_{(i,j)}({\textbf{x}})}{\sigma [e_{(i,j)}({\textbf{x}})]}; \; p_{(i,j)}({\textbf{x}}) > 0 \end{aligned}$$
(26)

where the normalisation \(\sigma [e_{(i,j)}({\textbf{x}})]\) is the standard deviation of the benchmark values within the condition of observed couples of categories in the body of research. Consider that while generally measures of IDR are unique values for each analytical unit, in this case there is a \(\underset{\textit{Uzzi}}{\dot{\nabla }}(p_{(i,j)},\textbf{x})\) for each combination of \(p_{(i,j)}\) and \({\textbf{x}}\), hence the dot over the inverted triangle. Since \(\underset{\textit{Uzzi}}{\nabla }(p_{(i,j)},\textbf{x})\) are distributed within a \({\textbf{x}}\) instead of summarising the distribution with a parametric estimator, the authors derive two non-parametric statistics for measuring the dimensions of conformity and novelty of the \({\textbf{x}}\) body of research.

The Z matrix can be accounted explicitly in the formula of Divergence, by substituting \(p_i\) with a \(\psi _(i,Z)\) such that:

$$\begin{aligned} \psi _{(i,z)}({\textbf{x}}) = \frac{\sum \limits _{j} p_j({\textbf{x}}) \cdot z(i,j)}{\sum \limits _{i} \sum \limits _{j} p_j({\textbf{x}}) \cdot z(i,j)} \end{aligned}$$
(27)

Finally, it is worth discussing standard methods of elicitation of P(E):

  • Goyanes et al. (2020) suggest to adopt a degenerate distribution that is Uniform for \(p_i({\textbf{x}}) > 0\) and 0 for \(p_i({\textbf{x}}) = 0\). This solution may hold some formal properties and correspond to a sort of uninformative expectation conditional to the nominal heterogeneity of categories in \({\textbf{x}}\), but there are virtually no real applications where one should expect a body of research to be perfectly balanced, so the assumptions behind this choice seem baseless.

  • Uzzi et al. (2013) derive P(E) through a randomised permutation of the reference lists that preserves the number of citations of each body of research, but swap randomly which paper cites which other. Random permutation are valid null models for networks, and this solution may be preferable for citational approaches to the stylisation of \(P({\textbf{x}})\).

  • Cantone and Nigthintigale (2024) present the rationale for equation P(E) to the P(X) of a higher-level body of research that contains the analytical unit. For example, a X journal contains a \({\textbf{x}}\) paper. This solution comes handy for semantic approaches to stylisation, whereas there is a strong relation between the analytical unit and its container.

Cohesion and other topological dimensions

The positional role of a body of research is a dual construct: a topological place (position) that is occupied for for an functional reason (role). Measuring the position of the body of research is finalised at the inference of the disciplinary ‘role’ of it. The easiest metaphor to grasp the ambivalence of position and role is through a metaphor from sciences of organisation: knowing the role of an agent in an organisation, it is possible to predict in which area it will be found; and observing where an agent is located it is possibile to infer its role in organisation. Indeed, all studies concerned with the positional role of a body of research depends by the topological maps of scientific knowledge pioneered by authors as Boyack et al. (2005); Leydesdorff and Rafols (2009); Porter and Rafols (2009); Börner et al. (2012); Carusi and Bianchi (2020), and the notion of disciplinary position as relevant attribute of IDR primarily concerns dimensions like centrality/periphericity or the density of the close neighbourhood. Indeed, evolutionary theories of knowledge postulate the theory that the growth of IDR is primarily driven by a process tha of filling the most evident gaps in the development of scientific knowledge (Campbell, 2017; Hodgson, 2022; Hou et al., 2023). This immaterial process of that contrast hyper-specialised novelty in research with an integration of the whole scientific knowledge (opposed to a highly local integration aimed to present novel ideas) is the dimension of disciplinary Cohesion.

Betweenness Centrality

In a network, the Betweenness Centrality (BC) of node A is the frequency of the node being crossed by the shortest path connecting nodes B and C, where B and C are (ideally) all the combinations of other nodes in the network. The estimation of the BC ratio is left to algorithmic optimisation, but in all cases a high BC of the \({\textbf{x}}\), here stylised as a node in the network, would signal:

  • the role of \({\textbf{x}}\) as a bridge of dense portions of the graph, ergo, of connector of different parts of the network;

  • that if \({\textbf{x}}\) is hypothetically removed from the network, it will be much harder to cross two extremities of the network, compared to removing another node unit picked at random.

In particular, a network with a high average BC is highly dense and cohesive, so the epistemological role of a node with high BC can be identified in making the whole network more integrated: this operational definition of BC would link it to the dimension of disciplinary Cohesion.

BC is one of the few measures of IDR that does not stylise \({\textbf{x}}\) as a distribution of p-frequencies across i disciplinary categories. Being relatively independent by the methodological definition of taxonomy, BC seemed a very promising approach for measuring IDR Leydesdorff (2007), yet aside from criticisms common to all network-based measures (e.g. statistical inefficiency in estimation) it suffers from very specific problems of statistical counfoundness of the positional role, too. As remarked by Leydesdorff (2007); Leydesdorff and Rafols (2011), and Leydesdorff et al. (2018), all the measures of centrality are correlated by their design. It means that a node cannot be in the position of a bridge without also possessing other features that are disconnected by the bridging role. The degree of confoundess in network is not problematic for the measurement of IDR in itself, but it becomes problematic in a causal model, because it does not allow an easy strategy for identification of a causal model (see Sect. “Causality in measures of IDR”).

It is particularly worrisome how BC can be obfuscated by a high Degree Centrality, that is the number of nodes as first neighbours of \({\textbf{x}}\). In a citational network, this confoundness would reify problematic dynamics: bodies of research linking more with others (e.g. citing them) just for the sake of reputational accreditation (Tahamtan et al., 2016; Varga, 2022) would benefit from a inflated meaurement of IDR. BC shares with Diffusion the doubtfully desirable feature issue that bodies of research receiving more citations would be also considered more interdisciplinary than not receiving citations at all, making it an unfit metric even for contingent evaluations of few unit of analysis (Feller, 2006; Bark et al., 2016). Leydesdorff (2007) tried to overcome the problem of confoudness with a normalisation based on the cosine of the adjacency matrices of the nodes, and later Leydesdorff and Rafols (2011) demonstrated how the nominal BC would promote the most cited journals as the most interdisciplinary too, while the normalised BC would promote journals in the field of Information Science and Science and Technology Studies as the most interdisciplinary, coherently with these two fields touching methodological and substantive topics which connect more or less all the other scientific disciplines.

Coherence and cohesiveness

Cohesiveness is the name adopted by Bone et al. (2020) to generalise the measure of of disciplinary Coherence originally theorised by Rafols and Meyer (2010) and then re-formulated later by Rafols (2014). The original rationale for introducing Coherence to complete the the definition of IDR:

Research that by integrating knowledge, increases the coherence of Science.

with a counterpart to Diversity. Lately, Coherence ended to substitute BC as a measure of the topological dimension of ‘Cohesion’, encompassing the role of a ‘bridge position’ originally attributed to BC. In either cases, the role of a cohesive body of research is to make the scientific knowledge more compact (Egghe & Rousseau, 2003; Rousseau et al., 2019), facilitating novel discoveries by lowering the cognitive costs associated to venturing outside own disciplinary boundaries.

The generalised formula of Cohesiveness tries to adhere to the operational definition through an isomorphism with the formalism of Rao-Stirling for a complex taxonomy:

$$\begin{aligned} \Omega ({\textbf{x}}) = \sum \limits _{(i,j)} w_{(i,j)}({\textbf{x}}) \cdot \left[ 1 - z(i,j)\right] \end{aligned}$$
(28)

where \(w_{(i,j)}\) is conceptualised as the relative ‘intensity’ existing between disciplines i and j once met around the body of research \({\textbf{x}}\). What ‘intensity’ means in the context of a topological measure of IDR? Rafols (2014) provides a operational definition of intensity for citational stylisation as the frequency of citations between categories i or j within the network of references of \({\textbf{x}}\). This definition connects cohesiveness with the role of the analytical unit \({\textbf{x}}\) as connector of i and j. More advanced approaches can make a distinction between which category cites which one, but still no valid operational definitions has been proposed for semantic stylisations of IDR.

An additional criticism of the measure is mentioned in Bone et al. (2020): differently from the parametrisation of Generalised Diversity, \(\Omega\) has not a finite range nor a well-defined \(\arg \max\), so it could be harder to evaluate.

Timed Novelty

The dimension of Timed Novelty lies between the statistical characterisation of Novelty as nonconforming, and the role of novel bodies of research in trailblazing future paths for research. So, not only it involves the identification of a role, but it explicitly models this role within a spatio-temporal (not only spatial) topology. In included it as an example of a topological dimension that does not coincide with Cohesion, but still employ a measure relying on a network-based stylisation of \({\textbf{x}}\). This measure has been presented by Wang et al. (2017) as an alternative to the non-parametrical method of estimation of disciplinary Novelty originally presented by Uzzi et al. (2013). It can be better explained as an algorithm than a proper mathematical formula:

  1. 1.

    Given a network of scientific papers, set their date of publishing t as an attribute of the nodes.

  2. 2.

    For each occurrence \(p_{(i,j)} > 0\) within the sample of observational units, select exclusively those with \(\min (t)\). These are the first instances where a positive frequency \(p_{(i,j)}\) is observed within a unit of analysis. For each of these, record a \(t_0(i,j)\)

  3. 3.

    The topological novelty of \({\textbf{x}}\) is equal to the

    $$\begin{aligned} \sum \limits _{(i,j)} t_0(i,j) \cdot [1-z(i,j)] \end{aligned}$$

    associated with \({\textbf{x}}\).

However the drawbacks of this method are more than its merits:

  • It is not interpretable as a proper topological measure for any stylisation of the taxonomy that does not account to observe the reference lists.

  • And still, the innovation of recombination requires to be calibrated with a measure of Diffusion or scientific impact to be effective. How can a paper be a pioneer if nobody cited it?

  • It seems very sensitive to the granularity of the taxonomy, and inappropriate for low-granular taxonomies.

  • It is sensitive to the characteristics of the sample of analytical units: in a sample with only recent papers, the score of timed novelty of them may be very misleading.

In addition, topological novelty seems to fail the empirical concordance with qualitatively assessed novelty in Bornmann et al. (2019) and Fontana et al. (2020), questioning the epistemic validity of this measure.

Criticism of the topological measures as statistical methods

Topological measures offer an alternative to traditional frequentist stylisation of disciplinarity. But \(P({\textbf{x}})\) exists to reduce the computational complexity (‘stylise’) of the observable facts. Due to the scale of data involved in operating with sets of links, networks demand substantial computational resources instead. Despite these challenges, advancements in technology may offer hope for improved computational capabilities. In fact, the fundamental criticisms of topological measures are based on the statistical, not computational, efficiency of network-based stylisations.

The logic behind identifying the sampling frame follows a straightforward premise: a topological measure of interdisciplinary research (IDR) provides insights into a node’s role by examining its connections to other nodes. Thus, it is essential to gather a finite sample of nodes such that the estimates of the topological measures approximate the values ideally observed if the entire population were available. The issue of statistical efficiency revolves around determining the number of nodes required to obtain these representative statistics. Unfortunately, the answer can be simplified to just “a lot": papers typically have a relatively small number of references, consequently even when references are missing at random and the estimate is expected to be unbiased, the total measurement error due to missing references is rarely trivial. Network-based metrics are rarely unbiased for underpowered sample size of the links (Lee et al., 2006; Stumpf et al., 2005). More robust results may be obtained when collectives are considered as the analytical unit. In these case reference lists can be united, and the overall distribution of a sample of these lists can still yield representative results. However, even in this case the research design would still require to sample a remarkable quantity of papers for each journal, so the computational burden of the analysis would still be a relevant factor of feasibility of the study.

This issue is compounded by how pre-existing sources handle the monitoring of bibliometric records. It is unrealistic to expect missing values to be mutually independent; databases may lack specific journals from certain disciplines. In practice, even relatively simple research design necessitate access to computational resources for querying extensive networks.

Beyond the measurement: estimation in collectives, aggregation of measures, causal models

This concluding section addresses advanced topics. These may not always be necessary in the task of measuring IDR, but are useful to understand the advantages of thinking systematically at measurements. It explores methods for appropriately handling the estimation of collective units of analysis and emphasises the significance of causal thinking in constructing models of measurement and relating IDR with other variables.

Estimation in collective analytical units

The term \({\textbf{x}}\) have been consistently employed not distinguishing between elementary papers and collectives. However, this distinction important for timed measurements: papers are static bodies of research, and only collectives exhibit dynamism over time in the units of observation. This concept echoes the philosophical paradox of Theseus’ Ship: the identity of a plank of wood can be reduced as a shaped object made of a material, but the ship can undergo complete material replacement of all its planks and yet retain an original identity. Similarly, authors or journals can maintain a distinct disciplinary identity while their intellectual output inevitably evolves over time.

Disciplinary evolution is pertinent to the dimension of Integration because the idea that the absence of variation of frequencies in the presence of highly nominal differences is a signal of synergy has been questioned (Leydesdorff & Ivanova, 2021). A reanalysis of the concept of Integration highlights the necessity to observe integration as dynamic process, not as a finite output of metadata. An integrated system may be stable and efficient but a system capable of integration evolves by evaluating, assimilating and finally integrating novel elements from its proximal environment, a process akin to a form of growth by learning. Maybe, for the practical scopes of measuring IDR (a “mode of research", not an outcome of research), a method capable of observing the capacity of a collective to integrate novelty over time fits better than any static analysis on Effective Pluralism (Brigandt, 2013; O’Rourke et al., 2016).

In general, there are two methods to summarise a measured feature in a collective: the method of the union of the parts, or pooled estimation, and the method of the averaging of the measurements of the parts, or within estimation. Let M be the measure and \({\textbf{X}}: \{{\textbf{x}}_1, {\textbf{x}}_2, \dots \}\) be the collective analytic units, then

  • Pooled estimation:

    $$\begin{aligned} \underset{POOL}{M}({\textbf{X}}) = M \left( \bigcup {\textbf{x}}\right) \end{aligned}$$
    (29)

    where \(P \left( \bigcup {\textbf{x}}\right)\) is achievable by the methods of uniting reference (and citation) lists, or by averaging units of observation as in Eq. 12, etc. In general, unions of units of analysis regard operations on the unit of observations.

  • Within estimation:

    $$\begin{aligned} \underset{WITH}{M}({\textbf{X}}) = {\bar{M}}({\textbf{x}}) \end{aligned}$$
    (30)

    In this case, the operation of averaging regards the outputs of the measure on the elementary units of analysis, not the unit of observations.

Similarly to Variance, Diversity has a long tradition of studies on the distinction between these two quantities (Lieberson, 1969; Patil & Taillie, 1982; Mcdonald & Dimmick, 2003; Daly et al., 2018). In particular, it holds

$$\begin{aligned} \underset{POOL}{\Delta }({\textbf{X}}) > \underset{WITH}{\Delta }({\textbf{X}}); \; \forall {\textbf{X}} \end{aligned}$$

so it always exists the following positive quantity

$$\begin{aligned} \underset{BETW}{\Delta }({\textbf{X}}) = \underset{POOL}{\Delta }({\textbf{X}}) - \underset{WITH}{\Delta }({\textbf{X}}) = \Delta (\bar{{\textbf{x}}}) - {\bar{\Delta }}({\textbf{x}}) \end{aligned}$$
(31)

acting as the between estimator of Diversity. Cassi et al. (2014) and (2017) propose to consider exclusively Within Diversity as the proper measure of Integration in collectives. Pooled Diversity does not qualify as a measure of integration because it cannot distinguish between cases of co-occurrence (multidisciplinarity) and coexistence (interdisciplinarity) of disciplines within a collective (see the example in Fig. 4).

Fig. 4
figure 4

Pooling papers together confound the internal patterns of coexistence of disciplines. Taken singularly, papers produced by Group B would look much more interdisciplinary than the papers of Group A, but by pooling all the papers together as units of analysis then the Diversity of the two would be equal

Finally, Cassi and his co-authors propose to define Between Diversity as a measure of “thematic diversity" after recognising a formal isomorphism with formulas of multidimensional inertia, a statistic that catches the multidimensional ‘spread’ of the unit of observations. Thematic diversity is the component of diversity unconnected to the formation of integrated knowledge. The simplest way to understand the dimension of thematic diversity is through the ratio

$$\begin{aligned} \frac{\underset{BETW}{\Delta }({\textbf{X}})}{\underset{POOL}{\Delta }({\textbf{X}})} \end{aligned}$$
(32)

that is the percentageFootnote 7 of the unrealised potential of coexistence (collaboration) of disciplines in the of the collective.

Measures based on \(P({\textbf{x}})\) can easily adopt the within estimator, but the same does not apply to network-based measures, where it is not obvious if the average of M is a superior estimator to the M of the union. Consider papers: their static nature renders them rather unsuitable for operational definition involving BC or Cohesiveness, as the citational network of a paper expands solely through processes of diffusion, while the reference list is fixed. In contrast, the Cohesion of a a collective unit grows through processes of citational integration and diffusion. Additionally, as already mentioned collectives lend themselves to temporal discretisation, enabling dynamic evaluations of lagged differences (Rafols & Meyer, 2010). Here a question comes natural: should the change be evaluated on temporally disjointed units (time-within), or on the cumulative state of the whole unit, differentiated by the marginal change (updating metadata by pooling)? Subtracting the average value of Semantic Diversity from a previous volume of a journal to identify the change over time makes sense. However, for measures of Cohesion, pooling should emerges as the preferred method. Even if a paper might qualitatively lose social relevance over time, the information embedded in its citational network remains intact (or even grows). This fact suggests it is inappropriate to identify changes in Cohesion as if the publication of two volumes of the same journal are totally separated events. The contribution of the publication of a new volume (or the equivalent for other collectives) can be evaluated through the marginal shifts in the pooled citational network of the unit of analysis. These new possibilities necessitate further reflections on the role of causality in models of measurement in IDR.

Causality in measures of IDR

Causality is a complex philosophical theme that since ever has accompanied the development of the scientific method. Any well-intentioned adoption of any indicator must at some point confront its use within a causal framework and the promise of a causality-based decision making is a clause that helps to justify measurements. On this topic, in the Introduction has been mentioned the dual role IDR in research policies:

  • for the comparative evaluation of static entities to inform contingent decision-making or to present rankings aimed at incentivising the adoption of desirable social behaviours;

  • for assessing the validity and robustness of theories of intervention

Static measurement models are useful for comparing different units of analysis, but an effective policy evaluation requires justifying interventions based on a verifiable theory. In the context of intervention in research policies, measured variables must be understood in their causal scheme (Yue & Wilson, 2004; García-Romero, 2006).

The role of causality can be tricky in the context of models of measurement because it encompasses two potential situations:

  • canonical causal inference involves the correct specification of the estimation process for the parameters of effect size, relating distinct phenomena;

  • causal schemes of path analysis are progressively more employed directly within functions of aggregation of dimensions for complex phenomena, too. This is the case for composite index made of synthetic values (Karagiannis, 2017; Greco et al., 2019).

These two situations are not even well-distinct: articulated research designs are deployed to test complex theories, for example attempting to isolate the effect of one dimension of IDR from the others (see Fig 5B).

Proceeding with order, there is nothing inherently special in causal inference involving IDR (or just one of its dimensions) as exposure or outcome variable. The role of parametric models in causal inference is well-covered by an extensive literature (Pearl, 2015; Bellemare et al., 2017; Cinelli et al., 2022; Leszczensky & Wolbring, 2022; Kunicki et al., 2023). Everybody can draft a preliminary causal inference from an exposure (independent variable) to an outcome (dependent variable) by actually following actually few general principles:

  1. 1.

    Between every cause and every effect, there should be a time lag.

  2. 2.

    All confounder variables must be controlled in the regression.

  3. 3.

    No collider variable must be controlled in the regression.

  4. 4.

    Mediator are controlled if required in the research question. If one control for a mediator, the estimate of the effect size of the exposure to the outcome must be evaluated as independent from the specific effect of the mediator.

These principles can be checked on the Directed Acyclic Graph (DAG) of Fig 5A. DAGs do not encompass all possible cases of causal inference, but they do cover all cases where the causal relationship is not subject to complex (“cyclic”, here implying feedback loops) dynamics.

Fig. 5
figure 5

Causal models in measurement of IDR. The squares represent measured quantities, while the ovals are conceptual variables

The second case for causality in measurements is related to the final step of modelling a measurement: the aggregation of the measures into a unique synthetic value for the unit of analysis. This operation is not strictly necessary, but sometimes a synthetic value is more interpretable than a list of metrics, even if it may hide significant variability across them. There are two major causal paradigms for aggregation (Hardin et al., 2008):

  • Reflective models (see Fig. 5C): measures of the construct of IDR are reflections of a latent feature that has a parametric role in the generative process of all the measures. This common generative parameter is, ideally, the closest object to the ideal synthetic measure of IDR, so the role of the analyst is to apply a method to estimate this hidden parameter from the evidence gathered through measuring the dimensions (Bagozzi, 1981; Edwards & Bagozzi, 2000).

  • Formative models (see Fig. 5D): the construct of IDR depends on the social consensus about the dimensions falling under its semantics. So the role of the analyst is to reconstruct an instance of aggregation of the measures that is coherent with the stated research design, and then to justify this specification of the aggregation against competing alternative models of aggregation (Diamantopoulos & Winklhofer, 2001; Podsakoff et al., 2006; Diamantopoulos et al., 2008).

As can be seen in Fig. 5, the immediate distinction between the two paradigms lies in the causal direction between dimensions and the synthetic value of IDR. In reflective models, IDR causes the measurements, and the dimensions are just clusters of measurements with similar assumptions. In reflective models, a positive correlation between measures (in Fig. 5C represented through a dashed line) is a sign of the validity of the assumptions, which acts also as a test of these: the absence of correlations among variables implies the implausibility of the assumptions of the reflective model (Bagozzi, 2007) and measures negatively correlated with other can even be removed from reflective models as not pertinent.

In formative models, a change in the weight of a dimension causes a change in the social perception of what IDR is about. In this case, positive correlations are problematic because they are signals of redundancy in the measurement model, and a miscalibration of the weights may bias the estimation of the synthetic value for the composite index of IDR (Coltman et al., 2008).

On a philosophical take, reflective models are associated with an essentialist conception of IDR: latent but stable. Formative models are associated with a constructionist conception of IDR: something emergent that can mutate following the complex social dynamics of epistemic recognition (Jarvis et al., 2003). Decision-makers may have ambivalent affinity with the two philosophies of composite indexes: essentialist approaches assure that decisions follows a realist (and not arbitrary) line of conduct, but constructivist approaches allow transparency and theory in interventions through explicit methods of valuation of IDR, where all processes can be reproduced from the original measurements to the synthetic value (Hardin et al., 2008).

Estimation of reflective models

Estimation of generative latent construct relies on the validation of a (mono) factorial model (Confirmative Factor Analysis) for validation, eventually estimated and evaluated through a decomposition of the matrix of measures (e.g., a Principal Component Analysis, PCA), looking for the first component. In relating variables, models fundamentally follows the same causal schemes of causal inference on Direct Acyclic Graphs (DAG), formalising all the causal paths from the generative parameters to the measures through Structural Equation Models (SEM), with the difference of preferring the Partial Least Square (PLS) estimator to a linear estimatorFootnote 8 (Guan & Ma, 2009). The final aim of the path analysis is to deconfound estimates for the measures from the inherent collinearity induced by the common generative factor (Wetzels et al., 2009; Hair et al., 2011, 2020; Henseler et al., 2016). Clearly, within this scheme, there is only one true generative model (essentialism), even if the the analysis can tolerate deviations from the hypothetical true model as long as the results are coherent, and the factorial model is validated.

Reflective models, while sometimes perceived as rigid, exhibit an inherent isomorphism with DAGs (Kunicki et al., 2023). This characteristic facilitates the assessment of articulated regressions because the role of multiple variables in the causal model can be readily evaluated just looking at the path diagram. Among exotic formative DAGs of IDR particularly remarkable is the case of models exploring causal hypotheses where one dimension of IDR acts as an intervening variable, and another as explainable outcome. As example, thinking of D’Este and Robinson-García (2023), consider the following research question: “Do IDR papers exhibit greater diffusion across diverse disciplinary communities?" My interpretation of these cases suggests that since, potentially, numerous dimensions of IDR may contribute as concauses, but only one as outcome, then implicitly it must be assumed that the whole dimension involved in the outcome must but left out from the operational definition of IDR while aggregating a monofactorial construct of other measures of IDR.

Formative models are calibrated, not estimated

The PLS-SEM framework has been proposed for formative models (Wetzels et al., 2009), but there is a fundamental tension between the methodology and the assumption of formative models: SEM still relies latent variables measured through the decomposition of the observables in principal components. The assumptions of factor analysis and PCA are antithetical to formative models (Mazziotta & Pareto, 2019), so a formative SEM falls not very far from a linear regression analysis with multiple independent variables (Greco et al., 2019), visually supported on DAG.

In line of principle, if \(\tilde{M}\) is the composite index made of M measures, any model that satisfies the following equation counts as a formative model

$$\begin{aligned} \ddot{M}({\textbf{x}}) = \bigoplus \left[ \ddot{\mu }_1({\textbf{x}}) \cdot \ddot{w}_1, \ddot{\mu }_2({\textbf{x}}) \cdot \ddot{w}_2, \dots \right] \end{aligned}$$
(33)

where \(\bigoplus\) is an aggregative function, \(\ddot{\mu }\) are standardised measurements, and \(\tilde{w}\) are the weights assessing the relevance of the measure in the model (Rogge, 2018). Within this general framework, even simple functions as the arithmetic or the geometric means of the standardise measures count as valid aggregative functions. The causal structure is entirely constructed through the algorithm calibrating the weighting scheme, which typically is aimed at penalising correlated measures, but can follow many epistemological principles (Hagerty & Land, 2007; Zhou et al., 2010; Foster et al., 2013; Zanella et al., 2015; Becker et al., 2017).

It is necessary to pay attention to the implicit assumptions behind the aggregation function \(\bigoplus\). Linear functions like the arithmetic mean (or any other normalisation of sums) implies the assumption of compensability: the same state of the output (e.g. IDR) can be achieved through an exchange of states between two dimensions. If this is true, then it is logical that a policy can intervene to amplify a dimension at the cost of reducing another. So, by assuming a perfectly compensatory formation model of IDR, the analyst is also assuming the existence of a optimal state of the dimensions and is allowing policymakers to enhance interventions to increase exposure to a dimension at the sacrifice of another, as long as this is justified by a benefit-cost analysis to reach the optimal state. If the analyst rejects the assumption of compensability of the dimensions, then can opt for non-compensatory functions (Vidoli et al., 2015). Human Development Index is a case of a highly successful synthetic index that shifted from the compensatory arithmetic mean to the semi-compensatory alternative, the geometric mean, which penalises the measurement of the construct in the presence of strong deviations between standardised measures.

Conclusions

This study offered a design that, given access to data and chosen a taxonomy, allows to compute statistics for a very wide array of research questions concerning IDR. Remarkably, differently from other reviews on methods of quantification of IDR, it says nothing about the selection of data providers and taxonomies. It also covers only a relatively small range of theories on the dimensions of IDR. For example, Leydesdorff and Ivanova (2021) elaborated a theory of measurement for the disciplinary Synergy of a body of research that combinated theoretical insights for either Integration and Cohesion. While this theory has not been further developed, this is another occasion to remark that the dimension of IDR are not very well demarcated, and lies in the eye of the beholder.

More precisely, these elements are higly specific of the research question. Anything defining (and defined by) the research question should be evaluated separately from the goodness in the design of the model of measurement. Nevertheless, for classifying papers (and other documents) and calibrating similarity coefficients between disciplines, I would venture to say that we will rely increasingly on the technology of Large Language Model and less on citation networks. I bet on a future where certain processes of classification (emphasis on processes, not outputs) in scientometrics will be gradually standardised and made accessible through initiatives funded from government agencies engaged in international cooperation programs and private foundations supporting research.

Instead, clarity in the operational definition will be the distinctive feature of good models of measurement of IDR, especially regarding the quantity and the quality of dimensions. I have stumbled upon an intriguing point for reflection the taxonomy of dimensions of disciplinarity in Sugimoto and Weingart (2015). In this article, disciplines are defined as cognitive processes, grouping factors, semantic standards, boundaries of separations, reification of traditions and even proper institution. Cognition can be integrated, and traditions can be challenged. Boundaries can be connected, but probably there are still gaps in understanding what dimension of interdisciplinarity pertains the semantics of interdisciplinarity in itself. As foreshadowed in the Introduction and in the Sect. Methods of semantic classification and surveys, one cannot define itself as ‘interdisciplinary’ and expect that this satisfies a formal criterion of classification, so I would encourage theoretical and qualitative research in this area. Similar considerations may hold for the concept of disciplines as institutions, since processes of institutionalisation encompass semantics, cognition, physical and social distances (Bone et al., 2020).