1 Introduction

Corruption, the misuse of entrusted power for private gains, is a major obstacle to sustainable development, as it affects each of its five pillars: people, planet, prosperity, peace, and partnerships. Estimates of the costs of corruption show that corruption costs developing and developed countries more than the US$ 10 trillion required to end poverty by 2030 (World Economic Forum, 2016). These large estimates provide an immediate understanding of the magnitude and effects of corruption in relation to the Sustainable Development Goal (SDG) targets. Resources lost through corruption may instead be used to ensure that everyone has equal access to basic services like health care, education, clean water (UNODC, 2018). By diverting resources intended for public goods and services to personal gain, corruption leads to inefficiencies, misallocation of resources, and substandard service delivery. It creates barriers to economic development, by distorting markets and hindering fair competition. It discourages both domestic and foreign investment, limiting economic growth and stifling entrepreneurship. The indirect consequences of corruption are even larger. It undermines the rule of law by distorting legal and regulatory frameworks. It weakens accountability mechanisms and hampers the effective functioning of democratic systems. Corruption dwindles public institutions by compromising their integrity, undermining their effectiveness, and eroding public trust in them.

The adoption of the 2030 Agenda for Sustainable Development was a major breakthrough for the anti-corruption movement (UNODC, 2018) because it introduced for the first time an explicit link between corruption reduction and peaceful, just and inclusive societies by capitalising on the importance of promoting transparency, accountability, and anti-corruption. Indeed, tackling corruption is functional to develop effective, accountable, and transparent institutions, to guarantee efficient allocation of resources, to ensure fair and equal opportunities of access to resources, to boost public institutions and foster trust in them.

Tackling corruption is key to achieving particularly SDG 16, devoted to Peace, Justice and Strong Institutions. Goal 16 has five key targets, which are especially relevant for our research context:

  • Significantly reduce illicit financial and arms flows, strengthen the recovery and return of stolen assets and combat all forms of organised crime (Target 16.4);

  • Substantially reduce corruption and bribery in all their forms (Target 16.5);

  • Develop effective, accountable and transparent institutions (Target 16.6);

  • Ensuring responsive, inclusive, participatory and representative decision making (Target 16.7);

  • Ensure public access to information and protect fundamental freedoms, in accordance with national legislation and international agreements (Target 16.10).

The monitoring of these targets is entrusted to a pool of indicators substantially different in nature and capturing, both objectively and subjectively, a specific dimension of a complex and unobservable latent phenomenon (United Nations, 2016). Specifically, the monitoring of Target 16.4 is assigned to two indicators, the total value of inward and outward illicit financial flows (Indicator 16.4.1) and the proportion of seized, found or surrendered arms (Indicator 16.4.2). Direct experiences of corruption and bribery by individuals (Indicator 16.5.1) and businesses (Indicator 16.5.2) in their interactions with public officials are tasked with the monitoring of Target 16.5. Primary government expenditures as a proportion of original approved budget (Indicator 16.6.1) and the proportion of population satisfied with their last experience of public services (Indicator 16.6.2) are chosen within Target 16.6. Two further indicators are selected for the monitoring of Target 16.7, that is, the proportional representation of various demographic groups (i.e., by sex, age, disability) in national and local institutions, including the legislatures, the public service and the judiciary (Indicator 16.7.1) and the proportion of population who believe decision-making is inclusive and responsive (Indicator 16.7.2). Finally, the monitoring of Target 16.10 is given to the number of verified cases of killing, kidnapping, enforced disappearance, arbitrary detention and torture of journalists (Indicator 16.10.1) and to an indicator measuring the number of countries that adopt and implement a legal framework for public access to information (Indicator 16.10.2).

The choice of these indicators mirrors the multifaceted nature of corruption and the complexities of any actions directed at tackling it in view of developing peaceful, just and strong institutions. Corruption, like several other phenomena accounted for to improve sustainable development, is a complex latent phenomenon and very challenging to quantify. It encompasses a wide range of economic and social dimensions that are interconnected and can have indirect and long-term impacts not immediately quantifiable. Indeed, measurement instruments to quantify corruption are wide and diversified and embrace—other than experience-based measures, such as Indicators 16.5.1 and 16.5.2 included in the SDG 16—subjective or perception-based measures, judicial-based measures and statistical inference proxies (Gnaldi et al., 2021). Perception-based measures—such as the Transparency International’s Corruption Perception Index (CPI)—are widely used in comparative studies, but are questioned on many grounds. Although CPI scores generally correlate with objective measures such as citizen reported experience with bribery, the literature (Rose & Peiffer, 2012) underlines that perceptions may not reflect factual involvement in corruption crimes. Besides, perceptions can be driven by the general sentiment reflecting prior economic growth (Kurtz & Schrank, 2007) or media coverage of cases or scandals of corruption (Mancini et al., 2017). In recent years, several proposals have been made to identify more “objective” indicators of corruption and overcome the limitations of subjective measures, such as experience-based surveys (such as Transparency International’s Global Corruption Barometer), experimental approaches, conviction rates (Fiorino et al., 2012), context analyses (e.g. newspapers and press reports), and proxy approaches (e.g., Golden & Picci 2005, Olken 2007) including corruption risk indicators.

In this paper, we focus on these latest generation of corruption indicators, the so-called red flag indicators of corruption risks, and propose a method to assess their degree of validity. Red flags are proxy measures for corruption signalling risks of corruption, rather than actual corruption. They are expected to be correlated with corrupt practices, rather than perfectly matching them (Fazekas & Kocsis, 2020; OECD, 2019).

Red flags, as well as many indicators employed to measure sustainable development, can be interrelated. The presence of one red flag may indicate a higher likelihood of other associated risks, just as progress in one area can have implications for others when measuring sustainable development progress. Capturing these interlinkages and understanding the complex relationships among red flags require careful analysis and measurement frameworks that account for systemic relationships among them. Besides, red flags—as well as many other indicators employed to measure SDG targets—often involve subjective judgements and interpretation. Determining which factors or behaviours constitute red flags and how they should be assessed requires expert judgement and contextual understanding.

In line with the previous considerations, in this paper we propose a method to assess the degree of validity of red flags for corruption risk with respect to their internal coherence and based on a procedure grounded on the framework of multidimensional Item Response Theory (IRT) models (specifically, by means of the multidimensional graded response model). This framework is particularly appropriate for the context at issue as it allows us to study the extent to which corruption measures capture the higher-level theoretical construct by assuming that the associations among items (i.e., red flags) are represented by a latent and unobservable trait (corruption).

The article is organised as follows. In Sect. 2 we go into greater detail about red flag indicators and discuss the ways their validity is addressed in the literature and in the present paper. Section 3 is devoted to introduce the IRT framework and to describe the specific IRT model used in our case study, whereas Sect. 4, after introducing the data at hand and the selected red flag indicators, reports the results of the dimensionality assessment, whose discussion is provided in Sect. 5 together with the main conclusions.

2 Red Flag Indicators of Corruption Risk

In this section we further explore the broad framework of red flag indicators of corruption risk (Sect. 2.1) and examine how the literature and this paper tackle the issue of their validity (Sect. 2.2).

2.1 General Framework

International experts and policy makers who work in the field of corruption have come to recognise over the past 20 years that prevention and repression are both important components of effective corruption control  (Carloni, 2017; Gnaldi & Del Sarto, 2019; Gnaldi et al., 2021; Gallego et al., 2021). When a corrupt act is committed, repression steps in to punish it. Differently, corruption prevention seeks to identify and reduce opportunities for corruption. Concurrently, the so-called red flags of corruption risk, a new generation of indicators that are thought to be more accurate and effective than pre-existing measures, have been developed (Mungiu-Pippidi, 2016). Instead of ex-post assessing the “amount of corruption”, the primary goal of such metrics is an ex-ante identification of situations potentially vulnerable to corruption.

In order to identify corruption risks and establish practical mitigation solutions, corruption risk assessment systems are progressively built around corruption red flags. They have been increasingly empowered by both data availability in a machine-readable format and the development of new technologies based on the collection and cross-processing of public data sources (Gnaldi et al., 2021). For example, the reader can refer to the system set up by the Italian anticorruption authority (ANAC) within the project “Measurement of corruption risk at territorial-level and promotion of transparency”.Footnote 1

Red flags are typically used to measure the likelihood of corruption in the public procurement process, through which public authorities acquire goods, products, or services from businesses. A number of occurrences in the public procurement process—such as a delay or failure to award a contract, an overuse of emergency procedures, contract extensions, the repetition of small assignments for the same item, etc.—can indicate potentially risky scenarios to watch out for. In fact, one of the areas of government where corruption is most likely to occur is public procurement. Every stage of the procurement process is susceptible to integrity concerns (OECD, 2016), which are aggravated by the economic volume of transactions and involved financial stakes.

Corruption in public procurement aims at directing the contract secretly and repeatedly toward the preferred bidder. The primary aim of institutionalised grand corruption is rent extraction. Rents can be obtained in public procurement by pre-selecting businesses, who then make additional profit by charging higher than average market prices for the contract object delivery (Abdou et al., 2021). Information on the price and amount of procured deliveries—which is typically present in most administrative public procurement databases, though not comparable over time and space—would be needed in order to assess corruption risk by quantifying extra-profit.

As a result, other indicators of corruption risk are put forth, such as the circumstance that a procurement procedure receives a single offer by a single bidder. Indeed, it has been demonstrated that the number of bids submitted, particularly when there is only one submitted bid, is related to the likelihood of corruption (Klasnja, 2015), and this relationship has been widely used in the literature as a proxy for corruption. Additional frequent risk indicators are (Fazekas & Kocsis, 2020): non-open or exceptional procedure forms, such as direct awards, that allow for the suppression of competition and the steering of contracts to the chosen bidders (Auriol et al., 2016); little time given for bid advertisements, which may prevent unaffiliated bidders from developing suitable offers in time; absence of publication of the call for tenders (where publishing is voluntary); short time intervals to award contracts, as hasty decisions may indicate preconceived evaluation; subjective and difficult-to-quantify evaluation criteria (used in place of price-related evaluation criteria), which indicate some discretionary margins and may restrict accountability management systems.

2.2 The Validity Issue

The validity issue of red flag indicators of corruption risk is as strategic to monitor SDG achievements as it is complex and still largely unexplored in the international scientific literature. Indeed, the validity of a measurement tool concerns, in general terms, its ability to adequately reflect the concept to be measured (Adcock & Collier, 2001). Thus, corruption measures are valid when they detect corruption precisely in the cases where there has actually been corruption. But the evidence that there really was corruption is not straightforwardly inferable as corruption is a hidden and latent phenomenon. Recourse to final judgements for bribery offences would intuitively be the most direct way to assess the degree of validity of corruption measures. However, resorting to sentences presents various substantial and technical problems attributable to the fact that they detect only a very marginal part of emerged corruption. Moreover, judgements are expressed qualitatively and are included in documents whose format may not be easily accessible and directly analysable but with data scarping techniques and text mining methods. The latest would be needed to access basic information, for instance, on the type of crime, the actors of the offence (i.e., individuals, companies, public officials, contracting authorities, etc.), the period in which the offence was committed, etc. Furthermore, judgements are usually issued many years after the crime has been committed. Given this temporal misalignment, a retrospective evaluation of the validity of corruption measures based on them presupposes an efficient judicial system capable of keeping track of the crime actors, type of crime, etc. over time, other than a system which allows interested users to access freely judgement contents.

Validating corruption risk indicators is rarely addressed by the scientific literature on corruption measurement, with only a few exceptions. In Fazekas et al. (2018), red flags are validated with respect to their strength in predicting the single bidding indicator within a logistic regression modelling framework. In this procedure, the single offer—which measures the circumstance that a public contract receives only one bid—is the key criterion against which to verify the validity of a set of red flags. The procedure has the merit of being easily replicable, as the single offer indicator can be straightforwardly computed with data accessible in most Countries. On the other hand, it implicitly assumes that when a public contract receives only one bid we are faced with a certainly corrupt circumstance, an occurrence which, though, would need to be ascertained as well.

Another well-known procedure is the one described in Decarolis et al. (2019) and Decarolis and Giorgiantonio (2022). Employing maximum likelihood algorithms (i.e., LASSO, ridge regression and random forest), the authors suggest evaluating the reliability of red flags in terms of their ability to accurately predict a variable indicating whether a company’s owners or top executives have ever been under investigation by the police for engaging in corruption practices. Despite the strength of the variable expressing corruption risk and derived from police investigations, the procedure appears limited in its replicability, as police investigation data are confidential and hardly open.

Given the complexity and limitations of validating corruption measures relying on the above procedures, in this paper we opt for the adoption of a criterion of “minimum validity” (Bello y Villarino, J.-M., 2021), which leads to assessing the degree of validity of a set of indicators with respect to their level of internal coherence, based on specific statistical criteria. According to such a criterion, corruption measures can be valid when they are strongly correlated to the latent phenomenon they measure. Specifically, the interest is to assess to what extent corruption measures can capture the higher-level theoretical construct they intend to detect, while at the same time excluding irrelevant elements.

Coherently with the criterion above, in this paper we assess the validity of a set of red flag indicators though a procedure based on the framework of multidimensional Item Response Theory (IRT) models. By assuming that the associations between items (i.e., red flags) are represented by a latent and unobservable trait (corruption), this framework is especially suitable for the research context at hand because it enables us to investigate the degree to which corruption measures capture the higher-level theoretical construct. Besides, in their multidimensional version, IRT models assume that the multidimensional latent trait is measured by a multivariate latent variable. This latest extension is, again, very suitable for the purposes of this study as it accounts for the complexity of corruption, that is, for the circumstance that the relational structure among single indicators might mirror a complex and multidimensional configuration, rather than a simple and unidimensional one, where high correlations can be observed between sub-groups (or sub-dimension) of red flags.

3 Item Response Theory Models to Evaluate the Validity of Red Flags

In this section we first introduce the IRT framework from a general point of view (Sect. 3.1), then the specific model used for the analysis (i.e., the graded response model) is described (Sect. 3.2).

3.1 Preliminaries

IRT models are latent variable models for handling multivariate categorical data, assuming that associations among items (i.e., red flags in our case) may be represented by a latent trait (corruption risk here), which can be uni- or multidimensional, that is, measured by a uni- or multivariate latent variable (Bartolucci et al., 2015). IRT is broadly used in analysing students’ proficiency through learning assessment test, where the items (test questions) are binary variables (wrong/correct responses). In this regard, Rasch (Rasch, 1961) and two-parameter logistic (2PL; Birnbaum, 1968) models are the most widely known and employed IRT models. However, in other research fields items can have more than two categories, therefore IRT models for polytomous items are available (based on nominal or ordinal items).

Similarly to factor analysis for continuous variables, IRT can be employed for confirmatory and exploratory analysis of the dimensionality structure of the underlying latent variables when the manifest variables are categorical. Data dimensionality investigation is an essential step towards the validation of corruption risk indicators. Indeed, as clarified throughout this paper, the measurement of the risk of corruption through proxy indicators is valid to the extent that red flag indicators are strongly related to the latent concept (i.e., corruption) they intend to measure. When the phenomenon of interest is a one-dimensional latent variable, it is expected that the proxy indicators used for its measurement, in order to be valid, are all strongly correlated with each other, as an expression of a single underlying one-dimensional latent. Instead, when the latent phenomenon is complex and multidimensional—such as the risk of corruption—it is reasonable to expect that the red flag indicators are not all related with each other, but rather in sub-groups of indicators, each measuring a different sub-dimension and aspect of the same underlying risk.

IRT models are particularly appropriate methodological tools in the context at issue as they allow us to study the relational structure of single red flags (i.e., their dimensionality) and can be applied when one has no prior information about the number and composition of the sub-dimensions of the phenomenon under scrutiny. In the IRT case, the dimensionality structure can be estimated empirically by comparing nested models and rotating ascertained solutions to seek for more interpretable structures (Bock & Aitkin, 1981; Bock et al., 1988; Chalmers, 2012).

Moreover, the choice of this approach for evaluating the dimensionality structure of the red flags has been driven by the nature of the available data. As outlined in Sect. 4.2, our set of red flags, computed at the contracting authority level, displays skew distributions of varying intensities and cannot be considered normally distributed. As a consequence, any methods based on the normal distribution assumption is inappropriate for the data at hand and we opt for an IRT modelling framework. Specifically, we consider an IRT model for handling ordinal items, the graded response model (GRM) of Samejima (1969)—introduced in the next section—and we convert our elementary indicators into categorical (ordinal) indicators using a unified scale based on four categories, representing distinct levels of corruption risk.

3.2 The Multidimensional Graded Response Model

The multidimensional GRM proposed by Samejima (1969) can be formalised as follows. Let \(Y_{ij}\) be the value of indicator j as regards contracting authority i, with \(i=1,\ldots ,n\) and \(j=1,\ldots ,J\). Let us assume that each indicator is ordinal with \(C_j\) categories (labelled as 0, 1, ..., \(C_j-1\)). In our setting, each indicator has the same number of categories (\(C=4\)). For modelling ordinal responses and assuming multiple underlying latent traits, the multidimensional version of the GRM can be used and consists in the natural extension of the 2PL multidimensional IRT model for polytomous data.

Let us denote the multidimensional latent trait of contracting authority i by \(\varvec{\theta }_i=[\theta _{i1}, \ldots , \theta _{iD}]^\top\), which represents its unobservable corruption risk in public procurement with D dimensions (or factors). Like for its unidimensional counterpart, the multidimensional GRM formulation starts with the definition of the (conditional) probability that indicator j of contracting authority i is observed in category y, \(P(Y_{ij} = y \vert \varvec{\theta }_i)\), which can be written as the difference of two cumulative probabilities:

$$\begin{aligned} P(Y_{ij} = y \vert \varvec{\theta }_i) = P^*_{ij}(y) - P^*_{ij}(y+1), \end{aligned}$$

where \(P^*_{ij}(y)\) is the probability of observing category y or higher in indicator \(Y_{ij}\), namely \(P^*_{ij}(y)=P(Y_{ij} \ge y \vert \varvec{\theta }_i)\), with \(y=0,\ldots ,C-1\).

Cumulative probability \(P^*_{ij}(y)\) can be written using two parameters related to the indicator, as follows:

$$\begin{aligned} P^*_{ij}(y) = \frac{\exp (\varvec{a}_j^\top \varvec{\theta }_i + b_{jy})}{1+\exp (\varvec{a}_j^\top \varvec{\theta }_i + b_{jy})},\qquad y=1,\ldots ,C-1, \end{aligned}$$

where \(\varvec{a}_j=[a_{j1}, \ldots , a_{jD}]^\top\) is the vector of D slopes of indicator j (one for each dimension) and \(b_{jy}\) is the intercept of indicator j with reference to category y. Obviously, cumulative probability for the first response category (\(y=0)\) is not formally defined, as \(P^*_{ij}(0)=1\).

As we can see, for each indicator the model envisages as many slopes as the number of dimensions D, so as to give the indicator a chance, in theory, to “load” on all the dimensions. However, in order to ease the result interpretation, slopes can be transformed in a sort of loadings (like in factor analysis) and successively suitably rotated in order to reach a dimensional structure as simple as possible. Specifically, the loading of item j on dimension d is obtained as

$$\begin{aligned} l_{jd} = \frac{a_{jd}}{\sqrt{1+\sum _d a_{jd}^2}}, \end{aligned}$$

while other typical measures of factor analysis can be computed, such as communality \(h^2_j\), uniqueness \(u_j\) and SS loadings \(SS_d\), as follows:

$$\begin{aligned} h^2_j&= \sum _d l^2_{jd},\\ u_j&= 1 - h^2_j, \\ SS_d&= \sum _j l^2_{jd}. \end{aligned}$$

Moreover, indicator parameters can also be expressed with reference to difficulty and discrimination, as typically done in the IRT framework (Reckase, 2009). In particular, components of \(\varvec{a}_j\) can be considered as measures of the indicator discrimination with respect to each dimension. However, a measure of overall discrimination of item j over the D dimensions (denoted by \(\alpha _j\)) can be computed as follows:

$$\begin{aligned} \alpha _j = \sqrt{\sum _d a^2_{jd}}. \end{aligned}$$
(1)

Similarly, intercepts can be converted into overall difficulties as follows:

$$\begin{aligned} \beta _{jy} = \frac{-b_{jy}}{\alpha _j}, \qquad y=1,\ldots ,C-1. \end{aligned}$$
(2)

In particular, \(\beta _{jy}\) is the overall level of latent trait at which the probability of observing category y or higher in indicator j is equal to that related to category \(y-1\) or smaller.

Model estimation procedure using maximum likelihood methods is detailed in Bock et al. (1988) and implemented in the R package “mirt” (Chalmers, 2012).

4 Validity Assessment of Red Flag Indicators Through the Multidimensional Graded Response Model

This section is devoted to detail the main results of the paper about the validation of red flag indicators. After describing the data at hand and the selected red flag indicators in Sect. 4.1, an exploratory analysis of these data is presented in Sect. 4.2, whereas the final results on the validity of the red flag indicators using the graded response model is given in Sect. 4.3.

4.1 Data and Indicators

For this work, we rely on a pool of red flag indicators computed at the contracting body level, by aggregating information of the contracts managed by each contracting authority in the considered time period. Every tender managed by Italian contracting authorities flows into the Italian National Database of Public Contracts (BDNCP for short, standing for Banca Dati Nazionale dei Contratti Pubblici). It is managed by the Italian anticorruption authority and its open version is organised in several tables, one for each stage of the procurement process (ANAC, 2023). In the ANAC open data catalogue,Footnote 2 users can download any specific table about a particular procurement stage (in csv or json format) and including specific information on the contracts that have reached the selected phase. For example, the table on award notices (Aggiudicazioni) includes information on awarded contracts, such as award notice date, award value, award criterion, number of received and eligible bids, etc.

Given the above, it is not possible to get the entire BDNCP by a one-click download as users need to download the single tables and then merge them. Two keys are available to this aim: i. the contract unique identifier, called CIG (Codice Identificativo Gara), assigned to each procedure during the call for tenders phase; ii. the award ID, obtained once the contract reaches the award phase and allowing us to retrieve information on the award notice subsequent stages.

Tables 1a,b,and c list the fifteen selected red flags and report their label, the procurement stage, some formalisation details and the variables needed for their computation. Moreover, the reader can refer to Appendix A for a detailed description of each red flag.

Table 1 Selected red flags (1–15)

4.2 Exploratory Analysis

In this work, we consider data from the BDNCP about tenders with a call published in 2017, for a total number of more than 145 thousand contracts. The choice of the referenced year is justified by the need to include in the analyses an adequate portion of recent contracts reaching the final stages of the procurement process, whose average duration is five years.

According to the structure of the BDNCP open version (as outlined in Sect. 4.1), the final data matrix, which pertains to the portion of selected contracts, has manually been created as follows. Firstly, the list of contracts published in the selected year is obtained by downloading the related table about the call for tenders (Bando CIG in the ANAC open data catalogue). Then, using the contract unique identifier (the CIG), it is merged with the award notice table (Aggiudicazioni), hence connecting each contract with its own award ID. This secondary key is then used for retrieving the information of each contract (if available) related to the other procurement stages, required for the computation of the selected red flag indicators. Specifically, the tables about the following stages are considered: variants (Varianti), contract start (Avvio contratto), contract end (Fine contratto), contract economic framework (Quadro economico), and winners (Aggiudicatari).

Due to contract specific features and/or missing values, a sub-sample of 7,900 contracting bodies has been accounted for in our study. Missing values can occur due to several reasons. A major missing case occurs when contracting bodies manage a single procedure over a year, a quite typical circumstance for “small” bodies. Anytime this condition applies, the computation of the red flag at the contracting body level is omitted, as (at least) two contracts are needed. Missing values in an indicator may also occur when the information required for the computation of an indicator is not present in the database. In this case, the indicator cannot be calculated due to missing variable(s), even if the procurement procedure has got the formal characteristics that would allow us to compute it.

As a first exploratory tool to investigate the dimensionality of the selected red flags, we run an analysis of the correlation between pairs of indicators, relying on the Pearson’s linear correlation and the Spearman’s rank correlation coefficients. In this regard, we can observe from Fig. 1 the Pearson correlation plot between the fifteen indicators. Only significant correlations are shown, where the significance level is set at 5% and adjusted for multiple comparisons (105, in this case). Very similar results are obtained by means of the rank correlation and for this reason we omit to report them.

Fig. 1
figure 1

Linear correlation plot among red flags. Note: only significant correlations are reported, adjusted for multiple comparisons

As expected, the red flags belonging to the first six indicators are highly and positively correlated within pairs since each pair accounts for two facets of a mutual risk. In fact, the correlation coefficient is 0.86, 0.83 and 0.89 for indicator related to non open procedures (having root label non_open), single bidding (single_bid) and usage of MEAT criterion (MEAT), respectively. In addition, the two indicators on the use of non-open procedures (non_open_count and non_open_val) are negatively correlated with indicators about the use of MEAT as award criterion (MEAT_count and MEAT_val), albeit with a small extent (between 0.3 and 0.4 in absolute value). Moreover, two other groups of three connected red flags emerge from the correlation plot: a first group includes the indicators accounting for the exclusion of bids (excluded_bids, all_bids_excluded_but_one and excluded_bids_but_one), while the second, less evident (i.e., with lower correlations), contains red flags concerning procedures with variants (modifications), deviations in the contract economic value (amount_deviation) and duration (time_deviation).

In our data, the red flags at the contracting authority level, expressed as continuous variables, show distributions with different degrees of skewness (data not shown). Hence, they cannot be treated as normally distributed variables and a factor analysis approach cannot be considered a proper solution for exploring their dimensionality (Gao et al., 2008; Mahmoud & Khalifa, 2015; Mahmoud et al., 2019; Meyers et al., 2016).

Furthermore, given that the red flags are expressed either as proportions or as means, we transform them into ordinal indicators expressed into a common scale with the same number of categories, each representing a different degree of corruption risk. Specifically, the fifteen selected red flags are discretised into four categories of increasing corruption risk, labelled from 0 to 3, where 0 stands for “no or low risk”, 1 for “medium-low risk”, 2 for “medium-high risk” and 3 for “high risk”. Since there is no general and objective recommendation on how to identify corruption risk categories, thresholds are computed by relying on the quantiles of each indicator observed distribution.

With the only exception of the red flags concerned with the time for submitting bids and for selecting the winning bids (advertisement and evaluation), all other indicators have a positive polarity, so that the greater the indicator, the higher the risk of corruption. Accordingly, sample quartiles are computed and used as thresholds for the risk categories, as follows. Let \(X_{ij}\) be the value of red flag j for contracting authority i, with \(j=\{1,\ldots ,15\}{\setminus }\{7,8\}\) and \(i=1,\ldots ,n\). In addition, let \(q_j(0.25), q_j(0.50)\) and \(q_j(0.75)\) be the sample quartiles of red flag j (if overlapped, they are computed on the distribution of unique values). The discrete red flag \(Y_{ij}\) is obtained as follows:

$$\begin{aligned} Y_{ij} = {\left\{ \begin{array}{ll} 0, \qquad \text{ if } X_{ij} \le q_j(0.25), \\ 1, \qquad \text{ if } q_j(0.25)< X_{ij} \le q_j(0.50), \\ 2, \qquad \text{ if } q_j(0.50) < X_{ij} \le q_j(0.75), \\ 3, \qquad \text{ if } X_{ij} > q_j(0.75). \end{array}\right. } \end{aligned}$$

As far as indicators advertisement and evaluation are concerned, since both extremely low and extremely high values signal a risk of corruption (see Appendix A), a different categorisation procedure is carried out. Specifically, the distribution of these indicators is divided into eight parts by computing seven quantiles \(q_j(\tau )\), with \(\tau\) ranging from 0.125 to 0.875 with step 0.125 (1/8). Afterwards, \(Y_{ij}\) is built as follows (\(j=7,8\)):

$$\begin{aligned} Y_{ij} = {\left\{ \begin{array}{ll} 0, \qquad \text{ if } q_j(0.375)< X_{ij} \le q_j(0.625), \\ 1, \qquad \text{ if } q_j(0.25)< X_{ij} \le q_j(0.375) \text{ or } q_j(0.625)< X_{ij} \le q_j(0.75), \\ 2, \qquad \text{ if } q_j(0.125)< X_{ij} \le q_j(0.25) \text{ or } q_j(0.75) < X_{ij} \le q_j(0.875), \\ 3, \qquad \text{ if } X_{ij} \le q_j(0.125) \text{ or } X_{ij} > q_j(0.875). \end{array}\right. } \end{aligned}$$

In this way, the low risk condition (\(Y_{ij}=0\)) includes values of the indicator that lie around the median, whereas the high risk condition (\(Y_{ij}=3\)) includes values that lie on the extreme tails of the distribution.

The resulting set of \(Y_{ij}\) becomes the basis for studying the dimensionality structure of corruption risk in public procurement, by means of the multidimensional GRM introduced in Sect. 3.2.

4.3 Dimensional Solutions

The number of dimensions characterising the corruption risk in public procurement is obviously unknown. Consequently, several GRMs are estimated on the data at hand and model fitting is evaluated using classic penalised log-likelihood indexes, as well as model fitting relative improvement computed as the percentage difference of each index with respect to that of the previous model (with one dimension less). Specifically, in this work we consider the following indexes: Akaike Information Criterion (Akaike, 1973, AIC;), Bayesian Information Criterion (Schwarz, 1978, BIC;), sample size-adjusted BIC (Sclove, 1987, SABIC;) and Hannan-Quinn information criterion (Hannan & Quinn, 1979, HQC;), together with the classic likelihood ratio test.

The procedure for selecting the suitable number of dimensions begins with fitting the unidimensional model (\(D=1\)) and increasing until \(D=7\). No model is fitted with more dimensions, in order to ideally have sufficient space for allocating at least two red flags in each dimension.

Results (see Table 2) highlight \(D=7\) as optimal dimensional solution, according to all the indexes and the likelihood ratio test. However, if we consider the relative improvement of the model fitting (index values vs. number of dimensions shown in Fig. 2), it can be appreciated that the curves become flat in correspondence of five dimensions, implying that the improvements for models with six or seven dimensions are negligible (lower than 0.2%), despite higher absolute fitting (so-called elbow method).

Table 2 Model selection output: number of dimension (D), maximimised model log-likelihood (\({\hat{l}}\)), number of model parameters (# par.), penalised log-likelihood indexes (AIC, SABIC, HQC, BIC), likelihood ratio test statistic (\(\chi ^2\)), degrees of freedom (df) and p-value
Fig. 2
figure 2

Values of penalised log-likelihood indexes against number of dimensions of the model

Table 3 Output of the IRT model with \(D=5\) dimensions: oblimin-rotated loadings (cut at ±0.2), communalities (\(h^2\)), sum of squared (SS) loadings and proportion of explained variance by each dimension

By concurrently accounting for both a statistical criterion based on the above indexes and a criterion that is attentive to model parsimony (Del Sarto et al., 2023; Pohle et al., 2017; Montanari et al., 2018), the solution with \(D=5\) is selected as the best one. Table 3 reports the five-dimensional solution in terms of rotated loadings, communalities, sum of squared loadings, and proportion of explained variance. These results align with the exploratory analysis, based on correlation among red flags (Sect. 4.2). In fact, the first three pairs of red flags—related to the proportion of non open procedures, to the procedures receiving a single bid, and to the procedures awarded with the MEAT criterion—contribute to defining three distinct dimensions (D1, D2, and D3) due to high loadings and, consequently, the variability of these indicators is almost entirely explained (high communalities). Two further groups of red flags emerge, each defining a different dimension. The first group considers bid exclusion (excluded_bids, all_bids excluded_but_one and excluded_bids_but_one), which flows into dimension D4. The second group includes indicators based on variants and the deviation in contract economic values and duration (modifications, amount_deviation, and time_deviation), contributing to dimension D5.

Finally, indicators concerning the average time for submitting bids (advertisement) and for selecting winning bids (evaluation), together with the indicator on the homogeneity of the distribution of winners (winners_homog), do not consistently load into none of the five ascertained dimensions and show very low communalities, thus highlighting that these dimensions can explain only a very low share of their variability. Again, this is an expected result, since these three red flags have very low correlation coefficients with the other indicators (Sect. 4.2). We have also fitted the model with six dimension to confirm if excluded indicators are re-admitted to a six-dimensional solution, but none contributes to an additional risk dimension measurement.

Table 4 Output of the IRT model with \(D=4\) dimensions: oblimin-rotated loadings (cut at ±0.2), communalities (\(h^2\)), sum of squared (SS) loadings and proportion of explained variance by each dimension

Besides, the solution with four dimensions (Table 4) is also inspected to assess how the final five dimensions get clustered. We find again the three blocks of two indicators (D1, D2, and D3), while the remaining six red flags are clustered in D4. The indicators about the exclusion of bids have high loadings, while the proportion of procedures with variants has similar magnitudes on D3 and D4 (even if of opposite sign). Likewise for the five-dimension solution, the same three indicators (advertisement, evaluation and winners_homog) remain outside the four-dimensional solution.

Given all the previous considerations, we retain the five-dimensional solution as the most suitable for characterising corruption risk in public procurement. The correlation matrix among the five dimensions (not shown) displays poor correlations among them. The only exception involves dimensions D1 (related to non-open procedures) and D3 (MEAT criterion recurrence), which are are negatively correlated (−0.458), as, by the way, already shown in Fig. 1.

Finally, for each red flag overall discrimination and difficulties (IRT parameters) are obtained using formulas in (1) and (2) on the estimated parameters of the five-dimensional model. Looking in particular at the discriminating power of each red flag and considering the underlying corruption risk dimensional structure (Table 5), we can appreciate how red flags based on contract economic value have the highest overall discrimination parameter, both within and across dimensions. In this perspective, other important red flags include the one that considers procedures with all bids excluded but one and the indicator that takes into account modified contracts.

Table 5 Estimated IRT parameters for each red flag j: overall discrimination \({\hat{\alpha }}_j\) and difficulties \({\hat{\beta }}_{jy}\)

5 Discussion of Main Results

In this paper we assessed the validity of a set of red flag indicators through a procedure based on the framework of multidimensional IRT models, which allows us to account concurrently for the latent and complex facet of corruption. A pool of fifteen red flag indicators of corruption risk in public procurement was chosen and computed on data included in the Italian National Database of Public Contracts (BDNCP). Results showed that the risk of corruption is not a one-dimensional phenomenon, because the goodness of fit of the multidimensional models to the data was always better than that observed for the one-dimensional solution. This result is fully in line with our expectations as corruption risk is a complex phenomenon that occurs through behaviours, facts and circumstances that differ in nature, type and entity. Operationally, this result implies that the single measures of corruption risk (i.e., red flags) are not expected to be all related with each other, but rather in sub-groups, each measuring a different and specific risk category of the same underlying risk of corruption. Indeed, among the several multidimensional solutions we identified, the five- and four-dimensional solutions are to be preferred, on the basis of statistical, interpretability and model parsimony criteria. In the five-dimensional solution, the five categories of corruption risks refer in particular to:

  1. 1.

    risks related to the type of procurement procedure, or to the weight—in relative terms and in terms of economic value—of non-open procedures (for example, direct assignments or negotiated procedures) on the total number of procedures awarded;

  2. 2.

    risks related to the circumstance that a procurement procedure receives only one offer, or to the weight—in relative terms and in terms of economic value—of the procurement procedures that receive only one offer;

  3. 3.

    risks related to the contract award criterion, or to the weight—in relative terms and in economic value—of the procedures awarded with discretionary criteria, such as that of the most economically advantageous tender;

  4. 4.

    risks related to the circumstance that, during the evaluation of the offers received, a contracting authority excludes all but one offers, or to the weight in relative terms of the procurement procedures for which all the offers have been excluded except for one;

  5. 5.

    risks related to the circumstance that a contract undergoes variations during its implementation and to the extent of the variation between i. allotment and cleared amount, and ii. contract expected and actual duration.

The risk of corruption measured by the first and third category is greatly related to an excess of discretionary power exerted by contracting authorities in the public procurement decision process. Addressing excess of discretion is key to advance socioeconomic sustainability, which requires transparency and accountability in decision-making. An excess of discretion undermines these principles by allowing contracting authorities to make decisions without clear guidelines and oversight. Even though access to (and availability of) relevant information related to the public procurement decision-making process may still be formally guaranteed by contracting authorities, excess of discretionary power can weaken disclosure and accountability mechanisms (i.e., public consultations, complaint mechanisms, audit processes, etc.) and lead to favouritism, conflict of interest, and the misuse of resources. Unwarranted discretion can also erode the rule of law and undermine fairness in socio-economic systems. Indeed, when decisions are based on subjective judgement rather than objective criteria, fairness—a fundamental pillar of socio-economic sustainability, ensuring equal opportunities, protection of rights, and access to resources for all individuals and groups—may be compromised. The two above groups of red flags are therefore of particular interest of many targets under SDG 16 and beyond, such as (among others): Target 16.6 devoted to developing effective, accountable and transparent institutions; Target 16.3 devoted to promoting the rule of law and equal access to justice; Target 10.3 devoted to ensuring equal opportunity and reduce inequalities of outcome, by eliminating discriminatory laws, policies and practices.

The second and fourth categories identify in the lack of competitiveness the major mechanism behind corruption risks. Both dimensions point to the absence of competitive bids as a sign that collusive practices may have taken place among contracting authorities and companies, i.e., in the form of illegal agreements to ensure that a specific bidder wins the contract. Competition plays a crucial role in fostering economic growth, innovation, and equitable outcomes. Conversely, lack of competition can concentrate wealth and power in the hands of a few dominant players, exacerbating socio-economic inequalities and hindering sustainable development through higher prices, reduced consumer welfare, lower product diversity, and limited innovation. Hence, these two aggregations of red flags relate especially to SDG targets on issues conveying sustainable economic growth and innovation, such as: Target 8.2 on achieving higher levels of economic productivity through diversification, technological upgrading, and innovation; Target 8.8 which calls for promoting a stable and secure working environment, and implies a fair competition environment where businesses can thrive based on merit and innovation; Target 9.5 devoted to enhancing scientific research and innovation. Besides, the two aggregations may be of interest transversely of those SDGs and targets devoted to reducing inequalities, such as Goal 10 and Goal 5 devoted specifically to achieve gender equality.

Finally, the fifth group of red flags primarily reflects inefficiencies in the public procurement process. Efficiency is a fundamental concept throughout the 2030 Agenda, as it is closely linked to achieving sustainability and ensuring the responsible allocation and use of resources. This last group of red flags is therefore of interest of several SDGs and related targets, such as, among others: Target 9.4, which focuses on upgrading infrastructure and retrofitting industries to make them sustainable, with a particular emphasis on resource efficiency and adoption of clean and environmentally sound technologies; Target 12.2, which calls for achieving sustainable management and efficient use of natural resources.

Besides, the analysis of the correlation between sub-dimensions of corruption risk shows negligible (and close to zero) correlations between dimensions. The identified dimensions of the risk of corruption are therefore generally non-superimposable, and different from each other. Based on our findings, this implies in particular that:

  • the risk associated with the adoption of non-open procedures is neither related with the risk associated with the circumstance that a procurement procedure receives only one offer, nor with the risk associated with the circumstance that all offers, except one, are excluded;

  • the risk associated with the circumstance that a tender procedure receives only one offer is not related with the risk associated with the circumstance that a tender procedure is awarded with discretionary criteria, nor with the circumstance that, during the evaluation phase of the offers received, a contracting station excludes all but one offer;

  • the risk linked to the circumstance that a tender procedure is awarded with discretionary criteria is not associated with the risk linked to the exclusion of all offers except one.

The only important exception was given by the negative and by no means negligible correlation between the risk linked to the adoption of non-open procedures and the risk linked to the circumstance that a tender procedure is awarded with discretionary criteria. This suggests that non-open procedures are typically awarded through criteria other than the discretionary ones, such as that of the lowest price. Finally, from the five-dimensional solution it is possible to observe that the risk linked to the circumstance that a contract undergoes variations during implementation is negatively associated (albeit with a weak entity) both to the risk that the tender procedure receives a single offer, and to the risk linked to the circumstance that a tender procedure is awarded with discretionary criteria.

The last analysis concerned the estimation of the discriminating power of the red flag indicators within each sub-dimension of corruption risk. Since discrimination estimates can be interpreted as weights assigned to the single red flags, they help in identifying which red flag indicators contribute the most to differentiating units of analysis within each sub-dimension. In both the four- and five-dimensional solutions, the most important red flags were those accounting for the economic value of tendering procedures, and specifically: economic value of procedures awarded with discretionary criteria; economic value of unopened procedures; economic value of tender procedures that receive a single offer; bidding procedures in which all bids, but one, are excluded.

The present work can be extended in several ways. First, in order to check the robustness of the obtained results across time and space, the proposed validating procedure can be tested with data referred to other years and to other national contexts. Second, a multilevel component accounting for the nested structure of our data (i.e., contracts nested within contracting authorities) can be added to the proposed statistical model to get insights on the contribution of data gathered at different aggregation levels to the identified dimensionality solutions.