Introduction

Democracy, its causes, and its consequences are among the most extensively analysed issues in political research. Nevertheless, many theoretical propositions on democracy remain contested because of inconsistency in empirical findings. This paper argues that one underemphasised reason to inconclusive findings on the topic can be attributed to the use of different democracy measures and provides strong empirical support for the argument. Differences in measures of democracy are problematic, because scholars have more and more measures of democracy to choose from, but choosing the right measure and/or assessing whether the findings are generalisable across several measures does not play an important role in most research agendas.

The limitations of measures of democracy have been discussed in a rich body of literature, and scholars have identified problems related to reliability and validity (e.g. Bollen 1980; Bowman et al. 2005; Hadenius and Teorell 2005; Knutsen 2010; Giebler 2012; Pickel et al. 2015). More specifically, to give a few examples, some of the most established measures of democracy lack in conceptual clarity and contextual equivalence (Barsh 1993), are volatile over time (Gleditsch and Ward 1997), have systematic measurement errors (e.g. Bollen and Paxton 2000; Treier and Jackman 2008; Pemstein et al. 2010; Elff and Ziaja 2018), little theoretical background (Besançon 2003), and ideological bias (Giannone 2010). Elkins (2000) finds that dichotomous measures of democracy are less valid and reliable than graded measures. Munck and Verkuilen (2002) distinguish among three major methodological problems of democracy indices: conceptualisation, measurement, and aggregation. McMahon and Kornheiser (2010) show that not all democracy indices are strongly related to each other. Boese (2019) argues that the V-Dem performs better than Freedom House and Polity IV as to definition, measurement scale, and aggregation.

Casper and Tufis (2003) and Högström (2013) investigate whether measures of democracy are interchangeable by regressing selected measures of democracy on structural predictors of democracy. Through panel (Casper and Tufis 2003) and cross-sectional (Högström 2013) regressions, both studies find that measures of democracy are not interchangeable and that the selection of measures can determine significantly the outcomes of a study. Both studies, however, have some limits that are addressed in my research. First and foremost, Casper and Tufis (2003) evaluate the interchangeability of three democracy indices (Vanhanen, Polity IV, and Freedom Rating) from 1975 to 1992 and Högström (2013) the interchangeability of two democracy indices (Polity IV and Freedom Rating) in 2009. It is evident that a more comprehensive picture on the matter requires analysing a larger number of democracy measures in multiple years. Moreover, updated information on more recent years is needed. My research builds primarily on these two studies with three key improvements: (1) a comparative evaluation of the statistical properties of the measures, (2) an analysis of the overall and over time convergence of the measures, and (3) broader empirical foundations covering more recent data and some of the most “currently relevant” measures of democracy.

Data and research strategy

Selection of relevant measures of democracy

Since all existing measures of democracy cannot be effectively compared and analysed in a single study, I begin my enquiry by selecting some of the most currently relevant measures of democracy. The selection is based on academic importance, measurement scale, availability, and coverage. Dichotomous indicators such as Alvarez et al. (1996) Democracy and Dictatorship are excluded, so that we capture not only the distinction between autocracy and democracy, but also different degrees of democracy. Ordinal measures that are hardly ever treated as interval such as Skaaning et al. (2015) Lexical Index of Electoral Democracy are excluded as well. Finally, measures that have limited time coverage such as Economist Intelligence Unit’s (2019) Democracy Index have not been updated regularly in recent years such as Coppedge et al. (2008) Contestation and Inclusiveness, or rely explicitly on other measures such as Kaufmann et al. (2011) Voice and Accountability are left out from the analysis.

To assess the academic importance of measures of democracy, I use Coppedge et al. (2017) study, which reports the number of Google Scholar citations of several democracy indices. Center of Systemic Peace’s Polity IV leads the pack with nearly 5000 citations. Freedom House’s Civil Liberties and Political Rights are the second most “impactful” measures of democracy with more than 1,500 citations on Google Scholar. Vanhanen’s Competition and Participation are the next most cited measures of democracy that fulfil our selection criteria. The three above-mentioned measures have been widely used in political research, cover nearly all countries around the world over many decades, and are publicly available. Besides these three well-established measures, I include also the more recently developed V-Dem’s Polyarchy index, which has gained widespread recognition the last few years. Thus, to sum up, this article analyses four measures of democracy: Polity IV (Marshall et al. 2019), Vanhanen Index (Vanhanen 2019), Freedom Rating (Freedom House 2019a), and V-Dem Polyarchy (Teorell et al. 2019). Table 1 presents an overview of these indices.

Table 1 Main characteristics of frequently used measures of democracy

Polity IV (PIV) was originally built upon Gurr’s (1974) and Eckstein and Gurr’s (1975) studies on political systems, with the objective of analysing whether common beliefs about the characteristics of state authorities have possible historical or cross-cultural patterns around the world. The Polity IV index is composed of the Institutional Democracy and Institutional Autocracy sub-indices, which measure, respectively, the democratic and the autocratic features of a country. The two sub-indices, which in turn are based on several more specific sub-components, are ultimately combined into the Polity IV index. The index has a 21-point scale from more autocratic to more democratic and provides annual data for all countries in the world with a population of at least 500,000 since 1800.

Vanhanen’s index of democratisation (VAN) measures democracy as a two-dimensional construct formed by the amount of electoral competition between parties and the amount of voting turnout in the elections (Vanhanen 2000). The index aggregates two equally weighted sub-indicators, Electoral Competition and Political Participation, which are both based on “hard” quantitative electoral data. Thus, according to Vanhanen (2000) the index provides more objective information than many other democracy indices. In theory, Vanhanen’s index ranges from 0 to 100 (from low to high democracy), although, in practice, no country has ever received a score higher than 49. Annual data is provided for most countries in the world since 1810.

Freedom Rating (FH), developed and produced initially by regional studies specialist Gastil, is conceptually rooted in the principles of the Universal Declaration of Human Rights of 1948 (Freedom House 2019b). The index aggregates two main sub-indices, Political Rights and Civil Liberties, which in turn are based on several more specific sub-indicators. The final Freedom Rating score of a country is the average of these two main sub-indices. Annual data is available since 1972 for virtually all existing countries and some territories (e.g. Hong Kong, West Bank) in the world. The index measures democracy on a scale from 1 to 7, where a lower score stands for a higher level of democracy.

V-Dem’s Polyarchy index (VDEM), rooted in Dahl’s (1971) influential theory of democracy, quantifies democracy as a combination of “freedom of association, suffrage, clean elections, elected executive, and freedom of expression” (Coppedge et al. 2017: 10). In V-Dem project’s conceptual scheme, the Polyarchy index is conceived to measure the fundamental aspects of democracy (Coppedge et al. 2017). The index synthesises 40 sub-indicators that are coded by country experts (mainly academics) and research assistants (Teorell et al. 2019). Polyarchy ranges on a continuous scale from 0 to 1, from low to high. Annual data covers nearly all countries of the world since 1789.

Research strategy

I start the quantitative analysis by exploring the statistical features and evaluating the distributional characteristics of the selected indices through violin plots. While we should not expect measures of democracy to follow a Gaussian distribution by nature, we would prefer—for statistical purposes—to have measures that are more or less normally distributed, for the simple reason that many common statistical analyses and tests assume such a distribution. An analysis of the distributions is an informative task per se. Yet, it evidences also some more subtle weaknesses in the measures that are further discussed. In particular, the ability to differentiate between countries (Knutsen 2010), especially at the extremes of a scale (Goertz 2020), is a desirable property of democracy measures. Violin plots are especially helpful in detecting bumps within a distribution (Hintze and Nelson 1998), and as we shall see, some of the democracy measures tend to clump at extreme values. Such scales should be further extended to capture all interesting variation (Goertz 2020).

Second, after analysing the statistical features and distributions of the data, I run bivariate correlations to assess the strength of the linear associations among our measures of democracy. Since all measures quantify democracy, we expect them to be strongly related to one another. To get a general view of the associations, correlations are computed between pairs of democracy measures using all available observations since 1972.Footnote 1 To examine whether these relationships differ from year to year and/or over time, correlations are run also at 5-year intervals. These correlations provide information about the statistical association and the convergence of the measures. If currently relevant measures of democracy are strongly associated among one another, we can conclude that their convergence is high.

Finally, through a battery of regression models, I analyse whether the choice of the measure of democracy affects the interpretation of the results, or in other words, whether measures of democracy are interchangeable or not. Following previous empirical literature on measurement interchangeability (e.g. Casper and Tufis 2003; Møller and Skaaning 2011; Högström 2013), I regress our measures one by one on the same set of predictors, keeping everything else equal. According to the interchangeability criterion, “equivalent measures should produce similar causal inferences” (Seawright and Collier 2014: 124). Hence, if the conclusions are not determined by the choice of the measure, we can claim that measures of democracy are interchangeable. To exclude selection bias, I run the regressions with only country-years common to our four measures of democracy.

The democracy indices are regressed on a set of common predictors of democracy based on both Högström’s (2013) and Casper and Tufis’ (2003) studies. These common or “structural” predictors are economic development, growth, trade openness, inflation, population, education, presidentialism, parliamentarianism, and party fractionalisation. The first five variables are taken from the World Development Indicators dataset (World Bank 2017). Economic development is measured by the natural logarithm of GDP/capita in constant 2010 USD. Growth is measured by the annual GDP/capita growth (as a percentage). Trade openness is measured by the sum of exports and imports of goods and services as a share of GDP. Inflation is measured by annual change in consumer prices (as a percentage). Population is measured by the natural logarithm of the estimated total population. Education is measured by the average years of education among citizens older than 15 and is retrieved from the Varieties of Democracy dataset (Coppedge et al. 2019). Party fractionalisation is measured by the probability that two random draws from the lower legislative chamber will be from the same party and is taken from Henisz’s (2017) POLCON dataset. Presidentialism and parliamentarism are dummy variables coded according to Anckar and Fredriksson’s (2019) classification of political systems.

The selected set of variables allows us to assess the interchangeability of measures of democracy with a sample of nearly 130 countries from 1973 to 2017. To cope adequately with problems common to panel regressions in cross-national contexts, as recommended by Beck and Katz (1995), I estimate the models with ordinary least squares (OLS) and panel-corrected standard errors (PCSE), which correct for heteroskedasticity and contemporaneous correlation. Serial correlation is corrected with a lagged dependent variable on the right-hand side of the equation and problems of reverse causality are mitigated by lagging all independent variables by 1 year. Throughout the empirical analysis, to facilitate comparison, measures of democracy are normalised (min–max) and transformed to range from 0 (low) to 100 (high). FH is reversed so that higher scores indicate more democracy.

Results and discussion

Main statistical properties

Violin plots in Fig. 1 present the main statistical properties of our four measures of democracy in all common country-years from 1972 to 2018. The black rectangle in the middle of each “violin” represents the median, the white box around it represents the interquartile range, and the whiskers show the distribution of the remaining values, with the exception of possible outliers. The grey area around the white box illustrates the shape of the distribution at various levels of democracy.

Fig. 1
figure 1

Violin plots of measures of democracy (1972–2018)

Overall, none of the selected measures seems to provide anything close to a fairly normal distribution of values. VAN has a substantially lower median value compared to the other measures and its distribution is heavily skewed to the right. In practice, as shown by the high frequency of observations at the bottom of the scale, this means that a large amount of country-years in VAN are coded as highly autocratic. On the contrary, the other three indices of democracy have a more or less pronounced bimodal distribution, with peaks near the two ends of the scale and less observations in the middle of the scale. Intermediate levels are particularly empty with PIV, as shown by its comparatively large interquartile range. The very extremes are more clumped with FH and PIV than with VDEM. FH’s median is precisely in the middle of the scale, PIV’s median is closer to the democratic end of the scale, and VDEM’s median is closer to the autocratic end of the scale.

Besides showing that measures of democracy are far from being Gaussian, the distributional characteristics reveal some interesting differences among the measures. VAN seems to be unable to capture variation at the bottom extreme of the scale, as confirmed by the fact that as many as 25.7% of all country-years from 1972 to 2018 have the minimum score with VAN. Only in 2000, there are nearly 30 countries with the minimum possible level of democracy with VAN. Since the other measures are able to tell the difference between almost all these countries (VDEM) or at least between many of them (PIV and FH), such an agglomeration of country-years at the bottom extreme of the scale with VAN is more likely to reflect an excessive compression rather than real equivalence between the observations. Furthermore, since all these countries have the lowest possible score in VAN, the index is not able to capture any decrease in the level of democracy in suspiciously many countries.

To a lesser extent, PIV and FH suffer also from similar distributional shortcomings at the very extremes of the scale. Both indices compress several countries especially at the maximum level of democracy, suggesting that the scales of PIV and FH should be further extended beyond their current upper extremes to capture a more nuanced view of highly democratic countries. In total, from 1972 to 2018, 18.6% of country-years in PIV and 14.1% country-years in FH are coded with the maximum score, and for instance, both indices are unable to distinguish between more than 30 countries with maximum scores each year from 2010 to 2015. By construction, any possible increase in democracy in these countries has been left unnoticed by both PIV and FH. Furthermore, as before, we can confidently assume that such an agglomeration of observations reflects excessive compression instead of true equivalence, because both VAN and VDEM are able to distinguish between the vast majority of these countries.

The above discussed violin plots show that none of the measures of democracy is normally distributed. Heavy non-normality is not problematic in itself but becomes limiting if users of these measures wish to employ them in statistical analyses and tests that assume normally distributed data. The evaluation of the distributional characteristics of the data shows as well that some of the measures fail to capture variation between non-democratic cases at the bottom end of the scale (VAN) or between highly democratic cases at the top end of the scale (PIV and FH). Such an artificial agglomeration of cases at the extremes of the scale can be considered as a shortcoming of VAN, FH, and PIV. From this perspective, VDEM is better constructed than the other measures.

Convergence

Table 2 reports bivariate correlations among our four democracy indices, when all available observations across countries and over years are taken into account. The bottom-left quadrant reports Pearson’s correlation coefficients, while the upper-right quadrant reports Spearman’s rank correlation coefficients. Both methods are used to add robustness to the results. Pearson’s correlations measure the strength of linear relationships and are more sensitive to outliers, while Spearman’s correlations measure the strength of monotonic relationships and are less sensitive to outliers. The choice of the estimator, however, does not alter substantially the results.

Table 2 Pairwise correlation coefficients of measures of democracy (1972–2018)

All bivariate correlation coefficients, regardless of the chosen method, are higher than 0.80, indicating high convergence among measures of democracy. With Pearson’s method the weakest correlation is between VAN and PIV (0.82), while the strongest is between VDEM and FH (0.92). With Spearman’s method, instead, the weakest correlation is between VAN and FH (0.84) and the strongest is between VDEM and PIV (0.92).

In general, the high convergence of measures of democracy is confirmed by Fig. 2, which illustrates bivariate Pearson’s correlation coefficients among our indices at 5-year intervals, from 1975 to 2015. Despite being strong, however, these correlations are not always constant over time. For instance, the bivariate correlation between FH and PIV changes minimally from one period to another (0.88 in 1975 and 0.86 in 2015), whereas the bivariate correlation between VAN and VDEM changes considerably over time (0.92 in 1975 but only 0.77 in 2015). In most cases, measures of democracy have become less similar among one another, as shown by the predominantly decreasing lines in the plot. Heterogeneity among correlation coefficients has increased over time. In 1975, the correlation coefficients among measures of democracy range only between 0.84 and 0.92, but in 2015, they range no less than from 0.76 to 0.95.

Fig. 2
figure 2

Evolution of correlations over time: line plot of linear bivariate correlations among measures of democracy in 5-year intervals, 1975–2015

The findings of this section show that, overall, measures of democracy are convergent and strongly associated among one another. We would expect the four indices to be highly interchangeable, even if we have seen as well that their convergence has mostly decreased in the last decades. It seems that the wave of democratisations that begun in the mid-1970s has not been captured in the same way by the measures. To sum up, VAN has become considerably less associated with the other measures. Likewise, PIV has become less associated with the other measures, but not as much as VAN. The only exception is the relationship between FH and VDEM, which has become slightly stronger from 1975 to 2005 and remained relatively stable from 2005 on.

Interchangeability

Table 3 presents the results of the first set of longitudinal regressions of measures of democracy on structural predictors. All other things equal, the results show that different measures of democracy can lead to substantially different findings and interpretations. According to this first set of models, frequently used measures of democracy are not interchangeable at all.

Table 3 Longitudinal regressions of democracy on structural predictors (1973–2017)

The only similarities are that GDP/capita growth, inflation, and population are not significant predictors of democracy in any of the models. All the other associations depend meaningfully on the chosen measure of democracy. GDP/capita is a positive and significant predictor of democracy only when democracy is quantified with VAN (at the 99.9% level), VDEM (at the 99.9% level), or FH (at the 95% level). On the contrary, when democracy is quantified with PIV, there is no significant relationship between GDP/capita and the level of democracy. Trade openness is significantly and inversely related to democracy with PIV (at the 95% level) and VDEM (at the 99% level), but with VAN and FH the relationship is not significantly different from zero. Education has a significant positive effect on democracy with VAN, whereas with PIV, FH, and VDEM there is no significant relationship between education and democracy. Presidential systems are significantly and positively related to democracy only with VAN and VDEM (in both cases at the 95% level). Parliamentary systems are significantly and positively related to democracy only with FH (at the 95% level) and VAN (at the 99.9% level). There is a significant and positive relationship between party fractionalisation and democracy only with PIV and VAN. With FH and VDEM, instead, the relationship between party fractionalisation and democracy is not even close to conventional levels of statistical significance.

The findings of the first set of regressions show unquestionably that measures of democracy are weakly interchangeable in our full sample of common country-years. Nevertheless, since we previously found that in general the convergence among measures of democracy has become weaker over time, I investigate whether the interchangeability of the measures follows a similar pattern as well. I divide the dataset into two similar parts (1973–1995 [T = 23]; 1996–2017 [T = 22]) and regress each measure of democracy on our selection of structural predictors. Due to our earlier findings, we would expect the measures to be more interchangeable between 1973 and 1995 than between 1996 and 2017.

The expectations are confirmed by the regression estimates. In the first sample of years (1973–1995), the estimated results lead to similar interpretations nearly without any exceptions (Table 4). Only the effect of education depends meaningfully on the chosen measure of democracy: education is significantly and positively related to democracy with VAN and FH, but has no significant relationship with democracy with PIV and VDEM. No other prediction is statistically different from zero at conventional levels. This means that, in general, in the first sample of years (1973–1995), the choice of the measure does not play a decisive role in generating different results. In fact, from 1973 to 1995, PIV and VDEM lead exactly to the same conclusions across all predictors. Given these results, it would not be surprising to find that the weak interchangeability of the full sample (1973–2017) is driven by increased divergencies among measures in more recent years (1996–2017).

Table 4 Longitudinal regressions of democracy on structural predictors (1973–1995)

As suspected, in the second sample of years (1996–2017) the choice of the measure of democracy affects crucially the conclusions (Table 5). With VAN and VDEM, the relationship between GDP/capita and democracy is positive and significant (at the 99% level), whereas with the other two measures of democracy there is no evidence of a significant relationship between GDP/capita and democracy. Trade openness is a significant predictor of democracy only with PIV and VDEM, education is a significant predictor of democracy only with VAN, and population predicts democracy significantly only with VDEM. The relationship between presidential systems and democracy is significantly positive with PIV (at the 95% level), VAN (at the 99% level), but non-significant with FH and VDEM. Parliamentary systems are positively related to democracy with VAN (at the 99.9% level), whereas with PIV, FH, and VDEM the estimates do not point to the existence of such a relationship. Party fractionalisation has a positive and significant effect on democracy with PIV and VAN, but not with FH and VDEM.

Table 5 Longitudinal regressions of democracy on structural predictors (1996–2017)

The coefficient plots in Fig. 3 provide a visual representation of the point estimates and 95% confidence intervals of selected independent variables in our battery of regression models. These plots illustrate clearly how the lack of interchangeability in the full sample of country-years (1973–2017, first row in Fig. 3) is driven by the weak interchangeability of the measures in the sample of most recent years (1996–2017, third row in Fig. 3). In the antecedent sample of country-years (1973–1995, second row in Fig. 3), on the contrary, the results are almost never different from zero, regardless of the chosen measure.

Fig. 3
figure 3

Coefficient plots of selected predictors across regression models

Besides summarising the main results of the regressions, coefficient plots are particularly useful to analyse differences in confidence intervals. They allow us to notice immediately that, almost without exceptions, the predicted effects in models with VDEM as the dependent variable have systematically narrower standard errors compared to the other models. Being the cause of less precise predictions can be considered a limitation of any measure, since we can confidently assume that users of measures of democracy prefer precise inferences to imprecise inferences. This finding can be explained by the fact that in general VDEM captures much more variation across countries and over years than the other measures. Hence, VDEM seems to lead to more precise estimates than the other measures and is likely to bring more fruitful results in large-N cross-national studies than VAN, PIV, or FH.

We would expect highly convergent measures of democracy to be similarly related to structural predictors of democracy, but the findings show instead that in general the interchangeability of measures of democracy is weak. Choosing one measure instead of another can affect substantially the conclusions. In fact, in two of our three sets of regression models there is not even a single pair of measures that leads to the same conclusions across all predictors. This is problematic, since most scholars treat highly correlated measures of democracy as interchangeable. The findings show as well that the interchangeability of these measures has weakened over time. While the measures lead to relatively similar conclusions in a longitudinal analysis from 1973 to 1995, the choice of the measure becomes crucial in a longitudinal analysis from 1996 to 2017, where all measures lead to different conclusions.

Conclusions

Through a comparative statistical analysis of four frequently used democracy indices, the study at hand contributes to the literature on the measurement of democracy and provides novel information regarding the statistical properties, convergence, and interchangeability of some of the most currently relevant measures.

First of all, the analysis of the statistical properties has shown that none of the democracy measures is normally distributed and some of the measures tend to clump at the extremes of the scale. In general, VAN compresses too many observations at the bottom extreme of the scale and PIV and FH compress too many observations at the top extreme of the scale. These three measures could be improved by extending their scales. VDEM, PIV, and FH are more capable than VAN to distinguish between different degrees of democracy at the low end of the scale, whereas VDEM and VAN are more capable than PIV and FH to distinguish between different degrees of democracy at the high end of the scale. There is no agglomeration of observations at the extremes with VDEM.

Regardless of these differences, the findings of the study at hand have shown that frequently used measures of democracy are highly convergent among one another. Anyhow, this convergence has not stayed still over time, but instead, we have observed a pattern of decreasing bivariate association between most measures since the 1980s. While in 1980 and 1985 most bivariate correlation coefficients between the measures of democracy were at least 0.90, in 2015 they were below 0.90 nearly without exceptions. Especially VAN has become less related to the other measures, although it is important to stress that, overall, the surveyed measures of democracy are strongly related to one another throughout the analysed period of time.

Despite high correlations, the findings show that the interchangeability of measures of democracy is generally weak, providing robustness to some of the previous claims on the contrasting relationship between correlation and interchangeability of measures of democracy. By substantially broadening the empirical foundations of earlier studies on the topic with additional measures and more extensive longitudinal coverage, this study provides novel knowledge on the interchangeability of currently relevant measures of democracy. The estimations from three sets of longitudinal regressions show not only that the choice of the measure of democracy affects significantly the conclusions, but also that the interchangeability of these measures has become weaker during the last few decades. On the whole, it is worrying that measures of democracy can be so differently related to common predictors of democracy, but it explains some of the inconsistent conclusions related to democracy, its causes, and its consequences.

If findings are not generalisable across measures, it becomes important to understand why. This comparative analysis has pointed out some statistical differences and shortcomings of the measures that scholars should take into consideration when using these measures. To give a few examples, we have seen that VAN fails to distinguish between highly autocratic countries, that PIV and FH compress countries at the extremes of the scale, and that VDEM captures most of the variation between countries. VDEM seems to “cause” also more precise estimations than the other measures. Moreover, scholars should remember that VAN has become less and less associated with the other measures since 1980, raising some doubts regarding its validity and reliability.

Individual country scores can provide more detailed insights on some of the increased divergencies among the measures. While in the early years of our sample it is hard to find severe disagreements about countries, in the last couple of decades there are disagreements even on the most obvious cases. For instance, virtually all experts would agree that Venezuela experienced a major democratic erosion during Chávez’s presidency of the country. However, a comparison of the country-level scores in Venezuela from 1999 to 2013 shows that this “democratic backsliding” is captured only by VDEM, FH, and PIV. Incredibly, according to VAN, the level of democracy in Venezuela increased considerably during Chávez’s rule.

Similarly, it is commonly acknowledged that Hungary has suffered a democratic decline in the last years, but this decline is not captured by all our measures of democracy. From 2010 to 2018, the level of democracy in Hungary has decreased sharply according to VDEM and FH, but has remained stable and as high as possible according to PIV and has even slightly increased according to VAN. Since there is little doubt that in practice both Venezuela and Hungary have suffered a decline in democracy, an inability to record such pattern must be considered as a shortcoming. Besides of being aware of the weak interchangeability, scholars using measures of democracy should be aware of these deficiencies in individual country scores. In particular, these two examples confirm our questions regarding the validity of VAN and suggest that comparative studies of individual country scores could be valuable in understanding more thoroughly divergencies among measures of democracy.

To conclude, as we have seen in this study, measures of democracy are not equivalent nor interchangeable despite high convergence. The choice of the measure matters, and ideally, researchers should carefully choose the measure of democracy that best matches their theory and defend the choice on theoretical grounds. When this is the case, the lack of interchangeability becomes a minor problem. If scholars are unwilling to make such a move, however, they should at the minimum test the robustness of their results with several frequently used measures of democracy (at least with VDEM, PIV, and FH). Even more so in studies on the last few decades, since the findings have pointed out that the interchangeability of measures of democracy has decreased. Alternatively, if theoretically appropriate, researchers could dig deeper into the sub-components of the measures and select a specific sub-component that matches well their theory. Whatever the chosen approach, the study at hand has shown convincingly that since measures of democracy are weakly interchangeable, findings are not likely to be valid across different measures, unless further evidence to generalise findings is provided.